1
0
Fork 0
Commit Graph

3247 Commits (ccc78ec5d463e6c99f4a384be52b31222ffe2e21)

Author SHA1 Message Date
Thomas Gleixner 1c10c49d83 net: replace commatas with semicolons
Impact: syntax fix

Interestingly enough this compiles w/o any complaints:

	orphans = percpu_counter_sum_positive(&tcp_orphan_count),
	sockets = percpu_counter_sum_positive(&tcp_sockets_allocated),

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-16 00:08:56 -08:00
Patrick Ohly 51f31cabe3 ip: support for TX timestamps on UDP and RAW sockets
Instructions for time stamping outgoing packets are take from the
socket layer and later copied into the new skb.

Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-15 22:43:38 -08:00
Herbert Xu aa6320d336 gro: Optimise TCP packet reception
gro: Optimise TCP packet reception

As this function can be called more than half a million times for
10GbE, it's important to optimise it as much as we can.

This patch uses bit ops to logical ops, as well as open coding
memcmp to exploit alignment properties.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-08 20:22:19 -08:00
Herbert Xu a5ad24be72 gro: Optimise IPv4 packet reception
As this function can be called more than half a million times for
10GbE, it's important to optimise it as much as we can.

This patch does some obvious changes to use 2-byte and 4-byte
operations instead of byte-oriented ones where possible.  Bit
ops are also used to replace logical ops to reduce branching.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-08 20:22:19 -08:00
David S. Miller 409f0a9014 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/wireless/iwlwifi/iwl-agn.c
	drivers/net/wireless/iwlwifi/iwl3945-base.c
2009-02-07 02:52:44 -08:00
Ilpo Järvinen 69ebbf58f3 ipmr: use goto to common label instead of opencoding
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-06 23:46:51 -08:00
Jesper Dangaard Brouer 2783ef2312 udp: Fix potential wrong ip_hdr(skb) pointers
Like the UDP header fix, pskb_may_pull() can potentially
alter the SKB buffer.  Thus the saddr and daddr, pointers
may point to the old skb->data buffer.

I haven't seen corruptions, as its only seen if the old
skb->data buffer were reallocated by another user and
written into very quickly (or poison'd by SLAB debugging).

Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-06 01:59:12 -08:00
David S. Miller a23f4bbd8d Revert "tcp: Always set urgent pointer if it's beyond snd_nxt"
This reverts commit 64ff3b938e.

Jeff Chua reports that it breaks rlogin for him.

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-05 15:38:31 -08:00
Jesper Dangaard Brouer 7b5e56f9d6 udp: Fix UDP short packet false positive
The UDP header pointer assignment must happen after calling
pskb_may_pull().  As pskb_may_pull() can potentially alter the SKB
buffer.

This was exposted by running multicast traffic through the NIU driver,
as it won't prepull the protocol headers into the linear area on
receive.

Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-05 15:05:45 -08:00
David S. Miller b3ff29d2cc Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/Kconfig
2009-02-03 00:15:35 -08:00
Eric Dumazet e408b8dcb5 udp: increments sk_drops in __udp_queue_rcv_skb()
Commit 93821778de (udp: Fix rcv socket
locking) accidentally removed sk_drops increments for UDP IPV4
sockets.

This field can be used to detect incorrect sizing of socket receive
buffers.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-02 13:41:57 -08:00
Herbert Xu f15fbcd7d8 ipv4: Delete redundant sk_family assignment
sk_alloc now sets sk_family so this is redundant.  In fact it caught
my eye because sock_init_data already uses sk_family so this is too
late anyway.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-01 22:24:43 -08:00
Eric Dumazet 24dd1fa184 net: move bsockets outside of read only beginning of struct inet_hashinfo
And switch bsockets to atomic_t since it might be changed in parallel.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-01 12:31:33 -08:00
Stephen Hemminger 5add300975 inet: Fix virt-manager regression due to bind(0) changes.
From: Stephen Hemminger <shemminger@vyatta.com>

Fix regression introduced by a9d8f9110d
("inet: Allowing more than 64k connections and heavily optimize
bind(0) time.")

Based upon initial patches and feedback from Evegniy Polyakov and
Eric Dumazet.

From Eric Dumazet:
--------------------
Also there might be a problem at line 175

if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) { 
	spin_unlock(&head->lock);
	goto again;

If we entered inet_csk_get_port() with a non null snum, we can "goto again"
while it was not expected.
--------------------

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-01 01:40:17 -08:00
Stephen Hemminger eefef1cf76 net: add ARP notify option for devices
This adds another inet device option to enable gratuitous ARP
when device is brought up or address change. This is handy for
clusters or virtualization.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-01 01:04:33 -08:00
Harvey Harrison 09640e6365 net: replace uses of __constant_{endian}
Base versions handle constant folding now.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-01 00:45:17 -08:00
David S. Miller 05bee47377 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/e1000/e1000_main.c
2009-01-30 14:31:07 -08:00
Herbert Xu 86911732d3 gro: Avoid copying headers of unmerged packets
Unfortunately simplicity isn't always the best.  The fraginfo
interface turned out to be suboptimal.  The problem was quite
obvious.  For every packet, we have to copy the headers from
the frags structure into skb->head, even though for 99% of the
packets this part is immediately thrown away after the merge.

LRO didn't have this problem because it directly read the headers
from the frags structure.

This patch attempts to address this by creating an interface
that allows GRO to access the headers in the first frag without
having to copy it.  Because all drivers that use frags place the
headers in the first frag this optimisation should be enough.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-29 16:33:03 -08:00
Benjamin Zores 9d8dba6c97 ipv4: fix infinite retry loop in IP-Config
Signed-off-by: Benjamin Zores <benjamin.zores@alcatel-lucent.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-29 16:19:13 -08:00
Dimitris Michailidis 9fa5fdf291 tcp: Fix length tcp_splice_data_recv passes to skb_splice_bits.
tcp_splice_data_recv has two lengths to consider: the len parameter it
gets from tcp_read_sock, which specifies the amount of data in the skb,
and rd_desc->count, which is the amount of data the splice caller still
wants.  Currently it passes just the latter to skb_splice_bits, which then
splices min(rd_desc->count, skb->len - offset) bytes.

Most of the time this is fine, except when the skb contains urgent data.
In that case len goes only up to the urgent byte and is less than
skb->len - offset.  By ignoring len tcp_splice_data_recv may a) splice
data tcp_read_sock told it not to, b) return to tcp_read_sock a value > len.

Now, tcp_read_sock doesn't handle used > len and leaves the socket in a
bad state (both sk_receive_queue and copied_seq are bad at that point)
resulting in duplicated data and corruption.

Fix by passing min(rd_desc->count, len) to skb_splice_bits.

Signed-off-by: Dimitris Michailidis <dm@chelsio.com>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-26 22:15:31 -08:00
Eric Dumazet 98322f22ec udp: optimize bind(0) if many ports are in use
commit 9088c56095
(udp: Improve port randomization) introduced a regression for UDP bind() syscall
to null port (getting a random port) in case lot of ports are already in use.

This is because we do about 28000 scans of very long chains (220 sockets per chain),
with many spin_lock_bh()/spin_unlock_bh() calls.

Fix this using a bitmap (64 bytes for current value of UDP_HTABLE_SIZE)
so that we scan chains at most once.

Instead of 250 ms per bind() call, we get after patch a time of 2.9 ms 

Based on a report from Vitaly Mayatskikh

Reported-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Tested-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-26 21:35:35 -08:00
Timo Teras afcf12422e gre: optimize hash lookup
Instead of keeping candidate tunnel device from all categories,
keep only one candidate with best score. This optimizes stack
usage and speeds up exit code.

Signed-off-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-26 20:56:10 -08:00
Benjamin Thery 4feb88e5c6 netns: ipmr: enable namespace support in ipv4 multicast routing code
This last patch makes the appropriate changes to use and propagate the
network namespace where needed in IPv4 multicast routing code.

This consists mainly in replacing all the remaining init_net occurences
with current netns pointer retrieved from sockets, net devices or
mfc_caches depending on the routines' contexts.

Some routines receive a new 'struct net' parameter to propagate the current
netns:
* vif_add/vif_delete
* ipmr_new_tunnel
* mroute_clean_tables
* ipmr_cache_find
* ipmr_cache_report
* ipmr_cache_unresolved
* ipmr_mfc_add/ipmr_mfc_delete
* ipmr_get_route
* rt_fill_info (in route.c)

Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-22 13:57:41 -08:00
Benjamin Thery f6bb451476 netns: ipmr: declare ipmr /proc/net entries per-namespace
Declare IPv4 multicast forwarding /proc/net entries per-namespace:
/proc/net/ip_mr_vif
/proc/net/ip_mr_cache

Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-22 13:57:41 -08:00
Benjamin Thery 6c5143dbcf netns: ipmr: declare reg_vif_num per-namespace
Preliminary work to make IPv4 multicast routing netns-aware.

Declare variable 'reg_vif_num' per-namespace, move into struct netns_ipv4.

At the moment, this variable is only referenced in init_net.

Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-22 13:57:40 -08:00
Benjamin Thery 6f9374a934 netns: ipmr: declare mroute_do_assert and mroute_do_pim per-namespace
Preliminary work to make IPv4 multicast routing netns-aware.

Declare IPv multicast routing variables 'mroute_do_assert' and
'mroute_do_pim' per-namespace in struct netns_ipv4.

At the moment, these variables are only referenced in init_net.

Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-22 13:57:40 -08:00
Benjamin Thery 1e8fb3b6a4 netns: ipmr: declare counter cache_resolve_queue_len per-namespace
Preliminary work to make IPv4 multicast routing netns-aware.

Declare variable cache_resolve_queue_len per-namespace: move it into
struct netns_ipv4.

This variable counts the number of unresolved cache entries queued in the
list mfc_unres_queue. This list is kept global to all netns as the number
of entries per namespace is limited to 10 (hardcoded in routine
ipmr_cache_unresolved).
Entries belonging to different namespaces in mfc_unres_queue will be
identified by matching the mfc_net member introduced previously in
struct mfc_cache.

Keeping this list global to all netns, also allows us to keep a single
timer (ipmr_expire_timer) to handle their expiration.
In some places cache_resolve_queue_len value was tested for arming
or deleting the timer. These tests were equivalent to testing
mfc_unres_queue value instead and are replaced in this patch.

At the moment, cache_resolve_queue_len is only referenced in init_net.

Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-22 13:57:39 -08:00
Benjamin Thery 2bb8b26c3e netns: ipmr: dynamically allocate mfc_cache_array
Preliminary work to make IPv4 multicast routing netns-aware.

Dynamically allocate IPv4 multicast forwarding cache, mfc_cache_array,
and move it to struct netns_ipv4.

At the moment, mfc_cache_array is only referenced in init_net.

Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-22 13:57:38 -08:00
Benjamin Thery 5c0a66f5f3 netns: ipmr: store netns in struct mfc_cache
This patch stores into struct mfc_cache the network namespace each
mfc_cache belongs to. The new member is mfc_net.

mfc_net is assigned at cache allocation and doesn't change during
the rest of the cache entry life.
A new net parameter is added to ipmr_cache_alloc/ipmr_cache_alloc_unres.

This will help to retrieve the current netns around the IPv4 multicast
routing code.

At the moment, all mfc_cache are allocated in init_net.

Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-22 13:57:36 -08:00
Benjamin Thery cf958ae377 netns: ipmr: dynamically allocate vif_table
Preliminary work to make IPv6 multicast routing netns-aware.

Dynamically allocate interface table vif_table and move it to
struct netns_ipv4, and update MIF_EXISTS() macro.

At the moment, vif_table is only referenced in init_net.

Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-22 13:57:34 -08:00
Benjamin Thery 70a269e6c9 netns: ipmr: allocate mroute_socket per-namespace.
Preliminary work to make IPv4 multicast routing netns-aware.

Make IPv4 multicast routing mroute_socket per-namespace,
moves it into struct netns_ipv4.

At the moment, mroute_socket is only referenced in init_net.

Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-22 13:57:34 -08:00
Timo Teras 749c10f931 gre: strict physical device binding
Check the device on receive path and allow otherwise identical devices
as long as the physical device differs.

This is useful for NBMA tunnels, where you want to use different gre IP
for each public IP available via different physical devices.

Signed-off-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-21 14:34:54 -08:00
Evgeniy Polyakov a9d8f9110d inet: Allowing more than 64k connections and heavily optimize bind(0) time.
With simple extension to the binding mechanism, which allows to bind more
than 64k sockets (or smaller amount, depending on sysctl parameters),
we have to traverse the whole bind hash table to find out empty bucket.
And while it is not a problem for example for 32k connections, bind()
completion time grows exponentially (since after each successful binding
we have to traverse one bucket more to find empty one) even if we start
each time from random offset inside the hash table.

So, when hash table is full, and we want to add another socket, we have
to traverse the whole table no matter what, so effectivelly this will be
the worst case performance and it will be constant.

Attached picture shows bind() time depending on number of already bound
sockets.

Green area corresponds to the usual binding to zero port process, which
turns on kernel port selection as described above. Red area is the bind
process, when number of reuse-bound sockets is not limited by 64k (or
sysctl parameters). The same exponential growth (hidden by the green
area) before number of ports reaches sysctl limit.

At this time bind hash table has exactly one reuse-enbaled socket in a
bucket, but it is possible that they have different addresses. Actually
kernel selects the first port to try randomly, so at the beginning bind
will take roughly constant time, but with time number of port to check
after random start will increase. And that will have exponential growth,
but because of above random selection, not every next port selection
will necessary take longer time than previous. So we have to consider
the area below in the graph (if you could zoom it, you could find, that
there are many different times placed there), so area can hide another.

Blue area corresponds to the port selection optimization.

This is rather simple design approach: hashtable now maintains (unprecise
and racely updated) number of currently bound sockets, and when number
of such sockets becomes greater than predefined value (I use maximum
port range defined by sysctls), we stop traversing the whole bind hash
table and just stop at first matching bucket after random start. Above
limit roughly corresponds to the case, when bind hash table is full and
we turned on mechanism of allowing to bind more reuse-enabled sockets,
so it does not change behaviour of other sockets.

Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net>
Tested-by: Denys Fedoryschenko <denys@visp.net.lb>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-21 14:34:31 -08:00
Herbert Xu 4e704ee3c2 gso: Ensure that the packet is long enough
When we get a GSO packet from an untrusted source, we need to
ensure that it is sufficiently long so that we don't end up
crashing.

Based on discovery and patch by Ian Campbell.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-14 20:41:12 -08:00
Willy Tarreau 33966dd0e2 tcp: splice as many packets as possible at once
As spotted by Willy Tarreau, current splice() from tcp socket to pipe is not
optimal. It processes at most one segment per call.
This results in low performance and very high overhead due to syscall rate
when splicing from interfaces which do not support LRO.

Willy provided a patch inside tcp_splice_read(), but a better fix
is to let tcp_read_sock() process as many segments as possible, so
that tcp_rcv_space_adjust() and tcp_cleanup_rbuf() are called less
often.

With this change, splice() behaves like tcp_recvmsg(), being able
to consume many skbs in one system call. With typical 1460 bytes
of payload per frame, that means splice(SPLICE_F_NONBLOCK) can return
16*1460 = 23360 bytes.

Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-13 16:04:36 -08:00
Patrick McHardy 71320afcdb netfilter 06/09: nf_conntrack: fix ICMP/ICMPv6 timeout sysctls on big-endian
An old bug crept back into the ICMP/ICMPv6 conntrack protocols: the timeout
values are defined as unsigned longs, the sysctl's maxsize is set to
sizeof(unsigned int). Use unsigned int for the timeout values as in the
other conntrack protocols.

Reported-by: Jean-Mickael Guerin <jean-mickael.guerin@6wind.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-12 21:18:35 -08:00
Patrick McHardy 88843104a1 netfilter 01/09: remove "happy cracking" message
Don't spam logs for locally generated short packets. these can only
be generated by root.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-12 21:18:33 -08:00
Linus Torvalds d9e8a3a5b8 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx: (22 commits)
  ioat: fix self test for multi-channel case
  dmaengine: bump initcall level to arch_initcall
  dmaengine: advertise all channels on a device to dma_filter_fn
  dmaengine: use idr for registering dma device numbers
  dmaengine: add a release for dma class devices and dependent infrastructure
  ioat: do not perform removal actions at shutdown
  iop-adma: enable module removal
  iop-adma: kill debug BUG_ON
  iop-adma: let devm do its job, don't duplicate free
  dmaengine: kill enum dma_state_client
  dmaengine: remove 'bigref' infrastructure
  dmaengine: kill struct dma_client and supporting infrastructure
  dmaengine: replace dma_async_client_register with dmaengine_get
  atmel-mci: convert to dma_request_channel and down-level dma_slave
  dmatest: convert to dma_request_channel
  dmaengine: introduce dma_request_channel and private channels
  net_dma: convert to dma_find_channel
  dmaengine: provide a common 'issue_pending_all' implementation
  dmaengine: centralize channel allocation, introduce dma_find_channel
  dmaengine: up-level reference counting to the module level
  ...
2009-01-09 11:52:14 -08:00
David S. Miller 7f46b1343f Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6 2009-01-08 11:05:59 -08:00
Herbert Xu 684f217601 tcp6: Add GRO support
This patch adds GRO support for TCP over IPv6.  The code is exactly
the same as the IPv4 version except for the pseudo-header checksum
computation.

Note that I've removed the unused tcphdr argument from tcp_v6_check
rather than invent a bogus value for GRO.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-08 10:41:23 -08:00
James Morris ac8cc0fa53 Merge branch 'next' into for-linus 2009-01-07 09:58:22 +11:00
Dan Williams f67b459992 net_dma: convert to dma_find_channel
Use the general-purpose channel allocation provided by dmaengine.

Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-01-06 11:38:15 -07:00
Dan Williams 6f49a57aa5 dmaengine: up-level reference counting to the module level
Simply, if a client wants any dmaengine channel then prevent all dmaengine
modules from being removed.  Once the clients are done re-enable module
removal.

Why?, beyond reducing complication:
1/ Tracking reference counts per-transaction in an efficient manner, as
   is currently done, requires a complicated scheme to avoid cache-line
   bouncing effects.
2/ Per-transaction ref-counting gives the false impression that a
   dma-driver can be gracefully removed ahead of its user (net, md, or
   dma-slave)
3/ None of the in-tree dma-drivers talk to hot pluggable hardware, but
   if such an engine were built one day we still would not need to notify
   clients of remove events.  The driver can simply return NULL to a
   ->prep() request, something that is much easier for a client to handle.

Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-01-06 11:38:14 -07:00
David S. Miller 7945cc6464 tcp: Kill extraneous SPLICE_F_NONBLOCK checks.
In splice TCP receive, the SPLICE_F_NONBLOCK flag is used
to compute the "timeo" value.  So checking it again inside
of the main receive loop to trigger -EAGAIN processing is
entirely unnecessary.

Noticed by Jarek P. and Lennert Buytenhek.

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-05 00:59:00 -08:00
Lennert Buytenhek 4f7d54f59b tcp: don't mask EOF and socket errors on nonblocking splice receive
Currently, setting SPLICE_F_NONBLOCK on splice from a TCP socket
results in masking of EOF (RDHUP) and error conditions on the socket
by an -EAGAIN return.  Move the NONBLOCK check in tcp_splice_read()
to be after the EOF and error checks to fix this.

Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-05 00:00:12 -08:00
Herbert Xu b530256d2e gro: Use gso_size to store MSS
In order to allow GRO packets without frag_list at all, we need to
store the MSS in the packet itself.  The obvious place is gso_size.
The only thing to watch out for is if the packet ends up not being
GRO then we need to clear gso_size before pushing the packet into
the stack.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-04 16:13:19 -08:00
Paul Moore 6c2e8ac095 netlabel: Update kernel configuration API
Update the NetLabel kernel API to expose the new features added in kernel
releases 2.6.25 and 2.6.28: the static/fallback label functionality and network
address based selectors.

Signed-off-by: Paul Moore <paul.moore@hp.com>
2008-12-31 12:54:11 -05:00
Herbert Xu eb4dea5853 net: Fix percpu counters deadlock
When we converted the protocol atomic counters such as the orphan
count and the total socket count deadlocks were introduced due to
the mismatch in BH status of the spots that used the percpu counter
operations.

Based on the diagnosis and patch by Peter Zijlstra, this patch
fixes these issues by disabling BH where we may be in process
context.

Reported-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-29 23:04:08 -08:00
Rusty Russell 0f23174aa8 cpumask: prepare for iterators to only go to nr_cpu_ids/nr_cpumask_bits: net
In future all cpumask ops will only be valid (in general) for bit
numbers < nr_cpu_ids.  So use that instead of NR_CPUS in iterators
and other comparisons.

This is always safe: no cpu number can be >= nr_cpu_ids, and
nr_cpu_ids is initialized to NR_CPUS at boot.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-29 22:44:47 -08:00
Linus Torvalds 0191b625ca Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1429 commits)
  net: Allow dependancies of FDDI & Tokenring to be modular.
  igb: Fix build warning when DCA is disabled.
  net: Fix warning fallout from recent NAPI interface changes.
  gro: Fix potential use after free
  sfc: If AN is enabled, always read speed/duplex from the AN advertising bits
  sfc: When disabling the NIC, close the device rather than unregistering it
  sfc: SFT9001: Add cable diagnostics
  sfc: Add support for multiple PHY self-tests
  sfc: Merge top-level functions for self-tests
  sfc: Clean up PHY mode management in loopback self-test
  sfc: Fix unreliable link detection in some loopback modes
  sfc: Generate unique names for per-NIC workqueues
  802.3ad: use standard ethhdr instead of ad_header
  802.3ad: generalize out mac address initializer
  802.3ad: initialize ports LACPDU from const initializer
  802.3ad: remove typedef around ad_system
  802.3ad: turn ports is_individual into a bool
  802.3ad: turn ports is_enabled into a bool
  802.3ad: make ntt bool
  ixgbe: Fix set_ringparam in ixgbe to use the same memory pools.
  ...

Fixed trivial IPv4/6 address printing conflicts in fs/cifs/connect.c due
to the conversion to %pI (in this networking merge) and the addition of
doing IPv6 addresses (from the earlier merge of CIFS).
2008-12-28 12:49:40 -08:00
Herbert Xu f2712fd0b4 ipsec: Remove useless ret variable
This patch removes a useless ret variable from the IPv4 ESP/UDP
decapsulation code.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-26 01:31:18 -08:00
Herbert Xu 64ff3b938e tcp: Always set urgent pointer if it's beyond snd_nxt
Our TCP stack does not set the urgent flag if the urgent pointer
does not fit in 16 bits, i.e., if it is more than 64K from the
sequence number of a packet.

This behaviour is different from the BSDs, and clearly contradicts
the purpose of urgent mode, which is to send the notification
(though not necessarily the associated data) as soon as possible.
Our current behaviour may in fact delay the urgent notification
indefinitely if the receiver window does not open up.

Simply matching BSD however may break legacy applications which
incorrectly rely on the out-of-band delivery of urgent data, and
conversely the in-band delivery of non-urgent data.

Alexey Kuznetsov suggested a safe solution of following BSD only
if the urgent pointer itself has not yet been transmitted.  This
way we guarantee that when the remote end sees the packet with
non-urgent data marked as urgent due to wrap-around we would have
advanced the urgent pointer beyond, either to the actual urgent
data or to an as-yet untransmitted packet.

The only potential downside is that applications on the remote
end may see multiple SIGURG notifications.  However, this would
occur anyway with other TCP stacks.  More importantly, the outcome
of such a duplicate notification is likely to be harmless since
the signal itself does not carry any information other than the
fact that we're in urgent mode.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-25 17:12:58 -08:00
Alexey Dobriyan 7091e728c5 netns: igmp: make /proc/net/{igmp,mcfilter} per netns
This patch makes the followinf proc entries per-netns:
/proc/net/igmp
/proc/net/mcfilter

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-25 16:42:51 -08:00
Alexey Dobriyan b4ee07df3d netns: igmp: allow IPPROTO_IGMP sockets in netns
Looks like everything is already ready.

Required for ebtables(8) for one thing.

Also, required for ipmr per-netns (coming soon). (Benjamin)

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-25 16:42:23 -08:00
James Morris cbacc2c7f0 Merge branch 'next' into for-linus 2008-12-25 11:40:09 +11:00
Matt Mackall 6086ebca13 tcp: Stop scaring users with "treason uncloaked!"
The original message was unhelpful and extremely alarming to our poor
users, despite its charm. Make it less frightening.

Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-18 22:27:42 -08:00
Ilpo Järvinen b1879204dd ipmr: merge common code
Also removes redundant skb->len < x check which can't
be true once pskb_may_pull(skb, x) succeeded.

$ diff-funcs pim_rcv ipmr.c ipmr.c pim_rcv_v1
  --- ipmr.c:pim_rcv()
  +++ ipmr.c:pim_rcv_v1()
@@ -1,22 +1,27 @@
-static int pim_rcv(struct sk_buff * skb)
+int pim_rcv_v1(struct sk_buff * skb)
 {
-	struct pimreghdr *pim;
+	struct igmphdr *pim;
 	struct iphdr   *encap;
 	struct net_device  *reg_dev = NULL;

 	if (!pskb_may_pull(skb, sizeof(*pim) + sizeof(*encap)))
 		goto drop;

-	pim = (struct pimreghdr *)skb_transport_header(skb);
-	if (pim->type != ((PIM_VERSION<<4)|(PIM_REGISTER)) ||
-	    (pim->flags&PIM_NULL_REGISTER) ||
-	    (ip_compute_csum((void *)pim, sizeof(*pim)) != 0 &&
-	     csum_fold(skb_checksum(skb, 0, skb->len, 0))))
+	pim = igmp_hdr(skb);
+
+	if (!mroute_do_pim ||
+	    skb->len < sizeof(*pim) + sizeof(*encap) ||
+	    pim->group != PIM_V1_VERSION || pim->code != PIM_V1_REGISTER)
 		goto drop;

-	/* check if the inner packet is destined to mcast group */
 	encap = (struct iphdr *)(skb_transport_header(skb) +
-				 sizeof(struct pimreghdr));
+				 sizeof(struct igmphdr));
+	/*
+	   Check that:
+	   a. packet is really destinted to a multicast group
+	   b. packet is not a NULL-REGISTER
+	   c. packet is not truncated
+	 */
 	if (!ipv4_is_multicast(encap->daddr) ||
 	    encap->tot_len == 0 ||
 	    ntohs(encap->tot_len) + sizeof(*pim) > skb->len)
@@ -40,9 +45,9 @@
 	skb->ip_summed = 0;
 	skb->pkt_type = PACKET_HOST;
 	dst_release(skb->dst);
+	skb->dst = NULL;
 	reg_dev->stats.rx_bytes += skb->len;
 	reg_dev->stats.rx_packets++;
-	skb->dst = NULL;
 	nf_reset(skb);
 	netif_rx(skb);
 	dev_put(reg_dev);

$ codiff net/ipv4/ipmr.o.old net/ipv4/ipmr.o.new

net/ipv4/ipmr.c:
  pim_rcv_v1 | -283
  pim_rcv    | -284
 2 functions changed, 567 bytes removed

net/ipv4/ipmr.c:
  __pim_rcv | +307
 1 function changed, 307 bytes added

net/ipv4/ipmr.o.new:
 3 functions changed, 307 bytes added, 567 bytes removed, diff: -260

(Tested on x86_64).

It seems that pimlen arg could be left out as well and
eq-sizedness of structs trapped with BUILD_BUG_ON but
I don't think that's more than a cosmetic flaw since there
aren't that many args anyway.

Compile tested.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 01:15:11 -08:00
Herbert Xu bf296b125b tcp: Add GRO support
This patch adds the TCP-specific portion of GRO.  The criterion for
merging is extremely strict (the TCP header must match exactly apart
from the checksum) so as to allow refragmentation.  Otherwise this
is pretty much identical to LRO, except that we support the merging
of ECN packets.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-15 23:43:36 -08:00
Herbert Xu 73cc19f155 ipv4: Add GRO infrastructure
This patch adds GRO support for IPv4.

The criteria for merging is more stringent than LRO, in particular,
we require all fields in the IP header to be identical except for
the length, ID and checksum.  In addition, the ID must form an
arithmetic sequence with a difference of one.

The ID requirement might seem overly strict, however, most hardware
TSO solutions already obey this rule.  Linux itself also obeys this
whether GSO is in use or not.

In future we could relax this rule by storing the IDs (or rather
making sure that we don't drop them when pulling the aggregate
skb's tail).

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-15 23:41:09 -08:00
David S. Miller eb14f01959 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/net/e1000e/ich8lan.c
2008-12-15 20:03:50 -08:00
Steven Rostedt be70ed189b netfilter: update rwlock initialization for nat_table
The commit e099a17357
(netfilter: netns nat: per-netns NAT table) renamed the
nat_table from __nat_table to nat_table without updating the
__RW_LOCK_UNLOCKED(__nat_table.lock).

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-15 00:19:14 -08:00
Ilpo Järvinen 857a6e0a4d icsk: join error paths using goto
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-14 23:13:08 -08:00
Doug Leith 8d3a564da3 tcp: tcp_vegas cong avoid fix
This patch addresses a book-keeping issue in tcp_vegas.c.  At present
tcp_vegas does separate book-keeping of cwnd based on packet sequence
numbers.  A mismatch can develop between this book-keeping and
tp->snd_cwnd due, for example, to delayed acks acking multiple
packets.  When vegas transitions to reno operation (e.g. following
loss), then this mismatch leads to incorrect behaviour (akin to a cwnd
backoff).  This seems mostly to affect operation at low cwnds where
delayed acking can lead to a significant fraction of cwnd being
covered by a single ack, leading to the book-keeping mismatch.  This
patch modifies the congestion avoidance update to avoid the need for
separate book-keeping while leaving vegas congestion avoidance
functionally unchanged.  A secondary advantage of this modification is
that the use of fixed-point (via V_PARAM_SHIFT) and 64 bit arithmetic
is no longer necessary, simplifying the code.

Some example test measurements with the patched code (confirming no functional
change in the congestion avoidance algorithm) can be seen at:

http://www.hamilton.ie/doug/vegaspatch/

Signed-off-by: Doug Leith <doug.leith@nuim.ie>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-09 00:13:04 -08:00
Ilpo Järvinen a2acde0771 tcp: fix tso_should_defer in 64bit
Since jiffies is unsigned long, the types get expanded into
that and after long enough time the difference will therefore
always be > 1 (and that probably happens near boot as well as
iirc the first jiffies wrap is scheduler close after boot to
find out problems related to that early).

This was originally noted by Bill Fink in Dec'07 but nobody
never ended fixing it.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-05 22:56:07 -08:00
Ilpo Järvinen d5dd9175bc tcp: use tcp_write_xmit also in tcp_push_one
tcp_minshall_update is not significant difference since it only
checks for not full-sized skb which is BUG'ed on the push_one
path anyway.

tcp_snd_test is tcp_nagle_test+tcp_cwnd_test+tcp_snd_wnd_test,
just the order changed slightly.

net/ipv4/tcp_output.c:
  tcp_snd_test              |  -89
  tcp_mss_split_point       |  -91
  tcp_may_send_now          |  +53
  tcp_cwnd_validate         |  -98
  tso_fragment              | -239
  __tcp_push_pending_frames | -1340
  tcp_push_one              | -146
 7 functions changed, 53 bytes added, 2003 bytes removed, diff: -1950

net/ipv4/tcp_output.c:
  tcp_write_xmit | +1772
 1 function changed, 1772 bytes added, diff: +1772

tcp_output.o.new:
 8 functions changed, 1825 bytes added, 2003 bytes removed, diff: -178

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-05 22:56:06 -08:00
David S. Miller 730c30ec64 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/net/wireless/iwlwifi/iwl-core.c
	drivers/net/wireless/iwlwifi/iwl-sta.c
2008-12-05 22:54:40 -08:00
Ilpo Järvinen 726e07a8a3 tcp: move some parts from tcp_write_xmit
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-05 22:43:56 -08:00
Ilpo Järvinen 41834b7332 tcp: share code through function, not through copy-paste. :-)
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-05 22:43:26 -08:00
Ilpo Järvinen ee6aac5950 tcp: drop tcp_bound_rto, merge content of it tcp_set_rto
Both are called by the same sites.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-05 22:43:08 -08:00
Ilpo Järvinen 50133161a8 tcp: no need to pass prev skb around, reduces arg pressure
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-05 22:42:41 -08:00
Ilpo Järvinen a1197f5a6f tcp: introduce struct tcp_sacktag_state to reduce arg pressure
There are just too many args to some sacktag functions. This
idea was first proposed by David S. Miller around a year ago,
and the current situation is much worse that what it was back
then.

tcp_sacktag_one can be made a bit simpler by returning the
new sacked (it can be achieved with a single variable though
the previous code "caching" sacked into a local variable and
therefore it is not exactly equal but the results will be the
same).

codiff on x86_64
  tcp_sacktag_one         |  -15
  tcp_shifted_skb         |  -50
  tcp_match_skb_to_sack   |   -1
  tcp_sacktag_walk        |  -64
  tcp_sacktag_write_queue |  -59
  tcp_urg                 |   +1
  tcp_event_data_recv     |   -1
 7 functions changed, 1 bytes added, 190 bytes removed, diff: -189

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-05 22:42:22 -08:00
Ilpo Järvinen 775ffabf77 tcp: make mtu probe failure to not break gso'ed skbs unnecessarily
I noticed that since skb->len has nothing to do with actual segment
length with gso, we need to figure it out separately, reuse
a function from the recent shifting stuff (generalize it).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-05 22:41:26 -08:00
Ilpo Järvinen 9969ca5f20 tcp: Fix thinko making the not-shiftable to cover S|R as well
S|R won't result in S if just SACK is received. DSACK is
another story (but it is covered correctly already).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-05 22:41:06 -08:00
Ilpo Järvinen f0bc52f38b tcp: force mss equality with the next skb too.
Also make if-goto forest nicer looking.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-05 22:40:47 -08:00
Doug Leith a6af2d6ba5 tcp: tcp_vegas ssthresh bug fix
This patch fixes a bug in tcp_vegas.c.  At the moment this code leaves
ssthresh untouched.  However, this means that the vegas congestion
control algorithm is effectively unable to reduce cwnd below the
ssthresh value (if the vegas update lowers the cwnd below ssthresh,
then slow start is activated to raise it back up).  One example where
this matters is when during slow start cwnd overshoots the link
capacity and a flow then exits slow start with ssthresh set to a value
above where congestion avoidance would like to adjust it.

Signed-off-by: Doug Leith <doug.leith@nuim.ie>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-04 17:17:18 -08:00
Benjamin Thery 999890b21a net: /proc/net/ip_mr_cache, display Iif as a signed short
Today, iproute2 fails to show multicast forwarding unresolved cache
entries while scanning /proc/net/ip_mr_cache.

Indeed, it expects to see -1 in 'Iif' column to identify unresolved
entries but the kernel outputs 65535. It's a signed/unsigned issue:

'Iif', the source interface, is retrieved from member mfc_parent in
struct mfc_cache. mfc_parent is a vifi_t: unsigned short, but is
displayed in ipmr_mfc_seq_show() as "%-3d", signed integer.

In unresolevd entries, the 65535 value (0xFFFF) comes from this define:
#define ALL_VIFS    ((vifi_t)(-1))

That may explains why the guy who added support for this in iproute2
thought a -1 should be expected.

I don't know if this must be fixed in kernel or in iproute2. Who is
right? What is the correct API? How was it designed originally?

I let you decide if it should goes in the kernel or be fixed in iproute2.

Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-03 22:22:16 -08:00
Benjamin Thery 1ea472e2de net: fix /proc/net/ip_mr_cache display - V2
/proc/net/ip_mr_cache and /proc/net/ip6_mr_cache displays garbage when
showing unresolved mfc_cache entries.

[root@qemu tests]# cat /proc/net/ip_mr_cache
Group    Origin   Iif     Pkts    Bytes    Wrong Oifs
014C00EF 010014AC 1         10    10050        0  2:1    3:1
024C00EF 010014AC 65535      514        2 -559067475

The first line is correct. It is a resolved cache entry, 10 packets used it...
The second line represents an unresolved entry, and the columns Pkts(4th),
Bytes(5th) and Wrong(6th) just show garbage.

In struct mfc_cache, there's an union to store data for resolved and
unresolved cases. And what ipmr_mfc_seq_show() is printing in these 
columns for the unresolved entries is some bytes from mfc_cache.mfc_un.res.
Bad.
(eg. In our case -559067475 is in fact 0xdead4ead which is the spinlock
magic from mfc_cache.mfc_un.unres.unresolved.lock.magic).

This patch replaces the garbage data written in these columns for the
unresolved entries by '0' (zeros) which is more correct.
This change doesn't break the ABI.

Also, mfc->mfc_un.res.pkt, mfc->mfc_un.res.bytes, mfc->mfc_un.res.wrong_if
are unsigned long.

It applies on top of net-next-2.6.

The patch for net-2.6 is slightly different because of the NIP6_FMT to
%pI6 conversion that was made in the seq_printf.

Changelog:
==========
V2:
* Instead of breaking the ABI by suppressing the columns that have no
  meaning for unresolved entries, fill them with 0 values.

Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-03 22:21:47 -08:00
James Morris ec98ce480a Merge branch 'master' into next
Conflicts:
	fs/nfsd/nfs4recover.c

Manually fixed above to use new creds API functions, e.g.
nfs4_save_creds().

Signed-off-by: James Morris <jmorris@namei.org>
2008-12-04 17:16:36 +11:00
Ilpo Järvinen f8269a495a tcp: make urg+gso work for real this time
I should have noticed this earlier... :-) The previous solution
to URG+GSO/TSO will cause SACK block tcp_fragment to do zig-zig
patterns, or even worse, a steep downward slope into packet
counting because each skb pcount would be truncated to pcount
of 2 and then the following fragments of the later portion would
restore the window again.

Basically this reverts "tcp: Do not use TSO/GSO when there is
urgent data" (33cf71cee1). It also removes some unnecessary code
from tcp_current_mss that didn't work as intented either (could
be that something was changed down the road, or it might have
been broken since the dawn of time) because it only works once
urg is already written while this bug shows up starting from
~64k before the urg point.

The retransmissions already are split to mss sized chunks, so
only new data sending paths need splitting in case they have
a segment otherwise suitable for gso/tso. The actually check
can be improved to be more narrow but since this is late -rc
already, I'll postpone thinking the more fine-grained things.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-03 21:24:48 -08:00
Eric Dumazet 6976a1d6c2 net: percpu_counter_inc() should not be called in BH-disabled section
Based upon a lockdep report by Alexey Dobriyan.

I checked all per_cpu_counter_xxx() usages in network tree, and I
think all call sites are BH enabled except one in
inet_csk_listen_stop().

commit dd24c00191
(net: Use a percpu_counter for orphan_count)
replaced atomic_t orphan_count to a percpu_counter.

atomic_inc()/atomic_dec() can be called from any context, while
percpu_counter_xxx() should be called from a consistent state.

For orphan_count, this context can be the BH-enabled one.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-01 23:37:17 -08:00
David S. Miller ed77a89c30 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6
Conflicts:

	net/netfilter/nf_conntrack_netlink.c
2008-11-28 02:19:15 -08:00
David S. Miller 5b9ab2ec04 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/net/hp-plus.c
	drivers/net/wireless/ath5k/base.c
	drivers/net/wireless/ath9k/recv.c
	net/wireless/reg.c
2008-11-26 23:48:40 -08:00
Eric Dumazet dd24c00191 net: Use a percpu_counter for orphan_count
Instead of using one atomic_t per protocol, use a percpu_counter
for "orphan_count", to reduce cache line contention on
heavy duty network servers. 

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 21:17:14 -08:00
Eric Dumazet 1748376b66 net: Use a percpu_counter for sockets_allocated
Instead of using one atomic_t per protocol, use a percpu_counter
for "sockets_allocated", to reduce cache line contention on
heavy duty network servers. 

Note : We revert commit (248969ae31
net: af_unix can make unix_nr_socks visbile in /proc),
since it is not anymore used after sock_prot_inuse_add() addition

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 21:16:35 -08:00
Alexey Dobriyan 4fb236bac9 netns xfrm: AH/ESP in netns!
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 17:59:27 -08:00
Alexey Dobriyan fbda33b2b8 netns xfrm: ->get_saddr in netns
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 17:56:49 -08:00
Alexey Dobriyan c5b3cf46ea netns xfrm: ->dst_lookup in netns
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 17:51:25 -08:00
Alexey Dobriyan ddcfd79680 netns xfrm: dst garbage-collecting in netns
Pass netns pointer to struct xfrm_policy_afinfo::garbage_collect()

	[This needs more thoughts on what to do with dst_ops]
	[Currently stub to init_net]

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 17:37:23 -08:00
Alexey Dobriyan 52479b623d netns xfrm: lookup in netns
Pass netns to xfrm_lookup()/__xfrm_lookup(). For that pass netns
to flow_cache_lookup() and resolver callback.

Take it from socket or netdevice. Stub DECnet to init_net.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 17:35:18 -08:00
Alexey Dobriyan 221df1ed33 netns xfrm: state lookup in netns
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 17:30:50 -08:00
Alexey Dobriyan 673c09be45 netns xfrm: add struct xfrm_state::xs_net
To avoid unnecessary complications with passing netns around.

* set once, very early after allocating
* once set, never changes

For a while create every xfrm_state in init_net.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 17:15:16 -08:00
Eric Dumazet 723b46108f net: udp_unhash() can test if sk is hashed
Impact: Optimization

Like done in inet_unhash(), we can avoid taking a chain lock if
socket is not hashed in udp_unhash()

Triggered by close(socket(AF_INET, SOCK_DGRAM, 0));

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 13:55:15 -08:00
Ilpo Järvinen 8eecaba900 tcp: tcp_limit_reno_sacked can become static
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 13:45:29 -08:00
Eric Leblond 5f145e44ae netfilter: nfmark routing in OUTPUT, mangle, NFQUEUE
This patch let nfmark to be evaluated for routing decision for OUTPUT
packet, in mangle table, when process paquet in NFQUEUE
Until now, only change (in NFQUEUE process) on fields src_addr,
dest_addr and tos could make netfilter to reevalute the routing.

From: Laurent Licour <laurent@licour.com>
Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-11-25 12:15:16 +01:00
Alexey Dobriyan fb7e06748c xfrm: remove useless forward declarations
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 01:05:54 -08:00
Alexey Dobriyan 6daad37230 ah4/ah6: remove useless NULL assignments
struct will be kfreed in a moment, so...

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 01:05:09 -08:00
Ilpo Järvinen 111cc8b913 tcp: add some mibs to track collapsing
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 21:27:22 -08:00
Ilpo Järvinen 92ee76b6d9 tcp: Make shifting not clear the hints
The earlier version was just very basic one which is "playing
safe" by always clearing the hints. However, clearing of a hint
is extremely costly operation with large windows, so it must be
avoided at all cost whenever possible, there is a way with
shifting too achieve not-clearing.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 21:26:56 -08:00
Ilpo Järvinen 832d11c5cd tcp: Try to restore large SKBs while SACK processing
During SACK processing, most of the benefits of TSO are eaten by
the SACK blocks that one-by-one fragment SKBs to MSS sized chunks.
Then we're in problems when cleanup work for them has to be done
when a large cumulative ACK comes. Try to return back to pre-split
state already while more and more SACK info gets discovered by
combining newly discovered SACK areas with the previous skb if
that's SACKed as well.

This approach has a number of benefits:

1) The processing overhead is spread more equally over the RTT
2) Write queue has less skbs to process (affect everything
   which has to walk in the queue past the sacked areas)
3) Write queue is consistent whole the time, so no other parts
   of TCP has to be aware of this (this was not the case with
   some other approach that was, well, quite intrusive all
   around).
4) Clean_rtx_queue can release most of the pages using single
   put_page instead of previous PAGE_SIZE/mss+1 calls

In case a hole is fully filled by the new SACK block, we attempt
to combine the next skb too which allows construction of skbs
that are even larger than what tso split them to and it handles
hole per on every nth patterns that often occur during slow start
overshoot pretty nicely. Though this to be really useful also
a retransmission would have to get lost since cumulative ACKs
advance one hole at a time in the most typical case.

TODO: handle upwards only merging. That should be rather easy
when segment is fully sacked but I'm leaving that as future
work item (it won't make very large difference anyway since
this current approach already covers quite a lot of normal
cases).

I was earlier thinking of some sophisticated way of tracking
timestamps of the first and the last segment but later on
realized that it won't be that necessary at all to store the
timestamp of the last segment. The cases that can occur are
basically either:
  1) ambiguous => no sensible measurement can be taken anyway
  2) non-ambiguous is due to reordering => having the timestamp
     of the last segment there is just skewing things more off
     than does some good since the ack got triggered by one of
     the holes (besides some substle issues that would make
     determining right hole/skb even harder problem). Anyway,
     it has nothing to do with this change then.

I choose to route some abnormal looking cases with goto noop,
some could be handled differently (eg., by stopping the
walking at that skb but again). In general, they either
shouldn't happen at all or are rare enough to make no difference
in practice.

In theory this change (as whole) could cause some macroscale
regression (global) because of cache misses that are taken over
the round-trip time but it gets very likely better because of much
less (local) cache misses per other write queue walkers and the
big recovery clearing cumulative ack.

Worth to note that these benefits would be very easy to get also
without TSO/GSO being on as long as the data is in pages so that
we can merge them. Currently I won't let that happen because
DSACK splitting at fragment that would mess up pcounts due to
sk_can_gso in tcp_set_skb_tso_segs. Once DSACKs fragments gets
avoided, we have some conditions that can be made less strict.

TODO: I will probably have to convert the excessive pointer
passing to struct sacktag_state... :-)

My testing revealed that considerable amount of skbs couldn't
be shifted because they were cloned (most likely still awaiting
tx reclaim)...

[The rest is considering future work instead since I got
repeatably EFAULT to tcpdump's recvfrom when I added
pskb_expand_head to deal with clones, so I separated that
into another, later patch]

...To counter that, I gave up on the fifth advantage:

5) When growing previous SACK block, less allocs for new skbs
   are done, basically a new alloc is needed only when new hole
   is detected and when the previous skb runs out of frags space

...which now only happens of if reclaim is fast enough to dispose
the clone before the SACK block comes in (the window is RTT long),
otherwise we'll have to alloc some.

With clones being handled I got these numbers (will be somewhat
worse without that), taken with fine-grained mibs:

                  TCPSackShifted 398
                   TCPSackMerged 877
            TCPSackShiftFallback 320
      TCPSACKCOLLAPSEFALLBACKGSO 0
  TCPSACKCOLLAPSEFALLBACKSKBBITS 0
  TCPSACKCOLLAPSEFALLBACKSKBDATA 0
    TCPSACKCOLLAPSEFALLBACKBELOW 0
    TCPSACKCOLLAPSEFALLBACKFIRST 1
 TCPSACKCOLLAPSEFALLBACKPREVBITS 318
      TCPSACKCOLLAPSEFALLBACKMSS 1
   TCPSACKCOLLAPSEFALLBACKNOHEAD 0
    TCPSACKCOLLAPSEFALLBACKSHIFT 0
          TCPSACKCOLLAPSENOOPSEQ 0
  TCPSACKCOLLAPSENOOPSMALLPCOUNT 0
     TCPSACKCOLLAPSENOOPSMALLLEN 0
             TCPSACKCOLLAPSEHOLE 12

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 21:20:15 -08:00
Ilpo Järvinen f58b22fd3c tcp: make tcp_sacktag_one able to handle partial skb too
This is preparatory work for SACK combiner patch which may
have to count TCP state changes for only a part of the skb
because it will intentionally avoids splitting skb to SACKed
and not sacked parts.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 21:14:43 -08:00
Ilpo Järvinen adb92db857 tcp: Make SACK code to split only at mss boundaries
Sadly enough, this adds possible divide though we try to avoid
it by checking one mss as common case.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 21:13:50 -08:00
Ilpo Järvinen e8bae275d9 tcp: more aggressive skipping
I knew already when rewriting the sacktag that this condition
was too conservative, change it now since it prevent lot of
useless work (especially in the sack shifter decision code
that is being added by a later patch). This shouldn't change
anything really, just save some processing regardless of the
shifter.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 21:12:28 -08:00
Ilpo Järvinen e1aa680fa4 tcp: move tcp_simple_retransmit to tcp_input
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 21:11:55 -08:00
Ilpo Järvinen 4a17fc3add tcp: collapse more than two on retransmission
I always had thought that collapsing up to two at a time was
intentional decision to avoid excessive processing if 1 byte
sized skbs are to be combined for a full mtu, and consecutive
retransmissions would make the size of the retransmittee
double each round anyway, but some recent discussion made me
to understand that was not the case. Thus make collapse work
more and wait less.

It would be possible to take advantage of the shifting
machinery (added in the later patch) in the case of paged
data but that can be implemented on top of this change.

tcp_skb_is_last check is now provided by the loop.

I tested a bit (ss-after-idle-off, fill 4096x4096B xfer,
10s sleep + 4096 x 1byte writes while dropping them for
some a while with netem):

. 16774097:16775545(1448) ack 1 win 46
. 16775545:16776993(1448) ack 1 win 46
. ack 16759617 win 2399
P 16776993:16777217(224) ack 1 win 46
. ack 16762513 win 2399
. ack 16765409 win 2399
. ack 16768305 win 2399
. ack 16771201 win 2399
. ack 16774097 win 2399
. ack 16776993 win 2399
. ack 16777217 win 2399
P 16777217:16777257(40) ack 1 win 46
. ack 16777257 win 2399
P 16777257:16778705(1448) ack 1 win 46
P 16778705:16780153(1448) ack 1 win 46
FP 16780153:16781313(1160) ack 1 win 46
. ack 16778705 win 2399
. ack 16780153 win 2399
F 1:1(0) ack 16781314 win 2399

While without drop-all period I get this:

. 16773585:16775033(1448) ack 1 win 46
. ack 16764897 win 9367
. ack 16767793 win 9367
. ack 16770689 win 9367
. ack 16773585 win 9367
. 16775033:16776481(1448) ack 1 win 46
P 16776481:16777217(736) ack 1 win 46
. ack 16776481 win 9367
. ack 16777217 win 9367
P 16777217:16777218(1) ack 1 win 46
P 16777218:16777219(1) ack 1 win 46
P 16777219:16777220(1) ack 1 win 46
  ...
P 16777247:16777248(1) ack 1 win 46
. ack 16777218 win 9367
. ack 16777219 win 9367
  ...
. ack 16777233 win 9367
. ack 16777248 win 9367
P 16777248:16778696(1448) ack 1 win 46
P 16778696:16780144(1448) ack 1 win 46
FP 16780144:16781313(1169) ack 1 win 46
. ack 16780144 win 9367
F 1:1(0) ack 16781314 win 9367

The window seems to be 30-40 segments, which were successfully
combined into: P 16777217:16777257(40) ack 1 win 46

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 21:03:43 -08:00
Eric Dumazet a21bba9454 net: avoid a pair of dst_hold()/dst_release() in ip_push_pending_frames()
We can reduce pressure on dst entry refcount that slowdown UDP transmit
path on SMP machines. This pressure is visible on RTP servers when
delivering content to mediagateways, especially big ones, handling
thousand of streams. Several cpus send UDP frames to the same
destination, hence use the same dst entry.

This patch makes ip_push_pending_frames() steal the refcount its
callers had to take when filling inet->cork.dst.

This doesnt avoid all refcounting, but still gives speedups on SMP,
on UDP/RAW transmit path.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 16:07:50 -08:00
Eric Dumazet 2e77d89b2f net: avoid a pair of dst_hold()/dst_release() in ip_append_data()
We can reduce pressure on dst entry refcount that slowdown UDP transmit
path on SMP machines. This pressure is visible on RTP servers when
delivering content to mediagateways, especially big ones, handling
thousand of streams. Several cpus send UDP frames to the same
destination, hence use the same dst entry.

This patch makes ip_append_data() eventually steal the refcount its
callers had to take on the dst entry.

This doesnt avoid all refcounting, but still gives speedups on SMP,
on UDP/RAW transmit path

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 15:52:46 -08:00
Eric Dumazet 920de804bc net: Make sure BHs are disabled in sock_prot_inuse_add()
The rule of calling sock_prot_inuse_add() is that BHs must
be disabled.  Some new calls were added where this was not
true and this tiggers warnings as reported by Ilpo.

Fix this by adding explicit BH disabling around those call sites,
or moving sock_prot_inuse_add() call inside an existing BH disabled
section.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 00:09:29 -08:00
Alexey Dobriyan be77e59307 net: fix tunnels in netns after ndo_ changes
dev_net_set() should be the very first thing after alloc_netdev().

"ndo_" changes turned simple assignment (which is OK to do before netns
assignment) into quite non-trivial operation (which is not OK, init_net was
used). This leads to incomplete initialisation of tunnel device in netns.

BUG: unable to handle kernel NULL pointer dereference at 00000004
IP: [<c02efdb5>] ip6_tnl_exit_net+0x37/0x4f
*pde = 00000000 
Oops: 0000 [#1] PREEMPT DEBUG_PAGEALLOC
last sysfs file: /sys/class/net/lo/operstate

Pid: 10, comm: netns Not tainted (2.6.28-rc6 #1) 
EIP: 0060:[<c02efdb5>] EFLAGS: 00010246 CPU: 0
EIP is at ip6_tnl_exit_net+0x37/0x4f
EAX: 00000000 EBX: 00000020 ECX: 00000000 EDX: 00000003
ESI: c5caef30 EDI: c782bbe8 EBP: c7909f50 ESP: c7909f48
 DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
Process netns (pid: 10, ti=c7908000 task=c7905780 task.ti=c7908000)
Stack:
 c03e75e0 c7390bc8 c7909f60 c0245448 c7390bd8 c7390bf0 c7909fa8 c012577a
 00000000 00000002 00000000 c0125736 c782bbe8 c7909f90 c0308fe3 c782bc04
 c7390bd4 c0245406 c084b718 c04f0770 c03ad785 c782bbe8 c782bc04 c782bc0c
Call Trace:
 [<c0245448>] ? cleanup_net+0x42/0x82
 [<c012577a>] ? run_workqueue+0xd6/0x1ae
 [<c0125736>] ? run_workqueue+0x92/0x1ae
 [<c0308fe3>] ? schedule+0x275/0x285
 [<c0245406>] ? cleanup_net+0x0/0x82
 [<c0125ae1>] ? worker_thread+0x81/0x8d
 [<c0128344>] ? autoremove_wake_function+0x0/0x33
 [<c0125a60>] ? worker_thread+0x0/0x8d
 [<c012815c>] ? kthread+0x39/0x5e
 [<c0128123>] ? kthread+0x0/0x5e
 [<c0103b9f>] ? kernel_thread_helper+0x7/0x10
Code: db e8 05 ff ff ff 89 c6 e8 dc 04 f6 ff eb 08 8b 40 04 e8 38 89 f5 ff 8b 44 9e 04 85 c0 75 f0 43 83 fb 20 75 f2 8b 86 84 00 00 00 <8b> 40 04 e8 1c 89 f5 ff e8 98 04 f6 ff 89 f0 e8 f8 63 e6 ff 5b 
EIP: [<c02efdb5>] ip6_tnl_exit_net+0x37/0x4f SS:ESP 0068:c7909f48
---[ end trace 6c2f2328fccd3e0c ]---

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-23 17:26:26 -08:00
Eric Dumazet c25eb3bfb9 net: Convert TCP/DCCP listening hash tables to use RCU
This is the last step to be able to perform full RCU lookups
in __inet_lookup() : After established/timewait tables, we
add RCU lookups to listening hash table.

The only trick here is that a socket of a given type (TCP ipv4,
TCP ipv6, ...) can now flight between two different tables
(established and listening) during a RCU grace period, so we
must use different 'nulls' end-of-chain values for two tables.

We define a large value :

#define LISTENING_NULLS_BASE (1U << 29)

So that slots in listening table are guaranteed to have different
end-of-chain values than slots in established table. A reader can
still detect it finished its lookup in the right chain.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-23 17:22:55 -08:00
Eric Dumazet 04f258ce7f net: some optimizations in af_inet
1) Use eq_net() in inet_netns_ok() to speedup socket creation if
   !CONFIG_NET_NS

2) Reorder the tests about inet_ehash_secret generation (once only)
   Use the unlikely() macro when testing if inet_ehash_secret already
   generated.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-23 15:42:23 -08:00
Petr Tesarik 33cf71cee1 tcp: Do not use TSO/GSO when there is urgent data
This patch fixes http://bugzilla.kernel.org/show_bug.cgi?id=12014

Since most (if not all) implementations of TSO and even the in-kernel
software GSO do not update the urgent pointer when splitting a large
segment, it is necessary to turn off TSO/GSO for all outgoing traffic
with the URG pointer set.

Looking at tcp_current_mss (and the preceding comment) I even think
this was the original intention. However, this approach is insufficient,
because TSO/GSO is turned off only for newly created frames, not for
frames which were already pending at the arrival of a message with
MSG_OOB set. These frames were created when TSO/GSO was enabled,
so they may be large, and they will have the urgent pointer set
in tcp_transmit_skb().

With this patch, such large packets will be fragmented again before
going to the transmit routine.

As a side note, at least the following NICs are known to screw up
the urgent pointer in the TCP header when doing TSO:

	Intel 82566MM (PCI ID 8086:1049)
	Intel 82566DC (PCI ID 8086:104b)
	Intel 82541GI (PCI ID 8086:1076)
	Broadcom NetXtreme II BCM5708 (PCI ID 14e4:164c)

Signed-off-by: Petr Tesarik <ptesarik@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-21 16:42:58 -08:00
David S. Miller 7e3aab4a9c inet_diag: Missed conversion after changing inet ehash lockl to spinlocks.
They are no longer a rwlocks.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-21 16:39:19 -08:00
Eric Dumazet 9db66bdcc8 net: convert TCP/DCCP ehash rwlocks to spinlocks
Now TCP & DCCP use RCU lookups, we can convert ehash rwlocks to spinlocks.

/proc/net/tcp and other seq_file 'readers' can safely be converted to 'writers'.

This should speedup writers, since spin_lock()/spin_unlock()
only use one atomic operation instead of two for write_lock()/write_unlock()

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-20 20:39:09 -08:00
Stephen Hemminger b8c26a33c8 ipgre: convert to netdevice_ops
Convert ipgre tunnel to netdevice ops.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-20 20:34:29 -08:00
Stephen Hemminger 23a12b1471 ipip: convert to net_device_ops
Convert to network device ops. Needed to change to directly call
the init routine since two sides share same ops.  In the process
found by inspection a device ref count leak if register_netdevice failed.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-20 20:33:21 -08:00
Stephen Hemminger 007c3838d9 ipmr: convert ipmr virtual interface to net_device_ops
Convert to new network device ops interface.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-20 20:28:35 -08:00
David S. Miller 6ab33d5171 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/net/ixgbe/ixgbe_main.c
	include/net/mac80211.h
	net/phonet/af_phonet.c
2008-11-20 16:44:00 -08:00
Harvey Harrison 21d1a161f6 net: ip_sockglue.c add static, annotate ports' endianness
Fixes sparse warnings:
net/ipv4/ip_sockglue.c:146:15: warning: incorrect type in assignment (different base types)
net/ipv4/ip_sockglue.c:146:15:    expected restricted __be16 [assigned] [usertype] sin_port
net/ipv4/ip_sockglue.c:146:15:    got unsigned short [unsigned] [short] [usertype] <noident>
net/ipv4/ip_sockglue.c:130:6: warning: symbol 'ip_cmsg_recv_dstaddr' was not declared. Should it be static?

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-20 01:54:27 -08:00
Balazs Scheidler c828384582 TPROXY: supply a struct flowi->flags argument in inet_sk_rebuild_header()
inet_sk_rebuild_header() does a new route lookup if the dst_entry
    associated with a socket becomes stale. However inet_sk_rebuild_header()
    didn't use struct flowi->flags, causing the route lookup to
    fail for foreign-bound IP_TRANSPARENT sockets, causing an error
    state to be set for the sockets in question.

Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-20 01:08:06 -08:00
Balazs Scheidler a134f85c13 TPROXY: fill struct flowi->flags in udp_sendmsg()
udp_sendmsg() didn't fill struct flowi->flags, which means that
    the route lookup would fail for non-local IPs even if the
    IP_TRANSPARENT sockopt was set.

    This prevents sendto() to work properly for UDP sockets, whereas
    bind(foreign-ip) + connect() + send() worked fine.

Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-20 01:07:24 -08:00
Eric Dumazet 5caea4ea70 net: listening_hash get a spinlock per bucket
This patch prepares RCU migration of listening_hash table for
TCP/DCCP protocols.

listening_hash table being small (32 slots per protocol), we add
a spinlock for each slot, instead of a single rwlock for whole table.

This should reduce hold time of readers, and writers concurrency.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-20 00:40:07 -08:00
Stephen Hemminger 5bc3eb7e2f ip: convert to net_device_ops for ioctl
Convert to net_device_ops function table pointer for ioctl.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-19 22:42:41 -08:00
Joe Perches 07f0757a68 include/net net/ - csum_partial - remove unnecessary casts
The first argument to csum_partial is const void *
casts to char/u8 * are not necessary

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-19 15:44:53 -08:00
Eric Dumazet a7a0d6a87b net: inet_diag_handler structs can be const
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-19 15:43:27 -08:00
Benjamin Thery c3e388964b net: fix ip_mr_init() error path
Similarly to IPv6 ip6_mr_init() (fixed last week), the order of cleanup
operations in the error/exit section of ip_mr_init() is completely 
inversed. It should be the other way around.
Also a del_timer() is missing in the error path.

I should have guessed last week that this same error existed in ipmr.c
too, as ip6mr.c is largely inspired by ipmr.c.

Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-19 14:07:41 -08:00
David S. Miller 198d6ba4d7 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/isdn/i4l/isdn_net.c
	fs/cifs/connect.c
2008-11-18 23:38:23 -08:00
James Morris f3a5c54701 Merge branch 'master' into next
Conflicts:
	fs/cifs/misc.c

Merge to resolve above, per the patch below.

Signed-off-by: James Morris <jmorris@namei.org>

diff --cc fs/cifs/misc.c
index ec36410,addd1dc..0000000
--- a/fs/cifs/misc.c
+++ b/fs/cifs/misc.c
@@@ -347,13 -338,13 +338,13 @@@ header_assemble(struct smb_hdr *buffer
  		/*  BB Add support for establishing new tCon and SMB Session  */
  		/*      with userid/password pairs found on the smb session   */
  		/*	for other target tcp/ip addresses 		BB    */
 -				if (current->fsuid != treeCon->ses->linux_uid) {
 +				if (current_fsuid() != treeCon->ses->linux_uid) {
  					cFYI(1, ("Multiuser mode and UID "
  						 "did not match tcon uid"));
- 					read_lock(&GlobalSMBSeslock);
- 					list_for_each(temp_item, &GlobalSMBSessionList) {
- 						ses = list_entry(temp_item, struct cifsSesInfo, cifsSessionList);
+ 					read_lock(&cifs_tcp_ses_lock);
+ 					list_for_each(temp_item, &treeCon->ses->server->smb_ses_list) {
+ 						ses = list_entry(temp_item, struct cifsSesInfo, smb_ses_list);
 -						if (ses->linux_uid == current->fsuid) {
 +						if (ses->linux_uid == current_fsuid()) {
  							if (ses->server == treeCon->ses->server) {
  								cFYI(1, ("found matching uid substitute right smb_uid"));
  								buffer->Uid = ses->Suid;
2008-11-18 18:52:37 +11:00
Eric Dumazet 3ab5aee7fe net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls
RCU was added to UDP lookups, using a fast infrastructure :
- sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
  price of call_rcu() at freeing time.
- hlist_nulls permits to use few memory barriers.

This patch uses same infrastructure for TCP/DCCP established
and timewait sockets.

Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
using short lived TCP connections. A followup patch, converting
rwlocks to spinlocks will even speedup this case.

__inet_lookup_established() is pretty fast now we dont have to
dirty a contended cache line (read_lock/read_unlock)

Only established and timewait hashtable are converted to RCU
(bind table and listen table are still using traditional locking)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-16 19:40:17 -08:00
Eric Dumazet 88ab1932ea udp: Use hlist_nulls in UDP RCU code
This is a straightforward patch, using hlist_nulls infrastructure.

RCUification already done on UDP two weeks ago.

Using hlist_nulls permits us to avoid some memory barriers, both
at lookup time and delete time.

Patch is large because it adds new macros to include/net/sock.h.
These macros will be used by TCP & DCCP in next patch.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-16 19:39:21 -08:00
Balazs Scheidler e8b2dfe9b4 TPROXY: implemented IP_RECVORIGDSTADDR socket option
In case UDP traffic is redirected to a local UDP socket,
the originally addressed destination address/port
cannot be recovered with the in-kernel tproxy.

This patch adds an IP_RECVORIGDSTADDR sockopt that enables
a IP_ORIGDSTADDR ancillary message in recvmsg(). This
ancillary message contains the original destination address/port
of the packet being received.

Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-16 19:32:39 -08:00
Ben Greear 8164f1b797 ipv4: Fix ARP behavior with many mac-vlans
Ben Greear wrote:
> I have 500 mac-vlans on a system talking to 500 other
> mac-vlans.  My problem is that the arp-table gets extremely
> huge because every time an arp-request comes in on all mac-vlans,
> a stale arp entry is added for each mac-vlan.  I have filtering
> turned on, but that doesn't help because the neigh_event_ns call
> below will cause a stale neighbor entry to be created regardless
> of whether a replay will be sent or not.
> Maybe the neigh_event code should be below the checks for dont_send,
> and only create check neigh_event_ns if we are !dont_send?

The attached patch makes it work much better for me.  The patch
will cause the code to NOT create a stale neighbor entry if we
are not going to respond to the ARP request.  The old code
*would* create a stale entry even if we are not going to respond.

Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-16 19:19:38 -08:00
James Morris 2b82892565 Merge branch 'master' into next
Conflicts:
	security/keys/internal.h
	security/keys/process_keys.c
	security/keys/request_key.c

Fixed conflicts above by using the non 'tsk' versions.

Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 11:29:12 +11:00
David Howells d76b0d9b2d CRED: Use creds in file structs
Attach creds to file structs and discard f_uid/f_gid.

file_operations::open() methods (such as hppfs_open()) should use file->f_cred
rather than current_cred().  At the moment file->f_cred will be current_cred()
at this point.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <jmorris@namei.org>
Signed-off-by: James Morris <jmorris@namei.org>
2008-11-14 10:39:25 +11:00
Alexey Dobriyan 9c0188acf6 net: shy netns_ok check
Failure to pass netns_ok check is SILENT, except some MIB counter is
incremented somewhere.

And adding "netns_ok = 1" (after long head-scratching session) is
usually the last step in making some protocol netns-ready...

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-12 23:23:51 -08:00
Doug Leith 8f65b5354b tcp_htcp: last_cong bug fix
This patch fixes a minor bug in tcp_htcp.c which has been
highlighted by Lachlan Andrew and Lawrence Stewart.  Currently, the
time since the last congestion event, which is stored in variable
last_cong, is reset whenever there is a state change into
TCP_CA_Open.  This includes transitions of the type
TCP_CA_Open->TCP_CA_Disorder->TCP_CA_Open which are not associated
with backoff of cwnd.  The patch changes last_cong to be updated
only on transitions into TCP_CA_Open that occur after experiencing
the congestion-related states TCP_CA_Loss, TCP_CA_Recovery,
TCP_CA_CWR.

Signed-off-by: Doug Leith <doug.leith@nuim.ie>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-12 01:41:09 -08:00
Eric Dumazet 7a9546ee35 net: ib_net pointer should depends on CONFIG_NET_NS
We can shrink size of "struct inet_bind_bucket" by 50%, using
read_pnet() and write_pnet()

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-12 00:54:20 -08:00
Alexey Dobriyan 6bb3ce25d0 net: remove struct dst_entry::entry_size
Unused after kmem_cache_zalloc() conversion.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-11 17:25:22 -08:00
David S. Miller 7e452baf6b Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/message/fusion/mptlan.c
	drivers/net/sfc/ethtool.c
	net/mac80211/debugfs_sta.c
2008-11-11 15:43:02 -08:00
Eric Dumazet b971e7ac83 net: fix /proc/net/snmp as memory corruptor
icmpmsg_put() can happily corrupt kernel memory, using a static
table and forgetting to reset an array index in a loop.

Remove the static array since its not safe without proper locking.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-10 21:43:08 -08:00
David S. Miller 9eeda9abd1 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/net/wireless/ath5k/base.c
	net/8021q/vlan_core.c
2008-11-06 22:43:03 -08:00
David S. Miller 518a09ef11 tcp: Fix recvmsg MSG_PEEK influence of blocking behavior.
Vito Caputo noticed that tcp_recvmsg() returns immediately from
partial reads when MSG_PEEK is used.  In particular, this means that
SO_RCVLOWAT is not respected.

Simply remove the test.  And this matches the behavior of several
other systems, including BSD.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-05 03:36:01 -08:00
Andreas Steffen 79654a7698 xfrm: Have af-specific init_tempsel() initialize family field of temporary selector
While adding MIGRATE support to strongSwan, Andreas Steffen noticed that
the selectors provided in XFRM_MSG_ACQUIRE have their family field
uninitialized (those in MIGRATE do have their family set).

Looking at the code, this is because the af-specific init_tempsel()
(called via afinfo->init_tempsel() in xfrm_init_tempsel()) do not set
the value.

Reported-by: Andreas Steffen <andreas.steffen@strongswan.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Arnaud Ebalard <arno@natisbad.org>
2008-11-04 14:49:19 -08:00
Simon Arlott 6e3354c1e9 netfilter: nf_nat: remove warn_if_extra_mangle
In net/ipv4/netfilter/nf_nat_rule.c, the function warn_if_extra_mangle was added
in commit 5b1158e909 (2006-12-02). I have a DNAT
target in the OUTPUT chain than changes connections with dst 2.0.0.1 to another
address which I'll substitute with 66.102.9.99 below.

On every boot I get the following message:
[  146.252505] NAT: no longer support implicit source local NAT
[  146.252517] NAT: packet src 66.102.9.99 -> dst 2.0.0.1

As far as I can tell from reading the function doing this, it should warn if the
source IP for the route to 66.102.9.99 is different from 2.0.0.1 but that is not
the case. It doesn't make sense to check the DNAT target against the local route
source.

Either the function should be changed to correctly check the route, or it should
be removed entirely as it's been nearly 2 years since it was added.

Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-11-04 14:35:39 +01:00
Alexey Dobriyan 19223f26d9 netfilter: arptable_filter: merge forward hook
It's identical to NF_ARP_IN hook.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-11-04 14:22:13 +01:00
Alexey Dobriyan d4ec52bae7 netfilter: netns-aware ipt_addrtype
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-11-04 14:21:48 +01:00
Alexey Dobriyan 6d9f239a1e net: '&' redux
I want to compile out proc_* and sysctl_* handlers totally and
stub them to NULL depending on config options, however usage of &
will prevent this, since taking adress of NULL pointer will break
compilation.

So, drop & in front of every ->proc_handler and every ->strategy
handler, it was never needed in fact.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-03 18:21:05 -08:00
Jianjun Kong 5799de0b12 net: clean up net/ipv4/tcp_ipv4.c
Signed-off-by: Jianjun Kong <jianjun@zeuux.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-03 02:49:10 -08:00
Jianjun Kong 539afedfcc net: clean up net/ipv4/devinet.c
Signed-off-by: Jianjun Kong <jianjun@zeuux.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-03 02:48:48 -08:00
Jianjun Kong f4cca7ffb2 net: clean up net/ipv4/pararp.c
Signed-off-by: Jianjun Kong <jianjun@zeuux.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-03 02:48:14 -08:00
Jianjun Kong fd3f8c4cb6 net: clean up net/ipv4/ip_fragment.c tcp_timer.c ip_input.c
Signed-off-by: Jianjun Kong <jianjun@zeuux.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-03 02:47:38 -08:00
Jianjun Kong c354e12463 net: clean up net/ipv4/ipmr.c
Signed-off-by: Jianjun Kong <jianjun@zeuux.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-03 00:28:02 -08:00
Jianjun Kong 09cb105ea7 net: clean up net/ipv4/ip_sockglue.c tcp_output.c
Signed-off-by: Jianjun Kong <jianjun@zeuux.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-03 00:27:11 -08:00
Jianjun Kong a7e9ff735b net: clean up net/ipv4/igmp.c
Signed-off-by: Jianjun Kong <jianjun@zeuux.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-03 00:26:09 -08:00
Jianjun Kong 6ed2533e55 net: clean up net/ipv4/fib_frontend.c fib_hash.c ip_gre.c
Signed-off-by: Jianjun Kong <jianjun@zeuux.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-03 00:25:16 -08:00
Jianjun Kong 5a5f3a8db9 net: clean up net/ipv4/ipip.c raw.c tcp.c tcp_minisocks.c tcp_yeah.c xfrm4_policy.c
Signed-off-by: Jianjun Kong <jianjun@zeuux.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-03 00:24:34 -08:00
Jianjun Kong d9319100c1 net: clean up net/ipv4/ah4.c esp4.c fib_semantics.c inet_connection_sock.c inetpeer.c ip_output.c
Signed-off-by: Jianjun Kong <jianjun@zeuux.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-03 00:23:42 -08:00
Sangtae Ha ae27e98a51 [TCP] CUBIC v2.3
Signed-off-by: Sangtae Ha <sha2@ncsu.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-02 00:28:10 -07:00
Eric Dumazet 920a46115c udp: multicast packets need to check namespace
Current UDP multicast delivery is not namespace aware.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-01 21:22:23 -07:00
Eric Dumazet c37ccc0d4e udp: add a missing smp_wmb() in udp_lib_get_port()
Corey Minyard spotted a missing memory barrier in udp_lib_get_port()

We need to make sure a reader cannot read the new 'sk->sk_next' value
and previous value of 'sk->sk_hash'. Or else, an item could be deleted
from a chain, and inserted into another chain. If new chain was empty
before the move, 'next' pointer is NULL, and lockless reader can
not detect it missed following items in original chain.

This patch is temporary, since we expect an upcoming patch
to introduce another way of handling the problem.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-01 21:19:18 -07:00
Harvey Harrison 673d57e723 net: replace NIPQUAD() in net/ipv4/ net/ipv6/
Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u
can be replaced with %pI4

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-31 00:53:57 -07:00
Harvey Harrison cffee385d7 net: replace NIPQUAD() in net/ipv4/netfilter/
Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u
can be replaced with %pI4

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-31 00:53:08 -07:00
David S. Miller a1744d3bee Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/net/wireless/p54/p54common.c
2008-10-31 00:17:34 -07:00
Eric Dumazet c8db3fec5b udp: Should use spin_lock_bh()/spin_unlock_bh() in udp_lib_unhash()
Spotted by Alexander Beregalov

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-30 14:00:53 -07:00
roel kluin 00af5c6959 cipso: unsigned buf_len cannot be negative
unsigned buf_len cannot be negative

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Paul Moore <paul.moore@hp.com>
2008-10-29 15:55:53 -04:00
Harvey Harrison 5b095d9892 net: replace %p6 with %pI6
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 12:52:50 -07:00
Eric Dumazet 96631ed16c udp: introduce sk_for_each_rcu_safenext()
Corey Minyard found a race added in commit 271b72c7fa
(udp: RCU handling for Unicast packets.)

 "If the socket is moved from one list to another list in-between the
 time the hash is calculated and the next field is accessed, and the
 socket has moved to the end of the new list, the traversal will not
 complete properly on the list it should have, since the socket will
 be on the end of the new list and there's not a way to tell it's on a
 new list and restart the list traversal.  I think that this can be
 solved by pre-fetching the "next" field (with proper barriers) before
 checking the hash."

This patch corrects this problem, introducing a new
sk_for_each_rcu_safenext() macro.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 11:19:58 -07:00
Eric Dumazet f52b5054ec udp: udp_get_next() should use spin_unlock_bh()
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 11:19:11 -07:00
Eric Dumazet 8203efb3c6 udp: calculate udp_mem based on low memory instead of all memory
This patch mimics commit 57413ebc4e
(tcp: calculate tcp_mem based on low memory instead of all memory)

The udp_mem array which contains limits on the total amount of memory
used by UDP sockets is calculated based on nr_all_pages.  On a 32 bits
x86 system, we should base this on the number of lowmem pages.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 02:32:32 -07:00
Eric Dumazet 271b72c7fa udp: RCU handling for Unicast packets.
Goals are :

1) Optimizing handling of incoming Unicast UDP frames, so that no memory
 writes should happen in the fast path.

 Note: Multicasts and broadcasts still will need to take a lock,
 because doing a full lockless lookup in this case is difficult.

2) No expensive operations in the socket bind/unhash phases :
  - No expensive synchronize_rcu() calls.

  - No added rcu_head in socket structure, increasing memory needs,
  but more important, forcing us to use call_rcu() calls,
  that have the bad property of making sockets structure cold.
  (rcu grace period between socket freeing and its potential reuse
   make this socket being cold in CPU cache).
  David did a previous patch using call_rcu() and noticed a 20%
  impact on TCP connection rates.
  Quoting Cristopher Lameter :
   "Right. That results in cacheline cooldown. You'd want to recycle
    the object as they are cache hot on a per cpu basis. That is screwed
    up by the delayed regular rcu processing. We have seen multiple
    regressions due to cacheline cooldown.
    The only choice in cacheline hot sensitive areas is to deal with the
    complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."

  - Because udp sockets are allocated from dedicated kmem_cache,
  use of SLAB_DESTROY_BY_RCU can help here.

Theory of operation :
---------------------

As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.

Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.

In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.

We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 02:11:14 -07:00
Eric Dumazet 645ca708f9 udp: introduce struct udp_table and multiple spinlocks
UDP sockets are hashed in a 128 slots hash table.

This hash table is protected by *one* rwlock.

This rwlock is readlocked each time an incoming UDP message is handled.

This rwlock is writelocked each time a socket must be inserted in
hash table (bind time), or deleted from this table (close time)

This is not scalable on SMP machines :

1) Even in read mode, lock() and unlock() are atomic operations and
 must dirty a contended cache line, shared by all cpus.

2) A writer might be starved if many readers are 'in flight'. This can
 happen on a machine with some NIC receiving many UDP messages. User
 process can be delayed a long time at socket creation/dismantle time.

This patch prepares RCU migration, by introducing 'struct udp_table
and struct udp_hslot', and using one spinlock per chain, to reduce
contention on central rwlock.

Introducing one spinlock per chain reduces latencies, for port
randomization on heavily loaded UDP servers. This also speedup
bindings to specific ports.

udp_lib_unhash() was uninlined, becoming to big.

Some cleanups were done to ease review of following patch
(RCUification of UDP Unicast lookups)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 01:41:45 -07:00
Harvey Harrison 0c6ce78abf net: replace uses of NIP6_FMT with %p6
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-28 23:02:31 -07:00
Alexey Dobriyan 93adcc80f3 net: don't use INIT_RCU_HEAD
call_rcu() will unconditionally rewrite RCU head anyway.
Applies to 
	struct neigh_parms
	struct neigh_table
	struct net
	struct cipso_v4_doi
	struct in_ifaddr
	struct in_device
	rt->u.dst

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-28 13:25:09 -07:00
Alexey Dobriyan def8b4faff net: reduce structures when XFRM=n
ifdef out
* struct sk_buff::sp		(pointer)
* struct dst_entry::xfrm	(pointer)
* struct sock::sk_policy	(2 pointers)

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-28 13:24:06 -07:00
Neil Horman 1080d709fb net: implement emergency route cache rebulds when gc_elasticity is exceeded
This is a patch to provide on demand route cache rebuilding.  Currently, our
route cache is rebulid periodically regardless of need.  This introduced
unneeded periodic latency.  This patch offers a better approach.  Using code
provided by Eric Dumazet, we compute the standard deviation of the average hash
bucket chain length while running rt_check_expire.  Should any given chain
length grow to larger that average plus 4 standard deviations, we trigger an
emergency hash table rebuild for that net namespace.  This allows for the common
case in which chains are well behaved and do not grow unevenly to not incur any
latency at all, while those systems (which may be being maliciously attacked),
only rebuild when the attack is detected.  This patch take 2 other factors into
account:
1) chains with multiple entries that differ by attributes that do not affect the
hash value are only counted once, so as not to unduly bias system to rebuilding
if features like QOS are heavily used
2) if rebuilding crosses a certain threshold (which is adjustable via the added
sysctl in this patch), route caching is disabled entirely for that net
namespace, since constant rebuilding is less efficient that no caching at all

Tested successfully by me.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-27 17:06:14 -07:00
Florian Westphal 8b5f12d04b syncookies: fix inclusion of tcp options in syn-ack
David Miller noticed that commit
33ad798c92 '(tcp: options clean up')
did not move the req->cookie_ts check.
This essentially disabled commit 4dfc281702
'[Syncookies]: Add support for TCP options via timestamps.'.

This restores the original logic.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-26 23:10:12 -07:00
Ilpo Järvinen fd6149d332 tcp: Restore ordering of TCP options for the sake of inter-operability
This is not our bug! Sadly some devices cannot cope with the change
of TCP option ordering which was a result of the recent rewrite of
the option code (not that there was some particular reason steming
from the rewrite for the reordering) though any ordering of TCP
options is perfectly legal. Thus we restore the original ordering
to allow interoperability with/through such broken devices and add
some warning about this trap. Since the reordering just happened
without any particular reason, this change shouldn't cost us
anything.

There are already couple of known failure reports (within close
proximity of the last release), so the problem might be more
wide-spread than a single device. And other reports which may
be due to the same problem though the symptoms were less obvious.
Analysis of one of the case revealed (with very high probability)
that sack capability cannot be negotiated as the first option
(SYN never got a response).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Reported-by: Aldo Maggi <sentiniate@tiscali.it>
Tested-by: Aldo Maggi <sentiniate@tiscali.it>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-23 14:06:35 -07:00
Ilpo Järvinen 75e3d8db53 tcp: should use number of sack blocks instead of -1
While looking for the recent "sack issue" I also read all eff_sacks
usage that was played around by some relevant commit. I found
out that there's another thing that is asking for a fix (unrelated
to the "sack issue" though).

This feature has probably very little significance in practice.
Opposite direction timeout with bidirectional tcp comes to me as
the most likely scenario though there might be other cases as
well related to non-data segments we send (e.g., response to the
opposite direction segment). Also some ACK losses or option space
wasted for other purposes is necessary to prevent the earlier
SACK feedback getting to the sender.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-21 16:28:36 -07:00
Linus Torvalds 5fdf11283e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  netfilter: replace old NF_ARP calls with NFPROTO_ARP
  netfilter: fix compilation error with NAT=n
  netfilter: xt_recent: use proc_create_data()
  netfilter: snmp nat leaks memory in case of failure
  netfilter: xt_iprange: fix range inversion match
  netfilter: netns: use NFPROTO_NUMPROTO instead of NUMPROTO for tables array
  netfilter: ctnetlink: remove obsolete NAT dependency from Kconfig
  pkt_sched: sch_generic: Fix oops in sch_teql
  dccp: Port redirection support for DCCP
  tcp: Fix IPv6 fallout from 'Port redirection support for TCP'
  netdev: change name dropping error codes
  ipvs: Update CONFIG_IP_VS_IPV6 description and help text
2008-10-20 09:06:35 -07:00
Jan Engelhardt fdc9314cbe netfilter: replace old NF_ARP calls with NFPROTO_ARP
(Supplements: ee999d8b95)

NFPROTO_ARP actually has a different value from NF_ARP, so ensure all
callers use the new value so that packets _do_ get delivered to the
registered hooks.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-20 03:34:51 -07:00
Ilpo Järvinen 311670f3ea netfilter: snmp nat leaks memory in case of failure
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-20 03:33:24 -07:00
Linus Torvalds b225ee5bed Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  net: Remove CONFIG_KMOD from net/ (towards removing CONFIG_KMOD entirely)
  ipv4: Add a missing rcu_assign_pointer() in routing cache.
  [netdrvr] ibmtr: PCMCIA IBMTR is ok on 64bit
  xen-netfront: Avoid unaligned accesses to IP header
  lmc: copy_*_user under spinlock
  [netdrvr] myri10ge, ixgbe: remove broken select INTEL_IOATDMA
2008-10-17 08:58:52 -07:00
Johannes Berg 95a5afca4a net: Remove CONFIG_KMOD from net/ (towards removing CONFIG_KMOD entirely)
Some code here depends on CONFIG_KMOD to not try to load
protocol modules or similar, replace by CONFIG_MODULES
where more than just request_module depends on CONFIG_KMOD
and and also use try_then_request_module in ebtables.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-16 15:24:51 -07:00
Eric Dumazet 00269b54ed ipv4: Add a missing rcu_assign_pointer() in routing cache.
rt_intern_hash() is doing an update of a RCU guarded hash chain
without using rcu_assign_pointer() or equivalent barrier.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-16 14:18:29 -07:00
Linus Torvalds cb23832e39 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (26 commits)
  decnet: Fix compiler warning in dn_dev.c
  IPV6: Fix default gateway criteria wrt. HIGH/LOW preference radv option
  net/802/fc.c: Fix compilation warnings
  netns: correct mib stats in ip6_route_me_harder()
  netns: fix net_generic array leak
  rt2x00: fix regression introduced by "mac80211: free up 2 bytes in skb->cb"
  rtl8187: Add USB ID for Belkin F5D7050 with RTL8187B chip
  p54usb: Device ID updates
  mac80211: fixme for kernel-doc
  ath9k/mac80211: disallow fragmentation in ath9k, report to userspace
  libertas : Remove unused variable warning for "old_channel" from cmd.c
  mac80211: Fix scan RX processing oops
  orinoco: fix unsafe locking in spectrum_cs_suspend
  orinoco: fix unsafe locking in orinoco_cs_resume
  cfg80211: fix debugfs error handling
  mac80211: fix debugfs netdev rename
  iwlwifi: fix ct kill configuration for 5350
  mac80211: fix HT information element parsing
  p54: Fix compilation problem on PPC
  mac80211: fix debugfs lockup
  ...
2008-10-16 11:26:26 -07:00
Alexey Dobriyan f221e726bf sysctl: simplify ->strategy
name and nlen parameters passed to ->strategy hook are unused, remove
them.  In general ->strategy hook should know what it's doing, and don't
do something tricky for which, say, pointer to original userspace array
may be needed (name).

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net> [ networking bits ]
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16 11:21:47 -07:00
Pablo Neira Ayuso e6a7d3c04f netfilter: ctnetlink: remove bogus module dependency between ctnetlink and nf_nat
This patch removes the module dependency between ctnetlink and
nf_nat by means of an indirect call that is initialized when
nf_nat is loaded. Now, nf_conntrack_netlink only requires
nf_conntrack and nfnetlink.

This patch puts nfnetlink_parse_nat_setup_hook into the
nf_conntrack_core to avoid dependencies between ctnetlink,
nf_conntrack_ipv4 and nf_conntrack_ipv6.

This patch also introduces the function ctnetlink_change_nat
that is only invoked from the creation path. Actually, the
nat handling cannot be invoked from the update path since
this is not allowed. By introducing this function, we remove
the useless nat handling in the update path and we avoid
deadlock-prone code.

This patch also adds the required EAGAIN logic for nfnetlink.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-14 11:58:31 -07:00
Patrick McHardy 38f7ac3eb7 netfilter: restore lost #ifdef guarding defrag exception
Nir Tzachar <nir.tzachar@gmail.com> reported a warning when sending
fragments over loopback with NAT:

[ 6658.338121] WARNING: at net/ipv4/netfilter/nf_nat_standalone.c:89 nf_nat_fn+0x33/0x155()

The reason is that defragmentation is skipped for already tracked connections.
This is wrong in combination with NAT and ip_conntrack actually had some ifdefs
to avoid this behaviour when NAT is compiled in.

The entire "optimization" may seem a bit silly, for now simply restoring the
lost #ifdef is the easiest solution until we can come up with something better.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-14 11:56:59 -07:00
Linus Torvalds 43096597a4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  qlge: Fix page size ifdef test.
  net: Rationalise email address: Network Specific Parts
  dsa: fix compile bug on s390
  netns: mib6 section fixlet
  enic: Fix Kconfig headline description
  de2104x: wrong MAC address fix
  s390: claw compile fixlet
  net: export genphy_restart_aneg
  cxgb3: extend copyrights to 2008
  cxgb3: update driver version
  net/phy: add missing kernel-doc
  pktgen: fix skb leak in case of failure
  mISDN/dsp_cmx.c: fix size checks
  misdn: use nonseekable_open()
  net: fix driver build errors due to missing net/ip6_checksum.h include
2008-10-14 10:28:49 -07:00
Alan Cox 113aa838ec net: Rationalise email address: Network Specific Parts
Clean up the various different email addresses of mine listed in the code
to a single current and valid address. As Dave says his network merges
for 2.6.28 are now done this seems a good point to send them in where
they won't risk disrupting real changes.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-13 19:01:08 -07:00
James Morris 93db628658 Merge branch 'next' into for-linus 2008-10-13 09:35:14 +11:00
Herbert Xu 7bb82d9245 gre: Initialise rtnl_link tunnel parameters properly
Brown paper bag error of calling memset with sizeof(p) instead
of sizeof(*p).

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-11 12:20:15 -07:00
Patrick McHardy 4d74f8ba1f gre: minor cleanups in netlink interface
- use typeful helpers for IFLA_GRE_LOCAL/IFLA_GRE_REMOTE
- replace magic value by FIELD_SIZEOF
- use MODULE_ALIAS_RTNL_LINK macro

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-10 12:11:06 -07:00
Patrick McHardy ba9e64b1c2 gre: fix copy and paste error
The flags are dumped twice, the keys not at all.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-10 12:10:30 -07:00
Paul Moore 15c45f7b2e cipso: Add support for native local labeling and fixup mapping names
This patch accomplishes three minor tasks: add a new tag type for local
labeling, rename the CIPSO_V4_MAP_STD define to CIPSO_V4_MAP_TRANS and
replace some of the CIPSO "magic numbers" with constants from the header
file.  The first change allows CIPSO to support full LSM labels/contexts,
not just MLS attributes.  The second change brings the mapping names inline
with what userspace is using, compatibility is preserved since we don't
actually change the value.  The last change is to aid readability and help
prevent mistakes.

Signed-off-by: Paul Moore <paul.moore@hp.com>
2008-10-10 10:16:34 -04:00
Paul Moore 014ab19a69 selinux: Set socket NetLabel based on connection endpoint
Previous work enabled the use of address based NetLabel selectors, which while
highly useful, brought the potential for additional per-packet overhead when
used.  This patch attempts to solve that by applying NetLabel socket labels
when sockets are connect()'d.  This should alleviate the per-packet NetLabel
labeling for all connected sockets (yes, it even works for connected DGRAM
sockets).

Signed-off-by: Paul Moore <paul.moore@hp.com>
Reviewed-by: James Morris <jmorris@namei.org>
2008-10-10 10:16:33 -04:00
Paul Moore 948bf85c1b netlabel: Add functionality to set the security attributes of a packet
This patch builds upon the new NetLabel address selector functionality by
providing the NetLabel KAPI and CIPSO engine support needed to enable the
new packet-based labeling.  The only new addition to the NetLabel KAPI at
this point is shown below:

 * int netlbl_skbuff_setattr(skb, family, secattr)

... and is designed to be called from a Netfilter hook after the packet's
IP header has been populated such as in the FORWARD or LOCAL_OUT hooks.

This patch also provides the necessary SELinux hooks to support this new
functionality.  Smack support is not currently included due to uncertainty
regarding the permissions needed to expand the Smack network access controls.

Signed-off-by: Paul Moore <paul.moore@hp.com>
Reviewed-by: James Morris <jmorris@namei.org>
2008-10-10 10:16:32 -04:00
Paul Moore b1edeb1023 netlabel: Replace protocol/NetLabel linking with refrerence counts
NetLabel has always had a list of backpointers in the CIPSO DOI definition
structure which pointed to the NetLabel LSM domain mapping structures which
referenced the CIPSO DOI struct.  The rationale for this was that when an
administrator removed a CIPSO DOI from the system all of the associated
NetLabel LSM domain mappings should be removed as well; a list of
backpointers made this a simple operation.

Unfortunately, while the backpointers did make the removal easier they were
a bit of a mess from an implementation point of view which was making
further development difficult.  Since the removal of a CIPSO DOI is a
realtively rare event it seems to make sense to remove this backpointer
list as the optimization was hurting us more then it was helping.  However,
we still need to be able to track when a CIPSO DOI definition is being used
so replace the backpointer list with a reference count.  In order to
preserve the current functionality of removing the associated LSM domain
mappings when a CIPSO DOI is removed we walk the LSM domain mapping table,
removing the relevant entries.

Signed-off-by: Paul Moore <paul.moore@hp.com>
Reviewed-by: James Morris <jmorris@namei.org>
2008-10-10 10:16:31 -04:00
Eric Dumazet f24d43c07e udp: complete port availability checking
While looking at UDP port randomization, I noticed it
was litle bit pessimistic, not looking at type of sockets
(IPV6/IPV4) and not looking at bound addresses if any.

We should perform same tests than when binding to a
specific port.

This permits a cleanup of udp_lib_get_port()

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-09 14:51:27 -07:00
Ilpo Järvinen 78e645cb89 tcpv[46]: fix md5 pseudoheader address field ordering
Maybe it's just me but I guess those md5 people made a mess
out of it by having *_md5_hash_* to use daddr, saddr order
instead of the one that is natural (and equal to what csum
functions use). For the segment were sending, the original
addresses are reversed so buff's saddr == skb's daddr and
vice-versa.

Maybe I can finally proceed with unification of some code
after fixing it first... :-)

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-09 14:37:47 -07:00
Herbert Xu 64194c31a0 inet: Make tunnel RX/TX byte counters more consistent
This patch makes the RX/TX byte counters for IPIP, GRE and SIT more
consistent.  Previously we included the external IP headers on the
way out but not when the packet is inbound.

The new scheme is to count payload only in both directions.  For
IPIP and SIT this simply means the exclusion of the external IP
header.  For GRE this means that we exclude the GRE header as
well.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-09 12:03:17 -07:00
Herbert Xu e1a8000228 gre: Add Transparent Ethernet Bridging
This patch adds support for Ethernet over GRE encapsulation.
This is exposed to user-space with a new link type of "gretap"
instead of "gre".  It will create an ARPHRD_ETHER device in
lieu of the usual ARPHRD_IPGRE.

Note that to preserver backwards compatibility all Transparent
Ethernet Bridging packets are passed to an ARPHRD_IPGRE tunnel
if its key matches and there is no ARPHRD_ETHER device whose
key matches more closely.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-09 12:00:17 -07:00
Herbert Xu c19e654ddb gre: Add netlink interface
This patch adds a netlink interface that will eventually displace
the existing ioctl interface.  It utilises the elegant rtnl_link_ops
mechanism.

This also means that user-space no longer needs to rely on the
tunnel interface being of type GRE to identify GRE tunnels.  The
identification can now occur using rtnl_link_ops.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-09 11:59:55 -07:00
Herbert Xu 42aa916265 gre: Move MTU setting out of ipgre_tunnel_bind_dev
This patch moves the dev->mtu setting out of ipgre_tunnel_bind_dev.
This is in prepartion of using rtnl_link where we'll need to make
the MTU setting conditional on whether the user has supplied an
MTU.  This also requires the move of the ipgre_tunnel_bind_dev
call out of the dev->init function so that we can access the user
parameters later.

This patch also adds a check to prevent setting the MTU below
the minimum of 68.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-09 11:59:32 -07:00
Herbert Xu c95b819ad7 gre: Use needed_headroom
Now that we have dev->needed_headroom, we can use it instead of
having a bogus dev->hard_header_len.  This also allows us to
include dev->hard_header_len in the MTU computation so that when
we do have a meaningful hard_harder_len in future it is included
automatically in figuring out the MTU.

Incidentally, this fixes a bug where we ignored the needed_headroom
field of the underlying device in calculating our own hard_header_len.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-09 11:58:54 -07:00
David S. Miller 4dd565134e Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/net/e1000e/ich8lan.c
	drivers/net/e1000e/netdev.c
2008-10-08 14:56:41 -07:00
Sven Wegener 071d7ab664 ipvs: Remove stray file left over from ipvs move
Commit cb7f6a7b71 ("IPVS: Move IPVS to
net/netfilter/ipvs") has left a stray file in the old location of ipvs.

Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-08 14:41:35 -07:00
David S. Miller db2bf2476b Merge branch 'lvs-next-2.6' of git://git.kernel.org/pub/scm/linux/kernel/git/horms/lvs-2.6
Conflicts:

	net/netfilter/Kconfig
2008-10-08 14:26:36 -07:00
Eric Dumazet 3c689b7320 inet: cleanup of local_port_range
I noticed sysctl_local_port_range[] and its associated seqlock
sysctl_local_port_range_lock were on separate cache lines.
Moreover, sysctl_local_port_range[] was close to unrelated
variables, highly modified, leading to cache misses.

Moving these two variables in a structure can help data
locality and moving this structure to read_mostly section
helps sharing of this data among cpus.

Cleanup of extern declarations (moved in include file where
they belong), and use of inet_get_local_port_range()
accessor instead of direct access to ports values.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-08 14:18:04 -07:00
Eric Dumazet 9088c56095 udp: Improve port randomization
Current UDP port allocation is suboptimal.
We select the shortest chain to chose a port (out of 512)
that will hash in this shortest chain.

First, it can lead to give not so ramdom ports and ease
give attackers more opportunities to break the system.

Second, it can consume a lot of CPU to scan all table
in order to find the shortest chain.

Third, in some pathological cases we can fail to find
a free port even if they are plenty of them.

This patch zap the search for a short chain and only
use one random seed. Problem of getting long chains
should be addressed in another way, since we can
obtain long chains with non random ports.

Based on a report and patch from Vitaly Mayatskikh

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-08 11:44:17 -07:00
Ilpo Järvinen 52cd5750e8 tcp: fix length used for checksum in a reset
While looking for some common code I came across difference
in checksum calculation between tcp_v6_send_(reset|ack) I
couldn't explain. I checked both v4 and v6 and found out that
both seem to have the same "feature". I couldn't find anything
in rfc nor anywhere else which would state that md5 option
should be ignored like it was in case of reset so I came to
a conclusion that this is probably a genuine bug. I suspect
that addition of md5 just was fooled by the excessive
copy-paste code in those functions and the reset part was
never tested well enough to find out the problem.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-08 11:34:06 -07:00
David S. Miller 364ae953a4 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6 2008-10-08 09:50:38 -07:00
Jan Engelhardt 916a917dfe netfilter: xtables: provide invoked family value to extensions
By passing in the family through which extensions were invoked, a bit
of data space can be reclaimed. The "family" member will be added to
the parameter structures and the check functions be adjusted.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:20 +02:00
Jan Engelhardt a2df1648ba netfilter: xtables: move extension arguments into compound structure (6/6)
This patch does this for target extensions' destroy functions.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:19 +02:00
Jan Engelhardt af5d6dc200 netfilter: xtables: move extension arguments into compound structure (5/6)
This patch does this for target extensions' checkentry functions.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:19 +02:00
Jan Engelhardt 7eb3558655 netfilter: xtables: move extension arguments into compound structure (4/6)
This patch does this for target extensions' target functions.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:19 +02:00
Jan Engelhardt 6be3d8598e netfilter: xtables: move extension arguments into compound structure (3/6)
This patch does this for match extensions' destroy functions.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:19 +02:00
Jan Engelhardt 9b4fce7a35 netfilter: xtables: move extension arguments into compound structure (2/6)
This patch does this for match extensions' checkentry functions.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:18 +02:00
Jan Engelhardt f7108a20de netfilter: xtables: move extension arguments into compound structure (1/6)
The function signatures for Xtables extensions have grown over time.
It involves a lot of typing/replication, and also a bit of stack space
even if they are not used. Realize an NFWS2008 idea and pack them into
structs. The skb remains outside of the struct so gcc can continue to
apply its optimizations.

This patch does this for match extensions' match functions.

A few ambiguities have also been addressed. The "offset" parameter for
example has been renamed to "fragoff" (there are so many different
offsets already) and "protoff" to "thoff" (there is more than just one
protocol here, so clarify).

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:18 +02:00
Jan Engelhardt c2df73de24 netfilter: xtables: use "if" blocks in Kconfig
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:18 +02:00
Jan Engelhardt aba0d34800 netfilter: xtables: sort extensions alphabetically in Kconfig
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:17 +02:00
Jan Engelhardt 367c679007 netfilter: xtables: do centralized checkentry call (1/2)
It used to be that {ip,ip6,etc}_tables called extension->checkentry
themselves, but this can be moved into the xtables core.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:17 +02:00
KOVACS Krisztian 73e4022f78 netfilter: split netfilter IPv4 defragmentation into a separate module
Netfilter connection tracking requires all IPv4 packets to be defragmented.
Both the socket match and the TPROXY target depend on this functionality, so
this patch separates the Netfilter IPv4 defrag hooks into a separate module.

Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:12 +02:00
Alexey Dobriyan cfd6e3d747 netfilter: netns nat: PPTP NAT in netns
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:11 +02:00
Alexey Dobriyan 9174c1538f netfilter: netns nf_conntrack: fixup DNAT in netns
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:11 +02:00
Alexey Dobriyan 0c4c9288ad netfilter: netns nat: per-netns bysource hash
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:11 +02:00
Alexey Dobriyan e099a17357 netfilter: netns nat: per-netns NAT table
Same story as with iptable_filter, iptables_raw tables.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:10 +02:00
Alexey Dobriyan b8b8063e0d netfilter: netns nat: fix ipt_MASQUERADE in netns
First, allow entry in notifier hook.
Second, start conntrack cleanup in netns to which netdevice belongs.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:10 +02:00
Alexey Dobriyan c2a2c7e0cc netfilter: netns nf_conntrack: per-netns net.netfilter.nf_conntrack_log_invalid sysctl
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:08 +02:00
Alexey Dobriyan c04d05529a netfilter: netns nf_conntrack: per-netns net.netfilter.nf_conntrack_checksum sysctl
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:08 +02:00
Alexey Dobriyan 8e9df80180 netfilter: netns nf_conntrack: per-netns /proc/net/stat/nf_conntrack, /proc/net/stat/ip_conntrack
Show correct conntrack count, while I'm at it.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:08 +02:00
Alexey Dobriyan 0d55af8791 netfilter: netns nf_conntrack: per-netns statistics
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:07 +02:00
Alexey Dobriyan a71996fccc netfilter: netns nf_conntrack: pass conntrack to nf_conntrack_event_cache() not skb
This is cleaner, we already know conntrack to which event is relevant.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:07 +02:00
Alexey Dobriyan 5e6b29972b netfilter: netns nf_conntrack: per-netns /proc/net/ip_conntrack, /proc/net/stat/ip_conntrack, /proc/net/ip_conntrack_expect
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:06 +02:00
Alexey Dobriyan 74c51a1497 netfilter: netns nf_conntrack: pass netns pointer to L4 protocol's ->error hook
Again, it's deducible from skb, but we're going to use it for
nf_conntrack_checksum and statistics, so just pass it from upper layer.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:05 +02:00
Alexey Dobriyan a702a65fc1 netfilter: netns nf_conntrack: pass netns pointer to nf_conntrack_in()
It's deducible from skb->dev or skb->dst->dev, but we know netns at
the moment of call, so pass it down and use for finding and creating
conntracks.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:04 +02:00
Alexey Dobriyan 9b03f38d04 netfilter: netns nf_conntrack: per-netns expectations
Make per-netns a) expectation hash and b) expectations count.

Expectations always belongs to netns to which it's master conntrack belong.
This is natural and doesn't bloat expectation.

Proc files and leaf users are stubbed to init_net, this is temporary.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:03 +02:00
Alexey Dobriyan b21f890193 netfilter: netns: fix {ip,6}_route_me_harder() in netns
Take netns from skb->dst->dev. It should be safe because, they are called
from LOCAL_OUT hook where dst is valid (though, I'm not exactly sure about
IPVS and queueing packets to userspace).

[Patrick: its safe everywhere since they already expect skb->dst to be set]

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:03 +02:00
Alexey Dobriyan 400dad39d1 netfilter: netns nf_conntrack: per-netns conntrack hash
* make per-netns conntrack hash

  Other solution is to add ->ct_net pointer to tuplehashes and still has one
  hash, I tried that it's ugly and requires more code deep down in protocol
  modules et al.

* propagate netns pointer to where needed, e. g. to conntrack iterators.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:03 +02:00
Alexey Dobriyan 49ac8713b6 netfilter: netns nf_conntrack: per-netns conntrack count
Sysctls and proc files are stubbed to init_net's one. This is temporary.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:03 +02:00
Alexey Dobriyan 48dc7865aa netfilter: netns: remove nf_*_net() wrappers
Now that dev_net() exists, the usefullness of them is even less. Also they're
a big problem in resolving circular header dependencies necessary for
NOTRACK-in-netns patch. See below.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:01 +02:00
Jan Engelhardt ee999d8b95 netfilter: x_tables: use NFPROTO_* in extensions
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:01 +02:00
Jan Engelhardt e948b20a71 netfilter: rename ipt_recent to xt_recent
Like with other modules (such as ipt_state), ipt_recent.h is changed
to forward definitions to (IOW include) xt_recent.h, and xt_recent.c
is changed to use the new constant names.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:00 +02:00
Jan Engelhardt 76108cea06 netfilter: Use unsigned types for hooknum and pf vars
and (try to) consistently use u_int8_t for the L3 family.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-10-08 11:35:00 +02:00
Daniele Lacamera 9d2c27e17b tcp: Fix tcp_hybla zero congestion window growth with small rho and large cwnd.
Because of rounding, in certain conditions, i.e. when in congestion
avoidance state rho is smaller than 1/128 of the current cwnd, TCP
Hybla congestion control starves and the cwnd is kept constant
forever.

This patch forces an increment by one segment after #send_cwnd calls
without increments(newreno behavior).

Signed-off-by: Daniele Lacamera <root@danielinux.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-07 15:58:17 -07:00
Rami Rosen b8bae41ed6 ipv4: add mc_count to in_device.
This patch add mc_count to struct in_device and updates
increment/decrement/initilaize of this field in IPv4 and in IPv6.

- Also printing the vfs /proc entry (/proc/net/igmp) is adjusted to
use the new mc_count.

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-07 15:34:37 -07:00
Ali Saidi 53240c2087 tcp: Fix possible double-ack w/ user dma
From: Ali Saidi <saidi@engin.umich.edu>

When TCP receive copy offload is enabled it's possible that
tcp_rcv_established() will cause two acks to be sent for a single
packet. In the case that a tcp_dma_early_copy() is successful,
copied_early is set to true which causes tcp_cleanup_rbuf() to be
called early which can send an ack. Further along in
tcp_rcv_established(), __tcp_ack_snd_check() is called and will
schedule a delayed ACK. If no packets are processed before the delayed
ack timer expires the packet will be acked twice.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-07 15:31:19 -07:00
Denis V. Lunev 0c7ed677fb netns: make udpv6 mib per/namespace
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-07 14:49:36 -07:00
Ilpo Järvinen 4a7e56098f tcp: cleanup messy initializer
I'm quite sure that if I give this function in its old format
for you to inspect, you start to wonder what is the type of
demanded or if it's a global variable.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-07 14:43:31 -07:00
Ilpo Järvinen 33f5f57eeb tcp: kill pointless urg_mode
It all started from me noticing that this urgent check in
tcp_clean_rtx_queue is unnecessarily inside the loop. Then
I took a longer look to it and found out that the users of
urg_mode can trivially do without, well almost, there was
one gotcha.

Bonus: those funny people who use urg with >= 2^31 write_seq -
snd_una could now rejoice too (that's the only purpose for the
between being there, otherwise a simple compare would have done
the thing). Not that I assume that the rest of the tcp code
happily lives with such mind-boggling numbers :-). Alas, it
turned out to be impossible to set wmem to such numbers anyway,
yes I really tried a big sendfile after setting some wmem but
nothing happened :-). ...Tcp_wmem is int and so is sk_sndbuf...
So I hacked a bit variable to long and found out that it seems
to work... :-)

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-07 14:43:06 -07:00
Peter Zijlstra c57943a1c9 net: wrap sk->sk_backlog_rcv()
Wrap calling sk->sk_backlog_rcv() in a function. This will allow extending the
generic sk_backlog_rcv behaviour.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-07 14:18:42 -07:00