Commit graph

5520 commits

Author SHA1 Message Date
Daniel Borkmann 7921895a5e net: esp{4,6}: fix potential MTU calculation overflows
Commit 91657eafb ("xfrm: take net hdr len into account for esp payload
size calculation") introduced a possible interger overflow in
esp{4,6}_get_mtu() handlers in case of x->props.mode equals
XFRM_MODE_TUNNEL. Thus, the following expression will overflow

  unsigned int net_adj;
  ...
  <case ipv{4,6} XFRM_MODE_TUNNEL>
         net_adj = 0;
  ...
  return ((mtu - x->props.header_len - crypto_aead_authsize(esp->aead) -
           net_adj) & ~(align - 1)) + (net_adj - 2);

where (net_adj - 2) would be evaluated as <foo> + (0 - 2) in an unsigned
context. Fix it by simply removing brackets as those operations here
do not need to have special precedence.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Benjamin Poirier <bpoirier@suse.de>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Benjamin Poirier <bpoirier@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-05 12:26:50 -07:00
David S. Miller 0e76a3a587 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Merge net into net-next to setup some infrastructure Eric
Dumazet needs for usbnet changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-03 21:36:46 -07:00
Stefan Tomanek 73f5698e77 fib_rules: fix suppressor names and default values
This change brings the suppressor attribute names into line; it also changes
the data types to provide a more consistent interface.

While -1 indicates that the suppressor is not enabled, values >= 0 for
suppress_prefixlen or suppress_ifgroup  reject routing decisions violating the
constraint.

This changes the previously presented behaviour of suppress_prefixlen, where a
prefix length _less_ than the attribute value was rejected. After this change,
a prefix length less than *or* equal to the value is considered a violation of
the rule constraint.

It also changes the default values for default and newly added rules (disabling
any suppression for those).

Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-03 10:40:23 -07:00
Stefan Tomanek 6ef94cfafb fib_rules: add route suppression based on ifgroup
This change adds the ability to suppress a routing decision based upon the
interface group the selected interface belongs to. This allows it to
exclude specific devices from a routing decision.

Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-02 15:24:22 -07:00
Daniel Borkmann 446266b0c7 net: rtm_to_ifaddr: free ifa if ifa_cacheinfo processing fails
Commit 5c766d642 ("ipv4: introduce address lifetime") leaves the ifa
resource that was allocated via inet_alloc_ifa() unfreed when returning
the function with -EINVAL. Thus, free it first via inet_free_ifa().

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-02 14:56:06 -07:00
Stefan Tomanek 7764a45a8f fib_rules: add .suppress operation
This change adds a new operation to the fib_rules_ops struct; it allows the
suppression of routing decisions if certain criteria are not met by its
results.

The first implemented constraint is a minimum prefix length added to the
structures of routing rules. If a rule is added with a minimum prefix length
>0, only routes meeting this threshold will be considered. Any other (more
general) routing table entries will be ignored.

When configuring a system with multiple network uplinks and default routes, it
is often convinient to reference the main routing table multiple times - but
omitting the default route. Using this patch and a modified "ip" utility, this
can be achieved by using the following command sequence:

  $ ip route add table secuplink default via 10.42.23.1

  $ ip rule add pref 100            table main prefixlength 1
  $ ip rule add pref 150 fwmark 0xA table secuplink

With this setup, packets marked 0xA will be processed by the additional routing
table "secuplink", but only if no suitable route in the main routing table can
be found. By using a minimal prefixlength of 1, the default route (/0) of the
table "main" is hidden to packets processed by rule 100; packets traveling to
destinations with more specific routing entries are processed as usual.

Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 17:27:17 -07:00
fan.du ca4c3fc24e net: split rt_genid for ipv4 and ipv6
Current net name space has only one genid for both IPv4 and IPv6, it has below
drawbacks:

- Add/delete an IPv4 address will invalidate all IPv6 routing table entries.
- Insert/remove XFRM policy will also invalidate both IPv4/IPv6 routing table
  entries even when the policy is only applied for one address family.

Thus, this patch attempt to split one genid for two to cater for IPv4 and IPv6
separately in a fine granularity.

Signed-off-by: Fan Du <fan.du@windriver.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 14:56:36 -07:00
Dmitry Popov c0155b2da4 tcp: Remove unused tcpct declarations and comments
Remove declaration, 4 defines and confusing comment that are no longer used
since 1a2c6181c4 ("tcp: Remove TCPCT").

Signed-off-by: Dmitry Popov <dp@highloadlab.com>
Acked-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 12:16:45 -07:00
Dan Carpenter 4299c8a94f net: remove an unneeded check
"ifa->ifa_label" is an array inside the in_ifaddr struct.  It can never
be NULL so we can remove this check.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-30 19:24:16 -07:00
Hannes Frederic Sowa 5ad37d5dee tcp: add tcp_syncookies mode to allow unconditionally generation of syncookies
| If you want to test which effects syncookies have to your
| network connections you can set this knob to 2 to enable
| unconditionally generation of syncookies.

Original idea and first implementation by Eric Dumazet.

Cc: Florian Westphal <fw@strlen.de>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-30 16:15:18 -07:00
Hannes Frederic Sowa 9d4a031464 ipv4, ipv6: send igmpv3/mld packets with TC_PRIO_CONTROL
v2:
a) Also send ipv4 igmp messages with TC_PRIO_CONTROL

Cc: William Manley <william.manley@youview.com>
Cc: Lukas Tribus <luky-37@hotmail.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-28 11:13:55 -07:00
Thomas Graf c26bf4a513 pktgen: Add UDPCSUM flag to support UDP checksums
UDP checksums are optional, hence pktgen has been omitting them in
favour of performance. The optional flag UDPCSUM enables UDP
checksumming. If the output device supports hardware checksumming
the skb is prepared and marked CHECKSUM_PARTIAL, otherwise the
checksum is generated in software.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-27 22:16:36 -07:00
Eric Dumazet c9bee3b7fd tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.

TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :

Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)

For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())

This patch adds two ways to set the limit :

1) Per socket option TCP_NOTSENT_LOWAT

2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.

This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat

Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.

Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
 Specify the minimum number of bytes in the buffer until
 the socket layer will pass the data to the protocol)

Tested:

netperf sessions, and watching /proc/net/protocols "memory" column for TCP

With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.

A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

           412,514 context-switches

     200.034645535 seconds time elapsed

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

         2,675,818 context-switches

     200.029651391 seconds time elapsed

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 17:54:48 -07:00
Eric Dumazet 64dc61306c net: add sk_stream_is_writeable() helper
Several call sites use the hardcoded following condition :

sk_stream_wspace(sk) >= sk_stream_min_wspace(sk)

Lets use a helper because TCP_NOTSENT_LOWAT support will change this
condition for TCP sockets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 17:54:48 -07:00
Jerry Snitselaar f585a991e1 fib_trie: potential out of bounds access in trie_show_stats()
With the <= max condition in the for loop, it will be always go 1
element further than needed. If the condition for the while loop is
never met, then max is MAX_STAT_DEPTH, and for loop will walk off the
end of nodesizes[].

Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 16:05:14 -07:00
Amerigo Wang b9959fd3b0 vti: switch to new ip tunnel code
GRE tunnel and IPIP tunnel already switched to the new
ip tunnel code, VTI tunnel can use it too.

Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Saurabh Mohan <saurabh.mohan@vyatta.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23 17:27:32 -07:00
Rami Rosen c4854ec8c4 ipmr: change the prototype of ip_mr_forward().
This patch changes the prototpye of the ip_mr_forward() method to return void
instead of int.

The ip_mr_forward() method always returns 0; moreover, the return value of this
method is not checked anywhere.

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23 17:01:05 -07:00
Jiri Pirko 4aa5dee4d9 net: convert resend IGMP to notifier event
Until now, bond_resend_igmp_join_requests() looks for vlans attached to
bonding device, bridge where bonding act as port manually. It does not
care of other scenarios, like stacked bonds or team device above. Make
this more generic and use netdev notifier to propagate the event to
upper devices and to actually call ip_mc_rejoin_groups().

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23 16:52:47 -07:00
Yuchung Cheng ed08495c31 tcp: use RTT from SACK for RTO
If RTT is not available because Karn's check has failed or no
new packet is acked, use the RTT measured from SACK to estimate
the RTO. The sender can continue to estimate the RTO during loss
recovery or reordering event upon receiving non-partial ACKs.

This also changes when the RTO is re-armed. Previously it is
only re-armed when some data is cummulatively acknowledged (i.e.,
SND.UNA advances), but now it is re-armed whenever RTT estimator
is updated. This feature is particularly useful to reduce spurious
timeout for buffer bloat including cellular carriers [1], and
RTT estimation on reordering events.

[1] "An In-depth Study of LTE: Effect of Network Protocol and
 Application Behavior on Performance", In Proc. of SIGCOMM 2013

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-22 17:53:42 -07:00
Yuchung Cheng 59c9af4234 tcp: measure RTT from new SACK
Take RTT sample if an ACK selectively acks some sequences that
have never been retransmitted. The Karn's algorithm does not apply
even if that ACK (s)acks other retransmitted sequences, because it
must been generated by an original but perhaps out-of-order packet.
There is no ambiguity. In case when multiple blocks are newly
sacked because of ACK losses the earliest block is used to
measure RTT, similar to cummulative ACKs.

Such RTT samples allow the sender to estimate the RTO during loss
recovery and packet reordering events. It is still useful even with
TCP timestamps. That's because during these events the SND.UNA may
not advance preventing RTT samples from TS ECR (thus the FLAG_ACKED
check before calling tcp_ack_update_rtt()).  Therefore this new
RTT source is complementary to existing ACK and TS RTT mechanisms.

This patch does not update the RTO. It is done in the next patch.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-22 17:53:42 -07:00
Yuchung Cheng 5b08e47caf tcp: prefer packet timing to TS-ECR for RTT
Prefer packet timings to TS-ecr for RTT measurements when both
sources are available. That's because broken middle-boxes and remote
peer can return packets with corrupted TS ECR fields. Similarly most
congestion controls that require RTT signals favor timing-based
sources as well. Also check for bad TS ECR values to avoid RTT
blow-ups. It has happened on production Web servers.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-22 17:53:42 -07:00
Yuchung Cheng 375fe02c91 tcp: consolidate SYNACK RTT sampling
The first patch consolidates SYNACK and other RTT measurement to use a
central function tcp_ack_update_rtt(). A (small) bonus is now SYNACK
RTT measurement happens after PAWS check, potentially reducing the
impact of RTO seeding on bad TCP timestamps values.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-22 17:53:42 -07:00
Michal Tesar 651e92716a sysctl net: Keep tcp_syn_retries inside the boundary
Limit the min/max value passed to the
/proc/sys/net/ipv4/tcp_syn_retries.

Signed-off-by: Michal Tesar <mtesar@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-19 17:36:03 -07:00
Eric Dumazet 21d1196a35 ipv4: set transport header earlier
commit 45f00f99d6 ("ipv4: tcp: clean up tcp_v4_early_demux()") added a
performance regression for non GRO traffic, basically disabling
IP early demux.

IPv6 stack resets transport header in ip6_rcv() before calling
IP early demux in ip6_rcv_finish(), while IPv4 does this only in
ip_local_deliver_finish(), _after_ IP early demux.

GRO traffic happened to enable IP early demux because transport header
is also set in inet_gro_receive()

Instead of reverting the faulty commit, we can make IPv4/IPv6 behave the
same : transport_header should be set in ip_rcv() instead of
ip_local_deliver_finish()

ip_local_deliver_finish() can also use skb_network_header_len() which is
faster than ip_hdrlen()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-16 12:59:28 -07:00
Yuchung Cheng 24ab6bec80 tcp: account all retransmit failures
Change snmp RETRANSFAILS stat to include timeout retransmit failures
in addition to other loss recoveries.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-12 16:15:56 -07:00
Alexander Duyck 8c91e162e0 gre: Fix MTU sizing check for gretap tunnels
This change fixes an MTU sizing issue seen with gretap tunnels when non-gso
packets are sent from the interface.

In my case I was able to reproduce the issue by simply sending a ping of
1421 bytes with the gretap interface created on a device with a standard
1500 mtu.

This fix is based on the fact that the tunnel mtu is already adjusted by
dev->hard_header_len so it would make sense that any packets being compared
against that mtu should also be adjusted by hard_header_len and the tunnel
header instead of just the tunnel header.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Reported-by: Cong Wang <amwang@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-11 16:12:03 -07:00
Alexander Duyck cdbaa0bb26 gso: Update tunnel segmentation to support Tx checksum offload
This change makes it so that the GRE and VXLAN tunnels can make use of Tx
checksum offload support provided by some drivers via the hw_enc_features.
Without this fix enabling GSO means sacrificing Tx checksum offload and
this actually leads to a performance regression as shown below:

            Utilization
            Send
Throughput  local         GSO
10^6bits/s  % S           state
  6276.51   8.39          enabled
  7123.52   8.42          disabled

To resolve this it was necessary to address two items.  First
netif_skb_features needed to be updated so that it would correctly handle
the Trans Ether Bridging protocol without impacting the need to check for
Q-in-Q tagging.  To do this it was necessary to update harmonize_features
so that it used skb_network_protocol instead of just using the outer
protocol.

Second it was necessary to update the GRE and UDP tunnel segmentation
offloads so that they would reset the encapsulation bit and inner header
offsets after the offload was complete.

As a result of this change I have seen the following results on a interface
with Tx checksum enabled for encapsulated frames:

            Utilization
            Send
Throughput  local         GSO
10^6bits/s  % S           state
  7123.52   8.42          disabled
  8321.75   5.43          enabled

v2: Instead of replacing refrence to skb->protocol with
    skb_network_protocol just replace the protocol reference in
    harmonize_features to allow for double VLAN tag checks.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-11 12:18:49 -07:00
Camelia Groza 3b8ccd4473 inet: fix spacing in assignment
Found using checkpatch.pl

Signed-off-by: Camelia Groza <camelia.groza@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-11 12:02:39 -07:00
Eliezer Tamir 8b80cda536 net: rename ll methods to busy-poll
Rename ndo_ll_poll to ndo_busy_poll.
Rename sk_mark_ll to sk_mark_napi_id.
Rename skb_mark_ll to skb_mark_napi_id.
Correct all useres of these functions.
Update comments and defines  in include/net/busy_poll.h

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-10 17:08:27 -07:00
Eliezer Tamir 076bb0c82a net: rename include/net/ll_poll.h to include/net/busy_poll.h
Rename the file and correct all the places where it is included.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-10 17:08:27 -07:00
Linus Torvalds 496322bc91 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "This is a re-do of the net-next pull request for the current merge
  window.  The only difference from the one I made the other day is that
  this has Eliezer's interface renames and the timeout handling changes
  made based upon your feedback, as well as a few bug fixes that have
  trickeled in.

  Highlights:

   1) Low latency device polling, eliminating the cost of interrupt
      handling and context switches.  Allows direct polling of a network
      device from socket operations, such as recvmsg() and poll().

      Currently ixgbe, mlx4, and bnx2x support this feature.

      Full high level description, performance numbers, and design in
      commit 0a4db187a9 ("Merge branch 'll_poll'")

      From Eliezer Tamir.

   2) With the routing cache removed, ip_check_mc_rcu() gets exercised
      more than ever before in the case where we have lots of multicast
      addresses.  Use a hash table instead of a simple linked list, from
      Eric Dumazet.

   3) Add driver for Atheros CQA98xx 802.11ac wireless devices, from
      Bartosz Markowski, Janusz Dziedzic, Kalle Valo, Marek Kwaczynski,
      Marek Puzyniak, Michal Kazior, and Sujith Manoharan.

   4) Support reporting the TUN device persist flag to userspace, from
      Pavel Emelyanov.

   5) Allow controlling network device VF link state using netlink, from
      Rony Efraim.

   6) Support GRE tunneling in openvswitch, from Pravin B Shelar.

   7) Adjust SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF for modern times, from
      Daniel Borkmann and Eric Dumazet.

   8) Allow controlling of TCP quickack behavior on a per-route basis,
      from Cong Wang.

   9) Several bug fixes and improvements to vxlan from Stephen
      Hemminger, Pravin B Shelar, and Mike Rapoport.  In particular,
      support receiving on multiple UDP ports.

  10) Major cleanups, particular in the area of debugging and cookie
      lifetime handline, to the SCTP protocol code.  From Daniel
      Borkmann.

  11) Allow packets to cross network namespaces when traversing tunnel
      devices.  From Nicolas Dichtel.

  12) Allow monitoring netlink traffic via AF_PACKET sockets, in a
      manner akin to how we monitor real network traffic via ptype_all.
      From Daniel Borkmann.

  13) Several bug fixes and improvements for the new alx device driver,
      from Johannes Berg.

  14) Fix scalability issues in the netem packet scheduler's time queue,
      by using an rbtree.  From Eric Dumazet.

  15) Several bug fixes in TCP loss recovery handling, from Yuchung
      Cheng.

  16) Add support for GSO segmentation of MPLS packets, from Simon
      Horman.

  17) Make network notifiers have a real data type for the opaque
      pointer that's passed into them.  Use this to properly handle
      network device flag changes in arp_netdev_event().  From Jiri
      Pirko and Timo Teräs.

  18) Convert several drivers over to module_pci_driver(), from Peter
      Huewe.

  19) tcp_fixup_rcvbuf() can loop 500 times over loopback, just use a
      O(1) calculation instead.  From Eric Dumazet.

  20) Support setting of explicit tunnel peer addresses in ipv6, just
      like ipv4.  From Nicolas Dichtel.

  21) Protect x86 BPF JIT against spraying attacks, from Eric Dumazet.

  22) Prevent a single high rate flow from overruning an individual cpu
      during RX packet processing via selective flow shedding.  From
      Willem de Bruijn.

  23) Don't use spinlocks in TCP md5 signing fast paths, from Eric
      Dumazet.

  24) Don't just drop GSO packets which are above the TBF scheduler's
      burst limit, chop them up so they are in-bounds instead.  Also
      from Eric Dumazet.

  25) VLAN offloads are missed when configured on top of a bridge, fix
      from Vlad Yasevich.

  26) Support IPV6 in ping sockets.  From Lorenzo Colitti.

  27) Receive flow steering targets should be updated at poll() time
      too, from David Majnemer.

  28) Fix several corner case regressions in PMTU/redirect handling due
      to the routing cache removal, from Timo Teräs.

  29) We have to be mindful of ipv4 mapped ipv6 sockets in
      upd_v6_push_pending_frames().  From Hannes Frederic Sowa.

  30) Fix L2TP sequence number handling bugs, from James Chapman."

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1214 commits)
  drivers/net: caif: fix wrong rtnl_is_locked() usage
  drivers/net: enic: release rtnl_lock on error-path
  vhost-net: fix use-after-free in vhost_net_flush
  net: mv643xx_eth: do not use port number as platform device id
  net: sctp: confirm route during forward progress
  virtio_net: fix race in RX VQ processing
  virtio: support unlocked queue poll
  net/cadence/macb: fix bug/typo in extracting gem_irq_read_clear bit
  Documentation: Fix references to defunct linux-net@vger.kernel.org
  net/fs: change busy poll time accounting
  net: rename low latency sockets functions to busy poll
  bridge: fix some kernel warning in multicast timer
  sfc: Fix memory leak when discarding scattered packets
  sit: fix tunnel update via netlink
  dt:net:stmmac: Add dt specific phy reset callback support.
  dt:net:stmmac: Add support to dwmac version 3.610 and 3.710
  dt:net:stmmac: Allocate platform data only if its NULL.
  net:stmmac: fix memleak in the open method
  ipv6: rt6_check_neigh should successfully verify neigh if no NUD information are available
  net: ipv6: fix wrong ping_v6_sendmsg return value
  ...
2013-07-09 18:24:39 -07:00
Eliezer Tamir cbf55001b2 net: rename low latency sockets functions to busy poll
Rename functions in include/net/ll_poll.h to busy wait.
Clarify documentation about expected power use increase.
Rename POLL_LL to POLL_BUSY_LOOP.
Add need_resched() testing to poll/select busy loops.

Note, that in select and poll can_busy_poll is dynamic and is
updated continuously to reflect the existence of supported
sockets with valid queue information.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-08 19:25:45 -07:00
Jiang Liu 0ed5fd1385 mm: use totalram_pages instead of num_physpages at runtime
The global variable num_physpages is scheduled to be removed, so use
totalram_pages instead of num_physpages at runtime.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:35 -07:00
David S. Miller 0c1072ae02 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/freescale/fec_main.c
	drivers/net/ethernet/renesas/sh_eth.c
	net/ipv4/gre.c

The GRE conflict is between a bug fix (kfree_skb --> kfree_skb_list)
and the splitting of the gre.c code into seperate files.

The FEC conflict was two sets of changes adding ethtool support code
in an "!CONFIG_M5272" CPP protected block.

Finally the sh_eth.c conflict was between one commit add bits set
in the .eesr_err_check mask whilst another commit removed the
.tx_error_check member and assignments.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-03 14:55:13 -07:00
Daniel Borkmann c50cd35788 net: gre: move GSO functions to gre_offload
Similarly to TCP/UDP offloading, move all related GRE functions to
gre_offload.c to make things more explicit and similar to the rest
of the code.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-03 14:37:39 -07:00
Pravin B Shelar 23a3647bc4 ip_tunnels: Use skb-len to PMTU check.
In path mtu check, ip header total length works for gre device
but not for gre-tap device.  Use skb len which is consistent
for all tunneling types.  This is old bug in gre.
This also fixes mtu calculation bug introduced by
commit c544193214 (GRE: Refactor GRE tunneling code).

Reported-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-02 16:43:35 -07:00
Hannes Frederic Sowa 8822b64a0f ipv6: call udp_push_pending_frames when uncorking a socket with AF_INET pending data
We accidentally call down to ip6_push_pending_frames when uncorking
pending AF_INET data on a ipv6 socket. This results in the following
splat (from Dave Jones):

skbuff: skb_under_panic: text:ffffffff816765f6 len:48 put:40 head:ffff88013deb6df0 data:ffff88013deb6dec tail:0x2c end:0xc0 dev:<NULL>
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:126!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: dccp_ipv4 dccp 8021q garp bridge stp dlci mpoa snd_seq_dummy sctp fuse hidp tun bnep nfnetlink scsi_transport_iscsi rfcomm can_raw can_bcm af_802154 appletalk caif_socket can caif ipt_ULOG x25 rose af_key pppoe pppox ipx phonet irda llc2 ppp_generic slhc p8023 psnap p8022 llc crc_ccitt atm bluetooth
+netrom ax25 nfc rfkill rds af_rxrpc coretemp hwmon kvm_intel kvm crc32c_intel snd_hda_codec_realtek ghash_clmulni_intel microcode pcspkr snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep usb_debug snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc snd_timer ptp snd pps_core soundcore xfs libcrc32c
CPU: 2 PID: 8095 Comm: trinity-child2 Not tainted 3.10.0-rc7+ #37
task: ffff8801f52c2520 ti: ffff8801e6430000 task.ti: ffff8801e6430000
RIP: 0010:[<ffffffff816e759c>]  [<ffffffff816e759c>] skb_panic+0x63/0x65
RSP: 0018:ffff8801e6431de8  EFLAGS: 00010282
RAX: 0000000000000086 RBX: ffff8802353d3cc0 RCX: 0000000000000006
RDX: 0000000000003b90 RSI: ffff8801f52c2ca0 RDI: ffff8801f52c2520
RBP: ffff8801e6431e08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffff88022ea0c800
R13: ffff88022ea0cdf8 R14: ffff8802353ecb40 R15: ffffffff81cc7800
FS:  00007f5720a10740(0000) GS:ffff880244c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000005862000 CR3: 000000022843c000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
 ffff88013deb6dec 000000000000002c 00000000000000c0 ffffffff81a3f6e4
 ffff8801e6431e18 ffffffff8159a9aa ffff8801e6431e90 ffffffff816765f6
 ffffffff810b756b 0000000700000002 ffff8801e6431e40 0000fea9292aa8c0
Call Trace:
 [<ffffffff8159a9aa>] skb_push+0x3a/0x40
 [<ffffffff816765f6>] ip6_push_pending_frames+0x1f6/0x4d0
 [<ffffffff810b756b>] ? mark_held_locks+0xbb/0x140
 [<ffffffff81694919>] udp_v6_push_pending_frames+0x2b9/0x3d0
 [<ffffffff81694660>] ? udplite_getfrag+0x20/0x20
 [<ffffffff8162092a>] udp_lib_setsockopt+0x1aa/0x1f0
 [<ffffffff811cc5e7>] ? fget_light+0x387/0x4f0
 [<ffffffff816958a4>] udpv6_setsockopt+0x34/0x40
 [<ffffffff815949f4>] sock_common_setsockopt+0x14/0x20
 [<ffffffff81593c31>] SyS_setsockopt+0x71/0xd0
 [<ffffffff816f5d54>] tracesys+0xdd/0xe2
Code: 00 00 48 89 44 24 10 8b 87 d8 00 00 00 48 89 44 24 08 48 8b 87 e8 00 00 00 48 c7 c7 c0 04 aa 81 48 89 04 24 31 c0 e8 e1 7e ff ff <0f> 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55
RIP  [<ffffffff816e759c>] skb_panic+0x63/0x65
 RSP <ffff8801e6431de8>

This patch adds a check if the pending data is of address family AF_INET
and directly calls udp_push_ending_frames from udp_v6_push_pending_frames
if that is the case.

This bug was found by Dave Jones with trinity.

(Also move the initialization of fl6 below the AF_INET check, even if
not strictly necessary.)

Cc: Dave Jones <davej@redhat.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-02 12:44:18 -07:00
Cong Wang 3b7b514f44 ipip: fix a regression in ioctl
This is a regression introduced by
commit fd58156e45 (IPIP: Use ip-tunneling code.)

Similar to GRE tunnel, previously we only check the parameters
for SIOCADDTUNNEL and SIOCCHGTUNNEL, after that commit, the
check is moved for all commands.

So, just check for SIOCADDTUNNEL and SIOCCHGTUNNEL.

Also, the check for i_key, o_key etc. is suspicious too,
which did not exist before, reset them before passing
to ip_tunnel_ioctl().

Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-02 01:13:09 -07:00
Cong Wang ab6c7a0a43 vti: remove duplicated code to fix a memory leak
vti module allocates dev->tstats twice: in vti_fb_tunnel_init()
and in vti_tunnel_init(), this lead to a memory leak of
dev->tstats.

Just remove the duplicated operations in vti_fb_tunnel_init().

(candidate for -stable)

Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Saurabh Mohan <saurabh.mohan@vyatta.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-01 23:37:14 -07:00
Cong Wang 6c734fb859 gre: fix a regression in ioctl
When testing GRE tunnel, I got:

 # ip tunnel show
 get tunnel gre0 failed: Invalid argument
 get tunnel gre1 failed: Invalid argument

This is a regression introduced by commit c544193214
("GRE: Refactor GRE tunneling code.") because previously we
only check the parameters for SIOCADDTUNNEL and SIOCCHGTUNNEL,
after that commit, the check is moved for all commands.

So, just check for SIOCADDTUNNEL and SIOCCHGTUNNEL.

After this patch I got:

 # ip tunnel show
 gre0: gre/ip  remote any  local any  ttl inherit  nopmtudisc
 gre1: gre/ip  remote 192.168.122.101  local 192.168.122.45  ttl inherit

Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-01 23:35:22 -07:00
Timo Teräs 2ffae99d1f ipv4: use next hop exceptions also for input routes
Commit d2d68ba9 (ipv4: Cache input routes in fib_info nexthops)
assmued that "locally destined, and routed packets, never trigger
PMTU events or redirects that will be processed by us".

However, it seems that tunnel devices do trigger PMTU events in certain
cases. At least ip_gre, ip6_gre, sit, and ipip do use the inner flow's
skb_dst(skb)->ops->update_pmtu to propage mtu information from the
outer flows. These can cause the inner flow mtu to be decreased. If
next hop exceptions are not consulted for pmtu, IP fragmentation will
not be done properly for these routes.

It also seems that we really need to have the PMTU information always
for netfilter TCPMSS clamp-to-pmtu feature to work properly.

So for the time being, cache separate copies of input routes for
each next hop exception.

Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-28 21:27:47 -07:00
Pablo Neira 3a36515f72 netlink: fix splat in skb_clone with large messages
Since (c05cdb1 netlink: allow large data transfers from user-space),
netlink splats if it invokes skb_clone on large netlink skbs since:

* skb_shared_info was not correctly initialized.
* skb->destructor is not set in the cloned skb.

This was spotted by trinity:

[  894.990671] BUG: unable to handle kernel paging request at ffffc9000047b001
[  894.991034] IP: [<ffffffff81a212c4>] skb_clone+0x24/0xc0
[...]
[  894.991034] Call Trace:
[  894.991034]  [<ffffffff81ad299a>] nl_fib_input+0x6a/0x240
[  894.991034]  [<ffffffff81c3b7e6>] ? _raw_read_unlock+0x26/0x40
[  894.991034]  [<ffffffff81a5f189>] netlink_unicast+0x169/0x1e0
[  894.991034]  [<ffffffff81a601e1>] netlink_sendmsg+0x251/0x3d0

Fix it by:

1) introducing a new netlink_skb_clone function that is used in nl_fib_input,
   that sets our special skb->destructor in the cloned skb. Moreover, handle
   the release of the large cloned skb head area in the destructor path.

2) not allowing large skbuffs in the netlink broadcast path. I cannot find
   any reasonable use of the large data transfer using netlink in that path,
   moreover this helps to skip extra skb_clone handling.

I found two more netlink clients that are cloning the skbs, but they are
not in the sendmsg path. Therefore, the sole client cloning that I found
seems to be the fib frontend.

Thanks to Eric Dumazet for helping to address this issue.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-27 22:44:16 -07:00
Nicolas Dichtel 5e6700b3bf sit: add support of x-netns
This patch allows to switch the netns when packet is encapsulated or
decapsulated. In other word, the encapsulated packet is received in a netns,
where the lookup is done to find the tunnel. Once the tunnel is found, the
packet is decapsulated and injecting into the corresponding interface which
stands to another netns.

When one of the two netns is removed, the tunnel is destroyed.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-27 22:30:47 -07:00
Nicolas Dichtel 963b89e80d sit: fix 4in4 + IPsec scenario
Since commit 32b8a8e59c "sit: add IPv4 over IPv4 support",
tunnel->parms.iph.protocol is 0 when both 4in4 and 6in4 are setup, but
xfrm_lookup() is called only when proto is != 0, thus we need to pass the real
value.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-26 13:42:03 -07:00
Eric Dumazet bd8a7036c0 gre: fix a possible skb leak
commit 68c3316311 ("v4 GRE: Add TCP segmentation offload for GRE")
added a possible skb leak, because it frees only the head of segment
list, in case a skb_linearize() call fails.

This patch adds a kfree_skb_list() helper to fix the bug.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-25 16:07:44 -07:00
David S. Miller a3d9dd89b7 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
The following patchset contains five fixes for Netfilter/IPVS, they are:

* A skb leak fix in fragmentation handling in case that helpers are in place,
  it occurs since the IPV6 NAT infrastructure, from Phil Oester.

* Fix SCTP port mangling in ICMP packets for IPVS, from Julian Anastasov.

* Fix event delivery in ctnetlink regarding the new connlabel infrastructure,
  from Florian Westphal.

* Fix mangling in the SIP NAT helper, from Balazs Peter Odor.

* Fix crash in ipt_ULOG introduced while adding netnamespace support,
  from Gao Feng.

I'll take care of passing several of these patches to -stable once they hit
Linus' tree.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 12:45:24 -07:00
Gao feng c8fc51cfa7 netfilter: ipt_ULOG: fix incorrect setting of ulog timer
The parameter of setup_timer should be &ulog->nlgroup[i].
the incorrect parameter will cause kernel panic in
ulog_timer.

Bug introducted in commit 355430671a
"netfilter: ipt_ULOG: add net namespace support for ipt_ULOG"

ebt_ULOG doesn't have this problem.

[ I have mangled this patch to fix nlgroup != 0 case, we were
  also crashing there --pablo ]

Tested-by: George Spelvin <linux@horizon.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-06-24 17:10:44 +02:00
Rami Rosen af92e5425e inet: frag , remove an empty ifdef.
This patch removes an empty ifdef from inet_frag_intern()
in net/ipv4/inet_fragment.c.

commit b67bfe0d42
(hlist: drop the node parameter from iterators) removed hlist from
net/ipv4/inet_fragment.c, but did not remove the enclosing ifdef command,
which is now empty.

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 23:06:52 -07:00
Cong Wang bcefe17cff tcp: introduce a per-route knob for quick ack
In previous discussions, I tried to find some reasonable heuristics
for delayed ACK, however this seems not possible, according to Eric:

	"ACKS might also be delayed because of bidirectional
	traffic, and is more controlled by the application
	response time. TCP stack can not easily estimate it."

	"ACK can be incredibly useful to recover from losses in
	a short time.

	The vast majority of TCP sessions are small lived, and we
	send one ACK per received segment anyway at beginning or
	retransmits to let the sender smoothly increase its cwnd,
	so an auto-tuning facility wont help them that much."

and according to David:

	"ACKs are the only information we have to detect loss.

	And, for the same reasons that TCP VEGAS is fundamentally
	broken, we cannot measure the pipe or some other
	receiver-side-visible piece of information to determine
	when it's "safe" to stretch ACK.

	And even if it's "safe", we should not do it so that losses are
	accurately detected and we don't spuriously retransmit.

	The only way to know when the bandwidth increases is to
	"test" it, by sending more and more packets until drops happen.
	That's why all successful congestion control algorithms must
	operate on explicited tested pieces of information.

	Similarly, it's not really possible to universally know if
	it's safe to stretch ACK or not."

It still makes sense to enable or disable quick ack mode like
what TCP_QUICK_ACK does.

Similar to TCP_QUICK_ACK option, but for people who can't
modify the source code and still wants to control
TCP delayed ACK behavior. As David suggested, this should belong
to per-path scope, since different pathes may want different
behaviors.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Rick Jones <rick.jones2@hp.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Graf <tgraf@suug.ch>
CC: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 23:06:51 -07:00
Weiping Pan 9ef71e0c82 tcp:typo unset should be unsent
Signed-off-by: Weiping Pan <wpan@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 22:21:09 -07:00
Aydin Arik c0353c7b5d ipv4: Fixed MD5 key lookups when adding/ removing MD5 to/ from TCP sockets.
MD5 key lookups on a given TCP socket were being performed
incorrectly. This fix alters parameter inputs to the MD5
lookup function tcp_md5_do_lookup, which is called by functions
tcp_md5_do_add and tcp_md5_do_del. Specifically, the change now
inputs the correct address and address family required to make
a proper lookup.

Signed-off-by: Aydin Arik <aydin.arik@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 21:21:53 -07:00
Pravin B Shelar 3d7b46cd20 ip_tunnel: push generic protocol handling to ip_tunnel module.
Process skb tunnel header before sending packet to protocol handler.
this allows code sharing between gre and ovs gre modules.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 18:07:41 -07:00
Pravin B Shelar 0e6fbc5b6c ip_tunnels: extend iptunnel_xmit()
Refactor various ip tunnels xmit functions and extend iptunnel_xmit()
so that there is more code sharing.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 18:07:41 -07:00
Pravin B Shelar 45f2e9976c gre: export gre_handle_offloads() function.
This is required for OVS GRE offloading.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 18:07:41 -07:00
Pravin B Shelar 752f36da68 gre: export gre_build_header() function.
This is required for ovs gre module.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 18:07:40 -07:00
Pravin B Shelar bda7bb4634 gre: Allow multiple protocol listener for gre protocol.
Currently there is only one user is allowed to register for gre
protocol.  Following patch adds de-multiplexer.  So that multiple
modules can listen on gre protocol e.g. kernel gre devices and ovs.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 18:07:40 -07:00
Pravin B Shelar 20fd4d1f04 gre: Simplify gre protocol registration locking.
Use cmpxchg() for atomic protocol registration which saves
code and data space.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 18:07:40 -07:00
David S. Miller d98cae64e4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/wireless/ath/ath9k/Kconfig
	drivers/net/xen-netback/netback.c
	net/batman-adv/bat_iv_ogm.c
	net/wireless/nl80211.c

The ath9k Kconfig conflict was a change of a Kconfig option name right
next to the deletion of another option.

The xen-netback conflict was overlapping changes involving the
handling of the notify list in xen_netbk_rx_action().

Batman conflict resolution provided by Antonio Quartulli, basically
keep everything in both conflict hunks.

The nl80211 conflict is a little more involved.  In 'net' we added a
dynamic memory allocation to nl80211_dump_wiphy() to fix a race that
Linus reported.  Meanwhile in 'net-next' the handlers were converted
to use pre and post doit handlers which use a flag to determine
whether to hold the RTNL mutex around the operation.

However, the dump handlers to not use this logic.  Instead they have
to explicitly do the locking.  There were apparent bugs in the
conversion of nl80211_dump_wiphy() in that we were not dropping the
RTNL mutex in all the return paths, and it seems we very much should
be doing so.  So I fixed that whilst handling the overlapping changes.

To simplify the initial returns, I take the RTNL mutex after we try
to allocate 'tb'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 16:49:39 -07:00
Eric Dumazet d3b6f61418 ip_tunnel: remove __net_init/exit from exported functions
If CONFIG_NET_NS is not set then __net_init is the same as __init and
__net_exit is the same as __exit. These functions will be removed from
memory after the module loads or is removed. Functions that are exported
for use by other functions should never be labeled for removal.

Bug introduced by commit c544193214
("GRE: Refactor GRE tunneling code.")

Reported-by: Steinar H. Gunderson <sgunderson@bigfoot.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 03:00:59 -07:00
Saurabh Mohan baafc77b32 net/ipv4: ip_vti clear skb cb before tunneling.
If users apply shaper to vti tunnel then it will cause a kernel crash. The
problem seems to be due to the vti_tunnel_xmit function not clearing
skb->opt field before passing the packet to xfrm tunneling code.

Signed-off-by: Saurabh Mohan <saurabh@vyatta.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 02:47:46 -07:00
Yuchung Cheng 85f16525a2 tcp: properly send new data in fast recovery in first RTT
Linux sends new unset data during disorder and recovery state if all
(suspected) lost packets have been retransmitted ( RFC5681, section
3.2 step 1 & 2, RFC3517 section 4, NexSeg() Rule 2).  One requirement
is to keep the receive window about twice the estimated sender's
congestion window (tcp_rcv_space_adjust()), assuming the fast
retransmits repair the losses in the next round trip.

But currently it's not the case on the first round trip in either
normal or Fast Open connection, beucase the initial receive window
is identical to (expected) sender's initial congestion window. The
fix is to double it.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 02:46:29 -07:00
Joe Perches fe2c6338fd net: Convert uses of typedef ctl_table to struct ctl_table
Reduce the uses of this unnecessary typedef.

Done via perl script:

$ git grep --name-only -w ctl_table net | \
  xargs perl -p -i -e '\
	sub trim { my ($local) = @_; $local =~ s/(^\s+|\s+$)//g; return $local; } \
        s/\b(?<!struct\s)ctl_table\b(\s*\*\s*|\s+\w+)/"struct ctl_table " . trim($1)/ge'

Reflow the modified lines that now exceed 80 columns.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 02:36:09 -07:00
Wu Fengguang a06a2d378d net: ping_check_bind_addr() etc. can be static
net/ipv4/ping.c:286:5: sparse: symbol 'ping_check_bind_addr' was not declared. Should it be static?
net/ipv4/ping.c:355:6: sparse: symbol 'ping_set_saddr' was not declared. Should it be static?
net/ipv4/ping.c:370:6: sparse: symbol 'ping_clear_saddr' was not declared. Should it be static?

net/ipv6/ping.c:60:5: sparse: symbol 'dummy_ipv6_recv_error' was not declared. Should it be static?
net/ipv6/ping.c:64:5: sparse: symbol 'dummy_ip6_datagram_recv_ctl' was not declared. Should it be static?
net/ipv6/ping.c:69:5: sparse: symbol 'dummy_icmpv6_err_convert' was not declared. Should it be static?
net/ipv6/ping.c:73:6: sparse: symbol 'dummy_ipv6_icmp_error' was not declared. Should it be static?
net/ipv6/ping.c:75:5: sparse: symbol 'dummy_ipv6_chk_addr' was not declared. Should it be static?
net/ipv6/ping.c:201:5: sparse: symbol 'ping_v6_seq_show' was not declared. Should it be static?

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 01:36:41 -07:00
Eric Dumazet 7c0cadc69c udp: fix two sparse errors
commit ba418fa357 ("soreuseport: UDP/IPv4 implementation")
added following sparse errors :

net/ipv4/udp.c:433:60: warning: cast from restricted __be16
net/ipv4/udp.c:433:60: warning: incorrect type in argument 1 (different base types)
net/ipv4/udp.c:433:60:    expected unsigned short [unsigned] [usertype] val
net/ipv4/udp.c:433:60:    got restricted __be16 [usertype] sport
net/ipv4/udp.c:433:60: warning: cast from restricted __be16
net/ipv4/udp.c:433:60: warning: cast from restricted __be16
net/ipv4/udp.c:514:60: warning: cast from restricted __be16
net/ipv4/udp.c:514:60: warning: incorrect type in argument 1 (different base types)
net/ipv4/udp.c:514:60:    expected unsigned short [unsigned] [usertype] val
net/ipv4/udp.c:514:60:    got restricted __be16 [usertype] sport
net/ipv4/udp.c:514:60: warning: cast from restricted __be16
net/ipv4/udp.c:514:60: warning: cast from restricted __be16

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 15:03:24 -07:00
Eric Dumazet 5b9b626377 gro: remove a sparse error
Fix following sparse error :

net/ipv4/af_inet.c:1410:59: warning: restricted __be16 degrades to
integer

added in commit db8caf3dbc
("gro: should aggregate frames without DF")

Reported-by: kbuild test robot <fengguang.wu@intel.com>
From: Eric Dumazet <edumazet@google.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 15:03:24 -07:00
Eric Dumazet c70eba7453 igmp: fix new sparse errors
Fix following sparse errors :

net/ipv4/igmp.c:1222:25: warning: cast from restricted __be32
net/ipv4/igmp.c🔢31: warning: incorrect type in assignment (different address spaces)
net/ipv4/igmp.c🔢31:    expected struct ip_mc_list [noderef] <asn:4>*next_hash
net/ipv4/igmp.c🔢31:    got struct ip_mc_list *<noident>
net/ipv4/igmp.c:1250:31: warning: incorrect type in assignment (different address spaces)
net/ipv4/igmp.c:1250:31:    expected struct ip_mc_list [noderef] <asn:4>*next_hash
net/ipv4/igmp.c:1250:31:    got struct ip_mc_list *<noident>
net/ipv4/igmp.c:2380:37: warning: cast from restricted __be32

These were added by commit e989707135
("igmp: hash a hash table to speedup ip_check_mc_rcu()")

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 14:14:55 -07:00
Daniel Borkmann da5bab079f net: udp4: move GSO functions to udp_offload
Similarly to TCP offloading and UDPv6 offloading, move all related
UDPv4 functions to udp_offload.c to make things more explicit. Also,
by this, we can make those functions static.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 00:47:25 -07:00
Shawn Bohrer 946d3bd723 igmp: remove unnecessary in_device member zeroing
ip_mc_init_dev() is passed a freshly kzalloc'd in_device so it is
unnecessary to explicitly zero out the members.

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 00:41:15 -07:00
Eric Dumazet e989707135 igmp: hash a hash table to speedup ip_check_mc_rcu()
After IP route cache removal, multicast applications using
a lot of multicast addresses hit a O(N) behavior in ip_check_mc_rcu()

Add a per in_device hash table to get faster lookup.

This hash table is created only if the number of items in mc_list is
above 4.

Reported-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 00:25:23 -07:00
Cong Wang 30f3a40f9a net: remove last caller of skb_tail_offset() and itself
Similar to the following commits:

commit 00f97da17a (netpoll: fix position of network header)
commit 525cebedb3 (pktgen: Fix position of ip and udp header)

using skb_tail_offset() seems not correct since the offset
is based on head pointer.

With the last caller removed, skb_tail_offset() can be killed
finally.

Cc: Thomas Graf <tgraf@suug.ch>
Cc: Daniel Borkmann <dborkmann@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-10 22:22:23 -07:00
Eliezer Tamir d30e383bb8 tcp: add low latency socket poll support.
Adds low latency socket poll support for TCP.
In tcp_v[46]_rcv() add a call to sk_mark_ll() to copy the napi_id
from the skb to the sk.
In tcp_recvmsg(), when there is no data in the socket we busy-poll.
This is a good example of how to add busy-poll support to more protocols.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-10 21:22:36 -07:00
Eliezer Tamir a5b50476f7 udp: add low latency socket poll support
Add upport for busy-polling on UDP sockets.
In __udp[46]_lib_rcv add a call to sk_mark_ll() to copy the napi_id
from the skb into the sk.
This is done at the earliest possible moment, right after we identify
which socket this skb is for.
In __skb_recv_datagram When there is no data and the user
tries to read we busy poll.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-10 21:22:36 -07:00
Eliezer Tamir 0602129286 net: add low latency socket poll
Adds an ndo_ll_poll method and the code that supports it.
This method can be used by low latency applications to busy-poll
Ethernet device queues directly from the socket code.
sysctl_net_ll_poll controls how many microseconds to poll.
Default is zero (disabled).
Individual protocol support will be added by subsequent patches.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-10 21:22:35 -07:00
Daniel Borkmann 28850dc7c7 net: tcp: move GRO/GSO functions to tcp_offload
Would be good to make things explicit and move those functions to
a new file called tcp_offload.c, thus make this similar to tcpv6_offload.c.
While moving all related functions into tcp_offload.c, we can also
make some of them static, since they are only used there. Also, add
an explicit registration function.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-07 14:39:05 -07:00
Daniel Borkmann 5ee9859157 net: minor: tcp: use tcp_skb_mss helper in tcp_tso_segment
We have the minimal inline helper tcp_skb_mss to access
skb_shinfo(skb)->gso_size, so also use it here to get mss.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-07 14:39:05 -07:00
David S. Miller 143554ace8 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Conflicts:
	net/netfilter/nf_log.c

The conflict in nf_log.c is that in 'net' we added CONFIG_PROC_FS
protection around foo_proc_entry() calls to fix a build failure,
whereas in Pablo's tree a guard if() test around a call is
remove_proc_entry() was removed.  Trivially resolved.

Pablo Neira Ayuso says:

====================
The following patchset contains the first batch of
Netfilter/IPVS updates for your net-next tree, they are:

* Three patches with improvements and code refactorization
  for nfnetlink_queue, from Florian Westphal.

* FTP helper now parses replies without brackets, as RFC1123
  recommends, from Jeff Mahoney.

* Rise a warning to tell everyone about ULOG deprecation,
  NFLOG has been already in the kernel tree for long time
  and supersedes the old logging over netlink stub, from
  myself.

* Don't panic if we fail to load netfilter core framework,
  just bail out instead, from myself.

* Add cond_resched_rcu, used by IPVS to allow rescheduling
  while walking over big hashtables, from Simon Horman.

* Change type of IPVS sysctl_sync_qlen_max sysctl to avoid
  possible overflow, from Zhang Yanfei.

* Use strlcpy instead of strncpy to skip zeroing of already
  initialized area to write the extension names in ebtables,
  from Chen Gang.

* Use already existing per-cpu notrack object from xt_CT,
  from Eric Dumazet.

* Save explicit socket lookup in xt_socket now that we have
  early demux, also from Eric Dumazet.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-06 01:03:06 -07:00
David S. Miller 6bc19fb82d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Merge 'net' bug fixes into 'net-next' as we have patches
that will build on top of them.

This merge commit includes a change from Emil Goode
(emilgoode@gmail.com) that fixes a warning that would
have been introduced by this merge.  Specifically it
fixes the pingv6_ops method ipv6_chk_addr() to add a
"const" to the "struct net_device *dev" argument and
likewise update the dummy_ipv6_chk_addr() declaration.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-05 16:37:30 -07:00
Cong Wang c26d6b46da ping: always initialize ->sin6_scope_id and ->sin6_flowinfo
If we don't need scope id, we should initialize it to zero.
Same for ->sin6_flowinfo.

Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-04 16:58:42 -07:00
Jean Sacren 4960c2c6fa Kconfig: remove dangling references to the deleted file
Commit 202dc3fc59 (Documentation: remove
obsolete networking/multicast.txt file) deleted the obsolete file. After
the file has been removed, clean up a couple of places where references
to the deleted file were made so that users wouldn't be confused when
they consult the Help menu.

Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-04 15:17:39 -07:00
Lorenzo Colitti d862e54614 net: ipv6: Implement /proc/net/icmp6.
The format is based on /proc/net/icmp and /proc/net/{udp,raw}6.

Compiles and displays reasonable results with CONFIG_IPV6={n,m,y}
Couldn't figure out how to test without CONFIG_PROC_FS enabled.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-04 12:56:14 -07:00
Lorenzo Colitti 8cc785f6f4 net: ipv4: make the ping /proc code AF-independent
Introduce a ping_seq_afinfo structure (similar to its UDP
equivalent) and use it to make some of the ping /proc functions
address-family independent. Rename the remaining ping /proc
functions from ping_* to ping_v4_*.

Compiles and displays reasonable results with CONFIG_IPV6={n,m,y}

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-04 12:56:14 -07:00
Cong Wang 9a99d4a50c icmp: avoid allocating large struct on stack
struct icmp_bxm is a large struct, reduce stack usage
by allocating it on heap.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Joe Perches <joe@perches.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-03 00:28:44 -07:00
Rami Rosen 08578d8d4e ] icmp: fix icmp_unreach() comment.
ICMP_PARAMETERPROB is handled by icmp_unreach(); This patch adds
ICMP_PARAMETERPROB to the list of ICMP message types handled by icmp_unreach().

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-03 00:27:15 -07:00
Timo Teräs 5aad1de5ea ipv4: use separate genid for next hop exceptions
commit 13d82bf5 (ipv4: Fix flushing of cached routing informations)
added the support to flush learned pmtu information.

However, using rt_genid is quite heavy as it is bumped on route
add/change and multicast events amongst other places. These can
happen quite often, especially if using dynamic routing protocols.

While this is ok with routes (as they are just recreated locally),
the pmtu information is learned from remote systems and the icmp
notification can come with long delays. It is worthy to have separate
genid to avoid excessive pmtu resets.

Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-03 00:07:43 -07:00
Timo Teräs f016229e30 ipv4: rate limit updating of next hop exceptions with same pmtu
The tunnel devices call update_pmtu for each packet sent, this causes
contention on the fnhe_lock. Ignore the pmtu update if pmtu is not
actually changed, and there is still plenty of time before the entry
expires.

Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-03 00:07:43 -07:00
Timo Teräs 387aa65a89 ipv4: properly refresh rtable entries on pmtu/redirect events
This reverts commit 05ab86c5 (xfrm4: Invalidate all ipv4 routes on
IPsec pmtu events). Flushing all cached entries is not needed.

Instead, invalidate only the related next hop dsts to recheck for
the added next hop exception where needed. This also fixes a subtle
race due to bumping generation id's before updating the pmtu.

Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-03 00:07:42 -07:00
Nicolas Dichtel 32b8a8e59c sit: add IPv4 over IPv4 support
This patch adds the support of IPv4 over Ipv4 for the module sit. The gain of
this feature is to be able to have 4in4 and 6in4 over the same interface
instead of having one interface for 6in4 and another for 4in4 even if
encapsulation addresses are the same.

To avoid conflicting with ipip module, sit IPv4 over IPv4 protocol is
registered with a smaller priority.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-31 17:19:05 -07:00
Nicolas Dichtel bf3d6a8f79 iptunnel: specify protocol outside IP header
Before this patch, ip_tunnel_xmit() was using the field protocol from the IP
header passed into argument.
There is no functional change, this patch prepares the support of IPv4 over
IPv4 for module sit.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-31 17:19:05 -07:00
Eric Dumazet db8caf3dbc gro: should aggregate frames without DF
GRO on IPv4 doesn't aggregate frames if they don't have DF bit set.

Some servers use IP_MTU_DISCOVER/IP_PMTUDISC_PROBE, so linux receivers
are unable to aggregate this kind of traffic.

The right thing to do is to allow aggregation as long as the DF bit has
same value on all segments.

bnx2x LRO does this correctly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jerry Chu <hkchu@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-31 16:25:56 -07:00
David Majnemer c3f1dbaf6e net: Update RFS target at poll for tcp/udp
The current state of affairs is that read()/write() will setup
RFS (Receive Flow Steering) for internet protocol sockets while
poll()/epoll() does not.

When poll() gets called with a TCP or UDP socket, we should update
the flow target.

This permits to RFS (if enabled) to select the appropriate CPU for
following incoming packets.

Note: Only connected UDP sockets can benefit from RFS.

Signed-off-by: David Majnemer <majnemer@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paul Turner <pjt@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-31 16:24:43 -07:00
Yuchung Cheng c7d9d6a185 tcp: undo on DSACK during recovery
If the receiver supports DSACK, sender can detect false recoveries and
revert cwnd reductions triggered by either severe network reordering or
concurrent reordering and loss event.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-30 18:06:11 -07:00
Yuchung Cheng 7026b912f9 tcp: fix undo on partial ack in recovery
Upon detecting spurious fast retransmit via timestamps during recovery,
use PRR to clock out new data packet instead of retransmission. Once
all retransmission are proven spurious, the sender then reverts the
cwnd reduction and congestion state to open or disorder.

The current code does the opposite: it undoes cwnd as soon as any
retransmission is spurious and continues to retransmit until all
data are acked. This nullifies the point to undo the cwnd because
the sender is still retransmistting spuriously. This patch fixes
it. The undo_ssthresh argument of tcp_undo_cwnd_reductiuon() is no
longer needed and is removed.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-30 18:06:11 -07:00
Yuchung Cheng 6a63df46a7 tcp: refactor undo functions
Refactor and relocate various functions or variables to prepare the
undo fix.  Remove some unused function arguments. Rename tcp_undo_cwr
to tcp_undo_cwnd_reduction to be consistent with the rest of
CWR related function names.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-30 18:06:11 -07:00
Yuchung Cheng 6804973ffb tcp: consolidate PRR packet accounting
This patch series fixes an undo bug in fast recovery: the sender
mistakenly undos the cwnd too early but continues fast retransmits
until all pending data are acked. This also multiplies the SNMP
stat PARTIALUNDO events by the degree of the network reordering.

The first patch prepares the fix by consolidating the accounting
of newly_acked_sacked in tcp_cwnd_reduction(), instead of updating
newly_acked_sacked everytime sacked_out is adjusted.  Also pass
acked and prior_unsacked as const type because they are readonly
in the rest of recovery processing.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-30 18:06:11 -07:00
David S. Miller 73ce00d4d6 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
The following patchset contains Netfilter/IPVS fixes for 3.10-rc3,
they are:

* fix xt_addrtype with IPv6, from Florian Westphal. This required
  a new hook for IPv6 functions in the netfilter core to avoid
  hard dependencies with the ipv6 subsystem when this match is
  only used for IPv4.

* fix connection reuse case in IPVS. Currently, if an reused
  connection are directed to the same server. If that server is
  down, those connection would fail. Therefore, clear the
  connection and choose a new server among the available ones.

* fix possible non-nul terminated string sent to user-space if
  ipt_ULOG is used as the default netfilter logging stub, from
  Chen Gang.

* fix mark logging of IPv6 packets in xt_LOG, from Michal Kubecek.
  This bug has been there since 2.6.26.

* Fix breakage ip_vs_sh due to incorrect structure layout for
  RCU, from Jan Beulich.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-30 16:38:38 -07:00
Simon Horman 7cc4619005 net, ipv4, ipv6: Correct assignment of skb->network_header to skb->tail
This corrects an regression introduced by "net: Use 16bits for *_headers
fields of struct skbuff" when NET_SKBUFF_DATA_USES_OFFSET is not set. In
that case skb->tail will be a pointer however skb->network_header is now
an offset.

This patch corrects the problem by adding a wrapper to return skb tail as
an offset regardless of the value of NET_SKBUFF_DATA_USES_OFFSET. It seems
that skb->tail that this offset may be more than 64k and some care has been
taken to treat such cases as an error.

Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28 23:49:07 -07:00
Simon Horman f7c0c2ae84 ipv4: Correct comparisons and calculations using skb->tail and skb-transport_header
This corrects an regression introduced by "net: Use 16bits for *_headers
fields of struct skbuff" when NET_SKBUFF_DATA_USES_OFFSET is not set. In
that case skb->tail will be a pointer whereas skb->transport_header
will be an offset from head. This is corrected by using wrappers that
ensure that comparisons and calculations are always made using pointers.

Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28 23:49:07 -07:00
Cong Wang 75538c2b85 net: always pass struct netdev_notifier_info to netdevice notifiers
commit 351638e7de (net: pass info struct via netdevice notifier)
breaks booting of my KVM guest, this is due to we still forget to pass
struct netdev_notifier_info in several places. This patch completes it.

Cc: Jiri Pirko <jiri@resnulli.us>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28 21:58:54 -07:00
Timo Teräs 6c8b4e3ff8 arp: flush arp cache on IFF_NOARP change
IFF_NOARP affects what kind of neighbor entries are created
(nud NOARP or nud INCOMPLETE). If the flag changes, flush the arp
cache to refresh all entries.

Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>

v2->v3: shortened notifier_info struct name
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28 13:11:02 -07:00
Jiri Pirko 351638e7de net: pass info struct via netdevice notifier
So far, only net_device * could be passed along with netdevice notifier
event. This patch provides a possibility to pass custom structure
able to provide info that event listener needs to know.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>

v2->v3: fix typo on simeth
	shortened dev_getter
	shortened notifier_info struct name
v1->v2: fix notifier_call parameter in call_netdevice_notifier()
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28 13:11:01 -07:00