Commit graph

24448 commits

Author SHA1 Message Date
Eric Dumazet 404e0a8b6a net: ipv4: fix RCU races on dst refcounts
commit c6cffba4ff (ipv4: Fix input route performance regression.)
added various fatal races with dst refcounts.

crashes happen on tcp workloads if routes are added/deleted at the same
time.

The dst_free() calls from free_fib_info_rcu() are clearly racy.

We need instead regular dst refcounting (dst_release()) and make
sure dst_release() is aware of RCU grace periods :

Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period
before dst destruction for cached dst

Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero()
to make sure we dont increase a zero refcount (On a dst currently
waiting an rcu grace period before destruction)

rt_cache_route() must take a reference on the new cached route, and
release it if was not able to install it.

With this patch, my machines survive various benchmarks.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-30 14:53:22 -07:00
Eric Dumazet cca32e4bf9 net: TCP early demux cleanup
early_demux() handlers should be called in RCU context, and as we
use skb_dst_set_noref(skb, dst), caller must not exit from RCU context
before dst use (skb_dst(skb)) or release (skb_drop(dst))

Therefore, rcu_read_lock()/rcu_read_unlock() pairs around
->early_demux() are confusing and not needed :

Protocol handlers are already in an RCU read lock section.
(__netif_receive_skb() does the rcu_read_lock() )

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-30 14:53:21 -07:00
Guanjun He a2a3258417 libceph: prevent the race of incoming work during teardown
Add an atomic variable 'stopping' as flag in struct ceph_messenger,
set this flag to 1 in function ceph_destroy_client(), and add the condition code
in function ceph_data_ready() to test the flag value, if true(1), just return.

Signed-off-by: Guanjun He <gjhe@suse.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-30 09:29:53 -07:00
Sage Weil a16cb1f707 libceph: fix messenger retry
In ancient times, the messenger could both initiate and accept connections.
An artifact if that was data structures to store/process an incoming
ceph_msg_connect request and send an outgoing ceph_msg_connect_reply.
Sadly, the negotiation code was referencing those structures and ignoring
important information (like the peer's connect_seq) from the correct ones.

Among other things, this fixes tight reconnect loops where the server sends
RETRY_SESSION and we (the client) retries with the same connect_seq as last
time.  This bug pretty easily triggered by injecting socket failures on the
MDS and running some fs workload like workunits/direct_io/test_sync_io.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 09:29:52 -07:00
Sage Weil cd43045c2d libceph: initialize rb, list nodes in ceph_osd_request
These don't strictly need to be initialized based on how they are used, but
it is good practice to do so.

Reported-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 09:29:51 -07:00
Sage Weil d50b409fb8 libceph: initialize msgpool message types
Initialize the type field for messages in a msgpool.  The caller was doing
this for osd ops, but not for the reply messages.

Reported-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 09:29:50 -07:00
Li Wei 8253947e2c ipv6: fix incorrect route 'expires' value passed to userspace
When userspace use RTM_GETROUTE to dump route table, with an already
expired route entry, we always got an 'expires' value(2147157)
calculated base on INT_MAX.

The reason of this problem is in the following satement:
	rt->dst.expires - jiffies < INT_MAX
gcc promoted the type of both sides of '<' to unsigned long, thus
a small negative value would be considered greater than INT_MAX.

With the help of Eric Dumazet, do the out of bound checks in
rtnl_put_cacheinfo(), _after_ conversion to clock_t.

Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-29 23:18:31 -07:00
Lin Ming 61648d91fc ipv4: clean up put_child
The first parameter struct trie *t is not used anymore.
Remove it.

Signed-off-by: Lin Ming <mlin@ss.pku.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-29 23:18:30 -07:00
Lin Ming 4ea4bf7ebc ipv4: fix debug info in tnode_new
It should print size of struct rt_trie_node * allocated instead of size
of struct rt_trie_node.

Signed-off-by: Lin Ming <mlin@ss.pku.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-29 23:18:30 -07:00
Al Viro faf0201029 clean unix_bind() up a bit
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:15 +04:00
Al Viro a8104a9fcd pull mnt_want_write()/mnt_drop_write() into kern_path_create()/done_path_create() resp.
One side effect - attempt to create a cross-device link on a read-only fs fails
with EROFS instead of EXDEV now.  Makes more sense, POSIX allows, etc.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:15 +04:00
Al Viro 921a1650de new helper: done_path_create()
releases what needs to be released after {kern,user}_path_create()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:13 +04:00
Jiri Kosina 59ea33a68a tcp: perform DMA to userspace only if there is a task waiting for it
Back in 2006, commit 1a2449a87b ("[I/OAT]: TCP recv offload to I/OAT")
added support for receive offloading to IOAT dma engine if available.

The code in tcp_rcv_established() tries to perform early DMA copy if
applicable. It however does so without checking whether the userspace
task is actually expecting the data in the buffer.

This is not a problem under normal circumstances, but there is a corner
case where this doesn't work -- and that's when MSG_TRUNC flag to
recvmsg() is used.

If the IOAT dma engine is not used, the code properly checks whether
there is a valid ucopy.task and the socket is owned by userspace, but
misses the check in the dmaengine case.

This problem can be observed in real trivially -- for example 'tbench' is a
good reproducer, as it makes a heavy use of MSG_TRUNC. On systems utilizing
IOAT, you will soon find tbench waiting indefinitely in sk_wait_data(), as they
have been already early-copied in tcp_rcv_established() using dma engine.

This patch introduces the same check we are performing in the simple
iovec copy case to the IOAT case as well. It fixes the indefinite
recvmsg(MSG_TRUNC) hangs.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-27 13:45:51 -07:00
Jesse Gross 6081030769 Revert "openvswitch: potential NULL deref in sample()"
This reverts commit 5b3e7e6cb5.

The problem that the original commit was attempting to fix can
never happen in practice because validation is done one a per-flow
basis rather than a per-packet basis.  Adding additional checks at
runtime is unnecessary and inconsistent with the rest of the code.

CC: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-27 13:45:51 -07:00
Eric Dumazet 505fbcf035 ipv4: fix TCP early demux
commit 92101b3b2e (ipv4: Prepare for change of rt->rt_iif encoding.)
invalidated TCP early demux, because rx_dst_ifindex is not properly
initialized and checked.

Also remove the use of inet_iif(skb) in favor or skb->skb_iif

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-27 13:45:51 -07:00
Jiri Benc b1beb681cb net: fix rtnetlink IFF_PROMISC and IFF_ALLMULTI handling
When device flags are set using rtnetlink, IFF_PROMISC and IFF_ALLMULTI
flags are handled specially. Function dev_change_flags sets IFF_PROMISC and
IFF_ALLMULTI bits in dev->gflags according to the passed value but
do_setlink passes a result of rtnl_dev_combine_flags which takes those bits
from dev->flags.

This can be easily trigerred by doing:

tcpdump -i eth0 &
ip l s up eth0

ip sets IFF_UP flag in ifi_flags and ifi_change, which is combined with
IFF_PROMISC by rtnl_dev_combine_flags, causing __dev_change_flags to set
IFF_PROMISC in gflags.

Reported-by: Max Matveev <makc@redhat.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-27 13:45:50 -07:00
Hangbin Liu 4249357010 tcp: Add TCP_USER_TIMEOUT negative value check
TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int. But
patch "tcp: Add TCP_USER_TIMEOUT socket option"(dca43c75) didn't check the negative
values. If a user assign -1 to it, the socket will set successfully and wait
for 4294967295 miliseconds. This patch add a negative value check to avoid
this issue.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-27 13:45:50 -07:00
Linus Torvalds aa0b3b2bee Merge branch 'for-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/linux-leds
Pull LED subsystem update from Bryan Wu.

* 'for-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/linux-leds: (50 commits)
  leds-lp8788: forgotten unlock at lp8788_led_work
  LEDS: propagate error codes in blinkm_detect()
  LEDS: memory leak in blinkm_led_common_set()
  leds: add new lp8788 led driver
  LEDS: add BlinkM RGB LED driver, documentation and update MAINTAINERS
  leds: max8997: Simplify max8997_led_set_mode implementation
  leds/leds-s3c24xx: use devm_gpio_request
  leds: convert Network Space v2 LED driver to devm_kzalloc() and cleanup error exit path
  leds: convert DAC124S085 LED driver to devm_kzalloc()
  leds: convert LM3530 LED driver to devm_kzalloc() and cleanup error exit path
  leds: convert TCA6507 LED driver to devm_kzalloc()
  leds: convert Freescale MC13783 LED driver to devm_kzalloc() and cleanup error exit path
  leds: convert ADP5520 LED driver to devm_kzalloc() and cleanup error exit path
  leds: convert PCA955x LED driver to devm_kzalloc() and cleanup error exit path
  leds: convert Sun Fire LED driver to devm_kzalloc() and cleanup error exit path
  leds: convert PCA9532 LED driver to devm_kzalloc()
  leds: convert LT3593 LED driver to devm_kzalloc()
  leds: convert Renesas TPU LED driver to devm_kzalloc() and cleanup error exit path
  leds: convert LP5523 LED driver to devm_kzalloc() and cleanup error exit path
  leds: convert PCA9633 LED driver to devm_kzalloc()
  ...
2012-07-26 20:26:27 -07:00
Linus Torvalds 1e30c1b386 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking updates and fixes from David Miller:

1) Reinstate the no-ref optimization for input route lookups in ipv4 to
   fix some routing cache removal perf regressions.

2) Make TCP socket pre-demux work on ipv6 side too, from Eric Dumazet.

3) Get RX hash value from correct place in be2net driver, from
   Sarveshwar Bandi.

4) Validation of FIB cached routes missing critical check, from Eric
   Dumazet.

5) EEH support in mlx4 driver, from Kleber Sacilotto de Souza.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (23 commits)
  ipv6: Early TCP socket demux
  ipv4: Fix input route performance regression.
  pch_gbe: vlan skb len fix
  pch_gbe: add extra clean tx
  pch_gbe: fix transmit watchdog timeout
  ixgbe: fix panic while dumping packets on Tx hang with IOMMU
  be2net: Fix to parse RSS hash from Receive completions correctly.
  net/mlx4_en: Limit the RFS filter IDs to be < RPS_NO_FILTER
  hyperv: Add error handling to rndis_filter_device_add()
  hyperv: Add a check for ring_size value
  ipv4: rt_cache_valid must check expired routes
  net/pch_gpe: Cannot disable ethernet autonegation
  qeth: repair crash in qeth_l3_vlan_rx_kill_vid()
  netiucv: cleanup attribute usage
  net: wiznet add missing HAS_IOMEM dependency
  be2net: Missing byteswap in be_get_fw_log_level causes oops on PowerPC
  mlx4: Add support for EEH error recovery
  cdc-ncm: tag Ericsson WWAN devices (eg F5521gw) with FLAG_WWAN
  wanmain: comparing array with NULL
  caif: fix NULL pointer check
  ...
2012-07-26 18:09:01 -07:00
Eric Dumazet c7109986db ipv6: Early TCP socket demux
This is the IPv6 missing bits for infrastructure added in commit
41063e9dd1 (ipv4: Early TCP socket demux.)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-26 15:50:39 -07:00
David S. Miller c6cffba4ff ipv4: Fix input route performance regression.
With the routing cache removal we lost the "noref" code paths on
input, and this can kill some routing workloads.

Reinstate the noref path when we hit a cached route in the FIB
nexthops.

With help from Eric Dumazet.

Reported-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-26 15:50:39 -07:00
Eric Dumazet 4331debc51 ipv4: rt_cache_valid must check expired routes
commit d2d68ba9fe (ipv4: Cache input routes in fib_info nexthops.)
introduced rt_cache_valid() helper. It unfortunately doesn't check if
route is expired before caching it.

I noticed sk_setup_caps() was constantly called on a tcp workload.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-25 15:24:14 -07:00
Alan Cox 8b72ff6484 wanmain: comparing array with NULL
gcc really should warn about these !

Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-24 13:55:21 -07:00
Eric Dumazet 9cb429d692 tcp: early_demux fixes
1) Remove a non needed pskb_may_pull() in tcp_v4_early_demux()
   and fix a potential bug if skb->head was reallocated
   (iph & th pointers were not reloaded)

TCP stack will pull/check headers anyway.

2) must reload iph in ip_rcv_finish() after early_demux()
 call since skb->head might have changed.

3) skb->dev->ifindex can be now replaced by skb->skb_iif

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-24 13:54:15 -07:00
Linus Torvalds d14b7a419a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree from Jiri Kosina:
 "Trivial updates all over the place as usual."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (29 commits)
  Fix typo in include/linux/clk.h .
  pci: hotplug: Fix typo in pci
  iommu: Fix typo in iommu
  video: Fix typo in drivers/video
  Documentation: Add newline at end-of-file to files lacking one
  arm,unicore32: Remove obsolete "select MISC_DEVICES"
  module.c: spelling s/postition/position/g
  cpufreq: Fix typo in cpufreq driver
  trivial: typo in comment in mksysmap
  mach-omap2: Fix typo in debug message and comment
  scsi: aha152x: Fix sparse warning and make printing pointer address more portable.
  Change email address for Steve Glendinning
  Btrfs: fix typo in convert_extent_bit
  via: Remove bogus if check
  netprio_cgroup.c: fix comment typo
  backlight: fix memory leak on obscure error path
  Documentation: asus-laptop.txt references an obsolete Kconfig item
  Documentation: ManagementStyle: fixed typo
  mm/vmscan: cleanup comment error in balance_pgdat
  mm: cleanup on the comments of zone_reclaim_stat
  ...
2012-07-24 13:34:56 -07:00
Linus Torvalds 3c4cfadef6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking changes from David S Miller:

 1) Remove the ipv4 routing cache.  Now lookups go directly into the FIB
    trie and use prebuilt routes cached there.

    No more garbage collection, no more rDOS attacks on the routing
    cache.  Instead we now get predictable and consistent performance,
    no matter what the pattern of traffic we service.

    This has been almost 2 years in the making.  Special thanks to
    Julian Anastasov, Eric Dumazet, Steffen Klassert, and others who
    have helped along the way.

    I'm sure that with a change of this magnitude there will be some
    kind of fallout, but such things ought the be simple to fix at this
    point.  Luckily I'm not European so I'll be around all of August to
    fix things :-)

    The major stages of this work here are each fronted by a forced
    merge commit whose commit message contains a top-level description
    of the motivations and implementation issues.

 2) Pre-demux of established ipv4 TCP sockets, saves a route demux on
    input.

 3) TCP SYN/ACK performance tweaks from Eric Dumazet.

 4) Add namespace support for netfilter L4 conntrack helpers, from Gao
    Feng.

 5) Add config mechanism for Energy Efficient Ethernet to ethtool, from
    Yuval Mintz.

 6) Remove quadratic behavior from /proc/net/unix, from Eric Dumazet.

 7) Support for connection tracker helpers in userspace, from Pablo
    Neira Ayuso.

 8) Allow userspace driven TX load balancing functions in TEAM driver,
    from Jiri Pirko.

 9) Kill off NLMSG_PUT and RTA_PUT macros, more gross stuff with
    embedded gotos.

10) TCP Small Queues, essentially minimize the amount of TCP data queued
    up in the packet scheduler layer.  Whereas the existing BQL (Byte
    Queue Limits) limits the pkt_sched --> netdevice queuing levels,
    this controls the TCP --> pkt_sched queueing levels.

    From Eric Dumazet.

11) Reduce the number of get_page/put_page ops done on SKB fragments,
    from Alexander Duyck.

12) Implement protection against blind resets in TCP (RFC 5961), from
    Eric Dumazet.

13) Support the client side of TCP Fast Open, basically the ability to
    send data in the SYN exchange, from Yuchung Cheng.

    Basically, the sender queues up data with a sendmsg() call using
    MSG_FASTOPEN, then they do the connect() which emits the queued up
    fastopen data.

14) Avoid all the problems we get into in TCP when timers or PMTU events
    hit a locked socket.  The TCP Small Queues changes added a
    tcp_release_cb() that allows us to queue work up to the
    release_sock() caller, and that's what we use here too.  From Eric
    Dumazet.

15) Zero copy on TX support for TUN driver, from Michael S. Tsirkin.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1870 commits)
  genetlink: define lockdep_genl_is_held() when CONFIG_LOCKDEP
  r8169: revert "add byte queue limit support".
  ipv4: Change rt->rt_iif encoding.
  net: Make skb->skb_iif always track skb->dev
  ipv4: Prepare for change of rt->rt_iif encoding.
  ipv4: Remove all RTCF_DIRECTSRC handliing.
  ipv4: Really ignore ICMP address requests/replies.
  decnet: Don't set RTCF_DIRECTSRC.
  net/ipv4/ip_vti.c: Fix __rcu warnings detected by sparse.
  ipv4: Remove redundant assignment
  rds: set correct msg_namelen
  openvswitch: potential NULL deref in sample()
  tcp: dont drop MTU reduction indications
  bnx2x: Add new 57840 device IDs
  tcp: avoid oops in tcp_metrics and reset tcpm_stamp
  niu: Change niu_rbr_fill() to use unlikely() to check niu_rbr_add_page() return value
  niu: Fix to check for dma mapping errors.
  net: Fix references to out-of-scope variables in put_cmsg_compat()
  net: ethernet: davinci_emac: add pm_runtime support
  net: ethernet: davinci_emac: Remove unnecessary #include
  ...
2012-07-24 10:01:50 -07:00
WANG Cong 320f5ea0ce genetlink: define lockdep_genl_is_held() when CONFIG_LOCKDEP
lockdep_is_held() is defined when CONFIG_LOCKDEP, not CONFIG_PROVE_LOCKING.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jesse Gross <jesse@nicira.com>
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-24 00:01:30 -07:00
Shuah Khan 19cd67e2d5 leds: Rename led_brightness_set() to led_set_brightness()
Rename leds external interface led_brightness_set() to led_set_brightness().
This is the second phase of the change to reduce confusion between the
leds internal and external interfaces that set brightness. With this change,
now the external interface is led_set_brightness(). The first phase renamed
the internal interface led_set_brightness() to __led_set_brightness().
There are no changes to the interface implementations.

Signed-off-by: Shuah Khan <shuahkhan@gmail.com>
Signed-off-by: Bryan Wu <bryan.wu@canonical.com>
2012-07-24 07:52:34 +08:00
David S. Miller 13378cad02 ipv4: Change rt->rt_iif encoding.
On input packet processing, rt->rt_iif will be zero if we should
use skb->dev->ifindex.

Since we access rt->rt_iif consistently via inet_iif(), that is
the only spot whose interpretation have to adjust.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23 16:36:27 -07:00
David S. Miller b68581778c net: Make skb->skb_iif always track skb->dev
Make it follow device decapsulation, from things such as VLAN and
bonding.

The stuff that actually cares about pre-demuxed device pointers, is
handled by the "orig_dev" variable in __netif_receive_skb().  And
the only consumer of that is the po->origdev feature of AF_PACKET
sockets.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23 16:36:27 -07:00
David S. Miller 92101b3b2e ipv4: Prepare for change of rt->rt_iif encoding.
Use inet_iif() consistently, and for TCP record the input interface of
cached RX dst in inet sock.

rt->rt_iif is going to be encoded differently, so that we can
legitimately cache input routes in the FIB info more aggressively.

When the input interface is "use SKB device index" the rt->rt_iif will
be set to zero.

This forces us to move the TCP RX dst cache installation into the ipv4
specific code, and as well it should since doing the route caching for
ipv6 is pointless at the moment since it is not inspected in the ipv6
input paths yet.

Also, remove the unlikely on dst->obsolete, all ipv4 dsts have
obsolete set to a non-zero value to force invocation of the check
callback.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23 16:36:26 -07:00
David S. Miller fe3edf4579 ipv4: Remove all RTCF_DIRECTSRC handliing.
The last and final kernel user, ICMP address replies,
has been removed.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23 13:22:20 -07:00
David S. Miller 838942a594 ipv4: Really ignore ICMP address requests/replies.
Alexey removed kernel side support for requests, and the
only thing we do for replies is log a message if something
doesn't look right.

As Alexey's comment indicates, this belongs in userspace (if
anywhere), and thus we can safely just get rid of this code.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23 13:20:26 -07:00
David S. Miller 8acfaa9484 decnet: Don't set RTCF_DIRECTSRC.
It's an ipv4 defined route flag, and only ipv4 uses it.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23 13:16:59 -07:00
Saurabh e7d4b18cbe net/ipv4/ip_vti.c: Fix __rcu warnings detected by sparse.
With CONFIG_SPARSE_RCU_POINTER=y sparse identified references which did not
specificy __rcu in ip_vti.c

Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23 13:00:54 -07:00
Lin Ming 8fe5cb873b ipv4: Remove redundant assignment
It is redundant to set no_addr and accept_local to 0 and then set them
with other values just after that.

Signed-off-by: Lin Ming <mlin@ss.pku.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23 13:00:54 -07:00
Linus Torvalds a66d2c8f7e Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull the big VFS changes from Al Viro:
 "This one is *big* and changes quite a few things around VFS.  What's in there:

   - the first of two really major architecture changes - death to open
     intents.

     The former is finally there; it was very long in making, but with
     Miklos getting through really hard and messy final push in
     fs/namei.c, we finally have it.  Unlike his variant, this one
     doesn't introduce struct opendata; what we have instead is
     ->atomic_open() taking preallocated struct file * and passing
     everything via its fields.

     Instead of returning struct file *, it returns -E...  on error, 0
     on success and 1 in "deal with it yourself" case (e.g.  symlink
     found on server, etc.).

     See comments before fs/namei.c:atomic_open().  That made a lot of
     goodies finally possible and quite a few are in that pile:
     ->lookup(), ->d_revalidate() and ->create() do not get struct
     nameidata * anymore; ->lookup() and ->d_revalidate() get lookup
     flags instead, ->create() gets "do we want it exclusive" flag.

     With the introduction of new helper (kern_path_locked()) we are rid
     of all struct nameidata instances outside of fs/namei.c; it's still
     visible in namei.h, but not for long.  Come the next cycle,
     declaration will move either to fs/internal.h or to fs/namei.c
     itself.  [me, miklos, hch]

   - The second major change: behaviour of final fput().  Now we have
     __fput() done without any locks held by caller *and* not from deep
     in call stack.

     That obviously lifts a lot of constraints on the locking in there.
     Moreover, it's legal now to call fput() from atomic contexts (which
     has immediately simplified life for aio.c).  We also don't need
     anti-recursion logics in __scm_destroy() anymore.

     There is a price, though - the damn thing has become partially
     asynchronous.  For fput() from normal process we are guaranteed
     that pending __fput() will be done before the caller returns to
     userland, exits or gets stopped for ptrace.

     For kernel threads and atomic contexts it's done via
     schedule_work(), so theoretically we might need a way to make sure
     it's finished; so far only one such place had been found, but there
     might be more.

     There's flush_delayed_fput() (do all pending __fput()) and there's
     __fput_sync() (fput() analog doing __fput() immediately).  I hope
     we won't need them often; see warnings in fs/file_table.c for
     details.  [me, based on task_work series from Oleg merged last
     cycle]

   - sync series from Jan

   - large part of "death to sync_supers()" work from Artem; the only
     bits missing here are exofs and ext4 ones.  As far as I understand,
     those are going via the exofs and ext4 trees resp.; once they are
     in, we can put ->write_super() to the rest, along with the thread
     calling it.

   - preparatory bits from unionmount series (from dhowells).

   - assorted cleanups and fixes all over the place, as usual.

  This is not the last pile for this cycle; there's at least jlayton's
  ESTALE work and fsfreeze series (the latter - in dire need of fixes,
  so I'm not sure it'll make the cut this cycle).  I'll probably throw
  symlink/hardlink restrictions stuff from Kees into the next pile, too.
  Plus there's a lot of misc patches I hadn't thrown into that one -
  it's large enough as it is..."

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (127 commits)
  ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file()
  btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file()
  switch dentry_open() to struct path, make it grab references itself
  spufs: shift dget/mntget towards dentry_open()
  zoran: don't bother with struct file * in zoran_map
  ecryptfs: don't reinvent the wheels, please - use struct completion
  don't expose I_NEW inodes via dentry->d_inode
  tidy up namei.c a bit
  unobfuscate follow_up() a bit
  ext3: pass custom EOF to generic_file_llseek_size()
  ext4: use core vfs llseek code for dir seeks
  vfs: allow custom EOF in generic_file_llseek code
  vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes
  vfs: Remove unnecessary flushing of block devices
  vfs: Make sys_sync writeout also block device inodes
  vfs: Create function for iterating over block devices
  vfs: Reorder operations during sys_sync
  quota: Move quota syncing to ->sync_fs method
  quota: Split dquot_quota_sync() to writeback and cache flushing part
  vfs: Move noop_backing_dev_info check from sync into writeback
  ...
2012-07-23 12:27:27 -07:00
Weiping Pan 06b6a1cf6e rds: set correct msg_namelen
Jay Fenlason (fenlason@redhat.com) found a bug,
that recvfrom() on an RDS socket can return the contents of random kernel
memory to userspace if it was called with a address length larger than
sizeof(struct sockaddr_in).
rds_recvmsg() also fails to set the addr_len paramater properly before
returning, but that's just a bug.
There are also a number of cases wher recvfrom() can return an entirely bogus
address. Anything in rds_recvmsg() that returns a non-negative value but does
not go through the "sin = (struct sockaddr_in *)msg->msg_name;" code path
at the end of the while(1) loop will return up to 128 bytes of kernel memory
to userspace.

And I write two test programs to reproduce this bug, you will see that in
rds_server, fromAddr will be overwritten and the following sock_fd will be
destroyed.
Yes, it is the programmer's fault to set msg_namelen incorrectly, but it is
better to make the kernel copy the real length of address to user space in
such case.

How to run the test programs ?
I test them on 32bit x86 system, 3.5.0-rc7.

1 compile
gcc -o rds_client rds_client.c
gcc -o rds_server rds_server.c

2 run ./rds_server on one console

3 run ./rds_client on another console

4 you will see something like:
server is waiting to receive data...
old socket fd=3
server received data from client:data from client
msg.msg_namelen=32
new socket fd=-1067277685
sendmsg()
: Bad file descriptor

/***************** rds_client.c ********************/

int main(void)
{
	int sock_fd;
	struct sockaddr_in serverAddr;
	struct sockaddr_in toAddr;
	char recvBuffer[128] = "data from client";
	struct msghdr msg;
	struct iovec iov;

	sock_fd = socket(AF_RDS, SOCK_SEQPACKET, 0);
	if (sock_fd < 0) {
		perror("create socket error\n");
		exit(1);
	}

	memset(&serverAddr, 0, sizeof(serverAddr));
	serverAddr.sin_family = AF_INET;
	serverAddr.sin_addr.s_addr = inet_addr("127.0.0.1");
	serverAddr.sin_port = htons(4001);

	if (bind(sock_fd, (struct sockaddr*)&serverAddr, sizeof(serverAddr)) < 0) {
		perror("bind() error\n");
		close(sock_fd);
		exit(1);
	}

	memset(&toAddr, 0, sizeof(toAddr));
	toAddr.sin_family = AF_INET;
	toAddr.sin_addr.s_addr = inet_addr("127.0.0.1");
	toAddr.sin_port = htons(4000);
	msg.msg_name = &toAddr;
	msg.msg_namelen = sizeof(toAddr);
	msg.msg_iov = &iov;
	msg.msg_iovlen = 1;
	msg.msg_iov->iov_base = recvBuffer;
	msg.msg_iov->iov_len = strlen(recvBuffer) + 1;
	msg.msg_control = 0;
	msg.msg_controllen = 0;
	msg.msg_flags = 0;

	if (sendmsg(sock_fd, &msg, 0) == -1) {
		perror("sendto() error\n");
		close(sock_fd);
		exit(1);
	}

	printf("client send data:%s\n", recvBuffer);

	memset(recvBuffer, '\0', 128);

	msg.msg_name = &toAddr;
	msg.msg_namelen = sizeof(toAddr);
	msg.msg_iov = &iov;
	msg.msg_iovlen = 1;
	msg.msg_iov->iov_base = recvBuffer;
	msg.msg_iov->iov_len = 128;
	msg.msg_control = 0;
	msg.msg_controllen = 0;
	msg.msg_flags = 0;
	if (recvmsg(sock_fd, &msg, 0) == -1) {
		perror("recvmsg() error\n");
		close(sock_fd);
		exit(1);
	}

	printf("receive data from server:%s\n", recvBuffer);

	close(sock_fd);

	return 0;
}

/***************** rds_server.c ********************/

int main(void)
{
	struct sockaddr_in fromAddr;
	int sock_fd;
	struct sockaddr_in serverAddr;
	unsigned int addrLen;
	char recvBuffer[128];
	struct msghdr msg;
	struct iovec iov;

	sock_fd = socket(AF_RDS, SOCK_SEQPACKET, 0);
	if(sock_fd < 0) {
		perror("create socket error\n");
		exit(0);
	}

	memset(&serverAddr, 0, sizeof(serverAddr));
	serverAddr.sin_family = AF_INET;
	serverAddr.sin_addr.s_addr = inet_addr("127.0.0.1");
	serverAddr.sin_port = htons(4000);
	if (bind(sock_fd, (struct sockaddr*)&serverAddr, sizeof(serverAddr)) < 0) {
		perror("bind error\n");
		close(sock_fd);
		exit(1);
	}

	printf("server is waiting to receive data...\n");
	msg.msg_name = &fromAddr;

	/*
	 * I add 16 to sizeof(fromAddr), ie 32,
	 * and pay attention to the definition of fromAddr,
	 * recvmsg() will overwrite sock_fd,
	 * since kernel will copy 32 bytes to userspace.
	 *
	 * If you just use sizeof(fromAddr), it works fine.
	 * */
	msg.msg_namelen = sizeof(fromAddr) + 16;
	/* msg.msg_namelen = sizeof(fromAddr); */
	msg.msg_iov = &iov;
	msg.msg_iovlen = 1;
	msg.msg_iov->iov_base = recvBuffer;
	msg.msg_iov->iov_len = 128;
	msg.msg_control = 0;
	msg.msg_controllen = 0;
	msg.msg_flags = 0;

	while (1) {
		printf("old socket fd=%d\n", sock_fd);
		if (recvmsg(sock_fd, &msg, 0) == -1) {
			perror("recvmsg() error\n");
			close(sock_fd);
			exit(1);
		}
		printf("server received data from client:%s\n", recvBuffer);
		printf("msg.msg_namelen=%d\n", msg.msg_namelen);
		printf("new socket fd=%d\n", sock_fd);
		strcat(recvBuffer, "--data from server");
		if (sendmsg(sock_fd, &msg, 0) == -1) {
			perror("sendmsg()\n");
			close(sock_fd);
			exit(1);
		}
	}

	close(sock_fd);
	return 0;
}

Signed-off-by: Weiping Pan <wpan@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23 01:01:44 -07:00
Dan Carpenter 5b3e7e6cb5 openvswitch: potential NULL deref in sample()
If there is no OVS_SAMPLE_ATTR_ACTIONS set then "acts_list" is NULL and
it leads to a NULL dereference when we call nla_len(acts_list).  This
is a static checker fix, not something I have seen in testing.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23 00:59:54 -07:00
Eric Dumazet 563d34d057 tcp: dont drop MTU reduction indications
ICMP messages generated in output path if frame length is bigger than
mtu are actually lost because socket is owned by user (doing the xmit)

One example is the ipgre_tunnel_xmit() calling
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu));

We had a similar case fixed in commit a34a101e1e (ipv6: disable GSO on
sockets hitting dst_allfrag).

Problem of such fix is that it relied on retransmit timers, so short tcp
sessions paid a too big latency increase price.

This patch uses the tcp_release_cb() infrastructure so that MTU
reduction messages (ICMP messages) are not lost, and no extra delay
is added in TCP transmits.

Reported-by: Maciej Żenczykowski <maze@google.com>
Diagnosed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Tore Anderson <tore@fud.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23 00:58:46 -07:00
Julian Anastasov 9a0a9502cb tcp: avoid oops in tcp_metrics and reset tcpm_stamp
In tcp_tw_remember_stamp we incorrectly checked tw
instead of tm, it can lead to oops if the cached entry is
not found.

	tcpm_stamp was not updated in tcpm_check_stamp when
tcpm_suck_dst was called, move the update into tcpm_suck_dst,
so that we do not call it infinitely on every next cache hit
after TCP_METRICS_TIMEOUT.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23 00:57:12 -07:00
Jesper Juhl 818810472b net: Fix references to out-of-scope variables in put_cmsg_compat()
In net/compat.c::put_cmsg_compat() we may assign 'data' the address of
either the 'ctv' or 'cts' local variables inside the 'if
(!COMPAT_USE_64BIT_TIME)' branch.

Those variables go out of scope at the end of the 'if' statement, so
when we use 'data' further down in 'copy_to_user(CMSG_COMPAT_DATA(cm),
data, cmlen - sizeof(struct compat_cmsghdr))' there's no telling what
it may be refering to - not good.

Fix the problem by simply giving 'ctv' and 'cts' function scope.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-22 17:50:49 -07:00
David S. Miller 5e9965c15b Merge branch 'kill_rtcache'
The ipv4 routing cache is non-deterministic, performance wise, and is
subject to reasonably easy to launch denial of service attacks.

The routing cache works great for well behaved traffic, and the world
was a much friendlier place when the tradeoffs that led to the routing
cache's design were considered.

What it boils down to is that the performance of the routing cache is
a product of the traffic patterns seen by a system rather than being a
product of the contents of the routing tables.  The former of which is
controllable by external entitites.

Even for "well behaved" legitimate traffic, high volume sites can see
hit rates in the routing cache of only ~%10.

The general flow of this patch series is that first the routing cache
is removed.  We build a completely new rtable entry every lookup
request.

Next we make some simplifications due to the fact that removing the
routing cache causes several members of struct rtable to become no
longer necessary.

Then we need to make some amends such that we can legally cache
pre-constructed routes in the FIB nexthops.  Firstly, we need to
invalidate routes which are hit with nexthop exceptions.  Secondly we
have to change the semantics of rt->rt_gateway such that zero means
that the destination is on-link and non-zero otherwise.

Now that the preparations are ready, we start caching precomputed
routes in the FIB nexthops.  Output and input routes need different
kinds of care when determining if we can legally do such caching or
not.  The details are in the commit log messages for those changes.

The patch series then winds down with some more struct rtable
simplifications and other tidy ups that remove unnecessary overhead.

On a SPARC-T3 output route lookups are ~876 cycles.  Input route
lookups are ~1169 cycles with rpfilter disabled, and about ~1468
cycles with rpfilter enabled.

These measurements were taken with the kbench_mod test module in the
net_test_tools GIT tree:

git://git.kernel.org/pub/scm/linux/kernel/git/davem/net_test_tools.git

That GIT tree also includes a udpflood tester tool and stresses
route lookups on packet output.

For example, on the same SPARC-T3 system we can run:

	time ./udpflood -l 10000000 10.2.2.11

with routing cache:
real    1m21.955s       user    0m6.530s        sys     1m15.390s

without routing cache:
real    1m31.678s       user    0m6.520s        sys     1m25.140s

Performance undoubtedly can easily be improved further.

For example fib_table_lookup() performs a lot of excessive
computations with all the masking and shifting, some of it
conditionalized to deal with edge cases.

Also, Eric's no-ref optimization for input route lookups can be
re-instated for the FIB nexthop caching code path.  I would be really
pleased if someone would work on that.

In fact anyone suitable motivated can just fire up perf on the loading
of the test net_test_tools benchmark kernel module.  I spend much of
my time going:

bash# perf record insmod ./kbench_mod.ko dst=172.30.42.22 src=74.128.0.1 iif=2
bash# perf report

Thanks to helpful feedback from Joe Perches, Eric Dumazet, Ben
Hutchings, and others.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-22 17:04:15 -07:00
Al Viro 6120d3dbb1 get rid of ->scm_work_list
recursion in __scm_destroy() will be cut by delaying final fput()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:00 +04:00
John Fastabend 406a3c638c net: netprio_cgroup: rework update socket logic
Instead of updating the sk_cgrp_prioidx struct field on every send
this only updates the field when a task is moved via cgroup
infrastructure.

This allows sockets that may be used by a kernel worker thread
to be managed. For example in the iscsi case today a user can
put iscsid in a netprio cgroup and control traffic will be sent
with the correct sk_cgrp_prioidx value set but as soon as data
is sent the kernel worker thread isssues a send and sk_cgrp_prioidx
is updated with the kernel worker threads value which is the
default case.

It seems more correct to only update the field when the user
explicitly sets it via control group infrastructure. This allows
the users to manage sockets that may be used with other threads.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-22 12:44:01 -07:00
Michael S. Tsirkin dcc0fb782b skbuff: export skb_copy_ubufs
Export skb_copy_ubufs so that modules can orphan frags.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-22 12:39:33 -07:00
Michael S. Tsirkin 1080e512d4 net: orphan frags on receive
zero copy packets are normally sent to the outside
network, but bridging, tun etc might loop them
back to host networking stack. If this happens
destructors will never be called, so orphan
the frags immediately on receive.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-22 12:39:33 -07:00
Michael S. Tsirkin 70008aa50e skbuff: convert to skb_orphan_frags
Reduce code duplication a bit using the new helper.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-22 12:39:33 -07:00
Mark A. Greer 1d69c2b343 rtnl: Add #ifdef CONFIG_RPS around num_rx_queues reference
Commit 76ff5cc919
(rtnl: allow to specify number of rx and tx queues
on device creation) added a reference to the net_device
structure's 'num_rx_queues' member in

	net/core/rtnetlink.c:rtnl_fill_ifinfo()

However, the definition for 'num_rx_queues' is surrounded
by an '#ifdef CONFIG_RPS' while the new reference to it is
not.  This causes a compile error when CONFIG_RPS is not
defined.

Fix the compile error by surrounding the new reference to
'num_rx_queues' by an '#ifdef CONFIG_RPS'.

CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Mark A. Greer <mgreer@animalcreek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-22 12:28:11 -07:00
Neil Horman 5aa93bcf66 sctp: Implement quick failover draft from tsvwg
I've seen several attempts recently made to do quick failover of sctp transports
by reducing various retransmit timers and counters.  While its possible to
implement a faster failover on multihomed sctp associations, its not
particularly robust, in that it can lead to unneeded retransmits, as well as
false connection failures due to intermittent latency on a network.

Instead, lets implement the new ietf quick failover draft found here:
http://tools.ietf.org/html/draft-nishida-tsvwg-sctp-failover-05

This will let the sctp stack identify transports that have had a small number of
errors, and avoid using them quickly until their reliability can be
re-established.  I've tested this out on two virt guests connected via multiple
isolated virt networks and believe its in compliance with the above draft and
works well.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Vlad Yasevich <vyasevich@gmail.com>
CC: Sridhar Samudrala <sri@us.ibm.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: linux-sctp@vger.kernel.org
CC: joe@perches.com
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-22 12:13:46 -07:00
Kevin Groeneveld e3906486f6 net: fix race condition in several drivers when reading stats
Fix race condition in several network drivers when reading stats on 32bit
UP architectures.  These drivers update their stats in a BH context and
therefore should use u64_stats_fetch_begin_bh/u64_stats_fetch_retry_bh
instead of u64_stats_fetch_begin/u64_stats_fetch_retry when reading the
stats.

Signed-off-by: Kevin Groeneveld <kgroeneveld@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-22 12:12:32 -07:00
Eric Dumazet 0980e56e50 ipv4: tcp: set unicast_sock uc_ttl to -1
Set unicast_sock uc_ttl to -1 so that we select the right ttl,
instead of sending packets with a 0 ttl.

Bug added in commit be9f4a44e7 (ipv4: tcp: remove per net tcp_sock)

Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-22 12:06:21 -07:00
David S. Miller c073cfc89f Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch
Jesse Gross says:

====================
A few bug fixes and small enhancements for net-next/3.6.
 ...
Ansis Atteka (1):
      openvswitch: Do not send notification if ovs_vport_set_options() failed

Ben Pfaff (1):
      openvswitch: Check gso_type for correct sk_buff in queue_gso_packets().

Jesse Gross (2):
      openvswitch: Enable retrieval of TCP flags from IPv6 traffic.
      openvswitch: Reset upper layer protocol info on internal devices.

Leo Alterman (1):
      openvswitch: Fix typo in documentation.

Pravin B Shelar (1):
      openvswitch: Check currect return value from skb_gso_segment()

Raju Subramanian (1):
      openvswitch: Replace Nicira Networks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 16:16:34 -07:00
Ben Pfaff a1b5d0dd28 openvswitch: Check gso_type for correct sk_buff in queue_gso_packets().
At the point where it was used, skb_shinfo(skb)->gso_type referred to a
post-GSO sk_buff.  Thus, it would always be 0.  We want to know the pre-GSO
gso_type, so we need to obtain it before segmenting.

Before this change, the kernel would pass inconsistent data to userspace:
packets for UDP fragments with nonzero offset would be passed along with
flow keys that indicate a zero offset (that is, the flow key for "later"
fragments claimed to be "first" fragments).  This inconsistency tended
to confuse Open vSwitch userspace, causing it to log messages about
"failed to flow_del" the flows with "later" fragments.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2012-07-20 14:47:54 -07:00
Pravin B Shelar 92e5dfc34c openvswitch: Check currect return value from skb_gso_segment()
Fix return check typo.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2012-07-20 14:46:29 -07:00
David S. Miller 2860583fe8 ipv4: Kill rt->fi
It's not really needed.

We only grabbed a reference to the fib_info for the sake of fib_info
local metrics.

However, fib_info objects are freed using RCU, as are therefore their
private metrics (if any).

We would have triggered a route cache flush if we eliminated a
reference to a fib_info object in the routing tables.

Therefore, any existing cached routes will first check and see that
they have been invalidated before an errant reference to these
metric values would occur.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 13:40:07 -07:00
David S. Miller 9917e1e876 ipv4: Turn rt->rt_route_iif into rt->rt_is_input.
That is this value's only use, as a boolean to indicate whether
a route is an input route or not.

So implement it that way, using a u16 gap present in the struct
already.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 13:40:02 -07:00
David S. Miller 4fd551d7be ipv4: Kill rt->rt_oif
Never actually used.

It was being set on output routes to the original OIF specified in the
flow key used for the lookup.

Adjust the only user, ipmr_rt_fib_lookup(), for greater correctness of
the flowi4_oif and flowi4_iif values, thanks to feedback from Julian
Anastasov.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 13:38:34 -07:00
David S. Miller 93ac53410a ipv4: Dirty less cache lines in route caching paths.
Don't bother incrementing dst->__use and setting dst->lastuse,
they are completely pointless and just slow things down.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 13:36:55 -07:00
David S. Miller ba3f7f04ef ipv4: Kill FLOWI_FLAG_RT_NOCACHE and associated code.
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 13:36:54 -07:00
David S. Miller d2d68ba9fe ipv4: Cache input routes in fib_info nexthops.
Caching input routes is slightly simpler than output routes, since we
don't need to be concerned with nexthop exceptions.  (locally
destined, and routed packets, never trigger PMTU events or redirects
that will be processed by us).

However, we have to elide caching for the DIRECTSRC and non-zero itag
cases.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 13:36:40 -07:00
David S. Miller f2bb4bedf3 ipv4: Cache output routes in fib_info nexthops.
If we have an output route that lacks nexthop exceptions, we can cache
it in the FIB info nexthop.

Such routes will have DST_HOST cleared because such routes refer to a
family of destinations, rather than just one.

The sequence of the handling of exceptions during route lookup is
adjusted to make the logic work properly.

Before we allocate the route, we lookup the exception.

Then we know if we will cache this route or not, and therefore whether
DST_HOST should be set on the allocated route.

Then we use DST_HOST to key off whether we should store the resulting
route, during rt_set_nexthop(), in the FIB nexthop cache.

With help from Eric Dumazet.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 13:36:16 -07:00
David S. Miller ceb3320610 ipv4: Kill routes during PMTU/redirect updates.
Mark them obsolete so there will be a re-lookup to fetch the
FIB nexthop exception info.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 13:31:22 -07:00
David S. Miller f5b0a87436 net: Document dst->obsolete better.
Add a big comment explaining how the field works, and use defines
instead of magic constants for the values assigned to it.

Suggested by Joe Perches.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 13:31:21 -07:00
David S. Miller f8126f1d51 ipv4: Adjust semantics of rt->rt_gateway.
In order to allow prefixed routes, we have to adjust how rt_gateway
is set and interpreted.

The new interpretation is:

1) rt_gateway == 0, destination is on-link, nexthop is iph->daddr

2) rt_gateway != 0, destination requires a nexthop gateway

Abstract the fetching of the proper nexthop value using a new
inline helper, rt_nexthop(), as suggested by Joe Perches.

Signed-off-by: David S. Miller <davem@davemloft.net>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
2012-07-20 13:31:20 -07:00
David S. Miller f1ce3062c5 ipv4: Remove 'rt_dst' from 'struct rtable'
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 13:31:19 -07:00
David Miller b48698895d ipv4: Remove 'rt_mark' from 'struct rtable'
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 13:31:18 -07:00
David Miller d6c0a4f609 ipv4: Kill 'rt_src' from 'struct rtable'
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 13:31:00 -07:00
David Miller 1a00fee4ff ipv4: Remove rt_key_{src,dst,tos} from struct rtable.
They are always used in contexts where they can be reconstituted,
or where the finally resolved rt->rt_{src,dst} is semantically
equivalent.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 13:30:59 -07:00
David Miller 38a424e465 ipv4: Kill ip_route_input_noref().
The "noref" argument to ip_route_input_common() is now always ignored
because we do not cache routes, and in that case we must always grab
a reference to the resulting 'dst'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 13:30:59 -07:00
David S. Miller 89aef8921b ipv4: Delete routing cache.
The ipv4 routing cache is non-deterministic, performance wise, and is
subject to reasonably easy to launch denial of service attacks.

The routing cache works great for well behaved traffic, and the world
was a much friendlier place when the tradeoffs that led to the routing
cache's design were considered.

What it boils down to is that the performance of the routing cache is
a product of the traffic patterns seen by a system rather than being a
product of the contents of the routing tables.  The former of which is
controllable by external entitites.

Even for "well behaved" legitimate traffic, high volume sites can see
hit rates in the routing cache of only ~%10.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 13:30:27 -07:00
Mikulas Patocka b09e786bd1 tun: fix a crash bug and a memory leak
This patch fixes a crash
tun_chr_close -> netdev_run_todo -> tun_free_netdev -> sk_release_kernel ->
sock_release -> iput(SOCK_INODE(sock))
introduced by commit 1ab5ecb90c

The problem is that this socket is embedded in struct tun_struct, it has
no inode, iput is called on invalid inode, which modifies invalid memory
and optionally causes a crash.

sock_release also decrements sockets_in_use, this causes a bug that
"sockets: used" field in /proc/*/net/sockstat keeps on decreasing when
creating and closing tun devices.

This patch introduces a flag SOCK_EXTERNALLY_ALLOCATED that instructs
sock_release to not free the inode and not decrement sockets_in_use,
fixing both memory corruption and sockets_in_use underflow.

It should be backported to 3.3 an 3.4 stabke.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: stable@kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 11:21:06 -07:00
Julian Anastasov 521f549097 ipv4: show pmtu in route list
Override the metrics with rt_pmtu

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 11:16:49 -07:00
Jiri Pirko 76ff5cc919 rtnl: allow to specify number of rx and tx queues on device creation
This patch introduces IFLA_NUM_TX_QUEUES and IFLA_NUM_RX_QUEUES by
which userspace can set number of rx and/or tx queues to be allocated
for newly created netdevice.
This overrides ops->get_num_[tr]x_queues()

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 11:07:00 -07:00
Jiri Pirko d40156aa5e rtnl: allow to specify different num for rx and tx queue count
Also cut out unused function parameters and possible err in return
value.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 11:06:59 -07:00
Eric Dumazet 6f458dfb40 tcp: improve latencies of timer triggered events
Modern TCP stack highly depends on tcp_write_timer() having a small
latency, but current implementation doesn't exactly meet the
expectations.

When a timer fires but finds the socket is owned by the user, it rearms
itself for an additional delay hoping next run will be more
successful.

tcp_write_timer() for example uses a 50ms delay for next try, and it
defeats many attempts to get predictable TCP behavior in term of
latencies.

Use the recently introduced tcp_release_cb(), so that the user owning
the socket will call various handlers right before socket release.

This will permit us to post a followup patch to address the
tcp_tso_should_defer() syndrome (some deferred packets have to wait
RTO timer to be transmitted, while cwnd should allow us to send them
sooner)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: John Heffner <johnwheffner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 10:59:41 -07:00
Eric Dumazet 9dc274151a tcp: fix ABC in tcp_slow_start()
When/if sysctl_tcp_abc > 1, we expect to increase cwnd by 2 if the
received ACK acknowledges more than 2*MSS bytes, in tcp_slow_start()

Problem is this RFC 3465 statement is not correctly coded, as
the while () loop increases snd_cwnd one by one.

Add a new variable to avoid this off-by one error.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: John Heffner <johnwheffner@gmail.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 10:59:41 -07:00
Eric Dumazet 5815d5e7aa tcp: use hash_32() in tcp_metrics
Fix a missing roundup_pow_of_two(), since tcpmhash_entries is not
guaranteed to be a power of two.

Uses hash_32() instead of custom hash.

tcpmhash_entries should be an unsigned int.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 10:59:41 -07:00
Vijay Subramanian 67b95bd78f tcp: Return bool instead of int where appropriate
Applied to a set of static inline functions in tcp_input.c

Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 10:59:41 -07:00
John W. Linville 90b90f60c4 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem 2012-07-20 12:30:48 -04:00
Linus Torvalds 85efc72a02 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull last minute Ceph fixes from Sage Weil:
 "The important one fixes a bug in the socket failure handling behavior
  that was turned up in some recent failure injection testing.  The
  other two are minor bug fixes."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: endian bug in rbd_req_cb()
  rbd: Fix ceph_snap_context size calculation
  libceph: fix messenger retry
2012-07-19 16:11:28 -07:00
Julian Anastasov f31fd38382 ipv4: Fix again the time difference calculation
Fix again the diff value in rt_bind_exception
after collision of two latest patches, my original commit
actually fixed the same problem.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19 13:01:44 -07:00
David S. Miller abaa72d7fd Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
2012-07-19 11:17:30 -07:00
Yuchung Cheng 67da22d23f net-tcp: Fast Open client - cookie-less mode
In trusted networks, e.g., intranet, data-center, the client does not
need to use Fast Open cookie to mitigate DoS attacks. In cookie-less
mode, sendmsg() with MSG_FASTOPEN flag will send SYN-data regardless
of cookie availability.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19 11:02:03 -07:00
Yuchung Cheng aab4874355 net-tcp: Fast Open client - detecting SYN-data drops
On paths with firewalls dropping SYN with data or experimental TCP options,
Fast Open connections will have experience SYN timeout and bad performance.
The solution is to track such incidents in the cookie cache and disables
Fast Open temporarily.

Since only the original SYN includes data and/or Fast Open option, the
SYN-ACK has some tell-tale sign (tcp_rcv_fastopen_synack()) to detect
such drops. If a path has recurring Fast Open SYN drops, Fast Open is
disabled for 2^(recurring_losses) minutes starting from four minutes up to
roughly one and half day. sendmsg with MSG_FASTOPEN flag will succeed but
it behaves as connect() then write().

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19 11:02:03 -07:00
Yuchung Cheng cf60af03ca net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN)
sendmsg() (or sendto()) with MSG_FASTOPEN is a combo of connect(2)
and write(2). The application should replace connect() with it to
send data in the opening SYN packet.

For blocking socket, sendmsg() blocks until all the data are buffered
locally and the handshake is completed like connect() call. It
returns similar errno like connect() if the TCP handshake fails.

For non-blocking socket, it returns the number of bytes queued (and
transmitted in the SYN-data packet) if cookie is available. If cookie
is not available, it transmits a data-less SYN packet with Fast Open
cookie request option and returns -EINPROGRESS like connect().

Using MSG_FASTOPEN on connecting or connected socket will result in
simlar errno like repeating connect() calls. Therefore the application
should only use this flag on new sockets.

The buffer size of sendmsg() is independent of the MSS of the connection.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19 11:02:03 -07:00
Yuchung Cheng 8e4178c1c7 net-tcp: Fast Open client - receiving SYN-ACK
On receiving the SYN-ACK after SYN-data, the client needs to
a) update the cached MSS and cookie (if included in SYN-ACK)
b) retransmit the data not yet acknowledged by the SYN-ACK in the final ACK of
   the handshake.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19 11:02:03 -07:00
Yuchung Cheng 783237e8da net-tcp: Fast Open client - sending SYN-data
This patch implements sending SYN-data in tcp_connect(). The data is
from tcp_sendmsg() with flag MSG_FASTOPEN (implemented in a later patch).

The length of the cookie in tcp_fastopen_req, init'd to 0, controls the
type of the SYN. If the cookie is not cached (len==0), the host sends
data-less SYN with Fast Open cookie request option to solicit a cookie
from the remote. If cookie is not available (len > 0), the host sends
a SYN-data with Fast Open cookie option. If cookie length is negative,
  the SYN will not include any Fast Open option (for fall back operations).

To deal with middleboxes that may drop SYN with data or experimental TCP
option, the SYN-data is only sent once. SYN retransmits do not include
data or Fast Open options. The connection will fall back to regular TCP
handshake.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19 11:02:03 -07:00
Yuchung Cheng 1fe4c481ba net-tcp: Fast Open client - cookie cache
With help from Eric Dumazet, add Fast Open metrics in tcp metrics cache.
The basic ones are MSS and the cookies. Later patch will cache more to
handle unfriendly middleboxes.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19 10:55:36 -07:00
Yuchung Cheng 2100c8d2d9 net-tcp: Fast Open base
This patch impelements the common code for both the client and server.

1. TCP Fast Open option processing. Since Fast Open does not have an
   option number assigned by IANA yet, it shares the experiment option
   code 254 by implementing draft-ietf-tcpm-experimental-options
   with a 16 bits magic number 0xF989. This enables global experiments
   without clashing the scarce(2) experimental options available for TCP.

   When the draft status becomes standard (maybe), the client should
   switch to the new option number assigned while the server supports
   both numbers for transistion.

2. The new sysctl tcp_fastopen

3. A place holder init function

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19 10:55:36 -07:00
stephen hemminger 83bd1b793e ipx: move peII functions
The Ethernet II wrapper is only used by IPX protocol, may have once
been used by Appletalk but not currently. Therefore it makes sense to
move it to the IPX dust bin and drop the exports.

Build tested only.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19 10:48:00 -07:00
Eric Dumazet be9f4a44e7 ipv4: tcp: remove per net tcp_sock
tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket
per network namespace.

This leads to bad behavior on multiqueue NICS, because many cpus
contend for the socket lock and once socket lock is acquired, extra
false sharing on various socket fields slow down the operations.

To better resist to attacks, we use a percpu socket. Each cpu can
run without contention, using appropriate memory (local node)

Additional features :

1) We also mirror the queue_mapping of the incoming skb, so that
answers use the same queue if possible.

2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree()

3) We now limit the number of in-flight RST/ACK [1] packets
per cpu, instead of per namespace, and we honor the sysctl_wmem_default
limit dynamically. (Prior to this patch, sysctl_wmem_default value was
copied at boot time, so any further change would not affect tcp_sock
limit)

[1] These packets are only generated when no socket was matched for
the incoming packet.

Reported-by: Bill Sommerfeld <wsommerfeld@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19 10:35:30 -07:00
Julian Anastasov aee06da672 ipv4: use seqlock for nh_exceptions
Use global seqlock for the nh_exceptions. Call
fnhe_oldest with the right hash chain. Correct the diff
value for dst_set_expires.

v2: after suggestions from Eric Dumazet:
* get rid of spin lock fnhe_lock, rearrange update_or_create_fnhe
* continue daddr search in rt_bind_exception

v3:
* remove the daddr check before seqlock in rt_bind_exception
* restart lookup in rt_bind_exception on detected seqlock change,
as suggested by David Miller

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19 10:30:14 -07:00
John W. Linville 3e497e0215 Merge branch 'for-john' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next 2012-07-19 12:35:00 -04:00
David S. Miller 7fed84f622 ipv4: Fix time difference calculation in rt_bind_exception().
Reported-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19 08:46:59 -07:00
Julian Anastasov 0cc535a299 ipv4: fix address selection in fib_compute_spec_dst
ip_options_compile can be called for forwarded packets,
make sure the specific-destionation address is a local one as
specified in RFC 1812, 4.2.2.2 Addresses in Options

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19 08:30:49 -07:00
Julian Anastasov 6255e5ead0 ipv4: optimize fib_compute_spec_dst call in ip_options_echo
Move fib_compute_spec_dst at the only place where it
is needed.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19 08:30:49 -07:00
Rustad, Mark D 734b65417b net: Statically initialize init_net.dev_base_head
This change eliminates an initialization-order hazard most
recently seen when netprio_cgroup is built into the kernel.

With thanks to Eric Dumazet for catching a bug.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-18 13:32:27 -07:00
John W. Linville 0cd06647b7 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next 2012-07-18 14:53:10 -04:00
Eric Dumazet ddbe503203 ipv6: add ipv6_addr_hash() helper
Introduce ipv6_addr_hash() helper doing a XOR on all bits
of an IPv6 address, with an optimized x86_64 version.

Use it in flow dissector, as suggested by Andrew McGregor,
to reduce hash collision probabilities in fq_codel (and other
users of flow dissector)

Use it in ip6_tunnel.c and use more bit shuffling, as suggested
by David Laight, as existing hash was ignoring most of them.

Use it in sunrpc and use more bit shuffling, using hash_32().

Use it in net/ipv6/addrconf.c, using hash_32() as well.

As a cleanup, use it in net/ipv4/tcp_metrics.c

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrew McGregor <andrewmcgr@gmail.com>
Cc: Dave Taht <dave.taht@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-18 11:28:46 -07:00