Commit graph

564 commits

Author SHA1 Message Date
Avinash Repaka b1fb67fa50 RDS: IB: Initialize max_items based on underlying device attributes
Use max_1m_mrs/max_8k_mrs while setting max_items, as the former
variables are set based on the underlying device attributes.

Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-05 21:16:33 -07:00
Avinash Repaka 9dff9936f0 RDS: IB: Limit the scope of has_fr/has_fmr variables
This patch fixes the scope of has_fr and has_fmr variables as they are
needed only in rds_ib_add_one().

Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-05 21:16:33 -07:00
Håkon Bugge 126f760ca9 rds: Fix incorrect statistics counting
In rds_send_xmit() there is logic to batch the sends. However, if
another thread has acquired the lock and has incremented the send_gen,
it is considered a race and we yield. The code incrementing the
s_send_lock_queue_raced statistics counter did not count this event
correctly.

This commit counts the race condition correctly.

Changes from v1:
- Removed check for *someone_on_xmit()*
- Fixed incorrect indentation

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-07 20:07:13 -07:00
Håkon Bugge f530f39f5f rds: Fix non-atomic operation on shared flag variable
The bits in m_flags in struct rds_message are used for a plurality of
reasons, and from different contexts. To avoid any missing updates to
m_flags, use the atomic set_bit() instead of the non-atomic equivalent.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-05 14:49:49 -07:00
Bhumika Goyal 8209432a5e RDS: make rhashtable_params const
Make this const as it is either used during a copy operation or passed
to a const argument of the function rhltable_init

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-28 11:30:02 -07:00
David S. Miller 3118e6e19d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The UDP offload conflict is dealt with by simply taking what is
in net-next where we have removed all of the UFO handling code
entirely.

The TCP conflict was a case of local variables in a function
being removed from both net and net-next.

In netvsc we had an assignment right next to where a missing
set of u64 stats sync object inits were added.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 16:28:45 -07:00
Håkon Bugge 05bfd7dbb5 rds: Reintroduce statistics counting
In commit 7e3f2952ee ("rds: don't let RDS shutdown a connection
while senders are present"), refilling the receive queue was removed
from rds_ib_recv(), along with the increment of
s_ib_rx_refill_from_thread.

Commit 73ce4317bf ("RDS: make sure we post recv buffers")
re-introduces filling the receive queue from rds_ib_recv(), but does
not add the statistics counter. rds_ib_recv() was later renamed to
rds_ib_recv_path().

This commit reintroduces the statistics counting of
s_ib_rx_refill_from_thread and s_ib_rx_refill_from_cq.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-08 21:03:47 -07:00
Sowmini Varadhan 840df162b3 rds: reduce memory footprint for RDS when transport is RDMA
RDS over IB does not use multipath RDS, so the array
of additional rds_conn_path structures is always superfluous
in this case. Reduce the memory footprint of the rds module
by making this a dynamic allocation predicated on whether
the transport is mp_capable.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Tested-by: Efrain Galaviz <efrain.galaviz@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:27:59 -07:00
David S. Miller 7a68ada6ec Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-07-21 03:38:43 +01:00
Håkon Bugge e623a48ee4 rds: Make sure updates to cp_send_gen can be observed
cp->cp_send_gen is treated as a normal variable, although it may be
used by different threads.

This is fixed by using {READ,WRITE}_ONCE when it is incremented and
READ_ONCE when it is read outside the {acquire,release}_in_xmit
protection.

Normative reference from the Linux-Kernel Memory Model:

    Loads from and stores to shared (but non-atomic) variables should
    be protected with the READ_ONCE(), WRITE_ONCE(), and
    ACCESS_ONCE().

Clause 5.1.2.4/25 in the C standard is also relevant.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-20 15:33:01 -07:00
Sowmini Varadhan aed20a53a7 rds: cancel send/recv work before queuing connection shutdown
We could end up executing rds_conn_shutdown before the rds_recv_worker
thread, then rds_conn_shutdown -> rds_tcp_conn_shutdown can do a
sock_release and set sock->sk to null, which may interleave in bad
ways with rds_recv_worker, e.g., it could result in:

"BUG: unable to handle kernel NULL pointer dereference at 0000000000000078"
    [ffff881769f6fd70] release_sock at ffffffff815f337b
    [ffff881769f6fd90] rds_tcp_recv at ffffffffa043c888 [rds_tcp]
    [ffff881769f6fdb0] rds_recv_worker at ffffffffa04a4810 [rds]
    [ffff881769f6fde0] process_one_work at ffffffff810a14c1
    [ffff881769f6fe40] worker_thread at ffffffff810a1940
    [ffff881769f6fec0] kthread at ffffffff810a6b1e

Also, do not enqueue any new shutdown workq items when the connection is
shutting down (this may happen for rds-tcp in softirq mode, if a FIN
or CLOSE is received while the modules is in the middle of an unload)

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-16 19:07:35 -07:00
Sowmini Varadhan 0933a578cd rds: tcp: use sock_create_lite() to create the accept socket
There are two problems with calling sock_create_kern() from
rds_tcp_accept_one()
1. it sets up a new_sock->sk that is wasteful, because this ->sk
   is going to get replaced by inet_accept() in the subsequent ->accept()
2. The new_sock->sk is a leaked reference in sock_graft() which
   expects to find a null parent->sk

Avoid these problems by calling sock_create_lite().

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-08 11:16:16 +01:00
Reshetova, Elena 6c5a1c4a2a net, rds: convert rds_message.m_refcount from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-04 22:35:18 +01:00
Reshetova, Elena 803ea85053 net, rds: convert rds_mr.r_refcount from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-04 22:35:17 +01:00
Reshetova, Elena b7f0292094 net, rds: convert rds_incoming.i_refcount from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-04 22:35:17 +01:00
Reshetova, Elena 50d61ff789 net, rds: convert rds_ib_device.refcount from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-04 22:35:17 +01:00
Reshetova, Elena 14afee4b60 net: convert sock.sk_wmem_alloc from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:08 -07:00
Sowmini Varadhan c14b036681 rds: tcp: set linger to 1 when unloading a rds-tcp
If we are unloading the rds_tcp module, we can set linger to 1
and drop pending packets to accelerate reconnect. The peer will
end up resetting the connection based on new generation numbers
of the new incarnation, so hanging on to unsent TCP packets via
linger is mostly pointless in this case.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: Jenny Xu <jenny.x.xu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-22 11:34:04 -04:00
Sowmini Varadhan 69b92b5b74 rds: tcp: send handshake ping-probe from passive endpoint
The RDS handshake ping probe added by commit 5916e2c155
("RDS: TCP: Enable multipath RDS for TCP") is sent from rds_sendmsg()
before the first data packet is sent to a peer. If the conversation
is not bidirectional  (i.e., one side is always passive and never
invokes rds_sendmsg()) and the passive side restarts its rds_tcp
module, a new HS ping probe needs to be sent, so that the number
of paths can be re-established.

This patch achieves that by sending a HS ping probe from
rds_tcp_accept_one() when c_npaths is 0 (i.e., we have not done
a handshake probe with this peer yet).

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: Jenny Xu <jenny.x.xu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-22 11:34:04 -04:00
Sowmini Varadhan 10beea7d74 rds: tcp: Set linger when rejecting an incoming conn in rds_tcp_accept_one
Each time we get an incoming SYN to the RDS_TCP_PORT, the TCP
layer accepts the connection and then the rds_tcp_accept_one()
callback is invoked to process the incoming connection.

rds_tcp_accept_one() may reject the incoming syn for a number of
reasons, e.g., commit 1a0e100fb2 ("RDS: TCP: Force every connection
to be initiated by numerically smaller IP address"), or because
we are getting spammed by a malicious node that is triggering
a flood of connection attempts to RDS_TCP_PORT. If the incoming
syn is rejected, no data would have been sent on the TCP socket,
and we do not need to be in TIME_WAIT state, so we set linger on
the TCP socket before closing, thereby closing the socket efficiently
with a RST.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: Imanti Mendez <imanti.mendez@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 12:45:15 -04:00
Sowmini Varadhan 00354de577 rds: tcp: various endian-ness fixes
Found when testing between sparc and x86 machines on different
subnets, so the address comparison patterns hit the corner cases and
brought out some bugs fixed by this patch.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: Imanti Mendez <imanti.mendez@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 12:45:15 -04:00
Sowmini Varadhan 41500c3e2a rds: tcp: remove cp_outgoing
After commit 1a0e100fb2 ("RDS: TCP: Force every connection to be
initiated by numerically smaller IP address") we no longer need
the logic associated with cp_outgoing, so clean up usage of this
field.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: Imanti Mendez <imanti.mendez@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 12:45:14 -04:00
Linus Torvalds 8d65b08deb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Millar:
 "Here are some highlights from the 2065 networking commits that
  happened this development cycle:

   1) XDP support for IXGBE (John Fastabend) and thunderx (Sunil Kowuri)

   2) Add a generic XDP driver, so that anyone can test XDP even if they
      lack a networking device whose driver has explicit XDP support
      (me).

   3) Sparc64 now has an eBPF JIT too (me)

   4) Add a BPF program testing framework via BPF_PROG_TEST_RUN (Alexei
      Starovoitov)

   5) Make netfitler network namespace teardown less expensive (Florian
      Westphal)

   6) Add symmetric hashing support to nft_hash (Laura Garcia Liebana)

   7) Implement NAPI and GRO in netvsc driver (Stephen Hemminger)

   8) Support TC flower offload statistics in mlxsw (Arkadi Sharshevsky)

   9) Multiqueue support in stmmac driver (Joao Pinto)

  10) Remove TCP timewait recycling, it never really could possibly work
      well in the real world and timestamp randomization really zaps any
      hint of usability this feature had (Soheil Hassas Yeganeh)

  11) Support level3 vs level4 ECMP route hashing in ipv4 (Nikolay
      Aleksandrov)

  12) Add socket busy poll support to epoll (Sridhar Samudrala)

  13) Netlink extended ACK support (Johannes Berg, Pablo Neira Ayuso,
      and several others)

  14) IPSEC hw offload infrastructure (Steffen Klassert)"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2065 commits)
  tipc: refactor function tipc_sk_recv_stream()
  tipc: refactor function tipc_sk_recvmsg()
  net: thunderx: Optimize page recycling for XDP
  net: thunderx: Support for XDP header adjustment
  net: thunderx: Add support for XDP_TX
  net: thunderx: Add support for XDP_DROP
  net: thunderx: Add basic XDP support
  net: thunderx: Cleanup receive buffer allocation
  net: thunderx: Optimize CQE_TX handling
  net: thunderx: Optimize RBDR descriptor handling
  net: thunderx: Support for page recycling
  ipx: call ipxitf_put() in ioctl error path
  net: sched: add helpers to handle extended actions
  qed*: Fix issues in the ptp filter config implementation.
  qede: Fix concurrency issue in PTP Tx path processing.
  stmmac: Add support for SIMATIC IOT2000 platform
  net: hns: fix ethtool_get_strings overflow in hns driver
  tcp: fix wraparound issue in tcp_lp
  bpf, arm64: fix jit branch offset related to ldimm64
  bpf, arm64: implement jiting of BPF_XADD
  ...
2017-05-02 16:40:27 -07:00
Linus Torvalds 5b13475a5e Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
 "Cleanups that sat in -next + -stable fodder that has just missed 4.11.

  There's more iov_iter work in my local tree, but I'd prefer to push
  the stuff that had been in -next first"

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  iov_iter: don't revert iov buffer if csum error
  generic_file_read_iter(): make use of iov_iter_revert()
  generic_file_direct_write(): make use of iov_iter_revert()
  orangefs: use iov_iter_revert()
  sctp: switch to copy_from_iter_full()
  net/9p: switch to copy_from_iter_full()
  switch memcpy_from_msg() to copy_from_iter_full()
  rds: make use of iov_iter_revert()
2017-05-02 11:18:50 -07:00
Al Viro eea86b637a Merge branches 'uaccess.alpha', 'uaccess.arc', 'uaccess.arm', 'uaccess.arm64', 'uaccess.avr32', 'uaccess.bfin', 'uaccess.c6x', 'uaccess.cris', 'uaccess.frv', 'uaccess.h8300', 'uaccess.hexagon', 'uaccess.ia64', 'uaccess.m32r', 'uaccess.m68k', 'uaccess.metag', 'uaccess.microblaze', 'uaccess.mips', 'uaccess.mn10300', 'uaccess.nios2', 'uaccess.openrisc', 'uaccess.parisc', 'uaccess.powerpc', 'uaccess.s390', 'uaccess.score', 'uaccess.sh', 'uaccess.sparc', 'uaccess.tile', 'uaccess.um', 'uaccess.unicore32', 'uaccess.x86' and 'uaccess.xtensa' into work.uaccess 2017-04-26 12:06:59 -04:00
Al Viro dc88e3b4c8 rds: make use of iov_iter_revert() 2017-04-21 13:57:11 -04:00
Al Viro e73a67f7cd don't open-code kernel_setsockopt()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-06 02:09:23 -04:00
Sowmini Varadhan 087d975353 rds: tcp: canonical connection order for all paths with index > 0
The rds_connect_worker() has a bug in the check that enforces the
canonical connection order described in the comments of
rds_tcp_state_change(). The intention is to make sure that all
the multipath connections are always initiated by the smaller IP
address via rds_start_mprds. To achieve this, rds_connection_worker
should check that cp_index > 0.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-02 19:41:00 -07:00
Sowmini Varadhan e97656d03c rds: tcp: allow progress of rds_conn_shutdown if the rds_connection is marked ERROR by an intervening FIN
rds_conn_shutdown() runs in workq context, and marks the rds_connection
as DISCONNECTING before quiescing Tx/Rx paths. However, after all I/O
has quiesced, we may still find the rds_connection state to be
RDS_CONN_ERROR if an intervening FIN was processed in softirq context.

This is not a fatal error: rds_conn_shutdown() should continue the
shutdown, and there is no need to log noisy messages about this event.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-02 19:41:00 -07:00
David S. Miller 101c431492 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/broadcom/genet/bcmgenet.c
	net/core/sock.c

Conflicts were overlapping changes in bcmgenet and the
lockdep handling of sockets.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15 11:59:10 -07:00
Zhu Yanjun 569f41d187 rds: ib: unmap the scatter/gather list when error
When some errors occur, the scatter/gather list mapped to DMA addresses
should be handled.

Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 23:20:05 -07:00
Zhu Yanjun edd08f96db rds: ib: add the static type to the function
The function rds_ib_map_fmr is used only in the ib_fmr.c
file. As such, the static type is added to limit it in this file.

Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 23:20:05 -07:00
Zhu Yanjun ea69c88364 rds: ib: remove redundant ib_dealloc_fmr
The function ib_dealloc_fmr will never be called. As such, it should
be removed.

Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 23:20:05 -07:00
Zhu Yanjun b418c5276a rds: ib: drop unnecessary rdma_reject
When rdma_accept fails, rdma_reject is called in it. As such, it is
not necessary to execute rdma_reject again.

Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 23:20:05 -07:00
David Howells cdfbabfb2f net: Work around lockdep limitation in sockets that use sockets
Lockdep issues a circular dependency warning when AFS issues an operation
through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.

The theory lockdep comes up with is as follows:

 (1) If the pagefault handler decides it needs to read pages from AFS, it
     calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
     creating a call requires the socket lock:

	mmap_sem must be taken before sk_lock-AF_RXRPC

 (2) afs_open_socket() opens an AF_RXRPC socket and binds it.  rxrpc_bind()
     binds the underlying UDP socket whilst holding its socket lock.
     inet_bind() takes its own socket lock:

	sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET

 (3) Reading from a TCP socket into a userspace buffer might cause a fault
     and thus cause the kernel to take the mmap_sem, but the TCP socket is
     locked whilst doing this:

	sk_lock-AF_INET must be taken before mmap_sem

However, lockdep's theory is wrong in this instance because it deals only
with lock classes and not individual locks.  The AF_INET lock in (2) isn't
really equivalent to the AF_INET lock in (3) as the former deals with a
socket entirely internal to the kernel that never sees userspace.  This is
a limitation in the design of lockdep.

Fix the general case by:

 (1) Double up all the locking keys used in sockets so that one set are
     used if the socket is created by userspace and the other set is used
     if the socket is created by the kernel.

 (2) Store the kern parameter passed to sk_alloc() in a variable in the
     sock struct (sk_kern_sock).  This informs sock_lock_init(),
     sock_init_data() and sk_clone_lock() as to the lock keys to be used.

     Note that the child created by sk_clone_lock() inherits the parent's
     kern setting.

 (3) Add a 'kern' parameter to ->accept() that is analogous to the one
     passed in to ->create() that distinguishes whether kernel_accept() or
     sys_accept4() was the caller and can be passed to sk_alloc().

     Note that a lot of accept functions merely dequeue an already
     allocated socket.  I haven't touched these as the new socket already
     exists before we get the parameter.

     Note also that there are a couple of places where I've made the accepted
     socket unconditionally kernel-based:

	irda_accept()
	rds_rcp_accept_one()
	tcp_accept_from_sock()

     because they follow a sock_create_kern() and accept off of that.

Whilst creating this, I noticed that lustre and ocfs don't create sockets
through sock_create_kern() and thus they aren't marked as for-kernel,
though they appear to be internal.  I wonder if these should do that so
that they use the new set of lock keys.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 18:23:27 -08:00
Zhu Yanjun 3b12f73a5c rds: ib: add error handle
In the function rds_ib_setup_qp, the error handle is missing. When some
error occurs, it is possible that memory leak occurs. As such, error
handle is added.

Cc: Joe Jin <joe.jin@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Guanglei Li <guanglei.li@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 13:09:18 -08:00
Sowmini Varadhan b21dd4506b rds: tcp: Sequence teardown of listen and acceptor sockets to avoid races
Commit a93d01f577 ("RDS: TCP: avoid bad page reference in
rds_tcp_listen_data_ready") added the function
rds_tcp_listen_sock_def_readable()  to handle the case when a
partially set-up acceptor socket drops into rds_tcp_listen_data_ready().
However, if the listen socket (rtn->rds_tcp_listen_sock) is itself going
through a tear-down via rds_tcp_listen_stop(), the (*ready)() will be
null and we would hit a panic  of the form
  BUG: unable to handle kernel NULL pointer dereference at   (null)
  IP:           (null)
   :
  ? rds_tcp_listen_data_ready+0x59/0xb0 [rds_tcp]
  tcp_data_queue+0x39d/0x5b0
  tcp_rcv_established+0x2e5/0x660
  tcp_v4_do_rcv+0x122/0x220
  tcp_v4_rcv+0x8b7/0x980
    :
In the above case, it is not fatal to encounter a NULL value for
ready- we should just drop the packet and let the flush of the
acceptor thread finish gracefully.

In general, the tear-down sequence for listen() and accept() socket
that is ensured by this commit is:
     rtn->rds_tcp_listen_sock = NULL; /* prevent any new accepts */
     In rds_tcp_listen_stop():
         serialize with, and prevent, further callbacks using lock_sock()
         flush rds_wq
         flush acceptor workq
         sock_release(listen socket)

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-07 14:09:59 -08:00
Sowmini Varadhan 16c09b1c76 rds: tcp: Reorder initialization sequence in rds_tcp_init to avoid races
Order of initialization in rds_tcp_init needs to be done so
that resources are set up and destroyed in the correct synchronization
sequence with both the data path, as well as netns create/destroy
path. Specifically,

- we must call register_pernet_subsys and get the rds_tcp_netid
  before calling register_netdevice_notifier, otherwise we risk
  the sequence
    1. register_netdevice_notifier sets up netdev notifier callback
    2. rds_tcp_dev_event -> rds_tcp_kill_sock uses netid 0, and finds
       the wrong rtn, resulting in a panic with string that is of the form:

  BUG: unable to handle kernel NULL pointer dereference at 000000000000000d
  IP: rds_tcp_kill_sock+0x3a/0x1d0 [rds_tcp]
         :

- the rds_tcp_incoming_slab kmem_cache must be initialized before the
  datapath starts up. The latter can happen any time after the
  pernet_subsys registration of rds_tcp_net_ops, whose -> init
  function sets up the listen socket. If the rds_tcp_incoming_slab has
  not been set up at that time, a panic of the form below may be
  encountered

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000014
  IP: kmem_cache_alloc+0x90/0x1c0
     :
  rds_tcp_data_recv+0x1e7/0x370 [rds_tcp]
  tcp_read_sock+0x96/0x1c0
  rds_tcp_recv_path+0x65/0x80 [rds_tcp]
     :

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-07 14:09:59 -08:00
Sowmini Varadhan 8edc3affc0 rds: tcp: Take explicit refcounts on struct net
It is incorrect for the rds_connection to piggyback on the
sock_net() refcount for the netns because this gives rise to
a chicken-and-egg problem during rds_conn_destroy. Instead explicitly
take a ref on the net, and hold the netns down till the connection
tear-down is complete.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-07 14:09:59 -08:00
Linus Torvalds 8d70eeb84a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix double-free in batman-adv, from Sven Eckelmann.

 2) Fix packet stats for fast-RX path, from Joannes Berg.

 3) Netfilter's ip_route_me_harder() doesn't handle request sockets
    properly, fix from Florian Westphal.

 4) Fix sendmsg deadlock in rxrpc, from David Howells.

 5) Add missing RCU locking to transport hashtable scan, from Xin Long.

 6) Fix potential packet loss in mlxsw driver, from Ido Schimmel.

 7) Fix race in NAPI handling between poll handlers and busy polling,
    from Eric Dumazet.

 8) TX path in vxlan and geneve need proper RCU locking, from Jakub
    Kicinski.

 9) SYN processing in DCCP and TCP need to disable BH, from Eric
    Dumazet.

10) Properly handle net_enable_timestamp() being invoked from IRQ
    context, also from Eric Dumazet.

11) Fix crash on device-tree systems in xgene driver, from Alban Bedel.

12) Do not call sk_free() on a locked socket, from Arnaldo Carvalho de
    Melo.

13) Fix use-after-free in netvsc driver, from Dexuan Cui.

14) Fix max MTU setting in bonding driver, from WANG Cong.

15) xen-netback hash table can be allocated from softirq context, so use
    GFP_ATOMIC. From Anoob Soman.

16) Fix MAC address change bug in bgmac driver, from Hari Vyas.

17) strparser needs to destroy strp_wq on module exit, from WANG Cong.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (69 commits)
  strparser: destroy workqueue on module exit
  sfc: fix IPID endianness in TSOv2
  sfc: avoid max() in array size
  rds: remove unnecessary returned value check
  rxrpc: Fix potential NULL-pointer exception
  nfp: correct DMA direction in XDP DMA sync
  nfp: don't tell FW about the reserved buffer space
  net: ethernet: bgmac: mac address change bug
  net: ethernet: bgmac: init sequence bug
  xen-netback: don't vfree() queues under spinlock
  xen-netback: keep a local pointer for vif in backend_disconnect()
  netfilter: nf_tables: don't call nfnetlink_set_err() if nfnetlink_send() fails
  netfilter: nft_set_rbtree: incorrect assumption on lower interval lookups
  netfilter: nf_conntrack_sip: fix wrong memory initialisation
  can: flexcan: fix typo in comment
  can: usb_8dev: Fix memory leak of priv->cmd_msg_buffer
  can: gs_usb: fix coding style
  can: gs_usb: Don't use stack memory for USB transfers
  ixgbe: Limit use of 2K buffers on architectures with 256B or larger cache lines
  ixgbe: update the rss key on h/w, when ethtool ask for it
  ...
2017-03-04 17:31:39 -08:00
Zhu Yanjun a8d63a53b3 rds: remove unnecessary returned value check
The function rds_trans_register always returns 0. As such, it is not
necessary to check the returned value.

Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-03 09:51:22 -08:00
Linus Torvalds 69fd110eb6 Merge branch 'work.sendmsg' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs sendmsg updates from Al Viro:
 "More sendmsg work.

  This is a fairly separate isolated stuff (there's a continuation
  around lustre, but that one was too late to soak in -next), thus the
  separate pull request"

* 'work.sendmsg' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ncpfs: switch to sock_sendmsg()
  ncpfs: don't mess with manually advancing iovec on send
  ncpfs: sendmsg does *not* bugger iovec these days
  ceph_tcp_sendpage(): use ITER_BVEC sendmsg
  afs_send_pages(): use ITER_BVEC
  rds: remove dead code
  ceph: switch to sock_recvmsg()
  usbip_recv(): switch to sock_recvmsg()
  iscsi_target: deal with short writes on the tx side
  [nbd] pass iov_iter to nbd_xmit()
  [nbd] switch sock_xmit() to sock_{send,recv}msg()
  [drbd] use sock_sendmsg()
2017-03-02 15:16:38 -08:00
Zhu Yanjun 4f7bfb3982 rds: ib: add the static type to the variables
The variables rds_ib_mr_1m_pool_size and rds_ib_mr_8k_pool_size
are used only in the ib.c file. As such, the static type is
added to limit them in this file.

Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-01 09:50:58 -08:00
Linus Torvalds c2eca00fec Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Don't save TIPC header values before the header has been validated,
    from Jon Paul Maloy.

 2) Fix memory leak in RDS, from Zhu Yanjun.

 3) We miss to initialize the UID in the flow key in some paths, from
    Julian Anastasov.

 4) Fix latent TOS masking bug in the routing cache removal from years
    ago, also from Julian.

 5) We forget to set the sockaddr port in sctp_copy_local_addr_list(),
    fix from Xin Long.

 6) Missing module ref count drop in packet scheduler actions, from
    Roman Mashak.

 7) Fix RCU annotations in rht_bucket_nested, from Herbert Xu.

 8) Fix use after free which happens because L2TP's ipv4 support returns
    non-zero values from it's backlog_rcv function which ipv4 interprets
    as protocol values. Fix from Paul Hüber.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits)
  qed: Don't use attention PTT for configuring BW
  qed: Fix race with multiple VFs
  l2tp: avoid use-after-free caused by l2tp_ip_backlog_recv
  xfrm: provide correct dst in xfrm_neigh_lookup
  rhashtable: Fix RCU dereference annotation in rht_bucket_nested
  rhashtable: Fix use before NULL check in bucket_table_free
  net sched actions: do not overwrite status of action creation.
  rxrpc: Kernel calls get stuck in recvmsg
  net sched actions: decrement module reference count after table flush.
  lib: Allow compile-testing of parman
  ipv6: check sk sk_type and protocol early in ip_mroute_set/getsockopt
  sctp: set sin_port for addr param when checking duplicate address
  net/mlx4_en: fix overflow in mlx4_en_init_timestamp()
  netfilter: nft_set_bitmap: incorrect bitmap size
  net: s2io: fix typo argumnet argument
  net: vxge: fix typo argumnet argument
  netfilter: nf_ct_expect: Change __nf_ct_expect_check() return value.
  ipv4: mask tos for input route
  ipv4: add missing initialization for flowi4_uid
  lib: fix spelling mistake: "actualy" -> "actually"
  ...
2017-02-28 10:00:39 -08:00
Linus Torvalds ac1820fb28 This is a tree wide change and has been kept separate for that reason.
Bart Van Assche noted that the ib DMA mapping code was significantly
 similar enough to the core DMA mapping code that with a few changes
 it was possible to remove the IB DMA mapping code entirely and
 switch the RDMA stack to use the core DMA mapping code.  This resulted
 in a nice set of cleanups, but touched the entire tree.  This branch
 will be submitted separately to Linus at the end of the merge window
 as per normal practice for tree wide changes like this.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYo06oAAoJELgmozMOVy/d9Z8QALedWHdu98St1L0u2c8sxnR9
 2zo/4sF5Vb9u7FpmdIX32L4SQ9s9KhPE8Qp8NtZLf9v10zlDebIRJDpXknXtKooV
 CAXxX4sxBXV27/UrhbZEfXiPrmm6ccJFyIfRnMU6NlMqh2AtAsRa5AC2/RMp8oUD
 Med97PFiF0o6TD22/UH1VFbRpX1zjaKyqm7a3as5sJfzNA+UGIZAQ7Euz8000DKZ
 xCgVLTEwS0FmOujtBkCst7xa9TjuqR1HLOB4DdGvAhP6BHdz2yamM7Qmh9NN+NEX
 0BtjsuXomtn6j6AszGC+bpipCZh3NUigcwoFAARXCYFHibBvo4DPdFeGsraFgXdy
 1+KyR8CCeQG3Aly5Vwr264RFPGkGpwMj8PsBlXgQVtrlg4rriaCzOJNmIIbfdADw
 ftqhxBOzReZw77aH2s+9p2ILRfcAmPqhynLvFGFo9LBvsik8LVso7YgZN0xGxwcI
 IjI/XGC8UskPVsIZBIYA6sl2bYzgOjtBIHiXjRrPlW3uhduIXLrvKFfLPP/5XLAG
 ehLXK+J0bfsyY9ClmlNS8oH/WdLhXAyy/KNmnj5bRRm9qg6BRJR3bsOBhZJODuoC
 XgEXFfF6/7roNESWxowff7pK0rTkRg/m/Pa4VQpeO+6NWHE7kgZhL6kyIp5nKcwS
 3e7mgpcwC+3XfA/6vU3F
 =e0Si
 -----END PGP SIGNATURE-----

Merge tag 'for-next-dma_ops' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull rdma DMA mapping updates from Doug Ledford:
 "Drop IB DMA mapping code and use core DMA code instead.

  Bart Van Assche noted that the ib DMA mapping code was significantly
  similar enough to the core DMA mapping code that with a few changes it
  was possible to remove the IB DMA mapping code entirely and switch the
  RDMA stack to use the core DMA mapping code.

  This resulted in a nice set of cleanups, but touched the entire tree
  and has been kept separate for that reason."

* tag 'for-next-dma_ops' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (37 commits)
  IB/rxe, IB/rdmavt: Use dma_virt_ops instead of duplicating it
  IB/core: Remove ib_device.dma_device
  nvme-rdma: Switch from dma_device to dev.parent
  RDS: net: Switch from dma_device to dev.parent
  IB/srpt: Modify a debug statement
  IB/srp: Switch from dma_device to dev.parent
  IB/iser: Switch from dma_device to dev.parent
  IB/IPoIB: Switch from dma_device to dev.parent
  IB/rxe: Switch from dma_device to dev.parent
  IB/vmw_pvrdma: Switch from dma_device to dev.parent
  IB/usnic: Switch from dma_device to dev.parent
  IB/qib: Switch from dma_device to dev.parent
  IB/qedr: Switch from dma_device to dev.parent
  IB/ocrdma: Switch from dma_device to dev.parent
  IB/nes: Remove a superfluous assignment statement
  IB/mthca: Switch from dma_device to dev.parent
  IB/mlx5: Switch from dma_device to dev.parent
  IB/mlx4: Switch from dma_device to dev.parent
  IB/i40iw: Remove a superfluous assignment statement
  IB/hns: Switch from dma_device to dev.parent
  ...
2017-02-25 13:45:43 -08:00
Zhu Yanjun 3b5923f079 rds: fix memory leak error
When the function register_netdevice_notifier fails, the memory
allocated by kmem_cache_create should be freed by the function
kmem_cache_destroy.

Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-24 14:13:14 -05:00
Wu Fengguang 77cc7aee09 RDS: IB: fix ifnullfree.cocci warnings
net/rds/ib.c:115:2-7: WARNING: NULL check before freeing functions like kfree, debugfs_remove, debugfs_remove_recursive or usb_free_urb is not needed. Maybe consider reorganizing relevant code to avoid passing NULL values.

 NULL check before some freeing functions is not needed.

 Based on checkpatch warning
 "kfree(NULL) is safe this check is probably not required"
 and kfreeaddr.cocci by Julia Lawall.

Generated by: scripts/coccinelle/free/ifnullfree.cocci

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-24 11:11:33 -05:00
Zhu Yanjun d2c58294f5 rds:Remove unnecessary ib_ring unalloc
In the function rds_ib_xmit_atomic, ib_ring is not allocated
successfully. As such, it is not necessary to unalloc it.

Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 15:19:51 -05:00
Bart Van Assche 5f68dcaff5 RDS: net: Switch from dma_device to dev.parent
Prepare for removal of ib_device.dma_device.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24 12:26:17 -05:00
Bart Van Assche d43dbacfc0 IB/core: Change the type of an ib_dma_alloc_coherent() argument
Change the type of the dma_handle argument from u64 * to dma_addr_t *.
This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24 12:23:35 -05:00