Commit graph

66 commits

Author SHA1 Message Date
David Howells cd5892c756 rxrpc: Create an address for sendmsg() to bind unbound socket with
Create an address for sendmsg() to bind unbound socket with rather than
using a completely blank address otherwise the transport socket creation
will fail because it will try to use address family 0.

We use the address family specified in the protocol argument when the
AF_RXRPC socket was created and SOCK_DGRAM as the default.  For anything
else, bind() must be used.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 23:09:13 +01:00
David Howells cbd00891de rxrpc: Adjust the call ref tracepoint to show kernel API refs
Adjust the call ref tracepoint to show references held on a call by the
kernel API separately as much as possible and add an additional trace to at
the allocation point from the preallocation buffer for an incoming call.

Note that this doesn't show the allocation of a client call for the kernel
separately at the moment.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 22:38:30 +01:00
David Howells 248f219cb8 rxrpc: Rewrite the data and ack handling code
Rewrite the data and ack handling code such that:

 (1) Parsing of received ACK and ABORT packets and the distribution and the
     filing of DATA packets happens entirely within the data_ready context
     called from the UDP socket.  This allows us to process and discard ACK
     and ABORT packets much more quickly (they're no longer stashed on a
     queue for a background thread to process).

 (2) We avoid calling skb_clone(), pskb_pull() and pskb_trim().  We instead
     keep track of the offset and length of the content of each packet in
     the sk_buff metadata.  This means we don't do any allocation in the
     receive path.

 (3) Jumbo DATA packet parsing is now done in data_ready context.  Rather
     than cloning the packet once for each subpacket and pulling/trimming
     it, we file the packet multiple times with an annotation for each
     indicating which subpacket is there.  From that we can directly
     calculate the offset and length.

 (4) A call's receive queue can be accessed without taking locks (memory
     barriers do have to be used, though).

 (5) Incoming calls are set up from preallocated resources and immediately
     made live.  They can than have packets queued upon them and ACKs
     generated.  If insufficient resources exist, DATA packet #1 is given a
     BUSY reply and other DATA packets are discarded).

 (6) sk_buffs no longer take a ref on their parent call.

To make this work, the following changes are made:

 (1) Each call's receive buffer is now a circular buffer of sk_buff
     pointers (rxtx_buffer) rather than a number of sk_buff_heads spread
     between the call and the socket.  This permits each sk_buff to be in
     the buffer multiple times.  The receive buffer is reused for the
     transmit buffer.

 (2) A circular buffer of annotations (rxtx_annotations) is kept parallel
     to the data buffer.  Transmission phase annotations indicate whether a
     buffered packet has been ACK'd or not and whether it needs
     retransmission.

     Receive phase annotations indicate whether a slot holds a whole packet
     or a jumbo subpacket and, if the latter, which subpacket.  They also
     note whether the packet has been decrypted in place.

 (3) DATA packet window tracking is much simplified.  Each phase has just
     two numbers representing the window (rx_hard_ack/rx_top and
     tx_hard_ack/tx_top).

     The hard_ack number is the sequence number before base of the window,
     representing the last packet the other side says it has consumed.
     hard_ack starts from 0 and the first packet is sequence number 1.

     The top number is the sequence number of the highest-numbered packet
     residing in the buffer.  Packets between hard_ack+1 and top are
     soft-ACK'd to indicate they've been received, but not yet consumed.

     Four macros, before(), before_eq(), after() and after_eq() are added
     to compare sequence numbers within the window.  This allows for the
     top of the window to wrap when the hard-ack sequence number gets close
     to the limit.

     Two flags, RXRPC_CALL_RX_LAST and RXRPC_CALL_TX_LAST, are added also
     to indicate when rx_top and tx_top point at the packets with the
     LAST_PACKET bit set, indicating the end of the phase.

 (4) Calls are queued on the socket 'receive queue' rather than packets.
     This means that we don't need have to invent dummy packets to queue to
     indicate abnormal/terminal states and we don't have to keep metadata
     packets (such as ABORTs) around

 (5) The offset and length of a (sub)packet's content are now passed to
     the verify_packet security op.  This is currently expected to decrypt
     the packet in place and validate it.

     However, there's now nowhere to store the revised offset and length of
     the actual data within the decrypted blob (there may be a header and
     padding to skip) because an sk_buff may represent multiple packets, so
     a locate_data security op is added to retrieve these details from the
     sk_buff content when needed.

 (6) recvmsg() now has to handle jumbo subpackets, where each subpacket is
     individually secured and needs to be individually decrypted.  The code
     to do this is broken out into rxrpc_recvmsg_data() and shared with the
     kernel API.  It now iterates over the call's receive buffer rather
     than walking the socket receive queue.

Additional changes:

 (1) The timers are condensed to a single timer that is set for the soonest
     of three timeouts (delayed ACK generation, DATA retransmission and
     call lifespan).

 (2) Transmission of ACK and ABORT packets is effected immediately from
     process-context socket ops/kernel API calls that cause them instead of
     them being punted off to a background work item.  The data_ready
     handler still has to defer to the background, though.

 (3) A shutdown op is added to the AF_RXRPC socket so that the AFS
     filesystem can shut down the socket and flush its own work items
     before closing the socket to deal with any in-progress service calls.

Future additional changes that will need to be considered:

 (1) Make sure that a call doesn't hog the front of the queue by receiving
     data from the network as fast as userspace is consuming it to the
     exclusion of other calls.

 (2) Transmit delayed ACKs from within recvmsg() when we've consumed
     sufficiently more packets to avoid the background work item needing to
     run.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-08 11:10:12 +01:00
David Howells 00e907127e rxrpc: Preallocate peers, conns and calls for incoming service requests
Make it possible for the data_ready handler called from the UDP transport
socket to completely instantiate an rxrpc_call structure and make it
immediately live by preallocating all the memory it might need.  The idea
is to cut out the background thread usage as much as possible.

[Note that the preallocated structs are not actually used in this patch -
 that will be done in a future patch.]

If insufficient resources are available in the preallocation buffers, it
will be possible to discard the DATA packet in the data_ready handler or
schedule a BUSY packet without the need to schedule an attempt at
allocation in a background thread.

To this end:

 (1) Preallocate rxrpc_peer, rxrpc_connection and rxrpc_call structs to a
     maximum number each of the listen backlog size.  The backlog size is
     limited to a maxmimum of 32.  Only this many of each can be in the
     preallocation buffer.

 (2) For userspace sockets, the preallocation is charged initially by
     listen() and will be recharged by accepting or rejecting pending
     new incoming calls.

 (3) For kernel services {,re,dis}charging of the preallocation buffers is
     handled manually.  Two notifier callbacks have to be provided before
     kernel_listen() is invoked:

     (a) An indication that a new call has been instantiated.  This can be
     	 used to trigger background recharging.

     (b) An indication that a call is being discarded.  This is used when
     	 the socket is being released.

     A function, rxrpc_kernel_charge_accept() is called by the kernel
     service to preallocate a single call.  It should be passed the user ID
     to be used for that call and a callback to associate the rxrpc call
     with the kernel service's side of the ID.

 (4) Discard the preallocation when the socket is closed.

 (5) Temporarily bump the refcount on the call allocated in
     rxrpc_incoming_call() so that rxrpc_release_call() can ditch the
     preallocation ref on service calls unconditionally.  This will no
     longer be necessary once the preallocation is used.

Note that this does not yet control the number of active service calls on a
client - that will come in a later patch.

A future development would be to provide a setsockopt() call that allows a
userspace server to manually charge the preallocation buffer.  This would
allow user call IDs to be provided in advance and the awkward manual accept
stage to be bypassed.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-08 11:10:12 +01:00
David Howells de8d6c7401 rxrpc: Convert rxrpc_local::services to an hlist
Convert the rxrpc_local::services list to an hlist so that it can be
accessed under RCU conditions more readily.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-08 11:10:11 +01:00
David Howells 8d94aa381d rxrpc: Calls shouldn't hold socket refs
rxrpc calls shouldn't hold refs on the sock struct.  This was done so that
the socket wouldn't go away whilst the call was in progress, such that the
call could reach the socket's queues.

However, we can mark the socket as requiring an RCU release and rely on the
RCU read lock.

To make this work, we do:

 (1) rxrpc_release_call() removes the call's call user ID.  This is now
     only called from socket operations and not from the call processor:

	rxrpc_accept_call() / rxrpc_kernel_accept_call()
	rxrpc_reject_call() / rxrpc_kernel_reject_call()
	rxrpc_kernel_end_call()
	rxrpc_release_calls_on_socket()
	rxrpc_recvmsg()

     Though it is also called in the cleanup path of
     rxrpc_accept_incoming_call() before we assign a user ID.

 (2) Pass the socket pointer into rxrpc_release_call() rather than getting
     it from the call so that we can get rid of uninitialised calls.

 (3) Fix call processor queueing to pass a ref to the work queue and to
     release that ref at the end of the processor function (or to pass it
     back to the work queue if we have to requeue).

 (4) Skip out of the call processor function asap if the call is complete
     and don't requeue it if the call is complete.

 (5) Clean up the call immediately that the refcount reaches 0 rather than
     trying to defer it.  Actual deallocation is deferred to RCU, however.

 (6) Don't hold socket refs for allocated calls.

 (7) Use the RCU read lock when queueing a message on a socket and treat
     the call's socket pointer according to RCU rules and check it for
     NULL.

     We also need to use the RCU read lock when viewing a call through
     procfs.

 (8) Transmit the final ACK/ABORT to a client call in rxrpc_release_call()
     if this hasn't been done yet so that we can then disconnect the call.
     Once the call is disconnected, it won't have any access to the
     connection struct and the UDP socket for the call work processor to be
     able to send the ACK.  Terminal retransmission will be handled by the
     connection processor.

 (9) Release all calls immediately on the closing of a socket rather than
     trying to defer this.  Incomplete calls will be aborted.

The call refcount model is much simplified.  Refs are held on the call by:

 (1) A socket's user ID tree.

 (2) A socket's incoming call secureq and acceptq.

 (3) A kernel service that has a call in progress.

 (4) A queued call work processor.  We have to take care to put any call
     that we failed to queue.

 (5) sk_buffs on a socket's receive queue.  A future patch will get rid of
     this.

Whilst we're at it, we can do:

 (1) Get rid of the RXRPC_CALL_EV_RELEASE event.  Release is now done
     entirely from the socket routines and never from the call's processor.

 (2) Get rid of the RXRPC_CALL_DEAD state.  Calls now end in the
     RXRPC_CALL_COMPLETE state.

 (3) Get rid of the rxrpc_call::destroyer work item.  Calls are now torn
     down when their refcount reaches 0 and then handed over to RCU for
     final cleanup.

 (4) Get rid of the rxrpc_call::deadspan timer.  Calls are cleaned up
     immediately they're finished with and don't hang around.
     Post-completion retransmission is handled by the connection processor
     once the call is disconnected.

 (5) Get rid of the dead call expiry setting as there's no longer a timer
     to set.

 (6) rxrpc_destroy_all_calls() can just check that the call list is empty.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-07 15:33:20 +01:00
David Howells fff72429c2 rxrpc: Improve the call tracking tracepoint
Improve the call tracking tracepoint by showing more differentiation
between some of the put and get events, including:

  (1) Getting and putting refs for the socket call user ID tree.

  (2) Getting and putting refs for queueing and failing to queue the call
      processor work item.

Note that these aren't necessarily used in this patch, but will be taken
advantage of in future patches.

An enum is added for the event subtype numbers rather than coding them
directly as decimal numbers and a table of 3-letter strings is provided
rather than a sequence of ?: operators.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-07 15:30:22 +01:00
David Howells 5f2d9c4438 rxrpc: Randomise epoch and starting client conn ID values
Create a random epoch value rather than a time-based one on startup and set
the top bit to indicate that this is the case.

Also create a random starting client connection ID value.  This will be
incremented from here as new client connections are created.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-04 21:41:39 +01:00
David Howells d001648ec7 rxrpc: Don't expose skbs to in-kernel users [ver #2]
Don't expose skbs to in-kernel users, such as the AFS filesystem, but
instead provide a notification hook the indicates that a call needs
attention and another that indicates that there's a new call to be
collected.

This makes the following possibilities more achievable:

 (1) Call refcounting can be made simpler if skbs don't hold refs to calls.

 (2) skbs referring to non-data events will be able to be freed much sooner
     rather than being queued for AFS to pick up as rxrpc_kernel_recv_data
     will be able to consult the call state.

 (3) We can shortcut the receive phase when a call is remotely aborted
     because we don't have to go through all the packets to get to the one
     cancelling the operation.

 (4) It makes it easier to do encryption/decryption directly between AFS's
     buffers and sk_buffs.

 (5) Encryption/decryption can more easily be done in the AFS's thread
     contexts - usually that of the userspace process that issued a syscall
     - rather than in one of rxrpc's background threads on a workqueue.

 (6) AFS will be able to wait synchronously on a call inside AF_RXRPC.

To make this work, the following interface function has been added:

     int rxrpc_kernel_recv_data(
		struct socket *sock, struct rxrpc_call *call,
		void *buffer, size_t bufsize, size_t *_offset,
		bool want_more, u32 *_abort_code);

This is the recvmsg equivalent.  It allows the caller to find out about the
state of a specific call and to transfer received data into a buffer
piecemeal.

afs_extract_data() and rxrpc_kernel_recv_data() now do all the extraction
logic between them.  They don't wait synchronously yet because the socket
lock needs to be dealt with.

Five interface functions have been removed:

	rxrpc_kernel_is_data_last()
    	rxrpc_kernel_get_abort_code()
    	rxrpc_kernel_get_error_number()
    	rxrpc_kernel_free_skb()
    	rxrpc_kernel_data_consumed()

As a temporary hack, sk_buffs going to an in-kernel call are queued on the
rxrpc_call struct (->knlrecv_queue) rather than being handed over to the
in-kernel user.  To process the queue internally, a temporary function,
temp_deliver_data() has been added.  This will be replaced with common code
between the rxrpc_recvmsg() path and the kernel_rxrpc_recv_data() path in a
future patch.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 16:43:27 -07:00
David Howells 4de48af663 rxrpc: Pass struct socket * to more rxrpc kernel interface functions
Pass struct socket * to more rxrpc kernel interface functions.  They should
be starting from this rather than the socket pointer in the rxrpc_call
struct if they need to access the socket.

I have left:

	rxrpc_kernel_is_data_last()
	rxrpc_kernel_get_abort_code()
	rxrpc_kernel_get_error_number()
	rxrpc_kernel_free_skb()
	rxrpc_kernel_data_consumed()

unmodified as they're all about to be removed (and, in any case, don't
touch the socket).

Signed-off-by: David Howells <dhowells@redhat.com>
2016-08-30 16:07:53 +01:00
David Howells df844fd46b rxrpc: Use a tracepoint for skb accounting debugging
Use a tracepoint to log various skb accounting points to help in debugging
refcounting errors.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-08-23 15:27:24 +01:00
Wei Yongjun 8addc0440b rxrpc: Fix error handling in af_rxrpc_init()
security initialized after alloc workqueue, so we should exit security
before destroy workqueue in the error handing.

Fixes: 648af7fca1 ("rxrpc: Absorb the rxkad security module")
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-12 11:07:38 -07:00
David Howells dee46364ce rxrpc: Add RCU destruction for connections and calls
Add RCU destruction for connections and calls as the RCU lookup from the
transport socket data_ready handler is going to come along shortly.

Whilst we're at it, move the cleanup workqueue flushing and RCU barrierage
into the destruction code for the objects that need it (locals and
connections) and add the extra RCU barrier required for connection cleanup.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-06 10:43:51 +01:00
David Howells eb9b9d2275 rxrpc: Check that the client conns cache is empty before module removal
Check that the client conns cache is empty before module removal and bug if
not, listing any offending connections that are still present.  Unfortunately,
if there are connections still around, then the transport socket is still
unexpectedly open and active, so we can't just unallocate the connections.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-07-06 10:43:51 +01:00
David Howells aa390bbe21 rxrpc: Kill off the rxrpc_transport struct
The rxrpc_transport struct is now redundant, given that the rxrpc_peer
struct is now per peer port rather than per peer host, so get rid of it.

Service connection lists are transferred to the rxrpc_peer struct, as is
the conn_lock.  Previous patches moved the client connection handling out
of the rxrpc_transport struct and discarded the connection bundling code.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 14:00:23 +01:00
David Howells 999b69f892 rxrpc: Kill the client connection bundle concept
Kill off the concept of maintaining a bundle of connections to a particular
target service to increase the number of call slots available for any
beyond four for that service (there are four call slots per connection).

This will make cleaning up the connection handling code easier and
facilitate removal of the rxrpc_transport struct.  Bundling can be
reintroduced later if necessary.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:20:55 +01:00
David Howells 5627cc8b96 rxrpc: Provide more refcount helper functions
Provide refcount helper functions for connections so that the code doesn't
touch local or connection usage counts directly.

Also make it such that local and peer put functions can take a NULL
pointer.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:17:51 +01:00
David Howells f4552c2d24 rxrpc: Validate the net address given to rxrpc_kernel_begin_call()
Validate the net address given to rxrpc_kernel_begin_call() before using
it.

Whilst this should be mostly unnecessary for in-kernel users, it does clear
the tail of the address struct in case we want to hash or compare the whole
thing.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:17:51 +01:00
David Howells 4a3388c803 rxrpc: Use IDR to allocate client conn IDs on a machine-wide basis
Use the IDR facility to allocate client connection IDs on a machine-wide
basis so that each client connection has a unique identifier.  When the
connection ID space wraps, we advance the epoch by 1, thereby effectively
having a 62-bit ID space.  The IDR facility is then used to look up client
connections during incoming packet routing instead of using an rbtree
rooted on the transport.

This change allows for the removal of the transport in the future and also
means that client connections can be looked up directly in the data-ready
handler by connection ID.

The ID management code is placed in a new file, conn-client.c, to which all
the client connection-specific code will eventually move.

Note that the IDR tree gets very expensive on memory if the connection IDs
are widely scattered throughout the number space, so we shall need to
retire connections that have, say, an ID more than four times the maximum
number of client conns away from the current allocation point to try and
keep the IDs concentrated.  We will also need to retire connections from an
old epoch.

Also note that, for the moment, a pointer to the transport has to be passed
through into the ID allocation function so that we can take a BH lock to
prevent a locking issue against in-BH lookup of client connections.  This
will go away later when RCU is used for server connections also.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:10:02 +01:00
David Howells cc8feb8edd rxrpc: Fix exclusive connection handling
"Exclusive connections" are meant to be used for a single client call and
then scrapped.  The idea is to limit the use of the negotiated security
context.  The current code, however, isn't doing this: it is instead
restricting the socket to a single virtual connection and doing all the
calls over that.

This is changed such that the socket no longer maintains a special virtual
connection over which it will do all the calls, but rather gets a new one
each time a new exclusive call is made.

Further, using a socket option for this is a poor choice.  It should be
done on sendmsg with a control message marker instead so that calls can be
marked exclusive individually.  To that end, add RXRPC_EXCLUSIVE_CALL
which, if passed to sendmsg() as a control message element, will cause the
call to be done on an single-use connection.

The socket option (RXRPC_EXCLUSIVE_CONNECTION) still exists and, if set,
will override any lack of RXRPC_EXCLUSIVE_CALL being specified so that
programs using the setsockopt() will appear to work the same.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:10:00 +01:00
David Howells 19ffa01c9c rxrpc: Use structs to hold connection params and protocol info
Define and use a structure to hold connection parameters.  This makes it
easier to pass multiple connection parameters around.

Define and use a structure to hold protocol information used to hash a
connection for lookup on incoming packet.  Most of these fields will be
disposed of eventually, including the duplicate local pointer.

Whilst we're at it rename "proto" to "family" when referring to a protocol
family.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:09:59 +01:00
Dan Carpenter 0e4699e4a3 rxrpc: checking for IS_ERR() instead of NULL
rxrpc_lookup_peer_rcu() and rxrpc_lookup_peer() return NULL on error, never
error pointers, so IS_ERR() can't be used.

Fix three callers of those functions.

Fixes: be6e6707f6 ('rxrpc: Rework peer object handling to use hash table and RCU')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-22 09:09:58 +01:00
David Howells 4f95dd78a7 rxrpc: Rework local endpoint management
Rework the local RxRPC endpoint management.

Local endpoint objects are maintained in a flat list as before.  This
should be okay as there shouldn't be more than one per open AF_RXRPC socket
(there can be fewer as local endpoints can be shared if their local service
ID is 0 and they share the same local transport parameters).

Changes:

 (1) Local endpoints may now only be shared if they have local service ID 0
     (ie. they're not being used for listening).

     This prevents a scenario where process A is listening of the Cache
     Manager port and process B contacts a fileserver - which may then
     attempt to send CM requests back to B.  But if A and B are sharing a
     local endpoint, A will get the CM requests meant for B.

 (2) We use a mutex to handle lookups and don't provide RCU-only lookups
     since we only expect to access the list when opening a socket or
     destroying an endpoint.

     The local endpoint object is pointed to by the transport socket's
     sk_user_data for the life of the transport socket - allowing us to
     refer to it directly from the sk_data_ready and sk_error_report
     callbacks.

 (3) atomic_inc_not_zero() now exists and can be used to only share a local
     endpoint if the last reference hasn't yet gone.

 (4) We can remove rxrpc_local_lock - a spinlock that had to be taken with
     BH processing disabled given that we assume sk_user_data won't change
     under us.

 (5) The transport socket is shut down before we clear the sk_user_data
     pointer so that we can be sure that the transport socket's callbacks
     won't be invoked once the RCU destruction is scheduled.

 (6) Local endpoints have a work item that handles both destruction and
     event processing.  The means that destruction doesn't then need to
     wait for event processing.  The event queues can then be cleared after
     the transport socket is shut down.

 (7) Local endpoints are no longer available for resurrection beyond the
     life of the sockets that had them open.  As soon as their last ref
     goes, they are scheduled for destruction and may not have their usage
     count moved from 0.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-15 15:38:17 +01:00
David Howells be6e6707f6 rxrpc: Rework peer object handling to use hash table and RCU
Rework peer object handling to use a hash table instead of a flat list and
to use RCU.  Peer objects are no longer destroyed by passing them to a
workqueue to process, but rather are just passed to the RCU garbage
collector as kfree'able objects.

The hash function uses the local endpoint plus all the components of the
remote address, except for the RxRPC service ID.  Peers thus represent a
UDP port on the remote machine as contacted by a UDP port on this machine.

The RCU read lock is used to handle non-creating lookups so that they can
be called from bottom half context in the sk_error_report handler without
having to lock the hash table against modification.
rxrpc_lookup_peer_rcu() *does* take a reference on the peer object as in
the future, this will be passed to a work item for error distribution in
the error_report path and this function will cease being used in the
data_ready path.

Creating lookups are done under spinlock rather than mutex as they might be
set up due to an external stimulus if the local endpoint is a server.

Captured network error messages (ICMP) are handled with respect to this
struct and MTU size and RTT are cached here.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-15 10:12:33 +01:00
David Howells 0e119b41b7 rxrpc: Limit the listening backlog
Limit the socket incoming call backlog queue size so that a remote client
can't pump in sufficient new calls that the server runs out of memory.  Note
that this is partially theoretical at the moment since whilst the number of
calls is limited, the number of packets trying to set up new calls is not.
This will be addressed in a later patch.

If the caller of listen() specifies a backlog INT_MAX, then they get the
current maximum; anything else greater than max_backlog or anything
negative incurs EINVAL.

The limit on the maximum queue size can be set by:

	echo N >/proc/sys/net/rxrpc/max_backlog

where 4<=N<=32.

Further, set the default backlog to 0, requiring listen() to be called
before we start actually queueing new calls.  Whilst this kind of is a
change in the UAPI, the caller can't actually *accept* new calls anyway
unless they've first called listen() to put the socket into the LISTENING
state - thus the aforementioned new calls would otherwise just sit there,
eating up kernel memory.  (Note that sockets that don't have a non-zero
service ID bound don't get incoming calls anyway.)

Given that the default backlog is now 0, make the AFS filesystem call
kernel_listen() to set the maximum backlog for itself.

Possible improvements include:

 (1) Trimming a too-large backlog to max_backlog when listen is called.

 (2) Trimming the backlog value whenever the value is used so that changes
     to max_backlog are applied to an open socket automatically.  Note that
     the AFS filesystem opens one socket and keeps it open for extended
     periods, so would miss out on changes to max_backlog.

 (3) Having a separate setting for the AFS filesystem.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 18:14:47 -07:00
David Howells 2341e07757 rxrpc: Simplify connect() implementation and simplify sendmsg() op
Simplify the RxRPC connect() implementation.  It will just note the
destination address it is given, and if a sendmsg() comes along with no
address, this will be assigned as the address.  No transport struct will be
held internally, which will allow us to remove this later.

Simplify sendmsg() also.  Whilst a call is active, userspace refers to it
by a private unique user ID specified in a control message.  When sendmsg()
sees a user ID that doesn't map to an extant call, it creates a new call
for that user ID and attempts to add it.  If, when we try to add it, the
user ID is now registered, we now reject the message with -EEXIST.  We
should never see this situation unless two threads are racing, trying to
create a call with the same ID - which would be an error.

It also isn't required to provide sendmsg() with an address - provided the
control message data holds a user ID that maps to a currently active call.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-09 23:30:12 -07:00
Joe Perches 9b6d53985f rxrpc: Use pr_<level> and pr_fmt, reduce object size a few KB
Use the more common kernel logging style and reduce object size.

The logging message prefix changes from a mixture of
"RxRPC:" and "RXRPC:" to "af_rxrpc: ".

$ size net/rxrpc/built-in.o*
   text	   data	    bss	    dec	    hex	filename
  64172	   1972	   8304	  74448	  122d0	net/rxrpc/built-in.o.new
  67512	   1972	   8304	  77788	  12fdc	net/rxrpc/built-in.o.old

Miscellanea:

o Consolidate the ASSERT macros to use a single pr_err call with
  decimal and hexadecimal output and a stringified #OP argument

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-03 19:41:31 -04:00
David Howells 648af7fca1 rxrpc: Absorb the rxkad security module
Absorb the rxkad security module into the af_rxrpc module so that there's
only one module file.  This avoids a circular dependency whereby rxkad pins
af_rxrpc and cached connections pin rxkad but can't be manually evicted
(they will expire eventually and cease pinning).

With this change, af_rxrpc can just be unloaded, despite having cached
connections.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:34:41 -04:00
David Howells dad8aff754 rxrpc: Replace all unsigned with unsigned int
Replace all "unsigned" types with "unsigned int" types.

Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-13 15:14:57 -04:00
David Howells ab802ee0ab rxrpc: Clear the unused part of a sockaddr_rxrpc for memcmp() use
Clear the unused part of a sockaddr_rxrpc structs so that memcmp() can be
used to compare them.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-03-04 15:59:49 +00:00
David Howells b4f1342f91 rxrpc: Adjust some whitespace and comments
Remove some excess whitespace, insert some missing spaces and adjust a
couple of comments.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-03-04 15:56:19 +00:00
David Howells e33b3d97bc rxrpc: The protocol family should be set to PF_RXRPC not PF_UNIX
Fix the protocol family set in the proto_ops for rxrpc to be PF_RXRPC not
PF_UNIX.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-03-04 15:54:27 +00:00
David Howells 0d12f8a402 rxrpc: Keep the skb private record of the Rx header in host byte order
Currently, a copy of the Rx packet header is copied into the the sk_buff
private data so that we can advance the pointer into the buffer,
potentially discarding the original.  At the moment, this copy is held in
network byte order, but this means we're doing a lot of unnecessary
translations.

The reasons it was done this way are that we need the values in network
byte order occasionally and we can use the copy, slightly modified, as part
of an iov array when sending an ack or an abort packet.

However, it seems more reasonable on review that it would be better kept in
host byte order and that we make up a new header when we want to send
another packet.

To this end, rename the original header struct to rxrpc_wire_header (with
BE fields) and institute a variant called rxrpc_host_header that has host
order fields.  Change the struct in the sk_buff private data into an
rxrpc_host_header and translate the values when filling it in.

This further allows us to keep values kept in various structures in host
byte order rather than network byte order and allows removal of some fields
that are byteswapped duplicates.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-03-04 15:53:46 +00:00
Herbert Xu 1ce0bf50ae net: Generalise wq_has_sleeper helper
The memory barrier in the helper wq_has_sleeper is needed by just
about every user of waitqueue_active.  This patch generalises it
by making it take a wait_queue_head_t directly.  The existing
helper is renamed to skwq_has_sleeper.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-30 14:47:33 -05:00
David Howells 146aa8b145 KEYS: Merge the type-specific data with the payload data
Merge the type-specific data with the payload data into one four-word chunk
as it seems pointless to keep them separate.

Use user_key_payload() for accessing the payloads of overloaded
user-defined keys.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-cifs@vger.kernel.org
cc: ecryptfs@vger.kernel.org
cc: linux-ext4@vger.kernel.org
cc: linux-f2fs-devel@lists.sourceforge.net
cc: linux-nfs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: linux-ima-devel@lists.sourceforge.net
2015-10-21 15:18:36 +01:00
Eric W. Biederman 11aa9c28b4 net: Pass kern from net_proto_family.create to sk_alloc
In preparation for changing how struct net is refcounted
on kernel sockets pass the knowledge that we are creating
a kernel socket from sock_create_kern through to sk_alloc.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11 10:50:17 -04:00
Ying Xue 1b78414047 net: Remove iocb argument from sendmsg and recvmsg
After TIPC doesn't depend on iocb argument in its internal
implementations of sendmsg() and recvmsg() hooks defined in proto
structure, no any user is using iocb argument in them at all now.
Then we can drop the redundant iocb argument completely from kinds of
implementations of both sendmsg() and recvmsg() in the entire
networking stack.

Cc: Christoph Hellwig <hch@lst.de>
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02 13:06:31 -05:00
David Howells 5873c0834f af_rxrpc: Add sysctls for configuring RxRPC parameters
Add sysctls for configuring RxRPC protocol handling, specifically controls on
delays before ack generation, the delay before resending a packet, the maximum
lifetime of a call and the expiration times of calls, connections and
transports that haven't been recently used.

More info added in Documentation/networking/rxrpc.txt.

Signed-off-by: David Howells <dhowells@redhat.com>
2014-02-26 17:25:06 +00:00
Gao feng ece31ffd53 net: proc: change proc_net_remove to remove_proc_entry
proc_net_remove is only used to remove proc entries
that under /proc/net,it's not a general function for
removing proc entries of netns. if we want to remove
some proc entries which under /proc/net/stat/, we still
need to call remove_proc_entry.

this patch use remove_proc_entry to replace proc_net_remove.
we can remove proc_net_remove after this patch.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18 14:53:08 -05:00
Gao feng d4beaa66ad net: proc: change proc_net_fops_create to proc_create
Right now, some modules such as bonding use proc_create
to create proc entries under /proc/net/, and other modules
such as ipv4 use proc_net_fops_create.

It looks a little chaos.this patch changes all of
proc_net_fops_create to proc_create. we can remove
proc_net_fops_create after this patch.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18 14:53:08 -05:00
YOSHIFUJI Hideaki / 吉藤英明 ce6654cfc1 rxrpc: Use FIELD_SIZEOF() in af_rxrpc_init().
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-09 23:38:24 -08:00
Eric Dumazet 95c9617472 net: cleanup unsigned to unsigned int
Use of "unsigned int" is preferred to bare "unsigned" in net tree.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-15 12:44:40 -04:00
Tejun Heo e1fcc7e2a7 rxrpc: rxrpc_workqueue isn't used during memory reclaim
rxrpc_workqueue isn't depended upon while reclaiming memory.  Convert
to alloc_workqueue() without WQ_MEM_RECLAIM.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: linux-afs@lists.infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14 09:25:11 -08:00
Eric Dumazet 4381548237 net: sock_def_readable() and friends RCU conversion
sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
need two atomic operations (and associated dirtying) per incoming
packet.

RCU conversion is pretty much needed :

1) Add a new structure, called "struct socket_wq" to hold all fields
that will need rcu_read_lock() protection (currently: a
wait_queue_head_t and a struct fasync_struct pointer).

[Future patch will add a list anchor for wakeup coalescing]

2) Attach one of such structure to each "struct socket" created in
sock_alloc_inode().

3) Respect RCU grace period when freeing a "struct socket_wq"

4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct
socket_wq"

5) Change sk_sleep() function to use new sk->sk_wq instead of
sk->sk_sleep

6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside
a rcu_read_lock() section.

7) Change all sk_has_sleeper() callers to :
  - Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock)
  - Use wq_has_sleeper() to eventually wakeup tasks.
  - Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock)

8) sock_wake_async() is modified to use rcu protection as well.

9) Exceptions :
  macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq"
instead of dynamically allocated ones. They dont need rcu freeing.

Some cleanups or followups are probably needed, (possible
sk_callback_lock conversion to a spinlock for example...).

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-01 15:00:15 -07:00
Eric Dumazet aa39514516 net: sk_sleep() helper
Define a new function to return the waitqueue of a "struct sock".

static inline wait_queue_head_t *sk_sleep(struct sock *sk)
{
	return sk->sk_sleep;
}

Change all read occurrences of sk_sleep by a call to this function.

Needed for a future RCU conversion. sk_sleep wont be a field directly
available.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20 16:37:13 -07:00
Tejun Heo 5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Octavian Purdila 09ad9bc752 net: use net_eq to compare nets
Generated with the following semantic patch

@@
struct net *n1;
struct net *n2;
@@
- n1 == n2
+ net_eq(n1, n2)

@@
struct net *n1;
struct net *n2;
@@
- n1 != n2
+ !net_eq(n1, n2)

applied over {include,net,drivers/net}.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-25 15:14:13 -08:00
Eric Paris 3f378b6844 net: pass kern to net_proto_family create function
The generic __sock_create function has a kern argument which allows the
security system to make decisions based on if a socket is being created by
the kernel or by userspace.  This patch passes that flag to the
net_proto_family specific create function, so it can do the same thing.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-05 22:18:14 -08:00
Stephen Hemminger ec1b4cf74c net: mark net_proto_ops as const
All usages of structure net_proto_ops should be declared const.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-07 01:10:46 -07:00
David S. Miller b7058842c9 net: Make setsockopt() optlen be unsigned.
This provides safety against negative optlen at the type
level instead of depending upon (sometimes non-trivial)
checks against this sprinkled all over the the place, in
each and every implementation.

Based upon work done by Arjan van de Ven and feedback
from Linus Torvalds.

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-30 16:12:20 -07:00