alistair23-linux/net
John Fastabend 4f738adba3 bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data
This implements a BPF ULP layer to allow policy enforcement and
monitoring at the socket layer. In order to support this a new
program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
the sendmsg/sendpage hook. To attach the policy to sockets a
sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.

Similar to previous sockmap usages when a sock is added to a
sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
program type attached then the BPF ULP layer is created on the
socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
every msg in sendmsg case and page/offset in sendpage case.

BPF_PROG_TYPE_SK_MSG Semantics/API:

BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
case and in the sendpage case leaves the data untouched. Both cases
return -EACESS to the user. Returning SK_PASS will allow the msg to
be sent.

In the sendmsg case data is copied into kernel space buffers before
running the BPF program. The kernel space buffers are stored in a
scatterlist object where each element is a kernel memory buffer.
Some effort is made to coalesce data from the sendmsg call here.
For example a sendmsg call with many one byte iov entries will
likely be pushed into a single entry. The BPF program is run with
data pointers (start/end) pointing to the first sg element.

In the sendpage case data is not copied. We opt not to copy the
data by default here, because the BPF infrastructure does not
know what bytes will be needed nor when they will be needed. So
copying all bytes may be wasteful. Because of this the initial
start/end data pointers are (0,0). Meaning no data can be read or
written. This avoids reading data that may be modified by the
user. A new helper is added later in this series if reading and
writing the data is needed. The helper call will do a copy by
default so that the page is exclusively owned by the BPF call.

The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
in the sendmsg() case and the entire page/offset in the sendpage case.
This avoids ambiguity on how to handle mixed return codes in the
sendmsg case. Again a helper is added later in the series if
a verdict needs to apply to multiple system calls and/or only
a subpart of the currently being processed message.

The helper msg_redirect_map() can be used to select the socket to
send the data on. This is used similar to existing redirect use
cases. This allows policy to redirect msgs.

Pseudo code simple example:

The basic logic to attach a program to a socket is as follows,

  // load the programs
  bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
		&obj, &msg_prog);

  // lookup the sockmap
  bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");

  // get fd for sockmap
  map_fd_msg = bpf_map__fd(bpf_map_msg);

  // attach program to sockmap
  bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);

Adding sockets to the map is done in the normal way,

  // Add a socket 'fd' to sockmap at location 'i'
  bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);

After the above any socket attached to "my_sock_map", in this case
'fd', will run the BPF msg verdict program (msg_prog) on every
sendmsg and sendpage system call.

For a complete example see BPF selftests or sockmap samples.

Implementation notes:

It seemed the simplest, to me at least, to use a refcnt to ensure
psock is not lost across the sendmsg copy into the sg, the bpf program
running on the data in sg_data, and the final pass to the TCP stack.
Some performance testing may show a better method to do this and avoid
the refcnt cost, but for now use the simpler method.

Another item that will come after basic support is in place is
supporting MSG_MORE flag. At the moment we call sendpages even if
the MSG_MORE flag is set. An enhancement would be to collect the
pages into a larger scatterlist and pass down the stack. Notice that
bpf_tcp_sendmsg() could support this with some additional state saved
across sendmsg calls. I built the code to support this without having
to do refactoring work. Other features TBD include ZEROCOPY and the
TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series
shortly.

Future work could improve size limits on the scatterlist rings used
here. Currently, we use MAX_SKB_FRAGS simply because this was being
used already in the TLS case. Future work could extend the kernel sk
APIs to tune this depending on workload. This is a trade-off
between memory usage and throughput performance.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-19 21:14:38 +01:00
..
6lowpan
9p virtio: bugfixes 2018-02-15 14:29:27 -08:00
802
8021q net: Convert /proc creating and destroying pernet_operations 2018-02-27 11:01:35 -05:00
appletalk net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
atm net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
ax25 net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
batman-adv Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-06 01:20:46 -05:00
bluetooth Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next 2018-02-15 15:43:49 -05:00
bpf bpf: fix null pointer deref in bpf_prog_test_run_xdp 2018-02-01 07:43:56 -08:00
bridge Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-06 01:20:46 -05:00
caif net: Convert caif_net_ops 2018-03-05 10:48:27 -05:00
can net: Convert cangw_pernet_ops 2018-03-05 10:48:27 -05:00
ceph libceph, ceph: avoid memory leak when specifying same option several times 2018-02-26 16:19:30 +01:00
core bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data 2018-03-19 21:14:38 +01:00
dcb
dccp net: Convert dccp_v6_ops 2018-03-05 10:48:28 -05:00
decnet Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-19 18:46:11 -05:00
dns_resolver afs: Support the AFS dynamic root 2018-02-06 14:43:37 +00:00
dsa dsa: Pass the port to get_sset_count() 2018-03-04 13:34:18 -05:00
ethernet
hsr
ieee802154 Merge branch 'ieee802154-for-davem-2018-02-26' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan-next 2018-02-27 11:14:57 -05:00
ife
ipv4 net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG 2018-03-19 21:14:38 +01:00
ipv6 ip6mr: remove synchronize_rcu() in favor of SOCK_RCU_FREE 2018-03-07 18:13:41 -05:00
iucv net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
kcm net: Convert simple pernet_operations 2018-02-27 11:01:35 -05:00
key net: Convert /proc creating and destroying pernet_operations 2018-02-27 11:01:35 -05:00
l2tp Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-06 01:20:46 -05:00
l3mdev
lapb
llc net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
mac80211 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-06 01:20:46 -05:00
mac802154
mpls net: rename skb_gso_validate_mtu -> skb_gso_validate_network_len 2018-03-04 17:49:17 -05:00
ncsi net/ncsi: Add generic netlink family 2018-03-05 10:43:37 -05:00
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-06 01:20:46 -05:00
netlabel
netlink Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-24 00:04:20 -05:00
netrom net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
nfc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-19 18:46:11 -05:00
nsh
openvswitch openvswitch: Remove padding from packet before L3+ conntrack processing 2018-02-01 09:46:22 -05:00
packet net: Convert packet_net_ops 2018-02-13 10:36:08 -05:00
phonet net: Convert /proc creating and destroying pernet_operations 2018-02-27 11:01:35 -05:00
psample
qrtr Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-06 01:20:46 -05:00
rds rds: use list structure to track information for zerocopy completion notification 2018-03-07 18:05:57 -05:00
rfkill vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
rose net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
rxrpc rxrpc: Fix send in rxrpc_send_data_packet() 2018-02-22 15:37:47 -05:00
sched Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-06 01:20:46 -05:00
sctp sctp: add support for snd flag SCTP_SENDALL process in sendmsg 2018-03-07 10:55:29 -05:00
smc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-06 01:20:46 -05:00
strparser
sunrpc net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
switchdev
tipc tipc: bcast: use true and false for boolean values 2018-03-07 12:18:00 -05:00
tls net: generalize sk_alloc_sg to work with scatterlist rings 2018-03-19 21:14:38 +01:00
unix Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-19 18:46:11 -05:00
vmw_vsock net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
wimax
wireless Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-06 01:20:46 -05:00
x25 x25: use %*ph to print small buffer 2018-02-20 13:51:47 -05:00
xfrm Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-06 01:20:46 -05:00
compat.c
Kconfig Staging/IIO patches for 4.16-rc1 2018-02-01 09:51:57 -08:00
Makefile
socket.c socket: skip checking sk_err for recvmmsg(MSG_ERRQUEUE) 2018-03-01 21:27:49 -05:00
sysctl_net.c net: Convert sysctl_pernet_ops 2018-02-13 10:36:05 -05:00