remarkable-linux/include/linux/filter.h
Alexei Starovoitov bd4cf0ed33 net: filter: rework/optimize internal BPF interpreter's instruction set
This patch replaces/reworks the kernel-internal BPF interpreter with
an optimized BPF instruction set format that is modelled closer to
mimic native instruction sets and is designed to be JITed with one to
one mapping. Thus, the new interpreter is noticeably faster than the
current implementation of sk_run_filter(); mainly for two reasons:

1. Fall-through jumps:

  BPF jump instructions are forced to go either 'true' or 'false'
  branch which causes branch-miss penalty. The new BPF jump
  instructions have only one branch and fall-through otherwise,
  which fits the CPU branch predictor logic better. `perf stat`
  shows drastic difference for branch-misses between the old and
  new code.

2. Jump-threaded implementation of interpreter vs switch
   statement:

  Instead of single table-jump at the top of 'switch' statement,
  gcc will now generate multiple table-jump instructions, which
  helps CPU branch predictor logic.

Note that the verification of filters is still being done through
sk_chk_filter() in classical BPF format, so filters from user- or
kernel space are verified in the same way as we do now, and same
restrictions/constraints hold as well.

We reuse current BPF JIT compilers in a way that this upgrade would
even be fine as is, but nevertheless allows for a successive upgrade
of BPF JIT compilers to the new format.

The internal instruction set migration is being done after the
probing for JIT compilation, so in case JIT compilers are able to
create a native opcode image, we're going to use that, and in all
other cases we're doing a follow-up migration of the BPF program's
instruction set, so that it can be transparently run in the new
interpreter.

In short, the *internal* format extends BPF in the following way (more
details can be taken from the appended documentation):

  - Number of registers increase from 2 to 10
  - Register width increases from 32-bit to 64-bit
  - Conditional jt/jf targets replaced with jt/fall-through
  - Adds signed > and >= insns
  - 16 4-byte stack slots for register spill-fill replaced
    with up to 512 bytes of multi-use stack space
  - Introduction of bpf_call insn and register passing convention
    for zero overhead calls from/to other kernel functions
  - Adds arithmetic right shift and endianness conversion insns
  - Adds atomic_add insn
  - Old tax/txa insns are replaced with 'mov dst,src' insn

Performance of two BPF filters generated by libpcap resp. bpf_asm
was measured on x86_64, i386 and arm32 (other libpcap programs
have similar performance differences):

fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd

fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
   ((tcp[12]&0xf0)>>2)) != 0)' -dd

Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
smaller is better:

--x86_64--
         fprog #1  fprog #1   fprog #2  fprog #2
         cache-hit cache-miss cache-hit cache-miss
old BPF      90       101        192       202
new BPF      31        71         47        97
old BPF jit  12        34         17        44
new BPF jit TBD

--i386--
         fprog #1  fprog #1   fprog #2  fprog #2
         cache-hit cache-miss cache-hit cache-miss
old BPF     107       136        227       252
new BPF      40       119         69       172

--arm32--
         fprog #1  fprog #1   fprog #2  fprog #2
         cache-hit cache-miss cache-hit cache-miss
old BPF     202       300        475       540
new BPF     180       270        330       470
old BPF jit  26       182         37       202
new BPF jit TBD

Thus, without changing any userland BPF filters, applications on
top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
classifier, netfilter's xt_bpf, team driver's load-balancing mode,
and many more will have better interpreter filtering performance.

While we are replacing the internal BPF interpreter, we also need
to convert seccomp BPF in the same step to make use of the new
internal structure since it makes use of lower-level API details
without being further decoupled through higher-level calls like
sk_unattached_filter_{create,destroy}(), for example.

Just as for normal socket filtering, also seccomp BPF experiences
a time-to-verdict speedup:

05-sim-long_jumps.c of libseccomp was used as micro-benchmark:

  seccomp_rule_add_exact(ctx,...
  seccomp_rule_add_exact(ctx,...

  rc = seccomp_load(ctx);

  for (i = 0; i < 10000000; i++)
     syscall(199, 100);

'short filter' has 2 rules
'large filter' has 200 rules

'short filter' performance is slightly better on x86_64/i386/arm32
'large filter' is much faster on x86_64 and i386 and shows no
               difference on arm32

--x86_64-- short filter
old BPF: 2.7 sec
 39.12%  bench  libc-2.15.so       [.] syscall
  8.10%  bench  [kernel.kallsyms]  [k] sk_run_filter
  6.31%  bench  [kernel.kallsyms]  [k] system_call
  5.59%  bench  [kernel.kallsyms]  [k] trace_hardirqs_on_caller
  4.37%  bench  [kernel.kallsyms]  [k] trace_hardirqs_off_caller
  3.70%  bench  [kernel.kallsyms]  [k] __secure_computing
  3.67%  bench  [kernel.kallsyms]  [k] lock_is_held
  3.03%  bench  [kernel.kallsyms]  [k] seccomp_bpf_load
new BPF: 2.58 sec
 42.05%  bench  libc-2.15.so       [.] syscall
  6.91%  bench  [kernel.kallsyms]  [k] system_call
  6.25%  bench  [kernel.kallsyms]  [k] trace_hardirqs_on_caller
  6.07%  bench  [kernel.kallsyms]  [k] __secure_computing
  5.08%  bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp

--arm32-- short filter
old BPF: 4.0 sec
 39.92%  bench  [kernel.kallsyms]  [k] vector_swi
 16.60%  bench  [kernel.kallsyms]  [k] sk_run_filter
 14.66%  bench  libc-2.17.so       [.] syscall
  5.42%  bench  [kernel.kallsyms]  [k] seccomp_bpf_load
  5.10%  bench  [kernel.kallsyms]  [k] __secure_computing
new BPF: 3.7 sec
 35.93%  bench  [kernel.kallsyms]  [k] vector_swi
 21.89%  bench  libc-2.17.so       [.] syscall
 13.45%  bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp
  6.25%  bench  [kernel.kallsyms]  [k] __secure_computing
  3.96%  bench  [kernel.kallsyms]  [k] syscall_trace_exit

--x86_64-- large filter
old BPF: 8.6 seconds
    73.38%    bench  [kernel.kallsyms]  [k] sk_run_filter
    10.70%    bench  libc-2.15.so       [.] syscall
     5.09%    bench  [kernel.kallsyms]  [k] seccomp_bpf_load
     1.97%    bench  [kernel.kallsyms]  [k] system_call
new BPF: 5.7 seconds
    66.20%    bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp
    16.75%    bench  libc-2.15.so       [.] syscall
     3.31%    bench  [kernel.kallsyms]  [k] system_call
     2.88%    bench  [kernel.kallsyms]  [k] __secure_computing

--i386-- large filter
old BPF: 5.4 sec
new BPF: 3.8 sec

--arm32-- large filter
old BPF: 13.5 sec
 73.88%  bench  [kernel.kallsyms]  [k] sk_run_filter
 10.29%  bench  [kernel.kallsyms]  [k] vector_swi
  6.46%  bench  libc-2.17.so       [.] syscall
  2.94%  bench  [kernel.kallsyms]  [k] seccomp_bpf_load
  1.19%  bench  [kernel.kallsyms]  [k] __secure_computing
  0.87%  bench  [kernel.kallsyms]  [k] sys_getuid
new BPF: 13.5 sec
 76.08%  bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp
 10.98%  bench  [kernel.kallsyms]  [k] vector_swi
  5.87%  bench  libc-2.17.so       [.] syscall
  1.77%  bench  [kernel.kallsyms]  [k] __secure_computing
  0.93%  bench  [kernel.kallsyms]  [k] sys_getuid

BPF filters generated by seccomp are very branchy, so the new
internal BPF performance is better than the old one. Performance
gains will be even higher when BPF JIT is committed for the
new structure, which is planned in future work (as successive
JIT migrations).

BPF has also been stress-tested with trinity's BPF fuzzer.

Joint work with Daniel Borkmann.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-31 00:45:09 -04:00

230 lines
5.5 KiB
C

/*
* Linux Socket Filter Data Structures
*/
#ifndef __LINUX_FILTER_H__
#define __LINUX_FILTER_H__
#include <linux/atomic.h>
#include <linux/compat.h>
#include <linux/workqueue.h>
#include <uapi/linux/filter.h>
/* Internally used and optimized filter representation with extended
* instruction set based on top of classic BPF.
*/
/* instruction classes */
#define BPF_ALU64 0x07 /* alu mode in double word width */
/* ld/ldx fields */
#define BPF_DW 0x18 /* double word */
#define BPF_XADD 0xc0 /* exclusive add */
/* alu/jmp fields */
#define BPF_MOV 0xb0 /* mov reg to reg */
#define BPF_ARSH 0xc0 /* sign extending arithmetic shift right */
/* change endianness of a register */
#define BPF_END 0xd0 /* flags for endianness conversion: */
#define BPF_TO_LE 0x00 /* convert to little-endian */
#define BPF_TO_BE 0x08 /* convert to big-endian */
#define BPF_FROM_LE BPF_TO_LE
#define BPF_FROM_BE BPF_TO_BE
#define BPF_JNE 0x50 /* jump != */
#define BPF_JSGT 0x60 /* SGT is signed '>', GT in x86 */
#define BPF_JSGE 0x70 /* SGE is signed '>=', GE in x86 */
#define BPF_CALL 0x80 /* function call */
#define BPF_EXIT 0x90 /* function return */
/* BPF has 10 general purpose 64-bit registers and stack frame. */
#define MAX_BPF_REG 11
/* BPF program can access up to 512 bytes of stack space. */
#define MAX_BPF_STACK 512
/* Arg1, context and stack frame pointer register positions. */
#define ARG1_REG 1
#define CTX_REG 6
#define FP_REG 10
struct sock_filter_int {
__u8 code; /* opcode */
__u8 a_reg:4; /* dest register */
__u8 x_reg:4; /* source register */
__s16 off; /* signed offset */
__s32 imm; /* signed immediate constant */
};
#ifdef CONFIG_COMPAT
/* A struct sock_filter is architecture independent. */
struct compat_sock_fprog {
u16 len;
compat_uptr_t filter; /* struct sock_filter * */
};
#endif
struct sock_fprog_kern {
u16 len;
struct sock_filter *filter;
};
struct sk_buff;
struct sock;
struct seccomp_data;
struct sk_filter {
atomic_t refcnt;
u32 jited:1, /* Is our filter JIT'ed? */
len:31; /* Number of filter blocks */
struct sock_fprog_kern *orig_prog; /* Original BPF program */
struct rcu_head rcu;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct sock_filter_int *filter);
union {
struct sock_filter insns[0];
struct sock_filter_int insnsi[0];
struct work_struct work;
};
};
static inline unsigned int sk_filter_size(unsigned int proglen)
{
return max(sizeof(struct sk_filter),
offsetof(struct sk_filter, insns[proglen]));
}
#define sk_filter_proglen(fprog) \
(fprog->len * sizeof(fprog->filter[0]))
#define SK_RUN_FILTER(filter, ctx) \
(*filter->bpf_func)(ctx, filter->insnsi)
int sk_filter(struct sock *sk, struct sk_buff *skb);
u32 sk_run_filter_int_seccomp(const struct seccomp_data *ctx,
const struct sock_filter_int *insni);
u32 sk_run_filter_int_skb(const struct sk_buff *ctx,
const struct sock_filter_int *insni);
int sk_convert_filter(struct sock_filter *prog, int len,
struct sock_filter_int *new_prog, int *new_len);
int sk_unattached_filter_create(struct sk_filter **pfp,
struct sock_fprog *fprog);
void sk_unattached_filter_destroy(struct sk_filter *fp);
int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
int sk_detach_filter(struct sock *sk);
int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
int sk_get_filter(struct sock *sk, struct sock_filter __user *filter,
unsigned int len);
void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to);
void sk_filter_charge(struct sock *sk, struct sk_filter *fp);
void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp);
#ifdef CONFIG_BPF_JIT
#include <stdarg.h>
#include <linux/linkage.h>
#include <linux/printk.h>
void bpf_jit_compile(struct sk_filter *fp);
void bpf_jit_free(struct sk_filter *fp);
static inline void bpf_jit_dump(unsigned int flen, unsigned int proglen,
u32 pass, void *image)
{
pr_err("flen=%u proglen=%u pass=%u image=%pK\n",
flen, proglen, pass, image);
if (image)
print_hex_dump(KERN_ERR, "JIT code: ", DUMP_PREFIX_OFFSET,
16, 1, image, proglen, false);
}
#else
#include <linux/slab.h>
static inline void bpf_jit_compile(struct sk_filter *fp)
{
}
static inline void bpf_jit_free(struct sk_filter *fp)
{
kfree(fp);
}
#endif
static inline int bpf_tell_extensions(void)
{
return SKF_AD_MAX;
}
enum {
BPF_S_RET_K = 1,
BPF_S_RET_A,
BPF_S_ALU_ADD_K,
BPF_S_ALU_ADD_X,
BPF_S_ALU_SUB_K,
BPF_S_ALU_SUB_X,
BPF_S_ALU_MUL_K,
BPF_S_ALU_MUL_X,
BPF_S_ALU_DIV_X,
BPF_S_ALU_MOD_K,
BPF_S_ALU_MOD_X,
BPF_S_ALU_AND_K,
BPF_S_ALU_AND_X,
BPF_S_ALU_OR_K,
BPF_S_ALU_OR_X,
BPF_S_ALU_XOR_K,
BPF_S_ALU_XOR_X,
BPF_S_ALU_LSH_K,
BPF_S_ALU_LSH_X,
BPF_S_ALU_RSH_K,
BPF_S_ALU_RSH_X,
BPF_S_ALU_NEG,
BPF_S_LD_W_ABS,
BPF_S_LD_H_ABS,
BPF_S_LD_B_ABS,
BPF_S_LD_W_LEN,
BPF_S_LD_W_IND,
BPF_S_LD_H_IND,
BPF_S_LD_B_IND,
BPF_S_LD_IMM,
BPF_S_LDX_W_LEN,
BPF_S_LDX_B_MSH,
BPF_S_LDX_IMM,
BPF_S_MISC_TAX,
BPF_S_MISC_TXA,
BPF_S_ALU_DIV_K,
BPF_S_LD_MEM,
BPF_S_LDX_MEM,
BPF_S_ST,
BPF_S_STX,
BPF_S_JMP_JA,
BPF_S_JMP_JEQ_K,
BPF_S_JMP_JEQ_X,
BPF_S_JMP_JGE_K,
BPF_S_JMP_JGE_X,
BPF_S_JMP_JGT_K,
BPF_S_JMP_JGT_X,
BPF_S_JMP_JSET_K,
BPF_S_JMP_JSET_X,
/* Ancillary data */
BPF_S_ANC_PROTOCOL,
BPF_S_ANC_PKTTYPE,
BPF_S_ANC_IFINDEX,
BPF_S_ANC_NLATTR,
BPF_S_ANC_NLATTR_NEST,
BPF_S_ANC_MARK,
BPF_S_ANC_QUEUE,
BPF_S_ANC_HATYPE,
BPF_S_ANC_RXHASH,
BPF_S_ANC_CPU,
BPF_S_ANC_ALU_XOR_X,
BPF_S_ANC_SECCOMP_LD_W,
BPF_S_ANC_VLAN_TAG,
BPF_S_ANC_VLAN_TAG_PRESENT,
BPF_S_ANC_PAY_OFFSET,
};
#endif /* __LINUX_FILTER_H__ */