Commit graph

65 commits

Author SHA1 Message Date
Alexei Starovoitov 7ae457c1e5 net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix

split 'struct sk_filter' into
struct sk_filter {
	atomic_t        refcnt;
	struct rcu_head rcu;
	struct bpf_prog *prog;
};
and
struct bpf_prog {
        u32                     jited:1,
                                len:31;
        struct sock_fprog_kern  *orig_prog;
        unsigned int            (*bpf_func)(const struct sk_buff *skb,
                                            const struct bpf_insn *filter);
        union {
                struct sock_filter      insns[0];
                struct bpf_insn         insnsi[0];
                struct work_struct      work;
        };
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases

split SK_RUN_FILTER macro into:
    SK_RUN_FILTER to be used with 'struct sk_filter *' and
    BPF_PROG_RUN to be used with 'struct bpf_prog *'

__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function

also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:

sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter

API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet

API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02 15:03:58 -07:00
Alexei Starovoitov 8fb575ca39 net: filter: rename sk_convert_filter() -> bpf_convert_filter()
to indicate that this function is converting classic BPF into eBPF
and not related to sockets

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02 15:02:38 -07:00
Alexei Starovoitov 4df95ff488 net: filter: rename sk_chk_filter() -> bpf_check_classic()
trivial rename to indicate that this functions performs classic BPF checking

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02 15:02:38 -07:00
Alexei Starovoitov 009937e78a net: filter: rename sk_filter_proglen -> bpf_classic_proglen
trivial rename to better match semantics of macro

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02 15:02:38 -07:00
Alexei Starovoitov 278571baca net: filter: simplify socket charging
attaching bpf program to a socket involves multiple socket memory arithmetic,
since size of 'sk_filter' is changing when classic BPF is converted to eBPF.
Also common path of program creation has to deal with two ways of freeing
the memory.

Simplify the code by delaying socket charging until program is ready and
its size is known

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02 15:02:37 -07:00
Alexei Starovoitov 2695fb552c net: filter: rename 'struct sock_filter_int' into 'struct bpf_insn'
eBPF is used by socket filtering, seccomp and soon by tracing and
exposed to userspace, therefore 'sock_filter_int' name is not accurate.
Rename it to 'bpf_insn'

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-24 23:27:17 -07:00
Eric Dumazet ec31a05c4d net: filter: sk_chk_filter() no longer mangles filter
Add const attribute to filter argument to make clear it is no
longer modified.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-13 23:27:41 -07:00
Zi Shen Lim 9f12fbe603 net: filter: move load_pointer() into filter.h
load_pointer() is already a static inline function.
Let's move it into filter.h so BPF JIT implementations can reuse this
function.

Since we're exporting this function, let's also rename it to
bpf_load_pointer() for clarity.

Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-08 14:20:00 -07:00
Alexei Starovoitov e430f34ee5 net: filter: cleanup A/X name usage
The macro 'A' used in internal BPF interpreter:
 #define A regs[insn->a_reg]
was easily confused with the name of classic BPF register 'A', since
'A' would mean two different things depending on context.

This patch is trying to clean up the naming and clarify its usage in the
following way:

- A and X are names of two classic BPF registers

- BPF_REG_A denotes internal BPF register R0 used to map classic register A
  in internal BPF programs generated from classic

- BPF_REG_X denotes internal BPF register R7 used to map classic register X
  in internal BPF programs generated from classic

- internal BPF instruction format:
struct sock_filter_int {
        __u8    code;           /* opcode */
        __u8    dst_reg:4;      /* dest register */
        __u8    src_reg:4;      /* source register */
        __s16   off;            /* signed offset */
        __s32   imm;            /* signed immediate constant */
};

- BPF_X/BPF_K is 1 bit used to encode source operand of instruction
In classic:
  BPF_X - means use register X as source operand
  BPF_K - means use 32-bit immediate as source operand
In internal:
  BPF_X - means use 'src_reg' register as source operand
  BPF_K - means use 32-bit immediate as source operand

Suggested-by: Chema Gonzalez <chema@google.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Chema Gonzalez <chema@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 00:13:16 -07:00
Daniel Borkmann f8f6d679aa net: filter: improve filter block macros
Commit 9739eef13c ("net: filter: make BPF conversion more readable")
started to introduce helper macros similar to BPF_STMT()/BPF_JUMP()
macros from classic BPF.

However, quite some statements in the filter conversion functions
remained in the old style which gives a mixture of block macros and
non block macros in the code. This patch makes the block macros itself
more readable by using explicit member initialization, and converts
the remaining ones where possible to remain in a more consistent state.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-01 22:16:58 -07:00
Daniel Borkmann 3480593131 net: filter: get rid of BPF_S_* enum
This patch finally allows us to get rid of the BPF_S_* enum.
Currently, the code performs unnecessary encode and decode
workarounds in seccomp and filter migration itself when a filter
is being attached in order to overcome BPF_S_* encoding which
is not used anymore by the new interpreter resp. JIT compilers.

Keeping it around would mean that also in future we would need
to extend and maintain this enum and related encoders/decoders.
We can get rid of all that and save us these operations during
filter attaching. Naturally, also JIT compilers need to be updated
by this.

Before JIT conversion is being done, each compiler checks if A
is being loaded at startup to obtain information if it needs to
emit instructions to clear A first. Since BPF extensions are a
subset of BPF_LD | BPF_{W,H,B} | BPF_ABS variants, case statements
for extensions can be removed at that point. To ease and minimalize
code changes in the classic JITs, we have introduced bpf_anc_helper().

Tested with test_bpf on x86_64 (JIT, int), s390x (JIT, int),
arm (JIT, int), i368 (int), ppc64 (JIT, int); for sparc we
unfortunately didn't have access, but changes are analogous to
the rest.

Joint work with Alexei Starovoitov.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mircea Gherzan <mgherzan@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Chema Gonzalez <chemag@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-01 22:16:58 -07:00
Daniel Borkmann b1fcd35cf5 net: filter: let unattached filters use sock_fprog_kern
The sk_unattached_filter_create() API is used by BPF filters that
are not directly attached or related to sockets, and are used in
team, ptp, xt_bpf, cls_bpf, etc. As such all users do their own
internal managment of obtaining filter blocks and thus already
have them in kernel memory and set up before calling into
sk_unattached_filter_create(). As a result, due to __user annotation
in sock_fprog, sparse triggers false positives (incorrect type in
assignment [different address space]) when filters are set up before
passing them to sk_unattached_filter_create(). Therefore, let
sk_unattached_filter_create() API use sock_fprog_kern to overcome
this issue.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-23 16:48:05 -04:00
Daniel Borkmann 8556ce79d5 net: filter: remove DL macro
Lets get rid of this macro. After commit 5bcfedf06f ("net: filter:
simplify label names from jump-table"), labels have become more
readable due to omission of BPF_ prefix but at the same time more
generic, so that things like `git grep -n` would not find them. As
a middle path, lets get rid of the DL macro as it's not strictly
needed and would otherwise just hide the full name.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-23 16:48:05 -04:00
Alexei Starovoitov 5fe821a9de net: filter: cleanup invocation of internal BPF
Kernel API for classic BPF socket filters is:

sk_unattached_filter_create() - validate classic BPF, convert, JIT
SK_RUN_FILTER() - run it
sk_unattached_filter_destroy() - destroy socket filter

Cleanup internal BPF kernel API as following:

sk_filter_select_runtime() - final step of internal BPF creation.
  Try to JIT internal BPF program, if JIT is not available select interpreter
SK_RUN_FILTER() - run it
sk_filter_free() - free internal BPF program

Disallow direct calls to BPF interpreter. Execution of the BPF program should
be done with SK_RUN_FILTER() macro.

Example of internal BPF create, run, destroy:

  struct sk_filter *fp;

  fp = kzalloc(sk_filter_size(prog_len), GFP_KERNEL);
  memcpy(fp->insni, prog, prog_len * sizeof(fp->insni[0]));
  fp->len = prog_len;

  sk_filter_select_runtime(fp);

  SK_RUN_FILTER(fp, ctx);

  sk_filter_free(fp);

Sockets, seccomp, testsuite, tracing are using different ways to populate
sk_filter, so first steps of program creation are not common.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-21 17:07:17 -04:00
Alexei Starovoitov 622582786c net: filter: x86: internal BPF JIT
Maps all internal BPF instructions into x86_64 instructions.
This patch replaces original BPF x64 JIT with internal BPF x64 JIT.
sysctl net.core.bpf_jit_enable is reused as on/off switch.

Performance:

1. old BPF JIT and internal BPF JIT generate equivalent x86_64 code.
  No performance difference is observed for filters that were JIT-able before

Example assembler code for BPF filter "tcpdump port 22"

original BPF -> old JIT:            original BPF -> internal BPF -> new JIT:
   0:   push   %rbp                      0:     push   %rbp
   1:   mov    %rsp,%rbp                 1:     mov    %rsp,%rbp
   4:   sub    $0x60,%rsp                4:     sub    $0x228,%rsp
   8:   mov    %rbx,-0x8(%rbp)           b:     mov    %rbx,-0x228(%rbp) // prologue
                                        12:     mov    %r13,-0x220(%rbp)
                                        19:     mov    %r14,-0x218(%rbp)
                                        20:     mov    %r15,-0x210(%rbp)
                                        27:     xor    %eax,%eax         // clear A
   c:   xor    %ebx,%ebx                29:     xor    %r13,%r13         // clear X
   e:   mov    0x68(%rdi),%r9d          2c:     mov    0x68(%rdi),%r9d
  12:   sub    0x6c(%rdi),%r9d          30:     sub    0x6c(%rdi),%r9d
  16:   mov    0xd8(%rdi),%r8           34:     mov    0xd8(%rdi),%r10
                                        3b:     mov    %rdi,%rbx
  1d:   mov    $0xc,%esi                3e:     mov    $0xc,%esi
  22:   callq  0xffffffffe1021e15       43:     callq  0xffffffffe102bd75
  27:   cmp    $0x86dd,%eax             48:     cmp    $0x86dd,%rax
  2c:   jne    0x0000000000000069       4f:     jne    0x000000000000009a
  2e:   mov    $0x14,%esi               51:     mov    $0x14,%esi
  33:   callq  0xffffffffe1021e31       56:     callq  0xffffffffe102bd91
  38:   cmp    $0x84,%eax               5b:     cmp    $0x84,%rax
  3d:   je     0x0000000000000049       62:     je     0x0000000000000074
  3f:   cmp    $0x6,%eax                64:     cmp    $0x6,%rax
  42:   je     0x0000000000000049       68:     je     0x0000000000000074
  44:   cmp    $0x11,%eax               6a:     cmp    $0x11,%rax
  47:   jne    0x00000000000000c6       6e:     jne    0x0000000000000117
  49:   mov    $0x36,%esi               74:     mov    $0x36,%esi
  4e:   callq  0xffffffffe1021e15       79:     callq  0xffffffffe102bd75
  53:   cmp    $0x16,%eax               7e:     cmp    $0x16,%rax
  56:   je     0x00000000000000bf       82:     je     0x0000000000000110
  58:   mov    $0x38,%esi               88:     mov    $0x38,%esi
  5d:   callq  0xffffffffe1021e15       8d:     callq  0xffffffffe102bd75
  62:   cmp    $0x16,%eax               92:     cmp    $0x16,%rax
  65:   je     0x00000000000000bf       96:     je     0x0000000000000110
  67:   jmp    0x00000000000000c6       98:     jmp    0x0000000000000117
  69:   cmp    $0x800,%eax              9a:     cmp    $0x800,%rax
  6e:   jne    0x00000000000000c6       a1:     jne    0x0000000000000117
  70:   mov    $0x17,%esi               a3:     mov    $0x17,%esi
  75:   callq  0xffffffffe1021e31       a8:     callq  0xffffffffe102bd91
  7a:   cmp    $0x84,%eax               ad:     cmp    $0x84,%rax
  7f:   je     0x000000000000008b       b4:     je     0x00000000000000c2
  81:   cmp    $0x6,%eax                b6:     cmp    $0x6,%rax
  84:   je     0x000000000000008b       ba:     je     0x00000000000000c2
  86:   cmp    $0x11,%eax               bc:     cmp    $0x11,%rax
  89:   jne    0x00000000000000c6       c0:     jne    0x0000000000000117
  8b:   mov    $0x14,%esi               c2:     mov    $0x14,%esi
  90:   callq  0xffffffffe1021e15       c7:     callq  0xffffffffe102bd75
  95:   test   $0x1fff,%ax              cc:     test   $0x1fff,%rax
  99:   jne    0x00000000000000c6       d3:     jne    0x0000000000000117
                                        d5:     mov    %rax,%r14
  9b:   mov    $0xe,%esi                d8:     mov    $0xe,%esi
  a0:   callq  0xffffffffe1021e44       dd:     callq  0xffffffffe102bd91 // MSH
                                        e2:     and    $0xf,%eax
                                        e5:     shl    $0x2,%eax
                                        e8:     mov    %rax,%r13
                                        eb:     mov    %r14,%rax
                                        ee:     mov    %r13,%rsi
  a5:   lea    0xe(%rbx),%esi           f1:     add    $0xe,%esi
  a8:   callq  0xffffffffe1021e0d       f4:     callq  0xffffffffe102bd6d
  ad:   cmp    $0x16,%eax               f9:     cmp    $0x16,%rax
  b0:   je     0x00000000000000bf       fd:     je     0x0000000000000110
                                        ff:     mov    %r13,%rsi
  b2:   lea    0x10(%rbx),%esi         102:     add    $0x10,%esi
  b5:   callq  0xffffffffe1021e0d      105:     callq  0xffffffffe102bd6d
  ba:   cmp    $0x16,%eax              10a:     cmp    $0x16,%rax
  bd:   jne    0x00000000000000c6      10e:     jne    0x0000000000000117
  bf:   mov    $0xffff,%eax            110:     mov    $0xffff,%eax
  c4:   jmp    0x00000000000000c8      115:     jmp    0x000000000000011c
  c6:   xor    %eax,%eax               117:     mov    $0x0,%eax
  c8:   mov    -0x8(%rbp),%rbx         11c:     mov    -0x228(%rbp),%rbx // epilogue
  cc:   leaveq                         123:     mov    -0x220(%rbp),%r13
  cd:   retq                           12a:     mov    -0x218(%rbp),%r14
                                       131:     mov    -0x210(%rbp),%r15
                                       138:     leaveq
                                       139:     retq

On fully cached SKBs both JITed functions take 12 nsec to execute.
BPF interpreter executes the program in 30 nsec.

The difference in generated assembler is due to the following:

Old BPF imlements LDX_MSH instruction via sk_load_byte_msh() helper function
inside bpf_jit.S.
New JIT removes the helper and does it explicitly, so ldx_msh cost
is the same for both JITs, but generated code looks longer.

New JIT has 4 registers to save, so prologue/epilogue are larger,
but the cost is within noise on x64.

Old JIT checks whether first insn clears A and if not emits 'xor %eax,%eax'.
New JIT clears %rax unconditionally.

2. old BPF JIT doesn't support ANC_NLATTR, ANC_PAY_OFFSET, ANC_RANDOM
  extensions. New JIT supports all BPF extensions.
  Performance of such filters improves 2-4 times depending on a filter.
  The longer the filter the higher performance gain.
  Synthetic benchmarks with many ancillary loads see 20x speedup
  which seems to be the maximum gain from JIT

Notes:

. net.core.bpf_jit_enable=2 + tools/net/bpf_jit_disasm is still functional
  and can be used to see generated assembler

. there are two jit_compile() functions and code flow for classic filters is:
  sk_attach_filter() - load classic BPF
  bpf_jit_compile() - try to JIT from classic BPF
  sk_convert_filter() - convert classic to internal
  bpf_int_jit_compile() - JIT from internal BPF

  seccomp and tracing filters will just call bpf_int_jit_compile()

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-15 16:31:30 -04:00
Alexei Starovoitov 9739eef13c net: filter: make BPF conversion more readable
Introduce BPF helper macros to define instructions
(similar to old BPF_STMT/BPF_JUMP macros)

Use them while converting classic BPF to internal
and in BPF testsuite later.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-12 00:23:55 -04:00
Daniel Borkmann 30743837dd net: filter: make register naming more comprehensible
The current code is a bit hard to parse on which registers can be used,
how they are mapped and all play together. It makes much more sense to
define this a bit more clearly so that the code is a bit more intuitive.
This patch cleans this up, and makes naming a bit more consistent among
the code. This also allows for moving some of the defines into the header
file. Clearing of A and X registers in __sk_run_filter() do not get a
particular register name assigned as they have not an 'official' function,
but rather just result from the concrete initial mapping of old BPF
programs. Since for BPF helper functions for BPF_CALL we already use
small letters, so be consistent here as well. No functional changes.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-04 19:46:31 -04:00
Daniel Borkmann 5bcfedf06f net: filter: simplify label names from jump-table
This patch simplifies label naming for the BPF jump-table.
When we define labels via DL(), we just concatenate/textify
the combination of instruction opcode which consists of the
class, subclass, word size, target register and so on. Each
time we leave BPF_ prefix intact, so that e.g. the preprocessor
generates a label BPF_ALU_BPF_ADD_BPF_X for DL(BPF_ALU, BPF_ADD,
BPF_X) whereas a label name of ALU_ADD_X is much more easy
to grasp. Pure cleanup only.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-04 19:46:31 -04:00
Chema Gonzalez 4cd3675ebf filter: added BPF random opcode
Added a new ancillary load (bpf call in eBPF parlance) that produces
a 32-bit random number. We are implementing it as an ancillary load
(instead of an ISA opcode) because (a) it is simpler, (b) allows easy
JITing, and (c) seems more in line with generic ISAs that do not have
"get a random number" as a instruction, but as an OS call.

The main use for this ancillary load is to perform random packet sampling.

Signed-off-by: Chema Gonzalez <chema@google.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-22 21:27:57 -04:00
Daniel Borkmann 8c482cdc35 net: filter: seccomp: fix wrong decoding of BPF_S_ANC_SECCOMP_LD_W
While reviewing seccomp code, we found that BPF_S_ANC_SECCOMP_LD_W has
been wrongly decoded by commit a8fc927780 ("sk-filter: Add ability to
get socket filter program (v2)") into the opcode BPF_LD|BPF_B|BPF_ABS
although it should have been decoded as BPF_LD|BPF_W|BPF_ABS.

In practice, this should not have much side-effect though, as such
conversion is/was being done through prctl(2) PR_SET_SECCOMP. Reverse
operation PR_GET_SECCOMP will only return the current seccomp mode, but
not the filter itself. Since the transition to the new BPF infrastructure,
it's also not used anymore, so we can simply remove this as it's
unreachable.

Fixes: a8fc927780 ("sk-filter: Add ability to get socket filter program (v2)")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-14 16:26:47 -04:00
Alexei Starovoitov bd4cf0ed33 net: filter: rework/optimize internal BPF interpreter's instruction set
This patch replaces/reworks the kernel-internal BPF interpreter with
an optimized BPF instruction set format that is modelled closer to
mimic native instruction sets and is designed to be JITed with one to
one mapping. Thus, the new interpreter is noticeably faster than the
current implementation of sk_run_filter(); mainly for two reasons:

1. Fall-through jumps:

  BPF jump instructions are forced to go either 'true' or 'false'
  branch which causes branch-miss penalty. The new BPF jump
  instructions have only one branch and fall-through otherwise,
  which fits the CPU branch predictor logic better. `perf stat`
  shows drastic difference for branch-misses between the old and
  new code.

2. Jump-threaded implementation of interpreter vs switch
   statement:

  Instead of single table-jump at the top of 'switch' statement,
  gcc will now generate multiple table-jump instructions, which
  helps CPU branch predictor logic.

Note that the verification of filters is still being done through
sk_chk_filter() in classical BPF format, so filters from user- or
kernel space are verified in the same way as we do now, and same
restrictions/constraints hold as well.

We reuse current BPF JIT compilers in a way that this upgrade would
even be fine as is, but nevertheless allows for a successive upgrade
of BPF JIT compilers to the new format.

The internal instruction set migration is being done after the
probing for JIT compilation, so in case JIT compilers are able to
create a native opcode image, we're going to use that, and in all
other cases we're doing a follow-up migration of the BPF program's
instruction set, so that it can be transparently run in the new
interpreter.

In short, the *internal* format extends BPF in the following way (more
details can be taken from the appended documentation):

  - Number of registers increase from 2 to 10
  - Register width increases from 32-bit to 64-bit
  - Conditional jt/jf targets replaced with jt/fall-through
  - Adds signed > and >= insns
  - 16 4-byte stack slots for register spill-fill replaced
    with up to 512 bytes of multi-use stack space
  - Introduction of bpf_call insn and register passing convention
    for zero overhead calls from/to other kernel functions
  - Adds arithmetic right shift and endianness conversion insns
  - Adds atomic_add insn
  - Old tax/txa insns are replaced with 'mov dst,src' insn

Performance of two BPF filters generated by libpcap resp. bpf_asm
was measured on x86_64, i386 and arm32 (other libpcap programs
have similar performance differences):

fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd

fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
   ((tcp[12]&0xf0)>>2)) != 0)' -dd

Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
smaller is better:

--x86_64--
         fprog #1  fprog #1   fprog #2  fprog #2
         cache-hit cache-miss cache-hit cache-miss
old BPF      90       101        192       202
new BPF      31        71         47        97
old BPF jit  12        34         17        44
new BPF jit TBD

--i386--
         fprog #1  fprog #1   fprog #2  fprog #2
         cache-hit cache-miss cache-hit cache-miss
old BPF     107       136        227       252
new BPF      40       119         69       172

--arm32--
         fprog #1  fprog #1   fprog #2  fprog #2
         cache-hit cache-miss cache-hit cache-miss
old BPF     202       300        475       540
new BPF     180       270        330       470
old BPF jit  26       182         37       202
new BPF jit TBD

Thus, without changing any userland BPF filters, applications on
top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
classifier, netfilter's xt_bpf, team driver's load-balancing mode,
and many more will have better interpreter filtering performance.

While we are replacing the internal BPF interpreter, we also need
to convert seccomp BPF in the same step to make use of the new
internal structure since it makes use of lower-level API details
without being further decoupled through higher-level calls like
sk_unattached_filter_{create,destroy}(), for example.

Just as for normal socket filtering, also seccomp BPF experiences
a time-to-verdict speedup:

05-sim-long_jumps.c of libseccomp was used as micro-benchmark:

  seccomp_rule_add_exact(ctx,...
  seccomp_rule_add_exact(ctx,...

  rc = seccomp_load(ctx);

  for (i = 0; i < 10000000; i++)
     syscall(199, 100);

'short filter' has 2 rules
'large filter' has 200 rules

'short filter' performance is slightly better on x86_64/i386/arm32
'large filter' is much faster on x86_64 and i386 and shows no
               difference on arm32

--x86_64-- short filter
old BPF: 2.7 sec
 39.12%  bench  libc-2.15.so       [.] syscall
  8.10%  bench  [kernel.kallsyms]  [k] sk_run_filter
  6.31%  bench  [kernel.kallsyms]  [k] system_call
  5.59%  bench  [kernel.kallsyms]  [k] trace_hardirqs_on_caller
  4.37%  bench  [kernel.kallsyms]  [k] trace_hardirqs_off_caller
  3.70%  bench  [kernel.kallsyms]  [k] __secure_computing
  3.67%  bench  [kernel.kallsyms]  [k] lock_is_held
  3.03%  bench  [kernel.kallsyms]  [k] seccomp_bpf_load
new BPF: 2.58 sec
 42.05%  bench  libc-2.15.so       [.] syscall
  6.91%  bench  [kernel.kallsyms]  [k] system_call
  6.25%  bench  [kernel.kallsyms]  [k] trace_hardirqs_on_caller
  6.07%  bench  [kernel.kallsyms]  [k] __secure_computing
  5.08%  bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp

--arm32-- short filter
old BPF: 4.0 sec
 39.92%  bench  [kernel.kallsyms]  [k] vector_swi
 16.60%  bench  [kernel.kallsyms]  [k] sk_run_filter
 14.66%  bench  libc-2.17.so       [.] syscall
  5.42%  bench  [kernel.kallsyms]  [k] seccomp_bpf_load
  5.10%  bench  [kernel.kallsyms]  [k] __secure_computing
new BPF: 3.7 sec
 35.93%  bench  [kernel.kallsyms]  [k] vector_swi
 21.89%  bench  libc-2.17.so       [.] syscall
 13.45%  bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp
  6.25%  bench  [kernel.kallsyms]  [k] __secure_computing
  3.96%  bench  [kernel.kallsyms]  [k] syscall_trace_exit

--x86_64-- large filter
old BPF: 8.6 seconds
    73.38%    bench  [kernel.kallsyms]  [k] sk_run_filter
    10.70%    bench  libc-2.15.so       [.] syscall
     5.09%    bench  [kernel.kallsyms]  [k] seccomp_bpf_load
     1.97%    bench  [kernel.kallsyms]  [k] system_call
new BPF: 5.7 seconds
    66.20%    bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp
    16.75%    bench  libc-2.15.so       [.] syscall
     3.31%    bench  [kernel.kallsyms]  [k] system_call
     2.88%    bench  [kernel.kallsyms]  [k] __secure_computing

--i386-- large filter
old BPF: 5.4 sec
new BPF: 3.8 sec

--arm32-- large filter
old BPF: 13.5 sec
 73.88%  bench  [kernel.kallsyms]  [k] sk_run_filter
 10.29%  bench  [kernel.kallsyms]  [k] vector_swi
  6.46%  bench  libc-2.17.so       [.] syscall
  2.94%  bench  [kernel.kallsyms]  [k] seccomp_bpf_load
  1.19%  bench  [kernel.kallsyms]  [k] __secure_computing
  0.87%  bench  [kernel.kallsyms]  [k] sys_getuid
new BPF: 13.5 sec
 76.08%  bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp
 10.98%  bench  [kernel.kallsyms]  [k] vector_swi
  5.87%  bench  libc-2.17.so       [.] syscall
  1.77%  bench  [kernel.kallsyms]  [k] __secure_computing
  0.93%  bench  [kernel.kallsyms]  [k] sys_getuid

BPF filters generated by seccomp are very branchy, so the new
internal BPF performance is better than the old one. Performance
gains will be even higher when BPF JIT is committed for the
new structure, which is planned in future work (as successive
JIT migrations).

BPF has also been stress-tested with trinity's BPF fuzzer.

Joint work with Daniel Borkmann.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-31 00:45:09 -04:00
Daniel Borkmann fbc907f0b1 net: filter: move filter accounting to filter core
This patch basically does two things, i) removes the extern keyword
from the include/linux/filter.h file to be more consistent with the
rest of Joe's changes, and ii) moves filter accounting into the filter
core framework.

Filter accounting mainly done through sk_filter_{un,}charge() take
care of the case when sockets are being cloned through sk_clone_lock()
so that removal of the filter on one socket won't result in eviction
as it's still referenced by the other.

These functions actually belong to net/core/filter.c and not
include/net/sock.h as we want to keep all that in a central place.
It's also not in fast-path so uninlining them is fine and even allows
us to get rd of sk_filter_release_rcu()'s EXPORT_SYMBOL and a forward
declaration.

Joint work with Alexei Starovoitov.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-31 00:45:09 -04:00
Daniel Borkmann a3ea269b8b net: filter: keep original BPF program around
In order to open up the possibility to internally transform a BPF program
into an alternative and possibly non-trivial reversible representation, we
need to keep the original BPF program around, so that it can be passed back
to user space w/o the need of a complex decoder.

The reason for that use case resides in commit a8fc927780 ("sk-filter:
Add ability to get socket filter program (v2)"), that is, the ability
to retrieve the currently attached BPF filter from a given socket used
mainly by the checkpoint-restore project, for example.

Therefore, we add two helpers sk_{store,release}_orig_filter for taking
care of that. In the sk_unattached_filter_create() case, there's no such
possibility/requirement to retrieve a loaded BPF program. Therefore, we
can spare us the work in that case.

This approach will simplify and slightly speed up both, sk_get_filter()
and sock_diag_put_filterinfo() handlers as we won't need to successively
decode filters anymore through sk_decode_filter(). As we still need
sk_decode_filter() later on, we're keeping it around.

Joint work with Alexei Starovoitov.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-31 00:45:09 -04:00
Daniel Borkmann f8bbbfc3b9 net: filter: add jited flag to indicate jit compiled filters
This patch adds a jited flag into sk_filter struct in order to indicate
whether a filter is currently jited or not. The size of sk_filter is
not being expanded as the 32 bit 'len' member allows upper bits to be
reused since a filter can currently only grow as large as BPF_MAXINSNS.

Therefore, there's enough room also for other in future needed flags to
reuse 'len' field if necessary. The jited flag also allows for having
alternative interpreter functions running as currently, we can only
detect jit compiled filters by testing fp->bpf_func to not equal the
address of sk_run_filter().

Joint work with Alexei Starovoitov.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-31 00:45:08 -04:00
Daniel Borkmann 3769229931 net: filter: let bpf_tell_extensions return SKF_AD_MAX
Michal Sekletar added in commit ea02f9411d ("net: introduce
SO_BPF_EXTENSIONS") a facility where user space can enquire
the BPF ancillary instruction set, which is imho a step into
the right direction for letting user space high-level to BPF
optimizers make an informed decision for possibly using these
extensions.

The original rationale was to return through a getsockopt(2)
a bitfield of which instructions are supported and which
are not, as of right now, we just return 0 to indicate a
base support for SKF_AD_PROTOCOL up to SKF_AD_PAY_OFFSET.
Limitations of this approach are that this API which we need
to maintain for a long time can only support a maximum of 32
extensions, and needs to be additionally maintained/updated
when each new extension that comes in.

I thought about this a bit more and what we can do here to
overcome this is to just return SKF_AD_MAX. Since we never
remove any extension since we cannot break user space and
always linearly increase SKF_AD_MAX on each newly added
extension, user space can make a decision on what extensions
are supported in the whole set of extensions and which aren't,
by just checking which of them from the whole set have an
offset < SKF_AD_MAX of the underlying kernel.

Since SKF_AD_MAX must be updated each time we add new ones,
we don't need to introduce an additional enum and got
maintenance for free. At some point in time when
SO_BPF_EXTENSIONS becomes ubiquitous for most kernels, then
an application can simply make use of this and easily be run
on newer or older underlying kernels without needing to be
recompiled, of course. Since that is for 3.14, it's not too
late to do this change.

Cc: Michal Sekletar <msekleta@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Michal Sekletar <msekleta@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21 18:57:43 -08:00
Michal Sekletar ea02f9411d net: introduce SO_BPF_EXTENSIONS
For user space packet capturing libraries such as libpcap, there's
currently only one way to check which BPF extensions are supported
by the kernel, that is, commit aa1113d9f8 ("net: filter: return
-EINVAL if BPF_S_ANC* operation is not supported"). For querying all
extensions at once this might be rather inconvenient.

Therefore, this patch introduces a new option which can be used as
an argument for getsockopt(), and allows one to obtain information
about which BPF extensions are supported by the current kernel.

As David Miller suggests, we do not need to define any bits right
now and status quo can just return 0 in order to state that this
versions supports SKF_AD_PROTOCOL up to SKF_AD_PAY_OFFSET. Later
additions to BPF extensions need to add their bits to the
bpf_tell_extensions() function, as documented in the comment.

Signed-off-by: Michal Sekletar <msekleta@redhat.com>
Cc: David Miller <davem@davemloft.net>
Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-18 19:08:58 -08:00
Alexei Starovoitov d45ed4a4e3 net: fix unsafe set_memory_rw from softirq
on x86 system with net.core.bpf_jit_enable = 1

sudo tcpdump -i eth1 'tcp port 22'

causes the warning:
[   56.766097]  Possible unsafe locking scenario:
[   56.766097]
[   56.780146]        CPU0
[   56.786807]        ----
[   56.793188]   lock(&(&vb->lock)->rlock);
[   56.799593]   <Interrupt>
[   56.805889]     lock(&(&vb->lock)->rlock);
[   56.812266]
[   56.812266]  *** DEADLOCK ***
[   56.812266]
[   56.830670] 1 lock held by ksoftirqd/1/13:
[   56.836838]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff8118f44c>] vm_unmap_aliases+0x8c/0x380
[   56.849757]
[   56.849757] stack backtrace:
[   56.862194] CPU: 1 PID: 13 Comm: ksoftirqd/1 Not tainted 3.12.0-rc3+ #45
[   56.868721] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012
[   56.882004]  ffffffff821944c0 ffff88080bbdb8c8 ffffffff8175a145 0000000000000007
[   56.895630]  ffff88080bbd5f40 ffff88080bbdb928 ffffffff81755b14 0000000000000001
[   56.909313]  ffff880800000001 ffff880800000000 ffffffff8101178f 0000000000000001
[   56.923006] Call Trace:
[   56.929532]  [<ffffffff8175a145>] dump_stack+0x55/0x76
[   56.936067]  [<ffffffff81755b14>] print_usage_bug+0x1f7/0x208
[   56.942445]  [<ffffffff8101178f>] ? save_stack_trace+0x2f/0x50
[   56.948932]  [<ffffffff810cc0a0>] ? check_usage_backwards+0x150/0x150
[   56.955470]  [<ffffffff810ccb52>] mark_lock+0x282/0x2c0
[   56.961945]  [<ffffffff810ccfed>] __lock_acquire+0x45d/0x1d50
[   56.968474]  [<ffffffff810cce6e>] ? __lock_acquire+0x2de/0x1d50
[   56.975140]  [<ffffffff81393bf5>] ? cpumask_next_and+0x55/0x90
[   56.981942]  [<ffffffff810cef72>] lock_acquire+0x92/0x1d0
[   56.988745]  [<ffffffff8118f52a>] ? vm_unmap_aliases+0x16a/0x380
[   56.995619]  [<ffffffff817628f1>] _raw_spin_lock+0x41/0x50
[   57.002493]  [<ffffffff8118f52a>] ? vm_unmap_aliases+0x16a/0x380
[   57.009447]  [<ffffffff8118f52a>] vm_unmap_aliases+0x16a/0x380
[   57.016477]  [<ffffffff8118f44c>] ? vm_unmap_aliases+0x8c/0x380
[   57.023607]  [<ffffffff810436b0>] change_page_attr_set_clr+0xc0/0x460
[   57.030818]  [<ffffffff810cfb8d>] ? trace_hardirqs_on+0xd/0x10
[   57.037896]  [<ffffffff811a8330>] ? kmem_cache_free+0xb0/0x2b0
[   57.044789]  [<ffffffff811b59c3>] ? free_object_rcu+0x93/0xa0
[   57.051720]  [<ffffffff81043d9f>] set_memory_rw+0x2f/0x40
[   57.058727]  [<ffffffff8104e17c>] bpf_jit_free+0x2c/0x40
[   57.065577]  [<ffffffff81642cba>] sk_filter_release_rcu+0x1a/0x30
[   57.072338]  [<ffffffff811108e2>] rcu_process_callbacks+0x202/0x7c0
[   57.078962]  [<ffffffff81057f17>] __do_softirq+0xf7/0x3f0
[   57.085373]  [<ffffffff81058245>] run_ksoftirqd+0x35/0x70

cannot reuse jited filter memory, since it's readonly,
so use original bpf insns memory to hold work_struct

defer kfree of sk_filter until jit completed freeing

tested on x86_64 and i386

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-07 15:16:45 -04:00
David S. Miller d98cae64e4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/wireless/ath/ath9k/Kconfig
	drivers/net/xen-netback/netback.c
	net/batman-adv/bat_iv_ogm.c
	net/wireless/nl80211.c

The ath9k Kconfig conflict was a change of a Kconfig option name right
next to the deletion of another option.

The xen-netback conflict was overlapping changes involving the
handling of the notify list in xen_netbk_rx_action().

Batman conflict resolution provided by Antonio Quartulli, basically
keep everything in both conflict hunks.

The nl80211 conflict is a little more involved.  In 'net' we added a
dynamic memory allocation to nl80211_dump_wiphy() to fix a race that
Linus reported.  Meanwhile in 'net-next' the handlers were converted
to use pre and post doit handlers which use a flag to determine
whether to hold the RTNL mutex around the operation.

However, the dump handlers to not use this logic.  Instead they have
to explicitly do the locking.  There were apparent bugs in the
conversion of nl80211_dump_wiphy() in that we were not dropping the
RTNL mutex in all the return paths, and it seems we very much should
be doing so.  So I fixed that whilst handling the overlapping changes.

To simplify the initial returns, I take the RTNL mutex after we try
to allocate 'tb'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 16:49:39 -07:00
Nicolas Dichtel ed13998c31 sock_diag: fix filter code sent to userspace
Filters need to be translated to real BPF code for userland, like SO_GETFILTER.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-10 22:23:32 -07:00
Eric Dumazet 164954454a filter: do not output bpf image address for security reason
Do not leak starting address of BPF JIT code for non root users,
as it might help intruders to perform an attack.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-19 23:56:41 -07:00
Xi Wang 20074f357d filter: fix va_list build error
This patch fixes the following build error.

In file included from include/linux/filter.h:52:0,
                 from arch/arm/net/bpf_jit_32.c:14:
include/linux/printk.h:54:2: error: unknown type name ‘va_list’
include/linux/printk.h:105:21: error: unknown type name ‘va_list’
include/linux/printk.h:108:30: error: unknown type name ‘va_list’

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-01 16:28:48 -04:00
Chen Gang a691ce7fe4 include/linux: printk is needed in filter.h when CONFIG_BPF_JIT is defined
for make V=1 EXTRA_CFLAGS=-W ARCH=arm allmodconfig
    printk is need when CONFIG_BPF_JIT is defined
    or it will report pr_err and print_hex_dump are implicit declaration

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-29 23:31:34 -04:00
Daniel Borkmann 79617801ea filter: bpf_jit_comp: refactor and unify BPF JIT image dump output
If bpf_jit_enable > 1, then we dump the emitted JIT compiled image
after creation. Currently, only SPARC and PowerPC has similar output
as in the reference implementation on x86_64. Make a small helper
function in order to reduce duplicated code and make the dump output
uniform across architectures x86_64, SPARC, PPC, ARM (e.g. on ARM
flen, pass and proglen are currently not shown, but would be
interesting to know as well), also for future BPF JIT implementations
on other archs.

Cc: Mircea Gherzan <mgherzan@gmail.com>
Cc: Matt Evans <matt@ozlabs.org>
Cc: Eric Dumazet <eric.dumazet@google.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-21 17:25:56 -04:00
Daniel Borkmann 3e5289d5e3 filter: add ANC_PAY_OFFSET instruction for loading payload start offset
It is very useful to do dynamic truncation of packets. In particular,
we're interested to push the necessary header bytes to the user space and
cut off user payload that should probably not be transferred for some reasons
(e.g. privacy, speed, or others). With the ancillary extension PAY_OFFSET,
we can load it into the accumulator, and return it. E.g. in bpfc syntax ...

        ld #poff        ; { 0x20, 0, 0, 0xfffff034 },
        ret a           ; { 0x16, 0, 0, 0x00000000 },

... as a filter will accomplish this without having to do a big hackery in
a BPF filter itself. Follow-up JIT implementations are welcome.

Thanks to Eric Dumazet for suggesting and discussing this during the
Netfilter Workshop in Copenhagen.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-20 13:15:45 -04:00
Pavel Emelyanov a8fc927780 sk-filter: Add ability to get socket filter program (v2)
The SO_ATTACH_FILTER option is set only. I propose to add the get
ability by using SO_ATTACH_FILTER in getsockopt. To be less
irritating to eyes the SO_GET_FILTER alias to it is declared. This
ability is required by checkpoint-restore project to be able to
save full state of a socket.

There are two issues with getting filter back.

First, kernel modifies the sock_filter->code on filter load, thus in
order to return the filter element back to user we have to decode it
into user-visible constants. Fortunately the modification in question
is interconvertible.

Second, the BPF_S_ALU_DIV_K code modifies the command argument k to
speed up the run-time division by doing kernel_k = reciprocal(user_k).
Bad news is that different user_k may result in same kernel_k, so we
can't get the original user_k back. Good news is that we don't have
to do it. What we need to is calculate a user2_k so, that

  reciprocal(user2_k) == reciprocal(user_k) == kernel_k

i.e. if it's re-loaded back the compiled again value will be exactly
the same as it was. That said, the user2_k can be calculated like this

  user2_k = reciprocal(kernel_k)

with an exception, that if kernel_k == 0, then user2_k == 1.

The optlen argument is treated like this -- when zero, kernel returns
the amount of sock_fprog elements in filter, otherwise it should be
large enough for the sock_fprog array.

changes since v1:
* Declared SO_GET_FILTER in all arch headers
* Added decode of vlan-tag codes

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-01 11:17:15 -04:00
Eric Dumazet f3335031b9 net: filter: add vlan tag access
BPF filters lack ability to access skb->vlan_tci

This patch adds two new ancillary accessors :

SKF_AD_VLAN_TAG         (44) mapped to vlan_tx_tag_get(skb)

SKF_AD_VLAN_TAG_PRESENT (48) mapped to vlan_tx_tag_present(skb)

This allows libpcap/tcpdump to use a kernel filter instead of
having to fallback to accept all packets, then filter them in
user space.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Ani Sinha <ani@aristanetworks.com>
Suggested-by: Daniel Borkmann <danborkmann@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-31 14:00:15 -04:00
David Howells 607ca46e97 UAPI: (Scripted) Disintegrate include/linux
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Dave Jones <davej@redhat.com>
2012-10-13 10:46:48 +01:00
Daniel Borkmann 9e49e88958 filter: add XOR instruction for use with X/K
SKF_AD_ALU_XOR_X has been added a while ago, but as an 'ancillary'
operation that is invoked through a negative offset in K within BPF
load operations. Since BPF_MOD has recently been added, BPF_XOR should
also be part of the common ALU operations. Removing SKF_AD_ALU_XOR_X
might not be an option since this is exposed to user space.

Signed-off-by: Daniel Borkmann <daniel.borkmann@tik.ee.ethz.ch>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 16:49:21 -04:00
Eric Dumazet b6069a9570 filter: add MOD operation
Add a new ALU opcode, to compute a modulus.

Commit ffe06c17af used an ancillary to implement XOR_X,
but here we reserve one of the available ALU opcode to implement both
MOD_X and MOD_K

Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: George Bakos <gbakos@alpinista.org>
Cc: Jay Schulist <jschlst@samba.org>
Cc: Jiri Pirko <jpirko@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-10 15:44:56 -04:00
Linus Torvalds cb60e3e65c Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
 "New notable features:
   - The seccomp work from Will Drewry
   - PR_{GET,SET}_NO_NEW_PRIVS from Andy Lutomirski
   - Longer security labels for Smack from Casey Schaufler
   - Additional ptrace restriction modes for Yama by Kees Cook"

Fix up trivial context conflicts in arch/x86/Kconfig and include/linux/filter.h

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (65 commits)
  apparmor: fix long path failure due to disconnected path
  apparmor: fix profile lookup for unconfined
  ima: fix filename hint to reflect script interpreter name
  KEYS: Don't check for NULL key pointer in key_validate()
  Smack: allow for significantly longer Smack labels v4
  gfp flags for security_inode_alloc()?
  Smack: recursive tramsmute
  Yama: replace capable() with ns_capable()
  TOMOYO: Accept manager programs which do not start with / .
  KEYS: Add invalidation support
  KEYS: Do LRU discard in full keyrings
  KEYS: Permit in-place link replacement in keyring list
  KEYS: Perform RCU synchronisation on keys prior to key destruction
  KEYS: Announce key type (un)registration
  KEYS: Reorganise keys Makefile
  KEYS: Move the key config into security/keys/Kconfig
  KEYS: Use the compat keyctl() syscall wrapper on Sparc64 for Sparc32 compat
  Yama: remove an unused variable
  samples/seccomp: fix dependencies on arch macros
  Yama: add additional ptrace scopes
  ...
2012-05-21 20:27:36 -07:00
Will Drewry 0c5fe1b422 net/compat.c,linux/filter.h: share compat_sock_fprog
Any other users of bpf_*_filter that take a struct sock_fprog from
userspace will need to be able to also accept a compat_sock_fprog
if the arch supports compat calls.  This change allows the existing
compat_sock_fprog be shared.

Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Eric Paris <eparis@redhat.com>

v18: tasered by the apostrophe police
v14: rebase/nochanges
v13: rebase on to 88ebdda615
v12: rebase on to linux-next
v11: introduction
Signed-off-by: James Morris <james.l.morris@oracle.com>
2012-04-14 11:13:19 +10:00
Will Drewry 46b325c7eb sk_run_filter: add BPF_S_ANC_SECCOMP_LD_W
Introduces a new BPF ancillary instruction that all LD calls will be
mapped through when skb_run_filter() is being used for seccomp BPF.  The
rewriting will be done using a secondary chk_filter function that is run
after skb_chk_filter.

The code change is guarded by CONFIG_SECCOMP_FILTER which is added,
along with the seccomp_bpf_load() function later in this series.

This is based on http://lkml.org/lkml/2012/3/2/141

Suggested-by: Indan Zupancic <indan@nul.nu>
Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Eric Paris <eparis@redhat.com>

v18: rebase
...
v15: include seccomp.h explicitly for when seccomp_bpf_load exists.
v14: First cut using a single additional instruction
... v13: made bpf functions generic.
Signed-off-by: James Morris <james.l.morris@oracle.com>
2012-04-14 11:13:19 +10:00
Jiri Pirko ffe06c17af filter: add XOR operation
Add XOR instruction fo BPF machine. Needed for computing packet hashes.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-03 18:36:20 -04:00
Jiri Pirko 302d663740 filter: Allow to create sk-unattached filters
Today, BPF filters are bind to sockets. Since BPF machine becomes handy
for other purposes, this patch allows to create unattached filter.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-03 18:36:20 -04:00
Dan Carpenter 4f25af2782 filter: use unsigned int to silence static checker warning
This is just a cleanup.

My testing version of Smatch warns about this:
net/core/filter.c +380 check_load_and_stores(6)
	warn: check 'flen' for negative values

flen comes from the user.  We try to clamp the values here between 1
and BPF_MAXINSNS but the clamp doesn't work because it could be
negative.  This is a bug, but it's not exploitable.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-19 19:35:51 -04:00
Arun Sharma 60063497a9 atomic: use <linux/atomic.h>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>

Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26 16:49:47 -07:00
Heiko Carstens 792d4b5cb1 net: filter: move forward declarations to avoid compile warnings
Get rid of this compile warning:

In file included from arch/s390/kernel/compat_linux.c:37:0:
include/linux/filter.h:139:23: warning: 'struct sk_buff' declared inside parameter list

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-22 21:01:22 -04:00
Eric Dumazet 0a14842f5a net: filter: Just In Time compiler for x86-64
In order to speedup packet filtering, here is an implementation of a
JIT compiler for x86_64

It is disabled by default, and must be enabled by the admin.

echo 1 >/proc/sys/net/core/bpf_jit_enable

It uses module_alloc() and module_free() to get memory in the 2GB text
kernel range since we call helpers functions from the generated code.

EAX : BPF A accumulator
EBX : BPF X accumulator
RDI : pointer to skb   (first argument given to JIT function)
RBP : frame pointer (even if CONFIG_FRAME_POINTER=n)
r9d : skb->len - skb->data_len (headlen)
r8  : skb->data

To get a trace of generated code, use :

echo 2 >/proc/sys/net/core/bpf_jit_enable

Example of generated code :

# tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24

flen=18 proglen=147 pass=3 image=ffffffffa00b5000
JIT code: ffffffffa00b5000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 60
JIT code: ffffffffa00b5010: 44 2b 4f 64 4c 8b 87 b8 00 00 00 be 0c 00 00 00
JIT code: ffffffffa00b5020: e8 24 7b f7 e0 3d 00 08 00 00 75 28 be 1a 00 00
JIT code: ffffffffa00b5030: 00 e8 fe 7a f7 e0 24 00 3d 00 14 a8 c0 74 49 be
JIT code: ffffffffa00b5040: 1e 00 00 00 e8 eb 7a f7 e0 24 00 3d 00 14 a8 c0
JIT code: ffffffffa00b5050: 74 36 eb 3b 3d 06 08 00 00 74 07 3d 35 80 00 00
JIT code: ffffffffa00b5060: 75 2d be 1c 00 00 00 e8 c8 7a f7 e0 24 00 3d 00
JIT code: ffffffffa00b5070: 14 a8 c0 74 13 be 26 00 00 00 e8 b5 7a f7 e0 24
JIT code: ffffffffa00b5080: 00 3d 00 14 a8 c0 75 07 b8 ff ff 00 00 eb 02 31
JIT code: ffffffffa00b5090: c0 c9 c3

BPF program is 144 bytes long, so native program is almost same size ;)

(000) ldh      [12]
(001) jeq      #0x800           jt 2    jf 8
(002) ld       [26]
(003) and      #0xffffff00
(004) jeq      #0xc0a81400      jt 16   jf 5
(005) ld       [30]
(006) and      #0xffffff00
(007) jeq      #0xc0a81400      jt 16   jf 17
(008) jeq      #0x806           jt 10   jf 9
(009) jeq      #0x8035          jt 10   jf 17
(010) ld       [28]
(011) and      #0xffffff00
(012) jeq      #0xc0a81400      jt 16   jf 13
(013) ld       [38]
(014) and      #0xffffff00
(015) jeq      #0xc0a81400      jt 16   jf 17
(016) ret      #65535
(017) ret      #0

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-27 23:05:08 -07:00
Eric Dumazet 62ab081213 filter: constify sk_run_filter()
sk_run_filter() doesnt write on skb, change its prototype to reflect
this.

Fix two af_packet comments.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-08 10:30:34 -08:00
Eric Dumazet da2033c282 filter: add SKF_AD_RXHASH and SKF_AD_CPU
Add SKF_AD_RXHASH and SKF_AD_CPU to filter ancillary mechanism,
to be able to build advanced filters.

This can help spreading packets on several sockets with a fast
selection, after RPS dispatch to N cpus for example, or to catch a
percentage of flows in one queue.

tcpdump -s 500 "cpu = 1" :

[0] ld CPU
[1] jeq #1  jt 2  jf 3
[2] ret #500
[3] ret #0

# take 12.5 % of flows (average)
tcpdump -s 1000 "rxhash & 7 = 2" :

[0] ld RXHASH
[1] and #7
[2] jeq #2  jt 3  jf 4
[3] ret #1000
[4] ret #0

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Rui <wirelesser@gmail.com>
Acked-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-06 12:59:05 -08:00