Merge branch 'filter-next'

Daniel Borkmann says:

====================
BPF updates

We sat down and have heavily reworked the whole previous patchset
from v10 [1] to address all comments/concerns. This patchset therefore
*replaces* the internal BPF interpreter with the new layout as
discussed in [1], and migrates some exotic callers to properly use the
BPF API for a transparent upgrade. All other callers that already use
the BPF API in a way it should be used, need no further changes to run
the new internals. We also removed the sysctl knob entirely, and do not
expose any structure to userland, so that implementation details only
reside in kernel space. Since we are replacing the interpreter we had
to migrate seccomp in one patch along with the interpreter to not break
anything. When attaching a new filter, the flow can be described as
following: i) test if jit compiler is enabled and can compile the user
BPF, ii) if so, then go for it, iii) if not, then transparently migrate
the filter into the new representation, and run it in the interpreter.
Also, we have scratched the jit flag from the len attribute and made it
as initial patch in this series as Pablo has suggested in the last
feedback, thanks. For details, please refer to the patches themselves.

We did extensive testing of BPF and seccomp on the new interpreter
itself and also on the user ABIs and could not find any issues; new
performance numbers as posted in patch 8 are also still the same.

Please find more details in the patches themselves.

For all the previous history from v1 to v10, see [1]. We have decided
to drop the v11 as we have pedantically reworked the set, but of course,
included all previous feedback.

v3 -> v4:
 - Applied feedback from Dave regarding swap insns
 - Rebased on net-next
v2 -> v3:
 - Rebased to latest net-next (i.e. w/ rxhash->hash rename)
 - Fixed patch 8/9 commit message/doc as suggested by Dave
 - Rest is unchanged
v1 -> v2:
 - Rebased to latest net-next
 - Added static to ptp_filter as suggested by Dave
 - Fixed a typo in patch 8's commit message
 - Rest unchanged

Thanks !

  [1] http://thread.gmane.org/gmane.linux.kernel/1665858
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
This commit is contained in:
David S. Miller 2014-03-31 00:45:49 -04:00
commit 9109e17f7c
20 changed files with 1657 additions and 535 deletions

View file

@ -546,6 +546,130 @@ ffffffffa0069c8f + <x>:
For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful
toolchain for developing and testing the kernel's JIT compiler.
BPF kernel internals
--------------------
Internally, for the kernel interpreter, a different BPF instruction set
format with similar underlying principles from BPF described in previous
paragraphs is being used. However, the instruction set format is modelled
closer to the underlying architecture to mimic native instruction sets, so
that a better performance can be achieved (more details later).
It is designed to be JITed with one to one mapping, which can also open up
the possibility for GCC/LLVM compilers to generate optimized BPF code through
a BPF backend that performs almost as fast as natively compiled code.
The new instruction set was originally designed with the possible goal in
mind to write programs in "restricted C" and compile into BPF with a optional
GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with
minimal performance overhead over two steps, that is, C -> BPF -> native code.
Currently, the new format is being used for running user BPF programs, which
includes seccomp BPF, classic socket filters, cls_bpf traffic classifier,
team driver's classifier for its load-balancing mode, netfilter's xt_bpf
extension, PTP dissector/classifier, and much more. They are all internally
converted by the kernel into the new instruction set representation and run
in the extended interpreter. For in-kernel handlers, this all works
transparently by using sk_unattached_filter_create() for setting up the
filter, resp. sk_unattached_filter_destroy() for destroying it. The macro
SK_RUN_FILTER(filter, ctx) transparently invokes the right BPF function to
run the filter. 'filter' is a pointer to struct sk_filter that we got from
sk_unattached_filter_create(), and 'ctx' the given context (e.g. skb pointer).
All constraints and restrictions from sk_chk_filter() apply before a
conversion to the new layout is being done behind the scenes!
Currently, for JITing, the user BPF format is being used and current BPF JIT
compilers reused whenever possible. In other words, we do not (yet!) perform
a JIT compilation in the new layout, however, future work will successively
migrate traditional JIT compilers into the new instruction format as well, so
that they will profit from the very same benefits. Thus, when speaking about
JIT in the following, a JIT compiler (TBD) for the new instruction format is
meant in this context.
Some core changes of the new internal format:
- Number of registers increase from 2 to 10:
The old format had two registers A and X, and a hidden frame pointer. The
new layout extends this to be 10 internal registers and a read-only frame
pointer. Since 64-bit CPUs are passing arguments to functions via registers
the number of args from BPF program to in-kernel function is restricted
to 5 and one register is used to accept return value from an in-kernel
function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
Therefore, BPF calling convention is defined as:
* R0 - return value from in-kernel function
* R1 - R5 - arguments from BPF program to in-kernel function
* R6 - R9 - callee saved registers that in-kernel function will preserve
* R10 - read-only frame pointer to access stack
Thus, all BPF registers map one to one to HW registers on x86_64, aarch64,
etc, and BPF calling convention maps directly to ABIs used by the kernel on
64-bit architectures.
On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
and may let more complex programs to be interpreted.
R0 - R5 are scratch registers and BPF program needs spill/fill them if
necessary across calls. Note that there is only one BPF program (== one BPF
main routine) and it cannot call other BPF functions, it can only call
predefined in-kernel functions, though.
- Register width increases from 32-bit to 64-bit:
Still, the semantics of the original 32-bit ALU operations are preserved
via 32-bit subregisters. All BPF registers are 64-bit with 32-bit lower
subregisters that zero-extend into 64-bit if they are being written to.
That behavior maps directly to x86_64 and arm64 subregister definition, but
makes other JITs more difficult.
32-bit architectures run 64-bit internal BPF programs via interpreter.
Their JITs may convert BPF programs that only use 32-bit subregisters into
native instruction set and let the rest being interpreted.
Operation is 64-bit, because on 64-bit architectures, pointers are also
64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
so 32-bit BPF registers would otherwise require to define register-pair
ABI, thus, there won't be able to use a direct BPF register to HW register
mapping and JIT would need to do combine/split/move operations for every
register in and out of the function, which is complex, bug prone and slow.
Another reason is the use of atomic 64-bit counters.
- Conditional jt/jf targets replaced with jt/fall-through:
While the original design has constructs such as "if (cond) jump_true;
else jump_false;", they are being replaced into alternative constructs like
"if (cond) jump_true; /* else fall-through */".
- Introduces bpf_call insn and register passing convention for zero overhead
calls from/to other kernel functions:
After a kernel function call, R1 - R5 are reset to unreadable and R0 has a
return type of the function. Since R6 - R9 are callee saved, their state is
preserved across the call.
Also in the new design, BPF is limited to 4096 insns, which means that any
program will terminate quickly and will only call a fixed number of kernel
functions. Original BPF and the new format are two operand instructions,
which helps to do one-to-one mapping between BPF insn and x86 insn during JIT.
The input context pointer for invoking the interpreter function is generic,
its content is defined by a specific use case. For seccomp register R1 points
to seccomp_data, for converted BPF filters R1 points to a skb.
A program, that is translated internally consists of the following elements:
op:16, jt:8, jf:8, k:32 ==> op:8, a_reg:4, x_reg:4, off:16, imm:32
Just like the original BPF, the new format runs within a controlled environment,
is deterministic and the kernel can easily prove that. The safety of the program
can be determined in two steps: first step does depth-first-search to disallow
loops and other CFG validation; second step starts from the first insn and
descends all possible paths. It simulates execution of every insn and observes
the state change of registers and stack.
Misc
----
@ -561,3 +685,4 @@ the underlying architecture.
Jay Schulist <jschlst@samba.org>
Daniel Borkmann <dborkman@redhat.com>
Alexei Starovoitov <ast@plumgrid.com>

View file

@ -925,6 +925,7 @@ void bpf_jit_compile(struct sk_filter *fp)
bpf_jit_dump(fp->len, alloc_size, 2, ctx.target);
fp->bpf_func = (void *)ctx.target;
fp->jited = 1;
out:
kfree(ctx.offsets);
return;
@ -932,7 +933,7 @@ out:
void bpf_jit_free(struct sk_filter *fp)
{
if (fp->bpf_func != sk_run_filter)
if (fp->jited)
module_free(NULL, fp->bpf_func);
kfree(fp);
}

View file

@ -689,6 +689,7 @@ void bpf_jit_compile(struct sk_filter *fp)
((u64 *)image)[0] = (u64)code_base;
((u64 *)image)[1] = local_paca->kernel_toc;
fp->bpf_func = (void *)image;
fp->jited = 1;
}
out:
kfree(addrs);
@ -697,7 +698,7 @@ out:
void bpf_jit_free(struct sk_filter *fp)
{
if (fp->bpf_func != sk_run_filter)
if (fp->jited)
module_free(NULL, fp->bpf_func);
kfree(fp);
}

View file

@ -877,6 +877,7 @@ void bpf_jit_compile(struct sk_filter *fp)
if (jit.start) {
set_memory_ro((unsigned long)header, header->pages);
fp->bpf_func = (void *) jit.start;
fp->jited = 1;
}
out:
kfree(addrs);
@ -887,10 +888,12 @@ void bpf_jit_free(struct sk_filter *fp)
unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
struct bpf_binary_header *header = (void *)addr;
if (fp->bpf_func == sk_run_filter)
if (!fp->jited)
goto free_filter;
set_memory_rw(addr, header->pages);
module_free(NULL, header);
free_filter:
kfree(fp);
}

View file

@ -809,6 +809,7 @@ cond_branch: f_offset = addrs[i + filter[i].jf];
if (image) {
bpf_flush_icache(image, image + proglen);
fp->bpf_func = (void *)image;
fp->jited = 1;
}
out:
kfree(addrs);
@ -817,7 +818,7 @@ out:
void bpf_jit_free(struct sk_filter *fp)
{
if (fp->bpf_func != sk_run_filter)
if (fp->jited)
module_free(NULL, fp->bpf_func);
kfree(fp);
}

View file

@ -772,6 +772,7 @@ cond_branch: f_offset = addrs[i + filter[i].jf] - addrs[i];
bpf_flush_icache(header, image + proglen);
set_memory_ro((unsigned long)header, header->pages);
fp->bpf_func = (void *)image;
fp->jited = 1;
}
out:
kfree(addrs);
@ -791,7 +792,7 @@ static void bpf_jit_free_deferred(struct work_struct *work)
void bpf_jit_free(struct sk_filter *fp)
{
if (fp->bpf_func != sk_run_filter) {
if (fp->jited) {
INIT_WORK(&fp->work, bpf_jit_free_deferred);
schedule_work(&fp->work);
} else {

View file

@ -378,10 +378,15 @@ isdn_ppp_release(int min, struct file *file)
is->slcomp = NULL;
#endif
#ifdef CONFIG_IPPP_FILTER
kfree(is->pass_filter);
is->pass_filter = NULL;
kfree(is->active_filter);
is->active_filter = NULL;
if (is->pass_filter) {
sk_unattached_filter_destroy(is->pass_filter);
is->pass_filter = NULL;
}
if (is->active_filter) {
sk_unattached_filter_destroy(is->active_filter);
is->active_filter = NULL;
}
#endif
/* TODO: if this was the previous master: link the stuff to the new master */
@ -629,25 +634,41 @@ isdn_ppp_ioctl(int min, struct file *file, unsigned int cmd, unsigned long arg)
#ifdef CONFIG_IPPP_FILTER
case PPPIOCSPASS:
{
struct sock_fprog fprog;
struct sock_filter *code;
int len = get_filter(argp, &code);
int err, len = get_filter(argp, &code);
if (len < 0)
return len;
kfree(is->pass_filter);
is->pass_filter = code;
is->pass_len = len;
break;
fprog.len = len;
fprog.filter = code;
if (is->pass_filter)
sk_unattached_filter_destroy(is->pass_filter);
err = sk_unattached_filter_create(&is->pass_filter, &fprog);
kfree(code);
return err;
}
case PPPIOCSACTIVE:
{
struct sock_fprog fprog;
struct sock_filter *code;
int len = get_filter(argp, &code);
int err, len = get_filter(argp, &code);
if (len < 0)
return len;
kfree(is->active_filter);
is->active_filter = code;
is->active_len = len;
break;
fprog.len = len;
fprog.filter = code;
if (is->active_filter)
sk_unattached_filter_destroy(is->active_filter);
err = sk_unattached_filter_create(&is->active_filter, &fprog);
kfree(code);
return err;
}
#endif /* CONFIG_IPPP_FILTER */
default:
@ -1147,14 +1168,14 @@ isdn_ppp_push_higher(isdn_net_dev *net_dev, isdn_net_local *lp, struct sk_buff *
}
if (is->pass_filter
&& sk_run_filter(skb, is->pass_filter) == 0) {
&& SK_RUN_FILTER(is->pass_filter, skb) == 0) {
if (is->debug & 0x2)
printk(KERN_DEBUG "IPPP: inbound frame filtered.\n");
kfree_skb(skb);
return;
}
if (!(is->active_filter
&& sk_run_filter(skb, is->active_filter) == 0)) {
&& SK_RUN_FILTER(is->active_filter, skb) == 0)) {
if (is->debug & 0x2)
printk(KERN_DEBUG "IPPP: link-active filter: resetting huptimer.\n");
lp->huptimer = 0;
@ -1293,14 +1314,14 @@ isdn_ppp_xmit(struct sk_buff *skb, struct net_device *netdev)
}
if (ipt->pass_filter
&& sk_run_filter(skb, ipt->pass_filter) == 0) {
&& SK_RUN_FILTER(ipt->pass_filter, skb) == 0) {
if (ipt->debug & 0x4)
printk(KERN_DEBUG "IPPP: outbound frame filtered.\n");
kfree_skb(skb);
goto unlock;
}
if (!(ipt->active_filter
&& sk_run_filter(skb, ipt->active_filter) == 0)) {
&& SK_RUN_FILTER(ipt->active_filter, skb) == 0)) {
if (ipt->debug & 0x4)
printk(KERN_DEBUG "IPPP: link-active filter: resetting huptimer.\n");
lp->huptimer = 0;
@ -1490,9 +1511,9 @@ int isdn_ppp_autodial_filter(struct sk_buff *skb, isdn_net_local *lp)
}
drop |= is->pass_filter
&& sk_run_filter(skb, is->pass_filter) == 0;
&& SK_RUN_FILTER(is->pass_filter, skb) == 0;
drop |= is->active_filter
&& sk_run_filter(skb, is->active_filter) == 0;
&& SK_RUN_FILTER(is->active_filter, skb) == 0;
skb_push(skb, IPPP_MAX_HEADER - 4);
return drop;

View file

@ -120,10 +120,6 @@ static void pch_gbe_mdio_write(struct net_device *netdev, int addr, int reg,
int data);
static void pch_gbe_set_multi(struct net_device *netdev);
static struct sock_filter ptp_filter[] = {
PTP_FILTER
};
static int pch_ptp_match(struct sk_buff *skb, u16 uid_hi, u32 uid_lo, u16 seqid)
{
u8 *data = skb->data;
@ -131,7 +127,7 @@ static int pch_ptp_match(struct sk_buff *skb, u16 uid_hi, u32 uid_lo, u16 seqid)
u16 *hi, *id;
u32 lo;
if (sk_run_filter(skb, ptp_filter) == PTP_CLASS_NONE)
if (ptp_classify_raw(skb) == PTP_CLASS_NONE)
return 0;
offset = ETH_HLEN + IPV4_HLEN(data) + UDP_HLEN;
@ -2635,11 +2631,6 @@ static int pch_gbe_probe(struct pci_dev *pdev,
adapter->ptp_pdev = pci_get_bus_and_slot(adapter->pdev->bus->number,
PCI_DEVFN(12, 4));
if (ptp_filter_init(ptp_filter, ARRAY_SIZE(ptp_filter))) {
dev_err(&pdev->dev, "Bad ptp filter\n");
ret = -EINVAL;
goto err_free_netdev;
}
netdev->netdev_ops = &pch_gbe_netdev_ops;
netdev->watchdog_timeo = PCH_GBE_WATCHDOG_PERIOD;

View file

@ -31,10 +31,6 @@
#ifdef CONFIG_TI_CPTS
static struct sock_filter ptp_filter[] = {
PTP_FILTER
};
#define cpts_read32(c, r) __raw_readl(&c->reg->r)
#define cpts_write32(c, v, r) __raw_writel(v, &c->reg->r)
@ -301,7 +297,7 @@ static u64 cpts_find_ts(struct cpts *cpts, struct sk_buff *skb, int ev_type)
u64 ns = 0;
struct cpts_event *event;
struct list_head *this, *next;
unsigned int class = sk_run_filter(skb, ptp_filter);
unsigned int class = ptp_classify_raw(skb);
unsigned long flags;
u16 seqid;
u8 mtype;
@ -372,10 +368,6 @@ int cpts_register(struct device *dev, struct cpts *cpts,
int err, i;
unsigned long flags;
if (ptp_filter_init(ptp_filter, ARRAY_SIZE(ptp_filter))) {
pr_err("cpts: bad ptp filter\n");
return -EINVAL;
}
cpts->info = cpts_info;
cpts->clock = ptp_clock_register(&cpts->info, dev);
if (IS_ERR(cpts->clock)) {

View file

@ -256,10 +256,6 @@ static int ports_open;
static struct port *npe_port_tab[MAX_NPES];
static struct dma_pool *dma_pool;
static struct sock_filter ptp_filter[] = {
PTP_FILTER
};
static int ixp_ptp_match(struct sk_buff *skb, u16 uid_hi, u32 uid_lo, u16 seqid)
{
u8 *data = skb->data;
@ -267,7 +263,7 @@ static int ixp_ptp_match(struct sk_buff *skb, u16 uid_hi, u32 uid_lo, u16 seqid)
u16 *hi, *id;
u32 lo;
if (sk_run_filter(skb, ptp_filter) != PTP_CLASS_V1_IPV4)
if (ptp_classify_raw(skb) != PTP_CLASS_V1_IPV4)
return 0;
offset = ETH_HLEN + IPV4_HLEN(data) + UDP_HLEN;
@ -1413,11 +1409,6 @@ static int eth_init_one(struct platform_device *pdev)
char phy_id[MII_BUS_ID_SIZE + 3];
int err;
if (ptp_filter_init(ptp_filter, ARRAY_SIZE(ptp_filter))) {
pr_err("ixp4xx_eth: bad ptp filter\n");
return -EINVAL;
}
if (!(dev = alloc_etherdev(sizeof(struct port))))
return -ENOMEM;

View file

@ -143,9 +143,8 @@ struct ppp {
struct sk_buff_head mrq; /* MP: receive reconstruction queue */
#endif /* CONFIG_PPP_MULTILINK */
#ifdef CONFIG_PPP_FILTER
struct sock_filter *pass_filter; /* filter for packets to pass */
struct sock_filter *active_filter;/* filter for pkts to reset idle */
unsigned pass_len, active_len;
struct sk_filter *pass_filter; /* filter for packets to pass */
struct sk_filter *active_filter;/* filter for pkts to reset idle */
#endif /* CONFIG_PPP_FILTER */
struct net *ppp_net; /* the net we belong to */
struct ppp_link_stats stats64; /* 64 bit network stats */
@ -755,28 +754,42 @@ static long ppp_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
case PPPIOCSPASS:
{
struct sock_filter *code;
err = get_filter(argp, &code);
if (err >= 0) {
struct sock_fprog fprog = {
.len = err,
.filter = code,
};
ppp_lock(ppp);
kfree(ppp->pass_filter);
ppp->pass_filter = code;
ppp->pass_len = err;
if (ppp->pass_filter)
sk_unattached_filter_destroy(ppp->pass_filter);
err = sk_unattached_filter_create(&ppp->pass_filter,
&fprog);
kfree(code);
ppp_unlock(ppp);
err = 0;
}
break;
}
case PPPIOCSACTIVE:
{
struct sock_filter *code;
err = get_filter(argp, &code);
if (err >= 0) {
struct sock_fprog fprog = {
.len = err,
.filter = code,
};
ppp_lock(ppp);
kfree(ppp->active_filter);
ppp->active_filter = code;
ppp->active_len = err;
if (ppp->active_filter)
sk_unattached_filter_destroy(ppp->active_filter);
err = sk_unattached_filter_create(&ppp->active_filter,
&fprog);
kfree(code);
ppp_unlock(ppp);
err = 0;
}
break;
}
@ -1184,7 +1197,7 @@ ppp_send_frame(struct ppp *ppp, struct sk_buff *skb)
a four-byte PPP header on each packet */
*skb_push(skb, 2) = 1;
if (ppp->pass_filter &&
sk_run_filter(skb, ppp->pass_filter) == 0) {
SK_RUN_FILTER(ppp->pass_filter, skb) == 0) {
if (ppp->debug & 1)
netdev_printk(KERN_DEBUG, ppp->dev,
"PPP: outbound frame "
@ -1194,7 +1207,7 @@ ppp_send_frame(struct ppp *ppp, struct sk_buff *skb)
}
/* if this packet passes the active filter, record the time */
if (!(ppp->active_filter &&
sk_run_filter(skb, ppp->active_filter) == 0))
SK_RUN_FILTER(ppp->active_filter, skb) == 0))
ppp->last_xmit = jiffies;
skb_pull(skb, 2);
#else
@ -1818,7 +1831,7 @@ ppp_receive_nonmp_frame(struct ppp *ppp, struct sk_buff *skb)
*skb_push(skb, 2) = 0;
if (ppp->pass_filter &&
sk_run_filter(skb, ppp->pass_filter) == 0) {
SK_RUN_FILTER(ppp->pass_filter, skb) == 0) {
if (ppp->debug & 1)
netdev_printk(KERN_DEBUG, ppp->dev,
"PPP: inbound frame "
@ -1827,7 +1840,7 @@ ppp_receive_nonmp_frame(struct ppp *ppp, struct sk_buff *skb)
return;
}
if (!(ppp->active_filter &&
sk_run_filter(skb, ppp->active_filter) == 0))
SK_RUN_FILTER(ppp->active_filter, skb) == 0))
ppp->last_recv = jiffies;
__skb_pull(skb, 2);
} else
@ -2672,6 +2685,10 @@ ppp_create_interface(struct net *net, int unit, int *retp)
ppp->minseq = -1;
skb_queue_head_init(&ppp->mrq);
#endif /* CONFIG_PPP_MULTILINK */
#ifdef CONFIG_PPP_FILTER
ppp->pass_filter = NULL;
ppp->active_filter = NULL;
#endif /* CONFIG_PPP_FILTER */
/*
* drum roll: don't forget to set
@ -2802,10 +2819,15 @@ static void ppp_destroy_interface(struct ppp *ppp)
skb_queue_purge(&ppp->mrq);
#endif /* CONFIG_PPP_MULTILINK */
#ifdef CONFIG_PPP_FILTER
kfree(ppp->pass_filter);
ppp->pass_filter = NULL;
kfree(ppp->active_filter);
ppp->active_filter = NULL;
if (ppp->pass_filter) {
sk_unattached_filter_destroy(ppp->pass_filter);
ppp->pass_filter = NULL;
}
if (ppp->active_filter) {
sk_unattached_filter_destroy(ppp->active_filter);
ppp->active_filter = NULL;
}
#endif /* CONFIG_PPP_FILTER */
kfree_skb(ppp->xmit_pending);

View file

@ -9,28 +9,81 @@
#include <linux/workqueue.h>
#include <uapi/linux/filter.h>
#ifdef CONFIG_COMPAT
/*
* A struct sock_filter is architecture independent.
/* Internally used and optimized filter representation with extended
* instruction set based on top of classic BPF.
*/
/* instruction classes */
#define BPF_ALU64 0x07 /* alu mode in double word width */
/* ld/ldx fields */
#define BPF_DW 0x18 /* double word */
#define BPF_XADD 0xc0 /* exclusive add */
/* alu/jmp fields */
#define BPF_MOV 0xb0 /* mov reg to reg */
#define BPF_ARSH 0xc0 /* sign extending arithmetic shift right */
/* change endianness of a register */
#define BPF_END 0xd0 /* flags for endianness conversion: */
#define BPF_TO_LE 0x00 /* convert to little-endian */
#define BPF_TO_BE 0x08 /* convert to big-endian */
#define BPF_FROM_LE BPF_TO_LE
#define BPF_FROM_BE BPF_TO_BE
#define BPF_JNE 0x50 /* jump != */
#define BPF_JSGT 0x60 /* SGT is signed '>', GT in x86 */
#define BPF_JSGE 0x70 /* SGE is signed '>=', GE in x86 */
#define BPF_CALL 0x80 /* function call */
#define BPF_EXIT 0x90 /* function return */
/* BPF has 10 general purpose 64-bit registers and stack frame. */
#define MAX_BPF_REG 11
/* BPF program can access up to 512 bytes of stack space. */
#define MAX_BPF_STACK 512
/* Arg1, context and stack frame pointer register positions. */
#define ARG1_REG 1
#define CTX_REG 6
#define FP_REG 10
struct sock_filter_int {
__u8 code; /* opcode */
__u8 a_reg:4; /* dest register */
__u8 x_reg:4; /* source register */
__s16 off; /* signed offset */
__s32 imm; /* signed immediate constant */
};
#ifdef CONFIG_COMPAT
/* A struct sock_filter is architecture independent. */
struct compat_sock_fprog {
u16 len;
compat_uptr_t filter; /* struct sock_filter * */
compat_uptr_t filter; /* struct sock_filter * */
};
#endif
struct sock_fprog_kern {
u16 len;
struct sock_filter *filter;
};
struct sk_buff;
struct sock;
struct seccomp_data;
struct sk_filter
{
struct sk_filter {
atomic_t refcnt;
unsigned int len; /* Number of filter blocks */
u32 jited:1, /* Is our filter JIT'ed? */
len:31; /* Number of filter blocks */
struct sock_fprog_kern *orig_prog; /* Original BPF program */
struct rcu_head rcu;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct sock_filter *filter);
const struct sock_filter_int *filter);
union {
struct sock_filter insns[0];
struct sock_filter insns[0];
struct sock_filter_int insnsi[0];
struct work_struct work;
};
};
@ -41,25 +94,44 @@ static inline unsigned int sk_filter_size(unsigned int proglen)
offsetof(struct sk_filter, insns[proglen]));
}
extern int sk_filter(struct sock *sk, struct sk_buff *skb);
extern unsigned int sk_run_filter(const struct sk_buff *skb,
const struct sock_filter *filter);
extern int sk_unattached_filter_create(struct sk_filter **pfp,
struct sock_fprog *fprog);
extern void sk_unattached_filter_destroy(struct sk_filter *fp);
extern int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
extern int sk_detach_filter(struct sock *sk);
extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
extern int sk_get_filter(struct sock *sk, struct sock_filter __user *filter, unsigned len);
extern void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to);
#define sk_filter_proglen(fprog) \
(fprog->len * sizeof(fprog->filter[0]))
#define SK_RUN_FILTER(filter, ctx) \
(*filter->bpf_func)(ctx, filter->insnsi)
int sk_filter(struct sock *sk, struct sk_buff *skb);
u32 sk_run_filter_int_seccomp(const struct seccomp_data *ctx,
const struct sock_filter_int *insni);
u32 sk_run_filter_int_skb(const struct sk_buff *ctx,
const struct sock_filter_int *insni);
int sk_convert_filter(struct sock_filter *prog, int len,
struct sock_filter_int *new_prog, int *new_len);
int sk_unattached_filter_create(struct sk_filter **pfp,
struct sock_fprog *fprog);
void sk_unattached_filter_destroy(struct sk_filter *fp);
int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
int sk_detach_filter(struct sock *sk);
int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
int sk_get_filter(struct sock *sk, struct sock_filter __user *filter,
unsigned int len);
void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to);
void sk_filter_charge(struct sock *sk, struct sk_filter *fp);
void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp);
#ifdef CONFIG_BPF_JIT
#include <stdarg.h>
#include <linux/linkage.h>
#include <linux/printk.h>
extern void bpf_jit_compile(struct sk_filter *fp);
extern void bpf_jit_free(struct sk_filter *fp);
void bpf_jit_compile(struct sk_filter *fp);
void bpf_jit_free(struct sk_filter *fp);
static inline void bpf_jit_dump(unsigned int flen, unsigned int proglen,
u32 pass, void *image)
@ -70,7 +142,6 @@ static inline void bpf_jit_dump(unsigned int flen, unsigned int proglen,
print_hex_dump(KERN_ERR, "JIT code: ", DUMP_PREFIX_OFFSET,
16, 1, image, proglen, false);
}
#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB, FILTER->insns)
#else
#include <linux/slab.h>
static inline void bpf_jit_compile(struct sk_filter *fp)
@ -80,7 +151,6 @@ static inline void bpf_jit_free(struct sk_filter *fp)
{
kfree(fp);
}
#define SK_RUN_FILTER(FILTER, SKB) sk_run_filter(SKB, FILTER->insns)
#endif
static inline int bpf_tell_extensions(void)

View file

@ -180,9 +180,8 @@ struct ippp_struct {
struct slcompress *slcomp;
#endif
#ifdef CONFIG_IPPP_FILTER
struct sock_filter *pass_filter; /* filter for packets to pass */
struct sock_filter *active_filter; /* filter for pkts to reset idle */
unsigned pass_len, active_len;
struct sk_filter *pass_filter; /* filter for packets to pass */
struct sk_filter *active_filter; /* filter for pkts to reset idle */
#endif
unsigned long debug;
struct isdn_ppp_compressor *compressor,*decompressor;

View file

@ -27,11 +27,7 @@
#include <linux/if_vlan.h>
#include <linux/ip.h>
#include <linux/filter.h>
#ifdef __KERNEL__
#include <linux/in.h>
#else
#include <netinet/in.h>
#endif
#define PTP_CLASS_NONE 0x00 /* not a PTP event message */
#define PTP_CLASS_V1 0x01 /* protocol version 1 */
@ -84,14 +80,6 @@
#define OP_RETA (BPF_RET | BPF_A)
#define OP_RETK (BPF_RET | BPF_K)
static inline int ptp_filter_init(struct sock_filter *f, int len)
{
if (OP_LDH == f[0].code)
return sk_chk_filter(f, len);
else
return 0;
}
#define PTP_FILTER \
{OP_LDH, 0, 0, OFF_ETYPE }, /* */ \
{OP_JEQ, 0, 12, ETH_P_IP }, /* f goto L20 */ \
@ -137,4 +125,6 @@ static inline int ptp_filter_init(struct sock_filter *f, int len)
{OP_RETA, 0, 0, 0 }, /* */ \
/*L6x*/ {OP_RETK, 0, 0, PTP_CLASS_NONE },
unsigned int ptp_classify_raw(const struct sk_buff *skb);
#endif

View file

@ -76,7 +76,6 @@ static inline int seccomp_mode(struct seccomp *s)
#ifdef CONFIG_SECCOMP_FILTER
extern void put_seccomp_filter(struct task_struct *tsk);
extern void get_seccomp_filter(struct task_struct *tsk);
extern u32 seccomp_bpf_load(int off);
#else /* CONFIG_SECCOMP_FILTER */
static inline void put_seccomp_filter(struct task_struct *tsk)
{

View file

@ -1621,33 +1621,6 @@ void sk_common_release(struct sock *sk);
/* Initialise core socket variables */
void sock_init_data(struct socket *sock, struct sock *sk);
void sk_filter_release_rcu(struct rcu_head *rcu);
/**
* sk_filter_release - release a socket filter
* @fp: filter to remove
*
* Remove a filter from a socket and release its resources.
*/
static inline void sk_filter_release(struct sk_filter *fp)
{
if (atomic_dec_and_test(&fp->refcnt))
call_rcu(&fp->rcu, sk_filter_release_rcu);
}
static inline void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp)
{
atomic_sub(sk_filter_size(fp->len), &sk->sk_omem_alloc);
sk_filter_release(fp);
}
static inline void sk_filter_charge(struct sock *sk, struct sk_filter *fp)
{
atomic_inc(&fp->refcnt);
atomic_add(sk_filter_size(fp->len), &sk->sk_omem_alloc);
}
/*
* Socket reference counting postulates.
*

View file

@ -55,60 +55,33 @@ struct seccomp_filter {
atomic_t usage;
struct seccomp_filter *prev;
unsigned short len; /* Instruction count */
struct sock_filter insns[];
struct sock_filter_int insnsi[];
};
/* Limit any path through the tree to 256KB worth of instructions. */
#define MAX_INSNS_PER_PATH ((1 << 18) / sizeof(struct sock_filter))
/**
* get_u32 - returns a u32 offset into data
* @data: a unsigned 64 bit value
* @index: 0 or 1 to return the first or second 32-bits
*
* This inline exists to hide the length of unsigned long. If a 32-bit
* unsigned long is passed in, it will be extended and the top 32-bits will be
* 0. If it is a 64-bit unsigned long, then whatever data is resident will be
* properly returned.
*
/*
* Endianness is explicitly ignored and left for BPF program authors to manage
* as per the specific architecture.
*/
static inline u32 get_u32(u64 data, int index)
static void populate_seccomp_data(struct seccomp_data *sd)
{
return ((u32 *)&data)[index];
}
struct task_struct *task = current;
struct pt_regs *regs = task_pt_regs(task);
/* Helper for bpf_load below. */
#define BPF_DATA(_name) offsetof(struct seccomp_data, _name)
/**
* bpf_load: checks and returns a pointer to the requested offset
* @off: offset into struct seccomp_data to load from
*
* Returns the requested 32-bits of data.
* seccomp_check_filter() should assure that @off is 32-bit aligned
* and not out of bounds. Failure to do so is a BUG.
*/
u32 seccomp_bpf_load(int off)
{
struct pt_regs *regs = task_pt_regs(current);
if (off == BPF_DATA(nr))
return syscall_get_nr(current, regs);
if (off == BPF_DATA(arch))
return syscall_get_arch(current, regs);
if (off >= BPF_DATA(args[0]) && off < BPF_DATA(args[6])) {
unsigned long value;
int arg = (off - BPF_DATA(args[0])) / sizeof(u64);
int index = !!(off % sizeof(u64));
syscall_get_arguments(current, regs, arg, 1, &value);
return get_u32(value, index);
}
if (off == BPF_DATA(instruction_pointer))
return get_u32(KSTK_EIP(current), 0);
if (off == BPF_DATA(instruction_pointer) + sizeof(u32))
return get_u32(KSTK_EIP(current), 1);
/* seccomp_check_filter should make this impossible. */
BUG();
sd->nr = syscall_get_nr(task, regs);
sd->arch = syscall_get_arch(task, regs);
/* Unroll syscall_get_args to help gcc on arm. */
syscall_get_arguments(task, regs, 0, 1, (unsigned long *) &sd->args[0]);
syscall_get_arguments(task, regs, 1, 1, (unsigned long *) &sd->args[1]);
syscall_get_arguments(task, regs, 2, 1, (unsigned long *) &sd->args[2]);
syscall_get_arguments(task, regs, 3, 1, (unsigned long *) &sd->args[3]);
syscall_get_arguments(task, regs, 4, 1, (unsigned long *) &sd->args[4]);
syscall_get_arguments(task, regs, 5, 1, (unsigned long *) &sd->args[5]);
sd->instruction_pointer = KSTK_EIP(task);
}
/**
@ -133,17 +106,17 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
switch (code) {
case BPF_S_LD_W_ABS:
ftest->code = BPF_S_ANC_SECCOMP_LD_W;
ftest->code = BPF_LDX | BPF_W | BPF_ABS;
/* 32-bit aligned and not out of bounds. */
if (k >= sizeof(struct seccomp_data) || k & 3)
return -EINVAL;
continue;
case BPF_S_LD_W_LEN:
ftest->code = BPF_S_LD_IMM;
ftest->code = BPF_LD | BPF_IMM;
ftest->k = sizeof(struct seccomp_data);
continue;
case BPF_S_LDX_W_LEN:
ftest->code = BPF_S_LDX_IMM;
ftest->code = BPF_LDX | BPF_IMM;
ftest->k = sizeof(struct seccomp_data);
continue;
/* Explicitly include allowed calls. */
@ -185,6 +158,7 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
case BPF_S_JMP_JGT_X:
case BPF_S_JMP_JSET_K:
case BPF_S_JMP_JSET_X:
sk_decode_filter(ftest, ftest);
continue;
default:
return -EINVAL;
@ -202,18 +176,21 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
static u32 seccomp_run_filters(int syscall)
{
struct seccomp_filter *f;
struct seccomp_data sd;
u32 ret = SECCOMP_RET_ALLOW;
/* Ensure unexpected behavior doesn't result in failing open. */
if (WARN_ON(current->seccomp.filter == NULL))
return SECCOMP_RET_KILL;
populate_seccomp_data(&sd);
/*
* All filters in the list are evaluated and the lowest BPF return
* value always takes priority (ignoring the DATA).
*/
for (f = current->seccomp.filter; f; f = f->prev) {
u32 cur_ret = sk_run_filter(NULL, f->insns);
u32 cur_ret = sk_run_filter_int_seccomp(&sd, f->insnsi);
if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
ret = cur_ret;
}
@ -231,6 +208,8 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
struct seccomp_filter *filter;
unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
unsigned long total_insns = fprog->len;
struct sock_filter *fp;
int new_len;
long ret;
if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
@ -252,28 +231,43 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
CAP_SYS_ADMIN) != 0)
return -EACCES;
/* Allocate a new seccomp_filter */
filter = kzalloc(sizeof(struct seccomp_filter) + fp_size,
GFP_KERNEL|__GFP_NOWARN);
if (!filter)
fp = kzalloc(fp_size, GFP_KERNEL|__GFP_NOWARN);
if (!fp)
return -ENOMEM;
atomic_set(&filter->usage, 1);
filter->len = fprog->len;
/* Copy the instructions from fprog. */
ret = -EFAULT;
if (copy_from_user(filter->insns, fprog->filter, fp_size))
goto fail;
if (copy_from_user(fp, fprog->filter, fp_size))
goto free_prog;
/* Check and rewrite the fprog via the skb checker */
ret = sk_chk_filter(filter->insns, filter->len);
ret = sk_chk_filter(fp, fprog->len);
if (ret)
goto fail;
goto free_prog;
/* Check and rewrite the fprog for seccomp use */
ret = seccomp_check_filter(filter->insns, filter->len);
ret = seccomp_check_filter(fp, fprog->len);
if (ret)
goto fail;
goto free_prog;
/* Convert 'sock_filter' insns to 'sock_filter_int' insns */
ret = sk_convert_filter(fp, fprog->len, NULL, &new_len);
if (ret)
goto free_prog;
/* Allocate a new seccomp_filter */
filter = kzalloc(sizeof(struct seccomp_filter) +
sizeof(struct sock_filter_int) * new_len,
GFP_KERNEL|__GFP_NOWARN);
if (!filter)
goto free_prog;
ret = sk_convert_filter(fp, fprog->len, filter->insnsi, &new_len);
if (ret)
goto free_filter;
atomic_set(&filter->usage, 1);
filter->len = new_len;
/*
* If there is an existing filter, make it the prev and don't drop its
@ -282,8 +276,11 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
filter->prev = current->seccomp.filter;
current->seccomp.filter = filter;
return 0;
fail:
free_filter:
kfree(filter);
free_prog:
kfree(fp);
return ret;
}

File diff suppressed because it is too large Load diff

View file

@ -52,9 +52,10 @@ EXPORT_SYMBOL_GPL(sock_diag_put_meminfo);
int sock_diag_put_filterinfo(struct user_namespace *user_ns, struct sock *sk,
struct sk_buff *skb, int attrtype)
{
struct nlattr *attr;
struct sock_fprog_kern *fprog;
struct sk_filter *filter;
unsigned int len;
struct nlattr *attr;
unsigned int flen;
int err = 0;
if (!ns_capable(user_ns, CAP_NET_ADMIN)) {
@ -63,24 +64,20 @@ int sock_diag_put_filterinfo(struct user_namespace *user_ns, struct sock *sk,
}
rcu_read_lock();
filter = rcu_dereference(sk->sk_filter);
len = filter ? filter->len * sizeof(struct sock_filter) : 0;
if (!filter)
goto out;
attr = nla_reserve(skb, attrtype, len);
fprog = filter->orig_prog;
flen = sk_filter_proglen(fprog);
attr = nla_reserve(skb, attrtype, flen);
if (attr == NULL) {
err = -EMSGSIZE;
goto out;
}
if (filter) {
struct sock_filter *fb = (struct sock_filter *)nla_data(attr);
int i;
for (i = 0; i < filter->len; i++, fb++)
sk_decode_filter(&filter->insns[i], fb);
}
memcpy(nla_data(attr), fprog->filter, flen);
out:
rcu_read_unlock();
return err;

View file

@ -23,16 +23,19 @@
#include <linux/skbuff.h>
#include <linux/export.h>
static struct sock_filter ptp_filter[] = {
PTP_FILTER
};
static struct sk_filter *ptp_insns __read_mostly;
unsigned int ptp_classify_raw(const struct sk_buff *skb)
{
return SK_RUN_FILTER(ptp_insns, skb);
}
EXPORT_SYMBOL_GPL(ptp_classify_raw);
static unsigned int classify(const struct sk_buff *skb)
{
if (likely(skb->dev &&
skb->dev->phydev &&
if (likely(skb->dev && skb->dev->phydev &&
skb->dev->phydev->drv))
return sk_run_filter(skb, ptp_filter);
return ptp_classify_raw(skb);
else
return PTP_CLASS_NONE;
}
@ -60,11 +63,13 @@ void skb_clone_tx_timestamp(struct sk_buff *skb)
if (likely(phydev->drv->txtstamp)) {
if (!atomic_inc_not_zero(&sk->sk_refcnt))
return;
clone = skb_clone(skb, GFP_ATOMIC);
if (!clone) {
sock_put(sk);
return;
}
clone->sk = sk;
phydev->drv->txtstamp(phydev, clone, type);
}
@ -89,12 +94,15 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
}
*skb_hwtstamps(skb) = *hwtstamps;
serr = SKB_EXT_ERR(skb);
memset(serr, 0, sizeof(*serr));
serr->ee.ee_errno = ENOMSG;
serr->ee.ee_origin = SO_EE_ORIGIN_TIMESTAMPING;
skb->sk = NULL;
err = sock_queue_err_skb(sk, skb);
sock_put(sk);
if (err)
kfree_skb(skb);
@ -135,5 +143,10 @@ EXPORT_SYMBOL_GPL(skb_defer_rx_timestamp);
void __init skb_timestamping_init(void)
{
BUG_ON(sk_chk_filter(ptp_filter, ARRAY_SIZE(ptp_filter)));
static struct sock_filter ptp_filter[] = { PTP_FILTER };
struct sock_fprog ptp_prog = {
.len = ARRAY_SIZE(ptp_filter), .filter = ptp_filter,
};
BUG_ON(sk_unattached_filter_create(&ptp_insns, &ptp_prog));
}