Commit graph

295 commits

Author SHA1 Message Date
Pablo Neira Ayuso ae2d708ed8 netfilter: conntrack: fix crash on timeout object removal
The object and module refcounts are updated for each conntrack template,
however, if we delete the iptables rules and we flush the timeout
database, we may end up with invalid references to timeout object that
are just gone.

Resolve this problem by setting the timeout reference to NULL when the
custom timeout entry is removed from our base. This patch requires some
RCU trickery to ensure safe pointer handling.

This handling is similar to what we already do with conntrack helpers,
the idea is to avoid bumping the timeout object reference counter from
the packet path to avoid the cost of atomic ops.

Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-12 17:04:34 +02:00
Eric W. Biederman a31f1adc09 netfilter: nf_conntrack: Add a struct net parameter to l4_pkt_to_tuple
As gre does not have the srckey in the packet gre_pkt_to_tuple
needs to perform a lookup in it's per network namespace tables.

Pass in the proper network namespace to all pkt_to_tuple
implementations to ensure gre (and any similar protocols) can get this
right.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 22:00:04 +02:00
David S. Miller 53cfd053e4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Conflicts:
	include/net/netfilter/nf_conntrack.h

The conflict was an overlap between changing the type of the zone
argument to nf_ct_tmpl_alloc() whilst exporting nf_ct_tmpl_free.

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net, they are:

1) Oneliner to restore maps in nf_tables since we support addressing registers
   at 32 bits level.

2) Restore previous default behaviour in bridge netfilter when CONFIG_IPV6=n,
   oneliner from Bernhard Thaler.

3) Out of bound access in ipset hash:net* set types, reported by Dave Jones'
   KASan utility, patch from Jozsef Kadlecsik.

4) Fix ipset compilation with gcc 4.4.7 related to C99 initialization of
   unnamed unions, patch from Elad Raz.

5) Add a workaround to address inconsistent endianess in the res_id field of
   nfnetlink batch messages, reported by Florian Westphal.

6) Fix error paths of CT/synproxy since the conntrack template was moved to use
   kmalloc, patch from Daniel Borkmann.

All of them look good to me to reach 4.2, I can route this to -stable myself
too, just let me know what you prefer.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-05 21:57:42 -07:00
Daniel Borkmann 62da98656b netfilter: nf_conntrack: make nf_ct_zone_dflt built-in
Fengguang reported, that some randconfig generated the following linker
issue with nf_ct_zone_dflt object involved:

  [...]
  CC      init/version.o
  LD      init/built-in.o
  net/built-in.o: In function `ipv4_conntrack_defrag':
  nf_defrag_ipv4.c:(.text+0x93e95): undefined reference to `nf_ct_zone_dflt'
  net/built-in.o: In function `ipv6_defrag':
  nf_defrag_ipv6_hooks.c:(.text+0xe3ffe): undefined reference to `nf_ct_zone_dflt'
  make: *** [vmlinux] Error 1

Given that configurations exist where we have a built-in part, which is
accessing nf_ct_zone_dflt such as the two handlers nf_ct_defrag_user()
and nf_ct6_defrag_user(), and a part that configures nf_conntrack as a
module, we must move nf_ct_zone_dflt into a fixed, guaranteed built-in
area when netfilter is configured in general.

Therefore, split the more generic parts into a common header under
include/linux/netfilter/ and move nf_ct_zone_dflt into the built-in
section that already holds parts related to CONFIG_NF_CONNTRACK in the
netfilter core. This fixes the issue on my side.

Fixes: 308ac9143e ("netfilter: nf_conntrack: push zone object into functions")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-02 16:32:56 -07:00
Daniel Borkmann 9cf94eab8b netfilter: conntrack: use nf_ct_tmpl_free in CT/synproxy error paths
Commit 0838aa7fcf ("netfilter: fix netns dependencies with conntrack
templates") migrated templates to the new allocator api, but forgot to
update error paths for them in CT and synproxy to use nf_ct_tmpl_free()
instead of nf_conntrack_free().

Due to that, memory is being freed into the wrong kmemcache, but also
we drop the per net reference count of ct objects causing an imbalance.

In Brad's case, this leads to a wrap-around of net->ct.count and thus
lets __nf_conntrack_alloc() refuse to create a new ct object:

  [   10.340913] xt_addrtype: ipv6 does not support BROADCAST matching
  [   10.810168] nf_conntrack: table full, dropping packet
  [   11.917416] r8169 0000:07:00.0 eth0: link up
  [   11.917438] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
  [   12.815902] nf_conntrack: table full, dropping packet
  [   15.688561] nf_conntrack: table full, dropping packet
  [   15.689365] nf_conntrack: table full, dropping packet
  [   15.690169] nf_conntrack: table full, dropping packet
  [   15.690967] nf_conntrack: table full, dropping packet
  [...]

With slab debugging, it also reports the wrong kmemcache (kmalloc-512 vs.
nf_conntrack_ffffffff81ce75c0) and reports poison overwrites, etc. Thus,
to fix the problem, export and use nf_ct_tmpl_free() instead.

Fixes: 0838aa7fcf ("netfilter: fix netns dependencies with conntrack templates")
Reported-by: Brad Jackson <bjackson0971@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-01 12:15:08 +02:00
Pablo Neira Ayuso 81bf1c64e7 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Resolve conflicts with conntrack template fixes.

Conflicts:
	net/netfilter/nf_conntrack_core.c
	net/netfilter/nf_synproxy_core.c
	net/netfilter/xt_CT.c

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-21 06:09:05 +02:00
Daniel Borkmann 5e8018fc61 netfilter: nf_conntrack: add efficient mark to zone mapping
This work adds the possibility of deriving the zone id from the skb->mark
field in a scalable manner. This allows for having only a single template
serving hundreds/thousands of different zones, for example, instead of the
need to have one match for each zone as an extra CT jump target.

Note that we'd need to have this information attached to the template as at
the time when we're trying to lookup a possible ct object, we already need
to know zone information for a possible match when going into
__nf_conntrack_find_get(). This work provides a minimal implementation for
a possible mapping.

In order to not add/expose an extra ct->status bit, the zone structure has
been extended to carry a flag for deriving the mark.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-18 01:24:05 +02:00
Daniel Borkmann deedb59039 netfilter: nf_conntrack: add direction support for zones
This work adds a direction parameter to netfilter zones, so identity
separation can be performed only in original/reply or both directions
(default). This basically opens up the possibility of doing NAT with
conflicting IP address/port tuples from multiple, isolated tenants
on a host (e.g. from a netns) without requiring each tenant to NAT
twice resp. to use its own dedicated IP address to SNAT to, meaning
overlapping tuples can be made unique with the zone identifier in
original direction, where the NAT engine will then allocate a unique
tuple in the commonly shared default zone for the reply direction.
In some restricted, local DNAT cases, also port redirection could be
used for making the reply traffic unique w/o requiring SNAT.

The consensus we've reached and discussed at NFWS and since the initial
implementation [1] was to directly integrate the direction meta data
into the existing zones infrastructure, as opposed to the ct->mark
approach we proposed initially.

As we pass the nf_conntrack_zone object directly around, we don't have
to touch all call-sites, but only those, that contain equality checks
of zones. Thus, based on the current direction (original or reply),
we either return the actual id, or the default NF_CT_DEFAULT_ZONE_ID.
CT expectations are direction-agnostic entities when expectations are
being compared among themselves, so we can only use the identifier
in this case.

Note that zone identifiers can not be included into the hash mix
anymore as they don't contain a "stable" value that would be equal
for both directions at all times, f.e. if only zone->id would
unconditionally be xor'ed into the table slot hash, then replies won't
find the corresponding conntracking entry anymore.

If no particular direction is specified when configuring zones, the
behaviour is exactly as we expect currently (both directions).

Support has been added for the CT netlink interface as well as the
x_tables raw CT target, which both already offer existing interfaces
to user space for the configuration of zones.

Below a minimal, simplified collision example (script in [2]) with
netperf sessions:

  +--- tenant-1 ---+   mark := 1
  |    netperf     |--+
  +----------------+  |                CT zone := mark [ORIGINAL]
   [ip,sport] := X   +--------------+  +--- gateway ---+
                     | mark routing |--|     SNAT      |-- ... +
                     +--------------+  +---------------+       |
  +--- tenant-2 ---+  |                                     ~~~|~~~
  |    netperf     |--+                +-----------+           |
  +----------------+   mark := 2       | netserver |------ ... +
   [ip,sport] := X                     +-----------+
                                        [ip,port] := Y
On the gateway netns, example:

  iptables -t raw -A PREROUTING -j CT --zone mark --zone-dir ORIGINAL
  iptables -t nat -A POSTROUTING -o <dev> -j SNAT --to-source <ip> --random-fully

  iptables -t mangle -A PREROUTING -m conntrack --ctdir ORIGINAL -j CONNMARK --save-mark
  iptables -t mangle -A POSTROUTING -m conntrack --ctdir REPLY -j CONNMARK --restore-mark

conntrack dump from gateway netns:

  netperf -H 10.1.1.2 -t TCP_STREAM -l60 -p12865,5555 from each tenant netns

  tcp 6 431995 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=1
                           src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=1024
               [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1

  tcp 6 431994 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=2
                           src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=5555
               [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=1

  tcp 6 299 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=39438 dport=33768 zone-orig=1
                        src=10.1.1.2 dst=10.1.1.1 sport=33768 dport=39438
               [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1

  tcp 6 300 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=32889 dport=40206 zone-orig=2
                        src=10.1.1.2 dst=10.1.1.1 sport=40206 dport=32889
               [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=2

Taking this further, test script in [2] creates 200 tenants and runs
original-tuple colliding netperf sessions each. A conntrack -L dump in
the gateway netns also confirms 200 overlapping entries, all in ESTABLISHED
state as expected.

I also did run various other tests with some permutations of the script,
to mention some: SNAT in random/random-fully/persistent mode, no zones (no
overlaps), static zones (original, reply, both directions), etc.

  [1] http://thread.gmane.org/gmane.comp.security.firewalls.netfilter.devel/57412/
  [2] https://paste.fedoraproject.org/242835/65657871/

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-18 01:22:50 +02:00
Daniel Borkmann 308ac9143e netfilter: nf_conntrack: push zone object into functions
This patch replaces the zone id which is pushed down into functions
with the actual zone object. It's a bigger one-time change, but
needed for later on extending zones with a direction parameter, and
thus decoupling this additional information from all call-sites.

No functional changes in this patch.

The default zone becomes a global const object, namely nf_ct_zone_dflt
and will be returned directly in various cases, one being, when there's
f.e. no zoning support.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-11 12:29:01 +02:00
Joe Stringer f58e5aa7b8 netfilter: conntrack: Use flags in nf_ct_tmpl_alloc()
The flags were ignored for this function when it was introduced. Also
fix the style problem in kzalloc.

Fixes: 0838aa7fc (netfilter: fix netns dependencies with conntrack
templates)
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-05 10:56:43 +02:00
Pablo Neira Ayuso f0ad462189 netfilter: nf_conntrack: silence warning on falling back to vmalloc()
Since 88eab472ec ("netfilter: conntrack: adjust nf_conntrack_buckets default
value"), the hashtable can easily hit this warning. We got reports from users
that are getting this message in a quite spamming fashion, so better silence
this.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Florian Westphal <fw@strlen.de>
2015-07-30 12:32:55 +02:00
Pablo Neira Ayuso 0838aa7fcf netfilter: fix netns dependencies with conntrack templates
Quoting Daniel Borkmann:

"When adding connection tracking template rules to a netns, f.e. to
configure netfilter zones, the kernel will endlessly busy-loop as soon
as we try to delete the given netns in case there's at least one
template present, which is problematic i.e. if there is such bravery that
the priviledged user inside the netns is assumed untrusted.

Minimal example:

  ip netns add foo
  ip netns exec foo iptables -t raw -A PREROUTING -d 1.2.3.4 -j CT --zone 1
  ip netns del foo

What happens is that when nf_ct_iterate_cleanup() is being called from
nf_conntrack_cleanup_net_list() for a provided netns, we always end up
with a net->ct.count > 0 and thus jump back to i_see_dead_people. We
don't get a soft-lockup as we still have a schedule() point, but the
serving CPU spins on 100% from that point onwards.

Since templates are normally allocated with nf_conntrack_alloc(), we
also bump net->ct.count. The issue why they are not yet nf_ct_put() is
because the per netns .exit() handler from x_tables (which would eventually
invoke xt_CT's xt_ct_tg_destroy() that drops reference on info->ct) is
called in the dependency chain at a *later* point in time than the per
netns .exit() handler for the connection tracker.

This is clearly a chicken'n'egg problem: after the connection tracker
.exit() handler, we've teared down all the connection tracking
infrastructure already, so rightfully, xt_ct_tg_destroy() cannot be
invoked at a later point in time during the netns cleanup, as that would
lead to a use-after-free. At the same time, we cannot make x_tables depend
on the connection tracker module, so that the xt_ct_tg_destroy() would
be invoked earlier in the cleanup chain."

Daniel confirms this has to do with the order in which modules are loaded or
having compiled nf_conntrack as modules while x_tables built-in. So we have no
guarantees regarding the order in which netns callbacks are executed.

Fix this by allocating the templates through kmalloc() from the respective
SYNPROXY and CT targets, so they don't depend on the conntrack kmem cache.
Then, release then via nf_ct_tmpl_free() from destroy_conntrack(). This branch
is marked as unlikely since conntrack templates are rarely allocated and only
from the configuration plane path.

Note that templates are not kept in any list to avoid further dependencies with
nf_conntrack anymore, thus, the tmpl larval list is removed.

Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Daniel Borkmann <daniel@iogearbox.net>
2015-07-20 14:58:19 +02:00
David S. Miller 4e7a84b1a5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
netfilter updates for net-next

The following patchset contains netfilter updates for net-next, just a
bunch of cleanups and small enhancement to selectively flush conntracks
in ctnetlink, more specifically the patches are:

1) Rise default number of buckets in conntrack from 16384 to 65536 in
   systems with >= 4GBytes, patch from Marcelo Leitner.

2) Small refactor to save one level on indentation in xt_osf, from
   Joe Perches.

3) Remove unnecessary sizeof(char) in nf_log, from Fabian Frederick.

4) Another small cleanup to remove redundant variable in nfnetlink,
   from Duan Jiong.

5) Fix compilation warning in nfnetlink_cthelper on parisc, from
   Chen Gang.

6) Fix wrong format in debugging for ctseqadj, from Gao feng.

7) Selective conntrack flushing through the mark for ctnetlink, patch
   from Kristian Evensen.

8) Remove nf_ct_conntrack_flush_report() exported symbol now that is
   not required anymore after the selective flushing patch, again from
   Kristian.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-15 01:50:25 -05:00
Kristian Evensen ae406bd057 netfilter: conntrack: Remove nf_ct_conntrack_flush_report
The only user of nf_ct_conntrack_flush_report() was ctnetlink_del_conntrack().
After adding support for flushing connections with a given mark, this function
is no longer called.

Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-01-08 12:16:58 +01:00
Pablo Neira Ayuso 8ca3f5e974 netfilter: conntrack: fix race between confirmation and flush
Commit 5195c14c8b ("netfilter: conntrack: fix race in
__nf_conntrack_confirm against get_next_corpse") aimed to resolve the
race condition between the confirmation (packet path) and the flush
command (from control plane). However, it introduced a crash when
several packets race to add a new conntrack, which seems easier to
reproduce when nf_queue is in place.

Fix this race, in __nf_conntrack_confirm(), by removing the CT
from unconfirmed list before checking the DYING bit. In case
race occured, re-add the CT to the dying list

This patch also changes the verdict from NF_ACCEPT to NF_DROP when
we lose race. Basically, the confirmation happens for the first packet
that we see in a flow. If you just invoked conntrack -F once (which
should be the common case), then this is likely to be the first packet
of the flow (unless you already called flush anytime soon in the past).
This should be hard to trigger, but better drop this packet, otherwise
we leave things in inconsistent state since the destination will likely
reply to this packet, but it will find no conntrack, unless the origin
retransmits.

The change of the verdict has been discussed in:
https://www.marc.info/?l=linux-netdev&m=141588039530056&w=2

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-01-06 22:27:45 +01:00
Marcelo Leitner 88eab472ec netfilter: conntrack: adjust nf_conntrack_buckets default value
Manually bumping either nf_conntrack_buckets or nf_conntrack_max has
become a common task as our Linux servers tend to serve more and more
clients/applications, so let's adjust nf_conntrack_buckets this to a
more updated value.

Now for systems with more than 4GB of memory, nf_conntrack_buckets
becomes 65536 instead of 16384, resulting in nf_conntrack_max=256k
entries.

Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-12-23 14:20:10 +01:00
David S. Miller 244ebd9f8f Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following batch contains netfilter updates for net-next. Basically,
enhancements for xt_recent, skip zeroing of timer in conntrack, fix
linking problem with recent redirect support for nf_tables, ipset
updates and a couple of cleanups. More specifically, they are:

1) Rise maximum number per IP address to be remembered in xt_recent
   while retaining backward compatibility, from Florian Westphal.

2) Skip zeroing timer area in nf_conn objects, also from Florian.

3) Inspect IPv4 and IPv6 traffic from the bridge to allow filtering using
   using meta l4proto and transport layer header, from Alvaro Neira.

4) Fix linking problems in the new redirect support when CONFIG_IPV6=n
   and IP6_NF_IPTABLES=n.

And ipset updates from Jozsef Kadlecsik:

5) Support updating element extensions when the set is full (fixes
   netfilter bugzilla id 880).

6) Fix set match with 32-bits userspace / 64-bits kernel.

7) Indicate explicitly when /0 networks are supported in ipset.

8) Simplify cidr handling for hash:*net* types.

9) Allocate the proper size of memory when /0 networks are supported.

10) Explicitly add padding elements to hash:net,net and hash:net,port,
    because the elements must be u32 sized for the used hash function.

Jozsef is also cooking ipset RCU conversion which should land soon if
they reach the merge window in time.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-05 20:56:46 -08:00
Florian Westphal c41884ce05 netfilter: conntrack: avoid zeroing timer
add a __nfct_init_offset annotation member to struct nf_conn to make
it clear which members are covered by the memset when the conntrack
is allocated.

This avoids zeroing timer_list and ct_net; both are already inited
explicitly.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-11-27 12:41:06 +01:00
Pablo Neira 43612d7c04 Revert "netfilter: conntrack: fix race in __nf_conntrack_confirm against get_next_corpse"
This reverts commit 5195c14c8b.

If the conntrack clashes with an existing one, it is left out of
the unconfirmed list, thus, crashing when dropping the packet and
releasing the conntrack since golden rule is that conntracks are
always placed in any of the existing lists for traceability reasons.

Reported-by: Daniel Borkmann <dborkman@redhat.com>
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=88841
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-25 14:14:51 -05:00
bill bonaparte 5195c14c8b netfilter: conntrack: fix race in __nf_conntrack_confirm against get_next_corpse
After removal of the central spinlock nf_conntrack_lock, in
commit 93bb0ceb75 ("netfilter: conntrack: remove central
spinlock nf_conntrack_lock"), it is possible to race against
get_next_corpse().

The race is against the get_next_corpse() cleanup on
the "unconfirmed" list (a per-cpu list with seperate locking),
which set the DYING bit.

Fix this race, in __nf_conntrack_confirm(), by removing the CT
from unconfirmed list before checking the DYING bit.  In case
race occured, re-add the CT to the dying list.

While at this, fix coding style of the comment that has been
updated.

Fixes: 93bb0ceb75 ("netfilter: conntrack: remove central spinlock nf_conntrack_lock")
Reported-by: bill bonaparte <programme110@gmail.com>
Signed-off-by: bill bonaparte <programme110@gmail.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-11-14 17:43:05 +01:00
Daniel Borkmann 8fc54f6891 net: use reciprocal_scale() helper
Replace open codings of (((u64) <x> * <y>) >> 32) with reciprocal_scale().

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23 12:21:21 -07:00
Eric Dumazet d2de875c6d net: use ktime_get_ns() and ktime_get_real_ns() helpers
ktime_get_ns() replaces ktime_to_ns(ktime_get())

ktime_get_real_ns() replaces ktime_to_ns(ktime_get_real())

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 19:57:23 -07:00
Florian Westphal 9500507c61 netfilter: conntrack: remove timer from ecache extension
This brings the (per-conntrack) ecache extension back to 24 bytes in size
(was 152 byte on x86_64 with lockdep on).

When event delivery fails, re-delivery is attempted via work queue.

Redelivery is attempted at least every 0.1 seconds, but can happen
more frequently if userspace is not congested.

The nf_ct_release_dying_list() function is removed.
With this patch, ownership of the to-be-redelivered conntracks
(on-dying-list-with-DYING-bit not yet set) is with the work queue,
which will release the references once event is out.

Joint work with Pablo Neira Ayuso.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-06-25 19:15:38 +02:00
Peter Zijlstra 4e857c58ef arch: Mass conversion of smp_mb__*()
Mostly scripted conversion of the smp_mb__* barriers.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-18 14:20:48 +02:00
Andrey Vagin ee214d54bf netfilter: nf_conntrack: initialize net.ct.generation
[  251.920788] INFO: trying to register non-static key.
[  251.921386] the code is fine but needs lockdep annotation.
[  251.921386] turning off the locking correctness validator.
[  251.921386] CPU: 2 PID: 15715 Comm: socket_listen Not tainted 3.14.0+ #294
[  251.921386] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  251.921386]  0000000000000000 000000009d18c210 ffff880075f039b8 ffffffff816b7ecd
[  251.921386]  ffffffff822c3b10 ffff880075f039c8 ffffffff816b36f4 ffff880075f03aa0
[  251.921386]  ffffffff810c65ff ffffffff810c4a85 00000000fffffe01 ffffffffa0075172
[  251.921386] Call Trace:
[  251.921386]  [<ffffffff816b7ecd>] dump_stack+0x45/0x56
[  251.921386]  [<ffffffff816b36f4>] register_lock_class.part.24+0x38/0x3c
[  251.921386]  [<ffffffff810c65ff>] __lock_acquire+0x168f/0x1b40
[  251.921386]  [<ffffffff810c4a85>] ? trace_hardirqs_on_caller+0x105/0x1d0
[  251.921386]  [<ffffffffa0075172>] ? nf_nat_setup_info+0x252/0x3a0 [nf_nat]
[  251.921386]  [<ffffffff816c1215>] ? _raw_spin_unlock_bh+0x35/0x40
[  251.921386]  [<ffffffffa0075172>] ? nf_nat_setup_info+0x252/0x3a0 [nf_nat]
[  251.921386]  [<ffffffff810c7272>] lock_acquire+0xa2/0x120
[  251.921386]  [<ffffffffa008ab90>] ? ipv4_confirm+0x90/0xf0 [nf_conntrack_ipv4]
[  251.921386]  [<ffffffffa0055989>] __nf_conntrack_confirm+0x129/0x410 [nf_conntrack]
[  251.921386]  [<ffffffffa008ab90>] ? ipv4_confirm+0x90/0xf0 [nf_conntrack_ipv4]
[  251.921386]  [<ffffffffa008ab90>] ipv4_confirm+0x90/0xf0 [nf_conntrack_ipv4]
[  251.921386]  [<ffffffff815e7b00>] ? ip_fragment+0x9f0/0x9f0
[  251.921386]  [<ffffffff815d8c5a>] nf_iterate+0xaa/0xc0
[  251.921386]  [<ffffffff815e7b00>] ? ip_fragment+0x9f0/0x9f0
[  251.921386]  [<ffffffff815d8d14>] nf_hook_slow+0xa4/0x190
[  251.921386]  [<ffffffff815e7b00>] ? ip_fragment+0x9f0/0x9f0
[  251.921386]  [<ffffffff815e98f2>] ip_output+0x92/0x100
[  251.921386]  [<ffffffff815e8df9>] ip_local_out+0x29/0x90
[  251.921386]  [<ffffffff815e9240>] ip_queue_xmit+0x170/0x4c0
[  251.921386]  [<ffffffff815e90d5>] ? ip_queue_xmit+0x5/0x4c0
[  251.921386]  [<ffffffff81601208>] tcp_transmit_skb+0x498/0x960
[  251.921386]  [<ffffffff81602d82>] tcp_connect+0x812/0x960
[  251.921386]  [<ffffffff810e3dc5>] ? ktime_get_real+0x25/0x70
[  251.921386]  [<ffffffff8159ea2a>] ? secure_tcp_sequence_number+0x6a/0xc0
[  251.921386]  [<ffffffff81606f57>] tcp_v4_connect+0x317/0x470
[  251.921386]  [<ffffffff8161f645>] __inet_stream_connect+0xb5/0x330
[  251.921386]  [<ffffffff8158dfc3>] ? lock_sock_nested+0x33/0xa0
[  251.921386]  [<ffffffff810c4b5d>] ? trace_hardirqs_on+0xd/0x10
[  251.921386]  [<ffffffff81078885>] ? __local_bh_enable_ip+0x75/0xe0
[  251.921386]  [<ffffffff8161f8f8>] inet_stream_connect+0x38/0x50
[  251.921386]  [<ffffffff8158b157>] SYSC_connect+0xe7/0x120
[  251.921386]  [<ffffffff810e3789>] ? current_kernel_time+0x69/0xd0
[  251.921386]  [<ffffffff810c4a85>] ? trace_hardirqs_on_caller+0x105/0x1d0
[  251.921386]  [<ffffffff810c4b5d>] ? trace_hardirqs_on+0xd/0x10
[  251.921386]  [<ffffffff8158c36e>] SyS_connect+0xe/0x10
[  251.921386]  [<ffffffff816caf69>] system_call_fastpath+0x16/0x1b
[  312.014104] INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 0, t=60003 jiffies, g=42359, c=42358, q=333)
[  312.015097] INFO: Stall ended before state dump start

Fixes: 93bb0ceb75 ("netfilter: conntrack: remove central spinlock nf_conntrack_lock")
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-04-14 10:35:28 +02:00
Eric Dumazet d5d20912d3 netfilter: conntrack: Fix UP builds
ARRAY_SIZE(nf_conntrack_locks) is undefined if spinlock_t is an
empty structure. Replace it by CONNTRACK_LOCKS

Fixes: 93bb0ceb75 ("netfilter: conntrack: remove central spinlock nf_conntrack_lock")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-17 17:14:45 -04:00
Jesper Dangaard Brouer 93bb0ceb75 netfilter: conntrack: remove central spinlock nf_conntrack_lock
nf_conntrack_lock is a monolithic lock and suffers from huge contention
on current generation servers (8 or more core/threads).

Perf locking congestion is clear on base kernel:

-  72.56%  ksoftirqd/6  [kernel.kallsyms]    [k] _raw_spin_lock_bh
   - _raw_spin_lock_bh
      + 25.33% init_conntrack
      + 24.86% nf_ct_delete_from_lists
      + 24.62% __nf_conntrack_confirm
      + 24.38% destroy_conntrack
      + 0.70% tcp_packet
+   2.21%  ksoftirqd/6  [kernel.kallsyms]    [k] fib_table_lookup
+   1.15%  ksoftirqd/6  [kernel.kallsyms]    [k] __slab_free
+   0.77%  ksoftirqd/6  [kernel.kallsyms]    [k] inet_getpeer
+   0.70%  ksoftirqd/6  [nf_conntrack]       [k] nf_ct_delete
+   0.55%  ksoftirqd/6  [ip_tables]          [k] ipt_do_table

This patch change conntrack locking and provides a huge performance
improvement.  SYN-flood attack tested on a 24-core E5-2695v2(ES) with
10Gbit/s ixgbe (with tool trafgen):

 Base kernel:   810.405 new conntrack/sec
 After patch: 2.233.876 new conntrack/sec

Notice other floods attack (SYN+ACK or ACK) can easily be deflected using:
 # iptables -A INPUT -m state --state INVALID -j DROP
 # sysctl -w net/netfilter/nf_conntrack_tcp_loose=0

Use an array of hashed spinlocks to protect insertions/deletions of
conntracks into the hash table. 1024 spinlocks seem to give good
results, at minimal cost (4KB memory). Due to lockdep max depth,
1024 becomes 8 if CONFIG_LOCKDEP=y

The hash resize is a bit tricky, because we need to take all locks in
the array. A seqcount_t is used to synchronize the hash table users
with the resizing process.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-07 11:41:13 +01:00
Jesper Dangaard Brouer ca7433df3a netfilter: conntrack: seperate expect locking from nf_conntrack_lock
Netfilter expectations are protected with the same lock as conntrack
entries (nf_conntrack_lock).  This patch split out expectations locking
to use it's own lock (nf_conntrack_expect_lock).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-07 11:41:01 +01:00
Jesper Dangaard Brouer e1b207dac1 netfilter: avoid race with exp->master ct
Preparation for disconnecting the nf_conntrack_lock from the
expectations code.  Once the nf_conntrack_lock is lifted, a race
condition is exposed.

The expectations master conntrack exp->master, can race with
delete operations, as the refcnt increment happens too late in
init_conntrack().  Race is against other CPUs invoking
->destroy() (destroy_conntrack()), or nf_ct_delete() (via timeout
or early_drop()).

Avoid this race in nf_ct_find_expectation() by using atomic_inc_not_zero(),
and checking if nf_ct_is_dying() (path via nf_ct_delete()).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-07 11:40:47 +01:00
Jesper Dangaard Brouer b7779d06f9 netfilter: conntrack: spinlock per cpu to protect special lists.
One spinlock per cpu to protect dying/unconfirmed/template special lists.
(These lists are now per cpu, a bit like the untracked ct)
Add a @cpu field to nf_conn, to make sure we hold the appropriate
spinlock at removal time.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-07 11:40:38 +01:00
Jesper Dangaard Brouer b476b72a0f netfilter: trivial code cleanup and doc changes
Changes while reading through the netfilter code.

Added hint about how conntrack nf_conn refcnt is accessed.
And renamed repl_hash to reply_hash for readability

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-07 11:40:04 +01:00
Pablo Neira Ayuso e53376bef2 netfilter: nf_conntrack: don't release a conntrack with non-zero refcnt
With this patch, the conntrack refcount is initially set to zero and
it is bumped once it is added to any of the list, so we fulfill
Eric's golden rule which is that all released objects always have a
refcount that equals zero.

Andrey Vagin reports that nf_conntrack_free can't be called for a
conntrack with non-zero ref-counter, because it can race with
nf_conntrack_find_get().

A conntrack slab is created with SLAB_DESTROY_BY_RCU. Non-zero
ref-counter says that this conntrack is used. So when we release
a conntrack with non-zero counter, we break this assumption.

CPU1                                    CPU2
____nf_conntrack_find()
                                        nf_ct_put()
                                         destroy_conntrack()
                                        ...
                                        init_conntrack
                                         __nf_conntrack_alloc (set use = 1)
atomic_inc_not_zero(&ct->use) (use = 2)
                                         if (!l4proto->new(ct, skb, dataoff, timeouts))
                                          nf_conntrack_free(ct); (use = 2 !!!)
                                        ...
                                        __nf_conntrack_alloc (set use = 1)
 if (!nf_ct_key_equal(h, tuple, zone))
  nf_ct_put(ct); (use = 0)
   destroy_conntrack()
                                        /* continue to work with CT */

After applying the path "[PATCH] netfilter: nf_conntrack: fix RCU
race in nf_conntrack_find_get" another bug was triggered in
destroy_conntrack():

<4>[67096.759334] ------------[ cut here ]------------
<2>[67096.759353] kernel BUG at net/netfilter/nf_conntrack_core.c:211!
...
<4>[67096.759837] Pid: 498649, comm: atdd veid: 666 Tainted: G         C ---------------    2.6.32-042stab084.18 #1 042stab084_18 /DQ45CB
<4>[67096.759932] RIP: 0010:[<ffffffffa03d99ac>]  [<ffffffffa03d99ac>] destroy_conntrack+0x15c/0x190 [nf_conntrack]
<4>[67096.760255] Call Trace:
<4>[67096.760255]  [<ffffffff814844a7>] nf_conntrack_destroy+0x17/0x30
<4>[67096.760255]  [<ffffffffa03d9bb5>] nf_conntrack_find_get+0x85/0x130 [nf_conntrack]
<4>[67096.760255]  [<ffffffffa03d9fb2>] nf_conntrack_in+0x352/0xb60 [nf_conntrack]
<4>[67096.760255]  [<ffffffffa048c771>] ipv4_conntrack_local+0x51/0x60 [nf_conntrack_ipv4]
<4>[67096.760255]  [<ffffffff81484419>] nf_iterate+0x69/0xb0
<4>[67096.760255]  [<ffffffff814b5b00>] ? dst_output+0x0/0x20
<4>[67096.760255]  [<ffffffff814845d4>] nf_hook_slow+0x74/0x110
<4>[67096.760255]  [<ffffffff814b5b00>] ? dst_output+0x0/0x20
<4>[67096.760255]  [<ffffffff814b66d5>] raw_sendmsg+0x775/0x910
<4>[67096.760255]  [<ffffffff8104c5a8>] ? flush_tlb_others_ipi+0x128/0x130
<4>[67096.760255]  [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [<ffffffff814c136a>] inet_sendmsg+0x4a/0xb0
<4>[67096.760255]  [<ffffffff81444e93>] ? sock_sendmsg+0x13/0x140
<4>[67096.760255]  [<ffffffff81444f97>] sock_sendmsg+0x117/0x140
<4>[67096.760255]  [<ffffffff8102e299>] ? native_smp_send_reschedule+0x49/0x60
<4>[67096.760255]  [<ffffffff81519beb>] ? _spin_unlock_bh+0x1b/0x20
<4>[67096.760255]  [<ffffffff8109d930>] ? autoremove_wake_function+0x0/0x40
<4>[67096.760255]  [<ffffffff814960f0>] ? do_ip_setsockopt+0x90/0xd80
<4>[67096.760255]  [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [<ffffffff814457c9>] sys_sendto+0x139/0x190
<4>[67096.760255]  [<ffffffff810efa77>] ? audit_syscall_entry+0x1d7/0x200
<4>[67096.760255]  [<ffffffff810ef7c5>] ? __audit_syscall_exit+0x265/0x290
<4>[67096.760255]  [<ffffffff81474daf>] compat_sys_socketcall+0x13f/0x210
<4>[67096.760255]  [<ffffffff8104dea3>] ia32_sysret+0x0/0x5

I have reused the original title for the RFC patch that Andrey posted and
most of the original patch description.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Andrew Vagin <avagin@parallels.com>
Cc: Florian Westphal <fw@strlen.de>
Reported-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
2014-02-05 17:46:06 +01:00
Andrey Vagin c6825c0976 netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
Lets look at destroy_conntrack:

hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
...
nf_conntrack_free(ct)
	kmem_cache_free(net->ct.nf_conntrack_cachep, ct);

net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.

The hash is protected by rcu, so readers look up conntracks without
locks.
A conntrack is removed from the hash, but in this moment a few readers
still can use the conntrack. Then this conntrack is released and another
thread creates conntrack with the same address and the equal tuple.
After this a reader starts to validate the conntrack:
* It's not dying, because a new conntrack was created
* nf_ct_tuple_equal() returns true.

But this conntrack is not initialized yet, so it can not be used by two
threads concurrently. In this case BUG_ON may be triggered from
nf_nat_setup_info().

Florian Westphal suggested to check the confirm bit too. I think it's
right.

task 1			task 2			task 3
			nf_conntrack_find_get
			 ____nf_conntrack_find
destroy_conntrack
 hlist_nulls_del_rcu
 nf_conntrack_free
 kmem_cache_free
						__nf_conntrack_alloc
						 kmem_cache_alloc
						 memset(&ct->tuplehash[IP_CT_DIR_MAX],
			 if (nf_ct_is_dying(ct))
			 if (!nf_ct_tuple_equal()

I'm not sure, that I have ever seen this race condition in a real life.
Currently we are investigating a bug, which is reproduced on a few nodes.
In our case one conntrack is initialized from a few tasks concurrently,
we don't have any other explanation for this.

<2>[46267.083061] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:322!
...
<4>[46267.083951] RIP: 0010:[<ffffffffa01e00a4>]  [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590 [nf_nat]
...
<4>[46267.085549] Call Trace:
<4>[46267.085622]  [<ffffffffa023421b>] alloc_null_binding+0x5b/0xa0 [iptable_nat]
<4>[46267.085697]  [<ffffffffa02342bc>] nf_nat_rule_find+0x5c/0x80 [iptable_nat]
<4>[46267.085770]  [<ffffffffa0234521>] nf_nat_fn+0x111/0x260 [iptable_nat]
<4>[46267.085843]  [<ffffffffa0234798>] nf_nat_out+0x48/0xd0 [iptable_nat]
<4>[46267.085919]  [<ffffffff814841b9>] nf_iterate+0x69/0xb0
<4>[46267.085991]  [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086063]  [<ffffffff81484374>] nf_hook_slow+0x74/0x110
<4>[46267.086133]  [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086207]  [<ffffffff814b5890>] ? dst_output+0x0/0x20
<4>[46267.086277]  [<ffffffff81495204>] ip_output+0xa4/0xc0
<4>[46267.086346]  [<ffffffff814b65a4>] raw_sendmsg+0x8b4/0x910
<4>[46267.086419]  [<ffffffff814c10fa>] inet_sendmsg+0x4a/0xb0
<4>[46267.086491]  [<ffffffff814459aa>] ? sock_update_classid+0x3a/0x50
<4>[46267.086562]  [<ffffffff81444d67>] sock_sendmsg+0x117/0x140
<4>[46267.086638]  [<ffffffff8151997b>] ? _spin_unlock_bh+0x1b/0x20
<4>[46267.086712]  [<ffffffff8109d370>] ? autoremove_wake_function+0x0/0x40
<4>[46267.086785]  [<ffffffff81495e80>] ? do_ip_setsockopt+0x90/0xd80
<4>[46267.086858]  [<ffffffff8100be0e>] ? call_function_interrupt+0xe/0x20
<4>[46267.086936]  [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087006]  [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087081]  [<ffffffff8118f2e8>] ? kmem_cache_alloc+0xd8/0x1e0
<4>[46267.087151]  [<ffffffff81445599>] sys_sendto+0x139/0x190
<4>[46267.087229]  [<ffffffff81448c0d>] ? sock_setsockopt+0x16d/0x6f0
<4>[46267.087303]  [<ffffffff810efa47>] ? audit_syscall_entry+0x1d7/0x200
<4>[46267.087378]  [<ffffffff810ef795>] ? __audit_syscall_exit+0x265/0x290
<4>[46267.087454]  [<ffffffff81474885>] ? compat_sys_setsockopt+0x75/0x210
<4>[46267.087531]  [<ffffffff81474b5f>] compat_sys_socketcall+0x13f/0x210
<4>[46267.087607]  [<ffffffff8104dea3>] ia32_sysret+0x0/0x5
<4>[46267.087676] Code: 91 20 e2 01 75 29 48 89 de 4c 89 f7 e8 56 fa ff ff 85 c0 0f 84 68 fc ff ff 0f b6 4d c6 41 8b 45 00 e9 4d fb ff ff e8 7c 19 e9 e0 <0f> 0b eb fe f6 05 17 91 20 e2 80 74 ce 80 3d 5f 2e 00 00 00 74
<1>[46267.088023] RIP  [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-02-05 13:16:18 +01:00
stephen hemminger dcd93ed4cd netfilter: nf_conntrack: remove dead code
The following code is not used in current upstream code.
Some of this seems to be old hooks, other might be used by some
out of tree module (which I don't care about breaking), and
the need_ipv4_conntrack was used by old NAT code but no longer
called.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-03 23:41:37 +01:00
Pablo Neira Ayuso 0c3c6c00c6 netfilter: nf_conntrack: decrement global counter after object release
nf_conntrack_free() decrements our counter (net->ct.count)
before releasing the conntrack object. That counter is used in the
nf_conntrack_cleanup_net_list path to check if it's time to
kmem_cache_destroy our cache of conntrack objects. I think we have
a race there that should be easier to trigger (although still hard)
with CONFIG_DEBUG_OBJECTS_FREE as object releases become slowier
according to the following splat:

[ 1136.321305] WARNING: CPU: 2 PID: 2483 at lib/debugobjects.c:260
debug_print_object+0x83/0xa0()
[ 1136.321311] ODEBUG: free active (active state 0) object type:
timer_list hint: delayed_work_timer_fn+0x0/0x20
...
[ 1136.321390] Call Trace:
[ 1136.321398]  [<ffffffff8160d4a2>] dump_stack+0x45/0x56
[ 1136.321405]  [<ffffffff810514e8>] warn_slowpath_common+0x78/0xa0
[ 1136.321410]  [<ffffffff81051557>] warn_slowpath_fmt+0x47/0x50
[ 1136.321414]  [<ffffffff812f8883>] debug_print_object+0x83/0xa0
[ 1136.321420]  [<ffffffff8106aa90>] ? execute_in_process_context+0x90/0x90
[ 1136.321424]  [<ffffffff812f99fb>] debug_check_no_obj_freed+0x20b/0x250
[ 1136.321429]  [<ffffffff8112e7f2>] ? kmem_cache_destroy+0x92/0x100
[ 1136.321433]  [<ffffffff8115d945>] kmem_cache_free+0x125/0x210
[ 1136.321436]  [<ffffffff8112e7f2>] kmem_cache_destroy+0x92/0x100
[ 1136.321443]  [<ffffffffa046b806>] nf_conntrack_cleanup_net_list+0x126/0x160 [nf_conntrack]
[ 1136.321449]  [<ffffffffa046c43d>] nf_conntrack_pernet_exit+0x6d/0x80 [nf_conntrack]
[ 1136.321453]  [<ffffffff81511cc3>] ops_exit_list.isra.3+0x53/0x60
[ 1136.321457]  [<ffffffff815124f0>] cleanup_net+0x100/0x1b0
[ 1136.321460]  [<ffffffff8106b31e>] process_one_work+0x18e/0x430
[ 1136.321463]  [<ffffffff8106bf49>] worker_thread+0x119/0x390
[ 1136.321467]  [<ffffffff8106be30>] ? manage_workers.isra.23+0x2a0/0x2a0
[ 1136.321470]  [<ffffffff8107210b>] kthread+0xbb/0xc0
[ 1136.321472]  [<ffffffff81072050>] ? kthread_create_on_node+0x110/0x110
[ 1136.321477]  [<ffffffff8161b8fc>] ret_from_fork+0x7c/0xb0
[ 1136.321479]  [<ffffffff81072050>] ? kthread_create_on_node+0x110/0x110
[ 1136.321481] ---[ end trace 25f53c192da70825 ]---

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-11-18 14:07:19 +01:00
Holger Eitzenberger f7b13e4330 netfilter: introduce nf_conn_acct structure
Encapsulate counters for both directions into nf_conn_acct. During
that process also consistently name pointers to the extend 'acct',
not 'counters'. This patch is a cleanup.

Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-11-03 21:48:49 +01:00
Patrick McHardy 48b1de4c11 netfilter: add SYNPROXY core/target
Add a SYNPROXY for netfilter. The code is split into two parts, the synproxy
core with common functions and an address family specific target.

The SYNPROXY receives the connection request from the client, responds with
a SYN/ACK containing a SYN cookie and announcing a zero window and checks
whether the final ACK from the client contains a valid cookie.

It then establishes a connection to the original destination and, if
successful, sends a window update to the client with the window size
announced by the server.

Support for timestamps, SACK, window scaling and MSS options can be
statically configured as target parameters if the features of the server
are known. If timestamps are used, the timestamp value sent back to
the client in the SYN/ACK will be different from the real timestamp of
the server. In order to now break PAWS, the timestamps are translated in
the direction server->client.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Tested-by: Martin Topholm <mph@one.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-28 00:27:54 +02:00
Patrick McHardy 41d73ec053 netfilter: nf_conntrack: make sequence number adjustments usuable without NAT
Split out sequence number adjustments from NAT and move them to the conntrack
core to make them usable for SYN proxying. The sequence number adjustment
information is moved to a seperate extend. The extend is added to new
conntracks when a NAT mapping is set up for a connection using a helper.

As a side effect, this saves 24 bytes per connection with NAT in the common
case that a connection does not have a helper assigned.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Tested-by: Martin Topholm <mph@one.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-28 00:26:48 +02:00
Florian Westphal c655bc6896 netfilter: nf_conntrack: don't send destroy events from iterator
Let nf_ct_delete handle delivery of the DESTROY event.

Based on earlier patch from Pablo Neira.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-09 12:03:33 +02:00
Patrick McHardy 2d89c68ac7 netfilter: nf_nat: change sequence number adjustments to 32 bits
Using 16 bits is too small, when many adjustments happen the offsets might
overflow and break the connection.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-07-31 19:54:51 +02:00
Florian Westphal 02982c27ba netfilter: nf_conntrack: remove duplicate code in ctnetlink
ctnetlink contains copy-paste code from death_by_timeout.  In order to
avoid changing both places in upcoming event delivery patch,
export death_by_timeout functionality and use it in the ctnetlink code.

Based on earlier patch from Pablo Neira.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-07-31 18:51:23 +02:00
Patrick McHardy 312a0c16c1 netfilter: nf_conntrack: constify sk_buff argument to nf_ct_attach()
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-07-31 16:37:38 +02:00
Linus Torvalds 73287a43cc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights (1721 non-merge commits, this has to be a record of some
  sort):

   1) Add 'random' mode to team driver, from Jiri Pirko and Eric
      Dumazet.

   2) Make it so that any driver that supports configuration of multiple
      MAC addresses can provide the forwarding database add and del
      calls by providing a default implementation and hooking that up if
      the driver doesn't have an explicit set of handlers.  From Vlad
      Yasevich.

   3) Support GSO segmentation over tunnels and other encapsulating
      devices such as VXLAN, from Pravin B Shelar.

   4) Support L2 GRE tunnels in the flow dissector, from Michael Dalton.

   5) Implement Tail Loss Probe (TLP) detection in TCP, from Nandita
      Dukkipati.

   6) In the PHY layer, allow supporting wake-on-lan in situations where
      the PHY registers have to be written for it to be configured.

      Use it to support wake-on-lan in mv643xx_eth.

      From Michael Stapelberg.

   7) Significantly improve firewire IPV6 support, from YOSHIFUJI
      Hideaki.

   8) Allow multiple packets to be sent in a single transmission using
      network coding in batman-adv, from Martin Hundebøll.

   9) Add support for T5 cxgb4 chips, from Santosh Rastapur.

  10) Generalize the VXLAN forwarding tables so that there is more
      flexibility in configurating various aspects of the endpoints.
      From David Stevens.

  11) Support RSS and TSO in hardware over GRE tunnels in bxn2x driver,
      from Dmitry Kravkov.

  12) Zero copy support in nfnelink_queue, from Eric Dumazet and Pablo
      Neira Ayuso.

  13) Start adding networking selftests.

  14) In situations of overload on the same AF_PACKET fanout socket, or
      per-cpu packet receive queue, minimize drop by distributing the
      load to other cpus/fanouts.  From Willem de Bruijn and Eric
      Dumazet.

  15) Add support for new payload offset BPF instruction, from Daniel
      Borkmann.

  16) Convert several drivers over to mdoule_platform_driver(), from
      Sachin Kamat.

  17) Provide a minimal BPF JIT image disassembler userspace tool, from
      Daniel Borkmann.

  18) Rewrite F-RTO implementation in TCP to match the final
      specification of it in RFC4138 and RFC5682.  From Yuchung Cheng.

  19) Provide netlink socket diag of netlink sockets ("Yo dawg, I hear
      you like netlink, so I implemented netlink dumping of netlink
      sockets.") From Andrey Vagin.

  20) Remove ugly passing of rtnetlink attributes into rtnl_doit
      functions, from Thomas Graf.

  21) Allow userspace to be able to see if a configuration change occurs
      in the middle of an address or device list dump, from Nicolas
      Dichtel.

  22) Support RFC3168 ECN protection for ipv6 fragments, from Hannes
      Frederic Sowa.

  23) Increase accuracy of packet length used by packet scheduler, from
      Jason Wang.

  24) Beginning set of changes to make ipv4/ipv6 fragment handling more
      scalable and less susceptible to overload and locking contention,
      from Jesper Dangaard Brouer.

  25) Get rid of using non-type-safe NLMSG_* macros and use nlmsg_*()
      instead.  From Hong Zhiguo.

  26) Optimize route usage in IPVS by avoiding reference counting where
      possible, from Julian Anastasov.

  27) Convert IPVS schedulers to RCU, also from Julian Anastasov.

  28) Support cpu fanouts in xt_NFQUEUE netfilter target, from Holger
      Eitzenberger.

  29) Network namespace support for nf_log, ebt_log, xt_LOG, ipt_ULOG,
      nfnetlink_log, and nfnetlink_queue.  From Gao feng.

  30) Implement RFC3168 ECN protection, from Hannes Frederic Sowa.

  31) Support several new r8169 chips, from Hayes Wang.

  32) Support tokenized interface identifiers in ipv6, from Daniel
      Borkmann.

  33) Use usbnet_link_change() helper in USB net driver, from Ming Lei.

  34) Add 802.1ad vlan offload support, from Patrick McHardy.

  35) Support mmap() based netlink communication, also from Patrick
      McHardy.

  36) Support HW timestamping in mlx4 driver, from Amir Vadai.

  37) Rationalize AF_PACKET packet timestamping when transmitting, from
      Willem de Bruijn and Daniel Borkmann.

  38) Bring parity to what's provided by /proc/net/packet socket dumping
      and the info provided by netlink socket dumping of AF_PACKET
      sockets.  From Nicolas Dichtel.

  39) Fix peeking beyond zero sized SKBs in AF_UNIX, from Benjamin
      Poirier"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
  filter: fix va_list build error
  af_unix: fix a fatal race with bit fields
  bnx2x: Prevent memory leak when cnic is absent
  bnx2x: correct reading of speed capabilities
  net: sctp: attribute printl with __printf for gcc fmt checks
  netlink: kconfig: move mmap i/o into netlink kconfig
  netpoll: convert mutex into a semaphore
  netlink: Fix skb ref counting.
  net_sched: act_ipt forward compat with xtables
  mlx4_en: fix a build error on 32bit arches
  Revert "bnx2x: allow nvram test to run when device is down"
  bridge: avoid OOPS if root port not found
  drivers: net: cpsw: fix kernel warn on cpsw irq enable
  sh_eth: use random MAC address if no valid one supplied
  3c509.c: call SET_NETDEV_DEV for all device types (ISA/ISAPnP/EISA)
  tg3: fix to append hardware time stamping flags
  unix/stream: fix peeking with an offset larger than data in queue
  unix/dgram: fix peeking with an offset larger than data in queue
  unix/dgram: peek beyond 0-sized skbs
  openvswitch: Remove unneeded ovs_netdev_get_ifindex()
  ...
2013-05-01 14:08:52 -07:00
Akinobu Mita ca3d41a588 net/netfilter: rename random32() to prandom_u32()
Use preferable function name which implies using a pseudo-random
number generator.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 18:28:43 -07:00
David S. Miller 95a06161e6 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
The following patchset contains a small batch of Netfilter
updates for your net-next tree, they are:

* Three patches that provide more accurate error reporting to
  user-space, instead of -EPERM, in IPv4/IPv6 netfilter re-routing
  code and NAT, from Patrick McHardy.

* Update copyright statements in Netfilter filters of
  Patrick McHardy, from himself.

* Add Kconfig dependency on the raw/mangle tables to the
  rpfilter, from Florian Westphal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19 17:55:29 -04:00
Patrick McHardy ec464e5dc5 netfilter: rename netlink related "pid" variables to "portid"
Get rid of the confusing mix of pid and portid and use portid consistently
for all netlink related socket identities.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19 14:58:36 -04:00
Patrick McHardy f229f6ce48 netfilter: add my copyright statements
Add copyright statements to all netfilter files which have had significant
changes done by myself in the past.

Some notes:

- nf_conntrack_ecache.c was incorrectly attributed to Rusty and Netfilter
  Core Team when it got split out of nf_conntrack_core.c. The copyrights
  even state a date which lies six years before it was written. It was
  written in 2005 by Harald and myself.

- net/ipv{4,6}/netfilter.c, net/netfitler/nf_queue.c were missing copyright
  statements. I've added the copyright statement from net/netfilter/core.c,
  where this code originated

- for nf_conntrack_proto_tcp.c I've also added Jozsef, since I didn't want
  it to give the wrong impression

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-04-18 20:27:55 +02:00
Vladimir Davydov dece40e848 netfilter: nf_conntrack: speed up module removal path if netns in use
The patch introduces nf_conntrack_cleanup_net_list(), which cleanups
nf_conntrack for a list of netns and calls synchronize_net() only once
for them all. This should reduce netns destruction time.

I've measured cleanup time for 1k dummy net ns. Here are the results:

 <without the patch>
 # modprobe nf_conntrack
 # time modprobe -r nf_conntrack

 real	0m10.337s
 user	0m0.000s
 sys	0m0.376s

 <with the patch>
 # modprobe nf_conntrack
 # time modprobe -r nf_conntrack

 real    0m5.661s
 user    0m0.000s
 sys     0m0.216s

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-03-19 17:08:31 +01:00
stephen hemminger 493763684f netfilter: nf_conntrack: add include to fix sparse warning
Include header file to pickup prototype of nf_nat_seq_adjust_hook

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-03-19 17:05:53 +01:00
Gao feng 04d8700179 netfilter: nf_ct_proto: move initialization out of pernet_operations
Move the global initial codes to the module_init/exit context.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-01-23 12:56:33 +01:00