1
0
Fork 0
Commit Graph

433 Commits (a6ff1a2f1e91578860b37df9fd861ef7af207de4)

Author SHA1 Message Date
David S. Miller 4895c771c7 ipv4: Add FIB nexthop exceptions.
In a regime where we have subnetted route entries, we need a way to
store persistent storage about destination specific learned values
such as redirects and PMTU values.

This is implemented here via nexthop exceptions.

The initial implementation is a 2048 entry hash table with relaiming
starting at chain length 5.  A more sophisticated scheme can be
devised if that proves necessary.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-17 08:48:50 -07:00
David S. Miller 6700c2709c net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}()
This will be used so that we can compose a full flow key.

Even though we have a route in this context, we need more.  In the
future the routes will be without destination address, source address,
etc. keying.  One ipv4 route will cover entire subnets, etc.

In this environment we have to have a way to possess persistent storage
for redirects and PMTU information.  This persistent storage will exist
in the FIB tables, and that's why we'll need to be able to rebuild a
full lookup flow key here.  Using that flow key will do a fib_lookup()
and create/update the persistent entry.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-17 03:29:28 -07:00
David S. Miller 85b91b0339 ipv4: Don't store a rule pointer in fib_result.
We only use it to fetch the rule's tclassid, so just store the
tclassid there instead.

This also decreases the size of fib_result by a full 8 bytes on
64-bit.  On 32-bits it's a wash.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-13 08:21:29 -07:00
David S. Miller 99ee038d41 ipv4: Fix warnings in ip_do_redirect() for some configurations.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-12 07:40:05 -07:00
David S. Miller b587ee3ba2 net: Add dummy dst_ops->redirect method where needed.
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-12 00:39:24 -07:00
David S. Miller 1f42539d25 ipv4: Kill ip_rt_redirect().
No longer needed, as the protocol handlers now all properly
propagate the redirect back into the routing code.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11 21:30:08 -07:00
David S. Miller b42597e2f3 ipv4: Add ipv4_redirect() and ipv4_sk_redirect() helper functions.
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11 21:25:45 -07:00
David S. Miller e47a185b31 ipv4: Generalize ip_do_redirect() and hook into new dst_ops->redirect.
All of the redirect acceptance policy is now contained within.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11 20:55:47 -07:00
David S. Miller 94206125c4 ipv4: Rearrange arguments to ip_rt_redirect()
Pass in the SKB rather than just the IP addresses, so that policy
and other aspects can reside in ip_rt_redirect() rather then
icmp_redirect().

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11 20:38:08 -07:00
David S. Miller d0da720f9f ipv4: Pull redirect instantiation out into a helper function.
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11 20:27:54 -07:00
David S. Miller f185071ddf ipv4: Remove inetpeer from routes.
No longer used.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 22:40:18 -07:00
David S. Miller 312487313d ipv4: Calling ->cow_metrics() now is a bug.
Nothing every writes to ipv4 metrics any longer.

PMTU is stored in rt->rt_pmtu.

Dynamic TCP metrics are stored in a special TCP metrics cache,
completely outside of the routes.

Therefore ->cow_metrics() can simply nothing more than a WARN_ON
trigger so we can catch anyone who tries to add new writes to
ipv4 route metrics.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 22:40:17 -07:00
David S. Miller 2db2d67e4c ipv4: Kill dst_copy_metrics() call from ipv4_blackhole_route().
Blackhole routes have a COW metrics operation that returns NULL
always, therefore this dst_copy_metrics() call did absolutely
nothing.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 22:40:16 -07:00
David S. Miller 710ab6c031 ipv4: Enforce max MTU metric at route insertion time.
Rather than at every struct rtable creation.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 22:40:15 -07:00
David S. Miller 5943634fc5 ipv4: Maintain redirect and PMTU info in struct rtable again.
Maintaining this in the inetpeer entries was not the right way to do
this at all.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 22:40:14 -07:00
David S. Miller 87a50699cb rtnetlink: Remove ts/tsage args to rtnl_put_cacheinfo().
Nobody provides non-zero values any longer.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 22:40:13 -07:00
David S. Miller 3e12939a2a inet: Kill FLOWI_FLAG_PRECOW_METRICS.
No longer needed.  TCP writes metrics, but now in it's own special
cache that does not dirty the route metrics.  Therefore there is no
longer any reason to pre-cow metrics in this way.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 22:40:12 -07:00
David S. Miller 1d861aa4b3 inet: Minimize use of cached route inetpeer.
Only use it in the absolutely required cases:

1) COW'ing metrics

2) ipv4 PMTU

3) ipv4 redirects

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 22:40:11 -07:00
David S. Miller 81166dd6fa tcp: Move timestamps from inetpeer to metrics cache.
With help from Lin Ming.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 22:40:08 -07:00
David S. Miller 794785bf12 net: Don't report route RTT metric value in cache dumps.
We don't maintain it dynamically any longer, so reporting it would
be extremely misleading.  Report zero instead.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 22:40:06 -07:00
David S. Miller f187bc6efb ipv4: No need to set generic neighbour pointer.
Nobody reads it any longer.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-05 02:41:59 -07:00
David S. Miller f894cbf847 net: Add optional SKB arg to dst_ops->neigh_lookup().
Causes the handler to use the daddr in the ipv4/ipv6 header when
the route gateway is unspecified (local subnet).

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-05 01:04:01 -07:00
David S. Miller 3c521f2ba9 ipv4: Don't report neigh uptodate state in rtcache procfs.
Soon routes will not have a cached neigh attached, nor will we
be able to necessarily go directly to a neigh from an arbitrary
route.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-05 01:02:31 -07:00
David S. Miller a263b30936 ipv4: Make neigh lookups directly in output packet path.
Do not use the dst cached neigh, we'll be getting rid of that.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-05 01:02:12 -07:00
David S. Miller 3085a4b7d3 ipv4: Remove extraneous assignment of dst->tclassid.
We already set it several lines above.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-28 22:17:39 -07:00
David S. Miller 9e56e3800e ipv4: Adjust in_dev handling in fib_validate_source()
Checking for in_dev being NULL is pointless.

In fact, all of our callers have in_dev precomputed already,
so just pass it in and remove the NULL checking.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-28 18:54:02 -07:00
David S. Miller 41347dcdd8 ipv4: Kill rt->rt_spec_dst, no longer used.
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-28 04:05:27 -07:00
David S. Miller c10237e077 Revert "ipv4: tcp: dont cache unconfirmed intput dst"
This reverts commit c074da2810.

This change has several unwanted side effects:

1) Sockets will cache the DST_NOCACHE route in sk->sk_rx_dst and we'll
   thus never create a real cached route.

2) All TCP traffic will use DST_NOCACHE and never use the routing
   cache at all.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-27 17:05:06 -07:00
Eric Dumazet c074da2810 ipv4: tcp: dont cache unconfirmed intput dst
DDOS synflood attacks hit badly IP route cache.

On typical machines, this cache is allowed to hold up to 8 Millions dst
entries, 256 bytes for each, for a total of 2GB of memory.

rt_garbage_collect() triggers and tries to cleanup things.

Eventually route cache is disabled but machine is under fire and might
OOM and crash.

This patch exploits the new TCP early demux, to set a nocache
boolean in case incoming TCP frame is for a not yet ESTABLISHED or
TIMEWAIT socket.

This 'nocache' boolean is then used in case dst entry is not found in
route cache, to create an unhashed dst entry (DST_NOCACHE)

SYN-cookie-ACK sent use a similar mechanism (ipv4: tcp: dont cache
output dst for syncookies), so after this patch, a machine is able to
absorb a DDOS synflood attack without polluting its IP route cache.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-27 15:34:24 -07:00
David S. Miller 251da41301 ipv4: Cache ip_error() routes even when not forwarding.
And account for the fact that, when we are not forwarding, we should
bump statistic counters rather than emit an ICMP response.

RP-filter rejected lookups are still not cached.

Since -EHOSTUNREACH and -ENETUNREACH can now no longer be seen in
ip_rcv_finish(), remove those checks.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-26 16:27:09 -07:00
David S. Miller df67e6c9a6 ipv4: Remove unnecessary code from rt_check_expire().
IPv4 routing cache entries no longer use dst->expires, because the
metrics, PMTU, and redirect information are stored in the inetpeer
cache.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-26 00:10:09 -07:00
Eric Dumazet 7586eceb0a ipv4: tcp: dont cache output dst for syncookies
Don't cache output dst for syncookies, as this adds pressure on IP route
cache and rcu subsystem for no gain.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-22 21:47:33 -07:00
David S. Miller 6fac262526 ipv4: Cap ADVMSS metric in the FIB rather than the routing cache.
It makes no sense to execute this limit test every time we create a
routing cache entry.

We can't simply error out on these things since we've silently
accepted and truncated them forever.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-17 19:47:34 -07:00
David S. Miller 3639339553 ipv4: Handle PMTU in all ICMP error handlers.
With ip_rt_frag_needed() removed, we have to explicitly update PMTU
information in every ICMP error handler.

Create two helper functions to facilitate this.

1) ipv4_sk_update_pmtu()

   This updates the PMTU when we have a socket context to
   work with.

2) ipv4_update_pmtu()

   Raw version, used when no socket context is available.  For this
   interface, we essentially just pass in explicit arguments for
   the flow identity information we would have extracted from the
   socket.

   And you'll notice that ipv4_sk_update_pmtu() is simply implemented
   in terms of ipv4_update_pmtu()

Note that __ip_route_output_key() is used, rather than something like
ip_route_output_flow() or ip_route_output_key().  This is because we
absolutely do not want to end up with a route that does IPSEC
encapsulation and the like.  Instead, we only want the route that
would get us to the node described by the outermost IP header.

Reported-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-14 22:22:07 -07:00
Thomas Graf d0daebc3d6 ipv4: Add interface option to enable routing of 127.0.0.0/8
Routing of 127/8 is tradtionally forbidden, we consider
packets from that address block martian when routing and do
not process corresponding ARP requests.

This is a sane default but renders a huge address space
practically unuseable.

The RFC states that no address within the 127/8 block should
ever appear on any network anywhere but it does not forbid
the use of such addresses outside of the loopback device in
particular. For example to address a pool of virtual guests
behind a load balancer.

This patch adds a new interface option 'route_localnet'
enabling routing of the 127/8 address block and processing
of ARP requests on a specific interface.

Note that for the feature to work, the default local route
covering 127/8 dev lo needs to be removed.

Example:
  $ sysctl -w net.ipv4.conf.eth0.route_localnet=1
  $ ip route del 127.0.0.0/8 dev lo table local
  $ ip addr add 127.1.0.1/16 dev eth0
  $ ip route flush cache

V2: Fix invalid check to auto flush cache (thanks davem)

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-12 15:25:46 -07:00
David S. Miller 7b34ca2ac7 inet: Avoid potential NULL peer dereference.
We handle NULL in rt{,6}_set_peer but then our caller will try to pass
that NULL pointer into inet_putpeer() which isn't ready for it.

Fix this by moving the NULL check one level up, and then remove the
now unnecessary NULL check from inetpeer_ptr_set_peer().

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-11 04:13:57 -07:00
David S. Miller 8b96d22d7a inet: Use FIB table peer roots in routes.
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-11 02:10:54 -07:00
David S. Miller b48c80ece9 inet: Add family scope inetpeer flushes.
This implementation can deal with having many inetpeer roots, which is
a necessary prerequisite for per-FIB table rooted peer tables.

Each family (AF_INET, AF_INET6) has a sequence number which we bump
when we get a family invalidation request.

Each peer lookup cheaply checks whether the flush sequence of the
root we are using is out of date, and if so flushes it and updates
the sequence number.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-11 02:09:10 -07:00
David S. Miller 46517008e1 ipv4: Kill ip_rt_frag_needed().
There is zero point to this function.

It's only real substance is to perform an extremely outdated BSD4.2
ICMP check, which we can safely remove.  If you really have a MTU
limited link being routed by a BSD4.2 derived system, here's a nickel
go buy yourself a real router.

The other actions of ip_rt_frag_needed(), checking and conditionally
updating the peer, are done by the per-protocol handlers of the ICMP
event.

TCP, UDP, et al. have a handler which will receive this event and
transmit it back into the associated route via dst_ops->update_pmtu().

This simplification is important, because it eliminates the one place
where we do not have a proper route context in which to make an
inetpeer lookup.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-11 02:08:59 -07:00
David S. Miller 97bab73f98 inet: Hide route peer accesses behind helpers.
We encode the pointer(s) into an unsigned long with one state bit.

The state bit is used so we can store the inetpeer tree root to use
when resolving the peer later.

Later the peer roots will be per-FIB table, and this change works to
facilitate that.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-11 02:08:47 -07:00
David S. Miller c0efc887dc inet: Pass inetpeer root into inet_getpeer*() interfaces.
Otherwise we reference potentially non-existing members when
ipv6 is disabled.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-09 19:12:36 -07:00
David S. Miller 56a6b248eb inet: Consolidate inetpeer_invalidate_tree() interfaces.
We only need one interface for this operation, since we always know
which inetpeer root we want to flush.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-09 16:32:41 -07:00
David S. Miller c3426b4719 inet: Initialize per-netns inetpeer roots in net/ipv{4,6}/route.c
Instead of net/ipv4/inetpeer.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-09 16:27:05 -07:00
David S. Miller fbfe95a42e inet: Create and use rt{,6}_get_peer_create().
There's a lot of places that open-code rt{,6}_get_peer() only because
they want to set 'create' to one.  So add an rt{,6}_get_peer_create()
for their sake.

There were also a few spots open-coding plain rt{,6}_get_peer() and
those are transformed here as well.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-08 23:24:18 -07:00
Gao feng 54db0cc2ba inetpeer: add parameter net for inet_getpeer_v4,v6
add struct net as a parameter of inet_getpeer_v[4,6],
use net to replace &init_net.

and modify some places to provide net for inet_getpeer_v[4,6]

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-08 14:27:23 -07:00
Gao feng c8a627ed06 inetpeer: add namespace support for inetpeer
now inetpeer doesn't support namespace,the information will
be leaking across namespace.

this patch move the global vars v4_peers and v6_peers to
netns_ipv4 and netns_ipv6 as a field peers.

add struct pernet_operations inetpeer_ops to initial pernet
inetpeer data.

and change family_to_base and inet_getpeer to support namespace.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-08 14:27:23 -07:00
Tim Bird 31fe62b958 mm: add a low limit to alloc_large_system_hash
UDP stack needs a minimum hash size value for proper operation and also
uses alloc_large_system_hash() for proper NUMA distribution of its hash
tables and automatic sizing depending on available system memory.

On some low memory situations, udp_table_init() must ignore the
alloc_large_system_hash() result and reallocs a bigger memory area.

As we cannot easily free old hash table, we leak it and kmemleak can
issue a warning.

This patch adds a low limit parameter to alloc_large_system_hash() to
solve this problem.

We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
allocation.

Reported-by: Mark Asselstine <mark.asselstine@windriver.com>
Reported-by: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-24 00:28:21 -04:00
Eldad Zack 413c27d869 net/ipv4: replace simple_strtoul with kstrtoul
Replace simple_strtoul with kstrtoul in three similar occurrences, all setup
handlers:
* route.c: set_rhash_entries
* tcp.c: set_thash_entries
* udp.c: set_uhash_entries

Also check if the conversion failed.

Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-20 04:06:17 -04:00
Joe Perches 91df42bedc net: ipv4 and ipv6: Convert printk(KERN_DEBUG to pr_debug
Use the current debugging style and enable dynamic_debug.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-16 01:01:03 -04:00
Joe Perches e87cc4728f net: Convert net_ratelimit uses to net_<level>_ratelimited
Standardize the net core ratelimited logging functions.

Coalesce formats, align arguments.
Change a printk then vprintk sequence to use printf extension %pV.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-15 13:45:03 -04:00