Commit graph

1017 commits

Author SHA1 Message Date
Jason Wang 16c0b164bd act_mirred: do not drop packets when fails to mirror it
We drop packet unconditionally when we fail to mirror it. This is not intended
in some cases. Consdier for kvm guest, we may mirror the traffic of the bridge
to a tap device used by a VM. When kernel fails to mirror the packet in
conditions such as when qemu crashes or stop polling the tap, it's hard for the
management software to detect such condition and clean the the mirroring
before. This would lead all packets to the bridge to be dropped and break the
netowrk of other virtual machines.

To solve the issue, the patch does not drop packets when kernel fails to mirror
it, and only drop the redirected packets.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-16 14:54:44 -07:00
Eric W. Biederman a6c6796c71 userns: Convert cls_flow to work with user namespaces enabled
The flow classifier can use uids and gids of the sockets that
are transmitting packets and do insert those uids and gids
into the packet classification calcuation.  I don't fully
understand the details but it appears that we can depend
on specific uids and gids when making traffic classification
decisions.

To work with user namespaces enabled map from kuids and kgids
into uids and gids in the initial user namespace giving raw
integer values the code can play with and depend on.

To avoid issues of userspace depending on uids and gids in
packet classifiers installed from other user namespaces
and getting confused deny all packet classifiers that
use uids or gids that are not comming from a netlink socket
in the initial user namespace.

Cc: Patrick McHardy <kaber@trash.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Changli Gao <xiaosuo@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-08-14 21:55:28 -07:00
Eric W. Biederman af4c6641f5 net sched: Pass the skb into change so it can access NETLINK_CB
cls_flow.c plays with uids and gids.  Unless I misread that
code it is possible for classifiers to depend on the specific uid and
gid values.  Therefore I need to know the user namespace of the
netlink socket that is installing the packet classifiers.  Pass
in the rtnetlink skb so I can access the NETLINK_CB of the passed
packet.  In particular I want access to sk_user_ns(NETLINK_CB(in_skb).ssk).

Pass in not the user namespace but the incomming rtnetlink skb into
the the classifier change routines as that is generally the more useful
parameter.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-08-14 21:55:28 -07:00
Amerigo Wang ee89bab14e net: move and rename netif_notify_peers()
I believe net/core/dev.c is a better place for netif_notify_peers(),
because other net event notify functions also stay in this file.

And rename it to netdev_notify_peers().

Cc: David S. Miller <davem@davemloft.net>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-14 14:28:23 -07:00
Paolo Valente be72f63b4c sched: add missing group change to qfq_change_class
[Resending again, as the text was corrupted by the email client]

To speed up operations, QFQ internally divides classes into
groups. Which group a class belongs to depends on the ratio between
the maximum packet length and the weight of the class. Unfortunately
the function qfq_change_class lacks the steps for changing the group
of a class when the ratio max_pkt_len/weight of the class changes.

For example, when the last of the following three commands is
executed, the group of class 1:1 is not correctly changed:

tc disc add dev XXX root handle 1: qfq
tc class add dev XXX parent 1: qfq classid 1:1 weight 1
tc class change dev XXX parent 1: classid 1:1 qfq weight 4

Not changing the group of a class does not affect the long-term
bandwidth guaranteed to the class, as the latter is independent of the
maximum packet length, and correctly changes (only) if the weight of
the class changes. In contrast, if the group of the class is not
updated, the class is still guaranteed the short-term bandwidth and
packet delay related to its old group, instead of the guarantees that
it should receive according to its new weight and/or maximum packet
length. This may also break service guarantees for other classes.
This patch adds the missing operations.

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-08 16:02:05 -07:00
Hiroaki SHIMODA 47fd92f5a7 net_sched: act: Delete estimator in error path.
Some action modules free struct tcf_common in their error path
while estimator is still active. This results in est_timer()
dereference freed memory.
Add gen_kill_estimator() in ipt, pedit and simple action.

Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-06 13:30:01 -07:00
Hiroaki SHIMODA 696ecdc106 net_sched: gact: Fix potential panic in tcf_gact().
gact_rand array is accessed by gact->tcfg_ptype whose value
is assumed to less than MAX_RAND, but any range checks are
not performed.

So add a check in tcf_gact_init(). And in tcf_gact(), we can
reduce a branch.

Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-03 16:47:24 -07:00
David S. Miller 92101b3b2e ipv4: Prepare for change of rt->rt_iif encoding.
Use inet_iif() consistently, and for TCP record the input interface of
cached RX dst in inet sock.

rt->rt_iif is going to be encoded differently, so that we can
legitimately cache input routes in the FIB info more aggressively.

When the input interface is "use SKB device index" the rt->rt_iif will
be set to zero.

This forces us to move the TCP RX dst cache installation into the ipv4
specific code, and as well it should since doing the route caching for
ipv6 is pointless at the moment since it is not inspected in the ipv6
input paths yet.

Also, remove the unlikely on dst->obsolete, all ipv4 dsts have
obsolete set to a non-zero value to force invocation of the check
callback.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23 16:36:26 -07:00
David S. Miller abaa72d7fd Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
2012-07-19 11:17:30 -07:00
Eric Dumazet 5a308f40bf netem: refine early skb orphaning
netem does an early orphaning of skbs. Doing so breaks TCP Small Queue
or any mechanism relying on socket sk_wmem_alloc feedback.

Ideally, we should perform this orphaning after the rate module and
before the delay module, to mimic what happens on a real link :

skb orphaning is indeed normally done at TX completion, before the
transit on the link.

+-------+   +--------+  +---------------+  +-----------------+
+ Qdisc +---> Device +--> TX completion +--> links / hops    +->
+       +   +  xmit  +  + skb orphaning +  + propagation     +
+-------+   +--------+  +---------------+  +-----------------+
      < rate limiting >                  < delay, drops, reorders >

If netem is used without delay feature (drops, reorders, rate
limiting), then we should avoid early skb orphaning, to keep pressure
on sockets as long as packets are still in qdisc queue.

Ideally, netem should be refactored to implement delay module
as the last stage. Current algorithm merges the two phases
(rate limiting + delay) so its not correct.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Mark Gordon <msg@google.com>
Cc: Andreas Terzis <aterzis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-16 23:08:33 -07:00
Alan Cox 7ac2908e4b sch_sfb: Fix missing NULL check
Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=44461

Signed-off-by: Alan Cox <alan@linux.intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-12 08:33:18 -07:00
Florian Westphal 6d4fa852a0 net: sched: add ipset ematch
Can be used to match packets against netfilter ip sets created via ipset(8).
skb->sk_iif is used as 'incoming interface', skb->dev is 'outgoing interface'.

Since ipset is usually called from netfilter, the ematch
initializes a fake xt_action_param, pulls the ip header into the
linear area and also sets skb->data to the IP header (otherwise
matching Layer 4 set types doesn't work).

Tested-by: Mr Dash Four <mr.dash.four@googlemail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-12 07:54:46 -07:00
David S. Miller 04c9f416e3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/batman-adv/bridge_loop_avoidance.c
	net/batman-adv/bridge_loop_avoidance.h
	net/batman-adv/soft-interface.c
	net/mac80211/mlme.c

With merge help from Antonio Quartulli (batman-adv) and
Stephen Rothwell (drivers/net/usb/qmi_wwan.c).

The net/mac80211/mlme.c conflict seemed easy enough, accounting for a
conversion to some new tracing macros.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 23:56:33 -07:00
Eric Dumazet 960fb66e52 netem: add limitation to reordered packets
Fix two netem bugs :

1) When a frame was dropped by tfifo_enqueue(), drop counter
   was incremented twice.

2) When reordering is triggered, we enqueue a packet without
   checking queue limit. This can OOM pretty fast when this
   is repeated enough, since skbs are orphaned, no socket limit
   can help in this situation.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mark Gordon <msg@google.com>
Cc: Andreas Terzis <aterzis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-09 00:01:49 -07:00
David S. Miller 8f961faef7 Merge branch 'for-davem' of git://gitorious.org/linux-can/linux-can-next 2012-07-07 16:29:29 -07:00
David S. Miller dbedbe6d56 sch_teql: Convert over to dev_neigh_lookup_skb().
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-05 01:09:06 -07:00
Rostislav Lisovy f057bbb6f9 net: em_canid: Ematch rule to match CAN frames according to their identifiers
This ematch makes it possible to classify CAN frames (AF_CAN) according
to their identifiers. This functionality can not be easily achieved with
existing classifiers, such as u32, because CAN identifier is always stored
in native endianness, whereas u32 expects Network byte order.

Signed-off-by: Rostislav Lisovy <lisovy@gmail.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2012-07-04 13:07:05 +02:00
David S. Miller 02ef22ca40 pkt_sched: sch_api: Move away from NLMSG_NEW().
And use nlmsg_data() while we're here too, as well as remove
a useless cast.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-26 21:54:15 -07:00
David S. Miller 942b81653a pkt_sched: cls_api: Move away from NLMSG_NEW().
And use nlmsg_data() while we're here too, as well as remove
a useless cast.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-26 21:54:15 -07:00
David S. Miller 8b00a53c63 pkt_sched: act_api: Move away from NLMSG_PUT().
Move away from NLMSG_NEW() as well.

And use nlmsg_data() while we're here too.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-26 21:39:32 -07:00
Al Viro d58367515f sch_atm.c: get rid of poinless extern
sockfd_lookup() is declared in linux/net.h, which is pulled by
linux/skbuff.h (and needed for a lot of other stuff in sch_atm.c
anyway).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-01 10:37:18 -04:00
Linus Torvalds 88d6ae8dc3 Merge branch 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "cgroup file type addition / removal is updated so that file types are
  added and removed instead of individual files so that dynamic file
  type addition / removal can be implemented by cgroup and used by
  controllers.  blkio controller changes which will come through block
  tree are dependent on this.  Other changes include res_counter cleanup
  and disallowing kthread / PF_THREAD_BOUND threads to be attached to
  non-root cgroups.

  There's a reported bug with the file type addition / removal handling
  which can lead to oops on cgroup umount.  The issue is being looked
  into.  It shouldn't cause problems for most setups and isn't a
  security concern."

Fix up trivial conflict in Documentation/feature-removal-schedule.txt

* 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
  res_counter: Account max_usage when calling res_counter_charge_nofail()
  res_counter: Merge res_counter_charge and res_counter_charge_nofail
  cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads
  cgroup: remove cgroup_subsys->populate()
  cgroup: get rid of populate for memcg
  cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcg
  cgroup: make css->refcnt clearing on cgroup removal optional
  cgroup: use negative bias on css->refcnt to block css_tryget()
  cgroup: implement cgroup_rm_cftypes()
  cgroup: introduce struct cfent
  cgroup: relocate __d_cgrp() and __d_cft()
  cgroup: remove cgroup_add_file[s]()
  cgroup: convert memcg controller to the new cftype interface
  memcg: always create memsw files if CONFIG_CGROUP_MEM_RES_CTLR_SWAP
  cgroup: convert all non-memcg controllers to the new cftype interface
  cgroup: relocate cftype and cgroup_subsys definitions in controllers
  cgroup: merge cft_release_agent cftype array into the base files array
  cgroup: implement cgroup_add_cftypes() and friends
  cgroup: build list of all cgroups under a given cgroupfs_root
  cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir()
  ...
2012-05-22 17:40:19 -07:00
Eldad Zack 1de5a71c3e ipv6: correct the ipv6 option name - Pad0 to Pad1
The padding destination or hop-by-hop option is called Pad1 and not Pad0.

See RFC2460 (4.2) or the IANA ipv6-parameters registry:
http://www.iana.org/assignments/ipv6-parameters/ipv6-parameters.xml

Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-17 15:49:51 -04:00
Eric Dumazet 865ec5523d fq_codel: should use qdisc backlog as threshold
codel_should_drop() logic allows a packet being not dropped if queue
size is under max packet size.

In fq_codel, we have two possible backlogs : The qdisc global one, and
the flow local one.

The meaningful one for codel_should_drop() should be the global backlog,
not the per flow one, so that thin flows can have a non zero drop/mark
probability.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Kathleen Nichols <nichols@pollere.com>
Cc: Van Jacobson <van@pollere.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-16 15:30:26 -04:00
Joe Perches e87cc4728f net: Convert net_ratelimit uses to net_<level>_ratelimited
Standardize the net core ratelimited logging functions.

Coalesce formats, align arguments.
Change a printk then vprintk sequence to use printf extension %pV.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-15 13:45:03 -04:00
Sasha Levin 669d67bf77 net: codel: fix build errors
Fix the following build error:

net/sched/sch_fq_codel.c: In function 'fq_codel_dump_stats':
net/sched/sch_fq_codel.c:464:3: error: unknown field 'qdisc_stats' specified in initializer
net/sched/sch_fq_codel.c:464:3: warning: missing braces around initializer
net/sched/sch_fq_codel.c:464:3: warning: (near initialization for 'st.<anonymous>')
net/sched/sch_fq_codel.c:465:3: error: unknown field 'qdisc_stats' specified in initializer
net/sched/sch_fq_codel.c:465:3: warning: excess elements in struct initializer
net/sched/sch_fq_codel.c:465:3: warning: (near initialization for 'st')
net/sched/sch_fq_codel.c:466:3: error: unknown field 'qdisc_stats' specified in initializer
net/sched/sch_fq_codel.c:466:3: warning: excess elements in struct initializer
net/sched/sch_fq_codel.c:466:3: warning: (near initialization for 'st')
net/sched/sch_fq_codel.c:467:3: error: unknown field 'qdisc_stats' specified in initializer
net/sched/sch_fq_codel.c:467:3: warning: excess elements in struct initializer
net/sched/sch_fq_codel.c:467:3: warning: (near initialization for 'st')
make[1]: *** [net/sched/sch_fq_codel.o] Error 1

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-14 17:57:58 -04:00
Geert Uytterhoeven ce5b4b9771 net/codel: Add missing #include <linux/prefetch.h>
m68k allmodconfig:

net/sched/sch_codel.c: In function ‘dequeue’:
net/sched/sch_codel.c:70: error: implicit declaration of function ‘prefetch’
make[1]: *** [net/sched/sch_codel.o] Error 1

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-14 17:57:58 -04:00
Eric Dumazet 4b549a2ef4 fq_codel: Fair Queue Codel AQM
Fair Queue Codel packet scheduler

Principles :

- Packets are classified (internal classifier or external) on flows.
- This is a Stochastic model (as we use a hash, several flows might
                              be hashed on same slot)
- Each flow has a CoDel managed queue.
- Flows are linked onto two (Round Robin) lists,
  so that new flows have priority on old ones.

- For a given flow, packets are not reordered (CoDel uses a FIFO)
- head drops only.
- ECN capability is on by default.
- Very low memory footprint (64 bytes per flow)

tc qdisc ... fq_codel [ limit PACKETS ] [ flows number ]
                      [ target TIME ] [ interval TIME ] [ noecn ]
                      [ quantum BYTES ]

defaults : 1024 flows, 10240 packets limit, quantum : device MTU
           target : 5ms (CoDel default)
           interval : 100ms (CoDel default)

Impressive results on load :

class htb 1:1 root leaf 10: prio 0 quantum 1514 rate 200000Kbit ceil 200000Kbit burst 1475b/8 mpu 0b overhead 0b cburst 1475b/8 mpu 0b overhead 0b level 0
 Sent 43304920109 bytes 33063109 pkt (dropped 0, overlimits 0 requeues 0)
 rate 201691Kbit 28595pps backlog 0b 312p requeues 0
 lended: 33063109 borrowed: 0 giants: 0
 tokens: -912 ctokens: -912

class fq_codel 10:1735 parent 10:
 (dropped 1292, overlimits 0 requeues 0)
 backlog 15140b 10p requeues 0
  deficit 1514 count 1 lastcount 1 ldelay 7.1ms
class fq_codel 10:4524 parent 10:
 (dropped 1291, overlimits 0 requeues 0)
 backlog 16654b 11p requeues 0
  deficit 1514 count 1 lastcount 1 ldelay 7.1ms
class fq_codel 10:4e74 parent 10:
 (dropped 1290, overlimits 0 requeues 0)
 backlog 6056b 4p requeues 0
  deficit 1514 count 1 lastcount 1 ldelay 6.4ms dropping drop_next 92.0ms
class fq_codel 10:628a parent 10:
 (dropped 1289, overlimits 0 requeues 0)
 backlog 7570b 5p requeues 0
  deficit 1514 count 1 lastcount 1 ldelay 5.4ms dropping drop_next 90.9ms
class fq_codel 10:a4b3 parent 10:
 (dropped 302, overlimits 0 requeues 0)
 backlog 16654b 11p requeues 0
  deficit 1514 count 1 lastcount 1 ldelay 7.1ms
class fq_codel 10:c3c2 parent 10:
 (dropped 1284, overlimits 0 requeues 0)
 backlog 13626b 9p requeues 0
  deficit 1514 count 1 lastcount 1 ldelay 5.9ms
class fq_codel 10:d331 parent 10:
 (dropped 299, overlimits 0 requeues 0)
 backlog 15140b 10p requeues 0
  deficit 1514 count 1 lastcount 1 ldelay 7.0ms
class fq_codel 10:d526 parent 10:
 (dropped 12160, overlimits 0 requeues 0)
 backlog 35870b 211p requeues 0
  deficit 1508 count 12160 lastcount 1 ldelay 15.3ms dropping drop_next 247us
class fq_codel 10:e2c6 parent 10:
 (dropped 1288, overlimits 0 requeues 0)
 backlog 15140b 10p requeues 0
  deficit 1514 count 1 lastcount 1 ldelay 7.1ms
class fq_codel 10:eab5 parent 10:
 (dropped 1285, overlimits 0 requeues 0)
 backlog 16654b 11p requeues 0
  deficit 1514 count 1 lastcount 1 ldelay 5.9ms
class fq_codel 10:f220 parent 10:
 (dropped 1289, overlimits 0 requeues 0)
 backlog 15140b 10p requeues 0
  deficit 1514 count 1 lastcount 1 ldelay 7.1ms

qdisc htb 1: root refcnt 6 r2q 10 default 1 direct_packets_stat 0 ver 3.17
 Sent 43331086547 bytes 33092812 pkt (dropped 0, overlimits 66063544 requeues 71)
 rate 201697Kbit 28602pps backlog 0b 260p requeues 71
qdisc fq_codel 10: parent 1:1 limit 10240p flows 65536 target 5.0ms interval 100.0ms ecn
 Sent 43331086547 bytes 33092812 pkt (dropped 949359, overlimits 0 requeues 0)
 rate 201697Kbit 28602pps backlog 189352b 260p requeues 0
  maxpacket 1514 drop_overlimit 0 new_flow_count 5582 ecn_mark 125593
  new_flows_len 0 old_flows_len 11

PING 172.30.42.18 (172.30.42.18) 56(84) bytes of data.
64 bytes from 172.30.42.18: icmp_req=1 ttl=64 time=0.227 ms
64 bytes from 172.30.42.18: icmp_req=2 ttl=64 time=0.165 ms
64 bytes from 172.30.42.18: icmp_req=3 ttl=64 time=0.166 ms
64 bytes from 172.30.42.18: icmp_req=4 ttl=64 time=0.151 ms
64 bytes from 172.30.42.18: icmp_req=5 ttl=64 time=0.164 ms
64 bytes from 172.30.42.18: icmp_req=6 ttl=64 time=0.172 ms
64 bytes from 172.30.42.18: icmp_req=7 ttl=64 time=0.175 ms
64 bytes from 172.30.42.18: icmp_req=8 ttl=64 time=0.183 ms
64 bytes from 172.30.42.18: icmp_req=9 ttl=64 time=0.158 ms
64 bytes from 172.30.42.18: icmp_req=10 ttl=64 time=0.200 ms

10 packets transmitted, 10 received, 0% packet loss, time 8999ms
rtt min/avg/max/mdev = 0.151/0.176/0.227/0.022 ms

Much better than SFQ because of priority given to new flows, and fast
path dirtying less cache lines.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-12 15:53:42 -04:00
Eric Dumazet 76e3cc126b codel: Controlled Delay AQM
An implementation of CoDel AQM, from Kathleen Nichols and Van Jacobson.

http://queue.acm.org/detail.cfm?id=2209336

This AQM main input is no longer queue size in bytes or packets, but the
delay packets stay in (FIFO) queue.

As we don't have infinite memory, we still can drop packets in enqueue()
in case of massive load, but mean of CoDel is to drop packets in
dequeue(), using a control law based on two simple parameters :

target : target sojourn time (default 5ms)
interval : width of moving time window (default 100ms)

Based on initial work from Dave Taht.

Refactored to help future codel inclusion as a plugin for other linux
qdisc (FQ_CODEL, ...), like RED.

include/net/codel.h contains codel algorithm as close as possible than
Kathleen reference.

net/sched/sch_codel.c contains the linux qdisc specific glue.

Separate structures permit a memory efficient implementation of fq_codel
(to be sent as a separate work) : Each flow has its own struct
codel_vars.

timestamps are taken at enqueue() time with 1024 ns precision, allowing
a range of 2199 seconds in queue, and 100Gb links support. iproute2 uses
usec as base unit.

Selected packets are dropped, unless ECN is enabled and packets can get
ECN mark instead.

Tested from 2Mb to 10Gb speeds with no particular problems, on ixgbe and
tg3 drivers (BQL enabled).

Usage: tc qdisc ... codel [ limit PACKETS ] [ target TIME ]
                          [ interval TIME ] [ ecn ]

qdisc codel 10: parent 1:1 limit 2000p target 3.0ms interval 60.0ms ecn
 Sent 13347099587 bytes 8815805 pkt (dropped 0, overlimits 0 requeues 0)
 rate 202365Kbit 16708pps backlog 113550b 75p requeues 0
  count 116 lastcount 98 ldelay 4.3ms dropping drop_next 816us
  maxpacket 1514 ecn_mark 84399 drop_overlimit 0

CoDel must be seen as a base module, and should be used keeping in mind
there is still a FIFO queue. So a typical setup will probably need a
hierarchy of several qdiscs and packet classifiers to be able to meet
whatever constraints a user might have.

One possible example would be to use fq_codel, which combines Fair
Queueing and CoDel, in replacement of sfq / sfq_red.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Dave Taht <dave.taht@bufferbloat.net>
Cc: Kathleen Nichols <nichols@pollere.com>
Cc: Van Jacobson <van@pollere.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-10 23:35:02 -04:00
Eric Dumazet 2dd875ff31 net_sched: update bstats in dequeue()
Class bytes/packets stats can be misleading because they are updated in
enqueue() while packet might be dropped later.

We already fixed all qdiscs but sch_atm.

This patch makes the final cleanup.

class rate estimators can now match qdisc ones.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-10 23:33:01 -04:00
David S. Miller 0d6c4a2e46 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/intel/e1000e/param.c
	drivers/net/wireless/iwlwifi/iwl-agn-rx.c
	drivers/net/wireless/iwlwifi/iwl-trans-pcie-rx.c
	drivers/net/wireless/iwlwifi/iwl-trans.h

Resolved the iwlwifi conflict with mainline using 3-way diff posted
by John Linville and Stephen Rothwell.  In 'net' we added a bug
fix to make iwlwifi report a more accurate skb->truesize but this
conflicted with RX path changes that happened meanwhile in net-next.

In e1000e a conflict arose in the validation code for settings of
adapter->itr.  'net-next' had more sophisticated logic so that
logic was used.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-07 23:35:40 -04:00
Eric Dumazet 1704575519 net: sched: factorize code (qdisc_drop())
Use qdisc_drop() helper where possible.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-04 11:50:05 -04:00
Eric Dumazet 116a0fc31c netem: fix possible skb leak
skb_checksum_help(skb) can return an error, we must free skb in this
case. qdisc_drop(skb, sch) can also be feeded with a NULL skb (if
skb_unshare() failed), so lets use this generic helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01 13:40:48 -04:00
Eric Dumazet e4ae004b84 netem: add ECN capability
Add ECN (Explicit Congestion Notification) marking capability to netem

tc qdisc add dev eth0 root netem drop 0.5 ecn

Instead of dropping packets, try to ECN mark them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01 09:39:48 -04:00
David S. Miller f24001941c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Fix merge between commit 3adadc08cc ("net ax25: Reorder ax25_exit to
remove races") and commit 0ca7a4c87d ("net ax25: Simplify and
cleanup the ax25 sysctl handling")

The former moved around the sysctl register/unregister calls, the
later simply removed them.

With help from Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-23 23:15:17 -04:00
David Ward 244b65dbfe net_sched: gred: Fix oops in gred_dump() in WRED mode
A parameter set exists for WRED mode, called wred_set, to hold the same
values for qavg and qidlestart across all VQs. The WRED mode values had
been previously held in the VQ for the default DP. After these values
were moved to wred_set, the VQ for the default DP was no longer created
automatically (so that it could be omitted on purpose, to have packets
in the default DP enqueued directly to the device without using RED).

However, gred_dump() was overlooked during that change; in WRED mode it
still reads qavg/qidlestart from the VQ for the default DP, which might
not even exist. As a result, this command sequence will cause an oops:

tc qdisc add dev $DEV handle $HANDLE parent $PARENT gred setup \
    DPs 3 default 2 grio
tc qdisc change dev $DEV handle $HANDLE gred DP 0 prio 8 $RED_OPTIONS
tc qdisc change dev $DEV handle $HANDLE gred DP 1 prio 8 $RED_OPTIONS

This fixes gred_dump() in WRED mode to use the values held in wred_set.

Signed-off-by: David Ward <david.ward@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-16 23:51:07 -04:00
David S. Miller 1b34ec43c9 pkt_sched: Stop using NLA_PUT*().
These macros contain a hidden goto, and are thus extremely error
prone and make code hard to audit.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-01 18:11:37 -04:00
Tejun Heo 4baf6e3325 cgroup: convert all non-memcg controllers to the new cftype interface
Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio,
net_cls and device controllers to use the new cftype based interface.
Termination entry is added to cftype arrays and populate callbacks are
replaced with cgroup_subsys->base_cftypes initializations.

This is functionally identical transformation.  There shouldn't be any
visible behavior change.

memcg is rather special and will be converted separately.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Vivek Goyal <vgoyal@redhat.com>
2012-04-01 12:09:55 -07:00
Tejun Heo 676f7c8f84 cgroup: relocate cftype and cgroup_subsys definitions in controllers
blk-cgroup, netprio_cgroup, cls_cgroup and tcp_memcontrol
unnecessarily define cftype array and cgroup_subsys structures at the
top of the file, which is unconventional and necessiates forward
declaration of methods.

This patch relocates those below the definitions of the methods and
removes the forward declarations.  Note that forward declaration of
tcp_files[] is added in tcp_memcontrol.c for tcp_init_cgroup().  This
will be removed soon by another patch.

This patch doesn't introduce any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
2012-04-01 12:09:55 -07:00
Linus Torvalds 3b59bf0816 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking merge from David Miller:
 "1) Move ixgbe driver over to purely page based buffering on receive.
     From Alexander Duyck.

  2) Add receive packet steering support to e1000e, from Bruce Allan.

  3) Convert TCP MD5 support over to RCU, from Eric Dumazet.

  4) Reduce cpu usage in handling out-of-order TCP packets on modern
     systems, also from Eric Dumazet.

  5) Support the IP{,V6}_UNICAST_IF socket options, making the wine
     folks happy, from Erich Hoover.

  6) Support VLAN trunking from guests in hyperv driver, from Haiyang
     Zhang.

  7) Support byte-queue-limtis in r8169, from Igor Maravic.

  8) Outline code intended for IP_RECVTOS in IP_PKTOPTIONS existed but
     was never properly implemented, Jiri Benc fixed that.

  9) 64-bit statistics support in r8169 and 8139too, from Junchang Wang.

  10) Support kernel side dump filtering by ctmark in netfilter
      ctnetlink, from Pablo Neira Ayuso.

  11) Support byte-queue-limits in gianfar driver, from Paul Gortmaker.

  12) Add new peek socket options to assist with socket migration, from
      Pavel Emelyanov.

  13) Add sch_plug packet scheduler whose queue is controlled by
      userland daemons using explicit freeze and release commands.  From
      Shriram Rajagopalan.

  14) Fix FCOE checksum offload handling on transmit, from Yi Zou."

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1846 commits)
  Fix pppol2tp getsockname()
  Remove printk from rds_sendmsg
  ipv6: fix incorrent ipv6 ipsec packet fragment
  cpsw: Hook up default ndo_change_mtu.
  net: qmi_wwan: fix build error due to cdc-wdm dependecy
  netdev: driver: ethernet: Add TI CPSW driver
  netdev: driver: ethernet: add cpsw address lookup engine support
  phy: add am79c874 PHY support
  mlx4_core: fix race on comm channel
  bonding: send igmp report for its master
  fs_enet: Add MPC5125 FEC support and PHY interface selection
  net: bpf_jit: fix BPF_S_LDX_B_MSH compilation
  net: update the usage of CHECKSUM_UNNECESSARY
  fcoe: use CHECKSUM_UNNECESSARY instead of CHECKSUM_PARTIAL on tx
  net: do not do gso for CHECKSUM_UNNECESSARY in netif_needs_gso
  ixgbe: Fix issues with SR-IOV loopback when flow control is disabled
  net/hyperv: Fix the code handling tx busy
  ixgbe: fix namespace issues when FCoE/DCB is not enabled
  rtlwifi: Remove unused ETH_ADDR_LEN defines
  igbvf: Use ETH_ALEN
  ...

Fix up fairly trivial conflicts in drivers/isdn/gigaset/interface.c and
drivers/net/usb/{Kconfig,qmi_wwan.c} as per David.
2012-03-20 21:04:47 -07:00
Linus Torvalds 0d9cabdcce Merge branch 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup changes from Tejun Heo:
 "Out of the 8 commits, one fixes a long-standing locking issue around
  tasklist walking and others are cleanups."

* 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list
  cgroup: Remove wrong comment on cgroup_enable_task_cg_list()
  cgroup: remove cgroup_subsys argument from callbacks
  cgroup: remove extra calls to find_existing_css_set
  cgroup: replace tasklist_lock with rcu_read_lock
  cgroup: simplify double-check locking in cgroup_attach_proc
  cgroup: move struct cgroup_pidlist out from the header file
  cgroup: remove cgroup_attach_task_current_cg()
2012-03-20 18:11:21 -07:00
David S. Miller 4da0bd7365 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-03-18 23:29:41 -04:00
Eric Dumazet cc34eb672e sch_sfq: revert dont put new flow at the end of flows
This reverts commit d47a0ac7b6 (sch_sfq: dont put new flow at the end of
flows)

As Jesper found out, patch sounded great but has bad side effects.

In stress situation, pushing new flows in front of the queue can prevent
old flows doing any progress. Packets can stay in SFQ queue for
unlimited amount of time.

It's possible to add heuristics to limit this problem, but this would
add complexity outside of SFQ scope.

A more sensible answer to Dave Taht concerns (who reported the issued I
tried to solve in original commit) is probably to use a qdisc hierarchy
so that high prio packets dont enter a potentially crowded SFQ qdisc.

Reported-by: Jesper Dangaard Brouer <jdb@comx.dk>
Cc: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-16 01:55:25 -07:00
David S. Miller ff4783ce78 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/sfc/rx.c

Overlapping changes in drivers/net/ethernet/sfc/rx.c, one to change
the rx_buf->is_page boolean into a set of u16 flags, and another to
adjust how ->ip_summed is initialized.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-26 21:55:51 -05:00
Eric Dumazet cd961c2ca9 netem: fix dequeue
commit 50612537e9 (netem: fix classful handling) added two errors in
netem_dequeue()

1) After checking skb at the head of tfifo queue for time constraints,
   it dequeues tail skb, thus adding unwanted reordering.

2) qdisc stats are updated twice per packet
   (one when packet dequeued from tfifo, once when delivered)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-19 18:57:50 -05:00
Eric Dumazet 2132cf6437 net_sched: sch_plug: plug_qdisc_ops is static
net/sched/sch_plug.c:211:18: warning: symbol 'plug_qdisc_ops' was not
declared. Should it be static?

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-13 16:04:40 -05:00
David S. Miller 16bda13d90 net: Make qdisc_skb_cb upper size bound explicit.
Just like skb->cb[], so that qdisc_skb_cb can be encapsulated inside
of other data structures.

This is intended to be used by IPoIB so that it can remember
addressing information stored at hard_header_ops->create() time that
it can fetch when the packet gets to the transmit routine.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-09 13:50:34 -05:00
Shriram Rajagopalan c3059be16c net/sched: sch_plug - Queue traffic until an explicit release command
The qdisc supports two operations - plug and unplug. When the
qdisc receives a plug command via netlink request, packets arriving
henceforth are buffered until a corresponding unplug command is received.
Depending on the type of unplug command, the queue can be unplugged
indefinitely or selectively.

This qdisc can be used to implement output buffering, an essential
functionality required for consistent recovery in checkpoint based
fault-tolerance systems. Output buffering enables speculative execution
by allowing generated network traffic to be rolled back. It is used to
provide network protection for Xen Guests in the Remus high availability
project, available as part of Xen.

This module is generic enough to be used by any other system that wishes
to add speculative execution and output buffering to its applications.

This module was originally available in the linux 2.6.32 PV-OPS tree,
used as dom0 for Xen.

For more information, please refer to http://nss.cs.ubc.ca/remus/
and http://wiki.xensource.com/xenwiki/Remus

Changes in V3:
  * Removed debug output (printk) on queue overflow
  * Added TCQ_PLUG_RELEASE_INDEFINITE - that allows the user to
    use this qdisc, for simple plug/unplug operations.
  * Use of packet counts instead of pointers to keep track of
    the buffers in the queue.

Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
Signed-off-by: Brendan Cully <brendan@cs.ubc.ca>
[author of the code in the linux 2.6.32 pvops tree]
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-07 12:54:56 -05:00
David S. Miller a0417fa3a1 net: Make qdisc_skb_cb upper size bound explicit.
Just like skb->cb[], so that qdisc_skb_cb can be encapsulated inside
of other data structures.

This is intended to be used by IPoIB so that it can remember
addressing information stored at hard_header_ops->create() time that
it can fetch when the packet gets to the transmit routine.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-06 15:14:37 -05:00
Li Zefan 761b3ef50e cgroup: remove cgroup_subsys argument from callbacks
The argument is not used at all, and it's not necessary, because
a specific callback handler of course knows which subsys it
belongs to.

Now only ->pupulate() takes this argument, because the handlers of
this callback always call cgroup_add_file()/cgroup_add_files().

So we reduce a few lines of code, though the shrinking of object size
is minimal.

 16 files changed, 113 insertions(+), 162 deletions(-)

   text    data     bss     dec     hex filename
5486240  656987 7039960 13183187         c928d3 vmlinux.o.orig
5486170  656987 7039960 13183117         c9288d vmlinux.o

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-02-02 09:20:22 -08:00
Vijay Subramanian a42b4799c6 netem: Fix off-by-one bug in reordering
With netem reordering, a gap of N is supposed to reorder every Nth packet with
given reorder probability.  However, the code currently skips N packets and
reorders every (N+1)th packet.

Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-22 15:08:44 -05:00
Eric Dumazet ddecf0f4db net_sched: sfq: add optional RED on top of SFQ
Adds an optional Random Early Detection on each SFQ flow queue.

Traditional SFQ limits count of packets, while RED permits to also
control number of bytes per flow, and adds ECN capability as well.

1) We dont handle the idle time management in this RED implementation,
since each 'new flow' begins with a null qavg. We really want to address
backlogged flows.

2) if headdrop is selected, we try to ecn mark first packet instead of
currently enqueued packet. This gives faster feedback for tcp flows
compared to traditional RED [ marking the last packet in queue ]

Example of use :

tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 4sec sfq \
	limit 3000 headdrop flows 512 divisor 16384 \
	redflowlimit 100000 min 8000 max 60000 probability 0.20 ecn

qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop
flows 512/16384 divisor 16384
 ewma 6 min 8000b max 60000b probability 0.2 ecn
 prob_mark 0 prob_mark_head 4876 prob_drop 6131
 forced_mark 0 forced_mark_head 0 forced_drop 0
 Sent 1175211782 bytes 777537 pkt (dropped 6131, overlimits 11007
requeues 0)
 rate 99483Kbit 8219pps backlog 689392b 456p requeues 0

In this test, with 64 netperf TCP_STREAM sessions, 50% using ECN enabled
flows, we can see number of packets CE marked is smaller than number of
drops (for non ECN flows)

If same test is run, without RED, we can check backlog is much bigger.

qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop
flows 512/16384 divisor 16384
 Sent 1148683617 bytes 795006 pkt (dropped 0, overlimits 0 requeues 0)
 rate 98429Kbit 8521pps backlog 1221290b 841p requeues 0

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Dave Taht <dave.taht@gmail.com>
Tested-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-12 20:05:28 -08:00
Eric Dumazet eeca6688d6 net_sched: red: split red_parms into parms and vars
This patch splits the red_parms structure into two components.

One holding the RED 'constant' parameters, and one containing the
variables.

This permits a size reduction of GRED qdisc, and is a preliminary step
to add an optional RED unit to SFQ.

SFQRED will have a single red_parms structure shared by all flows, and a
private red_vars per flow.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Dave Taht <dave.taht@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-05 14:01:21 -05:00
Eric Dumazet 18cb809850 net_sched: sfq: extend limits
SFQ as implemented in Linux is very limited, with at most 127 flows
and limit of 127 packets. [ So if 127 flows are active, we have one
packet per flow ]

This patch brings to SFQ following features to cope with modern needs.

- Ability to specify a smaller per flow limit of inflight packets.
    (default value being at 127 packets)

- Ability to have up to 65408 active flows (instead of 127)

- Ability to have head drops instead of tail drops
  (to drop old packets from a flow)

Example of use : No more than 20 packets per flow, max 8000 flows, max
20000 packets in SFQ qdisc, hash table of 65536 slots.

tc qdisc add ... sfq \
        flows 8000 \
        depth 20 \
        headdrop \
        limit 20000 \
	divisor 65536

Ram usage :

2 bytes per hash table entry (instead of previous 1 byte/entry)
32 bytes per flow on 64bit arches, instead of 384 for QFQ, so much
better cache hit ratio.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-05 14:01:21 -05:00
Hagen Paul Pfeifer eb10192447 net_sched: Bug in netem reordering
Not now, but it looks you are correct. q->qdisc is NULL until another
additional qdisc is attached (beside tfifo). See 50612537e9.
The following patch should work.

From: Hagen Paul Pfeifer <hagen@jauu.net>

netem: catch NULL pointer by updating the real qdisc statistic

Reported-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-05 13:27:39 -05:00
David S. Miller 117ff42fd4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-01-04 21:35:43 -05:00
Eric Dumazet 02a9098ede net_sched: sfq: always randomize hash perturbation
SFQ q->perturbation is used in sfq_hash() as an input to Jenkins hash.

We currently randomize this 32bit value only if a perturbation timer is
setup.

Its much better to always initialize it to defeat attackers, or else
they can predict very well what kind of packets they have to forge to
hit a particular flow.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-04 14:12:48 -05:00
Eric Dumazet bd16a6cce2 net_sched: sfq: fix mem alloc error recovery
Since commit 817fb15dfd (net_sched: sfq: allow divisor to be a
parameter), we can leave perturbation timer armed if a memory allocation
error aborts sfq_init().

Memory containing active struct timer_list is freed and kernel can
crash.

Call sfq_destroy() from sfq_init() to properly dismantle qdisc.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-04 14:12:48 -05:00
Eric Dumazet fa0f5aa743 net_sched: qdisc_alloc_handle() can be too slow
When trying to allocate ~32768 qdiscs using autohandle mechanism, we can
fill the space managed by kernel (handles in [8000-FFFF]:0000 range)

But O(N^2) qdisc_alloc_handle() loops 0x10000 times instead of 0x8000

time tc add qdisc add dev eth0 parent 10:7fff pfifo limit 10
RTNETLINK answers: Cannot allocate memory
real    1m54.826s
user    0m0.000s
sys     0m0.004s

INFO: rcu_sched_state detected stall on CPU 0 (t=60000 jiffies)

Half number of loops, and add a cond_resched() call.
We hold rtnl at this point.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-03 13:03:20 -05:00
Eric Dumazet d32ae76f2b sch_qfq: accurate wsum handling
We can underestimate q->wsum in case of "tc class replace ... qfq"
and/or qdisc_create_dflt() error.

wsum is not really used in fast path, only at qfq qdisc/class setup,
to catch user error.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-03 13:02:19 -05:00
Eric Dumazet 6bafcac323 sch_qfq: fix overflow in qfq_update_start()
grp->slot_shift is between 22 and 41, so using 32bit wide variables is
probably a typo.

This could explain QFQ hangs Dave reported to me, after 2^23 packets ?

(23 = 64 - 41)

Reported-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-03 12:58:23 -05:00
Eric Dumazet d47a0ac7b6 sch_sfq: dont put new flow at the end of flows
SFQ enqueue algo puts a new flow _behind_ all pre-existing flows in the
circular list. In fact this is probably an old SFQ implementation bug.

100 Mbits = ~8333 full frames per second, or ~8 frames per ms.

With 50 flows, it means your "new flow" will have to wait 50 packets
being sent before its own packet. Thats the ~6ms.

We certainly can change SFQ to give a priority advantage to new flows,
so that next dequeued packet is taken from a new flow, not an old one.

Reported-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-03 12:52:09 -05:00
Eric Dumazet 50612537e9 netem: fix classful handling
Commit 10f6dfcfde (Revert "sch_netem: Remove classful functionality")
reintroduced classful functionality to netem, but broke basic netem
behavior :

netem uses an t(ime)fifo queue, and store timestamps in skb->cb[]

If qdisc is changed, time constraints are not respected and other qdisc
can destroy skb->cb[] and block netem at dequeue time.

Fix this by always using internal tfifo, and optionally attach a child
qdisc to netem (or a tree of qdiscs)

Example of use :

DEV=eth3
tc qdisc del dev $DEV root
tc qdisc add dev $DEV root handle 30: est 1sec 8sec netem delay 20ms 10ms
tc qdisc add dev $DEV handle 40:0 parent 30:0 tbf \
	burst 20480 limit 20480 mtu 1514 rate 32000bps

qdisc netem 30: root refcnt 18 limit 1000 delay 20.0ms  10.0ms
 Sent 190792 bytes 413 pkt (dropped 0, overlimits 0 requeues 0)
 rate 18416bit 3pps backlog 0b 0p requeues 0
qdisc tbf 40: parent 30: rate 256000bit burst 20Kb/8 mpu 0b lat 0us
 Sent 190792 bytes 413 pkt (dropped 6, overlimits 10 requeues 0)
 backlog 0b 5p requeues 0

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-30 17:12:23 -05:00
David S. Miller 7f8e3234c5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2011-12-30 13:04:14 -05:00
Eric Dumazet b0460e4484 sch_tbf: report backlog information
Provide child qdisc backlog (byte count) information so that "tc -s
qdisc" can report it to user.

qdisc netem 30: root refcnt 18 limit 1000 delay 20.0ms  10.0ms
 Sent 948517 bytes 898 pkt (dropped 0, overlimits 0 requeues 1)
 rate 175056bit 16pps backlog 114b 1p requeues 1
qdisc tbf 40: parent 30: rate 256000bit burst 20Kb/8 mpu 0b lat 0us
 Sent 948517 bytes 898 pkt (dropped 15, overlimits 611 requeues 0)
 backlog 18168b 12p requeues 0

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-29 15:07:21 -05:00
Eric Dumazet bb52c7acf8 netem: dont call vfree() under spinlock and BH disabled
commit 6373a9a286 (netem: use vmalloc for distribution table) added a
regression, since vfree() is called while holding a spinlock and BH
being disabled.

Fix this by doing the pointers swap in critical section, and freeing
after spinlock release.

Also add __GFP_NOWARN to the kmalloc() try, since we fallback to
vmalloc().

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-24 16:08:50 -05:00
David S. Miller abb434cb05 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/bluetooth/l2cap_core.c

Just two overlapping changes, one added an initialization of
a local variable, and another change added a new local variable.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-23 17:13:56 -05:00
stephen hemminger 2494654d48 netem: loss model API sizes
The new netem loss model is configured with nested netlink messages.
This code is being overly strict about sizes, and is easily confused
by padding (or possible future expansion). Also message
for gemodel is incorrect.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-23 16:51:18 -05:00
Eric Dumazet f5a59b7332 sch_hfsc: report backlog information
Add backlog (byte count) information in hfsc classes and qdisc, so that
"tc -s" can report it to user, instead of 0 values :

qdisc hfsc 1: root refcnt 6 default 20
 Sent 45141660 bytes 30545 pkt (dropped 0, overlimits 91751 requeues 0)
 rate 1492Kbit 126pps backlog 103226b 74p requeues 0
...
class hfsc 1:20 parent 1:1 leaf 1201: rt m1 0bit d 0us m2 400000bit ls m1 0bit d 0us m2 200000bit
 Sent 49534912 bytes 33519 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 81822b 56p requeues 0
 period 23 work 49451576 bytes rtwork 13277552 bytes level 0
...

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: John A. Sullivan III <jsullivan@opensourcedevel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-23 16:51:18 -05:00
Thomas Graf 7838f2ce36 mqprio: Avoid panic if no options are provided
Userspace may not provide TCA_OPTIONS, in fact tc currently does
so not do so if no arguments are specified on the command line.
Return EINVAL instead of panicing.

Signed-off-by: Thomas Graf <tgraf@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-22 22:34:56 -05:00
Eric Dumazet 225d9b89c9 sch_sfq: rehash queues in perturb timer
A known Out Of Order (OOO) problem hurts SFQ when timer changes
perturbation value, since all new packets delivered to SFQ enqueue might
end on different slots than previous in-flight packets.

With round robin delivery, we can thus deliver packets in a different
order.

Since SFQ is limited to small amount of in-flight packets, we can rehash
packets so that this OOO problem is fixed.

This rehashing is performed only if internal flow classifier is in use.

We now store in skb->cb[] the "struct flow_keys" so that we dont call
skb_flow_dissect() again while rehashing.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-21 15:44:34 -05:00
Eric Dumazet 869aa41044 sch_gred: prefer GFP_KERNEL allocations
In control path, its better to use GFP_KERNEL allocations where
possible.

Before taking qdisc spinlock, we preallocate memory just in case we'll
need it in gred_change_vq()

This is a followup to commit 3f1e6d3fd3 (sch_gred: should not use
GFP_KERNEL while holding a spinlock)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-16 15:40:33 -05:00
David S. Miller b26e478f8f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/freescale/fsl_pq_mdio.c
	net/batman-adv/translation-table.c
	net/ipv6/route.c
2011-12-16 02:11:14 -05:00
Eric Dumazet 3a53943b5a cls_flow: remove one dynamic array
Its better to use a predefined size for this small automatic variable.

Removes a sparse error as well :

net/sched/cls_flow.c:288:13: error: bad constant expression

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-14 13:34:55 -05:00
Hagen Paul Pfeifer 90b41a1cd4 netem: add cell concept to simulate special MAC behavior
This extension can be used to simulate special link layer
characteristics. Simulate because packet data is not modified, only the
calculation base is changed to delay a packet based on the original
packet size and artificial cell information.

packet_overhead can be used to simulate a link layer header compression
scheme (e.g. set packet_overhead to -20) or with a positive
packet_overhead value an additional MAC header can be simulated. It is
also possible to "replace" the 14 byte Ethernet header with something
else.

cell_size and cell_overhead can be used to simulate link layer schemes,
based on cells, like some TDMA schemes. Another application area are MAC
schemes using a link layer fragmentation with a (small) header each.
Cell size is the maximum amount of data bytes within one cell. Cell
overhead is an additional variable to change the per-cell-overhead
(e.g.  5 byte header per fragment).

Example (5 kbit/s, 20 byte per packet overhead, cell-size 100 byte, per
cell overhead 5 byte):

  tc qdisc add dev eth0 root netem rate 5kbit 20 100 5

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-12 19:44:48 -05:00
Eric Dumazet 3f1e6d3fd3 sch_gred: should not use GFP_KERNEL while holding a spinlock
gred_change_vq() is called under sch_tree_lock(sch).

This means a spinlock is held, and we are not allowed to sleep in this
context.

We might pre-allocate memory using GFP_KERNEL before taking spinlock,
but this is not suitable for stable material.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-12 19:08:54 -05:00
Eric Dumazet a73ed26bba sch_red: generalize accurate MAX_P support to RED/GRED/CHOKE
Now RED uses a Q0.32 number to store max_p (max probability), allow
RED/GRED/CHOKE to use/report full resolution at config/dump time.

Old tc binaries are non aware of new attributes, and still set/get Plog.

New tc binary set/get both Plog and max_p for backward compatibility,
they display "probability value" if they get max_p from new kernels.

# tc -d  qdisc show dev ...
...
qdisc red 10: parent 1:1 limit 360Kb min 30Kb max 90Kb ecn ewma 5
probability 0.09 Scell_log 15

Make sure we avoid potential divides by 0 in reciprocal_value(), if
(max_th - min_th) is big.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-09 13:46:15 -05:00
Eric Dumazet 8af2a218de sch_red: Adaptative RED AQM
Adaptative RED AQM for linux, based on paper from Sally FLoyd,
Ramakrishna Gummadi, and Scott Shenker, August 2001 :

http://icir.org/floyd/papers/adaptiveRed.pdf

Goal of Adaptative RED is to make max_p a dynamic value between 1% and
50% to reach the target average queue : (max_th - min_th) / 2

Every 500 ms:
 if (avg > target and max_p <= 0.5)
  increase max_p : max_p += alpha;
 else if (avg < target and max_p >= 0.01)
  decrease max_p : max_p *= beta;

target :[min_th + 0.4*(min_th - max_th),
          min_th + 0.6*(min_th - max_th)].
alpha : min(0.01, max_p / 4)
beta : 0.9
max_P is a Q0.32 fixed point number (unsigned, with 32 bits mantissa)

Changes against our RED implementation are :

max_p is no longer a negative power of two (1/(2^Plog)), but a Q0.32
fixed point number, to allow full range described in Adatative paper.

To deliver a random number, we now use a reciprocal divide (thats really
a multiply), but this operation is done once per marked/droped packet
when in RED_BETWEEN_TRESH window, so added cost (compared to previous
AND operation) is near zero.

dump operation gives current max_p value in a new TCA_RED_MAX_P
attribute.

Example on a 10Mbit link :

tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 8sec red \
   limit 400000 min 30000 max 90000 avpkt 1000 \
   burst 55 ecn adaptative bandwidth 10Mbit

# tc -s -d qdisc show dev eth3
...
qdisc red 10: parent 1:1 limit 400000b min 30000b max 90000b ecn
adaptative ewma 5 max_p=0.113335 Scell_log 15
 Sent 50414282 bytes 34504 pkt (dropped 35, overlimits 1392 requeues 0)
 rate 9749Kbit 831pps backlog 72056b 16p requeues 0
  marked 1357 early 35 pdrop 0 other 0

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-08 19:52:43 -05:00
David Miller 2721745501 net: Rename dst_get_neighbour{, _raw} to dst_get_neighbour_noref{, _raw}.
To reflect the fact that a refrence is not obtained to the
resulting neighbour entry.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Roland Dreier <roland@purestorage.com>
2011-12-05 15:20:19 -05:00
David S. Miller b3613118eb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2011-12-02 13:49:21 -05:00
Eric Dumazet 1ee5fa1e99 sch_red: fix red_change
Le mercredi 30 novembre 2011 à 14:36 -0800, Stephen Hemminger a écrit :

> (Almost) nobody uses RED because they can't figure it out.
> According to Wikipedia, VJ says that:
>  "there are not one, but two bugs in classic RED."

RED is useful for high throughput routers, I doubt many linux machines
act as such devices.

I was considering adding Adaptative RED (Sally Floyd, Ramakrishna
Gummadi, Scott Shender), August 2001

In this version, maxp is dynamic (from 1% to 50%), and user only have to
setup min_th (target average queue size)
(max_th and wq (burst in linux RED) are automatically setup)

By the way it seems we have a small bug in red_change()

if (skb_queue_empty(&sch->q))
	red_end_of_idle_period(&q->parms);

First, if queue is empty, we should call
red_start_of_idle_period(&q->parms);

Second, since we dont use anymore sch->q, but q->qdisc, the test is
meaningless.

Oh well...

[PATCH] sch_red: fix red_change()

Now RED is classful, we must check q->qdisc->q.qlen, and if queue is empty,
we start an idle period, not end it.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-01 19:24:38 -05:00
Eric Dumazet fc33cc7242 netem: fix build error on 32bit arches
ERROR: "__udivdi3" [net/sched/sch_netem.ko] undefined!

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-01 11:40:19 -05:00
Hagen Paul Pfeifer 7bc0f28c7a netem: rate extension
Currently netem is not in the ability to emulate channel bandwidth. Only static
delay (and optional random jitter) can be configured.

To emulate the channel rate the token bucket filter (sch_tbf) can be used.  But
TBF has some major emulation flaws. The buffer (token bucket depth/rate) cannot
be 0. Also the idea behind TBF is that the credit (token in buckets) fills if
no packet is transmitted. So that there is always a "positive" credit for new
packets. In real life this behavior contradicts the law of nature where
nothing can travel faster as speed of light. E.g.: on an emulated 1000 byte/s
link a small IPv4/TCP SYN packet with ~50 byte require ~0.05 seconds - not 0
seconds.

Netem is an excellent place to implement a rate limiting feature: static
delay is already implemented, tfifo already has time information and the
user can skip TBF configuration completely.

This patch implement rate feature which can be configured via tc. e.g:

	tc qdisc add dev eth0 root netem rate 10kbit

To emulate a link of 5000byte/s and add an additional static delay of 10ms:

	tc qdisc add dev eth0 root netem delay 10ms rate 5KBps

Note: similar to TBF the rate extension is bounded to the kernel timing
system. Depending on the architecture timer granularity, higher rates (e.g.
10mbit/s and higher) tend to transmission bursts. Also note: further queues
living in network adaptors; see ethtool(8).

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@drr.davemloft.net>
2011-11-30 23:18:35 -05:00
Eric Dumazet f7e57044ee sch_teql: fix lockdep splat
We need rcu_read_lock() protection before using dst_get_neighbour(), and
we must cache its value (pass it to __teql_resolve())

teql_master_xmit() is called under rcu_read_lock_bh() protection, its
not enough.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-30 17:10:49 -05:00
Eric Dumazet 2bcc34bb98 sch_choke: use skb_flow_dissect()
Instead of using a custom flow dissector, use skb_flow_dissect() and
benefit from tunnelling support.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-29 13:17:03 -05:00
Eric Dumazet 11fca931d3 sch_sfq: use skb_flow_dissect()
Instead of using a custom flow dissector, use skb_flow_dissect() and
benefit from tunnelling support.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-29 13:17:03 -05:00
Tom Herbert 7346649826 net: Add queue state xoff flag for stack
Create separate queue state flags so that either the stack or drivers
can turn on XOFF.  Added a set of functions used in the stack to determine
if a queue is really stopped (either by stack or driver)

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-29 12:46:19 -05:00
Eric Dumazet a00bd469b6 sch_sfb: use skb_flow_dissect()
Current SFB double hashing is not fulfilling SFB theory, if two flows
share same rxhash value.

Using skb_flow_dissect() permits to really have better hash dispersion,
and get tunnelling support as well.

Double hashing point was mentioned by Florian Westphal

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-28 19:09:28 -05:00
Eric Dumazet 6bd2a9af17 cls_flow: use skb_flow_dissect()
Instead of using a custom flow dissector, use skb_flow_dissect() and
benefit from tunnelling support.

This lack of tunnelling support was mentioned by Dan Siemon.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-28 19:09:28 -05:00
david decotigny ccf5ff69fb net: new counter for tx_timeout errors in sysfs
This adds the /sys/class/net/DEV/queues/Q/tx_timeout attribute
containing the total number of timeout events on the given queue. It
is always available with CONFIG_SYSFS, independently of
CONFIG_RPS/XPS.

Credits to Stephen Hemminger for a preliminary version of this patch.

Tested:
  without CONFIG_SYSFS (compilation only)
  with sysfs and without CONFIG_RPS & CONFIG_XPS
  with sysfs and without CONFIG_RPS
  with sysfs and without CONFIG_XPS
  with defaults

Signed-off-by: David Decotigny <david.decotigny@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-16 23:14:02 -05:00
Eric Dumazet 9ecd04bc04 sch_choke: use skb_header_pointer()
Remove the assumption that skb_get_rxhash() makes IP header and ports
linear, and use skb_header_pointer() instead in choke_match_flow()

This permits __skb_get_rxhash() to use skb_header_pointer() eventually.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-08 16:41:31 -05:00
Paul Gortmaker bc3b2d7fb9 net: Add export.h for EXPORT_SYMBOL/THIS_MODULE to non-modules
These files are non modular, but need to export symbols using
the macros now living in export.h -- call out the include so
that things won't break when we remove the implicit presence
of module.h from everywhere.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 19:30:30 -04:00
Paul Gortmaker 3a9a231d97 net: Fix files explicitly needing to include module.h
With calls to modular infrastructure, these files really
needs the full module.h header.  Call it out so some of the
cleanups of implicit and unrequired includes elsewhere can be
cleaned up.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 19:30:28 -04:00
Eric Dumazet 859c20123a net_sched: cls_flow: use skb_header_pointer()
Dan Siemon would like to add tunnelling support to cls_flow

This preliminary patch introduces use of skb_header_pointer() to help
this task, while avoiding skb head reallocation because of deep packet
inspection.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-24 18:40:14 -04:00
David S. Miller 8decf86879 Merge branch 'master' of github.com:davem330/net
Conflicts:
	MAINTAINERS
	drivers/net/Kconfig
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
	drivers/net/ethernet/broadcom/tg3.c
	drivers/net/wireless/iwlwifi/iwl-pci.c
	drivers/net/wireless/iwlwifi/iwl-trans-tx-pcie.c
	drivers/net/wireless/rt2x00/rt2800usb.c
	drivers/net/wireless/wl12xx/main.c
2011-09-22 03:23:13 -04:00
Igor Maravić 27e95a8c67 pkt_sched: cls_rsvp.h was outdated
File cls_rsvp.h in /net/sched was outdated. I'm sending you patch for this
file.

[ tb[] array should be indexed by X not X-1 -DaveM ]

Signed-off-by: Igor Maravić <igorm@etf.rs>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-15 14:49:43 -04:00
Eric Dumazet 363437f40a net_sched: sfb: optimize enqueue on full queue
In case SFB queue is full (hard limit reached), there is no point
spending time to compute hash and maximum qlen/p_mark.

We instead just early drop packet.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-26 12:55:18 -04:00
Jamal Hadi Salim 8919bc13e8 net_sched: fix port mirror/redirect stats reporting
When a redirected or mirrored packet is dropped by the target
device we need to record statistics.

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-17 20:10:20 -07:00
Florian Westphal 3557619f0f net_sched: prio: use qdisc_dequeue_peeked
commit 07bd8df5df
(sch_sfq: fix peek() implementation) changed sfq to use generic
peek helper.

This makes HFSC complain about a non-work-conserving child qdisc, if
prio with sfq child is used within hfsc:

hfsc peeks into prio qdisc, which will then peek into sfq.
returned skb is stashed in sch->gso_skb.

Next, hfsc tries to dequeue from prio, but prio will call sfq dequeue
directly, which may return NULL instead of previously peeked-at skb.

Have prio call qdisc_dequeue_peeked, so sfq->dequeue() is
not called in this case.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-09 21:52:11 -07:00
Eric Dumazet e1738bd9ce sch_sfq: fix sfq_enqueue()
commit 8efa885406 (sch_sfq: avoid giving spurious NET_XMIT_CN signals)
forgot to call qdisc_tree_decrease_qlen() to signal upper levels that a
packet (from another flow) was dropped, leading to various problems.

With help from Michal Soltys and Michal Pokrywka, who did a bisection.

Bugzilla ref: https://bugzilla.kernel.org/show_bug.cgi?id=39372
Debian ref: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=631945

Reported-by: Lucas Bocchi <lucas.bocchi@gmail.com>
Reported-and-bisected-by: Michal Pokrywka <wolfmoon@o2.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Michal Soltys <soltys@ziu.info>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-01 02:27:21 -07:00
David S. Miller 69cce1d140 net: Abstract dst->neighbour accesses behind helpers.
dst_{get,set}_neighbour()

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-17 23:11:35 -07:00
Krishna Kumar f0c50c7c9a Remove redundant variable/code in __qdisc_run
Remove redundant variable "work".

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-15 08:08:26 -07:00
Michał Mirosław e20e694073 net: remove SK_ROUTE_CAPS from meta ematch
Remove it, as it indirectly exposes netdev features. It's not used in
iproute2 (2.6.38) - is anything else using its interface?

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-14 14:45:59 -07:00
Eric Dumazet dc7f9f6e88 net: sched: constify tcf_proto and tc_action
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-06 02:52:16 -07:00
jamal d5b8aa1d24 net_sched: fix dequeuer fairness
Results on dummy device can be seen in my netconf 2011
slides. These results are for a 10Gige IXGBE intel
nic - on another i5 machine, very similar specs to
the one used in the netconf2011 results.
It turns out - this is a hell lot worse than dummy
and so this patch is even more beneficial for 10G.

Test setup:
----------

System under test sending packets out.
Additional box connected directly dropping packets.
Installed prio qdisc on the eth device and default
netdev default length of 1000 used as is.
The 3 prio bands each were set to 100 (didnt factor in
the results).

5 packet runs were made and the middle 3 picked.

results
-------

The "cpu" column indicates the which cpu the sample
was taken on,
The "Pkt runx" carries the number of packets a cpu
dequeued when forced to be in the "dequeuer" role.
The "avg" for each run is the number of times each
cpu should be a "dequeuer" if the system was fair.

3.0-rc4      (plain)
cpu         Pkt run1        Pkt run2        Pkt run3
================================================
cpu0        21853354        21598183        22199900
cpu1          431058          473476          393159
cpu2          481975          477529          458466
cpu3        23261406        23412299        22894315
avg         11506948        11490372        11486460

3.0-rc4 with patch and default weight 64
cpu 	     Pkt run1        Pkt run2        Pkt run3
================================================
cpu0        13205312        13109359        13132333
cpu1        10189914        10159127        10122270
cpu2        10213871        10124367        10168722
cpu3        13165760        13164767        13096705
avg         11693714        11639405        11630008

As you can see the system is still not perfect but
is a lot better than what it was before...

At the moment we use the old backlog weight, weight_p
which is 64 packets. It seems to be reasonably fine
with that value.
The system could be made more fair if we reduce the
weight_p (as per my presentation), but we are going
to affect the shared backlog weight. Unless deemed
necessary, I think the default value is fine. If not
we could add yet another knob.

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-27 00:14:10 -07:00
Paul Gortmaker 56f8a75c17 ip: introduce ip_is_fragment helper inline function
There are enough instances of this:

    iph->frag_off & htons(IP_MF | IP_OFFSET)

that a helper function is probably warranted.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-21 20:33:34 -07:00
Alexey Dobriyan b7f080cfe2 net: remove mm.h inclusion from netdevice.h
Remove linux/mm.h inclusion from netdevice.h -- it's unused (I've checked manually).

To prevent mm.h inclusion via other channels also extract "enum dma_data_direction"
definition into separate header. This tiny piece is what gluing netdevice.h with mm.h
via "netdevice.h => dmaengine.h => dma-mapping.h => scatterlist.h => mm.h".
Removal of mm.h from scatterlist.h was tried and was found not feasible
on most archs, so the link was cutoff earlier.

Hope people are OK with tiny include file.

Note, that mm_types.h is still dragged in, but it is a separate story.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-21 19:17:20 -07:00
David S. Miller 9f6ec8d697 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/wireless/iwlwifi/iwl-agn-rxon.c
	drivers/net/wireless/rtlwifi/pci.c
	net/netfilter/ipvs/ip_vs_core.c
2011-06-20 22:29:08 -07:00
Greg Rose c7ac8679be rtnetlink: Compute and store minimum ifinfo dump size
The message size allocated for rtnl ifinfo dumps was limited to
a single page.  This is not enough for additional interface info
available with devices that support SR-IOV and caused a bug in
which VF info would not be displayed if more than approximately
40 VFs were created per interface.

Implement a new function pointer for the rtnl_register service that will
calculate the amount of data required for the ifinfo dump and allocate
enough data to satisfy the request.

Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2011-06-09 20:38:07 -07:00
Alexey Dobriyan a6b7a40786 net: remove interrupt.h inclusion from netdevice.h
* remove interrupt.g inclusion from netdevice.h -- not needed
* fixup fallout, add interrupt.h and hardirq.h back where needed.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-06 22:55:11 -07:00
David S. Miller 3019de124b net: Rework netdev_drivername() to avoid warning.
This interface uses a temporary buffer, but for no real reason.
And now can generate warnings like:

net/sched/sch_generic.c: In function dev_watchdog
net/sched/sch_generic.c:254:10: warning: unused variable drivername

Just return driver->name directly or "".

Reported-by: Connor Hansen <cmdkhh@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-06 16:41:33 -07:00
Eric Dumazet 07bd8df5df sch_sfq: fix peek() implementation
Since commit eeaeb068f1 (sch_sfq: allow big packets and be fair),
sfq_peek() can return a different skb that would be normally dequeued by
sfq_dequeue() [ if current slot->allot is negative ]

Use generic qdisc_peek_dequeued() instead of custom implementation, to
get consistent result.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
CC: Jesper Dangaard Brouer <hawk@diku.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-25 17:55:32 -04:00
Eric Dumazet 8efa885406 sch_sfq: avoid giving spurious NET_XMIT_CN signals
While chasing a possible net_sched bug, I found that IP fragments have
litle chance to pass a congestioned SFQ qdisc :

- Say SFQ qdisc is full because one flow is non responsive.
- ip_fragment() wants to send two fragments belonging to an idle flow.
- sfq_enqueue() queues first packet, but see queue limit reached :
- sfq_enqueue() drops one packet from 'big consumer', and returns
NET_XMIT_CN.
- ip_fragment() cancel remaining fragments.

This patch restores fairness, making sure we return NET_XMIT_CN only if
we dropped a packet from the same flow.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Jamal Hadi Salim <hadi@cyberus.ca>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-23 17:36:00 -04:00
Eric Dumazet 3137663dfb net: avoid synchronize_rcu() in dev_deactivate_many
dev_deactivate_many() issues one synchronize_rcu() call after qdiscs set
to noop_qdisc.

This call is here to make sure they are no outstanding qdisc-less
dev_queue_xmit calls before returning to caller.

But in dismantle phase, we dont have to wait, because we wont activate
again the device, and we are going to wait one rcu grace period later in
rollback_registered_many().

After this patch, device dismantle uses one synchronize_net() and one
rcu_barrier() call only, so we have a ~30% speedup and a smaller RTNL
latency.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>,
CC: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-22 21:01:20 -04:00
Linus Torvalds 06f4e926d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits)
  macvlan: fix panic if lowerdev in a bond
  tg3: Add braces around 5906 workaround.
  tg3: Fix NETIF_F_LOOPBACK error
  macvlan: remove one synchronize_rcu() call
  networking: NET_CLS_ROUTE4 depends on INET
  irda: Fix error propagation in ircomm_lmp_connect_response()
  irda: Kill set but unused variable 'bytes' in irlan_check_command_param()
  irda: Kill set but unused variable 'clen' in ircomm_connect_indication()
  rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport()
  be2net: Kill set but unused variable 'req' in lancer_fw_download()
  irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication()
  atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined.
  rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer().
  rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler()
  rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection()
  rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window()
  pkt_sched: Kill set but unused variable 'protocol' in tc_classify()
  isdn: capi: Use pr_debug() instead of ifdefs.
  tg3: Update version to 3.119
  tg3: Apply rx_discards fix to 5719/5720
  ...

Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c
as per Davem.
2011-05-20 13:43:21 -07:00
Randy Dunlap 034cfe48d0 networking: NET_CLS_ROUTE4 depends on INET
IP_ROUTE_CLASSID depends on INET and NET_CLS_ROUTE4 selects
IP_ROUTE_CLASSID, but when INET is not enabled, this kconfig warning
is produced, so fix it by making NET_CLS_ROUTE4 depend on INET.

warning: (NET_CLS_ROUTE4) selects IP_ROUTE_CLASSID which has unmet direct dependencies (NET && INET)

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-19 19:23:28 -04:00
David S. Miller f06cd54f55 pkt_sched: Kill set but unused variable 'protocol' in tc_classify()
I checked the history and this has been like this since the
beginning of time.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-19 18:32:55 -04:00
Lai Jiangshan 75ef0368d1 net,act_police,rcu: remove rcu_barrier()
There is no callback of this module maybe queued
since we use kfree_rcu(), we can safely remove the rcu_barrier().

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-07 22:50:54 -07:00
Lai Jiangshan 5957b1ac52 net,rcu: convert call_rcu(tcf_police_free_rcu) to kfree_rcu()
[PATCH 05/17] net,rcu: convert call_rcu(tcf_police_free_rcu) to kfree_rcu()

The rcu callback tcf_police_free_rcu() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(tcf_police_free_rcu).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-07 22:50:48 -07:00
Lai Jiangshan f5c8593c10 net,rcu: convert call_rcu(tcf_common_free_rcu) to kfree_rcu()
The rcu callback tcf_common_free_rcu() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(tcf_common_free_rcu).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-07 22:50:48 -07:00
Eric Dumazet b71d1d426d inet: constify ip headers and in6_addr
Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers
where possible, to make code intention more obvious.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-22 11:04:14 -07:00
David S. Miller 1c01a80cfe Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/smsc911x.c
2011-04-11 13:44:25 -07:00
stephen hemminger 0545a30377 pkt_sched: QFQ - quick fair queue scheduler
This is an implementation of the Quick Fair Queue scheduler developed
by Fabio Checconi. The same algorithm is already implemented in ipfw
in FreeBSD. Fabio had an earlier version developed on Linux, I just
cleaned it up.  Thanks to Eric Dumazet for testing this under load.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-04 11:10:24 -07:00
Lucas De Marchi 25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
David S. Miller 5e2b61f784 ipv4: Remove flowi from struct rtable.
The only necessary parts are the src/dst addresses, the
interface indexes, the TOS, and the mark.

The rest is unnecessary bloat, which amounts to nearly
50 bytes on 64-bit.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-04 21:55:31 -08:00
David S. Miller 0a0e9ae1bd Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/bnx2x/bnx2x.h
2011-03-03 21:27:42 -08:00
Eric Dumazet d276055c4e net_sched: reduce fifo qdisc size
Because of various alignements [SLUB / qdisc], we use 512 bytes of
memory for one {p|b}fifo qdisc, instead of 256 bytes on 64bit arches and
192 bytes on 32bit ones.

Move the "u32 limit" inside "struct Qdisc" (no impact on other qdiscs)

Change qdisc_alloc(), first trying a regular allocation before an
oversized one.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-03 11:10:02 -08:00
Hagen Paul Pfeifer 52bc97470e sched: protocol only needed when CONFIG_NET_CLS_ACT is enabled
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-25 14:00:23 -08:00
David S. Miller 78776d3f2b sch_netem: Need to include vmalloc.h
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-24 22:48:13 -08:00
Eric Dumazet 26f70e1202 sch_choke: add choke_skb_cb
Better document choke skb->cb[] use, like we did in netem and sfb

This adds a compile time check to make sure we dont exhaust skb->cb[]
space.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-24 22:11:57 -08:00
stephen hemminger 250a65f782 netem: update version and cleanup
Get rid of debug message that are not useful, and enable
the log messages in case of error.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-24 22:11:56 -08:00
stephen hemminger 661b79725f netem: revised correlated loss generator
This is a patch originated with Stefano Salsano and Fabio Ludovici.
It provides several alternative loss models for use with netem.
This patch adds two state machine based loss models.

See: http://netgroup.uniroma2.it/twiki/bin/view.cgi/Main/NetemCLG

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-24 22:11:56 -08:00
stephen hemminger 10f6dfcfde Revert "sch_netem: Remove classful functionality"
Many users have wanted the old functionality that was lost
to be able to use pfifo as inner qdisc for netem. The reason that
netem could not be classful with the older API was because of the
limitations of the old dequeue/requeue interface; now that qdisc API has
a peek function, there is no longer a problem with using any
inner qdisc's.

This reverts commit 0220146411.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-24 22:11:55 -08:00
stephen hemminger df173bda26 netem: define NETEM_DIST_MAX
Rather than magic constant in code, expose the maximum size of
packet distribution table in API. In iproute2, q_netem defines
MAX_DIST as 16K already.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-24 22:11:54 -08:00
stephen hemminger 6373a9a286 netem: use vmalloc for distribution table
The netem probability table can be large (up to 64K bytes)
which may be too large to allocate in one contiguous chunk.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-24 22:11:54 -08:00
stephen hemminger 861d7f745f netem: cleanup dump code
Use nla_put_nested to update netlink attribute value.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-24 22:11:53 -08:00
stephen hemminger e0c563101a em_meta: fix sparse warning
gfp_t needs to be cast to integer.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-23 14:11:33 -08:00
stephen hemminger ea18fd950e mqprio: cleanups
* make qdisc_ops local
* add sparse annotation about expected unlock/unlock in dump_class_stats
* fix indentation

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-23 14:11:32 -08:00
Eric Dumazet e13e02a3c6 net_sched: SFB flow scheduler
This is the Stochastic Fair Blue scheduler, based on work from :

W. Feng, D. Kandlur, D. Saha, K. Shin. Blue: A New Class of Active Queue
Management Algorithms. U. Michigan CSE-TR-387-99, April 1999.

http://www.thefengs.com/wuchang/blue/CSE-TR-387-99.pdf

This implementation is based on work done by Juliusz Chroboczek

General SFB algorithm can be found in figure 14, page 15:

B[l][n] : L x N array of bins (L levels, N bins per level)
enqueue()
Calculate hash function values h{0}, h{1}, .. h{L-1}
Update bins at each level
for i = 0 to L - 1
   if (B[i][h{i}].qlen > bin_size)
      B[i][h{i}].p_mark += p_increment;
   else if (B[i][h{i}].qlen == 0)
      B[i][h{i}].p_mark -= p_decrement;
p_min = min(B[0][h{0}].p_mark ... B[L-1][h{L-1}].p_mark);
if (p_min == 1.0)
    ratelimit();
else
    mark/drop with probabilty p_min;

I did the adaptation of Juliusz code to meet current kernel standards,
and various changes to address previous comments :

http://thread.gmane.org/gmane.linux.network/90225
http://thread.gmane.org/gmane.linux.network/90375

Default flow classifier is the rxhash introduced by RPS in 2.6.35, but
we can use an external flow classifier if wanted.

tc qdisc add dev $DEV parent 1:11 handle 11:  \
        est 0.5sec 2sec sfb limit 128

tc filter add dev $DEV protocol ip parent 11: handle 3 \
        flow hash keys dst divisor 1024

Notes:

1) SFB default child qdisc is pfifo_fast. It can be changed by another
qdisc but a child qdisc MUST not drop a packet previously queued. This
is because SFB needs to handle a dequeued packet in order to maintain
its virtual queue states. pfifo_head_drop or CHOKe should not be used.

2) ECN is enabled by default, unlike RED/CHOKe/GRED

With help from Patrick McHardy & Andi Kleen

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Juliusz Chroboczek <Juliusz.Chroboczek@pps.jussieu.fr>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Patrick McHardy <kaber@trash.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-23 14:05:11 -08:00
stephen hemminger 86fce3ba1e cls_u32: fix sparse warnings
The variable _data is used in asm-generic to define sections
which causes sparse warnings, so just rename the variable.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-22 11:22:33 -08:00
Eric W. Biederman 5f04d5068a net: Fix more stale on-stack list_head objects.
From: Eric W. Biederman <ebiederm@xmission.com>

In the beginning with batching unreg_list was a list that was used only
once in the lifetime of a network device (I think).  Now we have calls
using the unreg_list that can happen multiple times in the life of a
network device like dev_deactivate and dev_close that are also using the
unreg_list.  In addition in unregister_netdevice_queue we also do a
list_move because for devices like veth pairs it is possible that
unregister_netdevice_queue will be called multiple times.

So I think the change below to fix dev_deactivate which Eric D. missed
will fix this problem.  Now to go test that.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-20 11:49:45 -08:00
Ben Hutchings ac7100ba93 sch_mqprio: Always set num_tc to 0 in mqprio_destroy()
All the cleanup code in mqprio_destroy() is currently conditional on
priv->qdiscs being non-null, but that condition should only apply to
the per-queue qdisc cleanup.  We should always set the number of
traffic classes back to 0 here.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
2011-02-14 19:07:58 +00:00
David S. Miller cdfb74d4c2 sch_choke: Need linux/vmalloc.h
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-02 23:06:31 -08:00
stephen hemminger 45e144339a sched: CHOKe flow scheduler
CHOKe ("CHOose and Kill" or "CHOose and Keep") is an alternative
packet scheduler based on the Random Exponential Drop (RED) algorithm.

The core idea is:
  For every packet arrival:
  	Calculate Qave
	if (Qave < minth)
	     Queue the new packet
	else
	     Select randomly a packet from the queue
	     if (both packets from same flow)
	     then Drop both the packets
	     else if (Qave > maxth)
	          Drop packet
	     else
	       	  Admit packet with proability p (same as RED)

See also:
  Rong Pan, Balaji Prabhakar, Konstantinos Psounis, "CHOKe: a stateless active
   queue management scheme for approximating fair bandwidth allocation",
  Proceeding of INFOCOM'2000, March 2000.

Help from:
     Eric Dumazet <eric.dumazet@gmail.com>
     Patrick McHardy <kaber@trash.net>

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-02 20:52:42 -08:00
stephen hemminger 119b3d3869 sfq: deadlock in error path
The change to allow divisor to be a parameter (in 2.6.38-rc1)
 commit 817fb15dfd
introduced a possible deadlock caught by sparse.

The scheduler tree lock was left locked in the case of an incorrect
divisor value. Simplest fix is to move test outside of lock
which also solves problem of partial update.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-02 20:51:20 -08:00
Eric Dumazet 144ce879b0 net_sched: sch_mqprio: dont leak kernel memory
mqprio_dump() should make sure all fields of struct tc_mqprio_qopt are
initialized.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-26 13:15:29 -08:00
David S. Miller 5bdc22a565 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/sched/sch_hfsc.c
	net/sched/sch_htb.c
	net/sched/sch_tbf.c
2011-01-24 14:09:35 -08:00
David S. Miller e92427b289 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6 2011-01-24 13:17:06 -08:00
Eric Dumazet 23624935e0 net_sched: TCQ_F_CAN_BYPASS generalization
Now qdisc stab is handled before TCQ_F_CAN_BYPASS test in
__dev_xmit_skb(), we can generalize TCQ_F_CAN_BYPASS to other qdiscs
than pfifo_fast : pfifo, bfifo, pfifo_head_drop and sfq

SFQ is special because it can have external classifiers, and in these
cases, we cannot bypass queue discipline (packet could be dropped by
classifier) without admin asking it, or further changes.

Its worth doing this, especially for SFQ, avoiding dirtying memory in
case no packets are already waiting in queue.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-21 16:26:09 -08:00
Eric Dumazet 9190b3b320 net_sched: accurate bytes/packets stats/rates
In commit 44b8288308 (net_sched: pfifo_head_drop problem), we fixed
a problem with pfifo_head drops that incorrectly decreased
sch->bstats.bytes and sch->bstats.packets

Several qdiscs (CHOKe, SFQ, pfifo_head, ...) are able to drop a
previously enqueued packet, and bstats cannot be changed, so
bstats/rates are not accurate (over estimated)

This patch changes the qdisc_bstats updates to be done at dequeue() time
instead of enqueue() time. bstats counters no longer account for dropped
frames, and rates are more correct, since enqueue() bursts dont have
effect on dequeue() rate.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-20 23:31:33 -08:00