mirror of
				https://git.proxmox.com/git/mirror_iproute2
				synced 2025-11-04 12:09:07 +00:00 
			
		
		
		
	Fix typo in ss manpage. Make the backslash visible in ip manpage (http://bugs.debian.org/285507). Strict syntax for ip addr advice in error message. Fix typo in libnetlink(3) manpage (writen -> written). Fix typos in tc-prio(8) manpage. Fix typo in tc-htb(8) manpage (mininum -> minimum). Fix typo in tc-cbq-details(8) manpage (occured -> occurred). Signed-off-by: Andreas Henriksson <andreas@fatal.se> Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
		
			
				
	
	
		
			426 lines
		
	
	
		
			14 KiB
		
	
	
	
		
			Groff
		
	
	
	
	
	
			
		
		
	
	
			426 lines
		
	
	
		
			14 KiB
		
	
	
	
		
			Groff
		
	
	
	
	
	
.TH CBQ 8 "8 December 2001" "iproute2" "Linux"
 | 
						|
.SH NAME
 | 
						|
CBQ \- Class Based Queueing
 | 
						|
.SH SYNOPSIS
 | 
						|
.B tc qdisc ... dev
 | 
						|
dev
 | 
						|
.B  ( parent
 | 
						|
classid 
 | 
						|
.B | root) [ handle 
 | 
						|
major: 
 | 
						|
.B ] cbq avpkt
 | 
						|
bytes
 | 
						|
.B bandwidth
 | 
						|
rate
 | 
						|
.B [ cell 
 | 
						|
bytes
 | 
						|
.B ] [ ewma
 | 
						|
log
 | 
						|
.B ] [ mpu
 | 
						|
bytes
 | 
						|
.B ] 
 | 
						|
 | 
						|
.B tc class ... dev
 | 
						|
dev
 | 
						|
.B parent 
 | 
						|
major:[minor]
 | 
						|
.B [ classid 
 | 
						|
major:minor
 | 
						|
.B ] cbq allot
 | 
						|
bytes
 | 
						|
.B [ bandwidth 
 | 
						|
rate 
 | 
						|
.B ] [ rate 
 | 
						|
rate
 | 
						|
.B ] prio
 | 
						|
priority
 | 
						|
.B [ weight
 | 
						|
weight
 | 
						|
.B ] [ minburst 
 | 
						|
packets
 | 
						|
.B ] [ maxburst 
 | 
						|
packets 
 | 
						|
.B ] [ ewma 
 | 
						|
log
 | 
						|
.B ] [ cell
 | 
						|
bytes
 | 
						|
.B ] avpkt
 | 
						|
bytes
 | 
						|
.B [ mpu
 | 
						|
bytes 
 | 
						|
.B ] [ bounded isolated ] [ split
 | 
						|
handle
 | 
						|
.B & defmap
 | 
						|
defmap
 | 
						|
.B ] [ estimator 
 | 
						|
interval timeconstant
 | 
						|
.B ]
 | 
						|
 | 
						|
.SH DESCRIPTION
 | 
						|
Class Based Queueing is a classful qdisc that implements a rich
 | 
						|
linksharing hierarchy of classes.  It contains shaping elements as
 | 
						|
well as prioritizing capabilities.  Shaping is performed using link
 | 
						|
idle time calculations based on the timing of dequeue events and 
 | 
						|
underlying link bandwidth.
 | 
						|
 | 
						|
.SH SHAPING ALGORITHM
 | 
						|
Shaping is done using link idle time calculations, and actions taken if
 | 
						|
these calculations deviate from set limits.
 | 
						|
 | 
						|
When shaping a 10mbit/s connection to 1mbit/s, the link will
 | 
						|
be idle 90% of the time. If it isn't, it needs to be throttled so that it
 | 
						|
IS idle 90% of the time.
 | 
						|
 | 
						|
From the kernel's perspective, this is hard to measure, so CBQ instead 
 | 
						|
derives the idle time from the number of microseconds (in fact, jiffies) 
 | 
						|
that elapse between  requests from the device driver for more data. Combined 
 | 
						|
with the  knowledge of packet sizes, this is used to approximate how full or 
 | 
						|
empty the link is.
 | 
						|
 | 
						|
This is rather circumspect and doesn't always arrive at proper
 | 
						|
results. For example, what is the actual link speed of an interface
 | 
						|
that is not really able to transmit the full 100mbit/s of data,
 | 
						|
perhaps because of a badly implemented driver? A PCMCIA network card
 | 
						|
will also never achieve 100mbit/s because of the way the bus is
 | 
						|
designed - again, how do we calculate the idle time?
 | 
						|
 | 
						|
The physical link bandwidth may be ill defined in case of not-quite-real 
 | 
						|
network devices like PPP over Ethernet or PPTP over TCP/IP. The effective 
 | 
						|
bandwidth in that case is probably determined by the efficiency of pipes 
 | 
						|
to userspace - which not defined.
 | 
						|
 | 
						|
During operations, the effective idletime is measured using an
 | 
						|
exponential weighted moving average (EWMA), which considers recent
 | 
						|
packets to be exponentially more important than past ones. The Unix
 | 
						|
loadaverage is calculated in the same way.
 | 
						|
 | 
						|
The calculated idle time is subtracted from the EWMA measured one,
 | 
						|
the resulting number is called 'avgidle'. A perfectly loaded link has
 | 
						|
an avgidle of zero: packets arrive exactly at the calculated
 | 
						|
interval.
 | 
						|
 | 
						|
An overloaded link has a negative avgidle and if it gets too negative,
 | 
						|
CBQ throttles and is then 'overlimit'.
 | 
						|
 | 
						|
Conversely, an idle link might amass a huge avgidle, which would then
 | 
						|
allow infinite bandwidths after a few hours of silence. To prevent
 | 
						|
this, avgidle is capped at 
 | 
						|
.B maxidle.
 | 
						|
 | 
						|
If overlimit, in theory, the CBQ could throttle itself for exactly the
 | 
						|
amount of time that was calculated to pass between packets, and then
 | 
						|
pass one packet, and throttle again. Due to timer resolution constraints,
 | 
						|
this may not be feasible, see the 
 | 
						|
.B minburst
 | 
						|
parameter below.
 | 
						|
 | 
						|
.SH CLASSIFICATION
 | 
						|
Within the one CBQ instance many classes may exist. Each of these classes
 | 
						|
contains another qdisc, by default 
 | 
						|
.BR tc-pfifo (8).
 | 
						|
 | 
						|
When enqueueing a packet, CBQ starts at the root and uses various methods to 
 | 
						|
determine which class should receive the data. If a verdict is reached, this
 | 
						|
process is repeated for the recipient class which might have further
 | 
						|
means of classifying traffic to its children, if any.
 | 
						|
 | 
						|
CBQ has the following methods available to classify a packet to any child 
 | 
						|
classes.
 | 
						|
.TP
 | 
						|
(i)
 | 
						|
.B skb->priority class encoding.
 | 
						|
Can be set from userspace by an application with the 
 | 
						|
.B SO_PRIORITY
 | 
						|
setsockopt.
 | 
						|
The 
 | 
						|
.B skb->priority class encoding
 | 
						|
only applies if the skb->priority holds a major:minor handle of an existing 
 | 
						|
class within  this qdisc.
 | 
						|
.TP
 | 
						|
(ii)
 | 
						|
tc filters attached to the class.
 | 
						|
.TP
 | 
						|
(iii)
 | 
						|
The defmap of a class, as set with the 
 | 
						|
.B split & defmap
 | 
						|
parameters. The defmap may contain instructions for each possible Linux packet
 | 
						|
priority.
 | 
						|
 | 
						|
.P
 | 
						|
Each class also has a 
 | 
						|
.B level.
 | 
						|
Leaf nodes, attached to the bottom of the class hierarchy, have a level of 0.
 | 
						|
.SH CLASSIFICATION ALGORITHM
 | 
						|
 | 
						|
Classification is a loop, which terminates when a leaf class is found. At any 
 | 
						|
point the loop may jump to the fallback algorithm.
 | 
						|
 | 
						|
The loop consists of the following steps:
 | 
						|
.TP 
 | 
						|
(i)
 | 
						|
If the packet is generated locally and has a valid classid encoded within its
 | 
						|
.B skb->priority,
 | 
						|
choose it and terminate.
 | 
						|
 | 
						|
.TP
 | 
						|
(ii)
 | 
						|
Consult the tc filters, if any, attached to this child. If these return
 | 
						|
a class which is not a leaf class, restart loop from the class returned.
 | 
						|
If it is a leaf, choose it and terminate.
 | 
						|
.TP
 | 
						|
(iii)
 | 
						|
If the tc filters did not return a class, but did return a classid, 
 | 
						|
try to find a class with that id within this qdisc. 
 | 
						|
Check if the found class is of a lower
 | 
						|
.B level
 | 
						|
than the current class. If so, and the returned class is not a leaf node,
 | 
						|
restart the loop at the found class. If it is a leaf node, terminate.
 | 
						|
If we found an upward reference to a higher level, enter the fallback 
 | 
						|
algorithm.
 | 
						|
.TP
 | 
						|
(iv)
 | 
						|
If the tc filters did not return a class, nor a valid reference to one,
 | 
						|
consider the minor number of the reference to be the priority. Retrieve
 | 
						|
a class from the defmap of this class for the priority. If this did not
 | 
						|
contain a class, consult the defmap of this class for the 
 | 
						|
.B BEST_EFFORT
 | 
						|
class. If this is an upward reference, or no 
 | 
						|
.B BEST_EFFORT 
 | 
						|
class was defined,
 | 
						|
enter the fallback algorithm. If a valid class was found, and it is not a
 | 
						|
leaf node, restart the loop at this class. If it is a leaf, choose it and 
 | 
						|
terminate. If
 | 
						|
neither the priority distilled from the classid, nor the 
 | 
						|
.B BEST_EFFORT 
 | 
						|
priority yielded a class, enter the fallback algorithm.
 | 
						|
.P
 | 
						|
The fallback algorithm resides outside of the loop and is as follows.
 | 
						|
.TP
 | 
						|
(i)
 | 
						|
Consult the defmap of the class at which the jump to fallback occured. If 
 | 
						|
the defmap contains a class for the 
 | 
						|
.B
 | 
						|
priority
 | 
						|
of the class (which is related to the TOS field), choose this class and 
 | 
						|
terminate. 
 | 
						|
.TP
 | 
						|
(ii)
 | 
						|
Consult the map for a class for the
 | 
						|
.B BEST_EFFORT
 | 
						|
priority. If found, choose it, and terminate.
 | 
						|
.TP
 | 
						|
(iii)
 | 
						|
Choose the class at which break out to the fallback algorithm occurred. Terminate.
 | 
						|
.P
 | 
						|
The packet is enqueued to the class which was chosen when either algorithm 
 | 
						|
terminated. It is therefore possible for a packet to be enqueued *not* at a
 | 
						|
leaf node, but in the middle of the hierarchy.
 | 
						|
 | 
						|
.SH LINK SHARING ALGORITHM
 | 
						|
When dequeuing for sending to the network device, CBQ decides which of its 
 | 
						|
classes will be allowed to send. It does so with a Weighted Round Robin process
 | 
						|
in which each class with packets gets a chance to send in turn. The WRR process
 | 
						|
starts by asking the highest priority classes (lowest numerically - 
 | 
						|
highest semantically) for packets, and will continue to do so until they
 | 
						|
have no more data to offer, in which case the process repeats for lower 
 | 
						|
priorities.
 | 
						|
 | 
						|
.B CERTAINTY ENDS HERE, ANK PLEASE HELP
 | 
						|
 | 
						|
Each class is not allowed to send at length though - they can only dequeue a
 | 
						|
configurable amount of data during each round. 
 | 
						|
 | 
						|
If a class is about to go overlimit, and it is not
 | 
						|
.B bounded
 | 
						|
it will try to borrow avgidle from siblings that are not
 | 
						|
.B isolated. 
 | 
						|
This process is repeated from the bottom upwards. If a class is unable
 | 
						|
to borrow enough avgidle to send a packet, it is throttled and not asked
 | 
						|
for a packet for enough time for the avgidle to increase above zero.
 | 
						|
 | 
						|
.B I REALLY NEED HELP FIGURING THIS OUT. REST OF DOCUMENT IS PRETTY CERTAIN
 | 
						|
.B AGAIN.
 | 
						|
 | 
						|
.SH QDISC
 | 
						|
The root qdisc of a CBQ class tree has the following parameters:
 | 
						|
 | 
						|
.TP 
 | 
						|
parent major:minor | root
 | 
						|
This mandatory parameter determines the place of the CBQ instance, either at the
 | 
						|
.B root
 | 
						|
of an interface or within an existing class.
 | 
						|
.TP
 | 
						|
handle major:
 | 
						|
Like all other qdiscs, the CBQ can be assigned a handle. Should consist only
 | 
						|
of a major number, followed by a colon. Optional.
 | 
						|
.TP
 | 
						|
avpkt bytes
 | 
						|
For calculations, the average packet size must be known. It is silently capped
 | 
						|
at a minimum of 2/3 of the interface MTU. Mandatory.
 | 
						|
.TP
 | 
						|
bandwidth rate
 | 
						|
To determine the idle time, CBQ must know the bandwidth of your underlying 
 | 
						|
physical interface, or parent qdisc. This is a vital parameter, more about it
 | 
						|
later. Mandatory.
 | 
						|
.TP
 | 
						|
cell
 | 
						|
The cell size determines he granularity of packet transmission time calculations. Has a sensible default.
 | 
						|
.TP 
 | 
						|
mpu
 | 
						|
A zero sized packet may still take time to transmit. This value is the lower
 | 
						|
cap for packet transmission time calculations - packets smaller than this value
 | 
						|
are still deemed to have this size. Defaults to zero.
 | 
						|
.TP
 | 
						|
ewma log
 | 
						|
When CBQ needs to measure the average idle time, it does so using an 
 | 
						|
Exponentially Weighted Moving Average which smoothes out measurements into
 | 
						|
a moving average. The EWMA LOG determines how much smoothing occurs. Defaults 
 | 
						|
to 5. Lower values imply greater sensitivity. Must be between 0 and 31.
 | 
						|
.P
 | 
						|
A CBQ qdisc does not shape out of its own accord. It only needs to know certain
 | 
						|
parameters about the underlying link. Actual shaping is done in classes.
 | 
						|
 | 
						|
.SH CLASSES
 | 
						|
Classes have a host of parameters to configure their operation.
 | 
						|
 | 
						|
.TP 
 | 
						|
parent major:minor
 | 
						|
Place of this class within the hierarchy. If attached directly to a qdisc 
 | 
						|
and not to another class, minor can be omitted. Mandatory.
 | 
						|
.TP 
 | 
						|
classid major:minor
 | 
						|
Like qdiscs, classes can be named. The major number must be equal to the
 | 
						|
major number of the qdisc to which it belongs. Optional, but needed if this 
 | 
						|
class is going to have children.
 | 
						|
.TP 
 | 
						|
weight weight
 | 
						|
When dequeuing to the interface, classes are tried for traffic in a 
 | 
						|
round-robin fashion. Classes with a higher configured qdisc will generally
 | 
						|
have more traffic to offer during each round, so it makes sense to allow
 | 
						|
it to dequeue more traffic. All weights under a class are normalized, so
 | 
						|
only the ratios matter. Defaults to the configured rate, unless the priority 
 | 
						|
of this class is maximal, in which case it is set to 1.
 | 
						|
.TP 
 | 
						|
allot bytes
 | 
						|
Allot specifies how many bytes a qdisc can dequeue
 | 
						|
during each round of the process. This parameter is weighted using the 
 | 
						|
renormalized class weight described above.
 | 
						|
 | 
						|
.TP 
 | 
						|
priority priority
 | 
						|
In the round-robin process, classes with the lowest priority field are tried 
 | 
						|
for packets first. Mandatory.
 | 
						|
 | 
						|
.TP 
 | 
						|
rate rate
 | 
						|
Maximum rate this class and all its children combined can send at. Mandatory.
 | 
						|
 | 
						|
.TP
 | 
						|
bandwidth rate
 | 
						|
This is different from the bandwidth specified when creating a CBQ disc. Only
 | 
						|
used to determine maxidle and offtime, which are only calculated when
 | 
						|
specifying maxburst or minburst. Mandatory if specifying maxburst or minburst.
 | 
						|
 | 
						|
.TP 
 | 
						|
maxburst
 | 
						|
This number of packets is used to calculate maxidle so that when
 | 
						|
avgidle is at maxidle, this number of average packets can be burst
 | 
						|
before avgidle drops to 0. Set it higher to be more tolerant of
 | 
						|
bursts. You can't set maxidle directly, only via this parameter.
 | 
						|
 | 
						|
.TP
 | 
						|
minburst 
 | 
						|
As mentioned before, CBQ needs to throttle in case of
 | 
						|
overlimit. The ideal solution is to do so for exactly the calculated
 | 
						|
idle time, and pass 1 packet. However, Unix kernels generally have a
 | 
						|
hard time scheduling events shorter than 10ms, so it is better to
 | 
						|
throttle for a longer period, and then pass minburst packets in one
 | 
						|
go, and then sleep minburst times longer.
 | 
						|
 | 
						|
The time to wait is called the offtime. Higher values of minburst lead
 | 
						|
to more accurate shaping in the long term, but to bigger bursts at
 | 
						|
millisecond timescales.
 | 
						|
 | 
						|
.TP
 | 
						|
minidle
 | 
						|
If avgidle is below 0, we are overlimits and need to wait until
 | 
						|
avgidle will be big enough to send one packet. To prevent a sudden
 | 
						|
burst from shutting down the link for a prolonged period of time,
 | 
						|
avgidle is reset to minidle if it gets too low.
 | 
						|
 | 
						|
Minidle is specified in negative microseconds, so 10 means that
 | 
						|
avgidle is capped at -10us.
 | 
						|
 | 
						|
.TP
 | 
						|
bounded 
 | 
						|
Signifies that this class will not borrow bandwidth from its siblings.
 | 
						|
.TP 
 | 
						|
isolated
 | 
						|
Means that this class will not borrow bandwidth to its siblings
 | 
						|
 | 
						|
.TP 
 | 
						|
split major:minor & defmap bitmap[/bitmap]
 | 
						|
If consulting filters attached to a class did not give a verdict, 
 | 
						|
CBQ can also classify based on the packet's priority. There are 16
 | 
						|
priorities available, numbered from 0 to 15. 
 | 
						|
 | 
						|
The defmap specifies which priorities this class wants to receive, 
 | 
						|
specified as a bitmap. The Least Significant Bit corresponds to priority 
 | 
						|
zero. The 
 | 
						|
.B split
 | 
						|
parameter tells CBQ at which class the decision must be made, which should
 | 
						|
be a (grand)parent of the class you are adding.
 | 
						|
 | 
						|
As an example, 'tc class add ... classid 10:1 cbq .. split 10:0 defmap c0'
 | 
						|
configures class 10:0 to send packets with priorities 6 and 7 to 10:1.
 | 
						|
 | 
						|
The complimentary configuration would then 
 | 
						|
be: 'tc class add ... classid 10:2 cbq ... split 10:0 defmap 3f'
 | 
						|
Which would send all packets 0, 1, 2, 3, 4 and 5 to 10:1.
 | 
						|
.TP
 | 
						|
estimator interval timeconstant
 | 
						|
CBQ can measure how much bandwidth each class is using, which tc filters
 | 
						|
can use to classify packets with. In order to determine the bandwidth
 | 
						|
it uses a very simple estimator that measures once every
 | 
						|
.B interval
 | 
						|
microseconds how much traffic has passed. This again is a EWMA, for which
 | 
						|
the time constant can be specified, also in microseconds. The 
 | 
						|
.B time constant
 | 
						|
corresponds to the sluggishness of the measurement or, conversely, to the 
 | 
						|
sensitivity of the average to short bursts. Higher values mean less
 | 
						|
sensitivity. 
 | 
						|
 | 
						|
 | 
						|
 | 
						|
.SH SOURCES
 | 
						|
.TP
 | 
						|
o
 | 
						|
Sally Floyd and Van Jacobson, "Link-sharing and Resource
 | 
						|
Management Models for Packet Networks",
 | 
						|
IEEE/ACM Transactions on Networking, Vol.3, No.4, 1995
 | 
						|
 | 
						|
.TP 
 | 
						|
o
 | 
						|
Sally Floyd, "Notes on CBQ and Guarantee Service", 1995
 | 
						|
 | 
						|
.TP
 | 
						|
o
 | 
						|
Sally Floyd, "Notes on Class-Based Queueing: Setting
 | 
						|
Parameters", 1996
 | 
						|
 | 
						|
.TP 
 | 
						|
o
 | 
						|
Sally Floyd and Michael Speer, "Experimental Results
 | 
						|
for Class-Based Queueing", 1998, not published.
 | 
						|
 | 
						|
 | 
						|
 | 
						|
.SH SEE ALSO
 | 
						|
.BR tc (8)
 | 
						|
 | 
						|
.SH AUTHOR
 | 
						|
Alexey N. Kuznetsov, <kuznet@ms2.inr.ac.ru>. This manpage maintained by
 | 
						|
bert hubert <ahu@ds9a.nl>
 | 
						|
 | 
						|
 |