Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

pkt_sched: sch_qfq: remove a source of high packet delay/jitter

QFQ+ inherits from QFQ a design choice that may cause a high packet
delay/jitter and a severe short-term unfairness. As QFQ, QFQ+ uses a
special quantity, the system virtual time, to track the service
provided by the ideal system it approximates. When a packet is
dequeued, this quantity must be incremented by the size of the packet,
divided by the sum of the weights of the aggregates waiting to be
served. Tracking this sum correctly is a non-trivial task, because, to
preserve tight service guarantees, the decrement of this sum must be
delayed in a special way [1]: this sum can be decremented only after
that its value would decrease also in the ideal system approximated by
QFQ+. For efficiency, QFQ+ keeps track only of the 'instantaneous'
weight sum, increased and decreased immediately as the weight of an
aggregate changes, and as an aggregate is created or destroyed (which,
in its turn, happens as a consequence of some class being
created/destroyed/changed). However, to avoid the problems caused to
service guarantees by these immediate decreases, QFQ+ increments the
system virtual time using the maximum value allowed for the weight
sum, 2^10, in place of the dynamic, instantaneous value. The
instantaneous value of the weight sum is used only to check whether a
request of weight increase or a class creation can be satisfied.

Unfortunately, the problems caused by this choice are worse than the
temporary degradation of the service guarantees that may occur, when a
class is changed or destroyed, if the instantaneous value of the
weight sum was used to update the system virtual time. In fact, the
fraction of the link bandwidth guaranteed by QFQ+ to each aggregate is
equal to the ratio between the weight of the aggregate and the sum of
the weights of the competing aggregates. The packet delay guaranteed
to the aggregate is instead inversely proportional to the guaranteed
bandwidth. By using the maximum possible value, and not the actual
value of the weight sum, QFQ+ provides each aggregate with the worst
possible service guarantees, and not with service guarantees related
to the actual set of competing aggregates. To see the consequences of
this fact, consider the following simple example.

Suppose that only the following aggregates are backlogged, i.e., that
only the classes in the following aggregates have packets to transmit:
one aggregate with weight 10, say A, and ten aggregates with weight 1,
say B1, B2, ..., B10. In particular, suppose that these aggregates are
always backlogged. Given the weight distribution, the smoothest and
fairest service order would be:
A B1 A B2 A B3 A B4 A B5 A B6 A B7 A B8 A B9 A B10 A B1 A B2 ...

QFQ+ would provide exactly this optimal service if it used the actual
value for the weight sum instead of the maximum possible value, i.e.,
11 instead of 2^10. In contrast, since QFQ+ uses the latter value, it
serves aggregates as follows (easy to prove and to reproduce
experimentally):
A B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 A A A A A A A A A A B1 B2 ... B10 A A ...

By replacing 10 with N in the above example, and by increasing N, one
can increase at will the maximum packet delay and the jitter
experienced by the classes in aggregate A.

This patch addresses this issue by just using the above
'instantaneous' value of the weight sum, instead of the maximum
possible value, when updating the system virtual time. After the
instantaneous weight sum is decreased, QFQ+ may deviate from the ideal
service for a time interval in the order of the time to serve one
maximum-size packet for each backlogged class. The worst-case extent
of the deviation exhibited by QFQ+ during this time interval [1] is
basically the same as of the deviation described above (but, without
this patch, QFQ+ suffers from such a deviation all the time). Finally,
this patch modifies the comment to the function qfq_slot_insert, to
make it coherent with the fact that the weight sum used by QFQ+ can
now be lower than the maximum possible value.

[1] P. Valente, "Extending WF2Q+ to support a dynamic traffic mix",
Proceedings of AAA-IDEA'05, June 2005.

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

Paolo Valente and committed by
David S. Miller
87f40dd6 093b9c71

+56 -29
+56 -29
net/sched/sch_qfq.c
··· 113 113 114 114 #define FRAC_BITS 30 /* fixed point arithmetic */ 115 115 #define ONE_FP (1UL << FRAC_BITS) 116 - #define IWSUM (ONE_FP/QFQ_MAX_WSUM) 117 116 118 117 #define QFQ_MTU_SHIFT 16 /* to support TSO/GSO */ 119 118 #define QFQ_MIN_LMAX 512 /* see qfq_slot_insert */ ··· 188 189 struct qfq_aggregate *in_serv_agg; /* Aggregate being served. */ 189 190 u32 num_active_agg; /* Num. of active aggregates */ 190 191 u32 wsum; /* weight sum */ 192 + u32 iwsum; /* inverse weight sum */ 191 193 192 194 unsigned long bitmaps[QFQ_MAX_STATE]; /* Group bitmaps. */ 193 195 struct qfq_group groups[QFQ_MAX_INDEX + 1]; /* The groups. */ ··· 314 314 315 315 q->wsum += 316 316 (int) agg->class_weight * (new_num_classes - agg->num_classes); 317 + q->iwsum = ONE_FP / q->wsum; 317 318 318 319 agg->num_classes = new_num_classes; 319 320 } ··· 341 340 { 342 341 if (!hlist_unhashed(&agg->nonfull_next)) 343 342 hlist_del_init(&agg->nonfull_next); 343 + q->wsum -= agg->class_weight; 344 + if (q->wsum != 0) 345 + q->iwsum = ONE_FP / q->wsum; 346 + 344 347 if (q->in_serv_agg == agg) 345 348 q->in_serv_agg = qfq_choose_next_agg(q); 346 349 kfree(agg); ··· 839 834 } 840 835 } 841 836 842 - 843 837 /* 844 - * The index of the slot in which the aggregate is to be inserted must 845 - * not be higher than QFQ_MAX_SLOTS-2. There is a '-2' and not a '-1' 846 - * because the start time of the group may be moved backward by one 847 - * slot after the aggregate has been inserted, and this would cause 848 - * non-empty slots to be right-shifted by one position. 838 + * The index of the slot in which the input aggregate agg is to be 839 + * inserted must not be higher than QFQ_MAX_SLOTS-2. There is a '-2' 840 + * and not a '-1' because the start time of the group may be moved 841 + * backward by one slot after the aggregate has been inserted, and 842 + * this would cause non-empty slots to be right-shifted by one 843 + * position. 849 844 * 850 - * If the weight and lmax (max_pkt_size) of the classes do not change, 851 - * then QFQ+ does meet the above contraint according to the current 852 - * values of its parameters. In fact, if the weight and lmax of the 853 - * classes do not change, then, from the theory, QFQ+ guarantees that 854 - * the slot index is never higher than 855 - * 2 + QFQ_MAX_AGG_CLASSES * ((1<<QFQ_MTU_SHIFT)/QFQ_MIN_LMAX) * 856 - * (QFQ_MAX_WEIGHT/QFQ_MAX_WSUM) = 2 + 8 * 128 * (1 / 64) = 18 845 + * QFQ+ fully satisfies this bound to the slot index if the parameters 846 + * of the classes are not changed dynamically, and if QFQ+ never 847 + * happens to postpone the service of agg unjustly, i.e., it never 848 + * happens that the aggregate becomes backlogged and eligible, or just 849 + * eligible, while an aggregate with a higher approximated finish time 850 + * is being served. In particular, in this case QFQ+ guarantees that 851 + * the timestamps of agg are low enough that the slot index is never 852 + * higher than 2. Unfortunately, QFQ+ cannot provide the same 853 + * guarantee if it happens to unjustly postpone the service of agg, or 854 + * if the parameters of some class are changed. 857 855 * 858 - * When the weight of a class is increased or the lmax of the class is 859 - * decreased, a new aggregate with smaller slot size than the original 860 - * parent aggregate of the class may happen to be activated. The 861 - * activation of this aggregate should be properly delayed to when the 862 - * service of the class has finished in the ideal system tracked by 863 - * QFQ+. If the activation of the aggregate is not delayed to this 864 - * reference time instant, then this aggregate may be unjustly served 865 - * before other aggregates waiting for service. This may cause the 866 - * above bound to the slot index to be violated for some of these 867 - * unlucky aggregates. 856 + * As for the first event, i.e., an out-of-order service, the 857 + * upper bound to the slot index guaranteed by QFQ+ grows to 858 + * 2 + 859 + * QFQ_MAX_AGG_CLASSES * ((1<<QFQ_MTU_SHIFT)/QFQ_MIN_LMAX) * 860 + * (current_max_weight/current_wsum) <= 2 + 8 * 128 * 1. 861 + * 862 + * The following function deals with this problem by backward-shifting 863 + * the timestamps of agg, if needed, so as to guarantee that the slot 864 + * index is never higher than QFQ_MAX_SLOTS-2. This backward-shift may 865 + * cause the service of other aggregates to be postponed, yet the 866 + * worst-case guarantees of these aggregates are not violated. In 867 + * fact, in case of no out-of-order service, the timestamps of agg 868 + * would have been even lower than they are after the backward shift, 869 + * because QFQ+ would have guaranteed a maximum value equal to 2 for 870 + * the slot index, and 2 < QFQ_MAX_SLOTS-2. Hence the aggregates whose 871 + * service is postponed because of the backward-shift would have 872 + * however waited for the service of agg before being served. 873 + * 874 + * The other event that may cause the slot index to be higher than 2 875 + * for agg is a recent change of the parameters of some class. If the 876 + * weight of a class is increased or the lmax (max_pkt_size) of the 877 + * class is decreased, then a new aggregate with smaller slot size 878 + * than the original parent aggregate of the class may happen to be 879 + * activated. The activation of this aggregate should be properly 880 + * delayed to when the service of the class has finished in the ideal 881 + * system tracked by QFQ+. If the activation of the aggregate is not 882 + * delayed to this reference time instant, then this aggregate may be 883 + * unjustly served before other aggregates waiting for service. This 884 + * may cause the above bound to the slot index to be violated for some 885 + * of these unlucky aggregates. 868 886 * 869 887 * Instead of delaying the activation of the new aggregate, which is 870 - * quite complex, the following inaccurate but simple solution is used: 871 - * if the slot index is higher than QFQ_MAX_SLOTS-2, then the 872 - * timestamps of the aggregate are shifted backward so as to let the 873 - * slot index become equal to QFQ_MAX_SLOTS-2. 888 + * quite complex, the above-discussed capping of the slot index is 889 + * used to handle also the consequences of a change of the parameters 890 + * of a class. 874 891 */ 875 892 static void qfq_slot_insert(struct qfq_group *grp, struct qfq_aggregate *agg, 876 893 u64 roundedS) ··· 1163 1136 else 1164 1137 in_serv_agg->budget -= len; 1165 1138 1166 - q->V += (u64)len * IWSUM; 1139 + q->V += (u64)len * q->iwsum; 1167 1140 pr_debug("qfq dequeue: len %u F %lld now %lld\n", 1168 1141 len, (unsigned long long) in_serv_agg->F, 1169 1142 (unsigned long long) q->V);