Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

net: nexthop: Increase weight to u16

In CLOS networks, as link failures occur at various points in the network,
ECMP weights of the involved nodes are adjusted to compensate. With high
fan-out of the involved nodes, and overall high number of nodes,
a (non-)ECMP weight ratio that we would like to configure does not fit into
8 bits. Instead of, say, 255:254, we might like to configure something like
1000:999. For these deployments, the 8-bit weight may not be enough.

To that end, in this patch increase the next hop weight from u8 to u16.

Increasing the width of an integral type can be tricky, because while the
code still compiles, the types may not check out anymore, and numerical
errors come up. To prevent this, the conversion was done in two steps.
First the type was changed from u8 to a single-member structure, which
invalidated all uses of the field. This allowed going through them one by
one and audit for type correctness. Then the structure was replaced with a
vanilla u16 again. This should ensure that no place was missed.

The UAPI for configuring nexthop group members is that an attribute
NHA_GROUP carries an array of struct nexthop_grp entries:

struct nexthop_grp {
__u32 id; /* nexthop id - must exist */
__u8 weight; /* weight of this nexthop */
__u8 resvd1;
__u16 resvd2;
};

The field resvd1 is currently validated and required to be zero. We can
lift this requirement and carry high-order bits of the weight in the
reserved field:

struct nexthop_grp {
__u32 id; /* nexthop id - must exist */
__u8 weight; /* weight of this nexthop */
__u8 weight_high;
__u16 resvd2;
};

Keeping the fields split this way was chosen in case an existing userspace
makes assumptions about the width of the weight field, and to sidestep any
endianness issues.

The weight field is currently encoded as the weight value minus one,
because weight of 0 is invalid. This same trick is impossible for the new
weight_high field, because zero must mean actual zero. With this in place:

- Old userspace is guaranteed to carry weight_high of 0, therefore
configuring 8-bit weights as appropriate. When dumping nexthops with
16-bit weight, it would only show the lower 8 bits. But configuring such
nexthops implies existence of userspace aware of the extension in the
first place.

- New userspace talking to an old kernel will work as long as it only
attempts to configure 8-bit weights, where the high-order bits are zero.
Old kernel will bounce attempts at configuring >8-bit weights.

Renaming reserved fields as they are allocated for some purpose is commonly
done in Linux. Whoever touches a reserved field is doing so at their own
risk. nexthop_grp::resvd1 in particular is currently used by at least
strace, however they carry an own copy of UAPI headers, and the conversion
should be trivial. A helper is provided for decoding the weight out of the
two fields. Forcing a conversion seems preferable to bending backwards and
introducing anonymous unions or whatever.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/483e2fcf4beb0d9135d62e7d27b46fa2685479d4.1723036486.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

authored by

Petr Machata and committed by
Jakub Kicinski
b72a6a7a 75bab45e

+31 -17
+2 -2
include/net/nexthop.h
··· 105 105 struct nh_grp_entry { 106 106 struct nexthop *nh; 107 107 struct nh_grp_entry_stats __percpu *stats; 108 - u8 weight; 108 + u16 weight; 109 109 110 110 union { 111 111 struct { ··· 192 192 }; 193 193 194 194 struct nh_notifier_grp_entry_info { 195 - u8 weight; 195 + u16 weight; 196 196 struct nh_notifier_single_info nh; 197 197 }; 198 198
+6 -1
include/uapi/linux/nexthop.h
··· 16 16 struct nexthop_grp { 17 17 __u32 id; /* nexthop id - must exist */ 18 18 __u8 weight; /* weight of this nexthop */ 19 - __u8 resvd1; 19 + __u8 weight_high; /* high order bits of weight */ 20 20 __u16 resvd2; 21 21 }; 22 + 23 + static inline __u16 nexthop_grp_weight(const struct nexthop_grp *entry) 24 + { 25 + return ((entry->weight_high << 8) | entry->weight) + 1; 26 + } 22 27 23 28 enum { 24 29 NEXTHOP_GRP_TYPE_MPATH, /* hash-threshold nexthop group
+23 -14
net/ipv4/nexthop.c
··· 872 872 size_t len = nhg->num_nh * sizeof(*p); 873 873 struct nlattr *nla; 874 874 u16 group_type = 0; 875 + u16 weight; 875 876 int i; 876 877 877 878 *resp_op_flags |= NHA_OP_FLAG_RESP_GRP_RESVD_0; ··· 891 890 892 891 p = nla_data(nla); 893 892 for (i = 0; i < nhg->num_nh; ++i) { 893 + weight = nhg->nh_entries[i].weight - 1; 894 + 894 895 *p++ = (struct nexthop_grp) { 895 896 .id = nhg->nh_entries[i].nh->id, 896 - .weight = nhg->nh_entries[i].weight - 1, 897 + .weight = weight, 898 + .weight_high = weight >> 8, 897 899 }; 898 900 } 899 901 ··· 1290 1286 1291 1287 nhg = nla_data(tb[NHA_GROUP]); 1292 1288 for (i = 0; i < len; ++i) { 1293 - if (nhg[i].resvd1 || nhg[i].resvd2) { 1294 - NL_SET_ERR_MSG(extack, "Reserved fields in nexthop_grp must be 0"); 1289 + if (nhg[i].resvd2) { 1290 + NL_SET_ERR_MSG(extack, "Reserved field in nexthop_grp must be 0"); 1295 1291 return -EINVAL; 1296 1292 } 1297 - if (nhg[i].weight > 254) { 1293 + if (nexthop_grp_weight(&nhg[i]) == 0) { 1294 + /* 0xffff got passed in, representing weight of 0x10000, 1295 + * which is too heavy. 1296 + */ 1298 1297 NL_SET_ERR_MSG(extack, "Invalid value for weight"); 1299 1298 return -EINVAL; 1300 1299 } ··· 1893 1886 static void nh_res_group_rebalance(struct nh_group *nhg, 1894 1887 struct nh_res_table *res_table) 1895 1888 { 1896 - int prev_upper_bound = 0; 1897 - int total = 0; 1898 - int w = 0; 1889 + u16 prev_upper_bound = 0; 1890 + u32 total = 0; 1891 + u32 w = 0; 1899 1892 int i; 1900 1893 1901 1894 INIT_LIST_HEAD(&res_table->uw_nh_entries); ··· 1905 1898 1906 1899 for (i = 0; i < nhg->num_nh; ++i) { 1907 1900 struct nh_grp_entry *nhge = &nhg->nh_entries[i]; 1908 - int upper_bound; 1901 + u16 upper_bound; 1902 + u64 btw; 1909 1903 1910 1904 w += nhge->weight; 1911 - upper_bound = DIV_ROUND_CLOSEST(res_table->num_nh_buckets * w, 1912 - total); 1905 + btw = ((u64)res_table->num_nh_buckets) * w; 1906 + upper_bound = DIV_ROUND_CLOSEST_ULL(btw, total); 1913 1907 nhge->res.wants_buckets = upper_bound - prev_upper_bound; 1914 1908 prev_upper_bound = upper_bound; 1915 1909 ··· 1976 1968 1977 1969 static void nh_hthr_group_rebalance(struct nh_group *nhg) 1978 1970 { 1979 - int total = 0; 1980 - int w = 0; 1971 + u32 total = 0; 1972 + u32 w = 0; 1981 1973 int i; 1982 1974 1983 1975 for (i = 0; i < nhg->num_nh; ++i) ··· 1985 1977 1986 1978 for (i = 0; i < nhg->num_nh; ++i) { 1987 1979 struct nh_grp_entry *nhge = &nhg->nh_entries[i]; 1988 - int upper_bound; 1980 + u32 upper_bound; 1989 1981 1990 1982 w += nhge->weight; 1991 1983 upper_bound = DIV_ROUND_CLOSEST_ULL((u64)w << 31, total) - 1; ··· 2727 2719 goto out_no_nh; 2728 2720 } 2729 2721 nhg->nh_entries[i].nh = nhe; 2730 - nhg->nh_entries[i].weight = entry[i].weight + 1; 2722 + nhg->nh_entries[i].weight = nexthop_grp_weight(&entry[i]); 2723 + 2731 2724 list_add(&nhg->nh_entries[i].nh_list, &nhe->grp_list); 2732 2725 nhg->nh_entries[i].nh_parent = nh; 2733 2726 }