Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

net: openvswitch: allow providing upcall pid for the 'execute' command

When a packet enters OVS datapath and there is no flow to handle it,
packet goes to userspace through a MISS upcall. With per-CPU upcall
dispatch mechanism, we're using the current CPU id to select the
Netlink PID on which to send this packet. This allows us to send
packets from the same traffic flow through the same handler.

The handler will process the packet, install required flow into the
kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE.

While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a
recirculation action that will pass the (likely modified) packet
through the flow lookup again. And if the flow is not found, the
packet will be sent to userspace again through another MISS upcall.

However, the handler thread in userspace is likely running on a
different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled
in the syscall context of that thread. So, when the time comes to
send the packet through another upcall, the per-CPU dispatch will
choose a different Netlink PID, and this packet will end up processed
by a different handler thread on a different CPU.

The process continues as long as there are new recirculations, each
time the packet goes to a different handler thread before it is sent
out of the OVS datapath to the destination port. In real setups the
number of recirculations can go up to 4 or 5, sometimes more.

There is always a chance to re-order packets while processing upcalls,
because userspace will first install the flow and then re-inject the
original packet. So, there is a race window when the flow is already
installed and the second packet can match it and be forwarded to the
destination before the first packet is re-injected. But the fact that
packets are going through multiple upcalls handled by different
userspace threads makes the reordering noticeably more likely, because
we not only have a race between the kernel and a userspace handler
(which is hard to avoid), but also between multiple userspace handlers.

For example, let's assume that 10 packets got enqueued through a MISS
upcall for handler-1, it will start processing them, will install the
flow into the kernel and start re-injecting packets back, from where
they will go through another MISS to handler-2. Handler-2 will install
the flow into the kernel and start re-injecting the packets, while
handler-1 continues to re-inject the last of the 10 packets, they will
hit the flow installed by handler-2 and be forwarded without going to
the handler-2, while handler-2 still re-injects the first of these 10
packets. Given multiple recirculations and misses, these 10 packets
may end up completely mixed up on the output from the datapath.

Let's allow userspace to specify on which Netlink PID the packets
should be upcalled while processing OVS_PACKET_CMD_EXECUTE.
This makes it possible to ensure that all the packets are processed
by the same handler thread in the userspace even with them being
upcalled multiple times in the process. Packets will remain in order
since they will be enqueued to the same socket and re-injected in the
same order. This doesn't eliminate re-ordering as stated above, since
we still have a race between kernel and the userspace thread, but it
allows to eliminate races between multiple userspace threads.

Userspace knows the PID of the socket on which the original upcall is
received, so there is no need to send it up from the kernel.

Solution requires storing the value somewhere for the duration of the
packet processing. There are two potential places for this: our skb
extension or the per-CPU storage. It's not clear which is better,
so just following currently used scheme of storing this kind of things
along the skb. We still have a decent amount of space in the cb.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20250702155043.2331772-1-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

authored by

Ilya Maximets and committed by
Jakub Kicinski
59f44c9c 5c3f832d

+21 -3
+6
include/uapi/linux/openvswitch.h
··· 186 186 * %OVS_PACKET_ATTR_USERSPACE action specify the Maximum received fragment 187 187 * size. 188 188 * @OVS_PACKET_ATTR_HASH: Packet hash info (e.g. hash, sw_hash and l4_hash in skb). 189 + * @OVS_PACKET_ATTR_UPCALL_PID: Netlink PID to use for upcalls while 190 + * processing %OVS_PACKET_CMD_EXECUTE. Takes precedence over all other ways 191 + * to determine the Netlink PID including %OVS_USERSPACE_ATTR_PID, 192 + * %OVS_DP_ATTR_UPCALL_PID, %OVS_DP_ATTR_PER_CPU_PIDS and the 193 + * %OVS_VPORT_ATTR_UPCALL_PID. 189 194 * 190 195 * These attributes follow the &struct ovs_header within the Generic Netlink 191 196 * payload for %OVS_PACKET_* commands. ··· 210 205 OVS_PACKET_ATTR_MRU, /* Maximum received IP fragment size. */ 211 206 OVS_PACKET_ATTR_LEN, /* Packet size before truncation. */ 212 207 OVS_PACKET_ATTR_HASH, /* Packet hash. */ 208 + OVS_PACKET_ATTR_UPCALL_PID, /* u32 Netlink PID. */ 213 209 __OVS_PACKET_ATTR_MAX 214 210 }; 215 211
+4 -2
net/openvswitch/actions.c
··· 941 941 break; 942 942 943 943 case OVS_USERSPACE_ATTR_PID: 944 - if (dp->user_features & 945 - OVS_DP_F_DISPATCH_UPCALL_PER_CPU) 944 + if (OVS_CB(skb)->upcall_pid) 945 + upcall.portid = OVS_CB(skb)->upcall_pid; 946 + else if (dp->user_features & 947 + OVS_DP_F_DISPATCH_UPCALL_PER_CPU) 946 948 upcall.portid = 947 949 ovs_dp_get_upcall_portid(dp, 948 950 smp_processor_id());
+7 -1
net/openvswitch/datapath.c
··· 267 267 memset(&upcall, 0, sizeof(upcall)); 268 268 upcall.cmd = OVS_PACKET_CMD_MISS; 269 269 270 - if (dp->user_features & OVS_DP_F_DISPATCH_UPCALL_PER_CPU) 270 + if (OVS_CB(skb)->upcall_pid) 271 + upcall.portid = OVS_CB(skb)->upcall_pid; 272 + else if (dp->user_features & OVS_DP_F_DISPATCH_UPCALL_PER_CPU) 271 273 upcall.portid = 272 274 ovs_dp_get_upcall_portid(dp, smp_processor_id()); 273 275 else ··· 653 651 !!(hash & OVS_PACKET_HASH_L4_BIT)); 654 652 } 655 653 654 + OVS_CB(packet)->upcall_pid = 655 + nla_get_u32_default(a[OVS_PACKET_ATTR_UPCALL_PID], 0); 656 + 656 657 /* Build an sw_flow for sending this packet. */ 657 658 flow = ovs_flow_alloc(); 658 659 err = PTR_ERR(flow); ··· 724 719 [OVS_PACKET_ATTR_PROBE] = { .type = NLA_FLAG }, 725 720 [OVS_PACKET_ATTR_MRU] = { .type = NLA_U16 }, 726 721 [OVS_PACKET_ATTR_HASH] = { .type = NLA_U64 }, 722 + [OVS_PACKET_ATTR_UPCALL_PID] = { .type = NLA_U32 }, 727 723 }; 728 724 729 725 static const struct genl_small_ops dp_packet_genl_ops[] = {
+3
net/openvswitch/datapath.h
··· 121 121 * @cutlen: The number of bytes from the packet end to be removed. 122 122 * @probability: The sampling probability that was applied to this skb; 0 means 123 123 * no sampling has occurred; U32_MAX means 100% probability. 124 + * @upcall_pid: Netlink socket PID to use for sending this packet to userspace; 125 + * 0 means "not set" and default per-CPU or per-vport dispatch should be used. 124 126 */ 125 127 struct ovs_skb_cb { 126 128 struct vport *input_vport; ··· 130 128 u16 acts_origlen; 131 129 u32 cutlen; 132 130 u32 probability; 131 + u32 upcall_pid; 133 132 }; 134 133 #define OVS_CB(skb) ((struct ovs_skb_cb *)(skb)->cb) 135 134
+1
net/openvswitch/vport.c
··· 501 501 OVS_CB(skb)->mru = 0; 502 502 OVS_CB(skb)->cutlen = 0; 503 503 OVS_CB(skb)->probability = 0; 504 + OVS_CB(skb)->upcall_pid = 0; 504 505 if (unlikely(dev_net(skb->dev) != ovs_dp_get_net(vport->dp))) { 505 506 u32 mark; 506 507