Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

kcm: Add description in Documentation

Add kcm.txt to desribe KCM and interfaces.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

Tom Herbert and committed by
David S. Miller
10016594 29152a34

+285
+285
Documentation/networking/kcm.txt
··· 1 + Kernel Connection Mulitplexor 2 + ----------------------------- 3 + 4 + Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based 5 + interface over TCP for generic application protocols. With KCM an application 6 + can efficiently send and receive application protocol messages over TCP using 7 + datagram sockets. 8 + 9 + KCM implements an NxM multiplexor in the kernel as diagrammed below: 10 + 11 + +------------+ +------------+ +------------+ +------------+ 12 + | KCM socket | | KCM socket | | KCM socket | | KCM socket | 13 + +------------+ +------------+ +------------+ +------------+ 14 + | | | | 15 + +-----------+ | | +----------+ 16 + | | | | 17 + +----------------------------------+ 18 + | Multiplexor | 19 + +----------------------------------+ 20 + | | | | | 21 + +---------+ | | | ------------+ 22 + | | | | | 23 + +----------+ +----------+ +----------+ +----------+ +----------+ 24 + | Psock | | Psock | | Psock | | Psock | | Psock | 25 + +----------+ +----------+ +----------+ +----------+ +----------+ 26 + | | | | | 27 + +----------+ +----------+ +----------+ +----------+ +----------+ 28 + | TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock | 29 + +----------+ +----------+ +----------+ +----------+ +----------+ 30 + 31 + KCM sockets 32 + ----------- 33 + 34 + The KCM sockets provide the user interface to the muliplexor. All the KCM sockets 35 + bound to a multiplexor are considered to have equivalent function, and I/O 36 + operations in different sockets may be done in parallel without the need for 37 + synchronization between threads in userspace. 38 + 39 + Multiplexor 40 + ----------- 41 + 42 + The multiplexor provides the message steering. In the transmit path, messages 43 + written on a KCM socket are sent atomically on an appropriate TCP socket. 44 + Similarly, in the receive path, messages are constructed on each TCP socket 45 + (Psock) and complete messages are steered to a KCM socket. 46 + 47 + TCP sockets & Psocks 48 + -------------------- 49 + 50 + TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated 51 + for each bound TCP socket, this structure holds the state for constructing 52 + messages on receive as well as other connection specific information for KCM. 53 + 54 + Connected mode semantics 55 + ------------------------ 56 + 57 + Each multiplexor assumes that all attached TCP connections are to the same 58 + destination and can use the different connections for load balancing when 59 + transmitting. The normal send and recv calls (include sendmmsg and recvmmsg) 60 + can be used to send and receive messages from the KCM socket. 61 + 62 + Socket types 63 + ------------ 64 + 65 + KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types. 66 + 67 + Message delineation 68 + ------------------- 69 + 70 + Messages are sent over a TCP stream with some application protocol message 71 + format that typically includes a header which frames the messages. The length 72 + of a received message can be deduced from the application protocol header 73 + (often just a simple length field). 74 + 75 + A TCP stream must be parsed to determine message boundaries. Berkeley Packet 76 + Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a 77 + BPF program must be specified. The program is called at the start of receiving 78 + a new message and is given an skbuff that contains the bytes received so far. 79 + It parses the message header and returns the length of the message. Given this 80 + information, KCM will construct the message of the stated length and deliver it 81 + to a KCM socket. 82 + 83 + TCP socket management 84 + --------------------- 85 + 86 + When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and 87 + write space available (POLLOUT) events are handled by the multiplexor. If there 88 + is a state change (disconnection) or other error on a TCP socket, an error is 89 + posted on the TCP socket so that a POLLERR event happens and KCM discontinues 90 + using the socket. When the application gets the error notification for a 91 + TCP socket, it should unattach the socket from KCM and then handle the error 92 + condition (the typical response is to close the socket and create a new 93 + connection if necessary). 94 + 95 + KCM limits the maximum receive message size to be the size of the receive 96 + socket buffer on the attached TCP socket (the socket buffer size can be set by 97 + SO_RCVBUF). If the length of a new message reported by the BPF program is 98 + greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP 99 + socket. The BPF program may also enforce a maximum messages size and report an 100 + error when it is exceeded. 101 + 102 + A timeout may be set for assembling messages on a receive socket. The timeout 103 + value is taken from the receive timeout of the attached TCP socket (this is set 104 + by SO_RCVTIMEO). If the timer expires before assembly is complete an error 105 + (ETIMEDOUT) is posted on the socket. 106 + 107 + User interface 108 + ============== 109 + 110 + Creating a multiplexor 111 + ---------------------- 112 + 113 + A new multiplexor and initial KCM socket is created by a socket call: 114 + 115 + socket(AF_KCM, type, protocol) 116 + 117 + - type is either SOCK_DGRAM or SOCK_SEQPACKET 118 + - protocol is KCMPROTO_CONNECTED 119 + 120 + Cloning KCM sockets 121 + ------------------- 122 + 123 + After the first KCM socket is created using the socket call as described 124 + above, additional sockets for the multiplexor can be created by cloning 125 + a KCM socket. This is accomplished by an ioctl on a KCM socket: 126 + 127 + /* From linux/kcm.h */ 128 + struct kcm_clone { 129 + int fd; 130 + }; 131 + 132 + struct kcm_clone info; 133 + 134 + memset(&info, 0, sizeof(info)); 135 + 136 + err = ioctl(kcmfd, SIOCKCMCLONE, &info); 137 + 138 + if (!err) 139 + newkcmfd = info.fd; 140 + 141 + Attach transport sockets 142 + ------------------------ 143 + 144 + Attaching of transport sockets to a multiplexor is performed by calling an 145 + ioctl on a KCM socket for the multiplexor. e.g.: 146 + 147 + /* From linux/kcm.h */ 148 + struct kcm_attach { 149 + int fd; 150 + int bpf_fd; 151 + }; 152 + 153 + struct kcm_attach info; 154 + 155 + memset(&info, 0, sizeof(info)); 156 + 157 + info.fd = tcpfd; 158 + info.bpf_fd = bpf_prog_fd; 159 + 160 + ioctl(kcmfd, SIOCKCMATTACH, &info); 161 + 162 + The kcm_attach structure contains: 163 + fd: file descriptor for TCP socket being attached 164 + bpf_prog_fd: file descriptor for compiled BPF program downloaded 165 + 166 + Unattach transport sockets 167 + -------------------------- 168 + 169 + Unattaching a transport socket from a multiplexor is straightforward. An 170 + "unattach" ioctl is done with the kcm_unattach structure as the argument: 171 + 172 + /* From linux/kcm.h */ 173 + struct kcm_unattach { 174 + int fd; 175 + }; 176 + 177 + struct kcm_unattach info; 178 + 179 + memset(&info, 0, sizeof(info)); 180 + 181 + info.fd = cfd; 182 + 183 + ioctl(fd, SIOCKCMUNATTACH, &info); 184 + 185 + Disabling receive on KCM socket 186 + ------------------------------- 187 + 188 + A setsockopt is used to disable or enable receiving on a KCM socket. 189 + When receive is disabled, any pending messages in the socket's 190 + receive buffer are moved to other sockets. This feature is useful 191 + if an application thread knows that it will be doing a lot of 192 + work on a request and won't be able to service new messages for a 193 + while. Example use: 194 + 195 + int val = 1; 196 + 197 + setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val)) 198 + 199 + BFP programs for message delineation 200 + ------------------------------------ 201 + 202 + BPF programs can be compiled using the BPF LLVM backend. For exmple, 203 + the BPF program for parsing Thrift is: 204 + 205 + #include "bpf.h" /* for __sk_buff */ 206 + #include "bpf_helpers.h" /* for load_word intrinsic */ 207 + 208 + SEC("socket_kcm") 209 + int bpf_prog1(struct __sk_buff *skb) 210 + { 211 + return load_word(skb, 0) + 4; 212 + } 213 + 214 + char _license[] SEC("license") = "GPL"; 215 + 216 + Use in applications 217 + =================== 218 + 219 + KCM accelerates application layer protocols. Specifically, it allows 220 + applications to use a message based interface for sending and receiving 221 + messages. The kernel provides necessary assurances that messages are sent 222 + and received atomically. This relieves much of the burden applications have 223 + in mapping a message based protocol onto the TCP stream. KCM also make 224 + application layer messages a unit of work in the kernel for the purposes of 225 + steerng and scheduling, which in turn allows a simpler networking model in 226 + multithreaded applications. 227 + 228 + Configurations 229 + -------------- 230 + 231 + In an Nx1 configuration, KCM logically provides multiple socket handles 232 + to the same TCP connection. This allows parallelism between in I/O 233 + operations on the TCP socket (for instance copyin and copyout of data is 234 + parallelized). In an application, a KCM socket can be opened for each 235 + processing thread and inserted into the epoll (similar to how SO_REUSEPORT 236 + is used to allow multiple listener sockets on the same port). 237 + 238 + In a MxN configuration, multiple connections are established to the 239 + same destination. These are used for simple load balancing. 240 + 241 + Message batching 242 + ---------------- 243 + 244 + The primary purpose of KCM is load balancing between KCM sockets and hence 245 + threads in a nominal use case. Perfect load balancing, that is steering 246 + each received message to a different KCM socket or steering each sent 247 + message to a different TCP socket, can negatively impact performance 248 + since this doesn't allow for affinities to be established. Balancing 249 + based on groups, or batches of messages, can be beneficial for performance. 250 + 251 + On transmit, there are three ways an application can batch (pipeline) 252 + messages on a KCM socket. 253 + 1) Send multiple messages in a single sendmmsg. 254 + 2) Send a group of messages each with a sendmsg call, where all messages 255 + except the last have MSG_BATCH in the flags of sendmsg call. 256 + 3) Create "super message" composed of multiple messages and send this 257 + with a single sendmsg. 258 + 259 + On receive, the KCM module attempts to queue messages received on the 260 + same KCM socket during each TCP ready callback. The targeted KCM socket 261 + changes at each receive ready callback on the KCM socket. The application 262 + does not need to configure this. 263 + 264 + Error handling 265 + -------------- 266 + 267 + An application should include a thread to monitor errors raised on 268 + the TCP connection. Normally, this will be done by placing each 269 + TCP socket attached to a KCM multiplexor in epoll set for POLLERR 270 + event. If an error occurs on an attached TCP socket, KCM sets an EPIPE 271 + on the socket thus waking up the application thread. When the application 272 + sees the error (which may just be a disconnect) it should unattach the 273 + socket from KCM and then close it. It is assumed that once an error is 274 + posted on the TCP socket the data stream is unrecoverable (i.e. an error 275 + may have occurred in in the middle of receiving a messssge). 276 + 277 + TCP connection monitoring 278 + ------------------------- 279 + 280 + In KCM there is no means to correlate a message to the TCP socket that 281 + was used to send or receive the message (except in the case there is 282 + only one attached TCP socket). However, the application does retain 283 + an open file descriptor to the socket so it will be able to get statistics 284 + from the socket which can be used in detecting issues (such as high 285 + retransmissions on the socket).