···11+Kernel Connection Mulitplexor22+-----------------------------33+44+Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based55+interface over TCP for generic application protocols. With KCM an application66+can efficiently send and receive application protocol messages over TCP using77+datagram sockets.88+99+KCM implements an NxM multiplexor in the kernel as diagrammed below:1010+1111++------------+ +------------+ +------------+ +------------+1212+| KCM socket | | KCM socket | | KCM socket | | KCM socket |1313++------------+ +------------+ +------------+ +------------+1414+ | | | |1515+ +-----------+ | | +----------+1616+ | | | |1717+ +----------------------------------+1818+ | Multiplexor |1919+ +----------------------------------+2020+ | | | | |2121+ +---------+ | | | ------------+2222+ | | | | |2323++----------+ +----------+ +----------+ +----------+ +----------+2424+| Psock | | Psock | | Psock | | Psock | | Psock |2525++----------+ +----------+ +----------+ +----------+ +----------+2626+ | | | | |2727++----------+ +----------+ +----------+ +----------+ +----------+2828+| TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock |2929++----------+ +----------+ +----------+ +----------+ +----------+3030+3131+KCM sockets3232+-----------3333+3434+The KCM sockets provide the user interface to the muliplexor. All the KCM sockets3535+bound to a multiplexor are considered to have equivalent function, and I/O3636+operations in different sockets may be done in parallel without the need for3737+synchronization between threads in userspace.3838+3939+Multiplexor4040+-----------4141+4242+The multiplexor provides the message steering. In the transmit path, messages4343+written on a KCM socket are sent atomically on an appropriate TCP socket.4444+Similarly, in the receive path, messages are constructed on each TCP socket4545+(Psock) and complete messages are steered to a KCM socket.4646+4747+TCP sockets & Psocks4848+--------------------4949+5050+TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated5151+for each bound TCP socket, this structure holds the state for constructing5252+messages on receive as well as other connection specific information for KCM.5353+5454+Connected mode semantics5555+------------------------5656+5757+Each multiplexor assumes that all attached TCP connections are to the same5858+destination and can use the different connections for load balancing when5959+transmitting. The normal send and recv calls (include sendmmsg and recvmmsg)6060+can be used to send and receive messages from the KCM socket.6161+6262+Socket types6363+------------6464+6565+KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types.6666+6767+Message delineation6868+-------------------6969+7070+Messages are sent over a TCP stream with some application protocol message7171+format that typically includes a header which frames the messages. The length7272+of a received message can be deduced from the application protocol header7373+(often just a simple length field).7474+7575+A TCP stream must be parsed to determine message boundaries. Berkeley Packet7676+Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a7777+BPF program must be specified. The program is called at the start of receiving7878+a new message and is given an skbuff that contains the bytes received so far.7979+It parses the message header and returns the length of the message. Given this8080+information, KCM will construct the message of the stated length and deliver it8181+to a KCM socket.8282+8383+TCP socket management8484+---------------------8585+8686+When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and8787+write space available (POLLOUT) events are handled by the multiplexor. If there8888+is a state change (disconnection) or other error on a TCP socket, an error is8989+posted on the TCP socket so that a POLLERR event happens and KCM discontinues9090+using the socket. When the application gets the error notification for a9191+TCP socket, it should unattach the socket from KCM and then handle the error9292+condition (the typical response is to close the socket and create a new9393+connection if necessary).9494+9595+KCM limits the maximum receive message size to be the size of the receive9696+socket buffer on the attached TCP socket (the socket buffer size can be set by9797+SO_RCVBUF). If the length of a new message reported by the BPF program is9898+greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP9999+socket. The BPF program may also enforce a maximum messages size and report an100100+error when it is exceeded.101101+102102+A timeout may be set for assembling messages on a receive socket. The timeout103103+value is taken from the receive timeout of the attached TCP socket (this is set104104+by SO_RCVTIMEO). If the timer expires before assembly is complete an error105105+(ETIMEDOUT) is posted on the socket.106106+107107+User interface108108+==============109109+110110+Creating a multiplexor111111+----------------------112112+113113+A new multiplexor and initial KCM socket is created by a socket call:114114+115115+ socket(AF_KCM, type, protocol)116116+117117+ - type is either SOCK_DGRAM or SOCK_SEQPACKET118118+ - protocol is KCMPROTO_CONNECTED119119+120120+Cloning KCM sockets121121+-------------------122122+123123+After the first KCM socket is created using the socket call as described124124+above, additional sockets for the multiplexor can be created by cloning125125+a KCM socket. This is accomplished by an ioctl on a KCM socket:126126+127127+ /* From linux/kcm.h */128128+ struct kcm_clone {129129+ int fd;130130+ };131131+132132+ struct kcm_clone info;133133+134134+ memset(&info, 0, sizeof(info));135135+136136+ err = ioctl(kcmfd, SIOCKCMCLONE, &info);137137+138138+ if (!err)139139+ newkcmfd = info.fd;140140+141141+Attach transport sockets142142+------------------------143143+144144+Attaching of transport sockets to a multiplexor is performed by calling an145145+ioctl on a KCM socket for the multiplexor. e.g.:146146+147147+ /* From linux/kcm.h */148148+ struct kcm_attach {149149+ int fd;150150+ int bpf_fd;151151+ };152152+153153+ struct kcm_attach info;154154+155155+ memset(&info, 0, sizeof(info));156156+157157+ info.fd = tcpfd;158158+ info.bpf_fd = bpf_prog_fd;159159+160160+ ioctl(kcmfd, SIOCKCMATTACH, &info);161161+162162+The kcm_attach structure contains:163163+ fd: file descriptor for TCP socket being attached164164+ bpf_prog_fd: file descriptor for compiled BPF program downloaded165165+166166+Unattach transport sockets167167+--------------------------168168+169169+Unattaching a transport socket from a multiplexor is straightforward. An170170+"unattach" ioctl is done with the kcm_unattach structure as the argument:171171+172172+ /* From linux/kcm.h */173173+ struct kcm_unattach {174174+ int fd;175175+ };176176+177177+ struct kcm_unattach info;178178+179179+ memset(&info, 0, sizeof(info));180180+181181+ info.fd = cfd;182182+183183+ ioctl(fd, SIOCKCMUNATTACH, &info);184184+185185+Disabling receive on KCM socket186186+-------------------------------187187+188188+A setsockopt is used to disable or enable receiving on a KCM socket.189189+When receive is disabled, any pending messages in the socket's190190+receive buffer are moved to other sockets. This feature is useful191191+if an application thread knows that it will be doing a lot of192192+work on a request and won't be able to service new messages for a193193+while. Example use:194194+195195+ int val = 1;196196+197197+ setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val))198198+199199+BFP programs for message delineation200200+------------------------------------201201+202202+BPF programs can be compiled using the BPF LLVM backend. For exmple,203203+the BPF program for parsing Thrift is:204204+205205+ #include "bpf.h" /* for __sk_buff */206206+ #include "bpf_helpers.h" /* for load_word intrinsic */207207+208208+ SEC("socket_kcm")209209+ int bpf_prog1(struct __sk_buff *skb)210210+ {211211+ return load_word(skb, 0) + 4;212212+ }213213+214214+ char _license[] SEC("license") = "GPL";215215+216216+Use in applications217217+===================218218+219219+KCM accelerates application layer protocols. Specifically, it allows220220+applications to use a message based interface for sending and receiving221221+messages. The kernel provides necessary assurances that messages are sent222222+and received atomically. This relieves much of the burden applications have223223+in mapping a message based protocol onto the TCP stream. KCM also make224224+application layer messages a unit of work in the kernel for the purposes of225225+steerng and scheduling, which in turn allows a simpler networking model in226226+multithreaded applications.227227+228228+Configurations229229+--------------230230+231231+In an Nx1 configuration, KCM logically provides multiple socket handles232232+to the same TCP connection. This allows parallelism between in I/O233233+operations on the TCP socket (for instance copyin and copyout of data is234234+parallelized). In an application, a KCM socket can be opened for each235235+processing thread and inserted into the epoll (similar to how SO_REUSEPORT236236+is used to allow multiple listener sockets on the same port).237237+238238+In a MxN configuration, multiple connections are established to the239239+same destination. These are used for simple load balancing.240240+241241+Message batching242242+----------------243243+244244+The primary purpose of KCM is load balancing between KCM sockets and hence245245+threads in a nominal use case. Perfect load balancing, that is steering246246+each received message to a different KCM socket or steering each sent247247+message to a different TCP socket, can negatively impact performance248248+since this doesn't allow for affinities to be established. Balancing249249+based on groups, or batches of messages, can be beneficial for performance.250250+251251+On transmit, there are three ways an application can batch (pipeline)252252+messages on a KCM socket.253253+ 1) Send multiple messages in a single sendmmsg.254254+ 2) Send a group of messages each with a sendmsg call, where all messages255255+ except the last have MSG_BATCH in the flags of sendmsg call.256256+ 3) Create "super message" composed of multiple messages and send this257257+ with a single sendmsg.258258+259259+On receive, the KCM module attempts to queue messages received on the260260+same KCM socket during each TCP ready callback. The targeted KCM socket261261+changes at each receive ready callback on the KCM socket. The application262262+does not need to configure this.263263+264264+Error handling265265+--------------266266+267267+An application should include a thread to monitor errors raised on268268+the TCP connection. Normally, this will be done by placing each269269+TCP socket attached to a KCM multiplexor in epoll set for POLLERR270270+event. If an error occurs on an attached TCP socket, KCM sets an EPIPE271271+on the socket thus waking up the application thread. When the application272272+sees the error (which may just be a disconnect) it should unattach the273273+socket from KCM and then close it. It is assumed that once an error is274274+posted on the TCP socket the data stream is unrecoverable (i.e. an error275275+may have occurred in in the middle of receiving a messssge).276276+277277+TCP connection monitoring278278+-------------------------279279+280280+In KCM there is no means to correlate a message to the TCP socket that281281+was used to send or receive the message (except in the case there is282282+only one attached TCP socket). However, the application does retain283283+an open file descriptor to the socket so it will be able to get statistics284284+from the socket which can be used in detecting issues (such as high285285+retransmissions on the socket).