Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

hvsock: fix epollout hang from race condition

Currently, hvsock can enter into a state where epoll_wait on EPOLLOUT will
not return even when the hvsock socket is writable, under some race
condition. This can happen under the following sequence:
- fd = socket(hvsocket)
- fd_out = dup(fd)
- fd_in = dup(fd)
- start a writer thread that writes data to fd_out with a combination of
epoll_wait(fd_out, EPOLLOUT) and
- start a reader thread that reads data from fd_in with a combination of
epoll_wait(fd_in, EPOLLIN)
- On the host, there are two threads that are reading/writing data to the
hvsocket

stack:
hvs_stream_has_space
hvs_notify_poll_out
vsock_poll
sock_poll
ep_poll

Race condition:
check for epollout from ep_poll():
assume no writable space in the socket
hvs_stream_has_space() returns 0
check for epollin from ep_poll():
assume socket has some free space < HVS_PKT_LEN(HVS_SEND_BUF_SIZE)
hvs_stream_has_space() will clear the channel pending send size
host will not notify the guest because the pending send size has
been cleared and so the hvsocket will never mark the
socket writable

Now, the EPOLLOUT will never return even if the socket write buffer is
empty.

The fix is to set the pending size to the default size and never change it.
This way the host will always notify the guest whenever the writable space
is bigger than the pending size. The host is already optimized to *only*
notify the guest when the pending size threshold boundary is crossed and
not everytime.

This change also reduces the cpu usage somewhat since hv_stream_has_space()
is in the hotpath of send:
vsock_stream_sendmsg()->hv_stream_has_space()
Earlier hv_stream_has_space was setting/clearing the pending size on every
call.

Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

Sunil Muthuswamy and committed by
David S. Miller
cb359b60 76e21533

+8 -31
+8 -31
net/vmw_vsock/hyperv_transport.c
··· 211 211 set_channel_pending_send_size(chan, 212 212 HVS_PKT_LEN(HVS_SEND_BUF_SIZE)); 213 213 214 - /* See hvs_stream_has_space(): we must make sure the host has seen 215 - * the new pending send size, before we can re-check the writable 216 - * bytes. 217 - */ 218 - virt_mb(); 219 - } 220 - 221 - static void hvs_clear_channel_pending_send_size(struct vmbus_channel *chan) 222 - { 223 - set_channel_pending_send_size(chan, 0); 224 - 225 - /* Ditto */ 226 214 virt_mb(); 227 215 } 228 216 ··· 280 292 if (hvs_channel_readable(chan)) 281 293 sk->sk_data_ready(sk); 282 294 283 - /* See hvs_stream_has_space(): when we reach here, the writable bytes 284 - * may be already less than HVS_PKT_LEN(HVS_SEND_BUF_SIZE). 285 - */ 286 295 if (hv_get_bytes_to_write(&chan->outbound) > 0) 287 296 sk->sk_write_space(sk); 288 297 } ··· 379 394 380 395 set_per_channel_state(chan, conn_from_host ? new : sk); 381 396 vmbus_set_chn_rescind_callback(chan, hvs_close_connection); 397 + 398 + /* Set the pending send size to max packet size to always get 399 + * notifications from the host when there is enough writable space. 400 + * The host is optimized to send notifications only when the pending 401 + * size boundary is crossed, and not always. 402 + */ 403 + hvs_set_channel_pending_send_size(chan); 382 404 383 405 if (conn_from_host) { 384 406 new->sk_state = TCP_ESTABLISHED; ··· 680 688 static s64 hvs_stream_has_space(struct vsock_sock *vsk) 681 689 { 682 690 struct hvsock *hvs = vsk->trans; 683 - struct vmbus_channel *chan = hvs->chan; 684 - s64 ret; 685 691 686 - ret = hvs_channel_writable_bytes(chan); 687 - if (ret > 0) { 688 - hvs_clear_channel_pending_send_size(chan); 689 - } else { 690 - /* See hvs_channel_cb() */ 691 - hvs_set_channel_pending_send_size(chan); 692 - 693 - /* Re-check the writable bytes to avoid race */ 694 - ret = hvs_channel_writable_bytes(chan); 695 - if (ret > 0) 696 - hvs_clear_channel_pending_send_size(chan); 697 - } 698 - 699 - return ret; 692 + return hvs_channel_writable_bytes(hvs->chan); 700 693 } 701 694 702 695 static u64 hvs_stream_rcvhiwat(struct vsock_sock *vsk)