Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

KVM: X86: Implement ring-based dirty memory tracking

This patch is heavily based on previous work from Lei Cao
<lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]

KVM currently uses large bitmaps to track dirty memory. These bitmaps
are copied to userspace when userspace queries KVM for its dirty page
information. The use of bitmaps is mostly sufficient for live
migration, as large parts of memory are be dirtied from one log-dirty
pass to another. However, in a checkpointing system, the number of
dirty pages is small and in fact it is often bounded---the VM is
paused when it has dirtied a pre-defined number of pages. Traversing a
large, sparsely populated bitmap to find set bits is time-consuming,
as is copying the bitmap to user-space.

A similar issue will be there for live migration when the guest memory
is huge while the page dirty procedure is trivial. In that case for
each dirty sync we need to pull the whole dirty bitmap to userspace
and analyse every bit even if it's mostly zeros.

The preferred data structure for above scenarios is a dense list of
guest frame numbers (GFN). This patch series stores the dirty list in
kernel memory that can be memory mapped into userspace to allow speedy
harvesting.

This patch enables dirty ring for X86 only. However it should be
easily extended to other archs as well.

[1] https://patchwork.kernel.org/patch/10471409/

Signed-off-by: Lei Cao <lei.cao@stratus.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201001012222.5767-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

authored by

Peter Xu and committed by
Paolo Bonzini
fb04a1ed 28bd726a

+662 -3
+93
Documentation/virt/kvm/api.rst
··· 262 262 memory region. This ioctl returns the size of that region. See the 263 263 KVM_RUN documentation for details. 264 264 265 + Besides the size of the KVM_RUN communication region, other areas of 266 + the VCPU file descriptor can be mmap-ed, including: 267 + 268 + - if KVM_CAP_COALESCED_MMIO is available, a page at 269 + KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons, 270 + this page is included in the result of KVM_GET_VCPU_MMAP_SIZE. 271 + KVM_CAP_COALESCED_MMIO is not documented yet. 272 + 273 + - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at 274 + KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. For more information on 275 + KVM_CAP_DIRTY_LOG_RING, see section 8.3. 276 + 265 277 266 278 4.6 KVM_SET_MEMORY_REGION 267 279 ------------------------- ··· 6408 6396 guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf 6409 6397 (0x40000001). Otherwise, a guest may use the paravirtual features 6410 6398 regardless of what has actually been exposed through the CPUID leaf. 6399 + 6400 + 6401 + 8.29 KVM_CAP_DIRTY_LOG_RING 6402 + --------------------------- 6403 + 6404 + :Architectures: x86 6405 + :Parameters: args[0] - size of the dirty log ring 6406 + 6407 + KVM is capable of tracking dirty memory using ring buffers that are 6408 + mmaped into userspace; there is one dirty ring per vcpu. 6409 + 6410 + The dirty ring is available to userspace as an array of 6411 + ``struct kvm_dirty_gfn``. Each dirty entry it's defined as:: 6412 + 6413 + struct kvm_dirty_gfn { 6414 + __u32 flags; 6415 + __u32 slot; /* as_id | slot_id */ 6416 + __u64 offset; 6417 + }; 6418 + 6419 + The following values are defined for the flags field to define the 6420 + current state of the entry:: 6421 + 6422 + #define KVM_DIRTY_GFN_F_DIRTY BIT(0) 6423 + #define KVM_DIRTY_GFN_F_RESET BIT(1) 6424 + #define KVM_DIRTY_GFN_F_MASK 0x3 6425 + 6426 + Userspace should call KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM 6427 + ioctl to enable this capability for the new guest and set the size of 6428 + the rings. Enabling the capability is only allowed before creating any 6429 + vCPU, and the size of the ring must be a power of two. The larger the 6430 + ring buffer, the less likely the ring is full and the VM is forced to 6431 + exit to userspace. The optimal size depends on the workload, but it is 6432 + recommended that it be at least 64 KiB (4096 entries). 6433 + 6434 + Just like for dirty page bitmaps, the buffer tracks writes to 6435 + all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was 6436 + set in KVM_SET_USER_MEMORY_REGION. Once a memory region is registered 6437 + with the flag set, userspace can start harvesting dirty pages from the 6438 + ring buffer. 6439 + 6440 + An entry in the ring buffer can be unused (flag bits ``00``), 6441 + dirty (flag bits ``01``) or harvested (flag bits ``1X``). The 6442 + state machine for the entry is as follows:: 6443 + 6444 + dirtied harvested reset 6445 + 00 -----------> 01 -------------> 1X -------+ 6446 + ^ | 6447 + | | 6448 + +------------------------------------------+ 6449 + 6450 + To harvest the dirty pages, userspace accesses the mmaped ring buffer 6451 + to read the dirty GFNs. If the flags has the DIRTY bit set (at this stage 6452 + the RESET bit must be cleared), then it means this GFN is a dirty GFN. 6453 + The userspace should harvest this GFN and mark the flags from state 6454 + ``01b`` to ``1Xb`` (bit 0 will be ignored by KVM, but bit 1 must be set 6455 + to show that this GFN is harvested and waiting for a reset), and move 6456 + on to the next GFN. The userspace should continue to do this until the 6457 + flags of a GFN have the DIRTY bit cleared, meaning that it has harvested 6458 + all the dirty GFNs that were available. 6459 + 6460 + It's not necessary for userspace to harvest the all dirty GFNs at once. 6461 + However it must collect the dirty GFNs in sequence, i.e., the userspace 6462 + program cannot skip one dirty GFN to collect the one next to it. 6463 + 6464 + After processing one or more entries in the ring buffer, userspace 6465 + calls the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about 6466 + it, so that the kernel will reprotect those collected GFNs. 6467 + Therefore, the ioctl must be called *before* reading the content of 6468 + the dirty pages. 6469 + 6470 + The dirty ring can get full. When it happens, the KVM_RUN of the 6471 + vcpu will return with exit reason KVM_EXIT_DIRTY_LOG_FULL. 6472 + 6473 + The dirty ring interface has a major difference comparing to the 6474 + KVM_GET_DIRTY_LOG interface in that, when reading the dirty ring from 6475 + userspace, it's still possible that the kernel has not yet flushed the 6476 + processor's dirty page buffers into the kernel buffer (with dirty bitmaps, the 6477 + flushing is done by the KVM_GET_DIRTY_LOG ioctl). To achieve that, one 6478 + needs to kick the vcpu out of KVM_RUN using a signal. The resulting 6479 + vmexit ensures that all dirty GFNs are flushed to the dirty rings.
+3
arch/x86/include/asm/kvm_host.h
··· 1232 1232 void (*enable_log_dirty_pt_masked)(struct kvm *kvm, 1233 1233 struct kvm_memory_slot *slot, 1234 1234 gfn_t offset, unsigned long mask); 1235 + int (*cpu_dirty_log_size)(void); 1235 1236 1236 1237 /* pmu operations of sub-arch */ 1237 1238 const struct kvm_pmu_ops *pmu_ops; ··· 1744 1743 1745 1744 #define GET_SMSTATE(type, buf, offset) \ 1746 1745 (*(type *)((buf) + (offset) - 0x7e00)) 1746 + 1747 + int kvm_cpu_dirty_log_size(void); 1747 1748 1748 1749 #endif /* _ASM_X86_KVM_HOST_H */
+1
arch/x86/include/uapi/asm/kvm.h
··· 12 12 13 13 #define KVM_PIO_PAGE_OFFSET 1 14 14 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2 15 + #define KVM_DIRTY_LOG_PAGE_OFFSET 64 15 16 16 17 #define DE_VECTOR 0 17 18 #define DB_VECTOR 1
+2 -1
arch/x86/kvm/Makefile
··· 10 10 KVM := ../../../virt/kvm 11 11 12 12 kvm-y += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \ 13 - $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o 13 + $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \ 14 + $(KVM)/dirty_ring.o 14 15 kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o 15 16 16 17 kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \
+8
arch/x86/kvm/mmu/mmu.c
··· 1289 1289 kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask); 1290 1290 } 1291 1291 1292 + int kvm_cpu_dirty_log_size(void) 1293 + { 1294 + if (kvm_x86_ops.cpu_dirty_log_size) 1295 + return kvm_x86_ops.cpu_dirty_log_size(); 1296 + 1297 + return 0; 1298 + } 1299 + 1292 1300 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm, 1293 1301 struct kvm_memory_slot *slot, u64 gfn) 1294 1302 {
+1 -1
arch/x86/kvm/mmu/tdp_mmu.c
··· 185 185 if ((!is_writable_pte(old_spte) || pfn_changed) && 186 186 is_writable_pte(new_spte)) { 187 187 slot = __gfn_to_memslot(__kvm_memslots(kvm, as_id), gfn); 188 - mark_page_dirty_in_slot(slot, gfn); 188 + mark_page_dirty_in_slot(kvm, slot, gfn); 189 189 } 190 190 } 191 191
+7
arch/x86/kvm/vmx/vmx.c
··· 7583 7583 return supported & BIT(bit); 7584 7584 } 7585 7585 7586 + static int vmx_cpu_dirty_log_size(void) 7587 + { 7588 + return enable_pml ? PML_ENTITY_NUM : 0; 7589 + } 7590 + 7586 7591 static struct kvm_x86_ops vmx_x86_ops __initdata = { 7587 7592 .hardware_unsetup = hardware_unsetup, 7588 7593 ··· 7717 7712 .migrate_timers = vmx_migrate_timers, 7718 7713 7719 7714 .msr_filter_changed = vmx_msr_filter_changed, 7715 + .cpu_dirty_log_size = vmx_cpu_dirty_log_size, 7720 7716 }; 7721 7717 7722 7718 static __init int hardware_setup(void) ··· 7835 7829 vmx_x86_ops.slot_disable_log_dirty = NULL; 7836 7830 vmx_x86_ops.flush_log_dirty = NULL; 7837 7831 vmx_x86_ops.enable_log_dirty_pt_masked = NULL; 7832 + vmx_x86_ops.cpu_dirty_log_size = NULL; 7838 7833 } 7839 7834 7840 7835 if (!cpu_has_vmx_preemption_timer())
+9
arch/x86/kvm/x86.c
··· 8754 8754 8755 8755 bool req_immediate_exit = false; 8756 8756 8757 + /* Forbid vmenter if vcpu dirty ring is soft-full */ 8758 + if (unlikely(vcpu->kvm->dirty_ring_size && 8759 + kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) { 8760 + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; 8761 + trace_kvm_dirty_ring_exit(vcpu); 8762 + r = 0; 8763 + goto out; 8764 + } 8765 + 8757 8766 if (kvm_request_pending(vcpu)) { 8758 8767 if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) { 8759 8768 if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
+103
include/linux/kvm_dirty_ring.h
··· 1 + #ifndef KVM_DIRTY_RING_H 2 + #define KVM_DIRTY_RING_H 3 + 4 + #include <linux/kvm.h> 5 + 6 + /** 7 + * kvm_dirty_ring: KVM internal dirty ring structure 8 + * 9 + * @dirty_index: free running counter that points to the next slot in 10 + * dirty_ring->dirty_gfns, where a new dirty page should go 11 + * @reset_index: free running counter that points to the next dirty page 12 + * in dirty_ring->dirty_gfns for which dirty trap needs to 13 + * be reenabled 14 + * @size: size of the compact list, dirty_ring->dirty_gfns 15 + * @soft_limit: when the number of dirty pages in the list reaches this 16 + * limit, vcpu that owns this ring should exit to userspace 17 + * to allow userspace to harvest all the dirty pages 18 + * @dirty_gfns: the array to keep the dirty gfns 19 + * @index: index of this dirty ring 20 + */ 21 + struct kvm_dirty_ring { 22 + u32 dirty_index; 23 + u32 reset_index; 24 + u32 size; 25 + u32 soft_limit; 26 + struct kvm_dirty_gfn *dirty_gfns; 27 + int index; 28 + }; 29 + 30 + #if (KVM_DIRTY_LOG_PAGE_OFFSET == 0) 31 + /* 32 + * If KVM_DIRTY_LOG_PAGE_OFFSET not defined, kvm_dirty_ring.o should 33 + * not be included as well, so define these nop functions for the arch. 34 + */ 35 + static inline u32 kvm_dirty_ring_get_rsvd_entries(void) 36 + { 37 + return 0; 38 + } 39 + 40 + static inline int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, 41 + int index, u32 size) 42 + { 43 + return 0; 44 + } 45 + 46 + static inline struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm) 47 + { 48 + return NULL; 49 + } 50 + 51 + static inline int kvm_dirty_ring_reset(struct kvm *kvm, 52 + struct kvm_dirty_ring *ring) 53 + { 54 + return 0; 55 + } 56 + 57 + static inline void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, 58 + u32 slot, u64 offset) 59 + { 60 + } 61 + 62 + static inline struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, 63 + u32 offset) 64 + { 65 + return NULL; 66 + } 67 + 68 + static inline void kvm_dirty_ring_free(struct kvm_dirty_ring *ring) 69 + { 70 + } 71 + 72 + static inline bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring) 73 + { 74 + return true; 75 + } 76 + 77 + #else /* KVM_DIRTY_LOG_PAGE_OFFSET == 0 */ 78 + 79 + u32 kvm_dirty_ring_get_rsvd_entries(void); 80 + int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size); 81 + struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm); 82 + 83 + /* 84 + * called with kvm->slots_lock held, returns the number of 85 + * processed pages. 86 + */ 87 + int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring); 88 + 89 + /* 90 + * returns =0: successfully pushed 91 + * <0: unable to push, need to wait 92 + */ 93 + void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset); 94 + 95 + /* for use in vm_operations_struct */ 96 + struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset); 97 + 98 + void kvm_dirty_ring_free(struct kvm_dirty_ring *ring); 99 + bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring); 100 + 101 + #endif /* KVM_DIRTY_LOG_PAGE_OFFSET == 0 */ 102 + 103 + #endif /* KVM_DIRTY_RING_H */
+13
include/linux/kvm_host.h
··· 34 34 #include <linux/kvm_types.h> 35 35 36 36 #include <asm/kvm_host.h> 37 + #include <linux/kvm_dirty_ring.h> 37 38 38 39 #ifndef KVM_MAX_VCPU_ID 39 40 #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS ··· 320 319 bool preempted; 321 320 bool ready; 322 321 struct kvm_vcpu_arch arch; 322 + struct kvm_dirty_ring dirty_ring; 323 323 }; 324 324 325 325 static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu) ··· 507 505 struct srcu_struct irq_srcu; 508 506 pid_t userspace_pid; 509 507 unsigned int max_halt_poll_ns; 508 + u32 dirty_ring_size; 510 509 }; 511 510 512 511 #define kvm_err(fmt, ...) \ ··· 1479 1476 vcpu->stat.signal_exits++; 1480 1477 } 1481 1478 #endif /* CONFIG_KVM_XFER_TO_GUEST_WORK */ 1479 + 1480 + /* 1481 + * This defines how many reserved entries we want to keep before we 1482 + * kick the vcpu to the userspace to avoid dirty ring full. This 1483 + * value can be tuned to higher if e.g. PML is enabled on the host. 1484 + */ 1485 + #define KVM_DIRTY_RING_RSVD_ENTRIES 64 1486 + 1487 + /* Max number of entries allowed for each kvm dirty ring */ 1488 + #define KVM_DIRTY_RING_MAX_ENTRIES 65536 1482 1489 1483 1490 #endif
+63
include/trace/events/kvm.h
··· 399 399 #define trace_kvm_halt_poll_ns_shrink(vcpu_id, new, old) \ 400 400 trace_kvm_halt_poll_ns(false, vcpu_id, new, old) 401 401 402 + TRACE_EVENT(kvm_dirty_ring_push, 403 + TP_PROTO(struct kvm_dirty_ring *ring, u32 slot, u64 offset), 404 + TP_ARGS(ring, slot, offset), 405 + 406 + TP_STRUCT__entry( 407 + __field(int, index) 408 + __field(u32, dirty_index) 409 + __field(u32, reset_index) 410 + __field(u32, slot) 411 + __field(u64, offset) 412 + ), 413 + 414 + TP_fast_assign( 415 + __entry->index = ring->index; 416 + __entry->dirty_index = ring->dirty_index; 417 + __entry->reset_index = ring->reset_index; 418 + __entry->slot = slot; 419 + __entry->offset = offset; 420 + ), 421 + 422 + TP_printk("ring %d: dirty 0x%x reset 0x%x " 423 + "slot %u offset 0x%llx (used %u)", 424 + __entry->index, __entry->dirty_index, 425 + __entry->reset_index, __entry->slot, __entry->offset, 426 + __entry->dirty_index - __entry->reset_index) 427 + ); 428 + 429 + TRACE_EVENT(kvm_dirty_ring_reset, 430 + TP_PROTO(struct kvm_dirty_ring *ring), 431 + TP_ARGS(ring), 432 + 433 + TP_STRUCT__entry( 434 + __field(int, index) 435 + __field(u32, dirty_index) 436 + __field(u32, reset_index) 437 + ), 438 + 439 + TP_fast_assign( 440 + __entry->index = ring->index; 441 + __entry->dirty_index = ring->dirty_index; 442 + __entry->reset_index = ring->reset_index; 443 + ), 444 + 445 + TP_printk("ring %d: dirty 0x%x reset 0x%x (used %u)", 446 + __entry->index, __entry->dirty_index, __entry->reset_index, 447 + __entry->dirty_index - __entry->reset_index) 448 + ); 449 + 450 + TRACE_EVENT(kvm_dirty_ring_exit, 451 + TP_PROTO(struct kvm_vcpu *vcpu), 452 + TP_ARGS(vcpu), 453 + 454 + TP_STRUCT__entry( 455 + __field(int, vcpu_id) 456 + ), 457 + 458 + TP_fast_assign( 459 + __entry->vcpu_id = vcpu->vcpu_id; 460 + ), 461 + 462 + TP_printk("vcpu %d", __entry->vcpu_id) 463 + ); 464 + 402 465 #endif /* _TRACE_KVM_MAIN_H */ 403 466 404 467 /* This part must be outside protection */
+53
include/uapi/linux/kvm.h
··· 250 250 #define KVM_EXIT_ARM_NISV 28 251 251 #define KVM_EXIT_X86_RDMSR 29 252 252 #define KVM_EXIT_X86_WRMSR 30 253 + #define KVM_EXIT_DIRTY_RING_FULL 31 253 254 254 255 /* For KVM_EXIT_INTERNAL_ERROR */ 255 256 /* Emulate instruction failed. */ ··· 1055 1054 #define KVM_CAP_X86_MSR_FILTER 189 1056 1055 #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190 1057 1056 #define KVM_CAP_SYS_HYPERV_CPUID 191 1057 + #define KVM_CAP_DIRTY_LOG_RING 192 1058 1058 1059 1059 #ifdef KVM_CAP_IRQ_ROUTING 1060 1060 ··· 1560 1558 /* Available with KVM_CAP_X86_MSR_FILTER */ 1561 1559 #define KVM_X86_SET_MSR_FILTER _IOW(KVMIO, 0xc6, struct kvm_msr_filter) 1562 1560 1561 + /* Available with KVM_CAP_DIRTY_LOG_RING */ 1562 + #define KVM_RESET_DIRTY_RINGS _IO(KVMIO, 0xc7) 1563 + 1563 1564 /* Secure Encrypted Virtualization command */ 1564 1565 enum sev_cmd_id { 1565 1566 /* Guest initialization commands */ ··· 1715 1710 1716 1711 #define KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE (1 << 0) 1717 1712 #define KVM_DIRTY_LOG_INITIALLY_SET (1 << 1) 1713 + 1714 + /* 1715 + * Arch needs to define the macro after implementing the dirty ring 1716 + * feature. KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the 1717 + * starting page offset of the dirty ring structures. 1718 + */ 1719 + #ifndef KVM_DIRTY_LOG_PAGE_OFFSET 1720 + #define KVM_DIRTY_LOG_PAGE_OFFSET 0 1721 + #endif 1722 + 1723 + /* 1724 + * KVM dirty GFN flags, defined as: 1725 + * 1726 + * |---------------+---------------+--------------| 1727 + * | bit 1 (reset) | bit 0 (dirty) | Status | 1728 + * |---------------+---------------+--------------| 1729 + * | 0 | 0 | Invalid GFN | 1730 + * | 0 | 1 | Dirty GFN | 1731 + * | 1 | X | GFN to reset | 1732 + * |---------------+---------------+--------------| 1733 + * 1734 + * Lifecycle of a dirty GFN goes like: 1735 + * 1736 + * dirtied harvested reset 1737 + * 00 -----------> 01 -------------> 1X -------+ 1738 + * ^ | 1739 + * | | 1740 + * +------------------------------------------+ 1741 + * 1742 + * The userspace program is only responsible for the 01->1X state 1743 + * conversion after harvesting an entry. Also, it must not skip any 1744 + * dirty bits, so that dirty bits are always harvested in sequence. 1745 + */ 1746 + #define KVM_DIRTY_GFN_F_DIRTY BIT(0) 1747 + #define KVM_DIRTY_GFN_F_RESET BIT(1) 1748 + #define KVM_DIRTY_GFN_F_MASK 0x3 1749 + 1750 + /* 1751 + * KVM dirty rings should be mapped at KVM_DIRTY_LOG_PAGE_OFFSET of 1752 + * per-vcpu mmaped regions as an array of struct kvm_dirty_gfn. The 1753 + * size of the gfn buffer is decided by the first argument when 1754 + * enabling KVM_CAP_DIRTY_LOG_RING. 1755 + */ 1756 + struct kvm_dirty_gfn { 1757 + __u32 flags; 1758 + __u32 slot; 1759 + __u64 offset; 1760 + }; 1718 1761 1719 1762 #endif /* __LINUX_KVM_H */
+194
virt/kvm/dirty_ring.c
··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + /* 3 + * KVM dirty ring implementation 4 + * 5 + * Copyright 2019 Red Hat, Inc. 6 + */ 7 + #include <linux/kvm_host.h> 8 + #include <linux/kvm.h> 9 + #include <linux/vmalloc.h> 10 + #include <linux/kvm_dirty_ring.h> 11 + #include <trace/events/kvm.h> 12 + 13 + int __weak kvm_cpu_dirty_log_size(void) 14 + { 15 + return 0; 16 + } 17 + 18 + u32 kvm_dirty_ring_get_rsvd_entries(void) 19 + { 20 + return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size(); 21 + } 22 + 23 + static u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring) 24 + { 25 + return READ_ONCE(ring->dirty_index) - READ_ONCE(ring->reset_index); 26 + } 27 + 28 + bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring) 29 + { 30 + return kvm_dirty_ring_used(ring) >= ring->soft_limit; 31 + } 32 + 33 + static bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring) 34 + { 35 + return kvm_dirty_ring_used(ring) >= ring->size; 36 + } 37 + 38 + struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm) 39 + { 40 + struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); 41 + 42 + WARN_ON_ONCE(vcpu->kvm != kvm); 43 + 44 + return &vcpu->dirty_ring; 45 + } 46 + 47 + static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask) 48 + { 49 + struct kvm_memory_slot *memslot; 50 + int as_id, id; 51 + 52 + as_id = slot >> 16; 53 + id = (u16)slot; 54 + 55 + if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS) 56 + return; 57 + 58 + memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id); 59 + 60 + if (!memslot || (offset + __fls(mask)) >= memslot->npages) 61 + return; 62 + 63 + spin_lock(&kvm->mmu_lock); 64 + kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask); 65 + spin_unlock(&kvm->mmu_lock); 66 + } 67 + 68 + int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size) 69 + { 70 + ring->dirty_gfns = vmalloc(size); 71 + if (!ring->dirty_gfns) 72 + return -ENOMEM; 73 + memset(ring->dirty_gfns, 0, size); 74 + 75 + ring->size = size / sizeof(struct kvm_dirty_gfn); 76 + ring->soft_limit = ring->size - kvm_dirty_ring_get_rsvd_entries(); 77 + ring->dirty_index = 0; 78 + ring->reset_index = 0; 79 + ring->index = index; 80 + 81 + return 0; 82 + } 83 + 84 + static inline void kvm_dirty_gfn_set_invalid(struct kvm_dirty_gfn *gfn) 85 + { 86 + gfn->flags = 0; 87 + } 88 + 89 + static inline void kvm_dirty_gfn_set_dirtied(struct kvm_dirty_gfn *gfn) 90 + { 91 + gfn->flags = KVM_DIRTY_GFN_F_DIRTY; 92 + } 93 + 94 + static inline bool kvm_dirty_gfn_invalid(struct kvm_dirty_gfn *gfn) 95 + { 96 + return gfn->flags == 0; 97 + } 98 + 99 + static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn) 100 + { 101 + return gfn->flags & KVM_DIRTY_GFN_F_RESET; 102 + } 103 + 104 + int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring) 105 + { 106 + u32 cur_slot, next_slot; 107 + u64 cur_offset, next_offset; 108 + unsigned long mask; 109 + int count = 0; 110 + struct kvm_dirty_gfn *entry; 111 + bool first_round = true; 112 + 113 + /* This is only needed to make compilers happy */ 114 + cur_slot = cur_offset = mask = 0; 115 + 116 + while (true) { 117 + entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)]; 118 + 119 + if (!kvm_dirty_gfn_harvested(entry)) 120 + break; 121 + 122 + next_slot = READ_ONCE(entry->slot); 123 + next_offset = READ_ONCE(entry->offset); 124 + 125 + /* Update the flags to reflect that this GFN is reset */ 126 + kvm_dirty_gfn_set_invalid(entry); 127 + 128 + ring->reset_index++; 129 + count++; 130 + /* 131 + * Try to coalesce the reset operations when the guest is 132 + * scanning pages in the same slot. 133 + */ 134 + if (!first_round && next_slot == cur_slot) { 135 + s64 delta = next_offset - cur_offset; 136 + 137 + if (delta >= 0 && delta < BITS_PER_LONG) { 138 + mask |= 1ull << delta; 139 + continue; 140 + } 141 + 142 + /* Backwards visit, careful about overflows! */ 143 + if (delta > -BITS_PER_LONG && delta < 0 && 144 + (mask << -delta >> -delta) == mask) { 145 + cur_offset = next_offset; 146 + mask = (mask << -delta) | 1; 147 + continue; 148 + } 149 + } 150 + kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask); 151 + cur_slot = next_slot; 152 + cur_offset = next_offset; 153 + mask = 1; 154 + first_round = false; 155 + } 156 + 157 + kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask); 158 + 159 + trace_kvm_dirty_ring_reset(ring); 160 + 161 + return count; 162 + } 163 + 164 + void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) 165 + { 166 + struct kvm_dirty_gfn *entry; 167 + 168 + /* It should never get full */ 169 + WARN_ON_ONCE(kvm_dirty_ring_full(ring)); 170 + 171 + entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)]; 172 + 173 + entry->slot = slot; 174 + entry->offset = offset; 175 + /* 176 + * Make sure the data is filled in before we publish this to 177 + * the userspace program. There's no paired kernel-side reader. 178 + */ 179 + smp_wmb(); 180 + kvm_dirty_gfn_set_dirtied(entry); 181 + ring->dirty_index++; 182 + trace_kvm_dirty_ring_push(ring, slot, offset); 183 + } 184 + 185 + struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset) 186 + { 187 + return vmalloc_to_page((void *)ring->dirty_gfns + offset * PAGE_SIZE); 188 + } 189 + 190 + void kvm_dirty_ring_free(struct kvm_dirty_ring *ring) 191 + { 192 + vfree(ring->dirty_gfns); 193 + ring->dirty_gfns = NULL; 194 + }
+112 -1
virt/kvm/kvm_main.c
··· 63 63 #define CREATE_TRACE_POINTS 64 64 #include <trace/events/kvm.h> 65 65 66 + #include <linux/kvm_dirty_ring.h> 67 + 66 68 /* Worst case buffer size needed for holding an integer. */ 67 69 #define ITOA_MAX_LEN 12 68 70 ··· 417 415 418 416 void kvm_vcpu_destroy(struct kvm_vcpu *vcpu) 419 417 { 418 + kvm_dirty_ring_free(&vcpu->dirty_ring); 420 419 kvm_arch_vcpu_destroy(vcpu); 421 420 422 421 /* ··· 2647 2644 { 2648 2645 if (memslot && memslot->dirty_bitmap) { 2649 2646 unsigned long rel_gfn = gfn - memslot->base_gfn; 2647 + u32 slot = (memslot->as_id << 16) | memslot->id; 2650 2648 2651 - set_bit_le(rel_gfn, memslot->dirty_bitmap); 2649 + if (kvm->dirty_ring_size) 2650 + kvm_dirty_ring_push(kvm_dirty_ring_get(kvm), 2651 + slot, rel_gfn); 2652 + else 2653 + set_bit_le(rel_gfn, memslot->dirty_bitmap); 2652 2654 } 2653 2655 } 2654 2656 EXPORT_SYMBOL_GPL(mark_page_dirty_in_slot); ··· 3013 3005 } 3014 3006 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin); 3015 3007 3008 + static bool kvm_page_in_dirty_ring(struct kvm *kvm, unsigned long pgoff) 3009 + { 3010 + #if KVM_DIRTY_LOG_PAGE_OFFSET > 0 3011 + return (pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) && 3012 + (pgoff < KVM_DIRTY_LOG_PAGE_OFFSET + 3013 + kvm->dirty_ring_size / PAGE_SIZE); 3014 + #else 3015 + return false; 3016 + #endif 3017 + } 3018 + 3016 3019 static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf) 3017 3020 { 3018 3021 struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data; ··· 3039 3020 else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET) 3040 3021 page = virt_to_page(vcpu->kvm->coalesced_mmio_ring); 3041 3022 #endif 3023 + else if (kvm_page_in_dirty_ring(vcpu->kvm, vmf->pgoff)) 3024 + page = kvm_dirty_ring_get_page( 3025 + &vcpu->dirty_ring, 3026 + vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET); 3042 3027 else 3043 3028 return kvm_arch_vcpu_fault(vcpu, vmf); 3044 3029 get_page(page); ··· 3056 3033 3057 3034 static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma) 3058 3035 { 3036 + struct kvm_vcpu *vcpu = file->private_data; 3037 + unsigned long pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT; 3038 + 3039 + if ((kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff) || 3040 + kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff + pages - 1)) && 3041 + ((vma->vm_flags & VM_EXEC) || !(vma->vm_flags & VM_SHARED))) 3042 + return -EINVAL; 3043 + 3059 3044 vma->vm_ops = &kvm_vcpu_vm_ops; 3060 3045 return 0; 3061 3046 } ··· 3157 3126 if (r) 3158 3127 goto vcpu_free_run_page; 3159 3128 3129 + if (kvm->dirty_ring_size) { 3130 + r = kvm_dirty_ring_alloc(&vcpu->dirty_ring, 3131 + id, kvm->dirty_ring_size); 3132 + if (r) 3133 + goto arch_vcpu_destroy; 3134 + } 3135 + 3160 3136 mutex_lock(&kvm->lock); 3161 3137 if (kvm_get_vcpu_by_id(kvm, id)) { 3162 3138 r = -EEXIST; ··· 3197 3159 3198 3160 unlock_vcpu_destroy: 3199 3161 mutex_unlock(&kvm->lock); 3162 + kvm_dirty_ring_free(&vcpu->dirty_ring); 3163 + arch_vcpu_destroy: 3200 3164 kvm_arch_vcpu_destroy(vcpu); 3201 3165 vcpu_free_run_page: 3202 3166 free_page((unsigned long)vcpu->run); ··· 3671 3631 #endif 3672 3632 case KVM_CAP_NR_MEMSLOTS: 3673 3633 return KVM_USER_MEM_SLOTS; 3634 + case KVM_CAP_DIRTY_LOG_RING: 3635 + #if KVM_DIRTY_LOG_PAGE_OFFSET > 0 3636 + return KVM_DIRTY_RING_MAX_ENTRIES * sizeof(struct kvm_dirty_gfn); 3637 + #else 3638 + return 0; 3639 + #endif 3674 3640 default: 3675 3641 break; 3676 3642 } 3677 3643 return kvm_vm_ioctl_check_extension(kvm, arg); 3644 + } 3645 + 3646 + static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size) 3647 + { 3648 + int r; 3649 + 3650 + if (!KVM_DIRTY_LOG_PAGE_OFFSET) 3651 + return -EINVAL; 3652 + 3653 + /* the size should be power of 2 */ 3654 + if (!size || (size & (size - 1))) 3655 + return -EINVAL; 3656 + 3657 + /* Should be bigger to keep the reserved entries, or a page */ 3658 + if (size < kvm_dirty_ring_get_rsvd_entries() * 3659 + sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE) 3660 + return -EINVAL; 3661 + 3662 + if (size > KVM_DIRTY_RING_MAX_ENTRIES * 3663 + sizeof(struct kvm_dirty_gfn)) 3664 + return -E2BIG; 3665 + 3666 + /* We only allow it to set once */ 3667 + if (kvm->dirty_ring_size) 3668 + return -EINVAL; 3669 + 3670 + mutex_lock(&kvm->lock); 3671 + 3672 + if (kvm->created_vcpus) { 3673 + /* We don't allow to change this value after vcpu created */ 3674 + r = -EINVAL; 3675 + } else { 3676 + kvm->dirty_ring_size = size; 3677 + r = 0; 3678 + } 3679 + 3680 + mutex_unlock(&kvm->lock); 3681 + return r; 3682 + } 3683 + 3684 + static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm) 3685 + { 3686 + int i; 3687 + struct kvm_vcpu *vcpu; 3688 + int cleared = 0; 3689 + 3690 + if (!kvm->dirty_ring_size) 3691 + return -EINVAL; 3692 + 3693 + mutex_lock(&kvm->slots_lock); 3694 + 3695 + kvm_for_each_vcpu(i, vcpu, kvm) 3696 + cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring); 3697 + 3698 + mutex_unlock(&kvm->slots_lock); 3699 + 3700 + if (cleared) 3701 + kvm_flush_remote_tlbs(kvm); 3702 + 3703 + return cleared; 3678 3704 } 3679 3705 3680 3706 int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm, ··· 3773 3667 kvm->max_halt_poll_ns = cap->args[0]; 3774 3668 return 0; 3775 3669 } 3670 + case KVM_CAP_DIRTY_LOG_RING: 3671 + return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]); 3776 3672 default: 3777 3673 return kvm_vm_ioctl_enable_cap(kvm, cap); 3778 3674 } ··· 3958 3850 } 3959 3851 case KVM_CHECK_EXTENSION: 3960 3852 r = kvm_vm_ioctl_check_extension_generic(kvm, arg); 3853 + break; 3854 + case KVM_RESET_DIRTY_RINGS: 3855 + r = kvm_vm_ioctl_reset_dirty_pages(kvm); 3961 3856 break; 3962 3857 default: 3963 3858 r = kvm_arch_vm_ioctl(filp, ioctl, arg);