KVM: X86: Implement ring-based dirty memory tracking

tjh.dev / kernel

fork atom

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

fork atom

KVM: X86: Implement ring-based dirty memory tracking

This patch is heavily based on previous work from Lei Cao
<lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]

KVM currently uses large bitmaps to track dirty memory. These bitmaps
are copied to userspace when userspace queries KVM for its dirty page
information. The use of bitmaps is mostly sufficient for live
migration, as large parts of memory are be dirtied from one log-dirty
pass to another. However, in a checkpointing system, the number of
dirty pages is small and in fact it is often bounded---the VM is
paused when it has dirtied a pre-defined number of pages. Traversing a
large, sparsely populated bitmap to find set bits is time-consuming,
as is copying the bitmap to user-space.

A similar issue will be there for live migration when the guest memory
is huge while the page dirty procedure is trivial. In that case for
each dirty sync we need to pull the whole dirty bitmap to userspace
and analyse every bit even if it's mostly zeros.

The preferred data structure for above scenarios is a dense list of
guest frame numbers (GFN). This patch series stores the dirty list in
kernel memory that can be memory mapped into userspace to allow speedy
harvesting.

This patch enables dirty ring for X86 only. However it should be
easily extended to other archs as well.

[1] https://patchwork.kernel.org/patch/10471409/

Signed-off-by: Lei Cao <lei.cao@stratus.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201001012222.5767-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

authored by

Peter Xu and committed by

Paolo Bonzini 5 years ago fb04a1ed 28bd726a

+662 -3

14 changed files

expand all collapse all

Documentation

virt

kvm

api.rst

arch

x86

include

asm

kvm_host.h

uapi

asm

kvm.h

kvm

Makefile

mmu

mmu.c

tdp_mmu.c

vmx

vmx.c

x86.c

include

linux

kvm_dirty_ring.h

kvm_host.h

trace

events

kvm.h

uapi

linux

kvm.h

virt

kvm

dirty_ring.c

kvm_main.c

+93

Documentation/virt/kvm/api.rst

reviewed

··· 262 262 memory region. This ioctl returns the size of that region. See the 263 263 KVM_RUN documentation for details. 264 264 265 265 + Besides the size of the KVM_RUN communication region, other areas of 266 266 + the VCPU file descriptor can be mmap-ed, including: 267 267 + 268 268 + - if KVM_CAP_COALESCED_MMIO is available, a page at 269 269 + KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons, 270 270 + this page is included in the result of KVM_GET_VCPU_MMAP_SIZE. 271 271 + KVM_CAP_COALESCED_MMIO is not documented yet. 272 272 + 273 273 + - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at 274 274 + KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. For more information on 275 275 + KVM_CAP_DIRTY_LOG_RING, see section 8.3. 276 276 + 265 277 266 278 4.6 KVM_SET_MEMORY_REGION 267 279 ------------------------- ··· 6408 6396 guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf 6409 6397 (0x40000001). Otherwise, a guest may use the paravirtual features 6410 6398 regardless of what has actually been exposed through the CPUID leaf. 6399 6399 + 6400 6400 + 6401 6401 + 8.29 KVM_CAP_DIRTY_LOG_RING 6402 6402 + --------------------------- 6403 6403 + 6404 6404 + :Architectures: x86 6405 6405 + :Parameters: args[0] - size of the dirty log ring 6406 6406 + 6407 6407 + KVM is capable of tracking dirty memory using ring buffers that are 6408 6408 + mmaped into userspace; there is one dirty ring per vcpu. 6409 6409 + 6410 6410 + The dirty ring is available to userspace as an array of 6411 6411 + ``struct kvm_dirty_gfn``. Each dirty entry it's defined as:: 6412 6412 + 6413 6413 + struct kvm_dirty_gfn { 6414 6414 + __u32 flags; 6415 6415 + __u32 slot; /* as_id | slot_id */ 6416 6416 + __u64 offset; 6417 6417 + }; 6418 6418 + 6419 6419 + The following values are defined for the flags field to define the 6420 6420 + current state of the entry:: 6421 6421 + 6422 6422 + #define KVM_DIRTY_GFN_F_DIRTY BIT(0) 6423 6423 + #define KVM_DIRTY_GFN_F_RESET BIT(1) 6424 6424 + #define KVM_DIRTY_GFN_F_MASK 0x3 6425 6425 + 6426 6426 + Userspace should call KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM 6427 6427 + ioctl to enable this capability for the new guest and set the size of 6428 6428 + the rings. Enabling the capability is only allowed before creating any 6429 6429 + vCPU, and the size of the ring must be a power of two. The larger the 6430 6430 + ring buffer, the less likely the ring is full and the VM is forced to 6431 6431 + exit to userspace. The optimal size depends on the workload, but it is 6432 6432 + recommended that it be at least 64 KiB (4096 entries). 6433 6433 + 6434 6434 + Just like for dirty page bitmaps, the buffer tracks writes to 6435 6435 + all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was 6436 6436 + set in KVM_SET_USER_MEMORY_REGION. Once a memory region is registered 6437 6437 + with the flag set, userspace can start harvesting dirty pages from the 6438 6438 + ring buffer. 6439 6439 + 6440 6440 + An entry in the ring buffer can be unused (flag bits ``00``), 6441 6441 + dirty (flag bits ``01``) or harvested (flag bits ``1X``). The 6442 6442 + state machine for the entry is as follows:: 6443 6443 + 6444 6444 + dirtied harvested reset 6445 6445 + 00 -----------> 01 -------------> 1X -------+ 6446 6446 + ^ | 6447 6447 + | | 6448 6448 + +------------------------------------------+ 6449 6449 + 6450 6450 + To harvest the dirty pages, userspace accesses the mmaped ring buffer 6451 6451 + to read the dirty GFNs. If the flags has the DIRTY bit set (at this stage 6452 6452 + the RESET bit must be cleared), then it means this GFN is a dirty GFN. 6453 6453 + The userspace should harvest this GFN and mark the flags from state 6454 6454 + ``01b`` to ``1Xb`` (bit 0 will be ignored by KVM, but bit 1 must be set 6455 6455 + to show that this GFN is harvested and waiting for a reset), and move 6456 6456 + on to the next GFN. The userspace should continue to do this until the 6457 6457 + flags of a GFN have the DIRTY bit cleared, meaning that it has harvested 6458 6458 + all the dirty GFNs that were available. 6459 6459 + 6460 6460 + It's not necessary for userspace to harvest the all dirty GFNs at once. 6461 6461 + However it must collect the dirty GFNs in sequence, i.e., the userspace 6462 6462 + program cannot skip one dirty GFN to collect the one next to it. 6463 6463 + 6464 6464 + After processing one or more entries in the ring buffer, userspace 6465 6465 + calls the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about 6466 6466 + it, so that the kernel will reprotect those collected GFNs. 6467 6467 + Therefore, the ioctl must be called *before* reading the content of 6468 6468 + the dirty pages. 6469 6469 + 6470 6470 + The dirty ring can get full. When it happens, the KVM_RUN of the 6471 6471 + vcpu will return with exit reason KVM_EXIT_DIRTY_LOG_FULL. 6472 6472 + 6473 6473 + The dirty ring interface has a major difference comparing to the 6474 6474 + KVM_GET_DIRTY_LOG interface in that, when reading the dirty ring from 6475 6475 + userspace, it's still possible that the kernel has not yet flushed the 6476 6476 + processor's dirty page buffers into the kernel buffer (with dirty bitmaps, the 6477 6477 + flushing is done by the KVM_GET_DIRTY_LOG ioctl). To achieve that, one 6478 6478 + needs to kick the vcpu out of KVM_RUN using a signal. The resulting 6479 6479 + vmexit ensures that all dirty GFNs are flushed to the dirty rings.

arch/x86/include/asm/kvm_host.h

reviewed

··· 1232 1232 void (*enable_log_dirty_pt_masked)(struct kvm *kvm, 1233 1233 struct kvm_memory_slot *slot, 1234 1234 gfn_t offset, unsigned long mask); 1235 1235 + int (*cpu_dirty_log_size)(void); 1235 1236 1236 1237 /* pmu operations of sub-arch */ 1237 1238 const struct kvm_pmu_ops *pmu_ops; ··· 1744 1743 1745 1744 #define GET_SMSTATE(type, buf, offset) \ 1746 1745 (*(type *)((buf) + (offset) - 0x7e00)) 1746 1746 + 1747 1747 + int kvm_cpu_dirty_log_size(void); 1747 1748 1748 1749 #endif /* _ASM_X86_KVM_HOST_H */

arch/x86/include/uapi/asm/kvm.h

reviewed

··· 12 12 13 13 #define KVM_PIO_PAGE_OFFSET 1 14 14 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2 15 15 + #define KVM_DIRTY_LOG_PAGE_OFFSET 64 15 16 16 17 #define DE_VECTOR 0 17 18 #define DB_VECTOR 1

+2 -1

arch/x86/kvm/Makefile

reviewed

··· 10 10 KVM := ../../../virt/kvm 11 11 12 12 kvm-y += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \ 13 13 - $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o 13 13 + $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \ 14 14 + $(KVM)/dirty_ring.o 14 15 kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o 15 16 16 17 kvm-y += x86.o emulate.o i8259.o irq.o lapic.o \

arch/x86/kvm/mmu/mmu.c

reviewed

··· 1289 1289 kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask); 1290 1290 } 1291 1291 1292 1292 + int kvm_cpu_dirty_log_size(void) 1293 1293 + { 1294 1294 + if (kvm_x86_ops.cpu_dirty_log_size) 1295 1295 + return kvm_x86_ops.cpu_dirty_log_size(); 1296 1296 + 1297 1297 + return 0; 1298 1298 + } 1299 1299 + 1292 1300 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm, 1293 1301 struct kvm_memory_slot *slot, u64 gfn) 1294 1302 {

+1 -1

arch/x86/kvm/mmu/tdp_mmu.c

reviewed

··· 185 185 if ((!is_writable_pte(old_spte) || pfn_changed) && 186 186 is_writable_pte(new_spte)) { 187 187 slot = __gfn_to_memslot(__kvm_memslots(kvm, as_id), gfn); 188 188 - mark_page_dirty_in_slot(slot, gfn); 188 188 + mark_page_dirty_in_slot(kvm, slot, gfn); 189 189 } 190 190 } 191 191

arch/x86/kvm/vmx/vmx.c

reviewed

··· 7583 7583 return supported & BIT(bit); 7584 7584 } 7585 7585 7586 7586 + static int vmx_cpu_dirty_log_size(void) 7587 7587 + { 7588 7588 + return enable_pml ? PML_ENTITY_NUM : 0; 7589 7589 + } 7590 7590 + 7586 7591 static struct kvm_x86_ops vmx_x86_ops __initdata = { 7587 7592 .hardware_unsetup = hardware_unsetup, 7588 7593 ··· 7717 7712 .migrate_timers = vmx_migrate_timers, 7718 7713 7719 7714 .msr_filter_changed = vmx_msr_filter_changed, 7715 7715 + .cpu_dirty_log_size = vmx_cpu_dirty_log_size, 7720 7716 }; 7721 7717 7722 7718 static __init int hardware_setup(void) ··· 7835 7829 vmx_x86_ops.slot_disable_log_dirty = NULL; 7836 7830 vmx_x86_ops.flush_log_dirty = NULL; 7837 7831 vmx_x86_ops.enable_log_dirty_pt_masked = NULL; 7832 7832 + vmx_x86_ops.cpu_dirty_log_size = NULL; 7838 7833 } 7839 7834 7840 7835 if (!cpu_has_vmx_preemption_timer())

arch/x86/kvm/x86.c

reviewed

··· 8754 8754 8755 8755 bool req_immediate_exit = false; 8756 8756 8757 8757 + /* Forbid vmenter if vcpu dirty ring is soft-full */ 8758 8758 + if (unlikely(vcpu->kvm->dirty_ring_size && 8759 8759 + kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) { 8760 8760 + vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL; 8761 8761 + trace_kvm_dirty_ring_exit(vcpu); 8762 8762 + r = 0; 8763 8763 + goto out; 8764 8764 + } 8765 8765 + 8757 8766 if (kvm_request_pending(vcpu)) { 8758 8767 if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) { 8759 8768 if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {

+103

include/linux/kvm_dirty_ring.h

reviewed

··· 1 1 + #ifndef KVM_DIRTY_RING_H 2 2 + #define KVM_DIRTY_RING_H 3 3 + 4 4 + #include <linux/kvm.h> 5 5 + 6 6 + /** 7 7 + * kvm_dirty_ring: KVM internal dirty ring structure 8 8 + * 9 9 + * @dirty_index: free running counter that points to the next slot in 10 10 + * dirty_ring->dirty_gfns, where a new dirty page should go 11 11 + * @reset_index: free running counter that points to the next dirty page 12 12 + * in dirty_ring->dirty_gfns for which dirty trap needs to 13 13 + * be reenabled 14 14 + * @size: size of the compact list, dirty_ring->dirty_gfns 15 15 + * @soft_limit: when the number of dirty pages in the list reaches this 16 16 + * limit, vcpu that owns this ring should exit to userspace 17 17 + * to allow userspace to harvest all the dirty pages 18 18 + * @dirty_gfns: the array to keep the dirty gfns 19 19 + * @index: index of this dirty ring 20 20 + */ 21 21 + struct kvm_dirty_ring { 22 22 + u32 dirty_index; 23 23 + u32 reset_index; 24 24 + u32 size; 25 25 + u32 soft_limit; 26 26 + struct kvm_dirty_gfn *dirty_gfns; 27 27 + int index; 28 28 + }; 29 29 + 30 30 + #if (KVM_DIRTY_LOG_PAGE_OFFSET == 0) 31 31 + /* 32 32 + * If KVM_DIRTY_LOG_PAGE_OFFSET not defined, kvm_dirty_ring.o should 33 33 + * not be included as well, so define these nop functions for the arch. 34 34 + */ 35 35 + static inline u32 kvm_dirty_ring_get_rsvd_entries(void) 36 36 + { 37 37 + return 0; 38 38 + } 39 39 + 40 40 + static inline int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, 41 41 + int index, u32 size) 42 42 + { 43 43 + return 0; 44 44 + } 45 45 + 46 46 + static inline struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm) 47 47 + { 48 48 + return NULL; 49 49 + } 50 50 + 51 51 + static inline int kvm_dirty_ring_reset(struct kvm *kvm, 52 52 + struct kvm_dirty_ring *ring) 53 53 + { 54 54 + return 0; 55 55 + } 56 56 + 57 57 + static inline void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, 58 58 + u32 slot, u64 offset) 59 59 + { 60 60 + } 61 61 + 62 62 + static inline struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, 63 63 + u32 offset) 64 64 + { 65 65 + return NULL; 66 66 + } 67 67 + 68 68 + static inline void kvm_dirty_ring_free(struct kvm_dirty_ring *ring) 69 69 + { 70 70 + } 71 71 + 72 72 + static inline bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring) 73 73 + { 74 74 + return true; 75 75 + } 76 76 + 77 77 + #else /* KVM_DIRTY_LOG_PAGE_OFFSET == 0 */ 78 78 + 79 79 + u32 kvm_dirty_ring_get_rsvd_entries(void); 80 80 + int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size); 81 81 + struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm); 82 82 + 83 83 + /* 84 84 + * called with kvm->slots_lock held, returns the number of 85 85 + * processed pages. 86 86 + */ 87 87 + int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring); 88 88 + 89 89 + /* 90 90 + * returns =0: successfully pushed 91 91 + * <0: unable to push, need to wait 92 92 + */ 93 93 + void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset); 94 94 + 95 95 + /* for use in vm_operations_struct */ 96 96 + struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset); 97 97 + 98 98 + void kvm_dirty_ring_free(struct kvm_dirty_ring *ring); 99 99 + bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring); 100 100 + 101 101 + #endif /* KVM_DIRTY_LOG_PAGE_OFFSET == 0 */ 102 102 + 103 103 + #endif /* KVM_DIRTY_RING_H */

+13

include/linux/kvm_host.h

reviewed

··· 34 34 #include <linux/kvm_types.h> 35 35 36 36 #include <asm/kvm_host.h> 37 37 + #include <linux/kvm_dirty_ring.h> 37 38 38 39 #ifndef KVM_MAX_VCPU_ID 39 40 #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS ··· 320 319 bool preempted; 321 320 bool ready; 322 321 struct kvm_vcpu_arch arch; 322 322 + struct kvm_dirty_ring dirty_ring; 323 323 }; 324 324 325 325 static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu) ··· 507 505 struct srcu_struct irq_srcu; 508 506 pid_t userspace_pid; 509 507 unsigned int max_halt_poll_ns; 508 508 + u32 dirty_ring_size; 510 509 }; 511 510 512 511 #define kvm_err(fmt, ...) \ ··· 1479 1476 vcpu->stat.signal_exits++; 1480 1477 } 1481 1478 #endif /* CONFIG_KVM_XFER_TO_GUEST_WORK */ 1479 1479 + 1480 1480 + /* 1481 1481 + * This defines how many reserved entries we want to keep before we 1482 1482 + * kick the vcpu to the userspace to avoid dirty ring full. This 1483 1483 + * value can be tuned to higher if e.g. PML is enabled on the host. 1484 1484 + */ 1485 1485 + #define KVM_DIRTY_RING_RSVD_ENTRIES 64 1486 1486 + 1487 1487 + /* Max number of entries allowed for each kvm dirty ring */ 1488 1488 + #define KVM_DIRTY_RING_MAX_ENTRIES 65536 1482 1489 1483 1490 #endif

+63

include/trace/events/kvm.h

reviewed

··· 399 399 #define trace_kvm_halt_poll_ns_shrink(vcpu_id, new, old) \ 400 400 trace_kvm_halt_poll_ns(false, vcpu_id, new, old) 401 401 402 402 + TRACE_EVENT(kvm_dirty_ring_push, 403 403 + TP_PROTO(struct kvm_dirty_ring *ring, u32 slot, u64 offset), 404 404 + TP_ARGS(ring, slot, offset), 405 405 + 406 406 + TP_STRUCT__entry( 407 407 + __field(int, index) 408 408 + __field(u32, dirty_index) 409 409 + __field(u32, reset_index) 410 410 + __field(u32, slot) 411 411 + __field(u64, offset) 412 412 + ), 413 413 + 414 414 + TP_fast_assign( 415 415 + __entry->index = ring->index; 416 416 + __entry->dirty_index = ring->dirty_index; 417 417 + __entry->reset_index = ring->reset_index; 418 418 + __entry->slot = slot; 419 419 + __entry->offset = offset; 420 420 + ), 421 421 + 422 422 + TP_printk("ring %d: dirty 0x%x reset 0x%x " 423 423 + "slot %u offset 0x%llx (used %u)", 424 424 + __entry->index, __entry->dirty_index, 425 425 + __entry->reset_index, __entry->slot, __entry->offset, 426 426 + __entry->dirty_index - __entry->reset_index) 427 427 + ); 428 428 + 429 429 + TRACE_EVENT(kvm_dirty_ring_reset, 430 430 + TP_PROTO(struct kvm_dirty_ring *ring), 431 431 + TP_ARGS(ring), 432 432 + 433 433 + TP_STRUCT__entry( 434 434 + __field(int, index) 435 435 + __field(u32, dirty_index) 436 436 + __field(u32, reset_index) 437 437 + ), 438 438 + 439 439 + TP_fast_assign( 440 440 + __entry->index = ring->index; 441 441 + __entry->dirty_index = ring->dirty_index; 442 442 + __entry->reset_index = ring->reset_index; 443 443 + ), 444 444 + 445 445 + TP_printk("ring %d: dirty 0x%x reset 0x%x (used %u)", 446 446 + __entry->index, __entry->dirty_index, __entry->reset_index, 447 447 + __entry->dirty_index - __entry->reset_index) 448 448 + ); 449 449 + 450 450 + TRACE_EVENT(kvm_dirty_ring_exit, 451 451 + TP_PROTO(struct kvm_vcpu *vcpu), 452 452 + TP_ARGS(vcpu), 453 453 + 454 454 + TP_STRUCT__entry( 455 455 + __field(int, vcpu_id) 456 456 + ), 457 457 + 458 458 + TP_fast_assign( 459 459 + __entry->vcpu_id = vcpu->vcpu_id; 460 460 + ), 461 461 + 462 462 + TP_printk("vcpu %d", __entry->vcpu_id) 463 463 + ); 464 464 + 402 465 #endif /* _TRACE_KVM_MAIN_H */ 403 466 404 467 /* This part must be outside protection */

+53

include/uapi/linux/kvm.h

reviewed

··· 250 250 #define KVM_EXIT_ARM_NISV 28 251 251 #define KVM_EXIT_X86_RDMSR 29 252 252 #define KVM_EXIT_X86_WRMSR 30 253 253 + #define KVM_EXIT_DIRTY_RING_FULL 31 253 254 254 255 /* For KVM_EXIT_INTERNAL_ERROR */ 255 256 /* Emulate instruction failed. */ ··· 1055 1054 #define KVM_CAP_X86_MSR_FILTER 189 1056 1055 #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190 1057 1056 #define KVM_CAP_SYS_HYPERV_CPUID 191 1057 1057 + #define KVM_CAP_DIRTY_LOG_RING 192 1058 1058 1059 1059 #ifdef KVM_CAP_IRQ_ROUTING 1060 1060 ··· 1560 1558 /* Available with KVM_CAP_X86_MSR_FILTER */ 1561 1559 #define KVM_X86_SET_MSR_FILTER _IOW(KVMIO, 0xc6, struct kvm_msr_filter) 1562 1560 1561 1561 + /* Available with KVM_CAP_DIRTY_LOG_RING */ 1562 1562 + #define KVM_RESET_DIRTY_RINGS _IO(KVMIO, 0xc7) 1563 1563 + 1563 1564 /* Secure Encrypted Virtualization command */ 1564 1565 enum sev_cmd_id { 1565 1566 /* Guest initialization commands */ ··· 1715 1710 1716 1711 #define KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE (1 << 0) 1717 1712 #define KVM_DIRTY_LOG_INITIALLY_SET (1 << 1) 1713 1713 + 1714 1714 + /* 1715 1715 + * Arch needs to define the macro after implementing the dirty ring 1716 1716 + * feature. KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the 1717 1717 + * starting page offset of the dirty ring structures. 1718 1718 + */ 1719 1719 + #ifndef KVM_DIRTY_LOG_PAGE_OFFSET 1720 1720 + #define KVM_DIRTY_LOG_PAGE_OFFSET 0 1721 1721 + #endif 1722 1722 + 1723 1723 + /* 1724 1724 + * KVM dirty GFN flags, defined as: 1725 1725 + * 1726 1726 + * |---------------+---------------+--------------| 1727 1727 + * | bit 1 (reset) | bit 0 (dirty) | Status | 1728 1728 + * |---------------+---------------+--------------| 1729 1729 + * | 0 | 0 | Invalid GFN | 1730 1730 + * | 0 | 1 | Dirty GFN | 1731 1731 + * | 1 | X | GFN to reset | 1732 1732 + * |---------------+---------------+--------------| 1733 1733 + * 1734 1734 + * Lifecycle of a dirty GFN goes like: 1735 1735 + * 1736 1736 + * dirtied harvested reset 1737 1737 + * 00 -----------> 01 -------------> 1X -------+ 1738 1738 + * ^ | 1739 1739 + * | | 1740 1740 + * +------------------------------------------+ 1741 1741 + * 1742 1742 + * The userspace program is only responsible for the 01->1X state 1743 1743 + * conversion after harvesting an entry. Also, it must not skip any 1744 1744 + * dirty bits, so that dirty bits are always harvested in sequence. 1745 1745 + */ 1746 1746 + #define KVM_DIRTY_GFN_F_DIRTY BIT(0) 1747 1747 + #define KVM_DIRTY_GFN_F_RESET BIT(1) 1748 1748 + #define KVM_DIRTY_GFN_F_MASK 0x3 1749 1749 + 1750 1750 + /* 1751 1751 + * KVM dirty rings should be mapped at KVM_DIRTY_LOG_PAGE_OFFSET of 1752 1752 + * per-vcpu mmaped regions as an array of struct kvm_dirty_gfn. The 1753 1753 + * size of the gfn buffer is decided by the first argument when 1754 1754 + * enabling KVM_CAP_DIRTY_LOG_RING. 1755 1755 + */ 1756 1756 + struct kvm_dirty_gfn { 1757 1757 + __u32 flags; 1758 1758 + __u32 slot; 1759 1759 + __u64 offset; 1760 1760 + }; 1718 1761 1719 1762 #endif /* __LINUX_KVM_H */

+194

virt/kvm/dirty_ring.c

reviewed

··· 1 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 2 + /* 3 3 + * KVM dirty ring implementation 4 4 + * 5 5 + * Copyright 2019 Red Hat, Inc. 6 6 + */ 7 7 + #include <linux/kvm_host.h> 8 8 + #include <linux/kvm.h> 9 9 + #include <linux/vmalloc.h> 10 10 + #include <linux/kvm_dirty_ring.h> 11 11 + #include <trace/events/kvm.h> 12 12 + 13 13 + int __weak kvm_cpu_dirty_log_size(void) 14 14 + { 15 15 + return 0; 16 16 + } 17 17 + 18 18 + u32 kvm_dirty_ring_get_rsvd_entries(void) 19 19 + { 20 20 + return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size(); 21 21 + } 22 22 + 23 23 + static u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring) 24 24 + { 25 25 + return READ_ONCE(ring->dirty_index) - READ_ONCE(ring->reset_index); 26 26 + } 27 27 + 28 28 + bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring) 29 29 + { 30 30 + return kvm_dirty_ring_used(ring) >= ring->soft_limit; 31 31 + } 32 32 + 33 33 + static bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring) 34 34 + { 35 35 + return kvm_dirty_ring_used(ring) >= ring->size; 36 36 + } 37 37 + 38 38 + struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm) 39 39 + { 40 40 + struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); 41 41 + 42 42 + WARN_ON_ONCE(vcpu->kvm != kvm); 43 43 + 44 44 + return &vcpu->dirty_ring; 45 45 + } 46 46 + 47 47 + static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask) 48 48 + { 49 49 + struct kvm_memory_slot *memslot; 50 50 + int as_id, id; 51 51 + 52 52 + as_id = slot >> 16; 53 53 + id = (u16)slot; 54 54 + 55 55 + if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS) 56 56 + return; 57 57 + 58 58 + memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id); 59 59 + 60 60 + if (!memslot || (offset + __fls(mask)) >= memslot->npages) 61 61 + return; 62 62 + 63 63 + spin_lock(&kvm->mmu_lock); 64 64 + kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask); 65 65 + spin_unlock(&kvm->mmu_lock); 66 66 + } 67 67 + 68 68 + int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size) 69 69 + { 70 70 + ring->dirty_gfns = vmalloc(size); 71 71 + if (!ring->dirty_gfns) 72 72 + return -ENOMEM; 73 73 + memset(ring->dirty_gfns, 0, size); 74 74 + 75 75 + ring->size = size / sizeof(struct kvm_dirty_gfn); 76 76 + ring->soft_limit = ring->size - kvm_dirty_ring_get_rsvd_entries(); 77 77 + ring->dirty_index = 0; 78 78 + ring->reset_index = 0; 79 79 + ring->index = index; 80 80 + 81 81 + return 0; 82 82 + } 83 83 + 84 84 + static inline void kvm_dirty_gfn_set_invalid(struct kvm_dirty_gfn *gfn) 85 85 + { 86 86 + gfn->flags = 0; 87 87 + } 88 88 + 89 89 + static inline void kvm_dirty_gfn_set_dirtied(struct kvm_dirty_gfn *gfn) 90 90 + { 91 91 + gfn->flags = KVM_DIRTY_GFN_F_DIRTY; 92 92 + } 93 93 + 94 94 + static inline bool kvm_dirty_gfn_invalid(struct kvm_dirty_gfn *gfn) 95 95 + { 96 96 + return gfn->flags == 0; 97 97 + } 98 98 + 99 99 + static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn) 100 100 + { 101 101 + return gfn->flags & KVM_DIRTY_GFN_F_RESET; 102 102 + } 103 103 + 104 104 + int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring) 105 105 + { 106 106 + u32 cur_slot, next_slot; 107 107 + u64 cur_offset, next_offset; 108 108 + unsigned long mask; 109 109 + int count = 0; 110 110 + struct kvm_dirty_gfn *entry; 111 111 + bool first_round = true; 112 112 + 113 113 + /* This is only needed to make compilers happy */ 114 114 + cur_slot = cur_offset = mask = 0; 115 115 + 116 116 + while (true) { 117 117 + entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)]; 118 118 + 119 119 + if (!kvm_dirty_gfn_harvested(entry)) 120 120 + break; 121 121 + 122 122 + next_slot = READ_ONCE(entry->slot); 123 123 + next_offset = READ_ONCE(entry->offset); 124 124 + 125 125 + /* Update the flags to reflect that this GFN is reset */ 126 126 + kvm_dirty_gfn_set_invalid(entry); 127 127 + 128 128 + ring->reset_index++; 129 129 + count++; 130 130 + /* 131 131 + * Try to coalesce the reset operations when the guest is 132 132 + * scanning pages in the same slot. 133 133 + */ 134 134 + if (!first_round && next_slot == cur_slot) { 135 135 + s64 delta = next_offset - cur_offset; 136 136 + 137 137 + if (delta >= 0 && delta < BITS_PER_LONG) { 138 138 + mask |= 1ull << delta; 139 139 + continue; 140 140 + } 141 141 + 142 142 + /* Backwards visit, careful about overflows! */ 143 143 + if (delta > -BITS_PER_LONG && delta < 0 && 144 144 + (mask << -delta >> -delta) == mask) { 145 145 + cur_offset = next_offset; 146 146 + mask = (mask << -delta) | 1; 147 147 + continue; 148 148 + } 149 149 + } 150 150 + kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask); 151 151 + cur_slot = next_slot; 152 152 + cur_offset = next_offset; 153 153 + mask = 1; 154 154 + first_round = false; 155 155 + } 156 156 + 157 157 + kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask); 158 158 + 159 159 + trace_kvm_dirty_ring_reset(ring); 160 160 + 161 161 + return count; 162 162 + } 163 163 + 164 164 + void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) 165 165 + { 166 166 + struct kvm_dirty_gfn *entry; 167 167 + 168 168 + /* It should never get full */ 169 169 + WARN_ON_ONCE(kvm_dirty_ring_full(ring)); 170 170 + 171 171 + entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)]; 172 172 + 173 173 + entry->slot = slot; 174 174 + entry->offset = offset; 175 175 + /* 176 176 + * Make sure the data is filled in before we publish this to 177 177 + * the userspace program. There's no paired kernel-side reader. 178 178 + */ 179 179 + smp_wmb(); 180 180 + kvm_dirty_gfn_set_dirtied(entry); 181 181 + ring->dirty_index++; 182 182 + trace_kvm_dirty_ring_push(ring, slot, offset); 183 183 + } 184 184 + 185 185 + struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset) 186 186 + { 187 187 + return vmalloc_to_page((void *)ring->dirty_gfns + offset * PAGE_SIZE); 188 188 + } 189 189 + 190 190 + void kvm_dirty_ring_free(struct kvm_dirty_ring *ring) 191 191 + { 192 192 + vfree(ring->dirty_gfns); 193 193 + ring->dirty_gfns = NULL; 194 194 + }

+112 -1

virt/kvm/kvm_main.c

reviewed

··· 63 63 #define CREATE_TRACE_POINTS 64 64 #include <trace/events/kvm.h> 65 65 66 66 + #include <linux/kvm_dirty_ring.h> 67 67 + 66 68 /* Worst case buffer size needed for holding an integer. */ 67 69 #define ITOA_MAX_LEN 12 68 70 ··· 417 415 418 416 void kvm_vcpu_destroy(struct kvm_vcpu *vcpu) 419 417 { 418 418 + kvm_dirty_ring_free(&vcpu->dirty_ring); 420 419 kvm_arch_vcpu_destroy(vcpu); 421 420 422 421 /* ··· 2647 2644 { 2648 2645 if (memslot && memslot->dirty_bitmap) { 2649 2646 unsigned long rel_gfn = gfn - memslot->base_gfn; 2647 2647 + u32 slot = (memslot->as_id << 16) | memslot->id; 2650 2648 2651 2651 - set_bit_le(rel_gfn, memslot->dirty_bitmap); 2649 2649 + if (kvm->dirty_ring_size) 2650 2650 + kvm_dirty_ring_push(kvm_dirty_ring_get(kvm), 2651 2651 + slot, rel_gfn); 2652 2652 + else 2653 2653 + set_bit_le(rel_gfn, memslot->dirty_bitmap); 2652 2654 } 2653 2655 } 2654 2656 EXPORT_SYMBOL_GPL(mark_page_dirty_in_slot); ··· 3013 3005 } 3014 3006 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin); 3015 3007 3008 3008 + static bool kvm_page_in_dirty_ring(struct kvm *kvm, unsigned long pgoff) 3009 3009 + { 3010 3010 + #if KVM_DIRTY_LOG_PAGE_OFFSET > 0 3011 3011 + return (pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) && 3012 3012 + (pgoff < KVM_DIRTY_LOG_PAGE_OFFSET + 3013 3013 + kvm->dirty_ring_size / PAGE_SIZE); 3014 3014 + #else 3015 3015 + return false; 3016 3016 + #endif 3017 3017 + } 3018 3018 + 3016 3019 static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf) 3017 3020 { 3018 3021 struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data; ··· 3039 3020 else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET) 3040 3021 page = virt_to_page(vcpu->kvm->coalesced_mmio_ring); 3041 3022 #endif 3023 3023 + else if (kvm_page_in_dirty_ring(vcpu->kvm, vmf->pgoff)) 3024 3024 + page = kvm_dirty_ring_get_page( 3025 3025 + &vcpu->dirty_ring, 3026 3026 + vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET); 3042 3027 else 3043 3028 return kvm_arch_vcpu_fault(vcpu, vmf); 3044 3029 get_page(page); ··· 3056 3033 3057 3034 static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma) 3058 3035 { 3036 3036 + struct kvm_vcpu *vcpu = file->private_data; 3037 3037 + unsigned long pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT; 3038 3038 + 3039 3039 + if ((kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff) || 3040 3040 + kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff + pages - 1)) && 3041 3041 + ((vma->vm_flags & VM_EXEC) || !(vma->vm_flags & VM_SHARED))) 3042 3042 + return -EINVAL; 3043 3043 + 3059 3044 vma->vm_ops = &kvm_vcpu_vm_ops; 3060 3045 return 0; 3061 3046 } ··· 3157 3126 if (r) 3158 3127 goto vcpu_free_run_page; 3159 3128 3129 3129 + if (kvm->dirty_ring_size) { 3130 3130 + r = kvm_dirty_ring_alloc(&vcpu->dirty_ring, 3131 3131 + id, kvm->dirty_ring_size); 3132 3132 + if (r) 3133 3133 + goto arch_vcpu_destroy; 3134 3134 + } 3135 3135 + 3160 3136 mutex_lock(&kvm->lock); 3161 3137 if (kvm_get_vcpu_by_id(kvm, id)) { 3162 3138 r = -EEXIST; ··· 3197 3159 3198 3160 unlock_vcpu_destroy: 3199 3161 mutex_unlock(&kvm->lock); 3162 3162 + kvm_dirty_ring_free(&vcpu->dirty_ring); 3163 3163 + arch_vcpu_destroy: 3200 3164 kvm_arch_vcpu_destroy(vcpu); 3201 3165 vcpu_free_run_page: 3202 3166 free_page((unsigned long)vcpu->run); ··· 3671 3631 #endif 3672 3632 case KVM_CAP_NR_MEMSLOTS: 3673 3633 return KVM_USER_MEM_SLOTS; 3634 3634 + case KVM_CAP_DIRTY_LOG_RING: 3635 3635 + #if KVM_DIRTY_LOG_PAGE_OFFSET > 0 3636 3636 + return KVM_DIRTY_RING_MAX_ENTRIES * sizeof(struct kvm_dirty_gfn); 3637 3637 + #else 3638 3638 + return 0; 3639 3639 + #endif 3674 3640 default: 3675 3641 break; 3676 3642 } 3677 3643 return kvm_vm_ioctl_check_extension(kvm, arg); 3644 3644 + } 3645 3645 + 3646 3646 + static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size) 3647 3647 + { 3648 3648 + int r; 3649 3649 + 3650 3650 + if (!KVM_DIRTY_LOG_PAGE_OFFSET) 3651 3651 + return -EINVAL; 3652 3652 + 3653 3653 + /* the size should be power of 2 */ 3654 3654 + if (!size || (size & (size - 1))) 3655 3655 + return -EINVAL; 3656 3656 + 3657 3657 + /* Should be bigger to keep the reserved entries, or a page */ 3658 3658 + if (size < kvm_dirty_ring_get_rsvd_entries() * 3659 3659 + sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE) 3660 3660 + return -EINVAL; 3661 3661 + 3662 3662 + if (size > KVM_DIRTY_RING_MAX_ENTRIES * 3663 3663 + sizeof(struct kvm_dirty_gfn)) 3664 3664 + return -E2BIG; 3665 3665 + 3666 3666 + /* We only allow it to set once */ 3667 3667 + if (kvm->dirty_ring_size) 3668 3668 + return -EINVAL; 3669 3669 + 3670 3670 + mutex_lock(&kvm->lock); 3671 3671 + 3672 3672 + if (kvm->created_vcpus) { 3673 3673 + /* We don't allow to change this value after vcpu created */ 3674 3674 + r = -EINVAL; 3675 3675 + } else { 3676 3676 + kvm->dirty_ring_size = size; 3677 3677 + r = 0; 3678 3678 + } 3679 3679 + 3680 3680 + mutex_unlock(&kvm->lock); 3681 3681 + return r; 3682 3682 + } 3683 3683 + 3684 3684 + static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm) 3685 3685 + { 3686 3686 + int i; 3687 3687 + struct kvm_vcpu *vcpu; 3688 3688 + int cleared = 0; 3689 3689 + 3690 3690 + if (!kvm->dirty_ring_size) 3691 3691 + return -EINVAL; 3692 3692 + 3693 3693 + mutex_lock(&kvm->slots_lock); 3694 3694 + 3695 3695 + kvm_for_each_vcpu(i, vcpu, kvm) 3696 3696 + cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring); 3697 3697 + 3698 3698 + mutex_unlock(&kvm->slots_lock); 3699 3699 + 3700 3700 + if (cleared) 3701 3701 + kvm_flush_remote_tlbs(kvm); 3702 3702 + 3703 3703 + return cleared; 3678 3704 } 3679 3705 3680 3706 int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm, ··· 3773 3667 kvm->max_halt_poll_ns = cap->args[0]; 3774 3668 return 0; 3775 3669 } 3670 3670 + case KVM_CAP_DIRTY_LOG_RING: 3671 3671 + return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]); 3776 3672 default: 3777 3673 return kvm_vm_ioctl_enable_cap(kvm, cap); 3778 3674 } ··· 3958 3850 } 3959 3851 case KVM_CHECK_EXTENSION: 3960 3852 r = kvm_vm_ioctl_check_extension_generic(kvm, arg); 3853 3853 + break; 3854 3854 + case KVM_RESET_DIRTY_RINGS: 3855 3855 + r = kvm_vm_ioctl_reset_dirty_pages(kvm); 3961 3856 break; 3962 3857 default: 3963 3858 r = kvm_arch_vm_ioctl(filp, ioctl, arg);