Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

KVM: Support dirty ring in conjunction with bitmap

ARM64 needs to dirty memory outside of a VCPU context when VGIC/ITS is
enabled. It's conflicting with that ring-based dirty page tracking always
requires a running VCPU context.

Introduce a new flavor of dirty ring that requires the use of both VCPU
dirty rings and a dirty bitmap. The expectation is that for non-VCPU
sources of dirty memory (such as the VGIC/ITS on arm64), KVM writes to
the dirty bitmap. Userspace should scan the dirty bitmap before migrating
the VM to the target.

Use an additional capability to advertise this behavior. The newly added
capability (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) can't be enabled before
KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. In this way, the newly added
capability is treated as an extension of KVM_CAP_DIRTY_LOG_RING_ACQ_REL.

Suggested-by: Marc Zyngier <maz@kernel.org>
Suggested-by: Peter Xu <peterx@redhat.com>
Co-developed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Gavin Shan <gshan@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110104914.31280-4-gshan@redhat.com

authored by

Gavin Shan and committed by
Marc Zyngier
86bdf3eb e8a18565

+112 -17
+27 -7
Documentation/virt/kvm/api.rst
··· 8003 8003 needs to kick the vcpu out of KVM_RUN using a signal. The resulting 8004 8004 vmexit ensures that all dirty GFNs are flushed to the dirty rings. 8005 8005 8006 - NOTE: the capability KVM_CAP_DIRTY_LOG_RING and the corresponding 8007 - ioctl KVM_RESET_DIRTY_RINGS are mutual exclusive to the existing ioctls 8008 - KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG. After enabling 8009 - KVM_CAP_DIRTY_LOG_RING with an acceptable dirty ring size, the virtual 8010 - machine will switch to ring-buffer dirty page tracking and further 8011 - KVM_GET_DIRTY_LOG or KVM_CLEAR_DIRTY_LOG ioctls will fail. 8012 - 8013 8006 NOTE: KVM_CAP_DIRTY_LOG_RING_ACQ_REL is the only capability that 8014 8007 should be exposed by weakly ordered architecture, in order to indicate 8015 8008 the additional memory ordering requirements imposed on userspace when ··· 8010 8017 Architecture with TSO-like ordering (such as x86) are allowed to 8011 8018 expose both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL 8012 8019 to userspace. 8020 + 8021 + After enabling the dirty rings, the userspace needs to detect the 8022 + capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the 8023 + ring structures can be backed by per-slot bitmaps. With this capability 8024 + advertised, it means the architecture can dirty guest pages without 8025 + vcpu/ring context, so that some of the dirty information will still be 8026 + maintained in the bitmap structure. KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 8027 + can't be enabled if the capability of KVM_CAP_DIRTY_LOG_RING_ACQ_REL 8028 + hasn't been enabled, or any memslot has been existing. 8029 + 8030 + Note that the bitmap here is only a backup of the ring structure. The 8031 + use of the ring and bitmap combination is only beneficial if there is 8032 + only a very small amount of memory that is dirtied out of vcpu/ring 8033 + context. Otherwise, the stand-alone per-slot bitmap mechanism needs to 8034 + be considered. 8035 + 8036 + To collect dirty bits in the backup bitmap, userspace can use the same 8037 + KVM_GET_DIRTY_LOG ioctl. KVM_CLEAR_DIRTY_LOG isn't needed as long as all 8038 + the generation of the dirty bits is done in a single pass. Collecting 8039 + the dirty bitmap should be the very last thing that the VMM does before 8040 + considering the state as complete. VMM needs to ensure that the dirty 8041 + state is final and avoid missing dirty pages from another ioctl ordered 8042 + after the bitmap collection. 8043 + 8044 + NOTE: One example of using the backup bitmap is saving arm64 vgic/its 8045 + tables through KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_SAVE_TABLES} command on 8046 + KVM device "kvm-arm-vgic-its" when dirty ring is enabled. 8013 8047 8014 8048 8.30 KVM_CAP_XEN_HVM 8015 8049 --------------------
+4 -1
Documentation/virt/kvm/devices/arm-vgic-its.rst
··· 52 52 53 53 KVM_DEV_ARM_ITS_SAVE_TABLES 54 54 save the ITS table data into guest RAM, at the location provisioned 55 - by the guest in corresponding registers/table entries. 55 + by the guest in corresponding registers/table entries. Should userspace 56 + require a form of dirty tracking to identify which pages are modified 57 + by the saving process, it should use a bitmap even if using another 58 + mechanism to track the memory dirtied by the vCPUs. 56 59 57 60 The layout of the tables in guest memory defines an ABI. The entries 58 61 are laid out in little endian format as described in the last paragraph.
+7
include/linux/kvm_dirty_ring.h
··· 37 37 return 0; 38 38 } 39 39 40 + static inline bool kvm_use_dirty_bitmap(struct kvm *kvm) 41 + { 42 + return true; 43 + } 44 + 40 45 static inline int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, 41 46 int index, u32 size) 42 47 { ··· 72 67 #else /* CONFIG_HAVE_KVM_DIRTY_RING */ 73 68 74 69 int kvm_cpu_dirty_log_size(void); 70 + bool kvm_use_dirty_bitmap(struct kvm *kvm); 71 + bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm); 75 72 u32 kvm_dirty_ring_get_rsvd_entries(void); 76 73 int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size); 77 74
+1
include/linux/kvm_host.h
··· 779 779 pid_t userspace_pid; 780 780 unsigned int max_halt_poll_ns; 781 781 u32 dirty_ring_size; 782 + bool dirty_ring_with_bitmap; 782 783 bool vm_bugged; 783 784 bool vm_dead; 784 785
+1
include/uapi/linux/kvm.h
··· 1178 1178 #define KVM_CAP_S390_ZPCI_OP 221 1179 1179 #define KVM_CAP_S390_CPU_TOPOLOGY 222 1180 1180 #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223 1181 + #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 224 1181 1182 1182 1183 #ifdef KVM_CAP_IRQ_ROUTING 1183 1184
+6
virt/kvm/Kconfig
··· 33 33 bool 34 34 select HAVE_KVM_DIRTY_RING 35 35 36 + # Allow enabling both the dirty bitmap and dirty ring. Only architectures 37 + # that need to dirty memory outside of a vCPU context should select this. 38 + config NEED_KVM_DIRTY_RING_WITH_BITMAP 39 + bool 40 + depends on HAVE_KVM_DIRTY_RING 41 + 36 42 config HAVE_KVM_EVENTFD 37 43 bool 38 44 select EVENTFD
+14
virt/kvm/dirty_ring.c
··· 21 21 return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size(); 22 22 } 23 23 24 + bool kvm_use_dirty_bitmap(struct kvm *kvm) 25 + { 26 + lockdep_assert_held(&kvm->slots_lock); 27 + 28 + return !kvm->dirty_ring_size || kvm->dirty_ring_with_bitmap; 29 + } 30 + 31 + #ifndef CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP 32 + bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm) 33 + { 34 + return false; 35 + } 36 + #endif 37 + 24 38 static u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring) 25 39 { 26 40 return READ_ONCE(ring->dirty_index) - READ_ONCE(ring->reset_index);
+52 -9
virt/kvm/kvm_main.c
··· 1617 1617 new->dirty_bitmap = NULL; 1618 1618 else if (old && old->dirty_bitmap) 1619 1619 new->dirty_bitmap = old->dirty_bitmap; 1620 - else if (!kvm->dirty_ring_size) { 1620 + else if (kvm_use_dirty_bitmap(kvm)) { 1621 1621 r = kvm_alloc_dirty_bitmap(new); 1622 1622 if (r) 1623 1623 return r; ··· 2060 2060 unsigned long n; 2061 2061 unsigned long any = 0; 2062 2062 2063 - /* Dirty ring tracking is exclusive to dirty log tracking */ 2064 - if (kvm->dirty_ring_size) 2063 + /* Dirty ring tracking may be exclusive to dirty log tracking */ 2064 + if (!kvm_use_dirty_bitmap(kvm)) 2065 2065 return -ENXIO; 2066 2066 2067 2067 *memslot = NULL; ··· 2125 2125 unsigned long *dirty_bitmap_buffer; 2126 2126 bool flush; 2127 2127 2128 - /* Dirty ring tracking is exclusive to dirty log tracking */ 2129 - if (kvm->dirty_ring_size) 2128 + /* Dirty ring tracking may be exclusive to dirty log tracking */ 2129 + if (!kvm_use_dirty_bitmap(kvm)) 2130 2130 return -ENXIO; 2131 2131 2132 2132 as_id = log->slot >> 16; ··· 2237 2237 unsigned long *dirty_bitmap_buffer; 2238 2238 bool flush; 2239 2239 2240 - /* Dirty ring tracking is exclusive to dirty log tracking */ 2241 - if (kvm->dirty_ring_size) 2240 + /* Dirty ring tracking may be exclusive to dirty log tracking */ 2241 + if (!kvm_use_dirty_bitmap(kvm)) 2242 2242 return -ENXIO; 2243 2243 2244 2244 as_id = log->slot >> 16; ··· 3305 3305 struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); 3306 3306 3307 3307 #ifdef CONFIG_HAVE_KVM_DIRTY_RING 3308 - if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm)) 3308 + if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm)) 3309 + return; 3310 + 3311 + if (WARN_ON_ONCE(!kvm_arch_allow_write_without_running_vcpu(kvm) && !vcpu)) 3309 3312 return; 3310 3313 #endif 3311 3314 ··· 3316 3313 unsigned long rel_gfn = gfn - memslot->base_gfn; 3317 3314 u32 slot = (memslot->as_id << 16) | memslot->id; 3318 3315 3319 - if (kvm->dirty_ring_size) 3316 + if (kvm->dirty_ring_size && vcpu) 3320 3317 kvm_dirty_ring_push(vcpu, slot, rel_gfn); 3321 3318 else 3322 3319 set_bit_le(rel_gfn, memslot->dirty_bitmap); ··· 4486 4483 #else 4487 4484 return 0; 4488 4485 #endif 4486 + #ifdef CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP 4487 + case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: 4488 + #endif 4489 4489 case KVM_CAP_BINARY_STATS_FD: 4490 4490 case KVM_CAP_SYSTEM_EVENT_DATA: 4491 4491 return 1; ··· 4564 4558 return -EINVAL; 4565 4559 } 4566 4560 4561 + static bool kvm_are_all_memslots_empty(struct kvm *kvm) 4562 + { 4563 + int i; 4564 + 4565 + lockdep_assert_held(&kvm->slots_lock); 4566 + 4567 + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { 4568 + if (!kvm_memslots_empty(__kvm_memslots(kvm, i))) 4569 + return false; 4570 + } 4571 + 4572 + return true; 4573 + } 4574 + 4567 4575 static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm, 4568 4576 struct kvm_enable_cap *cap) 4569 4577 { ··· 4608 4588 return -EINVAL; 4609 4589 4610 4590 return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]); 4591 + case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: { 4592 + int r = -EINVAL; 4593 + 4594 + if (!IS_ENABLED(CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP) || 4595 + !kvm->dirty_ring_size || cap->flags) 4596 + return r; 4597 + 4598 + mutex_lock(&kvm->slots_lock); 4599 + 4600 + /* 4601 + * For simplicity, allow enabling ring+bitmap if and only if 4602 + * there are no memslots, e.g. to ensure all memslots allocate 4603 + * a bitmap after the capability is enabled. 4604 + */ 4605 + if (kvm_are_all_memslots_empty(kvm)) { 4606 + kvm->dirty_ring_with_bitmap = true; 4607 + r = 0; 4608 + } 4609 + 4610 + mutex_unlock(&kvm->slots_lock); 4611 + 4612 + return r; 4613 + } 4611 4614 default: 4612 4615 return kvm_vm_ioctl_enable_cap(kvm, cap); 4613 4616 }