Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

kvm: introduce manual dirty log reprotect

There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:

1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).

This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

+308 -19
+67
Documentation/virtual/kvm/api.txt
··· 305 305 They must be less than the value that KVM_CHECK_EXTENSION returns for 306 306 the KVM_CAP_MULTI_ADDRESS_SPACE capability. 307 307 308 + The bits in the dirty bitmap are cleared before the ioctl returns, unless 309 + KVM_CAP_MANUAL_DIRTY_LOG_PROTECT is enabled. For more information, 310 + see the description of the capability. 308 311 309 312 4.9 KVM_SET_MEMORY_ALIAS 310 313 ··· 3761 3758 between coalesced mmio and pio except that coalesced pio records accesses 3762 3759 to I/O ports. 3763 3760 3761 + 4.117 KVM_CLEAR_DIRTY_LOG (vm ioctl) 3762 + 3763 + Capability: KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 3764 + Architectures: x86 3765 + Type: vm ioctl 3766 + Parameters: struct kvm_dirty_log (in) 3767 + Returns: 0 on success, -1 on error 3768 + 3769 + /* for KVM_CLEAR_DIRTY_LOG */ 3770 + struct kvm_clear_dirty_log { 3771 + __u32 slot; 3772 + __u32 num_pages; 3773 + __u64 first_page; 3774 + union { 3775 + void __user *dirty_bitmap; /* one bit per page */ 3776 + __u64 padding; 3777 + }; 3778 + }; 3779 + 3780 + The ioctl clears the dirty status of pages in a memory slot, according to 3781 + the bitmap that is passed in struct kvm_clear_dirty_log's dirty_bitmap 3782 + field. Bit 0 of the bitmap corresponds to page "first_page" in the 3783 + memory slot, and num_pages is the size in bits of the input bitmap. 3784 + Both first_page and num_pages must be a multiple of 64. For each bit 3785 + that is set in the input bitmap, the corresponding page is marked "clean" 3786 + in KVM's dirty bitmap, and dirty tracking is re-enabled for that page 3787 + (for example via write-protection, or by clearing the dirty bit in 3788 + a page table entry). 3789 + 3790 + If KVM_CAP_MULTI_ADDRESS_SPACE is available, bits 16-31 specifies 3791 + the address space for which you want to return the dirty bitmap. 3792 + They must be less than the value that KVM_CHECK_EXTENSION returns for 3793 + the KVM_CAP_MULTI_ADDRESS_SPACE capability. 3794 + 3795 + This ioctl is mostly useful when KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 3796 + is enabled; for more information, see the description of the capability. 3797 + However, it can always be used as long as KVM_CHECK_EXTENSION confirms 3798 + that KVM_CAP_MANUAL_DIRTY_LOG_PROTECT is present. 3799 + 3800 + 3764 3801 5. The kvm_run structure 3765 3802 ------------------------ 3766 3803 ··· 4694 4651 4695 4652 * For the new DR6 bits, note that bit 16 is set iff the #DB exception 4696 4653 will clear DR6.RTM. 4654 + 4655 + 7.18 KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 4656 + 4657 + Architectures: all 4658 + Parameters: args[0] whether feature should be enabled or not 4659 + 4660 + With this capability enabled, KVM_GET_DIRTY_LOG will not automatically 4661 + clear and write-protect all pages that are returned as dirty. 4662 + Rather, userspace will have to do this operation separately using 4663 + KVM_CLEAR_DIRTY_LOG. 4664 + 4665 + At the cost of a slightly more complicated operation, this provides better 4666 + scalability and responsiveness for two reasons. First, 4667 + KVM_CLEAR_DIRTY_LOG ioctl can operate on a 64-page granularity rather 4668 + than requiring to sync a full memslot; this ensures that KVM does not 4669 + take spinlocks for an extended period of time. Second, in some cases a 4670 + large amount of time can pass between a call to KVM_GET_DIRTY_LOG and 4671 + userspace actually using the data in the page. Pages can be modified 4672 + during this time, which is inefficint for both the guest and userspace: 4673 + the guest will incur a higher penalty due to write protection faults, 4674 + while userspace can see false reports of dirty pages. Manual reprotection 4675 + helps reducing this time, improving guest performance and reducing the 4676 + number of dirty log false positives. 4677 + 4697 4678 4698 4679 8. Other capabilities. 4699 4680 ----------------------
+23
arch/mips/kvm/mips.c
··· 1023 1023 return r; 1024 1024 } 1025 1025 1026 + int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm, struct kvm_clear_dirty_log *log) 1027 + { 1028 + struct kvm_memslots *slots; 1029 + struct kvm_memory_slot *memslot; 1030 + bool flush = false; 1031 + int r; 1032 + 1033 + mutex_lock(&kvm->slots_lock); 1034 + 1035 + r = kvm_clear_dirty_log_protect(kvm, log, &flush); 1036 + 1037 + if (flush) { 1038 + slots = kvm_memslots(kvm); 1039 + memslot = id_to_memslot(slots, log->slot); 1040 + 1041 + /* Let implementation handle TLB/GVA invalidation */ 1042 + kvm_mips_callbacks->flush_shadow_memslot(kvm, memslot); 1043 + } 1044 + 1045 + mutex_unlock(&kvm->slots_lock); 1046 + return r; 1047 + } 1048 + 1026 1049 long kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) 1027 1050 { 1028 1051 long r;
+27
arch/x86/kvm/x86.c
··· 4418 4418 return r; 4419 4419 } 4420 4420 4421 + int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm, struct kvm_clear_dirty_log *log) 4422 + { 4423 + bool flush = false; 4424 + int r; 4425 + 4426 + mutex_lock(&kvm->slots_lock); 4427 + 4428 + /* 4429 + * Flush potentially hardware-cached dirty pages to dirty_bitmap. 4430 + */ 4431 + if (kvm_x86_ops->flush_log_dirty) 4432 + kvm_x86_ops->flush_log_dirty(kvm); 4433 + 4434 + r = kvm_clear_dirty_log_protect(kvm, log, &flush); 4435 + 4436 + /* 4437 + * All the TLBs can be flushed out of mmu lock, see the comments in 4438 + * kvm_mmu_slot_remove_write_access(). 4439 + */ 4440 + lockdep_assert_held(&kvm->slots_lock); 4441 + if (flush) 4442 + kvm_flush_remote_tlbs(kvm); 4443 + 4444 + mutex_unlock(&kvm->slots_lock); 4445 + return r; 4446 + } 4447 + 4421 4448 int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_event, 4422 4449 bool line_status) 4423 4450 {
+5
include/linux/kvm_host.h
··· 449 449 #endif 450 450 long tlbs_dirty; 451 451 struct list_head devices; 452 + bool manual_dirty_log_protect; 452 453 struct dentry *debugfs_dentry; 453 454 struct kvm_stat_data **debugfs_stat_data; 454 455 struct srcu_struct srcu; ··· 755 754 756 755 int kvm_get_dirty_log_protect(struct kvm *kvm, 757 756 struct kvm_dirty_log *log, bool *flush); 757 + int kvm_clear_dirty_log_protect(struct kvm *kvm, 758 + struct kvm_clear_dirty_log *log, bool *flush); 758 759 759 760 void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, 760 761 struct kvm_memory_slot *slot, ··· 765 762 766 763 int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, 767 764 struct kvm_dirty_log *log); 765 + int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm, 766 + struct kvm_clear_dirty_log *log); 768 767 769 768 int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level, 770 769 bool line_status);
+15
include/uapi/linux/kvm.h
··· 492 492 }; 493 493 }; 494 494 495 + /* for KVM_CLEAR_DIRTY_LOG */ 496 + struct kvm_clear_dirty_log { 497 + __u32 slot; 498 + __u32 num_pages; 499 + __u64 first_page; 500 + union { 501 + void __user *dirty_bitmap; /* one bit per page */ 502 + __u64 padding2; 503 + }; 504 + }; 505 + 495 506 /* for KVM_SET_SIGNAL_MASK */ 496 507 struct kvm_signal_mask { 497 508 __u32 len; ··· 986 975 #define KVM_CAP_HYPERV_ENLIGHTENED_VMCS 163 987 976 #define KVM_CAP_EXCEPTION_PAYLOAD 164 988 977 #define KVM_CAP_ARM_VM_IPA_SIZE 165 978 + #define KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 166 989 979 990 980 #ifdef KVM_CAP_IRQ_ROUTING 991 981 ··· 1432 1420 /* Available with KVM_CAP_NESTED_STATE */ 1433 1421 #define KVM_GET_NESTED_STATE _IOWR(KVMIO, 0xbe, struct kvm_nested_state) 1434 1422 #define KVM_SET_NESTED_STATE _IOW(KVMIO, 0xbf, struct kvm_nested_state) 1423 + 1424 + /* Available with KVM_CAP_MANUAL_DIRTY_LOG_PROTECT */ 1425 + #define KVM_CLEAR_DIRTY_LOG _IOWR(KVMIO, 0xc0, struct kvm_clear_dirty_log) 1435 1426 1436 1427 /* Secure Encrypted Virtualization command */ 1437 1428 enum sev_cmd_id {
+2
tools/testing/selftests/kvm/Makefile
··· 16 16 TEST_GEN_PROGS_x86_64 += x86_64/state_test 17 17 TEST_GEN_PROGS_x86_64 += x86_64/evmcs_test 18 18 TEST_GEN_PROGS_x86_64 += dirty_log_test 19 + TEST_GEN_PROGS_x86_64 += clear_dirty_log_test 19 20 20 21 TEST_GEN_PROGS_aarch64 += dirty_log_test 22 + TEST_GEN_PROGS_aarch64 += clear_dirty_log_test 21 23 22 24 TEST_GEN_PROGS += $(TEST_GEN_PROGS_$(UNAME_M)) 23 25 LIBKVM += $(LIBKVM_$(UNAME_M))
+2
tools/testing/selftests/kvm/clear_dirty_log_test.c
··· 1 + #define USE_CLEAR_DIRTY_LOG 2 + #include "dirty_log_test.c"
+19
tools/testing/selftests/kvm/dirty_log_test.c
··· 275 275 276 276 vm = create_vm(mode, VCPU_ID, guest_num_pages, guest_code); 277 277 278 + #ifdef USE_CLEAR_DIRTY_LOG 279 + struct kvm_enable_cap cap = {}; 280 + 281 + cap.cap = KVM_CAP_MANUAL_DIRTY_LOG_PROTECT; 282 + cap.args[0] = 1; 283 + vm_enable_cap(vm, &cap); 284 + #endif 285 + 278 286 /* Add an extra memory slot for testing dirty logging */ 279 287 vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 280 288 guest_test_mem, ··· 324 316 /* Give the vcpu thread some time to dirty some pages */ 325 317 usleep(interval * 1000); 326 318 kvm_vm_get_dirty_log(vm, TEST_MEM_SLOT_INDEX, bmap); 319 + #ifdef USE_CLEAR_DIRTY_LOG 320 + kvm_vm_clear_dirty_log(vm, TEST_MEM_SLOT_INDEX, bmap, 0, 321 + DIV_ROUND_UP(host_num_pages, 64) * 64); 322 + #endif 327 323 vm_dirty_log_verify(bmap); 328 324 iteration++; 329 325 sync_global_to_guest(vm, iteration); ··· 403 391 bool top_offset = false; 404 392 unsigned int mode; 405 393 int opt, i; 394 + 395 + #ifdef USE_CLEAR_DIRTY_LOG 396 + if (!kvm_check_cap(KVM_CAP_MANUAL_DIRTY_LOG_PROTECT)) { 397 + fprintf(stderr, "KVM_CLEAR_DIRTY_LOG not available, skipping tests\n"); 398 + exit(KSFT_SKIP); 399 + } 400 + #endif 406 401 407 402 while ((opt = getopt(argc, argv, "hi:I:o:tm:")) != -1) { 408 403 switch (opt) {
+2
tools/testing/selftests/kvm/include/kvm_util.h
··· 58 58 void kvm_vm_restart(struct kvm_vm *vmp, int perm); 59 59 void kvm_vm_release(struct kvm_vm *vmp); 60 60 void kvm_vm_get_dirty_log(struct kvm_vm *vm, int slot, void *log); 61 + void kvm_vm_clear_dirty_log(struct kvm_vm *vm, int slot, void *log, 62 + uint64_t first_page, uint32_t num_pages); 61 63 62 64 int kvm_memcmp_hva_gva(void *hva, struct kvm_vm *vm, const vm_vaddr_t gva, 63 65 size_t len);
+13
tools/testing/selftests/kvm/lib/kvm_util.c
··· 231 231 strerror(-ret)); 232 232 } 233 233 234 + void kvm_vm_clear_dirty_log(struct kvm_vm *vm, int slot, void *log, 235 + uint64_t first_page, uint32_t num_pages) 236 + { 237 + struct kvm_clear_dirty_log args = { .dirty_bitmap = log, .slot = slot, 238 + .first_page = first_page, 239 + .num_pages = num_pages }; 240 + int ret; 241 + 242 + ret = ioctl(vm->fd, KVM_CLEAR_DIRTY_LOG, &args); 243 + TEST_ASSERT(ret == 0, "%s: KVM_CLEAR_DIRTY_LOG failed: %s", 244 + strerror(-ret)); 245 + } 246 + 234 247 /* 235 248 * Userspace Memory Region Find 236 249 *
+16
virt/kvm/arm/arm.c
··· 1219 1219 return r; 1220 1220 } 1221 1221 1222 + int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm, struct kvm_clear_dirty_log *log) 1223 + { 1224 + bool flush = false; 1225 + int r; 1226 + 1227 + mutex_lock(&kvm->slots_lock); 1228 + 1229 + r = kvm_clear_dirty_log_protect(kvm, log, &flush); 1230 + 1231 + if (flush) 1232 + kvm_flush_remote_tlbs(kvm); 1233 + 1234 + mutex_unlock(&kvm->slots_lock); 1235 + return r; 1236 + } 1237 + 1222 1238 static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm, 1223 1239 struct kvm_arm_device_addr *dev_addr) 1224 1240 {
+117 -19
virt/kvm/kvm_main.c
··· 1133 1133 #ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT 1134 1134 /** 1135 1135 * kvm_get_dirty_log_protect - get a snapshot of dirty pages, and if any pages 1136 - * are dirty write protect them for next write. 1136 + * and reenable dirty page tracking for the corresponding pages. 1137 1137 * @kvm: pointer to kvm instance 1138 1138 * @log: slot id and address to which we copy the log 1139 1139 * @is_dirty: flag set if any page is dirty ··· 1176 1176 return -ENOENT; 1177 1177 1178 1178 n = kvm_dirty_bitmap_bytes(memslot); 1179 - 1180 - dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot); 1181 - memset(dirty_bitmap_buffer, 0, n); 1182 - 1183 - spin_lock(&kvm->mmu_lock); 1184 1179 *flush = false; 1185 - for (i = 0; i < n / sizeof(long); i++) { 1186 - unsigned long mask; 1187 - gfn_t offset; 1180 + if (kvm->manual_dirty_log_protect) { 1181 + /* 1182 + * Unlike kvm_get_dirty_log, we always return false in *flush, 1183 + * because no flush is needed until KVM_CLEAR_DIRTY_LOG. There 1184 + * is some code duplication between this function and 1185 + * kvm_get_dirty_log, but hopefully all architecture 1186 + * transition to kvm_get_dirty_log_protect and kvm_get_dirty_log 1187 + * can be eliminated. 1188 + */ 1189 + dirty_bitmap_buffer = dirty_bitmap; 1190 + } else { 1191 + dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot); 1192 + memset(dirty_bitmap_buffer, 0, n); 1188 1193 1189 - if (!dirty_bitmap[i]) 1190 - continue; 1194 + spin_lock(&kvm->mmu_lock); 1195 + for (i = 0; i < n / sizeof(long); i++) { 1196 + unsigned long mask; 1197 + gfn_t offset; 1191 1198 1192 - *flush = true; 1199 + if (!dirty_bitmap[i]) 1200 + continue; 1193 1201 1194 - mask = xchg(&dirty_bitmap[i], 0); 1195 - dirty_bitmap_buffer[i] = mask; 1202 + *flush = true; 1203 + mask = xchg(&dirty_bitmap[i], 0); 1204 + dirty_bitmap_buffer[i] = mask; 1196 1205 1197 - if (mask) { 1198 - offset = i * BITS_PER_LONG; 1199 - kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, 1200 - offset, mask); 1206 + if (mask) { 1207 + offset = i * BITS_PER_LONG; 1208 + kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, 1209 + offset, mask); 1210 + } 1201 1211 } 1212 + spin_unlock(&kvm->mmu_lock); 1202 1213 } 1203 1214 1204 - spin_unlock(&kvm->mmu_lock); 1205 1215 if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n)) 1206 1216 return -EFAULT; 1207 1217 return 0; 1208 1218 } 1209 1219 EXPORT_SYMBOL_GPL(kvm_get_dirty_log_protect); 1220 + 1221 + /** 1222 + * kvm_clear_dirty_log_protect - clear dirty bits in the bitmap 1223 + * and reenable dirty page tracking for the corresponding pages. 1224 + * @kvm: pointer to kvm instance 1225 + * @log: slot id and address from which to fetch the bitmap of dirty pages 1226 + */ 1227 + int kvm_clear_dirty_log_protect(struct kvm *kvm, 1228 + struct kvm_clear_dirty_log *log, bool *flush) 1229 + { 1230 + struct kvm_memslots *slots; 1231 + struct kvm_memory_slot *memslot; 1232 + int as_id, id, n; 1233 + gfn_t offset; 1234 + unsigned long i; 1235 + unsigned long *dirty_bitmap; 1236 + unsigned long *dirty_bitmap_buffer; 1237 + 1238 + as_id = log->slot >> 16; 1239 + id = (u16)log->slot; 1240 + if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS) 1241 + return -EINVAL; 1242 + 1243 + if ((log->first_page & 63) || (log->num_pages & 63)) 1244 + return -EINVAL; 1245 + 1246 + slots = __kvm_memslots(kvm, as_id); 1247 + memslot = id_to_memslot(slots, id); 1248 + 1249 + dirty_bitmap = memslot->dirty_bitmap; 1250 + if (!dirty_bitmap) 1251 + return -ENOENT; 1252 + 1253 + n = kvm_dirty_bitmap_bytes(memslot); 1254 + *flush = false; 1255 + dirty_bitmap_buffer = kvm_second_dirty_bitmap(memslot); 1256 + if (copy_from_user(dirty_bitmap_buffer, log->dirty_bitmap, n)) 1257 + return -EFAULT; 1258 + 1259 + spin_lock(&kvm->mmu_lock); 1260 + for (offset = log->first_page, 1261 + i = offset / BITS_PER_LONG, n = log->num_pages / BITS_PER_LONG; n--; 1262 + i++, offset += BITS_PER_LONG) { 1263 + unsigned long mask = *dirty_bitmap_buffer++; 1264 + atomic_long_t *p = (atomic_long_t *) &dirty_bitmap[i]; 1265 + if (!mask) 1266 + continue; 1267 + 1268 + mask &= atomic_long_fetch_andnot(mask, p); 1269 + 1270 + /* 1271 + * mask contains the bits that really have been cleared. This 1272 + * never includes any bits beyond the length of the memslot (if 1273 + * the length is not aligned to 64 pages), therefore it is not 1274 + * a problem if userspace sets them in log->dirty_bitmap. 1275 + */ 1276 + if (mask) { 1277 + *flush = true; 1278 + kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, 1279 + offset, mask); 1280 + } 1281 + } 1282 + spin_unlock(&kvm->mmu_lock); 1283 + 1284 + return 0; 1285 + } 1286 + EXPORT_SYMBOL_GPL(kvm_clear_dirty_log_protect); 1210 1287 #endif 1211 1288 1212 1289 bool kvm_largepages_enabled(void) ··· 3026 2949 case KVM_CAP_IOEVENTFD_ANY_LENGTH: 3027 2950 case KVM_CAP_CHECK_EXTENSION_VM: 3028 2951 case KVM_CAP_ENABLE_CAP_VM: 2952 + #ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT 2953 + case KVM_CAP_MANUAL_DIRTY_LOG_PROTECT: 2954 + #endif 3029 2955 return 1; 3030 2956 #ifdef CONFIG_KVM_MMIO 3031 2957 case KVM_CAP_COALESCED_MMIO: ··· 3062 2982 struct kvm_enable_cap *cap) 3063 2983 { 3064 2984 switch (cap->cap) { 2985 + #ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT 2986 + case KVM_CAP_MANUAL_DIRTY_LOG_PROTECT: 2987 + if (cap->flags || (cap->args[0] & ~1)) 2988 + return -EINVAL; 2989 + kvm->manual_dirty_log_protect = cap->args[0]; 2990 + return 0; 2991 + #endif 3065 2992 default: 3066 2993 return kvm_vm_ioctl_enable_cap(kvm, cap); 3067 2994 } ··· 3116 3029 r = kvm_vm_ioctl_get_dirty_log(kvm, &log); 3117 3030 break; 3118 3031 } 3032 + #ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT 3033 + case KVM_CLEAR_DIRTY_LOG: { 3034 + struct kvm_clear_dirty_log log; 3035 + 3036 + r = -EFAULT; 3037 + if (copy_from_user(&log, argp, sizeof(log))) 3038 + goto out; 3039 + r = kvm_vm_ioctl_clear_dirty_log(kvm, &log); 3040 + break; 3041 + } 3042 + #endif 3119 3043 #ifdef CONFIG_KVM_MMIO 3120 3044 case KVM_REGISTER_COALESCED_MMIO: { 3121 3045 struct kvm_coalesced_mmio_zone zone;