Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

-137

Documentation/ia64/paravirt_ops.txt

··· 1 - Paravirt_ops on IA64 2 - ==================== 3 - 21 May 2008, Isaku Yamahata <yamahata@valinux.co.jp> 4 - 5 - 6 - Introduction 7 - ------------ 8 - The aim of this documentation is to help with maintainability and/or to 9 - encourage people to use paravirt_ops/IA64. 10 - 11 - paravirt_ops (pv_ops in short) is a way for virtualization support of 12 - Linux kernel on x86. Several ways for virtualization support were 13 - proposed, paravirt_ops is the winner. 14 - On the other hand, now there are also several IA64 virtualization 15 - technologies like kvm/IA64, xen/IA64 and many other academic IA64 16 - hypervisors so that it is good to add generic virtualization 17 - infrastructure on Linux/IA64. 18 - 19 - 20 - What is paravirt_ops? 21 - --------------------- 22 - It has been developed on x86 as virtualization support via API, not ABI. 23 - It allows each hypervisor to override operations which are important for 24 - hypervisors at API level. And it allows a single kernel binary to run on 25 - all supported execution environments including native machine. 26 - Essentially paravirt_ops is a set of function pointers which represent 27 - operations corresponding to low level sensitive instructions and high 28 - level functionalities in various area. But one significant difference 29 - from usual function pointer table is that it allows optimization with 30 - binary patch. It is because some of these operations are very 31 - performance sensitive and indirect call overhead is not negligible. 32 - With binary patch, indirect C function call can be transformed into 33 - direct C function call or in-place execution to eliminate the overhead. 34 - 35 - Thus, operations of paravirt_ops are classified into three categories. 36 - - simple indirect call 37 - These operations correspond to high level functionality so that the 38 - overhead of indirect call isn't very important. 39 - 40 - - indirect call which allows optimization with binary patch 41 - Usually these operations correspond to low level instructions. They 42 - are called frequently and performance critical. So the overhead is 43 - very important. 44 - 45 - - a set of macros for hand written assembly code 46 - Hand written assembly codes (.S files) also need paravirtualization 47 - because they include sensitive instructions or some of code paths in 48 - them are very performance critical. 49 - 50 - 51 - The relation to the IA64 machine vector 52 - --------------------------------------- 53 - Linux/IA64 has the IA64 machine vector functionality which allows the 54 - kernel to switch implementations (e.g. initialization, ipi, dma api...) 55 - depending on executing platform. 56 - We can replace some implementations very easily defining a new machine 57 - vector. Thus another approach for virtualization support would be 58 - enhancing the machine vector functionality. 59 - But paravirt_ops approach was taken because 60 - - virtualization support needs wider support than machine vector does. 61 - e.g. low level instruction paravirtualization. It must be 62 - initialized very early before platform detection. 63 - 64 - - virtualization support needs more functionality like binary patch. 65 - Probably the calling overhead might not be very large compared to the 66 - emulation overhead of virtualization. However in the native case, the 67 - overhead should be eliminated completely. 68 - A single kernel binary should run on each environment including native, 69 - and the overhead of paravirt_ops on native environment should be as 70 - small as possible. 71 - 72 - - for full virtualization technology, e.g. KVM/IA64 or 73 - Xen/IA64 HVM domain, the result would be 74 - (the emulated platform machine vector. probably dig) + (pv_ops). 75 - This means that the virtualization support layer should be under 76 - the machine vector layer. 77 - 78 - Possibly it might be better to move some function pointers from 79 - paravirt_ops to machine vector. In fact, Xen domU case utilizes both 80 - pv_ops and machine vector. 81 - 82 - 83 - IA64 paravirt_ops 84 - ----------------- 85 - In this section, the concrete paravirt_ops will be discussed. 86 - Because of the architecture difference between ia64 and x86, the 87 - resulting set of functions is very different from x86 pv_ops. 88 - 89 - - C function pointer tables 90 - They are not very performance critical so that simple C indirect 91 - function call is acceptable. The following structures are defined at 92 - this moment. For details see linux/include/asm-ia64/paravirt.h 93 - - struct pv_info 94 - This structure describes the execution environment. 95 - - struct pv_init_ops 96 - This structure describes the various initialization hooks. 97 - - struct pv_iosapic_ops 98 - This structure describes hooks to iosapic operations. 99 - - struct pv_irq_ops 100 - This structure describes hooks to irq related operations 101 - - struct pv_time_op 102 - This structure describes hooks to steal time accounting. 103 - 104 - - a set of indirect calls which need optimization 105 - Currently this class of functions correspond to a subset of IA64 106 - intrinsics. At this moment the optimization with binary patch isn't 107 - implemented yet. 108 - struct pv_cpu_op is defined. For details see 109 - linux/include/asm-ia64/paravirt_privop.h 110 - Mostly they correspond to ia64 intrinsics 1-to-1. 111 - Caveat: Now they are defined as C indirect function pointers, but in 112 - order to support binary patch optimization, they will be changed 113 - using GCC extended inline assembly code. 114 - 115 - - a set of macros for hand written assembly code (.S files) 116 - For maintenance purpose, the taken approach for .S files is single 117 - source code and compile multiple times with different macros definitions. 118 - Each pv_ops instance must define those macros to compile. 119 - The important thing here is that sensitive, but non-privileged 120 - instructions must be paravirtualized and that some privileged 121 - instructions also need paravirtualization for reasonable performance. 122 - Developers who modify .S files must be aware of that. At this moment 123 - an easy checker is implemented to detect paravirtualization breakage. 124 - But it doesn't cover all the cases. 125 - 126 - Sometimes this set of macros is called pv_cpu_asm_op. But there is no 127 - corresponding structure in the source code. 128 - Those macros mostly 1:1 correspond to a subset of privileged 129 - instructions. See linux/include/asm-ia64/native/inst.h. 130 - And some functions written in assembly also need to be overrided so 131 - that each pv_ops instance have to define some macros. Again see 132 - linux/include/asm-ia64/native/inst.h. 133 - 134 - 135 - Those structures must be initialized very early before start_kernel. 136 - Probably initialized in head.S using multi entry point or some other trick. 137 - For native case implementation see linux/arch/ia64/kernel/paravirt.c.

+3

Documentation/virtual/00-INDEX

··· 2 2 3 3 00-INDEX 4 4 - this file. 5 + 6 + paravirt_ops.txt 7 + - Describes the Linux kernel pv_ops to support different hypervisors 5 8 kvm/ 6 9 - Kernel Virtual Machine. See also http://linux-kvm.org 7 10 uml/

+32

Documentation/virtual/paravirt_ops.txt

··· 1 + Paravirt_ops 2 + ============ 3 + 4 + Linux provides support for different hypervisor virtualization technologies. 5 + Historically different binary kernels would be required in order to support 6 + different hypervisors, this restriction was removed with pv_ops. 7 + Linux pv_ops is a virtualization API which enables support for different 8 + hypervisors. It allows each hypervisor to override critical operations and 9 + allows a single kernel binary to run on all supported execution environments 10 + including native machine -- without any hypervisors. 11 + 12 + pv_ops provides a set of function pointers which represent operations 13 + corresponding to low level critical instructions and high level 14 + functionalities in various areas. pv-ops allows for optimizations at run 15 + time by enabling binary patching of the low-ops critical operations 16 + at boot time. 17 + 18 + pv_ops operations are classified into three categories: 19 + 20 + - simple indirect call 21 + These operations correspond to high level functionality where it is 22 + known that the overhead of indirect call isn't very important. 23 + 24 + - indirect call which allows optimization with binary patch 25 + Usually these operations correspond to low level critical instructions. They 26 + are called frequently and are performance critical. The overhead is 27 + very important. 28 + 29 + - a set of macros for hand written assembly code 30 + Hand written assembly codes (.S files) also need paravirtualization 31 + because they include sensitive instructions or some of code paths in 32 + them are very performance critical.

+1 -1

MAINTAINERS

··· 7302 7302 M: Rusty Russell <rusty@rustcorp.com.au> 7303 7303 L: virtualization@lists.linux-foundation.org 7304 7304 S: Supported 7305 - F: Documentation/ia64/paravirt_ops.txt 7305 + F: Documentation/virtual/paravirt_ops.txt 7306 7306 F: arch/*/kernel/paravirt* 7307 7307 F: arch/*/include/asm/paravirt.h 7308 7308

-35

arch/mn10300/unit-asb2305/pci-iomap.c

··· 1 - /* ASB2305 PCI I/O mapping handler 2 - * 3 - * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved. 4 - * Written by David Howells (dhowells@redhat.com) 5 - * 6 - * This program is free software; you can redistribute it and/or 7 - * modify it under the terms of the GNU General Public Licence 8 - * as published by the Free Software Foundation; either version 9 - * 2 of the Licence, or (at your option) any later version. 10 - */ 11 - #include <linux/pci.h> 12 - #include <linux/module.h> 13 - 14 - /* 15 - * Create a virtual mapping cookie for a PCI BAR (memory or IO) 16 - */ 17 - void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long maxlen) 18 - { 19 - resource_size_t start = pci_resource_start(dev, bar); 20 - resource_size_t len = pci_resource_len(dev, bar); 21 - unsigned long flags = pci_resource_flags(dev, bar); 22 - 23 - if (!len || !start) 24 - return NULL; 25 - 26 - if ((flags & IORESOURCE_IO) || (flags & IORESOURCE_MEM)) { 27 - if (flags & IORESOURCE_CACHEABLE && !(flags & IORESOURCE_IO)) 28 - return ioremap(start, len); 29 - else 30 - return ioremap_nocache(start, len); 31 - } 32 - 33 - return NULL; 34 - } 35 - EXPORT_SYMBOL(pci_iomap);

+1

arch/s390/include/asm/pci_io.h

··· 16 16 struct zpci_iomap_entry { 17 17 u32 fh; 18 18 u8 bar; 19 + u16 count; 19 20 }; 20 21 21 22 extern struct zpci_iomap_entry *zpci_iomap_start;

+27 -7

arch/s390/pci/pci.c

··· 259 259 } 260 260 261 261 /* Create a virtual mapping cookie for a PCI BAR */ 262 - void __iomem *pci_iomap(struct pci_dev *pdev, int bar, unsigned long max) 262 + void __iomem *pci_iomap_range(struct pci_dev *pdev, 263 + int bar, 264 + unsigned long offset, 265 + unsigned long max) 263 266 { 264 267 struct zpci_dev *zdev = get_zdev(pdev); 265 268 u64 addr; ··· 273 270 274 271 idx = zdev->bars[bar].map_idx; 275 272 spin_lock(&zpci_iomap_lock); 276 - zpci_iomap_start[idx].fh = zdev->fh; 277 - zpci_iomap_start[idx].bar = bar; 273 + if (zpci_iomap_start[idx].count++) { 274 + BUG_ON(zpci_iomap_start[idx].fh != zdev->fh || 275 + zpci_iomap_start[idx].bar != bar); 276 + } else { 277 + zpci_iomap_start[idx].fh = zdev->fh; 278 + zpci_iomap_start[idx].bar = bar; 279 + } 280 + /* Detect overrun */ 281 + BUG_ON(!zpci_iomap_start[idx].count); 278 282 spin_unlock(&zpci_iomap_lock); 279 283 280 284 addr = ZPCI_IOMAP_ADDR_BASE | ((u64) idx << 48); 281 - return (void __iomem *) addr; 285 + return (void __iomem *) addr + offset; 282 286 } 283 - EXPORT_SYMBOL_GPL(pci_iomap); 287 + EXPORT_SYMBOL_GPL(pci_iomap_range); 288 + 289 + void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long maxlen) 290 + { 291 + return pci_iomap_range(dev, bar, 0, maxlen); 292 + } 293 + EXPORT_SYMBOL(pci_iomap); 284 294 285 295 void pci_iounmap(struct pci_dev *pdev, void __iomem *addr) 286 296 { ··· 301 285 302 286 idx = (((__force u64) addr) & ~ZPCI_IOMAP_ADDR_BASE) >> 48; 303 287 spin_lock(&zpci_iomap_lock); 304 - zpci_iomap_start[idx].fh = 0; 305 - zpci_iomap_start[idx].bar = 0; 288 + /* Detect underrun */ 289 + BUG_ON(!zpci_iomap_start[idx].count); 290 + if (!--zpci_iomap_start[idx].count) { 291 + zpci_iomap_start[idx].fh = 0; 292 + zpci_iomap_start[idx].bar = 0; 293 + } 306 294 spin_unlock(&zpci_iomap_lock); 307 295 } 308 296 EXPORT_SYMBOL_GPL(pci_iounmap);

-1

arch/x86/include/asm/lguest_hcall.h

··· 16 16 #define LHCALL_SET_PTE 14 17 17 #define LHCALL_SET_PGD 15 18 18 #define LHCALL_LOAD_TLS 16 19 - #define LHCALL_NOTIFY 17 20 19 #define LHCALL_LOAD_GDT_ENTRY 18 21 20 #define LHCALL_SEND_INTERRUPTS 19 22 21

+153 -20

arch/x86/lguest/boot.c

··· 56 56 #include <linux/virtio_console.h> 57 57 #include <linux/pm.h> 58 58 #include <linux/export.h> 59 + #include <linux/pci.h> 60 + #include <linux/virtio_pci.h> 61 + #include <asm/acpi.h> 59 62 #include <asm/apic.h> 60 63 #include <asm/lguest.h> 61 64 #include <asm/paravirt.h> ··· 74 71 #include <asm/stackprotector.h> 75 72 #include <asm/reboot.h> /* for struct machine_ops */ 76 73 #include <asm/kvm_para.h> 74 + #include <asm/pci_x86.h> 75 + #include <asm/pci-direct.h> 77 76 78 77 /*G:010 79 78 * Welcome to the Guest! ··· 836 831 .irq_unmask = enable_lguest_irq, 837 832 }; 838 833 834 + static int lguest_enable_irq(struct pci_dev *dev) 835 + { 836 + u8 line = 0; 837 + 838 + /* We literally use the PCI interrupt line as the irq number. */ 839 + pci_read_config_byte(dev, PCI_INTERRUPT_LINE, &line); 840 + irq_set_chip_and_handler_name(line, &lguest_irq_controller, 841 + handle_level_irq, "level"); 842 + dev->irq = line; 843 + return 0; 844 + } 845 + 846 + /* We don't do hotplug PCI, so this shouldn't be called. */ 847 + static void lguest_disable_irq(struct pci_dev *dev) 848 + { 849 + WARN_ON(1); 850 + } 851 + 839 852 /* 840 853 * This sets up the Interrupt Descriptor Table (IDT) entry for each hardware 841 854 * interrupt (except 128, which is used for system calls), and then tells the ··· 1204 1181 return "LGUEST"; 1205 1182 } 1206 1183 1184 + /* Offset within PCI config space of BAR access capability. */ 1185 + static int console_cfg_offset = 0; 1186 + static int console_access_cap; 1187 + 1188 + /* Set up so that we access off in bar0 (on bus 0, device 1, function 0) */ 1189 + static void set_cfg_window(u32 cfg_offset, u32 off) 1190 + { 1191 + write_pci_config_byte(0, 1, 0, 1192 + cfg_offset + offsetof(struct virtio_pci_cap, bar), 1193 + 0); 1194 + write_pci_config(0, 1, 0, 1195 + cfg_offset + offsetof(struct virtio_pci_cap, length), 1196 + 4); 1197 + write_pci_config(0, 1, 0, 1198 + cfg_offset + offsetof(struct virtio_pci_cap, offset), 1199 + off); 1200 + } 1201 + 1202 + static void write_bar_via_cfg(u32 cfg_offset, u32 off, u32 val) 1203 + { 1204 + /* 1205 + * We could set this up once, then leave it; nothing else in the * 1206 + * kernel should touch these registers. But if it went wrong, that 1207 + * would be a horrible bug to find. 1208 + */ 1209 + set_cfg_window(cfg_offset, off); 1210 + write_pci_config(0, 1, 0, 1211 + cfg_offset + sizeof(struct virtio_pci_cap), val); 1212 + } 1213 + 1214 + static void probe_pci_console(void) 1215 + { 1216 + u8 cap, common_cap = 0, device_cap = 0; 1217 + /* Offset within BAR0 */ 1218 + u32 device_offset; 1219 + u32 device_len; 1220 + 1221 + /* Avoid recursive printk into here. */ 1222 + console_cfg_offset = -1; 1223 + 1224 + if (!early_pci_allowed()) { 1225 + printk(KERN_ERR "lguest: early PCI access not allowed!\n"); 1226 + return; 1227 + } 1228 + 1229 + /* We expect a console PCI device at BUS0, slot 1. */ 1230 + if (read_pci_config(0, 1, 0, 0) != 0x10431AF4) { 1231 + printk(KERN_ERR "lguest: PCI device is %#x!\n", 1232 + read_pci_config(0, 1, 0, 0)); 1233 + return; 1234 + } 1235 + 1236 + /* Find the capabilities we need (must be in bar0) */ 1237 + cap = read_pci_config_byte(0, 1, 0, PCI_CAPABILITY_LIST); 1238 + while (cap) { 1239 + u8 vndr = read_pci_config_byte(0, 1, 0, cap); 1240 + if (vndr == PCI_CAP_ID_VNDR) { 1241 + u8 type, bar; 1242 + u32 offset, length; 1243 + 1244 + type = read_pci_config_byte(0, 1, 0, 1245 + cap + offsetof(struct virtio_pci_cap, cfg_type)); 1246 + bar = read_pci_config_byte(0, 1, 0, 1247 + cap + offsetof(struct virtio_pci_cap, bar)); 1248 + offset = read_pci_config(0, 1, 0, 1249 + cap + offsetof(struct virtio_pci_cap, offset)); 1250 + length = read_pci_config(0, 1, 0, 1251 + cap + offsetof(struct virtio_pci_cap, length)); 1252 + 1253 + switch (type) { 1254 + case VIRTIO_PCI_CAP_DEVICE_CFG: 1255 + if (bar == 0) { 1256 + device_cap = cap; 1257 + device_offset = offset; 1258 + device_len = length; 1259 + } 1260 + break; 1261 + case VIRTIO_PCI_CAP_PCI_CFG: 1262 + console_access_cap = cap; 1263 + break; 1264 + } 1265 + } 1266 + cap = read_pci_config_byte(0, 1, 0, cap + PCI_CAP_LIST_NEXT); 1267 + } 1268 + if (!device_cap || !console_access_cap) { 1269 + printk(KERN_ERR "lguest: No caps (%u/%u/%u) in console!\n", 1270 + common_cap, device_cap, console_access_cap); 1271 + return; 1272 + } 1273 + 1274 + /* 1275 + * Note that we can't check features, until we've set the DRIVER 1276 + * status bit. We don't want to do that until we have a real driver, 1277 + * so we just check that the device-specific config has room for 1278 + * emerg_wr. If it doesn't support VIRTIO_CONSOLE_F_EMERG_WRITE 1279 + * it should ignore the access. 1280 + */ 1281 + if (device_len < (offsetof(struct virtio_console_config, emerg_wr) 1282 + + sizeof(u32))) { 1283 + printk(KERN_ERR "lguest: console missing emerg_wr field\n"); 1284 + return; 1285 + } 1286 + 1287 + console_cfg_offset = device_offset; 1288 + printk(KERN_INFO "lguest: Console via virtio-pci emerg_wr\n"); 1289 + } 1290 + 1207 1291 /* 1208 1292 * We will eventually use the virtio console device to produce console output, 1209 - * but before that is set up we use LHCALL_NOTIFY on normal memory to produce 1210 - * console output. 1293 + * but before that is set up we use the virtio PCI console's backdoor mmio 1294 + * access and the "emergency" write facility (which is legal even before the 1295 + * device is configured). 1211 1296 */ 1212 1297 static __init int early_put_chars(u32 vtermno, const char *buf, int count) 1213 1298 { 1214 - char scratch[17]; 1215 - unsigned int len = count; 1299 + /* If we couldn't find PCI console, forget it. */ 1300 + if (console_cfg_offset < 0) 1301 + return count; 1216 1302 1217 - /* We use a nul-terminated string, so we make a copy. Icky, huh? */ 1218 - if (len > sizeof(scratch) - 1) 1219 - len = sizeof(scratch) - 1; 1220 - scratch[len] = '\0'; 1221 - memcpy(scratch, buf, len); 1222 - hcall(LHCALL_NOTIFY, __pa(scratch), 0, 0, 0); 1303 + if (unlikely(!console_cfg_offset)) { 1304 + probe_pci_console(); 1305 + if (console_cfg_offset < 0) 1306 + return count; 1307 + } 1223 1308 1224 - /* This routine returns the number of bytes actually written. */ 1225 - return len; 1309 + write_bar_via_cfg(console_access_cap, 1310 + console_cfg_offset 1311 + + offsetof(struct virtio_console_config, emerg_wr), 1312 + buf[0]); 1313 + return 1; 1226 1314 } 1227 1315 1228 1316 /* ··· 1534 1400 atomic_notifier_chain_register(&panic_notifier_list, &paniced); 1535 1401 1536 1402 /* 1537 - * The IDE code spends about 3 seconds probing for disks: if we reserve 1538 - * all the I/O ports up front it can't get them and so doesn't probe. 1539 - * Other device drivers are similar (but less severe). This cuts the 1540 - * kernel boot time on my machine from 4.1 seconds to 0.45 seconds. 1541 - */ 1542 - paravirt_disable_iospace(); 1543 - 1544 - /* 1545 1403 * This is messy CPU setup stuff which the native boot code does before 1546 1404 * start_kernel, so we have to do, too: 1547 1405 */ ··· 1561 1435 1562 1436 /* Register our very early console. */ 1563 1437 virtio_cons_early_init(early_put_chars); 1438 + 1439 + /* Don't let ACPI try to control our PCI interrupts. */ 1440 + disable_acpi(); 1441 + 1442 + /* We control them ourselves, by overriding these two hooks. */ 1443 + pcibios_enable_irq = lguest_enable_irq; 1444 + pcibios_disable_irq = lguest_disable_irq; 1564 1445 1565 1446 /* 1566 1447 * Last of all, we set the power management poweroff hook to point to

+8 -4

drivers/block/virtio_blk.c

··· 28 28 char name[VQ_NAME_LEN]; 29 29 } ____cacheline_aligned_in_smp; 30 30 31 - struct virtio_blk 32 - { 31 + struct virtio_blk { 33 32 struct virtio_device *vdev; 34 33 35 34 /* The disk structure for the kernel. */ ··· 51 52 struct virtio_blk_vq *vqs; 52 53 }; 53 54 54 - struct virtblk_req 55 - { 55 + struct virtblk_req { 56 56 struct request *req; 57 57 struct virtio_blk_outhdr out_hdr; 58 58 struct virtio_scsi_inhdr in_hdr; ··· 572 574 u32 v, blk_size, sg_elems, opt_io_size; 573 575 u16 min_io_size; 574 576 u8 physical_block_exp, alignment_offset; 577 + 578 + if (!vdev->config->get) { 579 + dev_err(&vdev->dev, "%s failure: config access disabled\n", 580 + __func__); 581 + return -EINVAL; 582 + } 575 583 576 584 err = ida_simple_get(&vd_index_ida, 0, minor_to_index(1 << MINORBITS), 577 585 GFP_KERNEL);

+4 -1

drivers/char/virtio_console.c

··· 1986 1986 bool multiport; 1987 1987 bool early = early_put_chars != NULL; 1988 1988 1989 - if (!vdev->config->get) { 1989 + /* We only need a config space if features are offered */ 1990 + if (!vdev->config->get && 1991 + (virtio_has_feature(vdev, VIRTIO_CONSOLE_F_SIZE) 1992 + || virtio_has_feature(vdev, VIRTIO_CONSOLE_F_MULTIPORT))) { 1990 1993 dev_err(&vdev->dev, "%s failure: config access disabled\n", 1991 1994 __func__); 1992 1995 return -EINVAL;

-3

drivers/lguest/Makefile

··· 1 - # Guest requires the device configuration and probing code. 2 - obj-$(CONFIG_LGUEST_GUEST) += lguest_device.o 3 - 4 1 # Host requires the other files, which can be a module. 5 2 obj-$(CONFIG_LGUEST) += lg.o 6 3 lg-y = core.o hypercalls.o page_tables.o interrupts_and_traps.o \

+14 -15

drivers/lguest/core.c

··· 208 208 */ 209 209 int run_guest(struct lg_cpu *cpu, unsigned long __user *user) 210 210 { 211 + /* If the launcher asked for a register with LHREQ_GETREG */ 212 + if (cpu->reg_read) { 213 + if (put_user(*cpu->reg_read, user)) 214 + return -EFAULT; 215 + cpu->reg_read = NULL; 216 + return sizeof(*cpu->reg_read); 217 + } 218 + 211 219 /* We stop running once the Guest is dead. */ 212 220 while (!cpu->lg->dead) { 213 221 unsigned int irq; ··· 225 217 if (cpu->hcall) 226 218 do_hypercalls(cpu); 227 219 228 - /* 229 - * It's possible the Guest did a NOTIFY hypercall to the 230 - * Launcher. 231 - */ 232 - if (cpu->pending_notify) { 233 - /* 234 - * Does it just needs to write to a registered 235 - * eventfd (ie. the appropriate virtqueue thread)? 236 - */ 237 - if (!send_notify_to_eventfd(cpu)) { 238 - /* OK, we tell the main Launcher. */ 239 - if (put_user(cpu->pending_notify, user)) 240 - return -EFAULT; 241 - return sizeof(cpu->pending_notify); 242 - } 220 + /* Do we have to tell the Launcher about a trap? */ 221 + if (cpu->pending.trap) { 222 + if (copy_to_user(user, &cpu->pending, 223 + sizeof(cpu->pending))) 224 + return -EFAULT; 225 + return sizeof(cpu->pending); 243 226 } 244 227 245 228 /*

+2 -5

drivers/lguest/hypercalls.c

··· 117 117 /* Similarly, this sets the halted flag for run_guest(). */ 118 118 cpu->halted = 1; 119 119 break; 120 - case LHCALL_NOTIFY: 121 - cpu->pending_notify = args->arg1; 122 - break; 123 120 default: 124 121 /* It should be an architecture-specific hypercall. */ 125 122 if (lguest_arch_do_hcall(cpu, args)) ··· 186 189 * Stop doing hypercalls if they want to notify the Launcher: 187 190 * it needs to service this first. 188 191 */ 189 - if (cpu->pending_notify) 192 + if (cpu->pending.trap) 190 193 break; 191 194 } 192 195 } ··· 277 280 * NOTIFY to the Launcher, we want to return now. Otherwise we do 278 281 * the hypercall. 279 282 */ 280 - if (!cpu->pending_notify) { 283 + if (!cpu->pending.trap) { 281 284 do_hcall(cpu, cpu->hcall); 282 285 /* 283 286 * Tricky point: we reset the hcall pointer to mark the

+12 -14

drivers/lguest/lg.h

··· 50 50 /* Bitmap of what has changed: see CHANGED_* above. */ 51 51 int changed; 52 52 53 - unsigned long pending_notify; /* pfn from LHCALL_NOTIFY */ 53 + /* Pending operation. */ 54 + struct lguest_pending pending; 55 + 56 + unsigned long *reg_read; /* register from LHREQ_GETREG */ 54 57 55 58 /* At end of a page shared mapped over lguest_pages in guest. */ 56 59 unsigned long regs_page; ··· 81 78 struct lg_cpu_arch arch; 82 79 }; 83 80 84 - struct lg_eventfd { 85 - unsigned long addr; 86 - struct eventfd_ctx *event; 87 - }; 88 - 89 - struct lg_eventfd_map { 90 - unsigned int num; 91 - struct lg_eventfd map[]; 92 - }; 93 - 94 81 /* The private info the thread maintains about the guest. */ 95 82 struct lguest { 96 83 struct lguest_data __user *lguest_data; 97 84 struct lg_cpu cpus[NR_CPUS]; 98 85 unsigned int nr_cpus; 99 86 87 + /* Valid guest memory pages must be < this. */ 100 88 u32 pfn_limit; 89 + 90 + /* Device memory is >= pfn_limit and < device_limit. */ 91 + u32 device_limit; 101 92 102 93 /* 103 94 * This provides the offset to the base of guest-physical memory in the ··· 106 109 107 110 unsigned int stack_pages; 108 111 u32 tsc_khz; 109 - 110 - struct lg_eventfd_map *eventfds; 111 112 112 113 /* Dead? */ 113 114 const char *dead; ··· 192 197 void guest_set_pte(struct lg_cpu *cpu, unsigned long gpgdir, 193 198 unsigned long vaddr, pte_t val); 194 199 void map_switcher_in_guest(struct lg_cpu *cpu, struct lguest_pages *pages); 195 - bool demand_page(struct lg_cpu *cpu, unsigned long cr2, int errcode); 200 + bool demand_page(struct lg_cpu *cpu, unsigned long cr2, int errcode, 201 + unsigned long *iomem); 196 202 void pin_page(struct lg_cpu *cpu, unsigned long vaddr); 203 + bool __guest_pa(struct lg_cpu *cpu, unsigned long vaddr, unsigned long *paddr); 197 204 unsigned long guest_pa(struct lg_cpu *cpu, unsigned long vaddr); 198 205 void page_table_guest_data_init(struct lg_cpu *cpu); 199 206 ··· 207 210 int lguest_arch_init_hypercalls(struct lg_cpu *cpu); 208 211 int lguest_arch_do_hcall(struct lg_cpu *cpu, struct hcall_args *args); 209 212 void lguest_arch_setup_regs(struct lg_cpu *cpu, unsigned long start); 213 + unsigned long *lguest_arch_regptr(struct lg_cpu *cpu, size_t reg_off, bool any); 210 214 211 215 /* <arch>/switcher.S: */ 212 216 extern char start_switcher_text[], end_switcher_text[], switch_to_guest[];

-540

drivers/lguest/lguest_device.c

··· 1 - /*P:050 2 - * Lguest guests use a very simple method to describe devices. It's a 3 - * series of device descriptors contained just above the top of normal Guest 4 - * memory. 5 - * 6 - * We use the standard "virtio" device infrastructure, which provides us with a 7 - * console, a network and a block driver. Each one expects some configuration 8 - * information and a "virtqueue" or two to send and receive data. 9 - :*/ 10 - #include <linux/init.h> 11 - #include <linux/bootmem.h> 12 - #include <linux/lguest_launcher.h> 13 - #include <linux/virtio.h> 14 - #include <linux/virtio_config.h> 15 - #include <linux/interrupt.h> 16 - #include <linux/virtio_ring.h> 17 - #include <linux/err.h> 18 - #include <linux/export.h> 19 - #include <linux/slab.h> 20 - #include <asm/io.h> 21 - #include <asm/paravirt.h> 22 - #include <asm/lguest_hcall.h> 23 - 24 - /* The pointer to our (page) of device descriptions. */ 25 - static void *lguest_devices; 26 - 27 - /* 28 - * For Guests, device memory can be used as normal memory, so we cast away the 29 - * __iomem to quieten sparse. 30 - */ 31 - static inline void *lguest_map(unsigned long phys_addr, unsigned long pages) 32 - { 33 - return (__force void *)ioremap_cache(phys_addr, PAGE_SIZE*pages); 34 - } 35 - 36 - static inline void lguest_unmap(void *addr) 37 - { 38 - iounmap((__force void __iomem *)addr); 39 - } 40 - 41 - /*D:100 42 - * Each lguest device is just a virtio device plus a pointer to its entry 43 - * in the lguest_devices page. 44 - */ 45 - struct lguest_device { 46 - struct virtio_device vdev; 47 - 48 - /* The entry in the lguest_devices page for this device. */ 49 - struct lguest_device_desc *desc; 50 - }; 51 - 52 - /* 53 - * Since the virtio infrastructure hands us a pointer to the virtio_device all 54 - * the time, it helps to have a curt macro to get a pointer to the struct 55 - * lguest_device it's enclosed in. 56 - */ 57 - #define to_lgdev(vd) container_of(vd, struct lguest_device, vdev) 58 - 59 - /*D:130 60 - * Device configurations 61 - * 62 - * The configuration information for a device consists of one or more 63 - * virtqueues, a feature bitmap, and some configuration bytes. The 64 - * configuration bytes don't really matter to us: the Launcher sets them up, and 65 - * the driver will look at them during setup. 66 - * 67 - * A convenient routine to return the device's virtqueue config array: 68 - * immediately after the descriptor. 69 - */ 70 - static struct lguest_vqconfig *lg_vq(const struct lguest_device_desc *desc) 71 - { 72 - return (void *)(desc + 1); 73 - } 74 - 75 - /* The features come immediately after the virtqueues. */ 76 - static u8 *lg_features(const struct lguest_device_desc *desc) 77 - { 78 - return (void *)(lg_vq(desc) + desc->num_vq); 79 - } 80 - 81 - /* The config space comes after the two feature bitmasks. */ 82 - static u8 *lg_config(const struct lguest_device_desc *desc) 83 - { 84 - return lg_features(desc) + desc->feature_len * 2; 85 - } 86 - 87 - /* The total size of the config page used by this device (incl. desc) */ 88 - static unsigned desc_size(const struct lguest_device_desc *desc) 89 - { 90 - return sizeof(*desc) 91 - + desc->num_vq * sizeof(struct lguest_vqconfig) 92 - + desc->feature_len * 2 93 - + desc->config_len; 94 - } 95 - 96 - /* This gets the device's feature bits. */ 97 - static u64 lg_get_features(struct virtio_device *vdev) 98 - { 99 - unsigned int i; 100 - u32 features = 0; 101 - struct lguest_device_desc *desc = to_lgdev(vdev)->desc; 102 - u8 *in_features = lg_features(desc); 103 - 104 - /* We do this the slow but generic way. */ 105 - for (i = 0; i < min(desc->feature_len * 8, 32); i++) 106 - if (in_features[i / 8] & (1 << (i % 8))) 107 - features |= (1 << i); 108 - 109 - return features; 110 - } 111 - 112 - /* 113 - * To notify on reset or feature finalization, we (ab)use the NOTIFY 114 - * hypercall, with the descriptor address of the device. 115 - */ 116 - static void status_notify(struct virtio_device *vdev) 117 - { 118 - unsigned long offset = (void *)to_lgdev(vdev)->desc - lguest_devices; 119 - 120 - hcall(LHCALL_NOTIFY, (max_pfn << PAGE_SHIFT) + offset, 0, 0, 0); 121 - } 122 - 123 - /* 124 - * The virtio core takes the features the Host offers, and copies the ones 125 - * supported by the driver into the vdev->features array. Once that's all 126 - * sorted out, this routine is called so we can tell the Host which features we 127 - * understand and accept. 128 - */ 129 - static int lg_finalize_features(struct virtio_device *vdev) 130 - { 131 - unsigned int i, bits; 132 - struct lguest_device_desc *desc = to_lgdev(vdev)->desc; 133 - /* Second half of bitmap is features we accept. */ 134 - u8 *out_features = lg_features(desc) + desc->feature_len; 135 - 136 - /* Give virtio_ring a chance to accept features. */ 137 - vring_transport_features(vdev); 138 - 139 - /* Make sure we don't have any features > 32 bits! */ 140 - BUG_ON((u32)vdev->features != vdev->features); 141 - 142 - /* 143 - * Since lguest is currently x86-only, we're little-endian. That 144 - * means we could just memcpy. But it's not time critical, and in 145 - * case someone copies this code, we do it the slow, obvious way. 146 - */ 147 - memset(out_features, 0, desc->feature_len); 148 - bits = min_t(unsigned, desc->feature_len, sizeof(vdev->features)) * 8; 149 - for (i = 0; i < bits; i++) { 150 - if (__virtio_test_bit(vdev, i)) 151 - out_features[i / 8] |= (1 << (i % 8)); 152 - } 153 - 154 - /* Tell Host we've finished with this device's feature negotiation */ 155 - status_notify(vdev); 156 - 157 - return 0; 158 - } 159 - 160 - /* Once they've found a field, getting a copy of it is easy. */ 161 - static void lg_get(struct virtio_device *vdev, unsigned int offset, 162 - void *buf, unsigned len) 163 - { 164 - struct lguest_device_desc *desc = to_lgdev(vdev)->desc; 165 - 166 - /* Check they didn't ask for more than the length of the config! */ 167 - BUG_ON(offset + len > desc->config_len); 168 - memcpy(buf, lg_config(desc) + offset, len); 169 - } 170 - 171 - /* Setting the contents is also trivial. */ 172 - static void lg_set(struct virtio_device *vdev, unsigned int offset, 173 - const void *buf, unsigned len) 174 - { 175 - struct lguest_device_desc *desc = to_lgdev(vdev)->desc; 176 - 177 - /* Check they didn't ask for more than the length of the config! */ 178 - BUG_ON(offset + len > desc->config_len); 179 - memcpy(lg_config(desc) + offset, buf, len); 180 - } 181 - 182 - /* 183 - * The operations to get and set the status word just access the status field 184 - * of the device descriptor. 185 - */ 186 - static u8 lg_get_status(struct virtio_device *vdev) 187 - { 188 - return to_lgdev(vdev)->desc->status; 189 - } 190 - 191 - static void lg_set_status(struct virtio_device *vdev, u8 status) 192 - { 193 - BUG_ON(!status); 194 - to_lgdev(vdev)->desc->status = status; 195 - 196 - /* Tell Host immediately if we failed. */ 197 - if (status & VIRTIO_CONFIG_S_FAILED) 198 - status_notify(vdev); 199 - } 200 - 201 - static void lg_reset(struct virtio_device *vdev) 202 - { 203 - /* 0 status means "reset" */ 204 - to_lgdev(vdev)->desc->status = 0; 205 - status_notify(vdev); 206 - } 207 - 208 - /* 209 - * Virtqueues 210 - * 211 - * The other piece of infrastructure virtio needs is a "virtqueue": a way of 212 - * the Guest device registering buffers for the other side to read from or 213 - * write into (ie. send and receive buffers). Each device can have multiple 214 - * virtqueues: for example the console driver uses one queue for sending and 215 - * another for receiving. 216 - * 217 - * Fortunately for us, a very fast shared-memory-plus-descriptors virtqueue 218 - * already exists in virtio_ring.c. We just need to connect it up. 219 - * 220 - * We start with the information we need to keep about each virtqueue. 221 - */ 222 - 223 - /*D:140 This is the information we remember about each virtqueue. */ 224 - struct lguest_vq_info { 225 - /* A copy of the information contained in the device config. */ 226 - struct lguest_vqconfig config; 227 - 228 - /* The address where we mapped the virtio ring, so we can unmap it. */ 229 - void *pages; 230 - }; 231 - 232 - /* 233 - * When the virtio_ring code wants to prod the Host, it calls us here and we 234 - * make a hypercall. We hand the physical address of the virtqueue so the Host 235 - * knows which virtqueue we're talking about. 236 - */ 237 - static bool lg_notify(struct virtqueue *vq) 238 - { 239 - /* 240 - * We store our virtqueue information in the "priv" pointer of the 241 - * virtqueue structure. 242 - */ 243 - struct lguest_vq_info *lvq = vq->priv; 244 - 245 - hcall(LHCALL_NOTIFY, lvq->config.pfn << PAGE_SHIFT, 0, 0, 0); 246 - return true; 247 - } 248 - 249 - /* An extern declaration inside a C file is bad form. Don't do it. */ 250 - extern int lguest_setup_irq(unsigned int irq); 251 - 252 - /* 253 - * This routine finds the Nth virtqueue described in the configuration of 254 - * this device and sets it up. 255 - * 256 - * This is kind of an ugly duckling. It'd be nicer to have a standard 257 - * representation of a virtqueue in the configuration space, but it seems that 258 - * everyone wants to do it differently. The KVM coders want the Guest to 259 - * allocate its own pages and tell the Host where they are, but for lguest it's 260 - * simpler for the Host to simply tell us where the pages are. 261 - */ 262 - static struct virtqueue *lg_find_vq(struct virtio_device *vdev, 263 - unsigned index, 264 - void (*callback)(struct virtqueue *vq), 265 - const char *name) 266 - { 267 - struct lguest_device *ldev = to_lgdev(vdev); 268 - struct lguest_vq_info *lvq; 269 - struct virtqueue *vq; 270 - int err; 271 - 272 - if (!name) 273 - return NULL; 274 - 275 - /* We must have this many virtqueues. */ 276 - if (index >= ldev->desc->num_vq) 277 - return ERR_PTR(-ENOENT); 278 - 279 - lvq = kmalloc(sizeof(*lvq), GFP_KERNEL); 280 - if (!lvq) 281 - return ERR_PTR(-ENOMEM); 282 - 283 - /* 284 - * Make a copy of the "struct lguest_vqconfig" entry, which sits after 285 - * the descriptor. We need a copy because the config space might not 286 - * be aligned correctly. 287 - */ 288 - memcpy(&lvq->config, lg_vq(ldev->desc)+index, sizeof(lvq->config)); 289 - 290 - printk("Mapping virtqueue %i addr %lx\n", index, 291 - (unsigned long)lvq->config.pfn << PAGE_SHIFT); 292 - /* Figure out how many pages the ring will take, and map that memory */ 293 - lvq->pages = lguest_map((unsigned long)lvq->config.pfn << PAGE_SHIFT, 294 - DIV_ROUND_UP(vring_size(lvq->config.num, 295 - LGUEST_VRING_ALIGN), 296 - PAGE_SIZE)); 297 - if (!lvq->pages) { 298 - err = -ENOMEM; 299 - goto free_lvq; 300 - } 301 - 302 - /* 303 - * OK, tell virtio_ring.c to set up a virtqueue now we know its size 304 - * and we've got a pointer to its pages. Note that we set weak_barriers 305 - * to 'true': the host just a(nother) SMP CPU, so we only need inter-cpu 306 - * barriers. 307 - */ 308 - vq = vring_new_virtqueue(index, lvq->config.num, LGUEST_VRING_ALIGN, vdev, 309 - true, lvq->pages, lg_notify, callback, name); 310 - if (!vq) { 311 - err = -ENOMEM; 312 - goto unmap; 313 - } 314 - 315 - /* Make sure the interrupt is allocated. */ 316 - err = lguest_setup_irq(lvq->config.irq); 317 - if (err) 318 - goto destroy_vring; 319 - 320 - /* 321 - * Tell the interrupt for this virtqueue to go to the virtio_ring 322 - * interrupt handler. 323 - * 324 - * FIXME: We used to have a flag for the Host to tell us we could use 325 - * the interrupt as a source of randomness: it'd be nice to have that 326 - * back. 327 - */ 328 - err = request_irq(lvq->config.irq, vring_interrupt, IRQF_SHARED, 329 - dev_name(&vdev->dev), vq); 330 - if (err) 331 - goto free_desc; 332 - 333 - /* 334 - * Last of all we hook up our 'struct lguest_vq_info" to the 335 - * virtqueue's priv pointer. 336 - */ 337 - vq->priv = lvq; 338 - return vq; 339 - 340 - free_desc: 341 - irq_free_desc(lvq->config.irq); 342 - destroy_vring: 343 - vring_del_virtqueue(vq); 344 - unmap: 345 - lguest_unmap(lvq->pages); 346 - free_lvq: 347 - kfree(lvq); 348 - return ERR_PTR(err); 349 - } 350 - /*:*/ 351 - 352 - /* Cleaning up a virtqueue is easy */ 353 - static void lg_del_vq(struct virtqueue *vq) 354 - { 355 - struct lguest_vq_info *lvq = vq->priv; 356 - 357 - /* Release the interrupt */ 358 - free_irq(lvq->config.irq, vq); 359 - /* Tell virtio_ring.c to free the virtqueue. */ 360 - vring_del_virtqueue(vq); 361 - /* Unmap the pages containing the ring. */ 362 - lguest_unmap(lvq->pages); 363 - /* Free our own queue information. */ 364 - kfree(lvq); 365 - } 366 - 367 - static void lg_del_vqs(struct virtio_device *vdev) 368 - { 369 - struct virtqueue *vq, *n; 370 - 371 - list_for_each_entry_safe(vq, n, &vdev->vqs, list) 372 - lg_del_vq(vq); 373 - } 374 - 375 - static int lg_find_vqs(struct virtio_device *vdev, unsigned nvqs, 376 - struct virtqueue *vqs[], 377 - vq_callback_t *callbacks[], 378 - const char *names[]) 379 - { 380 - struct lguest_device *ldev = to_lgdev(vdev); 381 - int i; 382 - 383 - /* We must have this many virtqueues. */ 384 - if (nvqs > ldev->desc->num_vq) 385 - return -ENOENT; 386 - 387 - for (i = 0; i < nvqs; ++i) { 388 - vqs[i] = lg_find_vq(vdev, i, callbacks[i], names[i]); 389 - if (IS_ERR(vqs[i])) 390 - goto error; 391 - } 392 - return 0; 393 - 394 - error: 395 - lg_del_vqs(vdev); 396 - return PTR_ERR(vqs[i]); 397 - } 398 - 399 - static const char *lg_bus_name(struct virtio_device *vdev) 400 - { 401 - return ""; 402 - } 403 - 404 - /* The ops structure which hooks everything together. */ 405 - static const struct virtio_config_ops lguest_config_ops = { 406 - .get_features = lg_get_features, 407 - .finalize_features = lg_finalize_features, 408 - .get = lg_get, 409 - .set = lg_set, 410 - .get_status = lg_get_status, 411 - .set_status = lg_set_status, 412 - .reset = lg_reset, 413 - .find_vqs = lg_find_vqs, 414 - .del_vqs = lg_del_vqs, 415 - .bus_name = lg_bus_name, 416 - }; 417 - 418 - /* 419 - * The root device for the lguest virtio devices. This makes them appear as 420 - * /sys/devices/lguest/0,1,2 not /sys/devices/0,1,2. 421 - */ 422 - static struct device *lguest_root; 423 - 424 - /*D:120 425 - * This is the core of the lguest bus: actually adding a new device. 426 - * It's a separate function because it's neater that way, and because an 427 - * earlier version of the code supported hotplug and unplug. They were removed 428 - * early on because they were never used. 429 - * 430 - * As Andrew Tridgell says, "Untested code is buggy code". 431 - * 432 - * It's worth reading this carefully: we start with a pointer to the new device 433 - * descriptor in the "lguest_devices" page, and the offset into the device 434 - * descriptor page so we can uniquely identify it if things go badly wrong. 435 - */ 436 - static void add_lguest_device(struct lguest_device_desc *d, 437 - unsigned int offset) 438 - { 439 - struct lguest_device *ldev; 440 - 441 - /* Start with zeroed memory; Linux's device layer counts on it. */ 442 - ldev = kzalloc(sizeof(*ldev), GFP_KERNEL); 443 - if (!ldev) { 444 - printk(KERN_EMERG "Cannot allocate lguest dev %u type %u\n", 445 - offset, d->type); 446 - return; 447 - } 448 - 449 - /* This devices' parent is the lguest/ dir. */ 450 - ldev->vdev.dev.parent = lguest_root; 451 - /* 452 - * The device type comes straight from the descriptor. There's also a 453 - * device vendor field in the virtio_device struct, which we leave as 454 - * 0. 455 - */ 456 - ldev->vdev.id.device = d->type; 457 - /* 458 - * We have a simple set of routines for querying the device's 459 - * configuration information and setting its status. 460 - */ 461 - ldev->vdev.config = &lguest_config_ops; 462 - /* And we remember the device's descriptor for lguest_config_ops. */ 463 - ldev->desc = d; 464 - 465 - /* 466 - * register_virtio_device() sets up the generic fields for the struct 467 - * virtio_device and calls device_register(). This makes the bus 468 - * infrastructure look for a matching driver. 469 - */ 470 - if (register_virtio_device(&ldev->vdev) != 0) { 471 - printk(KERN_ERR "Failed to register lguest dev %u type %u\n", 472 - offset, d->type); 473 - kfree(ldev); 474 - } 475 - } 476 - 477 - /*D:110 478 - * scan_devices() simply iterates through the device page. The type 0 is 479 - * reserved to mean "end of devices". 480 - */ 481 - static void scan_devices(void) 482 - { 483 - unsigned int i; 484 - struct lguest_device_desc *d; 485 - 486 - /* We start at the page beginning, and skip over each entry. */ 487 - for (i = 0; i < PAGE_SIZE; i += desc_size(d)) { 488 - d = lguest_devices + i; 489 - 490 - /* Once we hit a zero, stop. */ 491 - if (d->type == 0) 492 - break; 493 - 494 - printk("Device at %i has size %u\n", i, desc_size(d)); 495 - add_lguest_device(d, i); 496 - } 497 - } 498 - 499 - /*D:105 500 - * Fairly early in boot, lguest_devices_init() is called to set up the 501 - * lguest device infrastructure. We check that we are a Guest by checking 502 - * pv_info.name: there are other ways of checking, but this seems most 503 - * obvious to me. 504 - * 505 - * So we can access the "struct lguest_device_desc"s easily, we map that memory 506 - * and store the pointer in the global "lguest_devices". Then we register a 507 - * root device from which all our devices will hang (this seems to be the 508 - * correct sysfs incantation). 509 - * 510 - * Finally we call scan_devices() which adds all the devices found in the 511 - * lguest_devices page. 512 - */ 513 - static int __init lguest_devices_init(void) 514 - { 515 - if (strcmp(pv_info.name, "lguest") != 0) 516 - return 0; 517 - 518 - lguest_root = root_device_register("lguest"); 519 - if (IS_ERR(lguest_root)) 520 - panic("Could not register lguest root"); 521 - 522 - /* Devices are in a single page above top of "normal" mem */ 523 - lguest_devices = lguest_map(max_pfn<<PAGE_SHIFT, 1); 524 - 525 - scan_devices(); 526 - return 0; 527 - } 528 - /* We do this after core stuff, but before the drivers. */ 529 - postcore_initcall(lguest_devices_init); 530 - 531 - /*D:150 532 - * At this point in the journey we used to now wade through the lguest 533 - * devices themselves: net, block and console. Since they're all now virtio 534 - * devices rather than lguest-specific, I've decided to ignore them. Mostly, 535 - * they're kind of boring. But this does mean you'll never experience the 536 - * thrill of reading the forbidden love scene buried deep in the block driver. 537 - * 538 - * "make Launcher" beckons, where we answer questions like "Where do Guests 539 - * come from?", and "What do you do when someone asks for optimization?". 540 - */

+66 -171

drivers/lguest/lguest_user.c

··· 2 2 * launcher controls and communicates with the Guest. For example, 3 3 * the first write will tell us the Guest's memory layout and entry 4 4 * point. A read will run the Guest until something happens, such as 5 - * a signal or the Guest doing a NOTIFY out to the Launcher. There is 6 - * also a way for the Launcher to attach eventfds to particular NOTIFY 7 - * values instead of returning from the read() call. 5 + * a signal or the Guest accessing a device. 8 6 :*/ 9 7 #include <linux/uaccess.h> 10 8 #include <linux/miscdevice.h> 11 9 #include <linux/fs.h> 12 10 #include <linux/sched.h> 13 - #include <linux/eventfd.h> 14 11 #include <linux/file.h> 15 12 #include <linux/slab.h> 16 13 #include <linux/export.h> 17 14 #include "lg.h" 18 15 19 - /*L:056 20 - * Before we move on, let's jump ahead and look at what the kernel does when 21 - * it needs to look up the eventfds. That will complete our picture of how we 22 - * use RCU. 23 - * 24 - * The notification value is in cpu->pending_notify: we return true if it went 25 - * to an eventfd. 26 - */ 27 - bool send_notify_to_eventfd(struct lg_cpu *cpu) 28 - { 29 - unsigned int i; 30 - struct lg_eventfd_map *map; 31 - 32 - /* 33 - * This "rcu_read_lock()" helps track when someone is still looking at 34 - * the (RCU-using) eventfds array. It's not actually a lock at all; 35 - * indeed it's a noop in many configurations. (You didn't expect me to 36 - * explain all the RCU secrets here, did you?) 37 - */ 38 - rcu_read_lock(); 39 - /* 40 - * rcu_dereference is the counter-side of rcu_assign_pointer(); it 41 - * makes sure we don't access the memory pointed to by 42 - * cpu->lg->eventfds before cpu->lg->eventfds is set. Sounds crazy, 43 - * but Alpha allows this! Paul McKenney points out that a really 44 - * aggressive compiler could have the same effect: 45 - * http://lists.ozlabs.org/pipermail/lguest/2009-July/001560.html 46 - * 47 - * So play safe, use rcu_dereference to get the rcu-protected pointer: 48 - */ 49 - map = rcu_dereference(cpu->lg->eventfds); 50 - /* 51 - * Simple array search: even if they add an eventfd while we do this, 52 - * we'll continue to use the old array and just won't see the new one. 53 - */ 54 - for (i = 0; i < map->num; i++) { 55 - if (map->map[i].addr == cpu->pending_notify) { 56 - eventfd_signal(map->map[i].event, 1); 57 - cpu->pending_notify = 0; 58 - break; 59 - } 60 - } 61 - /* We're done with the rcu-protected variable cpu->lg->eventfds. */ 62 - rcu_read_unlock(); 63 - 64 - /* If we cleared the notification, it's because we found a match. */ 65 - return cpu->pending_notify == 0; 66 - } 67 - 68 - /*L:055 69 - * One of the more tricksy tricks in the Linux Kernel is a technique called 70 - * Read Copy Update. Since one point of lguest is to teach lguest journeyers 71 - * about kernel coding, I use it here. (In case you're curious, other purposes 72 - * include learning about virtualization and instilling a deep appreciation for 73 - * simplicity and puppies). 74 - * 75 - * We keep a simple array which maps LHCALL_NOTIFY values to eventfds, but we 76 - * add new eventfds without ever blocking readers from accessing the array. 77 - * The current Launcher only does this during boot, so that never happens. But 78 - * Read Copy Update is cool, and adding a lock risks damaging even more puppies 79 - * than this code does. 80 - * 81 - * We allocate a brand new one-larger array, copy the old one and add our new 82 - * element. Then we make the lg eventfd pointer point to the new array. 83 - * That's the easy part: now we need to free the old one, but we need to make 84 - * sure no slow CPU somewhere is still looking at it. That's what 85 - * synchronize_rcu does for us: waits until every CPU has indicated that it has 86 - * moved on to know it's no longer using the old one. 87 - * 88 - * If that's unclear, see http://en.wikipedia.org/wiki/Read-copy-update. 89 - */ 90 - static int add_eventfd(struct lguest *lg, unsigned long addr, int fd) 91 - { 92 - struct lg_eventfd_map *new, *old = lg->eventfds; 93 - 94 - /* 95 - * We don't allow notifications on value 0 anyway (pending_notify of 96 - * 0 means "nothing pending"). 97 - */ 98 - if (!addr) 99 - return -EINVAL; 100 - 101 - /* 102 - * Replace the old array with the new one, carefully: others can 103 - * be accessing it at the same time. 104 - */ 105 - new = kmalloc(sizeof(*new) + sizeof(new->map[0]) * (old->num + 1), 106 - GFP_KERNEL); 107 - if (!new) 108 - return -ENOMEM; 109 - 110 - /* First make identical copy. */ 111 - memcpy(new->map, old->map, sizeof(old->map[0]) * old->num); 112 - new->num = old->num; 113 - 114 - /* Now append new entry. */ 115 - new->map[new->num].addr = addr; 116 - new->map[new->num].event = eventfd_ctx_fdget(fd); 117 - if (IS_ERR(new->map[new->num].event)) { 118 - int err = PTR_ERR(new->map[new->num].event); 119 - kfree(new); 120 - return err; 121 - } 122 - new->num++; 123 - 124 - /* 125 - * Now put new one in place: rcu_assign_pointer() is a fancy way of 126 - * doing "lg->eventfds = new", but it uses memory barriers to make 127 - * absolutely sure that the contents of "new" written above is nailed 128 - * down before we actually do the assignment. 129 - * 130 - * We have to think about these kinds of things when we're operating on 131 - * live data without locks. 132 - */ 133 - rcu_assign_pointer(lg->eventfds, new); 134 - 135 - /* 136 - * We're not in a big hurry. Wait until no one's looking at old 137 - * version, then free it. 138 - */ 139 - synchronize_rcu(); 140 - kfree(old); 141 - 142 - return 0; 143 - } 144 - 145 16 /*L:052 146 - * Receiving notifications from the Guest is usually done by attaching a 147 - * particular LHCALL_NOTIFY value to an event filedescriptor. The eventfd will 148 - * become readable when the Guest does an LHCALL_NOTIFY with that value. 149 - * 150 - * This is really convenient for processing each virtqueue in a separate 151 - * thread. 152 - */ 153 - static int attach_eventfd(struct lguest *lg, const unsigned long __user *input) 17 + The Launcher can get the registers, and also set some of them. 18 + */ 19 + static int getreg_setup(struct lg_cpu *cpu, const unsigned long __user *input) 154 20 { 155 - unsigned long addr, fd; 156 - int err; 21 + unsigned long which; 157 22 158 - if (get_user(addr, input) != 0) 23 + /* We re-use the ptrace structure to specify which register to read. */ 24 + if (get_user(which, input) != 0) 25 + return -EFAULT; 26 + 27 + /* 28 + * We set up the cpu register pointer, and their next read will 29 + * actually get the value (instead of running the guest). 30 + * 31 + * The last argument 'true' says we can access any register. 32 + */ 33 + cpu->reg_read = lguest_arch_regptr(cpu, which, true); 34 + if (!cpu->reg_read) 35 + return -ENOENT; 36 + 37 + /* And because this is a write() call, we return the length used. */ 38 + return sizeof(unsigned long) * 2; 39 + } 40 + 41 + static int setreg(struct lg_cpu *cpu, const unsigned long __user *input) 42 + { 43 + unsigned long which, value, *reg; 44 + 45 + /* We re-use the ptrace structure to specify which register to read. */ 46 + if (get_user(which, input) != 0) 159 47 return -EFAULT; 160 48 input++; 161 - if (get_user(fd, input) != 0) 49 + if (get_user(value, input) != 0) 162 50 return -EFAULT; 163 51 164 - /* 165 - * Just make sure two callers don't add eventfds at once. We really 166 - * only need to lock against callers adding to the same Guest, so using 167 - * the Big Lguest Lock is overkill. But this is setup, not a fast path. 168 - */ 169 - mutex_lock(&lguest_lock); 170 - err = add_eventfd(lg, addr, fd); 171 - mutex_unlock(&lguest_lock); 52 + /* The last argument 'false' means we can't access all registers. */ 53 + reg = lguest_arch_regptr(cpu, which, false); 54 + if (!reg) 55 + return -ENOENT; 172 56 173 - return err; 57 + *reg = value; 58 + 59 + /* And because this is a write() call, we return the length used. */ 60 + return sizeof(unsigned long) * 3; 174 61 } 175 62 176 63 /*L:050 ··· 78 191 * this interrupt. 79 192 */ 80 193 set_interrupt(cpu, irq); 194 + return 0; 195 + } 196 + 197 + /*L:053 198 + * Deliver a trap: this is used by the Launcher if it can't emulate 199 + * an instruction. 200 + */ 201 + static int trap(struct lg_cpu *cpu, const unsigned long __user *input) 202 + { 203 + unsigned long trapnum; 204 + 205 + if (get_user(trapnum, input) != 0) 206 + return -EFAULT; 207 + 208 + if (!deliver_trap(cpu, trapnum)) 209 + return -EINVAL; 210 + 81 211 return 0; 82 212 } 83 213 ··· 141 237 * If we returned from read() last time because the Guest sent I/O, 142 238 * clear the flag. 143 239 */ 144 - if (cpu->pending_notify) 145 - cpu->pending_notify = 0; 240 + if (cpu->pending.trap) 241 + cpu->pending.trap = 0; 146 242 147 243 /* Run the Guest until something interesting happens. */ 148 244 return run_guest(cpu, (unsigned long __user *)user); ··· 223 319 /* "struct lguest" contains all we (the Host) know about a Guest. */ 224 320 struct lguest *lg; 225 321 int err; 226 - unsigned long args[3]; 322 + unsigned long args[4]; 227 323 228 324 /* 229 325 * We grab the Big Lguest lock, which protects against multiple ··· 247 343 goto unlock; 248 344 } 249 345 250 - lg->eventfds = kmalloc(sizeof(*lg->eventfds), GFP_KERNEL); 251 - if (!lg->eventfds) { 252 - err = -ENOMEM; 253 - goto free_lg; 254 - } 255 - lg->eventfds->num = 0; 256 - 257 346 /* Populate the easy fields of our "struct lguest" */ 258 347 lg->mem_base = (void __user *)args[0]; 259 348 lg->pfn_limit = args[1]; 349 + lg->device_limit = args[3]; 260 350 261 351 /* This is the first cpu (cpu 0) and it will start booting at args[2] */ 262 352 err = lg_cpu_start(&lg->cpus[0], 0, args[2]); 263 353 if (err) 264 - goto free_eventfds; 354 + goto free_lg; 265 355 266 356 /* 267 357 * Initialize the Guest's shadow page tables. This allocates ··· 276 378 free_regs: 277 379 /* FIXME: This should be in free_vcpu */ 278 380 free_page(lg->cpus[0].regs_page); 279 - free_eventfds: 280 - kfree(lg->eventfds); 281 381 free_lg: 282 382 kfree(lg); 283 383 unlock: ··· 328 432 return initialize(file, input); 329 433 case LHREQ_IRQ: 330 434 return user_send_irq(cpu, input); 331 - case LHREQ_EVENTFD: 332 - return attach_eventfd(lg, input); 435 + case LHREQ_GETREG: 436 + return getreg_setup(cpu, input); 437 + case LHREQ_SETREG: 438 + return setreg(cpu, input); 439 + case LHREQ_TRAP: 440 + return trap(cpu, input); 333 441 default: 334 442 return -EINVAL; 335 443 } ··· 377 477 */ 378 478 mmput(lg->cpus[i].mm); 379 479 } 380 - 381 - /* Release any eventfds they registered. */ 382 - for (i = 0; i < lg->eventfds->num; i++) 383 - eventfd_ctx_put(lg->eventfds->map[i].event); 384 - kfree(lg->eventfds); 385 480 386 481 /* 387 482 * If lg->dead doesn't contain an error code it will be NULL or a

+59 -16

drivers/lguest/page_tables.c

··· 250 250 } 251 251 /*:*/ 252 252 253 + static bool gpte_in_iomem(struct lg_cpu *cpu, pte_t gpte) 254 + { 255 + /* We don't handle large pages. */ 256 + if (pte_flags(gpte) & _PAGE_PSE) 257 + return false; 258 + 259 + return (pte_pfn(gpte) >= cpu->lg->pfn_limit 260 + && pte_pfn(gpte) < cpu->lg->device_limit); 261 + } 262 + 253 263 static bool check_gpte(struct lg_cpu *cpu, pte_t gpte) 254 264 { 255 265 if ((pte_flags(gpte) & _PAGE_PSE) || ··· 384 374 * 385 375 * If we fixed up the fault (ie. we mapped the address), this routine returns 386 376 * true. Otherwise, it was a real fault and we need to tell the Guest. 377 + * 378 + * There's a corner case: they're trying to access memory between 379 + * pfn_limit and device_limit, which is I/O memory. In this case, we 380 + * return false and set @iomem to the physical address, so the the 381 + * Launcher can handle the instruction manually. 387 382 */ 388 - bool demand_page(struct lg_cpu *cpu, unsigned long vaddr, int errcode) 383 + bool demand_page(struct lg_cpu *cpu, unsigned long vaddr, int errcode, 384 + unsigned long *iomem) 389 385 { 390 386 unsigned long gpte_ptr; 391 387 pte_t gpte; 392 388 pte_t *spte; 393 389 pmd_t gpmd; 394 390 pgd_t gpgd; 391 + 392 + *iomem = 0; 395 393 396 394 /* We never demand page the Switcher, so trying is a mistake. */ 397 395 if (vaddr >= switcher_addr) ··· 476 458 /* User access to a kernel-only page? (bit 3 == user access) */ 477 459 if ((errcode & 4) && !(pte_flags(gpte) & _PAGE_USER)) 478 460 return false; 461 + 462 + /* If they're accessing io memory, we expect a fault. */ 463 + if (gpte_in_iomem(cpu, gpte)) { 464 + *iomem = (pte_pfn(gpte) << PAGE_SHIFT) | (vaddr & ~PAGE_MASK); 465 + return false; 466 + } 479 467 480 468 /* 481 469 * Check that the Guest PTE flags are OK, and the page number is below ··· 577 553 */ 578 554 void pin_page(struct lg_cpu *cpu, unsigned long vaddr) 579 555 { 580 - if (!page_writable(cpu, vaddr) && !demand_page(cpu, vaddr, 2)) 556 + unsigned long iomem; 557 + 558 + if (!page_writable(cpu, vaddr) && !demand_page(cpu, vaddr, 2, &iomem)) 581 559 kill_guest(cpu, "bad stack page %#lx", vaddr); 582 560 } 583 561 /*:*/ ··· 673 647 /*:*/ 674 648 675 649 /* We walk down the guest page tables to get a guest-physical address */ 676 - unsigned long guest_pa(struct lg_cpu *cpu, unsigned long vaddr) 650 + bool __guest_pa(struct lg_cpu *cpu, unsigned long vaddr, unsigned long *paddr) 677 651 { 678 652 pgd_t gpgd; 679 653 pte_t gpte; ··· 682 656 #endif 683 657 684 658 /* Still not set up? Just map 1:1. */ 685 - if (unlikely(cpu->linear_pages)) 686 - return vaddr; 659 + if (unlikely(cpu->linear_pages)) { 660 + *paddr = vaddr; 661 + return true; 662 + } 687 663 688 664 /* First step: get the top-level Guest page table entry. */ 689 665 gpgd = lgread(cpu, gpgd_addr(cpu, vaddr), pgd_t); 690 666 /* Toplevel not present? We can't map it in. */ 691 - if (!(pgd_flags(gpgd) & _PAGE_PRESENT)) { 692 - kill_guest(cpu, "Bad address %#lx", vaddr); 693 - return -1UL; 694 - } 667 + if (!(pgd_flags(gpgd) & _PAGE_PRESENT)) 668 + goto fail; 695 669 696 670 #ifdef CONFIG_X86_PAE 697 671 gpmd = lgread(cpu, gpmd_addr(gpgd, vaddr), pmd_t); 698 - if (!(pmd_flags(gpmd) & _PAGE_PRESENT)) { 699 - kill_guest(cpu, "Bad address %#lx", vaddr); 700 - return -1UL; 701 - } 672 + if (!(pmd_flags(gpmd) & _PAGE_PRESENT)) 673 + goto fail; 702 674 gpte = lgread(cpu, gpte_addr(cpu, gpmd, vaddr), pte_t); 703 675 #else 704 676 gpte = lgread(cpu, gpte_addr(cpu, gpgd, vaddr), pte_t); 705 677 #endif 706 678 if (!(pte_flags(gpte) & _PAGE_PRESENT)) 707 - kill_guest(cpu, "Bad address %#lx", vaddr); 679 + goto fail; 708 680 709 - return pte_pfn(gpte) * PAGE_SIZE | (vaddr & ~PAGE_MASK); 681 + *paddr = pte_pfn(gpte) * PAGE_SIZE | (vaddr & ~PAGE_MASK); 682 + return true; 683 + 684 + fail: 685 + *paddr = -1UL; 686 + return false; 687 + } 688 + 689 + /* 690 + * This is the version we normally use: kills the Guest if it uses a 691 + * bad address 692 + */ 693 + unsigned long guest_pa(struct lg_cpu *cpu, unsigned long vaddr) 694 + { 695 + unsigned long paddr; 696 + 697 + if (!__guest_pa(cpu, vaddr, &paddr)) 698 + kill_guest(cpu, "Bad address %#lx", vaddr); 699 + return paddr; 710 700 } 711 701 712 702 /* ··· 954 912 * now. This shaves 10% off a copy-on-write 955 913 * micro-benchmark. 956 914 */ 957 - if (pte_flags(gpte) & (_PAGE_DIRTY | _PAGE_ACCESSED)) { 915 + if ((pte_flags(gpte) & (_PAGE_DIRTY | _PAGE_ACCESSED)) 916 + && !gpte_in_iomem(cpu, gpte)) { 958 917 if (!check_gpte(cpu, gpte)) 959 918 return; 960 919 set_pte(spte,

+105 -89

drivers/lguest/x86/core.c

··· 182 182 } 183 183 /*:*/ 184 184 185 + unsigned long *lguest_arch_regptr(struct lg_cpu *cpu, size_t reg_off, bool any) 186 + { 187 + switch (reg_off) { 188 + case offsetof(struct pt_regs, bx): 189 + return &cpu->regs->ebx; 190 + case offsetof(struct pt_regs, cx): 191 + return &cpu->regs->ecx; 192 + case offsetof(struct pt_regs, dx): 193 + return &cpu->regs->edx; 194 + case offsetof(struct pt_regs, si): 195 + return &cpu->regs->esi; 196 + case offsetof(struct pt_regs, di): 197 + return &cpu->regs->edi; 198 + case offsetof(struct pt_regs, bp): 199 + return &cpu->regs->ebp; 200 + case offsetof(struct pt_regs, ax): 201 + return &cpu->regs->eax; 202 + case offsetof(struct pt_regs, ip): 203 + return &cpu->regs->eip; 204 + case offsetof(struct pt_regs, sp): 205 + return &cpu->regs->esp; 206 + } 207 + 208 + /* Launcher can read these, but we don't allow any setting. */ 209 + if (any) { 210 + switch (reg_off) { 211 + case offsetof(struct pt_regs, ds): 212 + return &cpu->regs->ds; 213 + case offsetof(struct pt_regs, es): 214 + return &cpu->regs->es; 215 + case offsetof(struct pt_regs, fs): 216 + return &cpu->regs->fs; 217 + case offsetof(struct pt_regs, gs): 218 + return &cpu->regs->gs; 219 + case offsetof(struct pt_regs, cs): 220 + return &cpu->regs->cs; 221 + case offsetof(struct pt_regs, flags): 222 + return &cpu->regs->eflags; 223 + case offsetof(struct pt_regs, ss): 224 + return &cpu->regs->ss; 225 + } 226 + } 227 + 228 + return NULL; 229 + } 230 + 185 231 /*M:002 186 232 * There are hooks in the scheduler which we can register to tell when we 187 233 * get kicked off the CPU (preempt_notifier_register()). This would allow us ··· 315 269 * usually attached to a PC. 316 270 * 317 271 * When the Guest uses one of these instructions, we get a trap (General 318 - * Protection Fault) and come here. We see if it's one of those troublesome 319 - * instructions and skip over it. We return true if we did. 272 + * Protection Fault) and come here. We queue this to be sent out to the 273 + * Launcher to handle. 320 274 */ 321 - static int emulate_insn(struct lg_cpu *cpu) 275 + 276 + /* 277 + * The eip contains the *virtual* address of the Guest's instruction: 278 + * we copy the instruction here so the Launcher doesn't have to walk 279 + * the page tables to decode it. We handle the case (eg. in a kernel 280 + * module) where the instruction is over two pages, and the pages are 281 + * virtually but not physically contiguous. 282 + * 283 + * The longest possible x86 instruction is 15 bytes, but we don't handle 284 + * anything that strange. 285 + */ 286 + static void copy_from_guest(struct lg_cpu *cpu, 287 + void *dst, unsigned long vaddr, size_t len) 322 288 { 323 - u8 insn; 324 - unsigned int insnlen = 0, in = 0, small_operand = 0; 325 - /* 326 - * The eip contains the *virtual* address of the Guest's instruction: 327 - * walk the Guest's page tables to find the "physical" address. 328 - */ 329 - unsigned long physaddr = guest_pa(cpu, cpu->regs->eip); 289 + size_t to_page_end = PAGE_SIZE - (vaddr % PAGE_SIZE); 290 + unsigned long paddr; 330 291 331 - /* 332 - * This must be the Guest kernel trying to do something, not userspace! 333 - * The bottom two bits of the CS segment register are the privilege 334 - * level. 335 - */ 336 - if ((cpu->regs->cs & 3) != GUEST_PL) 337 - return 0; 292 + BUG_ON(len > PAGE_SIZE); 338 293 339 - /* Decoding x86 instructions is icky. */ 340 - insn = lgread(cpu, physaddr, u8); 341 - 342 - /* 343 - * Around 2.6.33, the kernel started using an emulation for the 344 - * cmpxchg8b instruction in early boot on many configurations. This 345 - * code isn't paravirtualized, and it tries to disable interrupts. 346 - * Ignore it, which will Mostly Work. 347 - */ 348 - if (insn == 0xfa) { 349 - /* "cli", or Clear Interrupt Enable instruction. Skip it. */ 350 - cpu->regs->eip++; 351 - return 1; 294 + /* If it goes over a page, copy in two parts. */ 295 + if (len > to_page_end) { 296 + /* But make sure the next page is mapped! */ 297 + if (__guest_pa(cpu, vaddr + to_page_end, &paddr)) 298 + copy_from_guest(cpu, dst + to_page_end, 299 + vaddr + to_page_end, 300 + len - to_page_end); 301 + else 302 + /* Otherwise fill with zeroes. */ 303 + memset(dst + to_page_end, 0, len - to_page_end); 304 + len = to_page_end; 352 305 } 353 306 354 - /* 355 - * 0x66 is an "operand prefix". It means a 16, not 32 bit in/out. 356 - */ 357 - if (insn == 0x66) { 358 - small_operand = 1; 359 - /* The instruction is 1 byte so far, read the next byte. */ 360 - insnlen = 1; 361 - insn = lgread(cpu, physaddr + insnlen, u8); 362 - } 307 + /* This will kill the guest if it isn't mapped, but that 308 + * shouldn't happen. */ 309 + __lgread(cpu, dst, guest_pa(cpu, vaddr), len); 310 + } 363 311 364 - /* 365 - * We can ignore the lower bit for the moment and decode the 4 opcodes 366 - * we need to emulate. 367 - */ 368 - switch (insn & 0xFE) { 369 - case 0xE4: /* in <next byte>,%al */ 370 - insnlen += 2; 371 - in = 1; 372 - break; 373 - case 0xEC: /* in (%dx),%al */ 374 - insnlen += 1; 375 - in = 1; 376 - break; 377 - case 0xE6: /* out %al,<next byte> */ 378 - insnlen += 2; 379 - break; 380 - case 0xEE: /* out %al,(%dx) */ 381 - insnlen += 1; 382 - break; 383 - default: 384 - /* OK, we don't know what this is, can't emulate. */ 385 - return 0; 386 - } 387 312 388 - /* 389 - * If it was an "IN" instruction, they expect the result to be read 390 - * into %eax, so we change %eax. We always return all-ones, which 391 - * traditionally means "there's nothing there". 392 - */ 393 - if (in) { 394 - /* Lower bit tells means it's a 32/16 bit access */ 395 - if (insn & 0x1) { 396 - if (small_operand) 397 - cpu->regs->eax |= 0xFFFF; 398 - else 399 - cpu->regs->eax = 0xFFFFFFFF; 400 - } else 401 - cpu->regs->eax |= 0xFF; 402 - } 403 - /* Finally, we've "done" the instruction, so move past it. */ 404 - cpu->regs->eip += insnlen; 405 - /* Success! */ 406 - return 1; 313 + static void setup_emulate_insn(struct lg_cpu *cpu) 314 + { 315 + cpu->pending.trap = 13; 316 + copy_from_guest(cpu, cpu->pending.insn, cpu->regs->eip, 317 + sizeof(cpu->pending.insn)); 318 + } 319 + 320 + static void setup_iomem_insn(struct lg_cpu *cpu, unsigned long iomem_addr) 321 + { 322 + cpu->pending.trap = 14; 323 + cpu->pending.addr = iomem_addr; 324 + copy_from_guest(cpu, cpu->pending.insn, cpu->regs->eip, 325 + sizeof(cpu->pending.insn)); 407 326 } 408 327 409 328 /*H:050 Once we've re-enabled interrupts, we look at why the Guest exited. */ 410 329 void lguest_arch_handle_trap(struct lg_cpu *cpu) 411 330 { 331 + unsigned long iomem_addr; 332 + 412 333 switch (cpu->regs->trapnum) { 413 334 case 13: /* We've intercepted a General Protection Fault. */ 414 - /* 415 - * Check if this was one of those annoying IN or OUT 416 - * instructions which we need to emulate. If so, we just go 417 - * back into the Guest after we've done it. 418 - */ 335 + /* Hand to Launcher to emulate those pesky IN and OUT insns */ 419 336 if (cpu->regs->errcode == 0) { 420 - if (emulate_insn(cpu)) 421 - return; 337 + setup_emulate_insn(cpu); 338 + return; 422 339 } 423 340 break; 424 341 case 14: /* We've intercepted a Page Fault. */ ··· 396 387 * whether kernel or userspace code. 397 388 */ 398 389 if (demand_page(cpu, cpu->arch.last_pagefault, 399 - cpu->regs->errcode)) 390 + cpu->regs->errcode, &iomem_addr)) 400 391 return; 392 + 393 + /* Was this an access to memory mapped IO? */ 394 + if (iomem_addr) { 395 + /* Tell Launcher, let it handle it. */ 396 + setup_iomem_insn(cpu, iomem_addr); 397 + return; 398 + } 401 399 402 400 /* 403 401 * OK, it's really not there (or not OK): the Guest needs to

+6

drivers/net/virtio_net.c

··· 1710 1710 struct virtnet_info *vi; 1711 1711 u16 max_queue_pairs; 1712 1712 1713 + if (!vdev->config->get) { 1714 + dev_err(&vdev->dev, "%s failure: config access disabled\n", 1715 + __func__); 1716 + return -EINVAL; 1717 + } 1718 + 1713 1719 if (!virtnet_validate_features(vdev)) 1714 1720 return -EINVAL; 1715 1721

+6

drivers/scsi/virtio_scsi.c

··· 950 950 u32 num_queues; 951 951 struct scsi_host_template *hostt; 952 952 953 + if (!vdev->config->get) { 954 + dev_err(&vdev->dev, "%s failure: config access disabled\n", 955 + __func__); 956 + return -EINVAL; 957 + } 958 + 953 959 /* We need to know how many queues before we allocate. */ 954 960 num_queues = virtscsi_config_get(vdev, num_queues) ? : 1; 955 961

+20 -4

drivers/virtio/Kconfig

··· 12 12 depends on PCI 13 13 select VIRTIO 14 14 ---help--- 15 - This drivers provides support for virtio based paravirtual device 15 + This driver provides support for virtio based paravirtual device 16 16 drivers over PCI. This requires that your VMM has appropriate PCI 17 17 virtio backends. Most QEMU based VMMs should support these devices 18 18 (like KVM or Xen). 19 19 20 - Currently, the ABI is not considered stable so there is no guarantee 21 - that this version of the driver will work with your VMM. 22 - 23 20 If unsure, say M. 21 + 22 + config VIRTIO_PCI_LEGACY 23 + bool "Support for legacy virtio draft 0.9.X and older devices" 24 + default y 25 + depends on VIRTIO_PCI 26 + ---help--- 27 + Virtio PCI Card 0.9.X Draft (circa 2014) and older device support. 28 + 29 + This option enables building a transitional driver, supporting 30 + both devices conforming to Virtio 1 specification, and legacy devices. 31 + If disabled, you get a slightly smaller, non-transitional driver, 32 + with no legacy compatibility. 33 + 34 + So look out into your driveway. Do you have a flying car? If 35 + so, you can happily disable this option and virtio will not 36 + break. Otherwise, leave it set. Unless you're testing what 37 + life will be like in The Future. 38 + 39 + If unsure, say Y. 24 40 25 41 config VIRTIO_BALLOON 26 42 tristate "Virtio balloon driver"

+2 -1

drivers/virtio/Makefile

··· 1 1 obj-$(CONFIG_VIRTIO) += virtio.o virtio_ring.o 2 2 obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o 3 3 obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o 4 - virtio_pci-y := virtio_pci_legacy.o virtio_pci_common.o 4 + virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o 5 + virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o 5 6 obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o

+4 -1

drivers/virtio/virtio.c

··· 236 236 if (err) 237 237 goto err; 238 238 239 - add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK); 239 + /* If probe didn't do it, mark device DRIVER_OK ourselves. */ 240 + if (!(dev->config->get_status(dev) & VIRTIO_CONFIG_S_DRIVER_OK)) 241 + virtio_device_ready(dev); 242 + 240 243 if (drv->scan) 241 244 drv->scan(dev); 242 245

+7 -2

drivers/virtio/virtio_balloon.c

··· 44 44 module_param(oom_pages, int, S_IRUSR | S_IWUSR); 45 45 MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); 46 46 47 - struct virtio_balloon 48 - { 47 + struct virtio_balloon { 49 48 struct virtio_device *vdev; 50 49 struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; 51 50 ··· 464 465 { 465 466 struct virtio_balloon *vb; 466 467 int err; 468 + 469 + if (!vdev->config->get) { 470 + dev_err(&vdev->dev, "%s failure: config access disabled\n", 471 + __func__); 472 + return -EINVAL; 473 + } 467 474 468 475 vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL); 469 476 if (!vb) {

+81 -50

drivers/virtio/virtio_mmio.c

··· 1 1 /* 2 2 * Virtio memory mapped device driver 3 3 * 4 - * Copyright 2011, ARM Ltd. 4 + * Copyright 2011-2014, ARM Ltd. 5 5 * 6 6 * This module allows virtio devices to be used over a virtual, memory mapped 7 7 * platform device. ··· 49 49 * virtio_mmio.device=1K@0x1001e000:74 50 50 * 51 51 * 52 - * 53 - * Registers layout (all 32-bit wide): 54 - * 55 - * offset d. name description 56 - * ------ -- ---------------- ----------------- 57 - * 58 - * 0x000 R MagicValue Magic value "virt" 59 - * 0x004 R Version Device version (current max. 1) 60 - * 0x008 R DeviceID Virtio device ID 61 - * 0x00c R VendorID Virtio vendor ID 62 - * 63 - * 0x010 R HostFeatures Features supported by the host 64 - * 0x014 W HostFeaturesSel Set of host features to access via HostFeatures 65 - * 66 - * 0x020 W GuestFeatures Features activated by the guest 67 - * 0x024 W GuestFeaturesSel Set of activated features to set via GuestFeatures 68 - * 0x028 W GuestPageSize Size of guest's memory page in bytes 69 - * 70 - * 0x030 W QueueSel Queue selector 71 - * 0x034 R QueueNumMax Maximum size of the currently selected queue 72 - * 0x038 W QueueNum Queue size for the currently selected queue 73 - * 0x03c W QueueAlign Used Ring alignment for the current queue 74 - * 0x040 RW QueuePFN PFN for the currently selected queue 75 - * 76 - * 0x050 W QueueNotify Queue notifier 77 - * 0x060 R InterruptStatus Interrupt status register 78 - * 0x064 W InterruptACK Interrupt acknowledge register 79 - * 0x070 RW Status Device status register 80 - * 81 - * 0x100+ RW Device-specific configuration space 82 52 * 83 53 * Based on Virtio PCI driver by Anthony Liguori, copyright IBM Corp. 2007 84 54 * ··· 115 145 static u64 vm_get_features(struct virtio_device *vdev) 116 146 { 117 147 struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev); 148 + u64 features; 118 149 119 - /* TODO: Features > 32 bits */ 120 - writel(0, vm_dev->base + VIRTIO_MMIO_HOST_FEATURES_SEL); 150 + writel(1, vm_dev->base + VIRTIO_MMIO_DEVICE_FEATURES_SEL); 151 + features = readl(vm_dev->base + VIRTIO_MMIO_DEVICE_FEATURES); 152 + features <<= 32; 121 153 122 - return readl(vm_dev->base + VIRTIO_MMIO_HOST_FEATURES); 154 + writel(0, vm_dev->base + VIRTIO_MMIO_DEVICE_FEATURES_SEL); 155 + features |= readl(vm_dev->base + VIRTIO_MMIO_DEVICE_FEATURES); 156 + 157 + return features; 123 158 } 124 159 125 160 static int vm_finalize_features(struct virtio_device *vdev) ··· 134 159 /* Give virtio_ring a chance to accept features. */ 135 160 vring_transport_features(vdev); 136 161 137 - /* Make sure we don't have any features > 32 bits! */ 138 - BUG_ON((u32)vdev->features != vdev->features); 162 + /* Make sure there is are no mixed devices */ 163 + if (vm_dev->version == 2 && 164 + !__virtio_test_bit(vdev, VIRTIO_F_VERSION_1)) { 165 + dev_err(&vdev->dev, "New virtio-mmio devices (version 2) must provide VIRTIO_F_VERSION_1 feature!\n"); 166 + return -EINVAL; 167 + } 139 168 140 - writel(0, vm_dev->base + VIRTIO_MMIO_GUEST_FEATURES_SEL); 141 - writel(vdev->features, vm_dev->base + VIRTIO_MMIO_GUEST_FEATURES); 169 + writel(1, vm_dev->base + VIRTIO_MMIO_DRIVER_FEATURES_SEL); 170 + writel((u32)(vdev->features >> 32), 171 + vm_dev->base + VIRTIO_MMIO_DRIVER_FEATURES); 172 + 173 + writel(0, vm_dev->base + VIRTIO_MMIO_DRIVER_FEATURES_SEL); 174 + writel((u32)vdev->features, 175 + vm_dev->base + VIRTIO_MMIO_DRIVER_FEATURES); 142 176 143 177 return 0; 144 178 } ··· 259 275 260 276 /* Select and deactivate the queue */ 261 277 writel(index, vm_dev->base + VIRTIO_MMIO_QUEUE_SEL); 262 - writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN); 278 + if (vm_dev->version == 1) { 279 + writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN); 280 + } else { 281 + writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_READY); 282 + WARN_ON(readl(vm_dev->base + VIRTIO_MMIO_QUEUE_READY)); 283 + } 263 284 264 285 size = PAGE_ALIGN(vring_size(info->num, VIRTIO_MMIO_VRING_ALIGN)); 265 286 free_pages_exact(info->queue, size); ··· 301 312 writel(index, vm_dev->base + VIRTIO_MMIO_QUEUE_SEL); 302 313 303 314 /* Queue shouldn't already be set up. */ 304 - if (readl(vm_dev->base + VIRTIO_MMIO_QUEUE_PFN)) { 315 + if (readl(vm_dev->base + (vm_dev->version == 1 ? 316 + VIRTIO_MMIO_QUEUE_PFN : VIRTIO_MMIO_QUEUE_READY))) { 305 317 err = -ENOENT; 306 318 goto error_available; 307 319 } ··· 346 356 info->num /= 2; 347 357 } 348 358 349 - /* Activate the queue */ 350 - writel(info->num, vm_dev->base + VIRTIO_MMIO_QUEUE_NUM); 351 - writel(VIRTIO_MMIO_VRING_ALIGN, 352 - vm_dev->base + VIRTIO_MMIO_QUEUE_ALIGN); 353 - writel(virt_to_phys(info->queue) >> PAGE_SHIFT, 354 - vm_dev->base + VIRTIO_MMIO_QUEUE_PFN); 355 - 356 359 /* Create the vring */ 357 360 vq = vring_new_virtqueue(index, info->num, VIRTIO_MMIO_VRING_ALIGN, vdev, 358 361 true, info->queue, vm_notify, callback, name); 359 362 if (!vq) { 360 363 err = -ENOMEM; 361 364 goto error_new_virtqueue; 365 + } 366 + 367 + /* Activate the queue */ 368 + writel(info->num, vm_dev->base + VIRTIO_MMIO_QUEUE_NUM); 369 + if (vm_dev->version == 1) { 370 + writel(PAGE_SIZE, vm_dev->base + VIRTIO_MMIO_QUEUE_ALIGN); 371 + writel(virt_to_phys(info->queue) >> PAGE_SHIFT, 372 + vm_dev->base + VIRTIO_MMIO_QUEUE_PFN); 373 + } else { 374 + u64 addr; 375 + 376 + addr = virt_to_phys(info->queue); 377 + writel((u32)addr, vm_dev->base + VIRTIO_MMIO_QUEUE_DESC_LOW); 378 + writel((u32)(addr >> 32), 379 + vm_dev->base + VIRTIO_MMIO_QUEUE_DESC_HIGH); 380 + 381 + addr = virt_to_phys(virtqueue_get_avail(vq)); 382 + writel((u32)addr, vm_dev->base + VIRTIO_MMIO_QUEUE_AVAIL_LOW); 383 + writel((u32)(addr >> 32), 384 + vm_dev->base + VIRTIO_MMIO_QUEUE_AVAIL_HIGH); 385 + 386 + addr = virt_to_phys(virtqueue_get_used(vq)); 387 + writel((u32)addr, vm_dev->base + VIRTIO_MMIO_QUEUE_USED_LOW); 388 + writel((u32)(addr >> 32), 389 + vm_dev->base + VIRTIO_MMIO_QUEUE_USED_HIGH); 390 + 391 + writel(1, vm_dev->base + VIRTIO_MMIO_QUEUE_READY); 362 392 } 363 393 364 394 vq->priv = info; ··· 391 381 return vq; 392 382 393 383 error_new_virtqueue: 394 - writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN); 384 + if (vm_dev->version == 1) { 385 + writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN); 386 + } else { 387 + writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_READY); 388 + WARN_ON(readl(vm_dev->base + VIRTIO_MMIO_QUEUE_READY)); 389 + } 395 390 free_pages_exact(info->queue, size); 396 391 error_alloc_pages: 397 392 kfree(info); ··· 491 476 492 477 /* Check device version */ 493 478 vm_dev->version = readl(vm_dev->base + VIRTIO_MMIO_VERSION); 494 - if (vm_dev->version != 1) { 479 + if (vm_dev->version < 1 || vm_dev->version > 2) { 495 480 dev_err(&pdev->dev, "Version %ld not supported!\n", 496 481 vm_dev->version); 497 482 return -ENXIO; 498 483 } 499 484 500 485 vm_dev->vdev.id.device = readl(vm_dev->base + VIRTIO_MMIO_DEVICE_ID); 486 + if (vm_dev->vdev.id.device == 0) { 487 + /* 488 + * virtio-mmio device with an ID 0 is a (dummy) placeholder 489 + * with no function. End probing now with no error reported. 490 + */ 491 + return -ENODEV; 492 + } 501 493 vm_dev->vdev.id.vendor = readl(vm_dev->base + VIRTIO_MMIO_VENDOR_ID); 502 494 503 - writel(PAGE_SIZE, vm_dev->base + VIRTIO_MMIO_GUEST_PAGE_SIZE); 495 + /* Reject legacy-only IDs for version 2 devices */ 496 + if (vm_dev->version == 2 && 497 + virtio_device_is_legacy_only(vm_dev->vdev.id)) { 498 + dev_err(&pdev->dev, "Version 2 not supported for devices %u!\n", 499 + vm_dev->vdev.id.device); 500 + return -ENODEV; 501 + } 502 + 503 + if (vm_dev->version == 1) 504 + writel(PAGE_SIZE, vm_dev->base + VIRTIO_MMIO_GUEST_PAGE_SIZE); 504 505 505 506 platform_set_drvdata(pdev, vm_dev); 506 507

+92 -2

drivers/virtio/virtio_pci_common.c

··· 19 19 20 20 #include "virtio_pci_common.h" 21 21 22 + static bool force_legacy = false; 23 + 24 + #if IS_ENABLED(CONFIG_VIRTIO_PCI_LEGACY) 25 + module_param(force_legacy, bool, 0444); 26 + MODULE_PARM_DESC(force_legacy, 27 + "Force legacy mode for transitional virtio 1 devices"); 28 + #endif 29 + 22 30 /* wait for pending irq handlers */ 23 31 void vp_synchronize_vectors(struct virtio_device *vdev) 24 32 { ··· 472 464 473 465 MODULE_DEVICE_TABLE(pci, virtio_pci_id_table); 474 466 467 + static void virtio_pci_release_dev(struct device *_d) 468 + { 469 + struct virtio_device *vdev = dev_to_virtio(_d); 470 + struct virtio_pci_device *vp_dev = to_vp_device(vdev); 471 + 472 + /* As struct device is a kobject, it's not safe to 473 + * free the memory (including the reference counter itself) 474 + * until it's release callback. */ 475 + kfree(vp_dev); 476 + } 477 + 475 478 static int virtio_pci_probe(struct pci_dev *pci_dev, 476 479 const struct pci_device_id *id) 477 480 { 478 - return virtio_pci_legacy_probe(pci_dev, id); 481 + struct virtio_pci_device *vp_dev; 482 + int rc; 483 + 484 + /* allocate our structure and fill it out */ 485 + vp_dev = kzalloc(sizeof(struct virtio_pci_device), GFP_KERNEL); 486 + if (!vp_dev) 487 + return -ENOMEM; 488 + 489 + pci_set_drvdata(pci_dev, vp_dev); 490 + vp_dev->vdev.dev.parent = &pci_dev->dev; 491 + vp_dev->vdev.dev.release = virtio_pci_release_dev; 492 + vp_dev->pci_dev = pci_dev; 493 + INIT_LIST_HEAD(&vp_dev->virtqueues); 494 + spin_lock_init(&vp_dev->lock); 495 + 496 + /* Disable MSI/MSIX to bring device to a known good state. */ 497 + pci_msi_off(pci_dev); 498 + 499 + /* enable the device */ 500 + rc = pci_enable_device(pci_dev); 501 + if (rc) 502 + goto err_enable_device; 503 + 504 + rc = pci_request_regions(pci_dev, "virtio-pci"); 505 + if (rc) 506 + goto err_request_regions; 507 + 508 + if (force_legacy) { 509 + rc = virtio_pci_legacy_probe(vp_dev); 510 + /* Also try modern mode if we can't map BAR0 (no IO space). */ 511 + if (rc == -ENODEV || rc == -ENOMEM) 512 + rc = virtio_pci_modern_probe(vp_dev); 513 + if (rc) 514 + goto err_probe; 515 + } else { 516 + rc = virtio_pci_modern_probe(vp_dev); 517 + if (rc == -ENODEV) 518 + rc = virtio_pci_legacy_probe(vp_dev); 519 + if (rc) 520 + goto err_probe; 521 + } 522 + 523 + pci_set_master(pci_dev); 524 + 525 + rc = register_virtio_device(&vp_dev->vdev); 526 + if (rc) 527 + goto err_register; 528 + 529 + return 0; 530 + 531 + err_register: 532 + if (vp_dev->ioaddr) 533 + virtio_pci_legacy_remove(vp_dev); 534 + else 535 + virtio_pci_modern_remove(vp_dev); 536 + err_probe: 537 + pci_release_regions(pci_dev); 538 + err_request_regions: 539 + pci_disable_device(pci_dev); 540 + err_enable_device: 541 + kfree(vp_dev); 542 + return rc; 479 543 } 480 544 481 545 static void virtio_pci_remove(struct pci_dev *pci_dev) 482 546 { 483 - virtio_pci_legacy_remove(pci_dev); 547 + struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev); 548 + 549 + unregister_virtio_device(&vp_dev->vdev); 550 + 551 + if (vp_dev->ioaddr) 552 + virtio_pci_legacy_remove(vp_dev); 553 + else 554 + virtio_pci_modern_remove(vp_dev); 555 + 556 + pci_release_regions(pci_dev); 557 + pci_disable_device(pci_dev); 484 558 } 485 559 486 560 static struct pci_driver virtio_pci_driver = {

+37 -6

drivers/virtio/virtio_pci_common.h

··· 53 53 struct virtio_device vdev; 54 54 struct pci_dev *pci_dev; 55 55 56 + /* In legacy mode, these two point to within ->legacy. */ 57 + /* Where to read and clear interrupt */ 58 + u8 __iomem *isr; 59 + 60 + /* Modern only fields */ 61 + /* The IO mapping for the PCI config space (non-legacy mode) */ 62 + struct virtio_pci_common_cfg __iomem *common; 63 + /* Device-specific data (non-legacy mode) */ 64 + void __iomem *device; 65 + /* Base of vq notifications (non-legacy mode). */ 66 + void __iomem *notify_base; 67 + 68 + /* So we can sanity-check accesses. */ 69 + size_t notify_len; 70 + size_t device_len; 71 + 72 + /* Capability for when we need to map notifications per-vq. */ 73 + int notify_map_cap; 74 + 75 + /* Multiply queue_notify_off by this value. (non-legacy mode). */ 76 + u32 notify_offset_multiplier; 77 + 78 + /* Legacy only field */ 56 79 /* the IO mapping for the PCI config space */ 57 80 void __iomem *ioaddr; 58 - 59 - /* the IO mapping for ISR operation */ 60 - void __iomem *isr; 61 81 62 82 /* a list of queues so we can dispatch IRQs */ 63 83 spinlock_t lock; ··· 147 127 */ 148 128 int vp_set_vq_affinity(struct virtqueue *vq, int cpu); 149 129 150 - int virtio_pci_legacy_probe(struct pci_dev *pci_dev, 151 - const struct pci_device_id *id); 152 - void virtio_pci_legacy_remove(struct pci_dev *pci_dev); 130 + #if IS_ENABLED(CONFIG_VIRTIO_PCI_LEGACY) 131 + int virtio_pci_legacy_probe(struct virtio_pci_device *); 132 + void virtio_pci_legacy_remove(struct virtio_pci_device *); 133 + #else 134 + static inline int virtio_pci_legacy_probe(struct virtio_pci_device *vp_dev) 135 + { 136 + return -ENODEV; 137 + } 138 + static inline void virtio_pci_legacy_remove(struct virtio_pci_device *vp_dev) 139 + { 140 + } 141 + #endif 142 + int virtio_pci_modern_probe(struct virtio_pci_device *); 143 + void virtio_pci_modern_remove(struct virtio_pci_device *); 153 144 154 145 #endif

+8 -68

drivers/virtio/virtio_pci_legacy.c

··· 211 211 .set_vq_affinity = vp_set_vq_affinity, 212 212 }; 213 213 214 - static void virtio_pci_release_dev(struct device *_d) 215 - { 216 - struct virtio_device *vdev = dev_to_virtio(_d); 217 - struct virtio_pci_device *vp_dev = to_vp_device(vdev); 218 - 219 - /* As struct device is a kobject, it's not safe to 220 - * free the memory (including the reference counter itself) 221 - * until it's release callback. */ 222 - kfree(vp_dev); 223 - } 224 - 225 214 /* the PCI probing function */ 226 - int virtio_pci_legacy_probe(struct pci_dev *pci_dev, 227 - const struct pci_device_id *id) 215 + int virtio_pci_legacy_probe(struct virtio_pci_device *vp_dev) 228 216 { 229 - struct virtio_pci_device *vp_dev; 230 - int err; 217 + struct pci_dev *pci_dev = vp_dev->pci_dev; 231 218 232 219 /* We only own devices >= 0x1000 and <= 0x103f: leave the rest. */ 233 220 if (pci_dev->device < 0x1000 || pci_dev->device > 0x103f) ··· 226 239 return -ENODEV; 227 240 } 228 241 229 - /* allocate our structure and fill it out */ 230 - vp_dev = kzalloc(sizeof(struct virtio_pci_device), GFP_KERNEL); 231 - if (vp_dev == NULL) 242 + vp_dev->ioaddr = pci_iomap(pci_dev, 0, 0); 243 + if (!vp_dev->ioaddr) 232 244 return -ENOMEM; 233 245 234 - vp_dev->vdev.dev.parent = &pci_dev->dev; 235 - vp_dev->vdev.dev.release = virtio_pci_release_dev; 236 - vp_dev->vdev.config = &virtio_pci_config_ops; 237 - vp_dev->pci_dev = pci_dev; 238 - INIT_LIST_HEAD(&vp_dev->virtqueues); 239 - spin_lock_init(&vp_dev->lock); 240 - 241 - /* Disable MSI/MSIX to bring device to a known good state. */ 242 - pci_msi_off(pci_dev); 243 - 244 - /* enable the device */ 245 - err = pci_enable_device(pci_dev); 246 - if (err) 247 - goto out; 248 - 249 - err = pci_request_regions(pci_dev, "virtio-pci"); 250 - if (err) 251 - goto out_enable_device; 252 - 253 - vp_dev->ioaddr = pci_iomap(pci_dev, 0, 0); 254 - if (vp_dev->ioaddr == NULL) { 255 - err = -ENOMEM; 256 - goto out_req_regions; 257 - } 258 - 259 246 vp_dev->isr = vp_dev->ioaddr + VIRTIO_PCI_ISR; 260 - 261 - pci_set_drvdata(pci_dev, vp_dev); 262 - pci_set_master(pci_dev); 263 247 264 248 /* we use the subsystem vendor/device id as the virtio vendor/device 265 249 * id. this allows us to use the same PCI vendor/device id for all ··· 239 281 vp_dev->vdev.id.vendor = pci_dev->subsystem_vendor; 240 282 vp_dev->vdev.id.device = pci_dev->subsystem_device; 241 283 284 + vp_dev->vdev.config = &virtio_pci_config_ops; 285 + 242 286 vp_dev->config_vector = vp_config_vector; 243 287 vp_dev->setup_vq = setup_vq; 244 288 vp_dev->del_vq = del_vq; 245 289 246 - /* finally register the virtio device */ 247 - err = register_virtio_device(&vp_dev->vdev); 248 - if (err) 249 - goto out_set_drvdata; 250 - 251 290 return 0; 252 - 253 - out_set_drvdata: 254 - pci_iounmap(pci_dev, vp_dev->ioaddr); 255 - out_req_regions: 256 - pci_release_regions(pci_dev); 257 - out_enable_device: 258 - pci_disable_device(pci_dev); 259 - out: 260 - kfree(vp_dev); 261 - return err; 262 291 } 263 292 264 - void virtio_pci_legacy_remove(struct pci_dev *pci_dev) 293 + void virtio_pci_legacy_remove(struct virtio_pci_device *vp_dev) 265 294 { 266 - struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev); 295 + struct pci_dev *pci_dev = vp_dev->pci_dev; 267 296 268 - unregister_virtio_device(&vp_dev->vdev); 269 - 270 - vp_del_vqs(&vp_dev->vdev); 271 297 pci_iounmap(pci_dev, vp_dev->ioaddr); 272 - pci_release_regions(pci_dev); 273 - pci_disable_device(pci_dev); 274 298 }

+695

drivers/virtio/virtio_pci_modern.c

··· 1 + /* 2 + * Virtio PCI driver - modern (virtio 1.0) device support 3 + * 4 + * This module allows virtio devices to be used over a virtual PCI device. 5 + * This can be used with QEMU based VMMs like KVM or Xen. 6 + * 7 + * Copyright IBM Corp. 2007 8 + * Copyright Red Hat, Inc. 2014 9 + * 10 + * Authors: 11 + * Anthony Liguori <aliguori@us.ibm.com> 12 + * Rusty Russell <rusty@rustcorp.com.au> 13 + * Michael S. Tsirkin <mst@redhat.com> 14 + * 15 + * This work is licensed under the terms of the GNU GPL, version 2 or later. 16 + * See the COPYING file in the top-level directory. 17 + * 18 + */ 19 + 20 + #define VIRTIO_PCI_NO_LEGACY 21 + #include "virtio_pci_common.h" 22 + 23 + static void __iomem *map_capability(struct pci_dev *dev, int off, 24 + size_t minlen, 25 + u32 align, 26 + u32 start, u32 size, 27 + size_t *len) 28 + { 29 + u8 bar; 30 + u32 offset, length; 31 + void __iomem *p; 32 + 33 + pci_read_config_byte(dev, off + offsetof(struct virtio_pci_cap, 34 + bar), 35 + &bar); 36 + pci_read_config_dword(dev, off + offsetof(struct virtio_pci_cap, offset), 37 + &offset); 38 + pci_read_config_dword(dev, off + offsetof(struct virtio_pci_cap, length), 39 + &length); 40 + 41 + if (length <= start) { 42 + dev_err(&dev->dev, 43 + "virtio_pci: bad capability len %u (>%u expected)\n", 44 + length, start); 45 + return NULL; 46 + } 47 + 48 + if (length - start < minlen) { 49 + dev_err(&dev->dev, 50 + "virtio_pci: bad capability len %u (>=%zu expected)\n", 51 + length, minlen); 52 + return NULL; 53 + } 54 + 55 + length -= start; 56 + 57 + if (start + offset < offset) { 58 + dev_err(&dev->dev, 59 + "virtio_pci: map wrap-around %u+%u\n", 60 + start, offset); 61 + return NULL; 62 + } 63 + 64 + offset += start; 65 + 66 + if (offset & (align - 1)) { 67 + dev_err(&dev->dev, 68 + "virtio_pci: offset %u not aligned to %u\n", 69 + offset, align); 70 + return NULL; 71 + } 72 + 73 + if (length > size) 74 + length = size; 75 + 76 + if (len) 77 + *len = length; 78 + 79 + if (minlen + offset < minlen || 80 + minlen + offset > pci_resource_len(dev, bar)) { 81 + dev_err(&dev->dev, 82 + "virtio_pci: map virtio %zu@%u " 83 + "out of range on bar %i length %lu\n", 84 + minlen, offset, 85 + bar, (unsigned long)pci_resource_len(dev, bar)); 86 + return NULL; 87 + } 88 + 89 + p = pci_iomap_range(dev, bar, offset, length); 90 + if (!p) 91 + dev_err(&dev->dev, 92 + "virtio_pci: unable to map virtio %u@%u on bar %i\n", 93 + length, offset, bar); 94 + return p; 95 + } 96 + 97 + static void iowrite64_twopart(u64 val, __le32 __iomem *lo, __le32 __iomem *hi) 98 + { 99 + iowrite32((u32)val, lo); 100 + iowrite32(val >> 32, hi); 101 + } 102 + 103 + /* virtio config->get_features() implementation */ 104 + static u64 vp_get_features(struct virtio_device *vdev) 105 + { 106 + struct virtio_pci_device *vp_dev = to_vp_device(vdev); 107 + u64 features; 108 + 109 + iowrite32(0, &vp_dev->common->device_feature_select); 110 + features = ioread32(&vp_dev->common->device_feature); 111 + iowrite32(1, &vp_dev->common->device_feature_select); 112 + features |= ((u64)ioread32(&vp_dev->common->device_feature) << 32); 113 + 114 + return features; 115 + } 116 + 117 + /* virtio config->finalize_features() implementation */ 118 + static int vp_finalize_features(struct virtio_device *vdev) 119 + { 120 + struct virtio_pci_device *vp_dev = to_vp_device(vdev); 121 + 122 + /* Give virtio_ring a chance to accept features. */ 123 + vring_transport_features(vdev); 124 + 125 + if (!__virtio_test_bit(vdev, VIRTIO_F_VERSION_1)) { 126 + dev_err(&vdev->dev, "virtio: device uses modern interface " 127 + "but does not have VIRTIO_F_VERSION_1\n"); 128 + return -EINVAL; 129 + } 130 + 131 + iowrite32(0, &vp_dev->common->guest_feature_select); 132 + iowrite32((u32)vdev->features, &vp_dev->common->guest_feature); 133 + iowrite32(1, &vp_dev->common->guest_feature_select); 134 + iowrite32(vdev->features >> 32, &vp_dev->common->guest_feature); 135 + 136 + return 0; 137 + } 138 + 139 + /* virtio config->get() implementation */ 140 + static void vp_get(struct virtio_device *vdev, unsigned offset, 141 + void *buf, unsigned len) 142 + { 143 + struct virtio_pci_device *vp_dev = to_vp_device(vdev); 144 + u8 b; 145 + __le16 w; 146 + __le32 l; 147 + 148 + BUG_ON(offset + len > vp_dev->device_len); 149 + 150 + switch (len) { 151 + case 1: 152 + b = ioread8(vp_dev->device + offset); 153 + memcpy(buf, &b, sizeof b); 154 + break; 155 + case 2: 156 + w = cpu_to_le16(ioread16(vp_dev->device + offset)); 157 + memcpy(buf, &w, sizeof w); 158 + break; 159 + case 4: 160 + l = cpu_to_le32(ioread32(vp_dev->device + offset)); 161 + memcpy(buf, &l, sizeof l); 162 + break; 163 + case 8: 164 + l = cpu_to_le32(ioread32(vp_dev->device + offset)); 165 + memcpy(buf, &l, sizeof l); 166 + l = cpu_to_le32(ioread32(vp_dev->device + offset + sizeof l)); 167 + memcpy(buf + sizeof l, &l, sizeof l); 168 + break; 169 + default: 170 + BUG(); 171 + } 172 + } 173 + 174 + /* the config->set() implementation. it's symmetric to the config->get() 175 + * implementation */ 176 + static void vp_set(struct virtio_device *vdev, unsigned offset, 177 + const void *buf, unsigned len) 178 + { 179 + struct virtio_pci_device *vp_dev = to_vp_device(vdev); 180 + u8 b; 181 + __le16 w; 182 + __le32 l; 183 + 184 + BUG_ON(offset + len > vp_dev->device_len); 185 + 186 + switch (len) { 187 + case 1: 188 + memcpy(&b, buf, sizeof b); 189 + iowrite8(b, vp_dev->device + offset); 190 + break; 191 + case 2: 192 + memcpy(&w, buf, sizeof w); 193 + iowrite16(le16_to_cpu(w), vp_dev->device + offset); 194 + break; 195 + case 4: 196 + memcpy(&l, buf, sizeof l); 197 + iowrite32(le32_to_cpu(l), vp_dev->device + offset); 198 + break; 199 + case 8: 200 + memcpy(&l, buf, sizeof l); 201 + iowrite32(le32_to_cpu(l), vp_dev->device + offset); 202 + memcpy(&l, buf + sizeof l, sizeof l); 203 + iowrite32(le32_to_cpu(l), vp_dev->device + offset + sizeof l); 204 + break; 205 + default: 206 + BUG(); 207 + } 208 + } 209 + 210 + static u32 vp_generation(struct virtio_device *vdev) 211 + { 212 + struct virtio_pci_device *vp_dev = to_vp_device(vdev); 213 + return ioread8(&vp_dev->common->config_generation); 214 + } 215 + 216 + /* config->{get,set}_status() implementations */ 217 + static u8 vp_get_status(struct virtio_device *vdev) 218 + { 219 + struct virtio_pci_device *vp_dev = to_vp_device(vdev); 220 + return ioread8(&vp_dev->common->device_status); 221 + } 222 + 223 + static void vp_set_status(struct virtio_device *vdev, u8 status) 224 + { 225 + struct virtio_pci_device *vp_dev = to_vp_device(vdev); 226 + /* We should never be setting status to 0. */ 227 + BUG_ON(status == 0); 228 + iowrite8(status, &vp_dev->common->device_status); 229 + } 230 + 231 + static void vp_reset(struct virtio_device *vdev) 232 + { 233 + struct virtio_pci_device *vp_dev = to_vp_device(vdev); 234 + /* 0 status means a reset. */ 235 + iowrite8(0, &vp_dev->common->device_status); 236 + /* Flush out the status write, and flush in device writes, 237 + * including MSI-X interrupts, if any. */ 238 + ioread8(&vp_dev->common->device_status); 239 + /* Flush pending VQ/configuration callbacks. */ 240 + vp_synchronize_vectors(vdev); 241 + } 242 + 243 + static u16 vp_config_vector(struct virtio_pci_device *vp_dev, u16 vector) 244 + { 245 + /* Setup the vector used for configuration events */ 246 + iowrite16(vector, &vp_dev->common->msix_config); 247 + /* Verify we had enough resources to assign the vector */ 248 + /* Will also flush the write out to device */ 249 + return ioread16(&vp_dev->common->msix_config); 250 + } 251 + 252 + static size_t vring_pci_size(u16 num) 253 + { 254 + /* We only need a cacheline separation. */ 255 + return PAGE_ALIGN(vring_size(num, SMP_CACHE_BYTES)); 256 + } 257 + 258 + static void *alloc_virtqueue_pages(int *num) 259 + { 260 + void *pages; 261 + 262 + /* TODO: allocate each queue chunk individually */ 263 + for (; *num && vring_pci_size(*num) > PAGE_SIZE; *num /= 2) { 264 + pages = alloc_pages_exact(vring_pci_size(*num), 265 + GFP_KERNEL|__GFP_ZERO|__GFP_NOWARN); 266 + if (pages) 267 + return pages; 268 + } 269 + 270 + if (!*num) 271 + return NULL; 272 + 273 + /* Try to get a single page. You are my only hope! */ 274 + return alloc_pages_exact(vring_pci_size(*num), GFP_KERNEL|__GFP_ZERO); 275 + } 276 + 277 + static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev, 278 + struct virtio_pci_vq_info *info, 279 + unsigned index, 280 + void (*callback)(struct virtqueue *vq), 281 + const char *name, 282 + u16 msix_vec) 283 + { 284 + struct virtio_pci_common_cfg __iomem *cfg = vp_dev->common; 285 + struct virtqueue *vq; 286 + u16 num, off; 287 + int err; 288 + 289 + if (index >= ioread16(&cfg->num_queues)) 290 + return ERR_PTR(-ENOENT); 291 + 292 + /* Select the queue we're interested in */ 293 + iowrite16(index, &cfg->queue_select); 294 + 295 + /* Check if queue is either not available or already active. */ 296 + num = ioread16(&cfg->queue_size); 297 + if (!num || ioread16(&cfg->queue_enable)) 298 + return ERR_PTR(-ENOENT); 299 + 300 + if (num & (num - 1)) { 301 + dev_warn(&vp_dev->pci_dev->dev, "bad queue size %u", num); 302 + return ERR_PTR(-EINVAL); 303 + } 304 + 305 + /* get offset of notification word for this vq */ 306 + off = ioread16(&cfg->queue_notify_off); 307 + 308 + info->num = num; 309 + info->msix_vector = msix_vec; 310 + 311 + info->queue = alloc_virtqueue_pages(&info->num); 312 + if (info->queue == NULL) 313 + return ERR_PTR(-ENOMEM); 314 + 315 + /* create the vring */ 316 + vq = vring_new_virtqueue(index, info->num, 317 + SMP_CACHE_BYTES, &vp_dev->vdev, 318 + true, info->queue, vp_notify, callback, name); 319 + if (!vq) { 320 + err = -ENOMEM; 321 + goto err_new_queue; 322 + } 323 + 324 + /* activate the queue */ 325 + iowrite16(num, &cfg->queue_size); 326 + iowrite64_twopart(virt_to_phys(info->queue), 327 + &cfg->queue_desc_lo, &cfg->queue_desc_hi); 328 + iowrite64_twopart(virt_to_phys(virtqueue_get_avail(vq)), 329 + &cfg->queue_avail_lo, &cfg->queue_avail_hi); 330 + iowrite64_twopart(virt_to_phys(virtqueue_get_used(vq)), 331 + &cfg->queue_used_lo, &cfg->queue_used_hi); 332 + 333 + if (vp_dev->notify_base) { 334 + /* offset should not wrap */ 335 + if ((u64)off * vp_dev->notify_offset_multiplier + 2 336 + > vp_dev->notify_len) { 337 + dev_warn(&vp_dev->pci_dev->dev, 338 + "bad notification offset %u (x %u) " 339 + "for queue %u > %zd", 340 + off, vp_dev->notify_offset_multiplier, 341 + index, vp_dev->notify_len); 342 + err = -EINVAL; 343 + goto err_map_notify; 344 + } 345 + vq->priv = (void __force *)vp_dev->notify_base + 346 + off * vp_dev->notify_offset_multiplier; 347 + } else { 348 + vq->priv = (void __force *)map_capability(vp_dev->pci_dev, 349 + vp_dev->notify_map_cap, 2, 2, 350 + off * vp_dev->notify_offset_multiplier, 2, 351 + NULL); 352 + } 353 + 354 + if (!vq->priv) { 355 + err = -ENOMEM; 356 + goto err_map_notify; 357 + } 358 + 359 + if (msix_vec != VIRTIO_MSI_NO_VECTOR) { 360 + iowrite16(msix_vec, &cfg->queue_msix_vector); 361 + msix_vec = ioread16(&cfg->queue_msix_vector); 362 + if (msix_vec == VIRTIO_MSI_NO_VECTOR) { 363 + err = -EBUSY; 364 + goto err_assign_vector; 365 + } 366 + } 367 + 368 + return vq; 369 + 370 + err_assign_vector: 371 + if (!vp_dev->notify_base) 372 + pci_iounmap(vp_dev->pci_dev, (void __iomem __force *)vq->priv); 373 + err_map_notify: 374 + vring_del_virtqueue(vq); 375 + err_new_queue: 376 + free_pages_exact(info->queue, vring_pci_size(info->num)); 377 + return ERR_PTR(err); 378 + } 379 + 380 + static int vp_modern_find_vqs(struct virtio_device *vdev, unsigned nvqs, 381 + struct virtqueue *vqs[], 382 + vq_callback_t *callbacks[], 383 + const char *names[]) 384 + { 385 + struct virtio_pci_device *vp_dev = to_vp_device(vdev); 386 + struct virtqueue *vq; 387 + int rc = vp_find_vqs(vdev, nvqs, vqs, callbacks, names); 388 + 389 + if (rc) 390 + return rc; 391 + 392 + /* Select and activate all queues. Has to be done last: once we do 393 + * this, there's no way to go back except reset. 394 + */ 395 + list_for_each_entry(vq, &vdev->vqs, list) { 396 + iowrite16(vq->index, &vp_dev->common->queue_select); 397 + iowrite16(1, &vp_dev->common->queue_enable); 398 + } 399 + 400 + return 0; 401 + } 402 + 403 + static void del_vq(struct virtio_pci_vq_info *info) 404 + { 405 + struct virtqueue *vq = info->vq; 406 + struct virtio_pci_device *vp_dev = to_vp_device(vq->vdev); 407 + 408 + iowrite16(vq->index, &vp_dev->common->queue_select); 409 + 410 + if (vp_dev->msix_enabled) { 411 + iowrite16(VIRTIO_MSI_NO_VECTOR, 412 + &vp_dev->common->queue_msix_vector); 413 + /* Flush the write out to device */ 414 + ioread16(&vp_dev->common->queue_msix_vector); 415 + } 416 + 417 + if (!vp_dev->notify_base) 418 + pci_iounmap(vp_dev->pci_dev, (void __force __iomem *)vq->priv); 419 + 420 + vring_del_virtqueue(vq); 421 + 422 + free_pages_exact(info->queue, vring_pci_size(info->num)); 423 + } 424 + 425 + static const struct virtio_config_ops virtio_pci_config_nodev_ops = { 426 + .get = NULL, 427 + .set = NULL, 428 + .generation = vp_generation, 429 + .get_status = vp_get_status, 430 + .set_status = vp_set_status, 431 + .reset = vp_reset, 432 + .find_vqs = vp_modern_find_vqs, 433 + .del_vqs = vp_del_vqs, 434 + .get_features = vp_get_features, 435 + .finalize_features = vp_finalize_features, 436 + .bus_name = vp_bus_name, 437 + .set_vq_affinity = vp_set_vq_affinity, 438 + }; 439 + 440 + static const struct virtio_config_ops virtio_pci_config_ops = { 441 + .get = vp_get, 442 + .set = vp_set, 443 + .generation = vp_generation, 444 + .get_status = vp_get_status, 445 + .set_status = vp_set_status, 446 + .reset = vp_reset, 447 + .find_vqs = vp_modern_find_vqs, 448 + .del_vqs = vp_del_vqs, 449 + .get_features = vp_get_features, 450 + .finalize_features = vp_finalize_features, 451 + .bus_name = vp_bus_name, 452 + .set_vq_affinity = vp_set_vq_affinity, 453 + }; 454 + 455 + /** 456 + * virtio_pci_find_capability - walk capabilities to find device info. 457 + * @dev: the pci device 458 + * @cfg_type: the VIRTIO_PCI_CAP_* value we seek 459 + * @ioresource_types: IORESOURCE_MEM and/or IORESOURCE_IO. 460 + * 461 + * Returns offset of the capability, or 0. 462 + */ 463 + static inline int virtio_pci_find_capability(struct pci_dev *dev, u8 cfg_type, 464 + u32 ioresource_types) 465 + { 466 + int pos; 467 + 468 + for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR); 469 + pos > 0; 470 + pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) { 471 + u8 type, bar; 472 + pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap, 473 + cfg_type), 474 + &type); 475 + pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap, 476 + bar), 477 + &bar); 478 + 479 + /* Ignore structures with reserved BAR values */ 480 + if (bar > 0x5) 481 + continue; 482 + 483 + if (type == cfg_type) { 484 + if (pci_resource_len(dev, bar) && 485 + pci_resource_flags(dev, bar) & ioresource_types) 486 + return pos; 487 + } 488 + } 489 + return 0; 490 + } 491 + 492 + /* This is part of the ABI. Don't screw with it. */ 493 + static inline void check_offsets(void) 494 + { 495 + /* Note: disk space was harmed in compilation of this function. */ 496 + BUILD_BUG_ON(VIRTIO_PCI_CAP_VNDR != 497 + offsetof(struct virtio_pci_cap, cap_vndr)); 498 + BUILD_BUG_ON(VIRTIO_PCI_CAP_NEXT != 499 + offsetof(struct virtio_pci_cap, cap_next)); 500 + BUILD_BUG_ON(VIRTIO_PCI_CAP_LEN != 501 + offsetof(struct virtio_pci_cap, cap_len)); 502 + BUILD_BUG_ON(VIRTIO_PCI_CAP_CFG_TYPE != 503 + offsetof(struct virtio_pci_cap, cfg_type)); 504 + BUILD_BUG_ON(VIRTIO_PCI_CAP_BAR != 505 + offsetof(struct virtio_pci_cap, bar)); 506 + BUILD_BUG_ON(VIRTIO_PCI_CAP_OFFSET != 507 + offsetof(struct virtio_pci_cap, offset)); 508 + BUILD_BUG_ON(VIRTIO_PCI_CAP_LENGTH != 509 + offsetof(struct virtio_pci_cap, length)); 510 + BUILD_BUG_ON(VIRTIO_PCI_NOTIFY_CAP_MULT != 511 + offsetof(struct virtio_pci_notify_cap, 512 + notify_off_multiplier)); 513 + BUILD_BUG_ON(VIRTIO_PCI_COMMON_DFSELECT != 514 + offsetof(struct virtio_pci_common_cfg, 515 + device_feature_select)); 516 + BUILD_BUG_ON(VIRTIO_PCI_COMMON_DF != 517 + offsetof(struct virtio_pci_common_cfg, device_feature)); 518 + BUILD_BUG_ON(VIRTIO_PCI_COMMON_GFSELECT != 519 + offsetof(struct virtio_pci_common_cfg, 520 + guest_feature_select)); 521 + BUILD_BUG_ON(VIRTIO_PCI_COMMON_GF != 522 + offsetof(struct virtio_pci_common_cfg, guest_feature)); 523 + BUILD_BUG_ON(VIRTIO_PCI_COMMON_MSIX != 524 + offsetof(struct virtio_pci_common_cfg, msix_config)); 525 + BUILD_BUG_ON(VIRTIO_PCI_COMMON_NUMQ != 526 + offsetof(struct virtio_pci_common_cfg, num_queues)); 527 + BUILD_BUG_ON(VIRTIO_PCI_COMMON_STATUS != 528 + offsetof(struct virtio_pci_common_cfg, device_status)); 529 + BUILD_BUG_ON(VIRTIO_PCI_COMMON_CFGGENERATION != 530 + offsetof(struct virtio_pci_common_cfg, config_generation)); 531 + BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_SELECT != 532 + offsetof(struct virtio_pci_common_cfg, queue_select)); 533 + BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_SIZE != 534 + offsetof(struct virtio_pci_common_cfg, queue_size)); 535 + BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_MSIX != 536 + offsetof(struct virtio_pci_common_cfg, queue_msix_vector)); 537 + BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_ENABLE != 538 + offsetof(struct virtio_pci_common_cfg, queue_enable)); 539 + BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_NOFF != 540 + offsetof(struct virtio_pci_common_cfg, queue_notify_off)); 541 + BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_DESCLO != 542 + offsetof(struct virtio_pci_common_cfg, queue_desc_lo)); 543 + BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_DESCHI != 544 + offsetof(struct virtio_pci_common_cfg, queue_desc_hi)); 545 + BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_AVAILLO != 546 + offsetof(struct virtio_pci_common_cfg, queue_avail_lo)); 547 + BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_AVAILHI != 548 + offsetof(struct virtio_pci_common_cfg, queue_avail_hi)); 549 + BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_USEDLO != 550 + offsetof(struct virtio_pci_common_cfg, queue_used_lo)); 551 + BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_USEDHI != 552 + offsetof(struct virtio_pci_common_cfg, queue_used_hi)); 553 + } 554 + 555 + /* the PCI probing function */ 556 + int virtio_pci_modern_probe(struct virtio_pci_device *vp_dev) 557 + { 558 + struct pci_dev *pci_dev = vp_dev->pci_dev; 559 + int err, common, isr, notify, device; 560 + u32 notify_length; 561 + u32 notify_offset; 562 + 563 + check_offsets(); 564 + 565 + /* We only own devices >= 0x1000 and <= 0x107f: leave the rest. */ 566 + if (pci_dev->device < 0x1000 || pci_dev->device > 0x107f) 567 + return -ENODEV; 568 + 569 + if (pci_dev->device < 0x1040) { 570 + /* Transitional devices: use the PCI subsystem device id as 571 + * virtio device id, same as legacy driver always did. 572 + */ 573 + vp_dev->vdev.id.device = pci_dev->subsystem_device; 574 + } else { 575 + /* Modern devices: simply use PCI device id, but start from 0x1040. */ 576 + vp_dev->vdev.id.device = pci_dev->device - 0x1040; 577 + } 578 + vp_dev->vdev.id.vendor = pci_dev->subsystem_vendor; 579 + 580 + if (virtio_device_is_legacy_only(vp_dev->vdev.id)) 581 + return -ENODEV; 582 + 583 + /* check for a common config: if not, use legacy mode (bar 0). */ 584 + common = virtio_pci_find_capability(pci_dev, VIRTIO_PCI_CAP_COMMON_CFG, 585 + IORESOURCE_IO | IORESOURCE_MEM); 586 + if (!common) { 587 + dev_info(&pci_dev->dev, 588 + "virtio_pci: leaving for legacy driver\n"); 589 + return -ENODEV; 590 + } 591 + 592 + /* If common is there, these should be too... */ 593 + isr = virtio_pci_find_capability(pci_dev, VIRTIO_PCI_CAP_ISR_CFG, 594 + IORESOURCE_IO | IORESOURCE_MEM); 595 + notify = virtio_pci_find_capability(pci_dev, VIRTIO_PCI_CAP_NOTIFY_CFG, 596 + IORESOURCE_IO | IORESOURCE_MEM); 597 + if (!isr || !notify) { 598 + dev_err(&pci_dev->dev, 599 + "virtio_pci: missing capabilities %i/%i/%i\n", 600 + common, isr, notify); 601 + return -EINVAL; 602 + } 603 + 604 + /* Device capability is only mandatory for devices that have 605 + * device-specific configuration. 606 + */ 607 + device = virtio_pci_find_capability(pci_dev, VIRTIO_PCI_CAP_DEVICE_CFG, 608 + IORESOURCE_IO | IORESOURCE_MEM); 609 + 610 + err = -EINVAL; 611 + vp_dev->common = map_capability(pci_dev, common, 612 + sizeof(struct virtio_pci_common_cfg), 4, 613 + 0, sizeof(struct virtio_pci_common_cfg), 614 + NULL); 615 + if (!vp_dev->common) 616 + goto err_map_common; 617 + vp_dev->isr = map_capability(pci_dev, isr, sizeof(u8), 1, 618 + 0, 1, 619 + NULL); 620 + if (!vp_dev->isr) 621 + goto err_map_isr; 622 + 623 + /* Read notify_off_multiplier from config space. */ 624 + pci_read_config_dword(pci_dev, 625 + notify + offsetof(struct virtio_pci_notify_cap, 626 + notify_off_multiplier), 627 + &vp_dev->notify_offset_multiplier); 628 + /* Read notify length and offset from config space. */ 629 + pci_read_config_dword(pci_dev, 630 + notify + offsetof(struct virtio_pci_notify_cap, 631 + cap.length), 632 + &notify_length); 633 + 634 + pci_read_config_dword(pci_dev, 635 + notify + offsetof(struct virtio_pci_notify_cap, 636 + cap.length), 637 + &notify_offset); 638 + 639 + /* We don't know how many VQs we'll map, ahead of the time. 640 + * If notify length is small, map it all now. 641 + * Otherwise, map each VQ individually later. 642 + */ 643 + if ((u64)notify_length + (notify_offset % PAGE_SIZE) <= PAGE_SIZE) { 644 + vp_dev->notify_base = map_capability(pci_dev, notify, 2, 2, 645 + 0, notify_length, 646 + &vp_dev->notify_len); 647 + if (!vp_dev->notify_base) 648 + goto err_map_notify; 649 + } else { 650 + vp_dev->notify_map_cap = notify; 651 + } 652 + 653 + /* Again, we don't know how much we should map, but PAGE_SIZE 654 + * is more than enough for all existing devices. 655 + */ 656 + if (device) { 657 + vp_dev->device = map_capability(pci_dev, device, 0, 4, 658 + 0, PAGE_SIZE, 659 + &vp_dev->device_len); 660 + if (!vp_dev->device) 661 + goto err_map_device; 662 + 663 + vp_dev->vdev.config = &virtio_pci_config_ops; 664 + } else { 665 + vp_dev->vdev.config = &virtio_pci_config_nodev_ops; 666 + } 667 + 668 + vp_dev->config_vector = vp_config_vector; 669 + vp_dev->setup_vq = setup_vq; 670 + vp_dev->del_vq = del_vq; 671 + 672 + return 0; 673 + 674 + err_map_device: 675 + if (vp_dev->notify_base) 676 + pci_iounmap(pci_dev, vp_dev->notify_base); 677 + err_map_notify: 678 + pci_iounmap(pci_dev, vp_dev->isr); 679 + err_map_isr: 680 + pci_iounmap(pci_dev, vp_dev->common); 681 + err_map_common: 682 + return err; 683 + } 684 + 685 + void virtio_pci_modern_remove(struct virtio_pci_device *vp_dev) 686 + { 687 + struct pci_dev *pci_dev = vp_dev->pci_dev; 688 + 689 + if (vp_dev->device) 690 + pci_iounmap(pci_dev, vp_dev->device); 691 + if (vp_dev->notify_base) 692 + pci_iounmap(pci_dev, vp_dev->notify_base); 693 + pci_iounmap(pci_dev, vp_dev->isr); 694 + pci_iounmap(pci_dev, vp_dev->common); 695 + }

+4 -5

drivers/virtio/virtio_ring.c

··· 54 54 #define END_USE(vq) 55 55 #endif 56 56 57 - struct vring_virtqueue 58 - { 57 + struct vring_virtqueue { 59 58 struct virtqueue vq; 60 59 61 60 /* Actual memory layout for this queue */ ··· 244 245 vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, virtio16_to_cpu(_vq->vdev, vq->vring.avail->idx) + 1); 245 246 vq->num_added++; 246 247 248 + pr_debug("Added buffer head %i to %p\n", head, vq); 249 + END_USE(vq); 250 + 247 251 /* This is very unlikely, but theoretically possible. Kick 248 252 * just in case. */ 249 253 if (unlikely(vq->num_added == (1 << 16) - 1)) 250 254 virtqueue_kick(_vq); 251 - 252 - pr_debug("Added buffer head %i to %p\n", head, vq); 253 - END_USE(vq); 254 255 255 256 return 0; 256 257 }

+10

include/asm-generic/pci_iomap.h

··· 15 15 #ifdef CONFIG_PCI 16 16 /* Create a virtual mapping cookie for a PCI BAR (memory or IO) */ 17 17 extern void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long max); 18 + extern void __iomem *pci_iomap_range(struct pci_dev *dev, int bar, 19 + unsigned long offset, 20 + unsigned long maxlen); 18 21 /* Create a virtual mapping cookie for a port on a given PCI device. 19 22 * Do not call this directly, it exists to make it easier for architectures 20 23 * to override */ ··· 30 27 31 28 #elif defined(CONFIG_GENERIC_PCI_IOMAP) 32 29 static inline void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long max) 30 + { 31 + return NULL; 32 + } 33 + 34 + static inline void __iomem *pci_iomap_range(struct pci_dev *dev, int bar, 35 + unsigned long offset, 36 + unsigned long maxlen) 33 37 { 34 38 return NULL; 35 39 }

+16 -45

include/linux/lguest_launcher.h

··· 8 8 * 9 9 * The Guest needs devices to do anything useful. Since we don't let it touch 10 10 * real devices (think of the damage it could do!) we provide virtual devices. 11 - * We could emulate a PCI bus with various devices on it, but that is a fairly 12 - * complex burden for the Host and suboptimal for the Guest, so we have our own 13 - * simple lguest bus and we use "virtio" drivers. These drivers need a set of 14 - * routines from us which will actually do the virtual I/O, but they handle all 15 - * the net/block/console stuff themselves. This means that if we want to add 16 - * a new device, we simply need to write a new virtio driver and create support 17 - * for it in the Launcher: this code won't need to change. 11 + * We emulate a PCI bus with virtio devices on it; we used to have our own 12 + * lguest bus which was far simpler, but this tests the virtio 1.0 standard. 18 13 * 19 14 * Virtio devices are also used by kvm, so we can simply reuse their optimized 20 15 * device drivers. And one day when everyone uses virtio, my plan will be 21 16 * complete. Bwahahahah! 22 - * 23 - * Devices are described by a simplified ID, a status byte, and some "config" 24 - * bytes which describe this device's configuration. This is placed by the 25 - * Launcher just above the top of physical memory: 26 17 */ 27 - struct lguest_device_desc { 28 - /* The device type: console, network, disk etc. Type 0 terminates. */ 29 - __u8 type; 30 - /* The number of virtqueues (first in config array) */ 31 - __u8 num_vq; 32 - /* 33 - * The number of bytes of feature bits. Multiply by 2: one for host 34 - * features and one for Guest acknowledgements. 35 - */ 36 - __u8 feature_len; 37 - /* The number of bytes of the config array after virtqueues. */ 38 - __u8 config_len; 39 - /* A status byte, written by the Guest. */ 40 - __u8 status; 41 - __u8 config[0]; 42 - }; 43 - 44 - /*D:135 45 - * This is how we expect the device configuration field for a virtqueue 46 - * to be laid out in config space. 47 - */ 48 - struct lguest_vqconfig { 49 - /* The number of entries in the virtio_ring */ 50 - __u16 num; 51 - /* The interrupt we get when something happens. */ 52 - __u16 irq; 53 - /* The page number of the virtio ring for this device. */ 54 - __u32 pfn; 55 - }; 56 - /*:*/ 57 18 58 19 /* Write command first word is a request. */ 59 20 enum lguest_req ··· 23 62 LHREQ_GETDMA, /* No longer used */ 24 63 LHREQ_IRQ, /* + irq */ 25 64 LHREQ_BREAK, /* No longer used */ 26 - LHREQ_EVENTFD, /* + address, fd. */ 65 + LHREQ_EVENTFD, /* No longer used. */ 66 + LHREQ_GETREG, /* + offset within struct pt_regs (then read value). */ 67 + LHREQ_SETREG, /* + offset within struct pt_regs, value. */ 68 + LHREQ_TRAP, /* + trap number to deliver to guest. */ 27 69 }; 28 70 29 71 /* 30 - * The alignment to use between consumer and producer parts of vring. 31 - * x86 pagesize for historical reasons. 72 + * This is what read() of the lguest fd populates. trap == 73 + * LGUEST_TRAP_ENTRY for an LHCALL_NOTIFY (addr is the 74 + * argument), 14 for a page fault in the MMIO region (addr is 75 + * the trap address, insn is the instruction), or 13 for a GPF 76 + * (insn is the instruction). 32 77 */ 33 - #define LGUEST_VRING_ALIGN 4096 78 + struct lguest_pending { 79 + __u8 trap; 80 + __u8 insn[7]; 81 + __u32 addr; 82 + }; 34 83 #endif /* _LINUX_LGUEST_LAUNCHER */

+37 -7

include/linux/virtio_mmio.h

··· 51 51 /* Virtio vendor ID - Read Only */ 52 52 #define VIRTIO_MMIO_VENDOR_ID 0x00c 53 53 54 - /* Bitmask of the features supported by the host 54 + /* Bitmask of the features supported by the device (host) 55 55 * (32 bits per set) - Read Only */ 56 - #define VIRTIO_MMIO_HOST_FEATURES 0x010 56 + #define VIRTIO_MMIO_DEVICE_FEATURES 0x010 57 57 58 - /* Host features set selector - Write Only */ 59 - #define VIRTIO_MMIO_HOST_FEATURES_SEL 0x014 58 + /* Device (host) features set selector - Write Only */ 59 + #define VIRTIO_MMIO_DEVICE_FEATURES_SEL 0x014 60 60 61 - /* Bitmask of features activated by the guest 61 + /* Bitmask of features activated by the driver (guest) 62 62 * (32 bits per set) - Write Only */ 63 - #define VIRTIO_MMIO_GUEST_FEATURES 0x020 63 + #define VIRTIO_MMIO_DRIVER_FEATURES 0x020 64 64 65 65 /* Activated features set selector - Write Only */ 66 - #define VIRTIO_MMIO_GUEST_FEATURES_SEL 0x024 66 + #define VIRTIO_MMIO_DRIVER_FEATURES_SEL 0x024 67 + 68 + 69 + #ifndef VIRTIO_MMIO_NO_LEGACY /* LEGACY DEVICES ONLY! */ 67 70 68 71 /* Guest's memory page size in bytes - Write Only */ 69 72 #define VIRTIO_MMIO_GUEST_PAGE_SIZE 0x028 73 + 74 + #endif 75 + 70 76 71 77 /* Queue selector - Write Only */ 72 78 #define VIRTIO_MMIO_QUEUE_SEL 0x030 ··· 83 77 /* Queue size for the currently selected queue - Write Only */ 84 78 #define VIRTIO_MMIO_QUEUE_NUM 0x038 85 79 80 + 81 + #ifndef VIRTIO_MMIO_NO_LEGACY /* LEGACY DEVICES ONLY! */ 82 + 86 83 /* Used Ring alignment for the currently selected queue - Write Only */ 87 84 #define VIRTIO_MMIO_QUEUE_ALIGN 0x03c 88 85 89 86 /* Guest's PFN for the currently selected queue - Read Write */ 90 87 #define VIRTIO_MMIO_QUEUE_PFN 0x040 88 + 89 + #endif 90 + 91 + 92 + /* Ready bit for the currently selected queue - Read Write */ 93 + #define VIRTIO_MMIO_QUEUE_READY 0x044 91 94 92 95 /* Queue notifier - Write Only */ 93 96 #define VIRTIO_MMIO_QUEUE_NOTIFY 0x050 ··· 109 94 110 95 /* Device status register - Read Write */ 111 96 #define VIRTIO_MMIO_STATUS 0x070 97 + 98 + /* Selected queue's Descriptor Table address, 64 bits in two halves */ 99 + #define VIRTIO_MMIO_QUEUE_DESC_LOW 0x080 100 + #define VIRTIO_MMIO_QUEUE_DESC_HIGH 0x084 101 + 102 + /* Selected queue's Available Ring address, 64 bits in two halves */ 103 + #define VIRTIO_MMIO_QUEUE_AVAIL_LOW 0x090 104 + #define VIRTIO_MMIO_QUEUE_AVAIL_HIGH 0x094 105 + 106 + /* Selected queue's Used Ring address, 64 bits in two halves */ 107 + #define VIRTIO_MMIO_QUEUE_USED_LOW 0x0a0 108 + #define VIRTIO_MMIO_QUEUE_USED_HIGH 0x0a4 109 + 110 + /* Configuration atomicity value */ 111 + #define VIRTIO_MMIO_CONFIG_GENERATION 0x0fc 112 112 113 113 /* The config space is defined by each driver as 114 114 * the per-driver configuration space - Read Write */

+1 -2

include/uapi/linux/virtio_balloon.h

··· 36 36 /* Size of a PFN in the balloon interface. */ 37 37 #define VIRTIO_BALLOON_PFN_SHIFT 12 38 38 39 - struct virtio_balloon_config 40 - { 39 + struct virtio_balloon_config { 41 40 /* Number of pages host wants Guest to give up. */ 42 41 __le32 num_pages; 43 42 /* Number of pages we've actually got in balloon. */

+13 -4

include/uapi/linux/virtio_blk.h

··· 31 31 #include <linux/virtio_types.h> 32 32 33 33 /* Feature bits */ 34 - #define VIRTIO_BLK_F_BARRIER 0 /* Does host support barriers? */ 35 34 #define VIRTIO_BLK_F_SIZE_MAX 1 /* Indicates maximum segment size */ 36 35 #define VIRTIO_BLK_F_SEG_MAX 2 /* Indicates maximum # of segments */ 37 36 #define VIRTIO_BLK_F_GEOMETRY 4 /* Legacy geometry available */ 38 37 #define VIRTIO_BLK_F_RO 5 /* Disk is read-only */ 39 38 #define VIRTIO_BLK_F_BLK_SIZE 6 /* Block size of disk is available*/ 40 - #define VIRTIO_BLK_F_SCSI 7 /* Supports scsi command passthru */ 41 - #define VIRTIO_BLK_F_WCE 9 /* Writeback mode enabled after reset */ 42 39 #define VIRTIO_BLK_F_TOPOLOGY 10 /* Topology information is available */ 43 - #define VIRTIO_BLK_F_CONFIG_WCE 11 /* Writeback mode available in config */ 44 40 #define VIRTIO_BLK_F_MQ 12 /* support more than one vq */ 45 41 42 + /* Legacy feature bits */ 43 + #ifndef VIRTIO_BLK_NO_LEGACY 44 + #define VIRTIO_BLK_F_BARRIER 0 /* Does host support barriers? */ 45 + #define VIRTIO_BLK_F_SCSI 7 /* Supports scsi command passthru */ 46 + #define VIRTIO_BLK_F_WCE 9 /* Writeback mode enabled after reset */ 47 + #define VIRTIO_BLK_F_CONFIG_WCE 11 /* Writeback mode available in config */ 46 48 #ifndef __KERNEL__ 47 49 /* Old (deprecated) name for VIRTIO_BLK_F_WCE. */ 48 50 #define VIRTIO_BLK_F_FLUSH VIRTIO_BLK_F_WCE 49 51 #endif 52 + #endif /* !VIRTIO_BLK_NO_LEGACY */ 50 53 51 54 #define VIRTIO_BLK_ID_BYTES 20 /* ID string length */ 52 55 ··· 103 100 #define VIRTIO_BLK_T_IN 0 104 101 #define VIRTIO_BLK_T_OUT 1 105 102 103 + #ifndef VIRTIO_BLK_NO_LEGACY 106 104 /* This bit says it's a scsi command, not an actual read or write. */ 107 105 #define VIRTIO_BLK_T_SCSI_CMD 2 106 + #endif /* VIRTIO_BLK_NO_LEGACY */ 108 107 109 108 /* Cache flush command */ 110 109 #define VIRTIO_BLK_T_FLUSH 4 ··· 114 109 /* Get device ID command */ 115 110 #define VIRTIO_BLK_T_GET_ID 8 116 111 112 + #ifndef VIRTIO_BLK_NO_LEGACY 117 113 /* Barrier before this op. */ 118 114 #define VIRTIO_BLK_T_BARRIER 0x80000000 115 + #endif /* !VIRTIO_BLK_NO_LEGACY */ 119 116 120 117 /* This is the first element of the read scatter-gather list. */ 121 118 struct virtio_blk_outhdr { ··· 129 122 __virtio64 sector; 130 123 }; 131 124 125 + #ifndef VIRTIO_BLK_NO_LEGACY 132 126 struct virtio_scsi_inhdr { 133 127 __virtio32 errors; 134 128 __virtio32 data_len; 135 129 __virtio32 sense_len; 136 130 __virtio32 residual; 137 131 }; 132 + #endif /* !VIRTIO_BLK_NO_LEGACY */ 138 133 139 134 /* And this is the final byte of the write scatter-gather list. */ 140 135 #define VIRTIO_BLK_S_OK 0

+2

include/uapi/linux/virtio_config.h

··· 49 49 #define VIRTIO_TRANSPORT_F_START 28 50 50 #define VIRTIO_TRANSPORT_F_END 33 51 51 52 + #ifndef VIRTIO_CONFIG_NO_LEGACY 52 53 /* Do we get callbacks when the ring is completely used, even if we've 53 54 * suppressed them? */ 54 55 #define VIRTIO_F_NOTIFY_ON_EMPTY 24 55 56 56 57 /* Can the device handle any descriptor layout? */ 57 58 #define VIRTIO_F_ANY_LAYOUT 27 59 + #endif /* VIRTIO_CONFIG_NO_LEGACY */ 58 60 59 61 /* v1.0 compliant. */ 60 62 #define VIRTIO_F_VERSION_1 32

+33 -9

include/uapi/linux/virtio_net.h

··· 35 35 #define VIRTIO_NET_F_CSUM 0 /* Host handles pkts w/ partial csum */ 36 36 #define VIRTIO_NET_F_GUEST_CSUM 1 /* Guest handles pkts w/ partial csum */ 37 37 #define VIRTIO_NET_F_MAC 5 /* Host has given MAC address. */ 38 - #define VIRTIO_NET_F_GSO 6 /* Host handles pkts w/ any GSO type */ 39 38 #define VIRTIO_NET_F_GUEST_TSO4 7 /* Guest can handle TSOv4 in. */ 40 39 #define VIRTIO_NET_F_GUEST_TSO6 8 /* Guest can handle TSOv6 in. */ 41 40 #define VIRTIO_NET_F_GUEST_ECN 9 /* Guest can handle TSO[6] w/ ECN in. */ ··· 55 56 * Steering */ 56 57 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */ 57 58 59 + #ifndef VIRTIO_NET_NO_LEGACY 60 + #define VIRTIO_NET_F_GSO 6 /* Host handles pkts w/ any GSO type */ 61 + #endif /* VIRTIO_NET_NO_LEGACY */ 62 + 58 63 #define VIRTIO_NET_S_LINK_UP 1 /* Link is up */ 59 64 #define VIRTIO_NET_S_ANNOUNCE 2 /* Announcement is needed */ 60 65 ··· 74 71 __u16 max_virtqueue_pairs; 75 72 } __attribute__((packed)); 76 73 74 + /* 75 + * This header comes first in the scatter-gather list. If you don't 76 + * specify GSO or CSUM features, you can simply ignore the header. 77 + * 78 + * This is bitwise-equivalent to the legacy struct virtio_net_hdr_mrg_rxbuf, 79 + * only flattened. 80 + */ 81 + struct virtio_net_hdr_v1 { 82 + #define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 /* Use csum_start, csum_offset */ 83 + #define VIRTIO_NET_HDR_F_DATA_VALID 2 /* Csum is valid */ 84 + __u8 flags; 85 + #define VIRTIO_NET_HDR_GSO_NONE 0 /* Not a GSO frame */ 86 + #define VIRTIO_NET_HDR_GSO_TCPV4 1 /* GSO frame, IPv4 TCP (TSO) */ 87 + #define VIRTIO_NET_HDR_GSO_UDP 3 /* GSO frame, IPv4 UDP (UFO) */ 88 + #define VIRTIO_NET_HDR_GSO_TCPV6 4 /* GSO frame, IPv6 TCP */ 89 + #define VIRTIO_NET_HDR_GSO_ECN 0x80 /* TCP has ECN set */ 90 + __u8 gso_type; 91 + __virtio16 hdr_len; /* Ethernet + IP + tcp/udp hdrs */ 92 + __virtio16 gso_size; /* Bytes to append to hdr_len per frame */ 93 + __virtio16 csum_start; /* Position to start checksumming from */ 94 + __virtio16 csum_offset; /* Offset after that to place checksum */ 95 + __virtio16 num_buffers; /* Number of merged rx buffers */ 96 + }; 97 + 98 + #ifndef VIRTIO_NET_NO_LEGACY 77 99 /* This header comes first in the scatter-gather list. 78 - * If VIRTIO_F_ANY_LAYOUT is not negotiated, it must 100 + * For legacy virtio, if VIRTIO_F_ANY_LAYOUT is not negotiated, it must 79 101 * be the first element of the scatter-gather list. If you don't 80 102 * specify GSO or CSUM features, you can simply ignore the header. */ 81 103 struct virtio_net_hdr { 82 - #define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 // Use csum_start, csum_offset 83 - #define VIRTIO_NET_HDR_F_DATA_VALID 2 // Csum is valid 104 + /* See VIRTIO_NET_HDR_F_* */ 84 105 __u8 flags; 85 - #define VIRTIO_NET_HDR_GSO_NONE 0 // Not a GSO frame 86 - #define VIRTIO_NET_HDR_GSO_TCPV4 1 // GSO frame, IPv4 TCP (TSO) 87 - #define VIRTIO_NET_HDR_GSO_UDP 3 // GSO frame, IPv4 UDP (UFO) 88 - #define VIRTIO_NET_HDR_GSO_TCPV6 4 // GSO frame, IPv6 TCP 89 - #define VIRTIO_NET_HDR_GSO_ECN 0x80 // TCP has ECN set 106 + /* See VIRTIO_NET_HDR_GSO_* */ 90 107 __u8 gso_type; 91 108 __virtio16 hdr_len; /* Ethernet + IP + tcp/udp hdrs */ 92 109 __virtio16 gso_size; /* Bytes to append to hdr_len per frame */ ··· 120 97 struct virtio_net_hdr hdr; 121 98 __virtio16 num_buffers; /* Number of merged rx buffers */ 122 99 }; 100 + #endif /* ...VIRTIO_NET_NO_LEGACY */ 123 101 124 102 /* 125 103 * Control virtqueue data structures

+92 -1

include/uapi/linux/virtio_pci.h

··· 39 39 #ifndef _LINUX_VIRTIO_PCI_H 40 40 #define _LINUX_VIRTIO_PCI_H 41 41 42 - #include <linux/virtio_config.h> 42 + #include <linux/types.h> 43 43 44 44 #ifndef VIRTIO_PCI_NO_LEGACY 45 45 ··· 98 98 #define VIRTIO_PCI_ISR_CONFIG 0x2 99 99 /* Vector value used to disable MSI for queue */ 100 100 #define VIRTIO_MSI_NO_VECTOR 0xffff 101 + 102 + #ifndef VIRTIO_PCI_NO_MODERN 103 + 104 + /* IDs for different capabilities. Must all exist. */ 105 + 106 + /* Common configuration */ 107 + #define VIRTIO_PCI_CAP_COMMON_CFG 1 108 + /* Notifications */ 109 + #define VIRTIO_PCI_CAP_NOTIFY_CFG 2 110 + /* ISR access */ 111 + #define VIRTIO_PCI_CAP_ISR_CFG 3 112 + /* Device specific configuration */ 113 + #define VIRTIO_PCI_CAP_DEVICE_CFG 4 114 + /* PCI configuration access */ 115 + #define VIRTIO_PCI_CAP_PCI_CFG 5 116 + 117 + /* This is the PCI capability header: */ 118 + struct virtio_pci_cap { 119 + __u8 cap_vndr; /* Generic PCI field: PCI_CAP_ID_VNDR */ 120 + __u8 cap_next; /* Generic PCI field: next ptr. */ 121 + __u8 cap_len; /* Generic PCI field: capability length */ 122 + __u8 cfg_type; /* Identifies the structure. */ 123 + __u8 bar; /* Where to find it. */ 124 + __u8 padding[3]; /* Pad to full dword. */ 125 + __le32 offset; /* Offset within bar. */ 126 + __le32 length; /* Length of the structure, in bytes. */ 127 + }; 128 + 129 + struct virtio_pci_notify_cap { 130 + struct virtio_pci_cap cap; 131 + __le32 notify_off_multiplier; /* Multiplier for queue_notify_off. */ 132 + }; 133 + 134 + /* Fields in VIRTIO_PCI_CAP_COMMON_CFG: */ 135 + struct virtio_pci_common_cfg { 136 + /* About the whole device. */ 137 + __le32 device_feature_select; /* read-write */ 138 + __le32 device_feature; /* read-only */ 139 + __le32 guest_feature_select; /* read-write */ 140 + __le32 guest_feature; /* read-write */ 141 + __le16 msix_config; /* read-write */ 142 + __le16 num_queues; /* read-only */ 143 + __u8 device_status; /* read-write */ 144 + __u8 config_generation; /* read-only */ 145 + 146 + /* About a specific virtqueue. */ 147 + __le16 queue_select; /* read-write */ 148 + __le16 queue_size; /* read-write, power of 2. */ 149 + __le16 queue_msix_vector; /* read-write */ 150 + __le16 queue_enable; /* read-write */ 151 + __le16 queue_notify_off; /* read-only */ 152 + __le32 queue_desc_lo; /* read-write */ 153 + __le32 queue_desc_hi; /* read-write */ 154 + __le32 queue_avail_lo; /* read-write */ 155 + __le32 queue_avail_hi; /* read-write */ 156 + __le32 queue_used_lo; /* read-write */ 157 + __le32 queue_used_hi; /* read-write */ 158 + }; 159 + 160 + /* Macro versions of offsets for the Old Timers! */ 161 + #define VIRTIO_PCI_CAP_VNDR 0 162 + #define VIRTIO_PCI_CAP_NEXT 1 163 + #define VIRTIO_PCI_CAP_LEN 2 164 + #define VIRTIO_PCI_CAP_CFG_TYPE 3 165 + #define VIRTIO_PCI_CAP_BAR 4 166 + #define VIRTIO_PCI_CAP_OFFSET 8 167 + #define VIRTIO_PCI_CAP_LENGTH 12 168 + 169 + #define VIRTIO_PCI_NOTIFY_CAP_MULT 16 170 + 171 + #define VIRTIO_PCI_COMMON_DFSELECT 0 172 + #define VIRTIO_PCI_COMMON_DF 4 173 + #define VIRTIO_PCI_COMMON_GFSELECT 8 174 + #define VIRTIO_PCI_COMMON_GF 12 175 + #define VIRTIO_PCI_COMMON_MSIX 16 176 + #define VIRTIO_PCI_COMMON_NUMQ 18 177 + #define VIRTIO_PCI_COMMON_STATUS 20 178 + #define VIRTIO_PCI_COMMON_CFGGENERATION 21 179 + #define VIRTIO_PCI_COMMON_Q_SELECT 22 180 + #define VIRTIO_PCI_COMMON_Q_SIZE 24 181 + #define VIRTIO_PCI_COMMON_Q_MSIX 26 182 + #define VIRTIO_PCI_COMMON_Q_ENABLE 28 183 + #define VIRTIO_PCI_COMMON_Q_NOFF 30 184 + #define VIRTIO_PCI_COMMON_Q_DESCLO 32 185 + #define VIRTIO_PCI_COMMON_Q_DESCHI 36 186 + #define VIRTIO_PCI_COMMON_Q_AVAILLO 40 187 + #define VIRTIO_PCI_COMMON_Q_AVAILHI 44 188 + #define VIRTIO_PCI_COMMON_Q_USEDLO 48 189 + #define VIRTIO_PCI_COMMON_Q_USEDHI 52 190 + 191 + #endif /* VIRTIO_PCI_NO_MODERN */ 101 192 102 193 #endif

+43 -18

lib/pci_iomap.c

··· 10 10 11 11 #ifdef CONFIG_PCI 12 12 /** 13 + * pci_iomap_range - create a virtual mapping cookie for a PCI BAR 14 + * @dev: PCI device that owns the BAR 15 + * @bar: BAR number 16 + * @offset: map memory at the given offset in BAR 17 + * @maxlen: max length of the memory to map 18 + * 19 + * Using this function you will get a __iomem address to your device BAR. 20 + * You can access it using ioread*() and iowrite*(). These functions hide 21 + * the details if this is a MMIO or PIO address space and will just do what 22 + * you expect from them in the correct way. 23 + * 24 + * @maxlen specifies the maximum length to map. If you want to get access to 25 + * the complete BAR from offset to the end, pass %0 here. 26 + * */ 27 + void __iomem *pci_iomap_range(struct pci_dev *dev, 28 + int bar, 29 + unsigned long offset, 30 + unsigned long maxlen) 31 + { 32 + resource_size_t start = pci_resource_start(dev, bar); 33 + resource_size_t len = pci_resource_len(dev, bar); 34 + unsigned long flags = pci_resource_flags(dev, bar); 35 + 36 + if (len <= offset || !start) 37 + return NULL; 38 + len -= offset; 39 + start += offset; 40 + if (maxlen && len > maxlen) 41 + len = maxlen; 42 + if (flags & IORESOURCE_IO) 43 + return __pci_ioport_map(dev, start, len); 44 + if (flags & IORESOURCE_MEM) { 45 + if (flags & IORESOURCE_CACHEABLE) 46 + return ioremap(start, len); 47 + return ioremap_nocache(start, len); 48 + } 49 + /* What? */ 50 + return NULL; 51 + } 52 + EXPORT_SYMBOL(pci_iomap_range); 53 + 54 + /** 13 55 * pci_iomap - create a virtual mapping cookie for a PCI BAR 14 56 * @dev: PCI device that owns the BAR 15 57 * @bar: BAR number ··· 67 25 * */ 68 26 void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long maxlen) 69 27 { 70 - resource_size_t start = pci_resource_start(dev, bar); 71 - resource_size_t len = pci_resource_len(dev, bar); 72 - unsigned long flags = pci_resource_flags(dev, bar); 73 - 74 - if (!len || !start) 75 - return NULL; 76 - if (maxlen && len > maxlen) 77 - len = maxlen; 78 - if (flags & IORESOURCE_IO) 79 - return __pci_ioport_map(dev, start, len); 80 - if (flags & IORESOURCE_MEM) { 81 - if (flags & IORESOURCE_CACHEABLE) 82 - return ioremap(start, len); 83 - return ioremap_nocache(start, len); 84 - } 85 - /* What? */ 86 - return NULL; 28 + return pci_iomap_range(dev, bar, 0, maxlen); 87 29 } 88 - 89 30 EXPORT_SYMBOL(pci_iomap); 90 31 #endif /* CONFIG_PCI */

+6

net/9p/trans_virtio.c

··· 524 524 int err; 525 525 struct virtio_chan *chan; 526 526 527 + if (!vdev->config->get) { 528 + dev_err(&vdev->dev, "%s failure: config access disabled\n", 529 + __func__); 530 + return -EINVAL; 531 + } 532 + 527 533 chan = kmalloc(sizeof(struct virtio_chan), GFP_KERNEL); 528 534 if (!chan) { 529 535 pr_err("Failed to allocate virtio 9P channel\n");

+7 -1

tools/lguest/Makefile

··· 1 1 # This creates the demonstration utility "lguest" which runs a Linux guest. 2 - CFLAGS:=-m32 -Wall -Wmissing-declarations -Wmissing-prototypes -O3 -U_FORTIFY_SOURCE 2 + CFLAGS:=-m32 -Wall -Wmissing-declarations -Wmissing-prototypes -O3 -U_FORTIFY_SOURCE -Iinclude 3 3 4 4 all: lguest 5 + 6 + include/linux/virtio_types.h: ../../include/uapi/linux/virtio_types.h 7 + mkdir -p include/linux 2>&1 || true 8 + ln -sf ../../../../include/uapi/linux/virtio_types.h $@ 9 + 10 + lguest: include/linux/virtio_types.h 5 11 6 12 clean: 7 13 rm -f lguest

+1688 -350

tools/lguest/lguest.c

··· 41 41 #include <signal.h> 42 42 #include <pwd.h> 43 43 #include <grp.h> 44 + #include <sys/user.h> 45 + #include <linux/pci_regs.h> 44 46 45 47 #ifndef VIRTIO_F_ANY_LAYOUT 46 48 #define VIRTIO_F_ANY_LAYOUT 27 ··· 63 61 typedef uint8_t u8; 64 62 /*:*/ 65 63 66 - #include <linux/virtio_config.h> 67 - #include <linux/virtio_net.h> 68 - #include <linux/virtio_blk.h> 69 - #include <linux/virtio_console.h> 70 - #include <linux/virtio_rng.h> 64 + #define VIRTIO_CONFIG_NO_LEGACY 65 + #define VIRTIO_PCI_NO_LEGACY 66 + #define VIRTIO_BLK_NO_LEGACY 67 + #define VIRTIO_NET_NO_LEGACY 68 + 69 + /* Use in-kernel ones, which defines VIRTIO_F_VERSION_1 */ 70 + #include "../../include/uapi/linux/virtio_config.h" 71 + #include "../../include/uapi/linux/virtio_net.h" 72 + #include "../../include/uapi/linux/virtio_blk.h" 73 + #include "../../include/uapi/linux/virtio_console.h" 74 + #include "../../include/uapi/linux/virtio_rng.h" 71 75 #include <linux/virtio_ring.h> 76 + #include "../../include/uapi/linux/virtio_pci.h" 72 77 #include <asm/bootparam.h> 73 78 #include "../../include/linux/lguest_launcher.h" 74 79 ··· 100 91 /* The pointer to the start of guest memory. */ 101 92 static void *guest_base; 102 93 /* The maximum guest physical address allowed, and maximum possible. */ 103 - static unsigned long guest_limit, guest_max; 94 + static unsigned long guest_limit, guest_max, guest_mmio; 104 95 /* The /dev/lguest file descriptor. */ 105 96 static int lguest_fd; 106 97 107 98 /* a per-cpu variable indicating whose vcpu is currently running */ 108 99 static unsigned int __thread cpu_id; 100 + 101 + /* 5 bit device number in the PCI_CONFIG_ADDR => 32 only */ 102 + #define MAX_PCI_DEVICES 32 109 103 110 104 /* This is our list of devices. */ 111 105 struct device_list { ··· 118 106 /* Counter to print out convenient device numbers. */ 119 107 unsigned int device_num; 120 108 121 - /* The descriptor page for the devices. */ 122 - u8 *descpage; 123 - 124 - /* A single linked list of devices. */ 125 - struct device *dev; 126 - /* And a pointer to the last device for easy append. */ 127 - struct device *lastdev; 109 + /* PCI devices. */ 110 + struct device *pci[MAX_PCI_DEVICES]; 128 111 }; 129 112 130 113 /* The list of Guest devices, based on command line arguments. */ 131 114 static struct device_list devices; 132 115 116 + struct virtio_pci_cfg_cap { 117 + struct virtio_pci_cap cap; 118 + u32 pci_cfg_data; /* Data for BAR access. */ 119 + }; 120 + 121 + struct virtio_pci_mmio { 122 + struct virtio_pci_common_cfg cfg; 123 + u16 notify; 124 + u8 isr; 125 + u8 padding; 126 + /* Device-specific configuration follows this. */ 127 + }; 128 + 129 + /* This is the layout (little-endian) of the PCI config space. */ 130 + struct pci_config { 131 + u16 vendor_id, device_id; 132 + u16 command, status; 133 + u8 revid, prog_if, subclass, class; 134 + u8 cacheline_size, lat_timer, header_type, bist; 135 + u32 bar[6]; 136 + u32 cardbus_cis_ptr; 137 + u16 subsystem_vendor_id, subsystem_device_id; 138 + u32 expansion_rom_addr; 139 + u8 capabilities, reserved1[3]; 140 + u32 reserved2; 141 + u8 irq_line, irq_pin, min_grant, max_latency; 142 + 143 + /* Now, this is the linked capability list. */ 144 + struct virtio_pci_cap common; 145 + struct virtio_pci_notify_cap notify; 146 + struct virtio_pci_cap isr; 147 + struct virtio_pci_cap device; 148 + struct virtio_pci_cfg_cap cfg_access; 149 + }; 150 + 133 151 /* The device structure describes a single device. */ 134 152 struct device { 135 - /* The linked-list pointer. */ 136 - struct device *next; 137 - 138 - /* The device's descriptor, as mapped into the Guest. */ 139 - struct lguest_device_desc *desc; 140 - 141 - /* We can't trust desc values once Guest has booted: we use these. */ 142 - unsigned int feature_len; 143 - unsigned int num_vq; 144 - 145 153 /* The name of this device, for --verbose. */ 146 154 const char *name; 147 155 ··· 170 138 171 139 /* Is it operational */ 172 140 bool running; 141 + 142 + /* Has it written FEATURES_OK but not re-checked it? */ 143 + bool wrote_features_ok; 144 + 145 + /* PCI configuration */ 146 + union { 147 + struct pci_config config; 148 + u32 config_words[sizeof(struct pci_config) / sizeof(u32)]; 149 + }; 150 + 151 + /* Features we offer, and those accepted. */ 152 + u64 features, features_accepted; 153 + 154 + /* Device-specific config hangs off the end of this. */ 155 + struct virtio_pci_mmio *mmio; 156 + 157 + /* PCI MMIO resources (all in BAR0) */ 158 + size_t mmio_size; 159 + u32 mmio_addr; 173 160 174 161 /* Device-specific data. */ 175 162 void *priv; ··· 201 150 /* Which device owns me. */ 202 151 struct device *dev; 203 152 204 - /* The configuration for this queue. */ 205 - struct lguest_vqconfig config; 153 + /* Name for printing errors. */ 154 + const char *name; 206 155 207 156 /* The actual ring of buffers. */ 208 157 struct vring vring; 158 + 159 + /* The information about this virtqueue (we only use queue_size on) */ 160 + struct virtio_pci_common_cfg pci_config; 209 161 210 162 /* Last available index we saw. */ 211 163 u16 last_avail_idx; ··· 253 199 #define le32_to_cpu(v32) (v32) 254 200 #define le64_to_cpu(v64) (v64) 255 201 202 + /* 203 + * A real device would ignore weird/non-compliant driver behaviour. We 204 + * stop and flag it, to help debugging Linux problems. 205 + */ 206 + #define bad_driver(d, fmt, ...) \ 207 + errx(1, "%s: bad driver: " fmt, (d)->name, ## __VA_ARGS__) 208 + #define bad_driver_vq(vq, fmt, ...) \ 209 + errx(1, "%s vq %s: bad driver: " fmt, (vq)->dev->name, \ 210 + vq->name, ## __VA_ARGS__) 211 + 256 212 /* Is this iovec empty? */ 257 213 static bool iov_empty(const struct iovec iov[], unsigned int num_iov) 258 214 { ··· 275 211 } 276 212 277 213 /* Take len bytes from the front of this iovec. */ 278 - static void iov_consume(struct iovec iov[], unsigned num_iov, 214 + static void iov_consume(struct device *d, 215 + struct iovec iov[], unsigned num_iov, 279 216 void *dest, unsigned len) 280 217 { 281 218 unsigned int i; ··· 294 229 len -= used; 295 230 } 296 231 if (len != 0) 297 - errx(1, "iovec too short!"); 298 - } 299 - 300 - /* The device virtqueue descriptors are followed by feature bitmasks. */ 301 - static u8 *get_feature_bits(struct device *dev) 302 - { 303 - return (u8 *)(dev->desc + 1) 304 - + dev->num_vq * sizeof(struct lguest_vqconfig); 232 + bad_driver(d, "iovec too short!"); 305 233 } 306 234 307 235 /*L:100 ··· 367 309 return addr + getpagesize(); 368 310 } 369 311 370 - /* Get some more pages for a device. */ 371 - static void *get_pages(unsigned int num) 312 + /* Get some bytes which won't be mapped into the guest. */ 313 + static unsigned long get_mmio_region(size_t size) 372 314 { 373 - void *addr = from_guest_phys(guest_limit); 315 + unsigned long addr = guest_mmio; 316 + size_t i; 374 317 375 - guest_limit += num * getpagesize(); 376 - if (guest_limit > guest_max) 377 - errx(1, "Not enough memory for devices"); 318 + if (!size) 319 + return addr; 320 + 321 + /* Size has to be a power of 2 (and multiple of 16) */ 322 + for (i = 1; i < size; i <<= 1); 323 + 324 + guest_mmio += i; 325 + 378 326 return addr; 379 327 } 380 328 ··· 611 547 { 612 548 unsigned long args[] = { LHREQ_INITIALIZE, 613 549 (unsigned long)guest_base, 614 - guest_limit / getpagesize(), start }; 615 - verbose("Guest: %p - %p (%#lx)\n", 616 - guest_base, guest_base + guest_limit, guest_limit); 550 + guest_limit / getpagesize(), start, 551 + (guest_mmio+getpagesize()-1) / getpagesize() }; 552 + verbose("Guest: %p - %p (%#lx, MMIO %#lx)\n", 553 + guest_base, guest_base + guest_limit, 554 + guest_limit, guest_mmio); 617 555 lguest_fd = open_or_die("/dev/lguest", O_RDWR); 618 556 if (write(lguest_fd, args, sizeof(args)) < 0) 619 557 err(1, "Writing to /dev/lguest"); ··· 630 564 * we have a convenient routine which checks it and exits with an error message 631 565 * if something funny is going on: 632 566 */ 633 - static void *_check_pointer(unsigned long addr, unsigned int size, 567 + static void *_check_pointer(struct device *d, 568 + unsigned long addr, unsigned int size, 634 569 unsigned int line) 635 570 { 636 571 /* ··· 639 572 * or addr + size wraps around. 640 573 */ 641 574 if ((addr + size) > guest_limit || (addr + size) < addr) 642 - errx(1, "%s:%i: Invalid address %#lx", __FILE__, line, addr); 575 + bad_driver(d, "%s:%i: Invalid address %#lx", 576 + __FILE__, line, addr); 643 577 /* 644 578 * We return a pointer for the caller's convenience, now we know it's 645 579 * safe to use. ··· 648 580 return from_guest_phys(addr); 649 581 } 650 582 /* A macro which transparently hands the line number to the real function. */ 651 - #define check_pointer(addr,size) _check_pointer(addr, size, __LINE__) 583 + #define check_pointer(d,addr,size) _check_pointer(d, addr, size, __LINE__) 652 584 653 585 /* 654 586 * Each buffer in the virtqueues is actually a chain of descriptors. This 655 587 * function returns the next descriptor in the chain, or vq->vring.num if we're 656 588 * at the end. 657 589 */ 658 - static unsigned next_desc(struct vring_desc *desc, 590 + static unsigned next_desc(struct device *d, struct vring_desc *desc, 659 591 unsigned int i, unsigned int max) 660 592 { 661 593 unsigned int next; ··· 670 602 wmb(); 671 603 672 604 if (next >= max) 673 - errx(1, "Desc next is %u", next); 605 + bad_driver(d, "Desc next is %u", next); 674 606 675 607 return next; 676 608 } ··· 681 613 */ 682 614 static void trigger_irq(struct virtqueue *vq) 683 615 { 684 - unsigned long buf[] = { LHREQ_IRQ, vq->config.irq }; 616 + unsigned long buf[] = { LHREQ_IRQ, vq->dev->config.irq_line }; 685 617 686 618 /* Don't inform them if nothing used. */ 687 619 if (!vq->pending_used) 688 620 return; 689 621 vq->pending_used = 0; 690 622 691 - /* If they don't want an interrupt, don't send one... */ 623 + /* 624 + * 2.4.7.1: 625 + * 626 + * If the VIRTIO_F_EVENT_IDX feature bit is not negotiated: 627 + * The driver MUST set flags to 0 or 1. 628 + */ 629 + if (vq->vring.avail->flags > 1) 630 + bad_driver_vq(vq, "avail->flags = %u\n", vq->vring.avail->flags); 631 + 632 + /* 633 + * 2.4.7.2: 634 + * 635 + * If the VIRTIO_F_EVENT_IDX feature bit is not negotiated: 636 + * 637 + * - The device MUST ignore the used_event value. 638 + * - After the device writes a descriptor index into the used ring: 639 + * - If flags is 1, the device SHOULD NOT send an interrupt. 640 + * - If flags is 0, the device MUST send an interrupt. 641 + */ 692 642 if (vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT) { 693 643 return; 694 644 } 695 645 646 + /* 647 + * 4.1.4.5.1: 648 + * 649 + * If MSI-X capability is disabled, the device MUST set the Queue 650 + * Interrupt bit in ISR status before sending a virtqueue notification 651 + * to the driver. 652 + */ 653 + vq->dev->mmio->isr = 0x1; 654 + 696 655 /* Send the Guest an interrupt tell them we used something up. */ 697 656 if (write(lguest_fd, buf, sizeof(buf)) != 0) 698 - err(1, "Triggering irq %i", vq->config.irq); 657 + err(1, "Triggering irq %i", vq->dev->config.irq_line); 699 658 } 700 659 701 660 /* ··· 740 645 unsigned int i, head, max; 741 646 struct vring_desc *desc; 742 647 u16 last_avail = lg_last_avail(vq); 648 + 649 + /* 650 + * 2.4.7.1: 651 + * 652 + * The driver MUST handle spurious interrupts from the device. 653 + * 654 + * That's why this is a while loop. 655 + */ 743 656 744 657 /* There's nothing available? */ 745 658 while (last_avail == vq->vring.avail->idx) { ··· 782 679 783 680 /* Check it isn't doing very strange things with descriptor numbers. */ 784 681 if ((u16)(vq->vring.avail->idx - last_avail) > vq->vring.num) 785 - errx(1, "Guest moved used index from %u to %u", 786 - last_avail, vq->vring.avail->idx); 682 + bad_driver_vq(vq, "Guest moved used index from %u to %u", 683 + last_avail, vq->vring.avail->idx); 787 684 788 685 /* 789 686 * Make sure we read the descriptor number *after* we read the ring ··· 800 697 801 698 /* If their number is silly, that's a fatal mistake. */ 802 699 if (head >= vq->vring.num) 803 - errx(1, "Guest says index %u is available", head); 700 + bad_driver_vq(vq, "Guest says index %u is available", head); 804 701 805 702 /* When we start there are none of either input nor output. */ 806 703 *out_num = *in_num = 0; ··· 815 712 * that: no rmb() required. 816 713 */ 817 714 818 - /* 819 - * If this is an indirect entry, then this buffer contains a descriptor 820 - * table which we handle as if it's any normal descriptor chain. 821 - */ 822 - if (desc[i].flags & VRING_DESC_F_INDIRECT) { 823 - if (desc[i].len % sizeof(struct vring_desc)) 824 - errx(1, "Invalid size for indirect buffer table"); 825 - 826 - max = desc[i].len / sizeof(struct vring_desc); 827 - desc = check_pointer(desc[i].addr, desc[i].len); 828 - i = 0; 829 - } 830 - 831 715 do { 716 + /* 717 + * If this is an indirect entry, then this buffer contains a 718 + * descriptor table which we handle as if it's any normal 719 + * descriptor chain. 720 + */ 721 + if (desc[i].flags & VRING_DESC_F_INDIRECT) { 722 + /* 2.4.5.3.1: 723 + * 724 + * The driver MUST NOT set the VIRTQ_DESC_F_INDIRECT 725 + * flag unless the VIRTIO_F_INDIRECT_DESC feature was 726 + * negotiated. 727 + */ 728 + if (!(vq->dev->features_accepted & 729 + (1<<VIRTIO_RING_F_INDIRECT_DESC))) 730 + bad_driver_vq(vq, "vq indirect not negotiated"); 731 + 732 + /* 733 + * 2.4.5.3.1: 734 + * 735 + * The driver MUST NOT set the VIRTQ_DESC_F_INDIRECT 736 + * flag within an indirect descriptor (ie. only one 737 + * table per descriptor). 738 + */ 739 + if (desc != vq->vring.desc) 740 + bad_driver_vq(vq, "Indirect within indirect"); 741 + 742 + /* 743 + * Proposed update VIRTIO-134 spells this out: 744 + * 745 + * A driver MUST NOT set both VIRTQ_DESC_F_INDIRECT 746 + * and VIRTQ_DESC_F_NEXT in flags. 747 + */ 748 + if (desc[i].flags & VRING_DESC_F_NEXT) 749 + bad_driver_vq(vq, "indirect and next together"); 750 + 751 + if (desc[i].len % sizeof(struct vring_desc)) 752 + bad_driver_vq(vq, 753 + "Invalid size for indirect table"); 754 + /* 755 + * 2.4.5.3.2: 756 + * 757 + * The device MUST ignore the write-only flag 758 + * (flags&VIRTQ_DESC_F_WRITE) in the descriptor that 759 + * refers to an indirect table. 760 + * 761 + * We ignore it here: :) 762 + */ 763 + 764 + max = desc[i].len / sizeof(struct vring_desc); 765 + desc = check_pointer(vq->dev, desc[i].addr, desc[i].len); 766 + i = 0; 767 + 768 + /* 2.4.5.3.1: 769 + * 770 + * A driver MUST NOT create a descriptor chain longer 771 + * than the Queue Size of the device. 772 + */ 773 + if (max > vq->pci_config.queue_size) 774 + bad_driver_vq(vq, 775 + "indirect has too many entries"); 776 + } 777 + 832 778 /* Grab the first descriptor, and check it's OK. */ 833 779 iov[*out_num + *in_num].iov_len = desc[i].len; 834 780 iov[*out_num + *in_num].iov_base 835 - = check_pointer(desc[i].addr, desc[i].len); 781 + = check_pointer(vq->dev, desc[i].addr, desc[i].len); 836 782 /* If this is an input descriptor, increment that count. */ 837 783 if (desc[i].flags & VRING_DESC_F_WRITE) 838 784 (*in_num)++; ··· 891 739 * to come before any input descriptors. 892 740 */ 893 741 if (*in_num) 894 - errx(1, "Descriptor has out after in"); 742 + bad_driver_vq(vq, 743 + "Descriptor has out after in"); 895 744 (*out_num)++; 896 745 } 897 746 898 747 /* If we've got too many, that implies a descriptor loop. */ 899 748 if (*out_num + *in_num > max) 900 - errx(1, "Looped descriptor"); 901 - } while ((i = next_desc(desc, i, max)) != max); 749 + bad_driver_vq(vq, "Looped descriptor"); 750 + } while ((i = next_desc(vq->dev, desc, i, max)) != max); 902 751 903 752 return head; 904 753 } ··· 956 803 /* Make sure there's a descriptor available. */ 957 804 head = wait_for_vq_desc(vq, iov, &out_num, &in_num); 958 805 if (out_num) 959 - errx(1, "Output buffers in console in queue?"); 806 + bad_driver_vq(vq, "Output buffers in console in queue?"); 960 807 961 808 /* Read into it. This is where we usually wait. */ 962 809 len = readv(STDIN_FILENO, iov, in_num); ··· 1009 856 /* We usually wait in here, for the Guest to give us something. */ 1010 857 head = wait_for_vq_desc(vq, iov, &out, &in); 1011 858 if (in) 1012 - errx(1, "Input buffers in console output queue?"); 859 + bad_driver_vq(vq, "Input buffers in console output queue?"); 1013 860 1014 861 /* writev can return a partial write, so we loop here. */ 1015 862 while (!iov_empty(iov, out)) { ··· 1018 865 warn("Write to stdout gave %i (%d)", len, errno); 1019 866 break; 1020 867 } 1021 - iov_consume(iov, out, NULL, len); 868 + iov_consume(vq->dev, iov, out, NULL, len); 1022 869 } 1023 870 1024 871 /* ··· 1047 894 /* We usually wait in here for the Guest to give us a packet. */ 1048 895 head = wait_for_vq_desc(vq, iov, &out, &in); 1049 896 if (in) 1050 - errx(1, "Input buffers in net output queue?"); 897 + bad_driver_vq(vq, "Input buffers in net output queue?"); 1051 898 /* 1052 899 * Send the whole thing through to /dev/net/tun. It expects the exact 1053 900 * same format: what a coincidence! ··· 1095 942 */ 1096 943 head = wait_for_vq_desc(vq, iov, &out, &in); 1097 944 if (out) 1098 - errx(1, "Output buffers in net input queue?"); 945 + bad_driver_vq(vq, "Output buffers in net input queue?"); 1099 946 1100 947 /* 1101 948 * If it looks like we'll block reading from the tun device, send them ··· 1139 986 kill(0, SIGTERM); 1140 987 } 1141 988 989 + static void reset_vq_pci_config(struct virtqueue *vq) 990 + { 991 + vq->pci_config.queue_size = VIRTQUEUE_NUM; 992 + vq->pci_config.queue_enable = 0; 993 + } 994 + 1142 995 static void reset_device(struct device *dev) 1143 996 { 1144 997 struct virtqueue *vq; ··· 1152 993 verbose("Resetting device %s\n", dev->name); 1153 994 1154 995 /* Clear any features they've acked. */ 1155 - memset(get_feature_bits(dev) + dev->feature_len, 0, dev->feature_len); 996 + dev->features_accepted = 0; 1156 997 1157 998 /* We're going to be explicitly killing threads, so ignore them. */ 1158 999 signal(SIGCHLD, SIG_IGN); 1159 1000 1160 - /* Zero out the virtqueues, get rid of their threads */ 1001 + /* 1002 + * 4.1.4.3.1: 1003 + * 1004 + * The device MUST present a 0 in queue_enable on reset. 1005 + * 1006 + * This means we set it here, and reset the saved ones in every vq. 1007 + */ 1008 + dev->mmio->cfg.queue_enable = 0; 1009 + 1010 + /* Get rid of the virtqueue threads */ 1161 1011 for (vq = dev->vq; vq; vq = vq->next) { 1012 + vq->last_avail_idx = 0; 1013 + reset_vq_pci_config(vq); 1162 1014 if (vq->thread != (pid_t)-1) { 1163 1015 kill(vq->thread, SIGTERM); 1164 1016 waitpid(vq->thread, NULL, 0); 1165 1017 vq->thread = (pid_t)-1; 1166 1018 } 1167 - memset(vq->vring.desc, 0, 1168 - vring_size(vq->config.num, LGUEST_VRING_ALIGN)); 1169 - lg_last_avail(vq) = 0; 1170 1019 } 1171 1020 dev->running = false; 1021 + dev->wrote_features_ok = false; 1172 1022 1173 1023 /* Now we care if threads die. */ 1174 1024 signal(SIGCHLD, (void *)kill_launcher); 1175 1025 } 1176 1026 1177 - /*L:216 1178 - * This actually creates the thread which services the virtqueue for a device. 1027 + static void cleanup_devices(void) 1028 + { 1029 + unsigned int i; 1030 + 1031 + for (i = 1; i < MAX_PCI_DEVICES; i++) { 1032 + struct device *d = devices.pci[i]; 1033 + if (!d) 1034 + continue; 1035 + reset_device(d); 1036 + } 1037 + 1038 + /* If we saved off the original terminal settings, restore them now. */ 1039 + if (orig_term.c_lflag & (ISIG|ICANON|ECHO)) 1040 + tcsetattr(STDIN_FILENO, TCSANOW, &orig_term); 1041 + } 1042 + 1043 + /*L:217 1044 + * We do PCI. This is mainly done to let us test the kernel virtio PCI 1045 + * code. 1179 1046 */ 1180 - static void create_thread(struct virtqueue *vq) 1047 + 1048 + /* Linux expects a PCI host bridge: ours is a dummy, and first on the bus. */ 1049 + static struct device pci_host_bridge; 1050 + 1051 + static void init_pci_host_bridge(void) 1052 + { 1053 + pci_host_bridge.name = "PCI Host Bridge"; 1054 + pci_host_bridge.config.class = 0x06; /* bridge */ 1055 + pci_host_bridge.config.subclass = 0; /* host bridge */ 1056 + devices.pci[0] = &pci_host_bridge; 1057 + } 1058 + 1059 + /* The IO ports used to read the PCI config space. */ 1060 + #define PCI_CONFIG_ADDR 0xCF8 1061 + #define PCI_CONFIG_DATA 0xCFC 1062 + 1063 + /* 1064 + * Not really portable, but does help readability: this is what the Guest 1065 + * writes to the PCI_CONFIG_ADDR IO port. 1066 + */ 1067 + union pci_config_addr { 1068 + struct { 1069 + unsigned mbz: 2; 1070 + unsigned offset: 6; 1071 + unsigned funcnum: 3; 1072 + unsigned devnum: 5; 1073 + unsigned busnum: 8; 1074 + unsigned reserved: 7; 1075 + unsigned enabled : 1; 1076 + } bits; 1077 + u32 val; 1078 + }; 1079 + 1080 + /* 1081 + * We cache what they wrote to the address port, so we know what they're 1082 + * talking about when they access the data port. 1083 + */ 1084 + static union pci_config_addr pci_config_addr; 1085 + 1086 + static struct device *find_pci_device(unsigned int index) 1087 + { 1088 + return devices.pci[index]; 1089 + } 1090 + 1091 + /* PCI can do 1, 2 and 4 byte reads; we handle that here. */ 1092 + static void ioread(u16 off, u32 v, u32 mask, u32 *val) 1093 + { 1094 + assert(off < 4); 1095 + assert(mask == 0xFF || mask == 0xFFFF || mask == 0xFFFFFFFF); 1096 + *val = (v >> (off * 8)) & mask; 1097 + } 1098 + 1099 + /* PCI can do 1, 2 and 4 byte writes; we handle that here. */ 1100 + static void iowrite(u16 off, u32 v, u32 mask, u32 *dst) 1101 + { 1102 + assert(off < 4); 1103 + assert(mask == 0xFF || mask == 0xFFFF || mask == 0xFFFFFFFF); 1104 + *dst &= ~(mask << (off * 8)); 1105 + *dst |= (v & mask) << (off * 8); 1106 + } 1107 + 1108 + /* 1109 + * Where PCI_CONFIG_DATA accesses depends on the previous write to 1110 + * PCI_CONFIG_ADDR. 1111 + */ 1112 + static struct device *dev_and_reg(u32 *reg) 1113 + { 1114 + if (!pci_config_addr.bits.enabled) 1115 + return NULL; 1116 + 1117 + if (pci_config_addr.bits.funcnum != 0) 1118 + return NULL; 1119 + 1120 + if (pci_config_addr.bits.busnum != 0) 1121 + return NULL; 1122 + 1123 + if (pci_config_addr.bits.offset * 4 >= sizeof(struct pci_config)) 1124 + return NULL; 1125 + 1126 + *reg = pci_config_addr.bits.offset; 1127 + return find_pci_device(pci_config_addr.bits.devnum); 1128 + } 1129 + 1130 + /* 1131 + * We can get invalid combinations of values while they're writing, so we 1132 + * only fault if they try to write with some invalid bar/offset/length. 1133 + */ 1134 + static bool valid_bar_access(struct device *d, 1135 + struct virtio_pci_cfg_cap *cfg_access) 1136 + { 1137 + /* We only have 1 bar (BAR0) */ 1138 + if (cfg_access->cap.bar != 0) 1139 + return false; 1140 + 1141 + /* Check it's within BAR0. */ 1142 + if (cfg_access->cap.offset >= d->mmio_size 1143 + || cfg_access->cap.offset + cfg_access->cap.length > d->mmio_size) 1144 + return false; 1145 + 1146 + /* Check length is 1, 2 or 4. */ 1147 + if (cfg_access->cap.length != 1 1148 + && cfg_access->cap.length != 2 1149 + && cfg_access->cap.length != 4) 1150 + return false; 1151 + 1152 + /* 1153 + * 4.1.4.7.2: 1154 + * 1155 + * The driver MUST NOT write a cap.offset which is not a multiple of 1156 + * cap.length (ie. all accesses MUST be aligned). 1157 + */ 1158 + if (cfg_access->cap.offset % cfg_access->cap.length != 0) 1159 + return false; 1160 + 1161 + /* Return pointer into word in BAR0. */ 1162 + return true; 1163 + } 1164 + 1165 + /* Is this accessing the PCI config address port?. */ 1166 + static bool is_pci_addr_port(u16 port) 1167 + { 1168 + return port >= PCI_CONFIG_ADDR && port < PCI_CONFIG_ADDR + 4; 1169 + } 1170 + 1171 + static bool pci_addr_iowrite(u16 port, u32 mask, u32 val) 1172 + { 1173 + iowrite(port - PCI_CONFIG_ADDR, val, mask, 1174 + &pci_config_addr.val); 1175 + verbose("PCI%s: %#x/%x: bus %u dev %u func %u reg %u\n", 1176 + pci_config_addr.bits.enabled ? "" : " DISABLED", 1177 + val, mask, 1178 + pci_config_addr.bits.busnum, 1179 + pci_config_addr.bits.devnum, 1180 + pci_config_addr.bits.funcnum, 1181 + pci_config_addr.bits.offset); 1182 + return true; 1183 + } 1184 + 1185 + static void pci_addr_ioread(u16 port, u32 mask, u32 *val) 1186 + { 1187 + ioread(port - PCI_CONFIG_ADDR, pci_config_addr.val, mask, val); 1188 + } 1189 + 1190 + /* Is this accessing the PCI config data port?. */ 1191 + static bool is_pci_data_port(u16 port) 1192 + { 1193 + return port >= PCI_CONFIG_DATA && port < PCI_CONFIG_DATA + 4; 1194 + } 1195 + 1196 + static void emulate_mmio_write(struct device *d, u32 off, u32 val, u32 mask); 1197 + 1198 + static bool pci_data_iowrite(u16 port, u32 mask, u32 val) 1199 + { 1200 + u32 reg, portoff; 1201 + struct device *d = dev_and_reg(&reg); 1202 + 1203 + /* Complain if they don't belong to a device. */ 1204 + if (!d) 1205 + return false; 1206 + 1207 + /* They can do 1 byte writes, etc. */ 1208 + portoff = port - PCI_CONFIG_DATA; 1209 + 1210 + /* 1211 + * PCI uses a weird way to determine the BAR size: the OS 1212 + * writes all 1's, and sees which ones stick. 1213 + */ 1214 + if (&d->config_words[reg] == &d->config.bar[0]) { 1215 + int i; 1216 + 1217 + iowrite(portoff, val, mask, &d->config.bar[0]); 1218 + for (i = 0; (1 << i) < d->mmio_size; i++) 1219 + d->config.bar[0] &= ~(1 << i); 1220 + return true; 1221 + } else if ((&d->config_words[reg] > &d->config.bar[0] 1222 + && &d->config_words[reg] <= &d->config.bar[6]) 1223 + || &d->config_words[reg] == &d->config.expansion_rom_addr) { 1224 + /* Allow writing to any other BAR, or expansion ROM */ 1225 + iowrite(portoff, val, mask, &d->config_words[reg]); 1226 + return true; 1227 + /* We let them overide latency timer and cacheline size */ 1228 + } else if (&d->config_words[reg] == (void *)&d->config.cacheline_size) { 1229 + /* Only let them change the first two fields. */ 1230 + if (mask == 0xFFFFFFFF) 1231 + mask = 0xFFFF; 1232 + iowrite(portoff, val, mask, &d->config_words[reg]); 1233 + return true; 1234 + } else if (&d->config_words[reg] == (void *)&d->config.command 1235 + && mask == 0xFFFF) { 1236 + /* Ignore command writes. */ 1237 + return true; 1238 + } else if (&d->config_words[reg] 1239 + == (void *)&d->config.cfg_access.cap.bar 1240 + || &d->config_words[reg] 1241 + == &d->config.cfg_access.cap.length 1242 + || &d->config_words[reg] 1243 + == &d->config.cfg_access.cap.offset) { 1244 + 1245 + /* 1246 + * The VIRTIO_PCI_CAP_PCI_CFG capability 1247 + * provides a backdoor to access the MMIO 1248 + * regions without mapping them. Weird, but 1249 + * useful. 1250 + */ 1251 + iowrite(portoff, val, mask, &d->config_words[reg]); 1252 + return true; 1253 + } else if (&d->config_words[reg] == &d->config.cfg_access.pci_cfg_data) { 1254 + u32 write_mask; 1255 + 1256 + /* 1257 + * 4.1.4.7.1: 1258 + * 1259 + * Upon detecting driver write access to pci_cfg_data, the 1260 + * device MUST execute a write access at offset cap.offset at 1261 + * BAR selected by cap.bar using the first cap.length bytes 1262 + * from pci_cfg_data. 1263 + */ 1264 + 1265 + /* Must be bar 0 */ 1266 + if (!valid_bar_access(d, &d->config.cfg_access)) 1267 + return false; 1268 + 1269 + iowrite(portoff, val, mask, &d->config.cfg_access.pci_cfg_data); 1270 + 1271 + /* 1272 + * Now emulate a write. The mask we use is set by 1273 + * len, *not* this write! 1274 + */ 1275 + write_mask = (1ULL<<(8*d->config.cfg_access.cap.length)) - 1; 1276 + verbose("Window writing %#x/%#x to bar %u, offset %u len %u\n", 1277 + d->config.cfg_access.pci_cfg_data, write_mask, 1278 + d->config.cfg_access.cap.bar, 1279 + d->config.cfg_access.cap.offset, 1280 + d->config.cfg_access.cap.length); 1281 + 1282 + emulate_mmio_write(d, d->config.cfg_access.cap.offset, 1283 + d->config.cfg_access.pci_cfg_data, 1284 + write_mask); 1285 + return true; 1286 + } 1287 + 1288 + /* 1289 + * 4.1.4.1: 1290 + * 1291 + * The driver MUST NOT write into any field of the capability 1292 + * structure, with the exception of those with cap_type 1293 + * VIRTIO_PCI_CAP_PCI_CFG... 1294 + */ 1295 + return false; 1296 + } 1297 + 1298 + static u32 emulate_mmio_read(struct device *d, u32 off, u32 mask); 1299 + 1300 + static void pci_data_ioread(u16 port, u32 mask, u32 *val) 1301 + { 1302 + u32 reg; 1303 + struct device *d = dev_and_reg(&reg); 1304 + 1305 + if (!d) 1306 + return; 1307 + 1308 + /* Read through the PCI MMIO access window is special */ 1309 + if (&d->config_words[reg] == &d->config.cfg_access.pci_cfg_data) { 1310 + u32 read_mask; 1311 + 1312 + /* 1313 + * 4.1.4.7.1: 1314 + * 1315 + * Upon detecting driver read access to pci_cfg_data, the 1316 + * device MUST execute a read access of length cap.length at 1317 + * offset cap.offset at BAR selected by cap.bar and store the 1318 + * first cap.length bytes in pci_cfg_data. 1319 + */ 1320 + /* Must be bar 0 */ 1321 + if (!valid_bar_access(d, &d->config.cfg_access)) 1322 + bad_driver(d, 1323 + "Invalid cfg_access to bar%u, offset %u len %u", 1324 + d->config.cfg_access.cap.bar, 1325 + d->config.cfg_access.cap.offset, 1326 + d->config.cfg_access.cap.length); 1327 + 1328 + /* 1329 + * Read into the window. The mask we use is set by 1330 + * len, *not* this read! 1331 + */ 1332 + read_mask = (1ULL<<(8*d->config.cfg_access.cap.length))-1; 1333 + d->config.cfg_access.pci_cfg_data 1334 + = emulate_mmio_read(d, 1335 + d->config.cfg_access.cap.offset, 1336 + read_mask); 1337 + verbose("Window read %#x/%#x from bar %u, offset %u len %u\n", 1338 + d->config.cfg_access.pci_cfg_data, read_mask, 1339 + d->config.cfg_access.cap.bar, 1340 + d->config.cfg_access.cap.offset, 1341 + d->config.cfg_access.cap.length); 1342 + } 1343 + ioread(port - PCI_CONFIG_DATA, d->config_words[reg], mask, val); 1344 + } 1345 + 1346 + /*L:216 1347 + * This is where we emulate a handful of Guest instructions. It's ugly 1348 + * and we used to do it in the kernel but it grew over time. 1349 + */ 1350 + 1351 + /* 1352 + * We use the ptrace syscall's pt_regs struct to talk about registers 1353 + * to lguest: these macros convert the names to the offsets. 1354 + */ 1355 + #define getreg(name) getreg_off(offsetof(struct user_regs_struct, name)) 1356 + #define setreg(name, val) \ 1357 + setreg_off(offsetof(struct user_regs_struct, name), (val)) 1358 + 1359 + static u32 getreg_off(size_t offset) 1360 + { 1361 + u32 r; 1362 + unsigned long args[] = { LHREQ_GETREG, offset }; 1363 + 1364 + if (pwrite(lguest_fd, args, sizeof(args), cpu_id) < 0) 1365 + err(1, "Getting register %u", offset); 1366 + if (pread(lguest_fd, &r, sizeof(r), cpu_id) != sizeof(r)) 1367 + err(1, "Reading register %u", offset); 1368 + 1369 + return r; 1370 + } 1371 + 1372 + static void setreg_off(size_t offset, u32 val) 1373 + { 1374 + unsigned long args[] = { LHREQ_SETREG, offset, val }; 1375 + 1376 + if (pwrite(lguest_fd, args, sizeof(args), cpu_id) < 0) 1377 + err(1, "Setting register %u", offset); 1378 + } 1379 + 1380 + /* Get register by instruction encoding */ 1381 + static u32 getreg_num(unsigned regnum, u32 mask) 1382 + { 1383 + /* 8 bit ops use regnums 4-7 for high parts of word */ 1384 + if (mask == 0xFF && (regnum & 0x4)) 1385 + return getreg_num(regnum & 0x3, 0xFFFF) >> 8; 1386 + 1387 + switch (regnum) { 1388 + case 0: return getreg(eax) & mask; 1389 + case 1: return getreg(ecx) & mask; 1390 + case 2: return getreg(edx) & mask; 1391 + case 3: return getreg(ebx) & mask; 1392 + case 4: return getreg(esp) & mask; 1393 + case 5: return getreg(ebp) & mask; 1394 + case 6: return getreg(esi) & mask; 1395 + case 7: return getreg(edi) & mask; 1396 + } 1397 + abort(); 1398 + } 1399 + 1400 + /* Set register by instruction encoding */ 1401 + static void setreg_num(unsigned regnum, u32 val, u32 mask) 1402 + { 1403 + /* Don't try to set bits out of range */ 1404 + assert(~(val & ~mask)); 1405 + 1406 + /* 8 bit ops use regnums 4-7 for high parts of word */ 1407 + if (mask == 0xFF && (regnum & 0x4)) { 1408 + /* Construct the 16 bits we want. */ 1409 + val = (val << 8) | getreg_num(regnum & 0x3, 0xFF); 1410 + setreg_num(regnum & 0x3, val, 0xFFFF); 1411 + return; 1412 + } 1413 + 1414 + switch (regnum) { 1415 + case 0: setreg(eax, val | (getreg(eax) & ~mask)); return; 1416 + case 1: setreg(ecx, val | (getreg(ecx) & ~mask)); return; 1417 + case 2: setreg(edx, val | (getreg(edx) & ~mask)); return; 1418 + case 3: setreg(ebx, val | (getreg(ebx) & ~mask)); return; 1419 + case 4: setreg(esp, val | (getreg(esp) & ~mask)); return; 1420 + case 5: setreg(ebp, val | (getreg(ebp) & ~mask)); return; 1421 + case 6: setreg(esi, val | (getreg(esi) & ~mask)); return; 1422 + case 7: setreg(edi, val | (getreg(edi) & ~mask)); return; 1423 + } 1424 + abort(); 1425 + } 1426 + 1427 + /* Get bytes of displacement appended to instruction, from r/m encoding */ 1428 + static u32 insn_displacement_len(u8 mod_reg_rm) 1429 + { 1430 + /* Switch on the mod bits */ 1431 + switch (mod_reg_rm >> 6) { 1432 + case 0: 1433 + /* If mod == 0, and r/m == 101, 16-bit displacement follows */ 1434 + if ((mod_reg_rm & 0x7) == 0x5) 1435 + return 2; 1436 + /* Normally, mod == 0 means no literal displacement */ 1437 + return 0; 1438 + case 1: 1439 + /* One byte displacement */ 1440 + return 1; 1441 + case 2: 1442 + /* Four byte displacement */ 1443 + return 4; 1444 + case 3: 1445 + /* Register mode */ 1446 + return 0; 1447 + } 1448 + abort(); 1449 + } 1450 + 1451 + static void emulate_insn(const u8 insn[]) 1452 + { 1453 + unsigned long args[] = { LHREQ_TRAP, 13 }; 1454 + unsigned int insnlen = 0, in = 0, small_operand = 0, byte_access; 1455 + unsigned int eax, port, mask; 1456 + /* 1457 + * Default is to return all-ones on IO port reads, which traditionally 1458 + * means "there's nothing there". 1459 + */ 1460 + u32 val = 0xFFFFFFFF; 1461 + 1462 + /* 1463 + * This must be the Guest kernel trying to do something, not userspace! 1464 + * The bottom two bits of the CS segment register are the privilege 1465 + * level. 1466 + */ 1467 + if ((getreg(xcs) & 3) != 0x1) 1468 + goto no_emulate; 1469 + 1470 + /* Decoding x86 instructions is icky. */ 1471 + 1472 + /* 1473 + * Around 2.6.33, the kernel started using an emulation for the 1474 + * cmpxchg8b instruction in early boot on many configurations. This 1475 + * code isn't paravirtualized, and it tries to disable interrupts. 1476 + * Ignore it, which will Mostly Work. 1477 + */ 1478 + if (insn[insnlen] == 0xfa) { 1479 + /* "cli", or Clear Interrupt Enable instruction. Skip it. */ 1480 + insnlen = 1; 1481 + goto skip_insn; 1482 + } 1483 + 1484 + /* 1485 + * 0x66 is an "operand prefix". It means a 16, not 32 bit in/out. 1486 + */ 1487 + if (insn[insnlen] == 0x66) { 1488 + small_operand = 1; 1489 + /* The instruction is 1 byte so far, read the next byte. */ 1490 + insnlen = 1; 1491 + } 1492 + 1493 + /* If the lower bit isn't set, it's a single byte access */ 1494 + byte_access = !(insn[insnlen] & 1); 1495 + 1496 + /* 1497 + * Now we can ignore the lower bit and decode the 4 opcodes 1498 + * we need to emulate. 1499 + */ 1500 + switch (insn[insnlen] & 0xFE) { 1501 + case 0xE4: /* in <next byte>,%al */ 1502 + port = insn[insnlen+1]; 1503 + insnlen += 2; 1504 + in = 1; 1505 + break; 1506 + case 0xEC: /* in (%dx),%al */ 1507 + port = getreg(edx) & 0xFFFF; 1508 + insnlen += 1; 1509 + in = 1; 1510 + break; 1511 + case 0xE6: /* out %al,<next byte> */ 1512 + port = insn[insnlen+1]; 1513 + insnlen += 2; 1514 + break; 1515 + case 0xEE: /* out %al,(%dx) */ 1516 + port = getreg(edx) & 0xFFFF; 1517 + insnlen += 1; 1518 + break; 1519 + default: 1520 + /* OK, we don't know what this is, can't emulate. */ 1521 + goto no_emulate; 1522 + } 1523 + 1524 + /* Set a mask of the 1, 2 or 4 bytes, depending on size of IO */ 1525 + if (byte_access) 1526 + mask = 0xFF; 1527 + else if (small_operand) 1528 + mask = 0xFFFF; 1529 + else 1530 + mask = 0xFFFFFFFF; 1531 + 1532 + /* 1533 + * If it was an "IN" instruction, they expect the result to be read 1534 + * into %eax, so we change %eax. 1535 + */ 1536 + eax = getreg(eax); 1537 + 1538 + if (in) { 1539 + /* This is the PS/2 keyboard status; 1 means ready for output */ 1540 + if (port == 0x64) 1541 + val = 1; 1542 + else if (is_pci_addr_port(port)) 1543 + pci_addr_ioread(port, mask, &val); 1544 + else if (is_pci_data_port(port)) 1545 + pci_data_ioread(port, mask, &val); 1546 + 1547 + /* Clear the bits we're about to read */ 1548 + eax &= ~mask; 1549 + /* Copy bits in from val. */ 1550 + eax |= val & mask; 1551 + /* Now update the register. */ 1552 + setreg(eax, eax); 1553 + } else { 1554 + if (is_pci_addr_port(port)) { 1555 + if (!pci_addr_iowrite(port, mask, eax)) 1556 + goto bad_io; 1557 + } else if (is_pci_data_port(port)) { 1558 + if (!pci_data_iowrite(port, mask, eax)) 1559 + goto bad_io; 1560 + } 1561 + /* There are many other ports, eg. CMOS clock, serial 1562 + * and parallel ports, so we ignore them all. */ 1563 + } 1564 + 1565 + verbose("IO %s of %x to %u: %#08x\n", 1566 + in ? "IN" : "OUT", mask, port, eax); 1567 + skip_insn: 1568 + /* Finally, we've "done" the instruction, so move past it. */ 1569 + setreg(eip, getreg(eip) + insnlen); 1570 + return; 1571 + 1572 + bad_io: 1573 + warnx("Attempt to %s port %u (%#x mask)", 1574 + in ? "read from" : "write to", port, mask); 1575 + 1576 + no_emulate: 1577 + /* Inject trap into Guest. */ 1578 + if (write(lguest_fd, args, sizeof(args)) < 0) 1579 + err(1, "Reinjecting trap 13 for fault at %#x", getreg(eip)); 1580 + } 1581 + 1582 + static struct device *find_mmio_region(unsigned long paddr, u32 *off) 1583 + { 1584 + unsigned int i; 1585 + 1586 + for (i = 1; i < MAX_PCI_DEVICES; i++) { 1587 + struct device *d = devices.pci[i]; 1588 + 1589 + if (!d) 1590 + continue; 1591 + if (paddr < d->mmio_addr) 1592 + continue; 1593 + if (paddr >= d->mmio_addr + d->mmio_size) 1594 + continue; 1595 + *off = paddr - d->mmio_addr; 1596 + return d; 1597 + } 1598 + return NULL; 1599 + } 1600 + 1601 + /* FIXME: Use vq array. */ 1602 + static struct virtqueue *vq_by_num(struct device *d, u32 num) 1603 + { 1604 + struct virtqueue *vq = d->vq; 1605 + 1606 + while (num-- && vq) 1607 + vq = vq->next; 1608 + 1609 + return vq; 1610 + } 1611 + 1612 + static void save_vq_config(const struct virtio_pci_common_cfg *cfg, 1613 + struct virtqueue *vq) 1614 + { 1615 + vq->pci_config = *cfg; 1616 + } 1617 + 1618 + static void restore_vq_config(struct virtio_pci_common_cfg *cfg, 1619 + struct virtqueue *vq) 1620 + { 1621 + /* Only restore the per-vq part */ 1622 + size_t off = offsetof(struct virtio_pci_common_cfg, queue_size); 1623 + 1624 + memcpy((void *)cfg + off, (void *)&vq->pci_config + off, 1625 + sizeof(*cfg) - off); 1626 + } 1627 + 1628 + /* 1629 + * 4.1.4.3.2: 1630 + * 1631 + * The driver MUST configure the other virtqueue fields before 1632 + * enabling the virtqueue with queue_enable. 1633 + * 1634 + * When they enable the virtqueue, we check that their setup is valid. 1635 + */ 1636 + static void check_virtqueue(struct device *d, struct virtqueue *vq) 1637 + { 1638 + /* Because lguest is 32 bit, all the descriptor high bits must be 0 */ 1639 + if (vq->pci_config.queue_desc_hi 1640 + || vq->pci_config.queue_avail_hi 1641 + || vq->pci_config.queue_used_hi) 1642 + bad_driver_vq(vq, "invalid 64-bit queue address"); 1643 + 1644 + /* 1645 + * 2.4.1: 1646 + * 1647 + * The driver MUST ensure that the physical address of the first byte 1648 + * of each virtqueue part is a multiple of the specified alignment 1649 + * value in the above table. 1650 + */ 1651 + if (vq->pci_config.queue_desc_lo % 16 1652 + || vq->pci_config.queue_avail_lo % 2 1653 + || vq->pci_config.queue_used_lo % 4) 1654 + bad_driver_vq(vq, "invalid alignment in queue addresses"); 1655 + 1656 + /* Initialize the virtqueue and check they're all in range. */ 1657 + vq->vring.num = vq->pci_config.queue_size; 1658 + vq->vring.desc = check_pointer(vq->dev, 1659 + vq->pci_config.queue_desc_lo, 1660 + sizeof(*vq->vring.desc) * vq->vring.num); 1661 + vq->vring.avail = check_pointer(vq->dev, 1662 + vq->pci_config.queue_avail_lo, 1663 + sizeof(*vq->vring.avail) 1664 + + (sizeof(vq->vring.avail->ring[0]) 1665 + * vq->vring.num)); 1666 + vq->vring.used = check_pointer(vq->dev, 1667 + vq->pci_config.queue_used_lo, 1668 + sizeof(*vq->vring.used) 1669 + + (sizeof(vq->vring.used->ring[0]) 1670 + * vq->vring.num)); 1671 + 1672 + /* 1673 + * 2.4.9.1: 1674 + * 1675 + * The driver MUST initialize flags in the used ring to 0 1676 + * when allocating the used ring. 1677 + */ 1678 + if (vq->vring.used->flags != 0) 1679 + bad_driver_vq(vq, "invalid initial used.flags %#x", 1680 + vq->vring.used->flags); 1681 + } 1682 + 1683 + static void start_virtqueue(struct virtqueue *vq) 1181 1684 { 1182 1685 /* 1183 1686 * Create stack for thread. Since the stack grows upwards, we point 1184 1687 * the stack pointer to the end of this region. 1185 1688 */ 1186 1689 char *stack = malloc(32768); 1187 - unsigned long args[] = { LHREQ_EVENTFD, 1188 - vq->config.pfn*getpagesize(), 0 }; 1189 1690 1190 1691 /* Create a zero-initialized eventfd. */ 1191 1692 vq->eventfd = eventfd(0, 0); 1192 1693 if (vq->eventfd < 0) 1193 1694 err(1, "Creating eventfd"); 1194 - args[2] = vq->eventfd; 1195 - 1196 - /* 1197 - * Attach an eventfd to this virtqueue: it will go off when the Guest 1198 - * does an LHCALL_NOTIFY for this vq. 1199 - */ 1200 - if (write(lguest_fd, &args, sizeof(args)) != 0) 1201 - err(1, "Attaching eventfd"); 1202 1695 1203 1696 /* 1204 1697 * CLONE_VM: because it has to access the Guest memory, and SIGCHLD so ··· 1859 1048 vq->thread = clone(do_thread, stack + 32768, CLONE_VM | SIGCHLD, vq); 1860 1049 if (vq->thread == (pid_t)-1) 1861 1050 err(1, "Creating clone"); 1862 - 1863 - /* We close our local copy now the child has it. */ 1864 - close(vq->eventfd); 1865 1051 } 1866 1052 1867 - static void start_device(struct device *dev) 1053 + static void start_virtqueues(struct device *d) 1868 1054 { 1869 - unsigned int i; 1870 1055 struct virtqueue *vq; 1871 1056 1872 - verbose("Device %s OK: offered", dev->name); 1873 - for (i = 0; i < dev->feature_len; i++) 1874 - verbose(" %02x", get_feature_bits(dev)[i]); 1875 - verbose(", accepted"); 1876 - for (i = 0; i < dev->feature_len; i++) 1877 - verbose(" %02x", get_feature_bits(dev) 1878 - [dev->feature_len+i]); 1879 - 1880 - for (vq = dev->vq; vq; vq = vq->next) { 1881 - if (vq->service) 1882 - create_thread(vq); 1883 - } 1884 - dev->running = true; 1885 - } 1886 - 1887 - static void cleanup_devices(void) 1888 - { 1889 - struct device *dev; 1890 - 1891 - for (dev = devices.dev; dev; dev = dev->next) 1892 - reset_device(dev); 1893 - 1894 - /* If we saved off the original terminal settings, restore them now. */ 1895 - if (orig_term.c_lflag & (ISIG|ICANON|ECHO)) 1896 - tcsetattr(STDIN_FILENO, TCSANOW, &orig_term); 1897 - } 1898 - 1899 - /* When the Guest tells us they updated the status field, we handle it. */ 1900 - static void update_device_status(struct device *dev) 1901 - { 1902 - /* A zero status is a reset, otherwise it's a set of flags. */ 1903 - if (dev->desc->status == 0) 1904 - reset_device(dev); 1905 - else if (dev->desc->status & VIRTIO_CONFIG_S_FAILED) { 1906 - warnx("Device %s configuration FAILED", dev->name); 1907 - if (dev->running) 1908 - reset_device(dev); 1909 - } else { 1910 - if (dev->running) 1911 - err(1, "Device %s features finalized twice", dev->name); 1912 - start_device(dev); 1057 + for (vq = d->vq; vq; vq = vq->next) { 1058 + if (vq->pci_config.queue_enable) 1059 + start_virtqueue(vq); 1913 1060 } 1914 1061 } 1915 1062 1916 - /*L:215 1917 - * This is the generic routine we call when the Guest uses LHCALL_NOTIFY. In 1918 - * particular, it's used to notify us of device status changes during boot. 1919 - */ 1920 - static void handle_output(unsigned long addr) 1063 + static void emulate_mmio_write(struct device *d, u32 off, u32 val, u32 mask) 1921 1064 { 1922 - struct device *i; 1065 + struct virtqueue *vq; 1923 1066 1924 - /* Check each device. */ 1925 - for (i = devices.dev; i; i = i->next) { 1926 - struct virtqueue *vq; 1067 + switch (off) { 1068 + case offsetof(struct virtio_pci_mmio, cfg.device_feature_select): 1069 + /* 1070 + * 4.1.4.3.1: 1071 + * 1072 + * The device MUST present the feature bits it is offering in 1073 + * device_feature, starting at bit device_feature_select ∗ 32 1074 + * for any device_feature_select written by the driver 1075 + */ 1076 + if (val == 0) 1077 + d->mmio->cfg.device_feature = d->features; 1078 + else if (val == 1) 1079 + d->mmio->cfg.device_feature = (d->features >> 32); 1080 + else 1081 + d->mmio->cfg.device_feature = 0; 1082 + goto feature_write_through32; 1083 + case offsetof(struct virtio_pci_mmio, cfg.guest_feature_select): 1084 + if (val > 1) 1085 + bad_driver(d, "Unexpected driver select %u", val); 1086 + goto feature_write_through32; 1087 + case offsetof(struct virtio_pci_mmio, cfg.guest_feature): 1088 + if (d->mmio->cfg.guest_feature_select == 0) { 1089 + d->features_accepted &= ~((u64)0xFFFFFFFF); 1090 + d->features_accepted |= val; 1091 + } else { 1092 + assert(d->mmio->cfg.guest_feature_select == 1); 1093 + d->features_accepted &= 0xFFFFFFFF; 1094 + d->features_accepted |= ((u64)val) << 32; 1095 + } 1096 + /* 1097 + * 2.2.1: 1098 + * 1099 + * The driver MUST NOT accept a feature which the device did 1100 + * not offer 1101 + */ 1102 + if (d->features_accepted & ~d->features) 1103 + bad_driver(d, "over-accepted features %#llx of %#llx", 1104 + d->features_accepted, d->features); 1105 + goto feature_write_through32; 1106 + case offsetof(struct virtio_pci_mmio, cfg.device_status): { 1107 + u8 prev; 1108 + 1109 + verbose("%s: device status -> %#x\n", d->name, val); 1110 + /* 1111 + * 4.1.4.3.1: 1112 + * 1113 + * The device MUST reset when 0 is written to device_status, 1114 + * and present a 0 in device_status once that is done. 1115 + */ 1116 + if (val == 0) { 1117 + reset_device(d); 1118 + goto write_through8; 1119 + } 1120 + 1121 + /* 2.1.1: The driver MUST NOT clear a device status bit. */ 1122 + if (d->mmio->cfg.device_status & ~val) 1123 + bad_driver(d, "unset of device status bit %#x -> %#x", 1124 + d->mmio->cfg.device_status, val); 1927 1125 1928 1126 /* 1929 - * Notifications to device descriptors mean they updated the 1930 - * device status. 1127 + * 2.1.2: 1128 + * 1129 + * The device MUST NOT consume buffers or notify the driver 1130 + * before DRIVER_OK. 1931 1131 */ 1932 - if (from_guest_phys(addr) == i->desc) { 1933 - update_device_status(i); 1934 - return; 1935 - } 1132 + if (val & VIRTIO_CONFIG_S_DRIVER_OK 1133 + && !(d->mmio->cfg.device_status & VIRTIO_CONFIG_S_DRIVER_OK)) 1134 + start_virtqueues(d); 1936 1135 1937 - /* Devices should not be used before features are finalized. */ 1938 - for (vq = i->vq; vq; vq = vq->next) { 1939 - if (addr != vq->config.pfn*getpagesize()) 1940 - continue; 1941 - errx(1, "Notification on %s before setup!", i->name); 1136 + /* 1137 + * 3.1.1: 1138 + * 1139 + * The driver MUST follow this sequence to initialize a device: 1140 + * - Reset the device. 1141 + * - Set the ACKNOWLEDGE status bit: the guest OS has 1142 + * notice the device. 1143 + * - Set the DRIVER status bit: the guest OS knows how 1144 + * to drive the device. 1145 + * - Read device feature bits, and write the subset 1146 + * of feature bits understood by the OS and driver 1147 + * to the device. During this step the driver MAY 1148 + * read (but MUST NOT write) the device-specific 1149 + * configuration fields to check that it can 1150 + * support the device before accepting it. 1151 + * - Set the FEATURES_OK status bit. The driver 1152 + * MUST not accept new feature bits after this 1153 + * step. 1154 + * - Re-read device status to ensure the FEATURES_OK 1155 + * bit is still set: otherwise, the device does 1156 + * not support our subset of features and the 1157 + * device is unusable. 1158 + * - Perform device-specific setup, including 1159 + * discovery of virtqueues for the device, 1160 + * optional per-bus setup, reading and possibly 1161 + * writing the device’s virtio configuration 1162 + * space, and population of virtqueues. 1163 + * - Set the DRIVER_OK status bit. At this point the 1164 + * device is “live”. 1165 + */ 1166 + prev = 0; 1167 + switch (val & ~d->mmio->cfg.device_status) { 1168 + case VIRTIO_CONFIG_S_DRIVER_OK: 1169 + prev |= VIRTIO_CONFIG_S_FEATURES_OK; /* fall thru */ 1170 + case VIRTIO_CONFIG_S_FEATURES_OK: 1171 + prev |= VIRTIO_CONFIG_S_DRIVER; /* fall thru */ 1172 + case VIRTIO_CONFIG_S_DRIVER: 1173 + prev |= VIRTIO_CONFIG_S_ACKNOWLEDGE; /* fall thru */ 1174 + case VIRTIO_CONFIG_S_ACKNOWLEDGE: 1175 + break; 1176 + default: 1177 + bad_driver(d, "unknown device status bit %#x -> %#x", 1178 + d->mmio->cfg.device_status, val); 1942 1179 } 1180 + if (d->mmio->cfg.device_status != prev) 1181 + bad_driver(d, "unexpected status transition %#x -> %#x", 1182 + d->mmio->cfg.device_status, val); 1183 + 1184 + /* If they just wrote FEATURES_OK, we make sure they read */ 1185 + switch (val & ~d->mmio->cfg.device_status) { 1186 + case VIRTIO_CONFIG_S_FEATURES_OK: 1187 + d->wrote_features_ok = true; 1188 + break; 1189 + case VIRTIO_CONFIG_S_DRIVER_OK: 1190 + if (d->wrote_features_ok) 1191 + bad_driver(d, "did not re-read FEATURES_OK"); 1192 + break; 1193 + } 1194 + goto write_through8; 1195 + } 1196 + case offsetof(struct virtio_pci_mmio, cfg.queue_select): 1197 + vq = vq_by_num(d, val); 1198 + /* 1199 + * 4.1.4.3.1: 1200 + * 1201 + * The device MUST present a 0 in queue_size if the virtqueue 1202 + * corresponding to the current queue_select is unavailable. 1203 + */ 1204 + if (!vq) { 1205 + d->mmio->cfg.queue_size = 0; 1206 + goto write_through16; 1207 + } 1208 + /* Save registers for old vq, if it was a valid vq */ 1209 + if (d->mmio->cfg.queue_size) 1210 + save_vq_config(&d->mmio->cfg, 1211 + vq_by_num(d, d->mmio->cfg.queue_select)); 1212 + /* Restore the registers for the queue they asked for */ 1213 + restore_vq_config(&d->mmio->cfg, vq); 1214 + goto write_through16; 1215 + case offsetof(struct virtio_pci_mmio, cfg.queue_size): 1216 + /* 1217 + * 4.1.4.3.2: 1218 + * 1219 + * The driver MUST NOT write a value which is not a power of 2 1220 + * to queue_size. 1221 + */ 1222 + if (val & (val-1)) 1223 + bad_driver(d, "invalid queue size %u", val); 1224 + if (d->mmio->cfg.queue_enable) 1225 + bad_driver(d, "changing queue size on live device"); 1226 + goto write_through16; 1227 + case offsetof(struct virtio_pci_mmio, cfg.queue_msix_vector): 1228 + bad_driver(d, "attempt to set MSIX vector to %u", val); 1229 + case offsetof(struct virtio_pci_mmio, cfg.queue_enable): { 1230 + struct virtqueue *vq = vq_by_num(d, d->mmio->cfg.queue_select); 1231 + 1232 + /* 1233 + * 4.1.4.3.2: 1234 + * 1235 + * The driver MUST NOT write a 0 to queue_enable. 1236 + */ 1237 + if (val != 1) 1238 + bad_driver(d, "setting queue_enable to %u", val); 1239 + 1240 + /* 1241 + * 3.1.1: 1242 + * 1243 + * 7. Perform device-specific setup, including discovery of 1244 + * virtqueues for the device, optional per-bus setup, 1245 + * reading and possibly writing the device’s virtio 1246 + * configuration space, and population of virtqueues. 1247 + * 8. Set the DRIVER_OK status bit. 1248 + * 1249 + * All our devices require all virtqueues to be enabled, so 1250 + * they should have done that before setting DRIVER_OK. 1251 + */ 1252 + if (d->mmio->cfg.device_status & VIRTIO_CONFIG_S_DRIVER_OK) 1253 + bad_driver(d, "enabling vq after DRIVER_OK"); 1254 + 1255 + d->mmio->cfg.queue_enable = val; 1256 + save_vq_config(&d->mmio->cfg, vq); 1257 + check_virtqueue(d, vq); 1258 + goto write_through16; 1259 + } 1260 + case offsetof(struct virtio_pci_mmio, cfg.queue_notify_off): 1261 + bad_driver(d, "attempt to write to queue_notify_off"); 1262 + case offsetof(struct virtio_pci_mmio, cfg.queue_desc_lo): 1263 + case offsetof(struct virtio_pci_mmio, cfg.queue_desc_hi): 1264 + case offsetof(struct virtio_pci_mmio, cfg.queue_avail_lo): 1265 + case offsetof(struct virtio_pci_mmio, cfg.queue_avail_hi): 1266 + case offsetof(struct virtio_pci_mmio, cfg.queue_used_lo): 1267 + case offsetof(struct virtio_pci_mmio, cfg.queue_used_hi): 1268 + /* 1269 + * 4.1.4.3.2: 1270 + * 1271 + * The driver MUST configure the other virtqueue fields before 1272 + * enabling the virtqueue with queue_enable. 1273 + */ 1274 + if (d->mmio->cfg.queue_enable) 1275 + bad_driver(d, "changing queue on live device"); 1276 + 1277 + /* 1278 + * 3.1.1: 1279 + * 1280 + * The driver MUST follow this sequence to initialize a device: 1281 + *... 1282 + * 5. Set the FEATURES_OK status bit. The driver MUST not 1283 + * accept new feature bits after this step. 1284 + */ 1285 + if (!(d->mmio->cfg.device_status & VIRTIO_CONFIG_S_FEATURES_OK)) 1286 + bad_driver(d, "setting up vq before FEATURES_OK"); 1287 + 1288 + /* 1289 + * 6. Re-read device status to ensure the FEATURES_OK bit is 1290 + * still set... 1291 + */ 1292 + if (d->wrote_features_ok) 1293 + bad_driver(d, "didn't re-read FEATURES_OK before setup"); 1294 + 1295 + goto write_through32; 1296 + case offsetof(struct virtio_pci_mmio, notify): 1297 + vq = vq_by_num(d, val); 1298 + if (!vq) 1299 + bad_driver(d, "Invalid vq notification on %u", val); 1300 + /* Notify the process handling this vq by adding 1 to eventfd */ 1301 + write(vq->eventfd, "\1\0\0\0\0\0\0\0", 8); 1302 + goto write_through16; 1303 + case offsetof(struct virtio_pci_mmio, isr): 1304 + bad_driver(d, "Unexpected write to isr"); 1305 + /* Weird corner case: write to emerg_wr of console */ 1306 + case sizeof(struct virtio_pci_mmio) 1307 + + offsetof(struct virtio_console_config, emerg_wr): 1308 + if (strcmp(d->name, "console") == 0) { 1309 + char c = val; 1310 + write(STDOUT_FILENO, &c, 1); 1311 + goto write_through32; 1312 + } 1313 + /* Fall through... */ 1314 + default: 1315 + /* 1316 + * 4.1.4.3.2: 1317 + * 1318 + * The driver MUST NOT write to device_feature, num_queues, 1319 + * config_generation or queue_notify_off. 1320 + */ 1321 + bad_driver(d, "Unexpected write to offset %u", off); 1322 + } 1323 + 1324 + feature_write_through32: 1325 + /* 1326 + * 3.1.1: 1327 + * 1328 + * The driver MUST follow this sequence to initialize a device: 1329 + *... 1330 + * - Set the DRIVER status bit: the guest OS knows how 1331 + * to drive the device. 1332 + * - Read device feature bits, and write the subset 1333 + * of feature bits understood by the OS and driver 1334 + * to the device. 1335 + *... 1336 + * - Set the FEATURES_OK status bit. The driver MUST not 1337 + * accept new feature bits after this step. 1338 + */ 1339 + if (!(d->mmio->cfg.device_status & VIRTIO_CONFIG_S_DRIVER)) 1340 + bad_driver(d, "feature write before VIRTIO_CONFIG_S_DRIVER"); 1341 + if (d->mmio->cfg.device_status & VIRTIO_CONFIG_S_FEATURES_OK) 1342 + bad_driver(d, "feature write after VIRTIO_CONFIG_S_FEATURES_OK"); 1343 + 1344 + /* 1345 + * 4.1.3.1: 1346 + * 1347 + * The driver MUST access each field using the “natural” access 1348 + * method, i.e. 32-bit accesses for 32-bit fields, 16-bit accesses for 1349 + * 16-bit fields and 8-bit accesses for 8-bit fields. 1350 + */ 1351 + write_through32: 1352 + if (mask != 0xFFFFFFFF) { 1353 + bad_driver(d, "non-32-bit write to offset %u (%#x)", 1354 + off, getreg(eip)); 1355 + return; 1356 + } 1357 + memcpy((char *)d->mmio + off, &val, 4); 1358 + return; 1359 + 1360 + write_through16: 1361 + if (mask != 0xFFFF) 1362 + bad_driver(d, "non-16-bit write to offset %u (%#x)", 1363 + off, getreg(eip)); 1364 + memcpy((char *)d->mmio + off, &val, 2); 1365 + return; 1366 + 1367 + write_through8: 1368 + if (mask != 0xFF) 1369 + bad_driver(d, "non-8-bit write to offset %u (%#x)", 1370 + off, getreg(eip)); 1371 + memcpy((char *)d->mmio + off, &val, 1); 1372 + return; 1373 + } 1374 + 1375 + static u32 emulate_mmio_read(struct device *d, u32 off, u32 mask) 1376 + { 1377 + u8 isr; 1378 + u32 val = 0; 1379 + 1380 + switch (off) { 1381 + case offsetof(struct virtio_pci_mmio, cfg.device_feature_select): 1382 + case offsetof(struct virtio_pci_mmio, cfg.device_feature): 1383 + case offsetof(struct virtio_pci_mmio, cfg.guest_feature_select): 1384 + case offsetof(struct virtio_pci_mmio, cfg.guest_feature): 1385 + /* 1386 + * 3.1.1: 1387 + * 1388 + * The driver MUST follow this sequence to initialize a device: 1389 + *... 1390 + * - Set the DRIVER status bit: the guest OS knows how 1391 + * to drive the device. 1392 + * - Read device feature bits, and write the subset 1393 + * of feature bits understood by the OS and driver 1394 + * to the device. 1395 + */ 1396 + if (!(d->mmio->cfg.device_status & VIRTIO_CONFIG_S_DRIVER)) 1397 + bad_driver(d, 1398 + "feature read before VIRTIO_CONFIG_S_DRIVER"); 1399 + goto read_through32; 1400 + case offsetof(struct virtio_pci_mmio, cfg.msix_config): 1401 + bad_driver(d, "read of msix_config"); 1402 + case offsetof(struct virtio_pci_mmio, cfg.num_queues): 1403 + goto read_through16; 1404 + case offsetof(struct virtio_pci_mmio, cfg.device_status): 1405 + /* As they did read, any write of FEATURES_OK is now fine. */ 1406 + d->wrote_features_ok = false; 1407 + goto read_through8; 1408 + case offsetof(struct virtio_pci_mmio, cfg.config_generation): 1409 + /* 1410 + * 4.1.4.3.1: 1411 + * 1412 + * The device MUST present a changed config_generation after 1413 + * the driver has read a device-specific configuration value 1414 + * which has changed since any part of the device-specific 1415 + * configuration was last read. 1416 + * 1417 + * This is simple: none of our devices change config, so this 1418 + * is always 0. 1419 + */ 1420 + goto read_through8; 1421 + case offsetof(struct virtio_pci_mmio, notify): 1422 + /* 1423 + * 3.1.1: 1424 + * 1425 + * The driver MUST NOT notify the device before setting 1426 + * DRIVER_OK. 1427 + */ 1428 + if (!(d->mmio->cfg.device_status & VIRTIO_CONFIG_S_DRIVER_OK)) 1429 + bad_driver(d, "notify before VIRTIO_CONFIG_S_DRIVER_OK"); 1430 + goto read_through16; 1431 + case offsetof(struct virtio_pci_mmio, isr): 1432 + if (mask != 0xFF) 1433 + bad_driver(d, "non-8-bit read from offset %u (%#x)", 1434 + off, getreg(eip)); 1435 + isr = d->mmio->isr; 1436 + /* 1437 + * 4.1.4.5.1: 1438 + * 1439 + * The device MUST reset ISR status to 0 on driver read. 1440 + */ 1441 + d->mmio->isr = 0; 1442 + return isr; 1443 + case offsetof(struct virtio_pci_mmio, padding): 1444 + bad_driver(d, "read from padding (%#x)", getreg(eip)); 1445 + default: 1446 + /* Read from device config space, beware unaligned overflow */ 1447 + if (off > d->mmio_size - 4) 1448 + bad_driver(d, "read past end (%#x)", getreg(eip)); 1449 + 1450 + /* 1451 + * 3.1.1: 1452 + * The driver MUST follow this sequence to initialize a device: 1453 + *... 1454 + * 3. Set the DRIVER status bit: the guest OS knows how to 1455 + * drive the device. 1456 + * 4. Read device feature bits, and write the subset of 1457 + * feature bits understood by the OS and driver to the 1458 + * device. During this step the driver MAY read (but MUST NOT 1459 + * write) the device-specific configuration fields to check 1460 + * that it can support the device before accepting it. 1461 + */ 1462 + if (!(d->mmio->cfg.device_status & VIRTIO_CONFIG_S_DRIVER)) 1463 + bad_driver(d, 1464 + "config read before VIRTIO_CONFIG_S_DRIVER"); 1465 + 1466 + if (mask == 0xFFFFFFFF) 1467 + goto read_through32; 1468 + else if (mask == 0xFFFF) 1469 + goto read_through16; 1470 + else 1471 + goto read_through8; 1943 1472 } 1944 1473 1945 1474 /* 1946 - * Early console write is done using notify on a nul-terminated string 1947 - * in Guest memory. It's also great for hacking debugging messages 1948 - * into a Guest. 1475 + * 4.1.3.1: 1476 + * 1477 + * The driver MUST access each field using the “natural” access 1478 + * method, i.e. 32-bit accesses for 32-bit fields, 16-bit accesses for 1479 + * 16-bit fields and 8-bit accesses for 8-bit fields. 1949 1480 */ 1950 - if (addr >= guest_limit) 1951 - errx(1, "Bad NOTIFY %#lx", addr); 1481 + read_through32: 1482 + if (mask != 0xFFFFFFFF) 1483 + bad_driver(d, "non-32-bit read to offset %u (%#x)", 1484 + off, getreg(eip)); 1485 + memcpy(&val, (char *)d->mmio + off, 4); 1486 + return val; 1952 1487 1953 - write(STDOUT_FILENO, from_guest_phys(addr), 1954 - strnlen(from_guest_phys(addr), guest_limit - addr)); 1488 + read_through16: 1489 + if (mask != 0xFFFF) 1490 + bad_driver(d, "non-16-bit read to offset %u (%#x)", 1491 + off, getreg(eip)); 1492 + memcpy(&val, (char *)d->mmio + off, 2); 1493 + return val; 1494 + 1495 + read_through8: 1496 + if (mask != 0xFF) 1497 + bad_driver(d, "non-8-bit read to offset %u (%#x)", 1498 + off, getreg(eip)); 1499 + memcpy(&val, (char *)d->mmio + off, 1); 1500 + return val; 1501 + } 1502 + 1503 + static void emulate_mmio(unsigned long paddr, const u8 *insn) 1504 + { 1505 + u32 val, off, mask = 0xFFFFFFFF, insnlen = 0; 1506 + struct device *d = find_mmio_region(paddr, &off); 1507 + unsigned long args[] = { LHREQ_TRAP, 14 }; 1508 + 1509 + if (!d) { 1510 + warnx("MMIO touching %#08lx (not a device)", paddr); 1511 + goto reinject; 1512 + } 1513 + 1514 + /* Prefix makes it a 16 bit op */ 1515 + if (insn[0] == 0x66) { 1516 + mask = 0xFFFF; 1517 + insnlen++; 1518 + } 1519 + 1520 + /* iowrite */ 1521 + if (insn[insnlen] == 0x89) { 1522 + /* Next byte is r/m byte: bits 3-5 are register. */ 1523 + val = getreg_num((insn[insnlen+1] >> 3) & 0x7, mask); 1524 + emulate_mmio_write(d, off, val, mask); 1525 + insnlen += 2 + insn_displacement_len(insn[insnlen+1]); 1526 + } else if (insn[insnlen] == 0x8b) { /* ioread */ 1527 + /* Next byte is r/m byte: bits 3-5 are register. */ 1528 + val = emulate_mmio_read(d, off, mask); 1529 + setreg_num((insn[insnlen+1] >> 3) & 0x7, val, mask); 1530 + insnlen += 2 + insn_displacement_len(insn[insnlen+1]); 1531 + } else if (insn[0] == 0x88) { /* 8-bit iowrite */ 1532 + mask = 0xff; 1533 + /* Next byte is r/m byte: bits 3-5 are register. */ 1534 + val = getreg_num((insn[1] >> 3) & 0x7, mask); 1535 + emulate_mmio_write(d, off, val, mask); 1536 + insnlen = 2 + insn_displacement_len(insn[1]); 1537 + } else if (insn[0] == 0x8a) { /* 8-bit ioread */ 1538 + mask = 0xff; 1539 + val = emulate_mmio_read(d, off, mask); 1540 + setreg_num((insn[1] >> 3) & 0x7, val, mask); 1541 + insnlen = 2 + insn_displacement_len(insn[1]); 1542 + } else { 1543 + warnx("Unknown MMIO instruction touching %#08lx:" 1544 + " %02x %02x %02x %02x at %u", 1545 + paddr, insn[0], insn[1], insn[2], insn[3], getreg(eip)); 1546 + reinject: 1547 + /* Inject trap into Guest. */ 1548 + if (write(lguest_fd, args, sizeof(args)) < 0) 1549 + err(1, "Reinjecting trap 14 for fault at %#x", 1550 + getreg(eip)); 1551 + return; 1552 + } 1553 + 1554 + /* Finally, we've "done" the instruction, so move past it. */ 1555 + setreg(eip, getreg(eip) + insnlen); 1955 1556 } 1956 1557 1957 1558 /*L:190 ··· 2373 1150 * device" so the Launcher can keep track of it. We have common helper 2374 1151 * routines to allocate and manage them. 2375 1152 */ 2376 - 2377 - /* 2378 - * The layout of the device page is a "struct lguest_device_desc" followed by a 2379 - * number of virtqueue descriptors, then two sets of feature bits, then an 2380 - * array of configuration bytes. This routine returns the configuration 2381 - * pointer. 2382 - */ 2383 - static u8 *device_config(const struct device *dev) 1153 + static void add_pci_virtqueue(struct device *dev, 1154 + void (*service)(struct virtqueue *), 1155 + const char *name) 2384 1156 { 2385 - return (void *)(dev->desc + 1) 2386 - + dev->num_vq * sizeof(struct lguest_vqconfig) 2387 - + dev->feature_len * 2; 2388 - } 2389 - 2390 - /* 2391 - * This routine allocates a new "struct lguest_device_desc" from descriptor 2392 - * table page just above the Guest's normal memory. It returns a pointer to 2393 - * that descriptor. 2394 - */ 2395 - static struct lguest_device_desc *new_dev_desc(u16 type) 2396 - { 2397 - struct lguest_device_desc d = { .type = type }; 2398 - void *p; 2399 - 2400 - /* Figure out where the next device config is, based on the last one. */ 2401 - if (devices.lastdev) 2402 - p = device_config(devices.lastdev) 2403 - + devices.lastdev->desc->config_len; 2404 - else 2405 - p = devices.descpage; 2406 - 2407 - /* We only have one page for all the descriptors. */ 2408 - if (p + sizeof(d) > (void *)devices.descpage + getpagesize()) 2409 - errx(1, "Too many devices"); 2410 - 2411 - /* p might not be aligned, so we memcpy in. */ 2412 - return memcpy(p, &d, sizeof(d)); 2413 - } 2414 - 2415 - /* 2416 - * Each device descriptor is followed by the description of its virtqueues. We 2417 - * specify how many descriptors the virtqueue is to have. 2418 - */ 2419 - static void add_virtqueue(struct device *dev, unsigned int num_descs, 2420 - void (*service)(struct virtqueue *)) 2421 - { 2422 - unsigned int pages; 2423 1157 struct virtqueue **i, *vq = malloc(sizeof(*vq)); 2424 - void *p; 2425 - 2426 - /* First we need some memory for this virtqueue. */ 2427 - pages = (vring_size(num_descs, LGUEST_VRING_ALIGN) + getpagesize() - 1) 2428 - / getpagesize(); 2429 - p = get_pages(pages); 2430 1158 2431 1159 /* Initialize the virtqueue */ 2432 1160 vq->next = NULL; 2433 1161 vq->last_avail_idx = 0; 2434 1162 vq->dev = dev; 1163 + vq->name = name; 2435 1164 2436 1165 /* 2437 1166 * This is the routine the service thread will run, and its Process ID ··· 2393 1218 vq->thread = (pid_t)-1; 2394 1219 2395 1220 /* Initialize the configuration. */ 2396 - vq->config.num = num_descs; 2397 - vq->config.irq = devices.next_irq++; 2398 - vq->config.pfn = to_guest_phys(p) / getpagesize(); 1221 + reset_vq_pci_config(vq); 1222 + vq->pci_config.queue_notify_off = 0; 2399 1223 2400 - /* Initialize the vring. */ 2401 - vring_init(&vq->vring, num_descs, p, LGUEST_VRING_ALIGN); 2402 - 2403 - /* 2404 - * Append virtqueue to this device's descriptor. We use 2405 - * device_config() to get the end of the device's current virtqueues; 2406 - * we check that we haven't added any config or feature information 2407 - * yet, otherwise we'd be overwriting them. 2408 - */ 2409 - assert(dev->desc->config_len == 0 && dev->desc->feature_len == 0); 2410 - memcpy(device_config(dev), &vq->config, sizeof(vq->config)); 2411 - dev->num_vq++; 2412 - dev->desc->num_vq++; 2413 - 2414 - verbose("Virtqueue page %#lx\n", to_guest_phys(p)); 1224 + /* Add one to the number of queues */ 1225 + vq->dev->mmio->cfg.num_queues++; 2415 1226 2416 1227 /* 2417 1228 * Add to tail of list, so dev->vq is first vq, dev->vq->next is ··· 2407 1246 *i = vq; 2408 1247 } 2409 1248 2410 - /* 2411 - * The first half of the feature bitmask is for us to advertise features. The 2412 - * second half is for the Guest to accept features. 2413 - */ 2414 - static void add_feature(struct device *dev, unsigned bit) 1249 + /* The Guest accesses the feature bits via the PCI common config MMIO region */ 1250 + static void add_pci_feature(struct device *dev, unsigned bit) 2415 1251 { 2416 - u8 *features = get_feature_bits(dev); 1252 + dev->features |= (1ULL << bit); 1253 + } 2417 1254 2418 - /* We can't extend the feature bits once we've added config bytes */ 2419 - if (dev->desc->feature_len <= bit / CHAR_BIT) { 2420 - assert(dev->desc->config_len == 0); 2421 - dev->feature_len = dev->desc->feature_len = (bit/CHAR_BIT) + 1; 2422 - } 1255 + /* For devices with no config. */ 1256 + static void no_device_config(struct device *dev) 1257 + { 1258 + dev->mmio_addr = get_mmio_region(dev->mmio_size); 2423 1259 2424 - features[bit / CHAR_BIT] |= (1 << (bit % CHAR_BIT)); 1260 + dev->config.bar[0] = dev->mmio_addr; 1261 + /* Bottom 4 bits must be zero */ 1262 + assert(~(dev->config.bar[0] & 0xF)); 1263 + } 1264 + 1265 + /* This puts the device config into BAR0 */ 1266 + static void set_device_config(struct device *dev, const void *conf, size_t len) 1267 + { 1268 + /* Set up BAR 0 */ 1269 + dev->mmio_size += len; 1270 + dev->mmio = realloc(dev->mmio, dev->mmio_size); 1271 + memcpy(dev->mmio + 1, conf, len); 1272 + 1273 + /* 1274 + * 4.1.4.6: 1275 + * 1276 + * The device MUST present at least one VIRTIO_PCI_CAP_DEVICE_CFG 1277 + * capability for any device type which has a device-specific 1278 + * configuration. 1279 + */ 1280 + /* Hook up device cfg */ 1281 + dev->config.cfg_access.cap.cap_next 1282 + = offsetof(struct pci_config, device); 1283 + 1284 + /* 1285 + * 4.1.4.6.1: 1286 + * 1287 + * The offset for the device-specific configuration MUST be 4-byte 1288 + * aligned. 1289 + */ 1290 + assert(dev->config.cfg_access.cap.cap_next % 4 == 0); 1291 + 1292 + /* Fix up device cfg field length. */ 1293 + dev->config.device.length = len; 1294 + 1295 + /* The rest is the same as the no-config case */ 1296 + no_device_config(dev); 1297 + } 1298 + 1299 + static void init_cap(struct virtio_pci_cap *cap, size_t caplen, int type, 1300 + size_t bar_offset, size_t bar_bytes, u8 next) 1301 + { 1302 + cap->cap_vndr = PCI_CAP_ID_VNDR; 1303 + cap->cap_next = next; 1304 + cap->cap_len = caplen; 1305 + cap->cfg_type = type; 1306 + cap->bar = 0; 1307 + memset(cap->padding, 0, sizeof(cap->padding)); 1308 + cap->offset = bar_offset; 1309 + cap->length = bar_bytes; 2425 1310 } 2426 1311 2427 1312 /* 2428 - * This routine sets the configuration fields for an existing device's 2429 - * descriptor. It only works for the last device, but that's OK because that's 2430 - * how we use it. 1313 + * This sets up the pci_config structure, as defined in the virtio 1.0 1314 + * standard (and PCI standard). 2431 1315 */ 2432 - static void set_config(struct device *dev, unsigned len, const void *conf) 1316 + static void init_pci_config(struct pci_config *pci, u16 type, 1317 + u8 class, u8 subclass) 2433 1318 { 2434 - /* Check we haven't overflowed our single page. */ 2435 - if (device_config(dev) + len > devices.descpage + getpagesize()) 2436 - errx(1, "Too many devices"); 1319 + size_t bar_offset, bar_len; 2437 1320 2438 - /* Copy in the config information, and store the length. */ 2439 - memcpy(device_config(dev), conf, len); 2440 - dev->desc->config_len = len; 1321 + /* 1322 + * 4.1.4.4.1: 1323 + * 1324 + * The device MUST either present notify_off_multiplier as an even 1325 + * power of 2, or present notify_off_multiplier as 0. 1326 + * 1327 + * 2.1.2: 1328 + * 1329 + * The device MUST initialize device status to 0 upon reset. 1330 + */ 1331 + memset(pci, 0, sizeof(*pci)); 2441 1332 2442 - /* Size must fit in config_len field (8 bits)! */ 2443 - assert(dev->desc->config_len == len); 1333 + /* 4.1.2.1: Devices MUST have the PCI Vendor ID 0x1AF4 */ 1334 + pci->vendor_id = 0x1AF4; 1335 + /* 4.1.2.1: ... PCI Device ID calculated by adding 0x1040 ... */ 1336 + pci->device_id = 0x1040 + type; 1337 + 1338 + /* 1339 + * PCI have specific codes for different types of devices. 1340 + * Linux doesn't care, but it's a good clue for people looking 1341 + * at the device. 1342 + */ 1343 + pci->class = class; 1344 + pci->subclass = subclass; 1345 + 1346 + /* 1347 + * 4.1.2.1: 1348 + * 1349 + * Non-transitional devices SHOULD have a PCI Revision ID of 1 or 1350 + * higher 1351 + */ 1352 + pci->revid = 1; 1353 + 1354 + /* 1355 + * 4.1.2.1: 1356 + * 1357 + * Non-transitional devices SHOULD have a PCI Subsystem Device ID of 1358 + * 0x40 or higher. 1359 + */ 1360 + pci->subsystem_device_id = 0x40; 1361 + 1362 + /* We use our dummy interrupt controller, and irq_line is the irq */ 1363 + pci->irq_line = devices.next_irq++; 1364 + pci->irq_pin = 0; 1365 + 1366 + /* Support for extended capabilities. */ 1367 + pci->status = (1 << 4); 1368 + 1369 + /* Link them in. */ 1370 + /* 1371 + * 4.1.4.3.1: 1372 + * 1373 + * The device MUST present at least one common configuration 1374 + * capability. 1375 + */ 1376 + pci->capabilities = offsetof(struct pci_config, common); 1377 + 1378 + /* 4.1.4.3.1 ... offset MUST be 4-byte aligned. */ 1379 + assert(pci->capabilities % 4 == 0); 1380 + 1381 + bar_offset = offsetof(struct virtio_pci_mmio, cfg); 1382 + bar_len = sizeof(((struct virtio_pci_mmio *)0)->cfg); 1383 + init_cap(&pci->common, sizeof(pci->common), VIRTIO_PCI_CAP_COMMON_CFG, 1384 + bar_offset, bar_len, 1385 + offsetof(struct pci_config, notify)); 1386 + 1387 + /* 1388 + * 4.1.4.4.1: 1389 + * 1390 + * The device MUST present at least one notification capability. 1391 + */ 1392 + bar_offset += bar_len; 1393 + bar_len = sizeof(((struct virtio_pci_mmio *)0)->notify); 1394 + 1395 + /* 1396 + * 4.1.4.4.1: 1397 + * 1398 + * The cap.offset MUST be 2-byte aligned. 1399 + */ 1400 + assert(pci->common.cap_next % 2 == 0); 1401 + 1402 + /* FIXME: Use a non-zero notify_off, for per-queue notification? */ 1403 + /* 1404 + * 4.1.4.4.1: 1405 + * 1406 + * The value cap.length presented by the device MUST be at least 2 and 1407 + * MUST be large enough to support queue notification offsets for all 1408 + * supported queues in all possible configurations. 1409 + */ 1410 + assert(bar_len >= 2); 1411 + 1412 + init_cap(&pci->notify.cap, sizeof(pci->notify), 1413 + VIRTIO_PCI_CAP_NOTIFY_CFG, 1414 + bar_offset, bar_len, 1415 + offsetof(struct pci_config, isr)); 1416 + 1417 + bar_offset += bar_len; 1418 + bar_len = sizeof(((struct virtio_pci_mmio *)0)->isr); 1419 + /* 1420 + * 4.1.4.5.1: 1421 + * 1422 + * The device MUST present at least one VIRTIO_PCI_CAP_ISR_CFG 1423 + * capability. 1424 + */ 1425 + init_cap(&pci->isr, sizeof(pci->isr), 1426 + VIRTIO_PCI_CAP_ISR_CFG, 1427 + bar_offset, bar_len, 1428 + offsetof(struct pci_config, cfg_access)); 1429 + 1430 + /* 1431 + * 4.1.4.7.1: 1432 + * 1433 + * The device MUST present at least one VIRTIO_PCI_CAP_PCI_CFG 1434 + * capability. 1435 + */ 1436 + /* This doesn't have any presence in the BAR */ 1437 + init_cap(&pci->cfg_access.cap, sizeof(pci->cfg_access), 1438 + VIRTIO_PCI_CAP_PCI_CFG, 1439 + 0, 0, 0); 1440 + 1441 + bar_offset += bar_len + sizeof(((struct virtio_pci_mmio *)0)->padding); 1442 + assert(bar_offset == sizeof(struct virtio_pci_mmio)); 1443 + 1444 + /* 1445 + * This gets sewn in and length set in set_device_config(). 1446 + * Some devices don't have a device configuration interface, so 1447 + * we never expose this if we don't call set_device_config(). 1448 + */ 1449 + init_cap(&pci->device, sizeof(pci->device), VIRTIO_PCI_CAP_DEVICE_CFG, 1450 + bar_offset, 0, 0); 2444 1451 } 2445 1452 2446 1453 /* 2447 - * This routine does all the creation and setup of a new device, including 2448 - * calling new_dev_desc() to allocate the descriptor and device memory. We 2449 - * don't actually start the service threads until later. 1454 + * This routine does all the creation and setup of a new device, but we don't 1455 + * actually place the MMIO region until we know the size (if any) of the 1456 + * device-specific config. And we don't actually start the service threads 1457 + * until later. 2450 1458 * 2451 1459 * See what I mean about userspace being boring? 2452 1460 */ 2453 - static struct device *new_device(const char *name, u16 type) 1461 + static struct device *new_pci_device(const char *name, u16 type, 1462 + u8 class, u8 subclass) 2454 1463 { 2455 1464 struct device *dev = malloc(sizeof(*dev)); 2456 1465 2457 1466 /* Now we populate the fields one at a time. */ 2458 - dev->desc = new_dev_desc(type); 2459 1467 dev->name = name; 2460 1468 dev->vq = NULL; 2461 - dev->feature_len = 0; 2462 - dev->num_vq = 0; 2463 1469 dev->running = false; 2464 - dev->next = NULL; 1470 + dev->wrote_features_ok = false; 1471 + dev->mmio_size = sizeof(struct virtio_pci_mmio); 1472 + dev->mmio = calloc(1, dev->mmio_size); 1473 + dev->features = (u64)1 << VIRTIO_F_VERSION_1; 1474 + dev->features_accepted = 0; 2465 1475 2466 - /* 2467 - * Append to device list. Prepending to a single-linked list is 2468 - * easier, but the user expects the devices to be arranged on the bus 2469 - * in command-line order. The first network device on the command line 2470 - * is eth0, the first block device /dev/vda, etc. 2471 - */ 2472 - if (devices.lastdev) 2473 - devices.lastdev->next = dev; 2474 - else 2475 - devices.dev = dev; 2476 - devices.lastdev = dev; 1476 + if (devices.device_num + 1 >= MAX_PCI_DEVICES) 1477 + errx(1, "Can only handle 31 PCI devices"); 1478 + 1479 + init_pci_config(&dev->config, type, class, subclass); 1480 + assert(!devices.pci[devices.device_num+1]); 1481 + devices.pci[++devices.device_num] = dev; 2477 1482 2478 1483 return dev; 2479 1484 } ··· 2651 1324 static void setup_console(void) 2652 1325 { 2653 1326 struct device *dev; 1327 + struct virtio_console_config conf; 2654 1328 2655 1329 /* If we can save the initial standard input settings... */ 2656 1330 if (tcgetattr(STDIN_FILENO, &orig_term) == 0) { ··· 2664 1336 tcsetattr(STDIN_FILENO, TCSANOW, &term); 2665 1337 } 2666 1338 2667 - dev = new_device("console", VIRTIO_ID_CONSOLE); 1339 + dev = new_pci_device("console", VIRTIO_ID_CONSOLE, 0x07, 0x00); 2668 1340 2669 1341 /* We store the console state in dev->priv, and initialize it. */ 2670 1342 dev->priv = malloc(sizeof(struct console_abort)); ··· 2676 1348 * stdin. When they put something in the output queue, we write it to 2677 1349 * stdout. 2678 1350 */ 2679 - add_virtqueue(dev, VIRTQUEUE_NUM, console_input); 2680 - add_virtqueue(dev, VIRTQUEUE_NUM, console_output); 1351 + add_pci_virtqueue(dev, console_input, "input"); 1352 + add_pci_virtqueue(dev, console_output, "output"); 2681 1353 2682 - verbose("device %u: console\n", ++devices.device_num); 1354 + /* We need a configuration area for the emerg_wr early writes. */ 1355 + add_pci_feature(dev, VIRTIO_CONSOLE_F_EMERG_WRITE); 1356 + set_device_config(dev, &conf, sizeof(conf)); 1357 + 1358 + verbose("device %u: console\n", devices.device_num); 2683 1359 } 2684 1360 /*:*/ 2685 1361 ··· 2781 1449 static int get_tun_device(char tapif[IFNAMSIZ]) 2782 1450 { 2783 1451 struct ifreq ifr; 1452 + int vnet_hdr_sz; 2784 1453 int netfd; 2785 1454 2786 1455 /* Start with this zeroed. Messy but sure. */ ··· 2809 1476 */ 2810 1477 ioctl(netfd, TUNSETNOCSUM, 1); 2811 1478 1479 + /* 1480 + * In virtio before 1.0 (aka legacy virtio), we added a 16-bit 1481 + * field at the end of the network header iff 1482 + * VIRTIO_NET_F_MRG_RXBUF was negotiated. For virtio 1.0, 1483 + * that became the norm, but we need to tell the tun device 1484 + * about our expanded header (which is called 1485 + * virtio_net_hdr_mrg_rxbuf in the legacy system). 1486 + */ 1487 + vnet_hdr_sz = sizeof(struct virtio_net_hdr_v1); 1488 + if (ioctl(netfd, TUNSETVNETHDRSZ, &vnet_hdr_sz) != 0) 1489 + err(1, "Setting tun header size to %u", vnet_hdr_sz); 1490 + 2812 1491 memcpy(tapif, ifr.ifr_name, IFNAMSIZ); 2813 1492 return netfd; 2814 1493 } ··· 2844 1499 net_info->tunfd = get_tun_device(tapif); 2845 1500 2846 1501 /* First we create a new network device. */ 2847 - dev = new_device("net", VIRTIO_ID_NET); 1502 + dev = new_pci_device("net", VIRTIO_ID_NET, 0x02, 0x00); 2848 1503 dev->priv = net_info; 2849 1504 2850 1505 /* Network devices need a recv and a send queue, just like console. */ 2851 - add_virtqueue(dev, VIRTQUEUE_NUM, net_input); 2852 - add_virtqueue(dev, VIRTQUEUE_NUM, net_output); 1506 + add_pci_virtqueue(dev, net_input, "rx"); 1507 + add_pci_virtqueue(dev, net_output, "tx"); 2853 1508 2854 1509 /* 2855 1510 * We need a socket to perform the magic network ioctls to bring up the ··· 2869 1524 p = strchr(arg, ':'); 2870 1525 if (p) { 2871 1526 str2mac(p+1, conf.mac); 2872 - add_feature(dev, VIRTIO_NET_F_MAC); 1527 + add_pci_feature(dev, VIRTIO_NET_F_MAC); 2873 1528 *p = '\0'; 2874 1529 } 2875 1530 ··· 2883 1538 configure_device(ipfd, tapif, ip); 2884 1539 2885 1540 /* Expect Guest to handle everything except UFO */ 2886 - add_feature(dev, VIRTIO_NET_F_CSUM); 2887 - add_feature(dev, VIRTIO_NET_F_GUEST_CSUM); 2888 - add_feature(dev, VIRTIO_NET_F_GUEST_TSO4); 2889 - add_feature(dev, VIRTIO_NET_F_GUEST_TSO6); 2890 - add_feature(dev, VIRTIO_NET_F_GUEST_ECN); 2891 - add_feature(dev, VIRTIO_NET_F_HOST_TSO4); 2892 - add_feature(dev, VIRTIO_NET_F_HOST_TSO6); 2893 - add_feature(dev, VIRTIO_NET_F_HOST_ECN); 1541 + add_pci_feature(dev, VIRTIO_NET_F_CSUM); 1542 + add_pci_feature(dev, VIRTIO_NET_F_GUEST_CSUM); 1543 + add_pci_feature(dev, VIRTIO_NET_F_GUEST_TSO4); 1544 + add_pci_feature(dev, VIRTIO_NET_F_GUEST_TSO6); 1545 + add_pci_feature(dev, VIRTIO_NET_F_GUEST_ECN); 1546 + add_pci_feature(dev, VIRTIO_NET_F_HOST_TSO4); 1547 + add_pci_feature(dev, VIRTIO_NET_F_HOST_TSO6); 1548 + add_pci_feature(dev, VIRTIO_NET_F_HOST_ECN); 2894 1549 /* We handle indirect ring entries */ 2895 - add_feature(dev, VIRTIO_RING_F_INDIRECT_DESC); 2896 - /* We're compliant with the damn spec. */ 2897 - add_feature(dev, VIRTIO_F_ANY_LAYOUT); 2898 - set_config(dev, sizeof(conf), &conf); 1550 + add_pci_feature(dev, VIRTIO_RING_F_INDIRECT_DESC); 1551 + set_device_config(dev, &conf, sizeof(conf)); 2899 1552 2900 1553 /* We don't need the socket any more; setup is done. */ 2901 1554 close(ipfd); 2902 - 2903 - devices.device_num++; 2904 1555 2905 1556 if (bridging) 2906 1557 verbose("device %u: tun %s attached to bridge: %s\n", ··· 2948 1607 head = wait_for_vq_desc(vq, iov, &out_num, &in_num); 2949 1608 2950 1609 /* Copy the output header from the front of the iov (adjusts iov) */ 2951 - iov_consume(iov, out_num, &out, sizeof(out)); 1610 + iov_consume(vq->dev, iov, out_num, &out, sizeof(out)); 2952 1611 2953 1612 /* Find and trim end of iov input array, for our status byte. */ 2954 1613 in = NULL; ··· 2960 1619 } 2961 1620 } 2962 1621 if (!in) 2963 - errx(1, "Bad virtblk cmd with no room for status"); 1622 + bad_driver_vq(vq, "Bad virtblk cmd with no room for status"); 2964 1623 2965 1624 /* 2966 1625 * For historical reasons, block operations are expressed in 512 byte ··· 2968 1627 */ 2969 1628 off = out.sector * 512; 2970 1629 2971 - /* 2972 - * In general the virtio block driver is allowed to try SCSI commands. 2973 - * It'd be nice if we supported eject, for example, but we don't. 2974 - */ 2975 - if (out.type & VIRTIO_BLK_T_SCSI_CMD) { 2976 - fprintf(stderr, "Scsi commands unsupported\n"); 2977 - *in = VIRTIO_BLK_S_UNSUPP; 2978 - wlen = sizeof(*in); 2979 - } else if (out.type & VIRTIO_BLK_T_OUT) { 1630 + if (out.type & VIRTIO_BLK_T_OUT) { 2980 1631 /* 2981 1632 * Write 2982 1633 * ··· 2990 1657 /* Trim it back to the correct length */ 2991 1658 ftruncate64(vblk->fd, vblk->len); 2992 1659 /* Die, bad Guest, die. */ 2993 - errx(1, "Write past end %llu+%u", off, ret); 1660 + bad_driver_vq(vq, "Write past end %llu+%u", off, ret); 2994 1661 } 2995 1662 2996 1663 wlen = sizeof(*in); ··· 3032 1699 struct vblk_info *vblk; 3033 1700 struct virtio_blk_config conf; 3034 1701 3035 - /* Creat the device. */ 3036 - dev = new_device("block", VIRTIO_ID_BLOCK); 1702 + /* Create the device. */ 1703 + dev = new_pci_device("block", VIRTIO_ID_BLOCK, 0x01, 0x80); 3037 1704 3038 1705 /* The device has one virtqueue, where the Guest places requests. */ 3039 - add_virtqueue(dev, VIRTQUEUE_NUM, blk_request); 1706 + add_pci_virtqueue(dev, blk_request, "request"); 3040 1707 3041 1708 /* Allocate the room for our own bookkeeping */ 3042 1709 vblk = dev->priv = malloc(sizeof(*vblk)); ··· 3045 1712 vblk->fd = open_or_die(filename, O_RDWR|O_LARGEFILE); 3046 1713 vblk->len = lseek64(vblk->fd, 0, SEEK_END); 3047 1714 3048 - /* We support FLUSH. */ 3049 - add_feature(dev, VIRTIO_BLK_F_FLUSH); 3050 - 3051 1715 /* Tell Guest how many sectors this device has. */ 3052 1716 conf.capacity = cpu_to_le64(vblk->len / 512); 3053 1717 ··· 3052 1722 * Tell Guest not to put in too many descriptors at once: two are used 3053 1723 * for the in and out elements. 3054 1724 */ 3055 - add_feature(dev, VIRTIO_BLK_F_SEG_MAX); 1725 + add_pci_feature(dev, VIRTIO_BLK_F_SEG_MAX); 3056 1726 conf.seg_max = cpu_to_le32(VIRTQUEUE_NUM - 2); 3057 1727 3058 - /* Don't try to put whole struct: we have 8 bit limit. */ 3059 - set_config(dev, offsetof(struct virtio_blk_config, geometry), &conf); 1728 + set_device_config(dev, &conf, sizeof(struct virtio_blk_config)); 3060 1729 3061 1730 verbose("device %u: virtblock %llu sectors\n", 3062 - ++devices.device_num, le64_to_cpu(conf.capacity)); 1731 + devices.device_num, le64_to_cpu(conf.capacity)); 3063 1732 } 3064 1733 3065 1734 /*L:211 3066 - * Our random number generator device reads from /dev/random into the Guest's 1735 + * Our random number generator device reads from /dev/urandom into the Guest's 3067 1736 * input buffers. The usual case is that the Guest doesn't want random numbers 3068 - * and so has no buffers although /dev/random is still readable, whereas 1737 + * and so has no buffers although /dev/urandom is still readable, whereas 3069 1738 * console is the reverse. 3070 1739 * 3071 1740 * The same logic applies, however. ··· 3083 1754 /* First we need a buffer from the Guests's virtqueue. */ 3084 1755 head = wait_for_vq_desc(vq, iov, &out_num, &in_num); 3085 1756 if (out_num) 3086 - errx(1, "Output buffers in rng?"); 1757 + bad_driver_vq(vq, "Output buffers in rng?"); 3087 1758 3088 1759 /* 3089 1760 * Just like the console write, we loop to cover the whole iovec. ··· 3092 1763 while (!iov_empty(iov, in_num)) { 3093 1764 len = readv(rng_info->rfd, iov, in_num); 3094 1765 if (len <= 0) 3095 - err(1, "Read from /dev/random gave %i", len); 3096 - iov_consume(iov, in_num, NULL, len); 1766 + err(1, "Read from /dev/urandom gave %i", len); 1767 + iov_consume(vq->dev, iov, in_num, NULL, len); 3097 1768 totlen += len; 3098 1769 } 3099 1770 ··· 3109 1780 struct device *dev; 3110 1781 struct rng_info *rng_info = malloc(sizeof(*rng_info)); 3111 1782 3112 - /* Our device's privat info simply contains the /dev/random fd. */ 3113 - rng_info->rfd = open_or_die("/dev/random", O_RDONLY); 1783 + /* Our device's private info simply contains the /dev/urandom fd. */ 1784 + rng_info->rfd = open_or_die("/dev/urandom", O_RDONLY); 3114 1785 3115 1786 /* Create the new device. */ 3116 - dev = new_device("rng", VIRTIO_ID_RNG); 1787 + dev = new_pci_device("rng", VIRTIO_ID_RNG, 0xff, 0); 3117 1788 dev->priv = rng_info; 3118 1789 3119 1790 /* The device has one virtqueue, where the Guest places inbufs. */ 3120 - add_virtqueue(dev, VIRTQUEUE_NUM, rng_input); 1791 + add_pci_virtqueue(dev, rng_input, "input"); 3121 1792 3122 - verbose("device %u: rng\n", devices.device_num++); 1793 + /* We don't have any configuration space */ 1794 + no_device_config(dev); 1795 + 1796 + verbose("device %u: rng\n", devices.device_num); 3123 1797 } 3124 1798 /* That's the end of device setup. */ 3125 1799 ··· 3152 1820 static void __attribute__((noreturn)) run_guest(void) 3153 1821 { 3154 1822 for (;;) { 3155 - unsigned long notify_addr; 1823 + struct lguest_pending notify; 3156 1824 int readval; 3157 1825 3158 1826 /* We read from the /dev/lguest device to run the Guest. */ 3159 - readval = pread(lguest_fd, &notify_addr, 3160 - sizeof(notify_addr), cpu_id); 3161 - 3162 - /* One unsigned long means the Guest did HCALL_NOTIFY */ 3163 - if (readval == sizeof(notify_addr)) { 3164 - verbose("Notify on address %#lx\n", notify_addr); 3165 - handle_output(notify_addr); 1827 + readval = pread(lguest_fd, &notify, sizeof(notify), cpu_id); 1828 + if (readval == sizeof(notify)) { 1829 + if (notify.trap == 13) { 1830 + verbose("Emulating instruction at %#x\n", 1831 + getreg(eip)); 1832 + emulate_insn(notify.insn); 1833 + } else if (notify.trap == 14) { 1834 + verbose("Emulating MMIO at %#x\n", 1835 + getreg(eip)); 1836 + emulate_mmio(notify.addr, notify.insn); 1837 + } else 1838 + errx(1, "Unknown trap %i addr %#08x\n", 1839 + notify.trap, notify.addr); 3166 1840 /* ENOENT means the Guest died. Reading tells us why. */ 3167 1841 } else if (errno == ENOENT) { 3168 1842 char reason[1024] = { 0 }; ··· 3231 1893 main_args = argv; 3232 1894 3233 1895 /* 3234 - * First we initialize the device list. We keep a pointer to the last 3235 - * device, and the next interrupt number to use for devices (1: 3236 - * remember that 0 is used by the timer). 1896 + * First we initialize the device list. We remember next interrupt 1897 + * number to use for devices (1: remember that 0 is used by the timer). 3237 1898 */ 3238 - devices.lastdev = NULL; 3239 1899 devices.next_irq = 1; 3240 1900 3241 1901 /* We're CPU 0. In fact, that's the only CPU possible right now. */ ··· 3257 1921 guest_base = map_zeroed_pages(mem / getpagesize() 3258 1922 + DEVICE_PAGES); 3259 1923 guest_limit = mem; 3260 - guest_max = mem + DEVICE_PAGES*getpagesize(); 3261 - devices.descpage = get_pages(1); 1924 + guest_max = guest_mmio = mem + DEVICE_PAGES*getpagesize(); 3262 1925 break; 3263 1926 } 3264 1927 } 1928 + 1929 + /* We always have a console device, and it's always device 1. */ 1930 + setup_console(); 3265 1931 3266 1932 /* The options are fairly straight-forward */ 3267 1933 while ((c = getopt_long(argc, argv, "v", opts, NULL)) != EOF) { ··· 3305 1967 3306 1968 verbose("Guest base is at %p\n", guest_base); 3307 1969 3308 - /* We always have a console device */ 3309 - setup_console(); 1970 + /* Initialize the (fake) PCI host bridge device. */ 1971 + init_pci_host_bridge(); 3310 1972 3311 1973 /* Now we load the kernel */ 3312 1974 start = load_kernel(open_or_die(argv[optind+1], O_RDONLY));