Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Pull rework-memory-attribute-aliasing into release branch

Tony Luck 1323523f 9ba89334

+363 -77
+208
Documentation/ia64/aliasing.txt
··· 1 + MEMORY ATTRIBUTE ALIASING ON IA-64 2 + 3 + Bjorn Helgaas 4 + <bjorn.helgaas@hp.com> 5 + May 4, 2006 6 + 7 + 8 + MEMORY ATTRIBUTES 9 + 10 + Itanium supports several attributes for virtual memory references. 11 + The attribute is part of the virtual translation, i.e., it is 12 + contained in the TLB entry. The ones of most interest to the Linux 13 + kernel are: 14 + 15 + WB Write-back (cacheable) 16 + UC Uncacheable 17 + WC Write-coalescing 18 + 19 + System memory typically uses the WB attribute. The UC attribute is 20 + used for memory-mapped I/O devices. The WC attribute is uncacheable 21 + like UC is, but writes may be delayed and combined to increase 22 + performance for things like frame buffers. 23 + 24 + The Itanium architecture requires that we avoid accessing the same 25 + page with both a cacheable mapping and an uncacheable mapping[1]. 26 + 27 + The design of the chipset determines which attributes are supported 28 + on which regions of the address space. For example, some chipsets 29 + support either WB or UC access to main memory, while others support 30 + only WB access. 31 + 32 + MEMORY MAP 33 + 34 + Platform firmware describes the physical memory map and the 35 + supported attributes for each region. At boot-time, the kernel uses 36 + the EFI GetMemoryMap() interface. ACPI can also describe memory 37 + devices and the attributes they support, but Linux/ia64 currently 38 + doesn't use this information. 39 + 40 + The kernel uses the efi_memmap table returned from GetMemoryMap() to 41 + learn the attributes supported by each region of physical address 42 + space. Unfortunately, this table does not completely describe the 43 + address space because some machines omit some or all of the MMIO 44 + regions from the map. 45 + 46 + The kernel maintains another table, kern_memmap, which describes the 47 + memory Linux is actually using and the attribute for each region. 48 + This contains only system memory; it does not contain MMIO space. 49 + 50 + The kern_memmap table typically contains only a subset of the system 51 + memory described by the efi_memmap. Linux/ia64 can't use all memory 52 + in the system because of constraints imposed by the identity mapping 53 + scheme. 54 + 55 + The efi_memmap table is preserved unmodified because the original 56 + boot-time information is required for kexec. 57 + 58 + KERNEL IDENTITY MAPPINGS 59 + 60 + Linux/ia64 identity mappings are done with large pages, currently 61 + either 16MB or 64MB, referred to as "granules." Cacheable mappings 62 + are speculative[2], so the processor can read any location in the 63 + page at any time, independent of the programmer's intentions. This 64 + means that to avoid attribute aliasing, Linux can create a cacheable 65 + identity mapping only when the entire granule supports cacheable 66 + access. 67 + 68 + Therefore, kern_memmap contains only full granule-sized regions that 69 + can referenced safely by an identity mapping. 70 + 71 + Uncacheable mappings are not speculative, so the processor will 72 + generate UC accesses only to locations explicitly referenced by 73 + software. This allows UC identity mappings to cover granules that 74 + are only partially populated, or populated with a combination of UC 75 + and WB regions. 76 + 77 + USER MAPPINGS 78 + 79 + User mappings are typically done with 16K or 64K pages. The smaller 80 + page size allows more flexibility because only 16K or 64K has to be 81 + homogeneous with respect to memory attributes. 82 + 83 + POTENTIAL ATTRIBUTE ALIASING CASES 84 + 85 + There are several ways the kernel creates new mappings: 86 + 87 + mmap of /dev/mem 88 + 89 + This uses remap_pfn_range(), which creates user mappings. These 90 + mappings may be either WB or UC. If the region being mapped 91 + happens to be in kern_memmap, meaning that it may also be mapped 92 + by a kernel identity mapping, the user mapping must use the same 93 + attribute as the kernel mapping. 94 + 95 + If the region is not in kern_memmap, the user mapping should use 96 + an attribute reported as being supported in the EFI memory map. 97 + 98 + Since the EFI memory map does not describe MMIO on some 99 + machines, this should use an uncacheable mapping as a fallback. 100 + 101 + mmap of /sys/class/pci_bus/.../legacy_mem 102 + 103 + This is very similar to mmap of /dev/mem, except that legacy_mem 104 + only allows mmap of the one megabyte "legacy MMIO" area for a 105 + specific PCI bus. Typically this is the first megabyte of 106 + physical address space, but it may be different on machines with 107 + several VGA devices. 108 + 109 + "X" uses this to access VGA frame buffers. Using legacy_mem 110 + rather than /dev/mem allows multiple instances of X to talk to 111 + different VGA cards. 112 + 113 + The /dev/mem mmap constraints apply. 114 + 115 + However, since this is for mapping legacy MMIO space, WB access 116 + does not make sense. This matters on machines without legacy 117 + VGA support: these machines may have WB memory for the entire 118 + first megabyte (or even the entire first granule). 119 + 120 + On these machines, we could mmap legacy_mem as WB, which would 121 + be safe in terms of attribute aliasing, but X has no way of 122 + knowing that it is accessing regular memory, not a frame buffer, 123 + so the kernel should fail the mmap rather than doing it with WB. 124 + 125 + read/write of /dev/mem 126 + 127 + This uses copy_from_user(), which implicitly uses a kernel 128 + identity mapping. This is obviously safe for things in 129 + kern_memmap. 130 + 131 + There may be corner cases of things that are not in kern_memmap, 132 + but could be accessed this way. For example, registers in MMIO 133 + space are not in kern_memmap, but could be accessed with a UC 134 + mapping. This would not cause attribute aliasing. But 135 + registers typically can be accessed only with four-byte or 136 + eight-byte accesses, and the copy_from_user() path doesn't allow 137 + any control over the access size, so this would be dangerous. 138 + 139 + ioremap() 140 + 141 + This returns a kernel identity mapping for use inside the 142 + kernel. 143 + 144 + If the region is in kern_memmap, we should use the attribute 145 + specified there. Otherwise, if the EFI memory map reports that 146 + the entire granule supports WB, we should use that (granules 147 + that are partially reserved or occupied by firmware do not appear 148 + in kern_memmap). Otherwise, we should use a UC mapping. 149 + 150 + PAST PROBLEM CASES 151 + 152 + mmap of various MMIO regions from /dev/mem by "X" on Intel platforms 153 + 154 + The EFI memory map may not report these MMIO regions. 155 + 156 + These must be allowed so that X will work. This means that 157 + when the EFI memory map is incomplete, every /dev/mem mmap must 158 + succeed. It may create either WB or UC user mappings, depending 159 + on whether the region is in kern_memmap or the EFI memory map. 160 + 161 + mmap of 0x0-0xA0000 /dev/mem by "hwinfo" on HP sx1000 with VGA enabled 162 + 163 + See https://bugzilla.novell.com/show_bug.cgi?id=140858. 164 + 165 + The EFI memory map reports the following attributes: 166 + 0x00000-0x9FFFF WB only 167 + 0xA0000-0xBFFFF UC only (VGA frame buffer) 168 + 0xC0000-0xFFFFF WB only 169 + 170 + This mmap is done with user pages, not kernel identity mappings, 171 + so it is safe to use WB mappings. 172 + 173 + The kernel VGA driver may ioremap the VGA frame buffer at 0xA0000, 174 + which will use a granule-sized UC mapping covering 0-0xFFFFF. This 175 + granule covers some WB-only memory, but since UC is non-speculative, 176 + the processor will never generate an uncacheable reference to the 177 + WB-only areas unless the driver explicitly touches them. 178 + 179 + mmap of 0x0-0xFFFFF legacy_mem by "X" 180 + 181 + If the EFI memory map reports this entire range as WB, there 182 + is no VGA MMIO hole, and the mmap should fail or be done with 183 + a WB mapping. 184 + 185 + There's no easy way for X to determine whether the 0xA0000-0xBFFFF 186 + region is a frame buffer or just memory, so I think it's best to 187 + just fail this mmap request rather than using a WB mapping. As 188 + far as I know, there's no need to map legacy_mem with WB 189 + mappings. 190 + 191 + Otherwise, a UC mapping of the entire region is probably safe. 192 + The VGA hole means the region will not be in kern_memmap. The 193 + HP sx1000 chipset doesn't support UC access to the memory surrounding 194 + the VGA hole, but X doesn't need that area anyway and should not 195 + reference it. 196 + 197 + mmap of 0xA0000-0xBFFFF legacy_mem by "X" on HP sx1000 with VGA disabled 198 + 199 + The EFI memory map reports the following attributes: 200 + 0x00000-0xFFFFF WB only (no VGA MMIO hole) 201 + 202 + This is a special case of the previous case, and the mmap should 203 + fail for the same reason as above. 204 + 205 + NOTES 206 + 207 + [1] SDM rev 2.2, vol 2, sec 4.4.1. 208 + [2] SDM rev 2.2, vol 2, sec 4.4.6.
+106 -58
arch/ia64/kernel/efi.c
··· 8 8 * Copyright (C) 1999-2003 Hewlett-Packard Co. 9 9 * David Mosberger-Tang <davidm@hpl.hp.com> 10 10 * Stephane Eranian <eranian@hpl.hp.com> 11 + * (c) Copyright 2006 Hewlett-Packard Development Company, L.P. 12 + * Bjorn Helgaas <bjorn.helgaas@hp.com> 11 13 * 12 14 * All EFI Runtime Services are not implemented yet as EFI only 13 15 * supports physical mode addressing on SoftSDV. This is to be fixed ··· 624 622 return 0; 625 623 } 626 624 625 + static struct kern_memdesc * 626 + kern_memory_descriptor (unsigned long phys_addr) 627 + { 628 + struct kern_memdesc *md; 629 + 630 + for (md = kern_memmap; md->start != ~0UL; md++) { 631 + if (phys_addr - md->start < (md->num_pages << EFI_PAGE_SHIFT)) 632 + return md; 633 + } 634 + return 0; 635 + } 636 + 627 637 static efi_memory_desc_t * 628 638 efi_memory_descriptor (unsigned long phys_addr) 629 639 { ··· 652 638 653 639 if (phys_addr - md->phys_addr < (md->num_pages << EFI_PAGE_SHIFT)) 654 640 return md; 655 - } 656 - return 0; 657 - } 658 - 659 - static int 660 - efi_memmap_has_mmio (void) 661 - { 662 - void *efi_map_start, *efi_map_end, *p; 663 - efi_memory_desc_t *md; 664 - u64 efi_desc_size; 665 - 666 - efi_map_start = __va(ia64_boot_param->efi_memmap); 667 - efi_map_end = efi_map_start + ia64_boot_param->efi_memmap_size; 668 - efi_desc_size = ia64_boot_param->efi_memdesc_size; 669 - 670 - for (p = efi_map_start; p < efi_map_end; p += efi_desc_size) { 671 - md = p; 672 - 673 - if (md->type == EFI_MEMORY_MAPPED_IO) 674 - return 1; 675 641 } 676 642 return 0; 677 643 } ··· 677 683 } 678 684 EXPORT_SYMBOL(efi_mem_attributes); 679 685 680 - /* 681 - * Determines whether the memory at phys_addr supports the desired 682 - * attribute (WB, UC, etc). If this returns 1, the caller can safely 683 - * access size bytes at phys_addr with the specified attribute. 684 - */ 685 - int 686 - efi_mem_attribute_range (unsigned long phys_addr, unsigned long size, u64 attr) 686 + u64 687 + efi_mem_attribute (unsigned long phys_addr, unsigned long size) 687 688 { 688 689 unsigned long end = phys_addr + size; 689 690 efi_memory_desc_t *md = efi_memory_descriptor(phys_addr); 691 + u64 attr; 692 + 693 + if (!md) 694 + return 0; 690 695 691 696 /* 692 - * Some firmware doesn't report MMIO regions in the EFI memory 693 - * map. The Intel BigSur (a.k.a. HP i2000) has this problem. 694 - * On those platforms, we have to assume UC is valid everywhere. 697 + * EFI_MEMORY_RUNTIME is not a memory attribute; it just tells 698 + * the kernel that firmware needs this region mapped. 695 699 */ 696 - if (!md || (md->attribute & attr) != attr) { 697 - if (attr == EFI_MEMORY_UC && !efi_memmap_has_mmio()) 698 - return 1; 699 - return 0; 700 - } 701 - 700 + attr = md->attribute & ~EFI_MEMORY_RUNTIME; 702 701 do { 703 702 unsigned long md_end = efi_md_end(md); 704 703 705 704 if (end <= md_end) 706 - return 1; 705 + return attr; 707 706 708 707 md = efi_memory_descriptor(md_end); 709 - if (!md || (md->attribute & attr) != attr) 708 + if (!md || (md->attribute & ~EFI_MEMORY_RUNTIME) != attr) 710 709 return 0; 711 710 } while (md); 712 711 return 0; 713 712 } 714 713 715 - /* 716 - * For /dev/mem, we only allow read & write system calls to access 717 - * write-back memory, because read & write don't allow the user to 718 - * control access size. 719 - */ 714 + u64 715 + kern_mem_attribute (unsigned long phys_addr, unsigned long size) 716 + { 717 + unsigned long end = phys_addr + size; 718 + struct kern_memdesc *md; 719 + u64 attr; 720 + 721 + /* 722 + * This is a hack for ioremap calls before we set up kern_memmap. 723 + * Maybe we should do efi_memmap_init() earlier instead. 724 + */ 725 + if (!kern_memmap) { 726 + attr = efi_mem_attribute(phys_addr, size); 727 + if (attr & EFI_MEMORY_WB) 728 + return EFI_MEMORY_WB; 729 + return 0; 730 + } 731 + 732 + md = kern_memory_descriptor(phys_addr); 733 + if (!md) 734 + return 0; 735 + 736 + attr = md->attribute; 737 + do { 738 + unsigned long md_end = kmd_end(md); 739 + 740 + if (end <= md_end) 741 + return attr; 742 + 743 + md = kern_memory_descriptor(md_end); 744 + if (!md || md->attribute != attr) 745 + return 0; 746 + } while (md); 747 + return 0; 748 + } 749 + EXPORT_SYMBOL(kern_mem_attribute); 750 + 720 751 int 721 752 valid_phys_addr_range (unsigned long phys_addr, unsigned long size) 722 753 { 723 - return efi_mem_attribute_range(phys_addr, size, EFI_MEMORY_WB); 754 + u64 attr; 755 + 756 + /* 757 + * /dev/mem reads and writes use copy_to_user(), which implicitly 758 + * uses a granule-sized kernel identity mapping. It's really 759 + * only safe to do this for regions in kern_memmap. For more 760 + * details, see Documentation/ia64/aliasing.txt. 761 + */ 762 + attr = kern_mem_attribute(phys_addr, size); 763 + if (attr & EFI_MEMORY_WB || attr & EFI_MEMORY_UC) 764 + return 1; 765 + return 0; 724 766 } 725 767 726 - /* 727 - * We allow mmap of anything in the EFI memory map that supports 728 - * either write-back or uncacheable access. For uncacheable regions, 729 - * the supported access sizes are system-dependent, and the user is 730 - * responsible for using the correct size. 731 - * 732 - * Note that this doesn't currently allow access to hot-added memory, 733 - * because that doesn't appear in the boot-time EFI memory map. 734 - */ 735 768 int 736 769 valid_mmap_phys_addr_range (unsigned long phys_addr, unsigned long size) 737 770 { 738 - if (efi_mem_attribute_range(phys_addr, size, EFI_MEMORY_WB)) 739 - return 1; 771 + /* 772 + * MMIO regions are often missing from the EFI memory map. 773 + * We must allow mmap of them for programs like X, so we 774 + * currently can't do any useful validation. 775 + */ 776 + return 1; 777 + } 740 778 741 - if (efi_mem_attribute_range(phys_addr, size, EFI_MEMORY_UC)) 742 - return 1; 779 + pgprot_t 780 + phys_mem_access_prot(struct file *file, unsigned long pfn, unsigned long size, 781 + pgprot_t vma_prot) 782 + { 783 + unsigned long phys_addr = pfn << PAGE_SHIFT; 784 + u64 attr; 743 785 744 - return 0; 786 + /* 787 + * For /dev/mem mmap, we use user mappings, but if the region is 788 + * in kern_memmap (and hence may be covered by a kernel mapping), 789 + * we must use the same attribute as the kernel mapping. 790 + */ 791 + attr = kern_mem_attribute(phys_addr, size); 792 + if (attr & EFI_MEMORY_WB) 793 + return pgprot_cacheable(vma_prot); 794 + else if (attr & EFI_MEMORY_UC) 795 + return pgprot_noncached(vma_prot); 796 + 797 + /* 798 + * Some chipsets don't support UC access to memory. If 799 + * WB is supported, we prefer that. 800 + */ 801 + if (efi_mem_attribute(phys_addr, size) & EFI_MEMORY_WB) 802 + return pgprot_cacheable(vma_prot); 803 + 804 + return pgprot_noncached(vma_prot); 745 805 } 746 806 747 807 int __init
+22 -5
arch/ia64/mm/ioremap.c
··· 11 11 #include <linux/module.h> 12 12 #include <linux/efi.h> 13 13 #include <asm/io.h> 14 + #include <asm/meminit.h> 14 15 15 16 static inline void __iomem * 16 17 __ioremap (unsigned long offset, unsigned long size) ··· 22 21 void __iomem * 23 22 ioremap (unsigned long offset, unsigned long size) 24 23 { 25 - if (efi_mem_attribute_range(offset, size, EFI_MEMORY_WB)) 26 - return phys_to_virt(offset); 24 + u64 attr; 25 + unsigned long gran_base, gran_size; 27 26 28 - if (efi_mem_attribute_range(offset, size, EFI_MEMORY_UC)) 27 + /* 28 + * For things in kern_memmap, we must use the same attribute 29 + * as the rest of the kernel. For more details, see 30 + * Documentation/ia64/aliasing.txt. 31 + */ 32 + attr = kern_mem_attribute(offset, size); 33 + if (attr & EFI_MEMORY_WB) 34 + return phys_to_virt(offset); 35 + else if (attr & EFI_MEMORY_UC) 29 36 return __ioremap(offset, size); 30 37 31 38 /* 32 - * Someday this should check ACPI resources so we 33 - * can do the right thing for hot-plugged regions. 39 + * Some chipsets don't support UC access to memory. If 40 + * WB is supported for the whole granule, we prefer that. 34 41 */ 42 + gran_base = GRANULEROUNDDOWN(offset); 43 + gran_size = GRANULEROUNDUP(offset + size) - gran_base; 44 + if (efi_mem_attribute(gran_base, gran_size) & EFI_MEMORY_WB) 45 + return phys_to_virt(offset); 46 + 35 47 return __ioremap(offset, size); 36 48 } 37 49 EXPORT_SYMBOL(ioremap); ··· 52 38 void __iomem * 53 39 ioremap_nocache (unsigned long offset, unsigned long size) 54 40 { 41 + if (kern_mem_attribute(offset, size) & EFI_MEMORY_WB) 42 + return 0; 43 + 55 44 return __ioremap(offset, size); 56 45 } 57 46 EXPORT_SYMBOL(ioremap_nocache);
+15 -2
arch/ia64/pci/pci.c
··· 645 645 int 646 646 pci_mmap_legacy_page_range(struct pci_bus *bus, struct vm_area_struct *vma) 647 647 { 648 + unsigned long size = vma->vm_end - vma->vm_start; 649 + pgprot_t prot; 648 650 char *addr; 651 + 652 + /* 653 + * Avoid attribute aliasing. See Documentation/ia64/aliasing.txt 654 + * for more details. 655 + */ 656 + if (!valid_mmap_phys_addr_range(vma->vm_pgoff << PAGE_SHIFT, size)) 657 + return -EINVAL; 658 + prot = phys_mem_access_prot(NULL, vma->vm_pgoff, size, 659 + vma->vm_page_prot); 660 + if (pgprot_val(prot) != pgprot_val(pgprot_noncached(vma->vm_page_prot))) 661 + return -EINVAL; 649 662 650 663 addr = pci_get_legacy_mem(bus); 651 664 if (IS_ERR(addr)) 652 665 return PTR_ERR(addr); 653 666 654 667 vma->vm_pgoff += (unsigned long)addr >> PAGE_SHIFT; 655 - vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); 668 + vma->vm_page_prot = prot; 656 669 vma->vm_flags |= (VM_SHM | VM_RESERVED | VM_IO); 657 670 658 671 if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, 659 - vma->vm_end - vma->vm_start, vma->vm_page_prot)) 672 + size, vma->vm_page_prot)) 660 673 return -EAGAIN; 661 674 662 675 return 0;
+1
include/asm-ia64/io.h
··· 88 88 } 89 89 90 90 #define ARCH_HAS_VALID_PHYS_ADDR_RANGE 91 + extern u64 kern_mem_attribute (unsigned long phys_addr, unsigned long size); 91 92 extern int valid_phys_addr_range (unsigned long addr, size_t count); /* efi.c */ 92 93 extern int valid_mmap_phys_addr_range (unsigned long addr, size_t count); 93 94
+10 -12
include/asm-ia64/pgtable.h
··· 316 316 #define pte_mkhuge(pte) (__pte(pte_val(pte))) 317 317 318 318 /* 319 - * Macro to a page protection value as "uncacheable". Note that "protection" is really a 320 - * misnomer here as the protection value contains the memory attribute bits, dirty bits, 321 - * and various other bits as well. 319 + * Make page protection values cacheable, uncacheable, or write- 320 + * combining. Note that "protection" is really a misnomer here as the 321 + * protection value contains the memory attribute bits, dirty bits, and 322 + * various other bits as well. 322 323 */ 324 + #define pgprot_cacheable(prot) __pgprot((pgprot_val(prot) & ~_PAGE_MA_MASK) | _PAGE_MA_WB) 323 325 #define pgprot_noncached(prot) __pgprot((pgprot_val(prot) & ~_PAGE_MA_MASK) | _PAGE_MA_UC) 324 - 325 - /* 326 - * Macro to make mark a page protection value as "write-combining". 327 - * Note that "protection" is really a misnomer here as the protection 328 - * value contains the memory attribute bits, dirty bits, and various 329 - * other bits as well. Accesses through a write-combining translation 330 - * works bypasses the caches, but does allow for consecutive writes to 331 - * be combined into single (but larger) write transactions. 332 - */ 333 326 #define pgprot_writecombine(prot) __pgprot((pgprot_val(prot) & ~_PAGE_MA_MASK) | _PAGE_MA_WC) 327 + 328 + struct file; 329 + extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn, 330 + unsigned long size, pgprot_t vma_prot); 331 + #define __HAVE_PHYS_MEM_ACCESS_PROT 334 332 335 333 static inline unsigned long 336 334 pgd_index (unsigned long address)
+1
include/linux/efi.h
··· 294 294 extern u64 efi_get_iobase (void); 295 295 extern u32 efi_mem_type (unsigned long phys_addr); 296 296 extern u64 efi_mem_attributes (unsigned long phys_addr); 297 + extern u64 efi_mem_attribute (unsigned long phys_addr, unsigned long size); 297 298 extern int efi_mem_attribute_range (unsigned long phys_addr, unsigned long size, 298 299 u64 attr); 299 300 extern int __init efi_uart_console_only (void);