Merge tag 'docs-5.8' of git://git.lwn.net/linux

+4 -1

.mailmap

··· 152 152 Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> 153 153 Leon Romanovsky <leon@kernel.org> <leon@leon.nu> 154 154 Leon Romanovsky <leon@kernel.org> <leonro@mellanox.com> 155 + Leonardo Bras <leobras.c@gmail.com> <leonardo@linux.ibm.com> 155 156 Leonid I Ananiev <leonid.i.ananiev@intel.com> 156 157 Linas Vepstas <linas@austin.ibm.com> 157 158 Linus Lüssing <linus.luessing@c0d3.blue> <linus.luessing@web.de> ··· 235 234 Ralf Wildenhues <Ralf.Wildenhues@gmx.de> 236 235 Randy Dunlap <rdunlap@infradead.org> <rdunlap@xenotime.net> 237 236 Rémi Denis-Courmont <rdenis@simphalempin.com> 238 - Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com> 237 + Ricardo Ribalda <ribalda@kernel.org> <ricardo.ribalda@gmail.com> 238 + Ricardo Ribalda <ribalda@kernel.org> <ricardo@ribalda.com> 239 + Ricardo Ribalda <ribalda@kernel.org> Ricardo Ribalda Delgado <ribalda@kernel.org> 239 240 Ross Zwisler <zwisler@kernel.org> <ross.zwisler@linux.intel.com> 240 241 Rudolf Marek <R.Marek@sh.cvut.cz> 241 242 Rui Saraiva <rmps@joel.ist.utl.pt>

+4 -2

CREDITS

··· 3104 3104 D: Generic Z8530 driver, AX.25 DAMA slave implementation 3105 3105 D: Several AX.25 hacks 3106 3106 3107 - N: Ricardo Ribalda Delgado 3108 - E: ricardo.ribalda@gmail.com 3107 + N: Ricardo Ribalda 3108 + E: ribalda@kernel.org 3109 3109 W: http://ribalda.com 3110 3110 D: PLX USB338x driver 3111 3111 D: PCA9634 driver 3112 3112 D: Option GTM671WFS 3113 3113 D: Fintek F81216A 3114 3114 D: AD5761 iio driver 3115 + D: TI DAC7612 driver 3116 + D: Sony IMX214 driver 3115 3117 D: Various kernel hacks 3116 3118 S: Qtechnology A/S 3117 3119 S: Valby Langgade 142

+1 -1

Documentation/ABI/stable/sysfs-devices-node

··· 54 54 Contact: Linux Memory Management list <linux-mm@kvack.org> 55 55 Description: 56 56 Provides information about the node's distribution and memory 57 - utilization. Similar to /proc/meminfo, see Documentation/filesystems/proc.txt 57 + utilization. Similar to /proc/meminfo, see Documentation/filesystems/proc.rst 58 58 59 59 What: /sys/devices/system/node/nodeX/numastat 60 60 Date: October 2002

+1 -1

Documentation/ABI/testing/procfs-smaps_rollup

··· 11 11 Additionally, the fields Pss_Anon, Pss_File and Pss_Shmem 12 12 are not present in /proc/pid/smaps. These fields represent 13 13 the sum of the Pss field of each type (anon, file, shmem). 14 - For more details, see Documentation/filesystems/proc.txt 14 + For more details, see Documentation/filesystems/proc.rst 15 15 and the procfs man page. 16 16 17 17 Typical output looks like this:

Documentation/DMA-API-HOWTO.txt Documentation/core-api/dma-api-howto.rst

Documentation/DMA-API.txt Documentation/core-api/dma-api.rst

Documentation/DMA-ISA-LPC.txt Documentation/core-api/dma-isa-lpc.rst

Documentation/DMA-attributes.txt Documentation/core-api/dma-attributes.rst

Documentation/IPMI.txt Documentation/driver-api/ipmi.rst

Documentation/IRQ-affinity.txt Documentation/core-api/irq/irq-affinity.rst

-269

Documentation/IRQ-domain.txt

··· 1 - =============================================== 2 - The irq_domain interrupt number mapping library 3 - =============================================== 4 - 5 - The current design of the Linux kernel uses a single large number 6 - space where each separate IRQ source is assigned a different number. 7 - This is simple when there is only one interrupt controller, but in 8 - systems with multiple interrupt controllers the kernel must ensure 9 - that each one gets assigned non-overlapping allocations of Linux 10 - IRQ numbers. 11 - 12 - The number of interrupt controllers registered as unique irqchips 13 - show a rising tendency: for example subdrivers of different kinds 14 - such as GPIO controllers avoid reimplementing identical callback 15 - mechanisms as the IRQ core system by modelling their interrupt 16 - handlers as irqchips, i.e. in effect cascading interrupt controllers. 17 - 18 - Here the interrupt number loose all kind of correspondence to 19 - hardware interrupt numbers: whereas in the past, IRQ numbers could 20 - be chosen so they matched the hardware IRQ line into the root 21 - interrupt controller (i.e. the component actually fireing the 22 - interrupt line to the CPU) nowadays this number is just a number. 23 - 24 - For this reason we need a mechanism to separate controller-local 25 - interrupt numbers, called hardware irq's, from Linux IRQ numbers. 26 - 27 - The irq_alloc_desc*() and irq_free_desc*() APIs provide allocation of 28 - irq numbers, but they don't provide any support for reverse mapping of 29 - the controller-local IRQ (hwirq) number into the Linux IRQ number 30 - space. 31 - 32 - The irq_domain library adds mapping between hwirq and IRQ numbers on 33 - top of the irq_alloc_desc*() API. An irq_domain to manage mapping is 34 - preferred over interrupt controller drivers open coding their own 35 - reverse mapping scheme. 36 - 37 - irq_domain also implements translation from an abstract irq_fwspec 38 - structure to hwirq numbers (Device Tree and ACPI GSI so far), and can 39 - be easily extended to support other IRQ topology data sources. 40 - 41 - irq_domain usage 42 - ================ 43 - 44 - An interrupt controller driver creates and registers an irq_domain by 45 - calling one of the irq_domain_add_*() functions (each mapping method 46 - has a different allocator function, more on that later). The function 47 - will return a pointer to the irq_domain on success. The caller must 48 - provide the allocator function with an irq_domain_ops structure. 49 - 50 - In most cases, the irq_domain will begin empty without any mappings 51 - between hwirq and IRQ numbers. Mappings are added to the irq_domain 52 - by calling irq_create_mapping() which accepts the irq_domain and a 53 - hwirq number as arguments. If a mapping for the hwirq doesn't already 54 - exist then it will allocate a new Linux irq_desc, associate it with 55 - the hwirq, and call the .map() callback so the driver can perform any 56 - required hardware setup. 57 - 58 - When an interrupt is received, irq_find_mapping() function should 59 - be used to find the Linux IRQ number from the hwirq number. 60 - 61 - The irq_create_mapping() function must be called *atleast once* 62 - before any call to irq_find_mapping(), lest the descriptor will not 63 - be allocated. 64 - 65 - If the driver has the Linux IRQ number or the irq_data pointer, and 66 - needs to know the associated hwirq number (such as in the irq_chip 67 - callbacks) then it can be directly obtained from irq_data->hwirq. 68 - 69 - Types of irq_domain mappings 70 - ============================ 71 - 72 - There are several mechanisms available for reverse mapping from hwirq 73 - to Linux irq, and each mechanism uses a different allocation function. 74 - Which reverse map type should be used depends on the use case. Each 75 - of the reverse map types are described below: 76 - 77 - Linear 78 - ------ 79 - 80 - :: 81 - 82 - irq_domain_add_linear() 83 - irq_domain_create_linear() 84 - 85 - The linear reverse map maintains a fixed size table indexed by the 86 - hwirq number. When a hwirq is mapped, an irq_desc is allocated for 87 - the hwirq, and the IRQ number is stored in the table. 88 - 89 - The Linear map is a good choice when the maximum number of hwirqs is 90 - fixed and a relatively small number (~ < 256). The advantages of this 91 - map are fixed time lookup for IRQ numbers, and irq_descs are only 92 - allocated for in-use IRQs. The disadvantage is that the table must be 93 - as large as the largest possible hwirq number. 94 - 95 - irq_domain_add_linear() and irq_domain_create_linear() are functionally 96 - equivalent, except for the first argument is different - the former 97 - accepts an Open Firmware specific 'struct device_node', while the latter 98 - accepts a more general abstraction 'struct fwnode_handle'. 99 - 100 - The majority of drivers should use the linear map. 101 - 102 - Tree 103 - ---- 104 - 105 - :: 106 - 107 - irq_domain_add_tree() 108 - irq_domain_create_tree() 109 - 110 - The irq_domain maintains a radix tree map from hwirq numbers to Linux 111 - IRQs. When an hwirq is mapped, an irq_desc is allocated and the 112 - hwirq is used as the lookup key for the radix tree. 113 - 114 - The tree map is a good choice if the hwirq number can be very large 115 - since it doesn't need to allocate a table as large as the largest 116 - hwirq number. The disadvantage is that hwirq to IRQ number lookup is 117 - dependent on how many entries are in the table. 118 - 119 - irq_domain_add_tree() and irq_domain_create_tree() are functionally 120 - equivalent, except for the first argument is different - the former 121 - accepts an Open Firmware specific 'struct device_node', while the latter 122 - accepts a more general abstraction 'struct fwnode_handle'. 123 - 124 - Very few drivers should need this mapping. 125 - 126 - No Map 127 - ------ 128 - 129 - :: 130 - 131 - irq_domain_add_nomap() 132 - 133 - The No Map mapping is to be used when the hwirq number is 134 - programmable in the hardware. In this case it is best to program the 135 - Linux IRQ number into the hardware itself so that no mapping is 136 - required. Calling irq_create_direct_mapping() will allocate a Linux 137 - IRQ number and call the .map() callback so that driver can program the 138 - Linux IRQ number into the hardware. 139 - 140 - Most drivers cannot use this mapping. 141 - 142 - Legacy 143 - ------ 144 - 145 - :: 146 - 147 - irq_domain_add_simple() 148 - irq_domain_add_legacy() 149 - irq_domain_add_legacy_isa() 150 - 151 - The Legacy mapping is a special case for drivers that already have a 152 - range of irq_descs allocated for the hwirqs. It is used when the 153 - driver cannot be immediately converted to use the linear mapping. For 154 - example, many embedded system board support files use a set of #defines 155 - for IRQ numbers that are passed to struct device registrations. In that 156 - case the Linux IRQ numbers cannot be dynamically assigned and the legacy 157 - mapping should be used. 158 - 159 - The legacy map assumes a contiguous range of IRQ numbers has already 160 - been allocated for the controller and that the IRQ number can be 161 - calculated by adding a fixed offset to the hwirq number, and 162 - visa-versa. The disadvantage is that it requires the interrupt 163 - controller to manage IRQ allocations and it requires an irq_desc to be 164 - allocated for every hwirq, even if it is unused. 165 - 166 - The legacy map should only be used if fixed IRQ mappings must be 167 - supported. For example, ISA controllers would use the legacy map for 168 - mapping Linux IRQs 0-15 so that existing ISA drivers get the correct IRQ 169 - numbers. 170 - 171 - Most users of legacy mappings should use irq_domain_add_simple() which 172 - will use a legacy domain only if an IRQ range is supplied by the 173 - system and will otherwise use a linear domain mapping. The semantics 174 - of this call are such that if an IRQ range is specified then 175 - descriptors will be allocated on-the-fly for it, and if no range is 176 - specified it will fall through to irq_domain_add_linear() which means 177 - *no* irq descriptors will be allocated. 178 - 179 - A typical use case for simple domains is where an irqchip provider 180 - is supporting both dynamic and static IRQ assignments. 181 - 182 - In order to avoid ending up in a situation where a linear domain is 183 - used and no descriptor gets allocated it is very important to make sure 184 - that the driver using the simple domain call irq_create_mapping() 185 - before any irq_find_mapping() since the latter will actually work 186 - for the static IRQ assignment case. 187 - 188 - Hierarchy IRQ domain 189 - -------------------- 190 - 191 - On some architectures, there may be multiple interrupt controllers 192 - involved in delivering an interrupt from the device to the target CPU. 193 - Let's look at a typical interrupt delivering path on x86 platforms:: 194 - 195 - Device --> IOAPIC -> Interrupt remapping Controller -> Local APIC -> CPU 196 - 197 - There are three interrupt controllers involved: 198 - 199 - 1) IOAPIC controller 200 - 2) Interrupt remapping controller 201 - 3) Local APIC controller 202 - 203 - To support such a hardware topology and make software architecture match 204 - hardware architecture, an irq_domain data structure is built for each 205 - interrupt controller and those irq_domains are organized into hierarchy. 206 - When building irq_domain hierarchy, the irq_domain near to the device is 207 - child and the irq_domain near to CPU is parent. So a hierarchy structure 208 - as below will be built for the example above:: 209 - 210 - CPU Vector irq_domain (root irq_domain to manage CPU vectors) 211 - ^ 212 - | 213 - Interrupt Remapping irq_domain (manage irq_remapping entries) 214 - ^ 215 - | 216 - IOAPIC irq_domain (manage IOAPIC delivery entries/pins) 217 - 218 - There are four major interfaces to use hierarchy irq_domain: 219 - 220 - 1) irq_domain_alloc_irqs(): allocate IRQ descriptors and interrupt 221 - controller related resources to deliver these interrupts. 222 - 2) irq_domain_free_irqs(): free IRQ descriptors and interrupt controller 223 - related resources associated with these interrupts. 224 - 3) irq_domain_activate_irq(): activate interrupt controller hardware to 225 - deliver the interrupt. 226 - 4) irq_domain_deactivate_irq(): deactivate interrupt controller hardware 227 - to stop delivering the interrupt. 228 - 229 - Following changes are needed to support hierarchy irq_domain: 230 - 231 - 1) a new field 'parent' is added to struct irq_domain; it's used to 232 - maintain irq_domain hierarchy information. 233 - 2) a new field 'parent_data' is added to struct irq_data; it's used to 234 - build hierarchy irq_data to match hierarchy irq_domains. The irq_data 235 - is used to store irq_domain pointer and hardware irq number. 236 - 3) new callbacks are added to struct irq_domain_ops to support hierarchy 237 - irq_domain operations. 238 - 239 - With support of hierarchy irq_domain and hierarchy irq_data ready, an 240 - irq_domain structure is built for each interrupt controller, and an 241 - irq_data structure is allocated for each irq_domain associated with an 242 - IRQ. Now we could go one step further to support stacked(hierarchy) 243 - irq_chip. That is, an irq_chip is associated with each irq_data along 244 - the hierarchy. A child irq_chip may implement a required action by 245 - itself or by cooperating with its parent irq_chip. 246 - 247 - With stacked irq_chip, interrupt controller driver only needs to deal 248 - with the hardware managed by itself and may ask for services from its 249 - parent irq_chip when needed. So we could achieve a much cleaner 250 - software architecture. 251 - 252 - For an interrupt controller driver to support hierarchy irq_domain, it 253 - needs to: 254 - 255 - 1) Implement irq_domain_ops.alloc and irq_domain_ops.free 256 - 2) Optionally implement irq_domain_ops.activate and 257 - irq_domain_ops.deactivate. 258 - 3) Optionally implement an irq_chip to manage the interrupt controller 259 - hardware. 260 - 4) No need to implement irq_domain_ops.map and irq_domain_ops.unmap, 261 - they are unused with hierarchy irq_domain. 262 - 263 - Hierarchy irq_domain is in no way x86 specific, and is heavily used to 264 - support other architectures, such as ARM, ARM64 etc. 265 - 266 - === Debugging === 267 - 268 - Most of the internals of the IRQ subsystem are exposed in debugfs by 269 - turning CONFIG_GENERIC_IRQ_DEBUGFS on.

Documentation/IRQ.txt Documentation/core-api/irq/concepts.rst

+5 -1

Documentation/Makefile

··· 98 98 99 99 pdfdocs: latexdocs 100 100 @$(srctree)/scripts/sphinx-pre-install --version-check 101 - $(foreach var,$(SPHINXDIRS), $(MAKE) PDFLATEX="$(PDFLATEX)" LATEXOPTS="$(LATEXOPTS)" -C $(BUILDDIR)/$(var)/latex || exit;) 101 + $(foreach var,$(SPHINXDIRS), \ 102 + $(MAKE) PDFLATEX="$(PDFLATEX)" LATEXOPTS="$(LATEXOPTS)" -C $(BUILDDIR)/$(var)/latex || exit; \ 103 + mkdir -p $(BUILDDIR)/$(var)/pdf; \ 104 + mv $(subst .tex,.pdf,$(wildcard $(BUILDDIR)/$(var)/latex/*.tex)) $(BUILDDIR)/$(var)/pdf/; \ 105 + ) 102 106 103 107 endif # HAVE_PDFLATEX 104 108

+19 -15

Documentation/PCI/boot-interrupts.rst

··· 32 32 Spurious Interrupts. The IRQ will be disabled by the Linux kernel after it 33 33 reaches a specific count with the error "nobody cared". This disabled IRQ 34 34 now prevents valid usage by an existing interrupt which may happen to share 35 - the IRQ line. 35 + the IRQ line:: 36 36 37 37 irq 19: nobody cared (try booting with the "irqpoll" option) 38 38 CPU: 0 PID: 2988 Comm: irq/34-nipalk Tainted: 4.14.87-rt49-02410-g4a640ec-dirty #1 39 39 Hardware name: National Instruments NI PXIe-8880/NI PXIe-8880, BIOS 2.1.5f1 01/09/2020 40 40 Call Trace: 41 + 41 42 <IRQ> 42 43 ? dump_stack+0x46/0x5e 43 44 ? __report_bad_irq+0x2e/0xb0 ··· 86 85 The mitigations take the form of PCI quirks. The preference has been to 87 86 first identify and make use of a means to disable the routing to the PCH. 88 87 In such a case a quirk to disable boot interrupt generation can be 89 - added.[1] 88 + added. [1]_ 90 89 91 - Intel® 6300ESB I/O Controller Hub 90 + Intel® 6300ESB I/O Controller Hub 92 91 Alternate Base Address Register: 93 92 BIE: Boot Interrupt Enable 94 - 0 = Boot interrupt is enabled. 95 - 1 = Boot interrupt is disabled. 96 93 97 - Intel® Sandy Bridge through Sky Lake based Xeon servers: 94 + == =========================== 95 + 0 Boot interrupt is enabled. 96 + 1 Boot interrupt is disabled. 97 + == =========================== 98 + 99 + Intel® Sandy Bridge through Sky Lake based Xeon servers: 98 100 Coherent Interface Protocol Interrupt Control 99 101 dis_intx_route2pch/dis_intx_route2ich/dis_intx_route2dmi2: 100 102 When this bit is set. Local INTx messages received from the ··· 113 109 disabled, the Linux kernel will reroute the valid interrupt to its legacy 114 110 interrupt. This redirection of the handler will prevent the occurrence of 115 111 the spurious interrupt detection which would ordinarily disable the IRQ 116 - line due to excessive unhandled counts.[2] 112 + line due to excessive unhandled counts. [2]_ 117 113 118 114 The config option X86_REROUTE_FOR_BROKEN_BOOT_IRQS exists to enable (or 119 115 disable) the redirection of the interrupt handler to the PCH interrupt 120 116 line. The option can be overridden by either pci=ioapicreroute or 121 - pci=noioapicreroute.[3] 117 + pci=noioapicreroute. [3]_ 122 118 123 119 124 120 More Documentation ··· 131 127 Example of disabling of the boot interrupt 132 128 ------------------------------------------ 133 129 134 - Intel® 6300ESB I/O Controller Hub (Document # 300641-004US) 130 + - Intel® 6300ESB I/O Controller Hub (Document # 300641-004US) 135 131 5.7.3 Boot Interrupt 136 132 https://www.intel.com/content/dam/doc/datasheet/6300esb-io-controller-hub-datasheet.pdf 137 133 138 - Intel® Xeon® Processor E5-1600/2400/2600/4600 v3 Product Families 139 - Datasheet - Volume 2: Registers (Document # 330784-003) 134 + - Intel® Xeon® Processor E5-1600/2400/2600/4600 v3 Product Families 135 + Datasheet - Volume 2: Registers (Document # 330784-003) 140 136 6.6.41 cipintrc Coherent Interface Protocol Interrupt Control 141 137 https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-v3-datasheet-vol-2.pdf 142 138 143 139 Example of handler rerouting 144 140 ---------------------------- 145 141 146 - Intel® 6700PXH 64-bit PCI Hub (Document # 302628) 142 + - Intel® 6700PXH 64-bit PCI Hub (Document # 302628) 147 143 2.15.2 PCI Express Legacy INTx Support and Boot Interrupt 148 144 https://www.intel.com/content/dam/doc/datasheet/6700pxh-64-bit-pci-hub-datasheet.pdf 149 145 ··· 154 150 Sean V Kelley 155 151 sean.v.kelley@linux.intel.com 156 152 157 - [1] https://lore.kernel.org/r/12131949181903-git-send-email-sassmann@suse.de/ 158 - [2] https://lore.kernel.org/r/12131949182094-git-send-email-sassmann@suse.de/ 159 - [3] https://lore.kernel.org/r/487C8EA7.6020205@suse.de/ 153 + .. [1] https://lore.kernel.org/r/12131949181903-git-send-email-sassmann@suse.de/ 154 + .. [2] https://lore.kernel.org/r/12131949182094-git-send-email-sassmann@suse.de/ 155 + .. [3] https://lore.kernel.org/r/487C8EA7.6020205@suse.de/

+1 -1

Documentation/admin-guide/acpi/ssdt-overlays.rst

··· 63 63 ASL Input: minnomax.asl - 30 lines, 614 bytes, 7 keywords 64 64 AML Output: minnowmax.aml - 165 bytes, 6 named objects, 1 executable opcodes 65 65 66 - [1] http://wiki.minnowboard.org/MinnowBoard_MAX#Low_Speed_Expansion_Connector_.28Top.29 66 + [1] https://www.elinux.org/Minnowboard:MinnowMax#Low_Speed_Expansion_.28Top.29 67 67 68 68 The resulting AML code can then be loaded by the kernel using one of the methods 69 69 below.

+31 -22

Documentation/admin-guide/bug-hunting.rst

··· 49 49 50 50 Despite being an **Oops** or some other sort of stack trace, the offended 51 51 line is usually required to identify and handle the bug. Along this chapter, 52 - we'll refer to "Oops" for all kinds of stack traces that need to be analized. 52 + we'll refer to "Oops" for all kinds of stack traces that need to be analyzed. 53 53 54 - .. note:: 54 + If the kernel is compiled with ``CONFIG_DEBUG_INFO``, you can enhance the 55 + quality of the stack trace by using file:`scripts/decode_stacktrace.sh`. 55 56 56 - ``ksymoops`` is useless on 2.6 or upper. Please use the Oops in its original 57 - format (from ``dmesg``, etc). Ignore any references in this or other docs to 58 - "decoding the Oops" or "running it through ksymoops". 59 - If you post an Oops from 2.6+ that has been run through ``ksymoops``, 60 - people will just tell you to repost it. 57 + Modules linked in 58 + ----------------- 59 + 60 + Modules that are tainted or are being loaded or unloaded are marked with 61 + "(...)", where the taint flags are described in 62 + file:`Documentation/admin-guide/tainted-kernels.rst`, "being loaded" is 63 + annotated with "+", and "being unloaded" is annotated with "-". 64 + 61 65 62 66 Where is the Oops message is located? 63 67 ------------------------------------- ··· 75 71 Sometimes ``klogd`` dies, in which case you can run ``dmesg > file`` to 76 72 read the data from the kernel buffers and save it. Or you can 77 73 ``cat /proc/kmsg > file``, however you have to break in to stop the transfer, 78 - ``kmsg`` is a "never ending file". 74 + since ``kmsg`` is a "never ending file". 79 75 80 76 If the machine has crashed so badly that you cannot enter commands or 81 77 the disk is not available then you have three options: ··· 85 81 planned for a crash. Alternatively, you can take a picture of 86 82 the screen with a digital camera - not nice, but better than 87 83 nothing. If the messages scroll off the top of the console, you 88 - may find that booting with a higher resolution (eg, ``vga=791``) 84 + may find that booting with a higher resolution (e.g., ``vga=791``) 89 85 will allow you to read more of the text. (Caveat: This needs ``vesafb``, 90 - so won't help for 'early' oopses) 86 + so won't help for 'early' oopses.) 91 87 92 88 (2) Boot with a serial console (see 93 89 :ref:`Documentation/admin-guide/serial-console.rst <serial_console>`), ··· 108 104 gdb 109 105 ^^^ 110 106 111 - The GNU debug (``gdb``) is the best way to figure out the exact file and line 107 + The GNU debugger (``gdb``) is the best way to figure out the exact file and line 112 108 number of the OOPS from the ``vmlinux`` file. 113 109 114 110 The usage of gdb works best on a kernel compiled with ``CONFIG_DEBUG_INFO``. ··· 169 165 [<ffffffff8802770b>] :jbd:journal_stop+0x1be/0x1ee 170 166 ... 171 167 172 - this shows the problem likely in the :jbd: module. You can load that module 168 + this shows the problem likely is in the :jbd: module. You can load that module 173 169 in gdb and list the relevant code:: 174 170 175 171 $ gdb fs/jbd/jbd.ko ··· 203 199 You need to be at the top level of the kernel tree for this to pick up 204 200 your C files. 205 201 206 - If you don't have access to the code you can also debug on some crash dumps 207 - e.g. crash dump output as shown by Dave Miller:: 202 + If you don't have access to the source code you can still debug some crash 203 + dumps using the following method (example crash dump output as shown by 204 + Dave Miller):: 208 205 209 206 EIP is at +0x14/0x4c0 210 207 ... ··· 235 230 mov 0x8(%ebp), %ebx ! %ebx = skb->sk 236 231 mov 0x13c(%ebx), %eax ! %eax = inet_sk(sk)->opt 237 232 233 + file:`scripts/decodecode` can be used to automate most of this, depending 234 + on what CPU architecture is being debugged. 235 + 238 236 Reporting the bug 239 237 ----------------- 240 238 ··· 249 241 the ``get_maintainer.pl`` script. 250 242 251 243 For example, if you find a bug at the gspca's sonixj.c file, you can get 252 - their maintainers with:: 244 + its maintainers with:: 253 245 254 246 $ ./scripts/get_maintainer.pl -f drivers/media/usb/gspca/sonixj.c 255 247 Hans Verkuil <hverkuil@xs4all.nl> (odd fixer:GSPCA USB WEBCAM DRIVER,commit_signer:1/1=100%) ··· 261 253 262 254 Please notice that it will point to: 263 255 264 - - The last developers that touched on the source code. On the above example, 265 - Tejun and Bhaktipriya (in this specific case, none really envolved on the 266 - development of this file); 256 + - The last developers that touched the source code (if this is done inside 257 + a git tree). On the above example, Tejun and Bhaktipriya (in this 258 + specific case, none really envolved on the development of this file); 267 259 - The driver maintainer (Hans Verkuil); 268 260 - The subsystem maintainer (Mauro Carvalho Chehab); 269 261 - The driver and/or subsystem mailing list (linux-media@vger.kernel.org); 270 262 - the Linux Kernel mailing list (linux-kernel@vger.kernel.org). 271 263 272 264 Usually, the fastest way to have your bug fixed is to report it to mailing 273 - list used for the development of the code (linux-media ML) copying the driver maintainer (Hans). 265 + list used for the development of the code (linux-media ML) copying the 266 + driver maintainer (Hans). 274 267 275 268 If you are totally stumped as to whom to send the report, and 276 269 ``get_maintainer.pl`` didn't provide you anything useful, send it to ··· 312 303 and forwarded to the kernel developers. 313 304 314 305 Two types of address resolution are performed by ``klogd``. The first is 315 - static translation and the second is dynamic translation. Static 316 - translation uses the System.map file in much the same manner that 317 - ksymoops does. In order to do static translation the ``klogd`` daemon 306 + static translation and the second is dynamic translation. 307 + Static translation uses the System.map file. 308 + In order to do static translation the ``klogd`` daemon 318 309 must be able to find a system map file at daemon initialization time. 319 310 See the klogd man page for information on how ``klogd`` searches for map 320 311 files.

+1 -1

Documentation/admin-guide/cpu-load.rst

··· 105 105 ---------- 106 106 107 107 - http://lkml.org/lkml/2007/2/12/6 108 - - Documentation/filesystems/proc.txt (1.8) 108 + - Documentation/filesystems/proc.rst (1.8) 109 109 110 110 111 111 Thanks

+1 -1

Documentation/admin-guide/hw-vuln/l1tf.rst

··· 268 268 /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is 269 269 available at: 270 270 271 - https://www.kernel.org/doc/Documentation/IRQ-affinity.txt 271 + https://www.kernel.org/doc/Documentation/core-api/irq/irq-affinity.rst 272 272 273 273 .. _smt_control: 274 274

+33 -37

Documentation/admin-guide/init.rst

··· 1 - Explaining the dreaded "No init found." boot hang message 1 + Explaining the "No working init found." boot hang message 2 2 ========================================================= 3 + :Authors: Andreas Mohr <andi at lisas period de> 4 + Cristian Souza <cristianmsbr at gmail period com> 3 5 4 - OK, so you've got this pretty unintuitive message (currently located 5 - in init/main.c) and are wondering what the H*** went wrong. 6 - Some high-level reasons for failure (listed roughly in order of execution) 7 - to load the init binary are: 6 + This document provides some high-level reasons for failure 7 + (listed roughly in order of execution) to load the init binary. 8 8 9 - A) Unable to mount root FS 10 - B) init binary doesn't exist on rootfs 11 - C) broken console device 12 - D) binary exists but dependencies not available 13 - E) binary cannot be loaded 9 + 1) **Unable to mount root FS**: Set "debug" kernel parameter (in bootloader 10 + config file or CONFIG_CMDLINE) to get more detailed kernel messages. 14 11 15 - Detailed explanations: 12 + 2) **init binary doesn't exist on rootfs**: Make sure you have the correct 13 + root FS type (and ``root=`` kernel parameter points to the correct 14 + partition), required drivers such as storage hardware (such as SCSI or 15 + USB!) and filesystem (ext3, jffs2, etc.) are builtin (alternatively as 16 + modules, to be pre-loaded by an initrd). 16 17 17 - A) Set "debug" kernel parameter (in bootloader config file or CONFIG_CMDLINE) 18 - to get more detailed kernel messages. 19 - B) make sure you have the correct root FS type 20 - (and ``root=`` kernel parameter points to the correct partition), 21 - required drivers such as storage hardware (such as SCSI or USB!) 22 - and filesystem (ext3, jffs2 etc.) are builtin (alternatively as modules, 23 - to be pre-loaded by an initrd) 24 - C) Possibly a conflict in ``console= setup`` --> initial console unavailable. 25 - E.g. some serial consoles are unreliable due to serial IRQ issues (e.g. 26 - missing interrupt-based configuration). 18 + 3) **Broken console device**: Possibly a conflict in ``console= setup`` 19 + --> initial console unavailable. E.g. some serial consoles are unreliable 20 + due to serial IRQ issues (e.g. missing interrupt-based configuration). 27 21 Try using a different ``console= device`` or e.g. ``netconsole=``. 28 - D) e.g. required library dependencies of the init binary such as 29 - ``/lib/ld-linux.so.2`` missing or broken. Use 30 - ``readelf -d <INIT>|grep NEEDED`` to find out which libraries are required. 31 - E) make sure the binary's architecture matches your hardware. 32 - E.g. i386 vs. x86_64 mismatch, or trying to load x86 on ARM hardware. 33 - In case you tried loading a non-binary file here (shell script?), 34 - you should make sure that the script specifies an interpreter in its shebang 35 - header line (``#!/...``) that is fully working (including its library 36 - dependencies). And before tackling scripts, better first test a simple 37 - non-script binary such as ``/bin/sh`` and confirm its successful execution. 38 - To find out more, add code ``to init/main.c`` to display kernel_execve()s 39 - return values. 22 + 23 + 4) **Binary exists but dependencies not available**: E.g. required library 24 + dependencies of the init binary such as ``/lib/ld-linux.so.2`` missing or 25 + broken. Use ``readelf -d <INIT>|grep NEEDED`` to find out which libraries 26 + are required. 27 + 28 + 5) **Binary cannot be loaded**: Make sure the binary's architecture matches 29 + your hardware. E.g. i386 vs. x86_64 mismatch, or trying to load x86 on ARM 30 + hardware. In case you tried loading a non-binary file here (shell script?), 31 + you should make sure that the script specifies an interpreter in its 32 + shebang header line (``#!/...``) that is fully working (including its 33 + library dependencies). And before tackling scripts, better first test a 34 + simple non-script binary such as ``/bin/sh`` and confirm its successful 35 + execution. To find out more, add code ``to init/main.c`` to display 36 + kernel_execve()s return values. 40 37 41 38 Please extend this explanation whenever you find new failure causes 42 39 (after all loading the init binary is a CRITICAL and hard transition step 43 - which needs to be made as painless as possible), then submit patch to LKML. 40 + which needs to be made as painless as possible), then submit a patch to LKML. 44 41 Further TODOs: 45 42 46 43 - Implement the various ``run_init_process()`` invocations via a struct array 47 44 which can then store the ``kernel_execve()`` result value and on failure 48 45 log it all by iterating over **all** results (very important usability fix). 49 - - try to make the implementation itself more helpful in general, 50 - e.g. by providing additional error messages at affected places. 46 + - Try to make the implementation itself more helpful in general, e.g. by 47 + providing additional error messages at affected places. 51 48 52 - Andreas Mohr <andi at lisas period de>

+1 -1

Documentation/admin-guide/kernel-parameters.txt

··· 3336 3336 See Documentation/admin-guide/sysctl/vm.rst for details. 3337 3337 3338 3338 ohci1394_dma=early [HW] enable debugging via the ohci1394 driver. 3339 - See Documentation/debugging-via-ohci1394.txt for more 3339 + See Documentation/core-api/debugging-via-ohci1394.rst for more 3340 3340 info. 3341 3341 3342 3342 olpc_ec_timeout= [OLPC] ms delay when issuing EC commands

+1 -1

Documentation/admin-guide/kernel-per-CPU-kthreads.rst

··· 10 10 References 11 11 ========== 12 12 13 - - Documentation/IRQ-affinity.txt: Binding interrupts to sets of CPUs. 13 + - Documentation/core-api/irq/irq-affinity.rst: Binding interrupts to sets of CPUs. 14 14 15 15 - Documentation/admin-guide/cgroup-v1: Using cgroups to bind tasks to sets of CPUs. 16 16

+105 -104

Documentation/admin-guide/mm/userfaultfd.rst

··· 12 12 memory page faults, something otherwise only the kernel code could do. 13 13 14 14 For example userfaults allows a proper and more optimal implementation 15 - of the PROT_NONE+SIGSEGV trick. 15 + of the ``PROT_NONE+SIGSEGV`` trick. 16 16 17 17 Design 18 18 ====== 19 19 20 - Userfaults are delivered and resolved through the userfaultfd syscall. 20 + Userfaults are delivered and resolved through the ``userfaultfd`` syscall. 21 21 22 - The userfaultfd (aside from registering and unregistering virtual 22 + The ``userfaultfd`` (aside from registering and unregistering virtual 23 23 memory ranges) provides two primary functionalities: 24 24 25 - 1) read/POLLIN protocol to notify a userland thread of the faults 25 + 1) ``read/POLLIN`` protocol to notify a userland thread of the faults 26 26 happening 27 27 28 - 2) various UFFDIO_* ioctls that can manage the virtual memory regions 29 - registered in the userfaultfd that allows userland to efficiently 28 + 2) various ``UFFDIO_*`` ioctls that can manage the virtual memory regions 29 + registered in the ``userfaultfd`` that allows userland to efficiently 30 30 resolve the userfaults it receives via 1) or to manage the virtual 31 31 memory in the background 32 32 33 33 The real advantage of userfaults if compared to regular virtual memory 34 34 management of mremap/mprotect is that the userfaults in all their 35 35 operations never involve heavyweight structures like vmas (in fact the 36 - userfaultfd runtime load never takes the mmap_sem for writing). 36 + ``userfaultfd`` runtime load never takes the mmap_sem for writing). 37 37 38 38 Vmas are not suitable for page- (or hugepage) granular fault tracking 39 39 when dealing with virtual address spaces that could span 40 40 Terabytes. Too many vmas would be needed for that. 41 41 42 - The userfaultfd once opened by invoking the syscall, can also be 42 + The ``userfaultfd`` once opened by invoking the syscall, can also be 43 43 passed using unix domain sockets to a manager process, so the same 44 44 manager process could handle the userfaults of a multitude of 45 45 different processes without them being aware about what is going on 46 - (well of course unless they later try to use the userfaultfd 46 + (well of course unless they later try to use the ``userfaultfd`` 47 47 themselves on the same region the manager is already tracking, which 48 - is a corner case that would currently return -EBUSY). 48 + is a corner case that would currently return ``-EBUSY``). 49 49 50 50 API 51 51 === 52 52 53 - When first opened the userfaultfd must be enabled invoking the 54 - UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or 55 - a later API version) which will specify the read/POLLIN protocol 56 - userland intends to speak on the UFFD and the uffdio_api.features 57 - userland requires. The UFFDIO_API ioctl if successful (i.e. if the 58 - requested uffdio_api.api is spoken also by the running kernel and the 53 + When first opened the ``userfaultfd`` must be enabled invoking the 54 + ``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or 55 + a later API version) which will specify the ``read/POLLIN`` protocol 56 + userland intends to speak on the ``UFFD`` and the ``uffdio_api.features`` 57 + userland requires. The ``UFFDIO_API`` ioctl if successful (i.e. if the 58 + requested ``uffdio_api.api`` is spoken also by the running kernel and the 59 59 requested features are going to be enabled) will return into 60 - uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of 60 + ``uffdio_api.features`` and ``uffdio_api.ioctls`` two 64bit bitmasks of 61 61 respectively all the available features of the read(2) protocol and 62 62 the generic ioctl available. 63 63 64 - The uffdio_api.features bitmask returned by the UFFDIO_API ioctl 65 - defines what memory types are supported by the userfaultfd and what 64 + The ``uffdio_api.features`` bitmask returned by the ``UFFDIO_API`` ioctl 65 + defines what memory types are supported by the ``userfaultfd`` and what 66 66 events, except page fault notifications, may be generated. 67 67 68 - If the kernel supports registering userfaultfd ranges on hugetlbfs 69 - virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in 70 - uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be 71 - set if the kernel supports registering userfaultfd ranges on shared 72 - memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero 73 - MAP_SHARED, memfd_create, etc). 68 + If the kernel supports registering ``userfaultfd`` ranges on hugetlbfs 69 + virtual memory areas, ``UFFD_FEATURE_MISSING_HUGETLBFS`` will be set in 70 + ``uffdio_api.features``. Similarly, ``UFFD_FEATURE_MISSING_SHMEM`` will be 71 + set if the kernel supports registering ``userfaultfd`` ranges on shared 72 + memory (covering all shmem APIs, i.e. tmpfs, ``IPCSHM``, ``/dev/zero``, 73 + ``MAP_SHARED``, ``memfd_create``, etc). 74 74 75 - The userland application that wants to use userfaultfd with hugetlbfs 75 + The userland application that wants to use ``userfaultfd`` with hugetlbfs 76 76 or shared memory need to set the corresponding flag in 77 - uffdio_api.features to enable those features. 77 + ``uffdio_api.features`` to enable those features. 78 78 79 79 If the userland desires to receive notifications for events other than 80 - page faults, it has to verify that uffdio_api.features has appropriate 81 - UFFD_FEATURE_EVENT_* bits set. These events are described in more 82 - detail below in "Non-cooperative userfaultfd" section. 80 + page faults, it has to verify that ``uffdio_api.features`` has appropriate 81 + ``UFFD_FEATURE_EVENT_*`` bits set. These events are described in more 82 + detail below in `Non-cooperative userfaultfd`_ section. 83 83 84 - Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should 85 - be invoked (if present in the returned uffdio_api.ioctls bitmask) to 86 - register a memory range in the userfaultfd by setting the 87 - uffdio_register structure accordingly. The uffdio_register.mode 84 + Once the ``userfaultfd`` has been enabled the ``UFFDIO_REGISTER`` ioctl should 85 + be invoked (if present in the returned ``uffdio_api.ioctls`` bitmask) to 86 + register a memory range in the ``userfaultfd`` by setting the 87 + uffdio_register structure accordingly. The ``uffdio_register.mode`` 88 88 bitmask will specify to the kernel which kind of faults to track for 89 - the range (UFFDIO_REGISTER_MODE_MISSING would track missing 90 - pages). The UFFDIO_REGISTER ioctl will return the 91 - uffdio_register.ioctls bitmask of ioctls that are suitable to resolve 89 + the range (``UFFDIO_REGISTER_MODE_MISSING`` would track missing 90 + pages). The ``UFFDIO_REGISTER`` ioctl will return the 91 + ``uffdio_register.ioctls`` bitmask of ioctls that are suitable to resolve 92 92 userfaults on the range registered. Not all ioctls will necessarily be 93 93 supported for all memory types depending on the underlying virtual 94 94 memory backend (anonymous memory vs tmpfs vs real filebacked 95 95 mappings). 96 96 97 - Userland can use the uffdio_register.ioctls to manage the virtual 97 + Userland can use the ``uffdio_register.ioctls`` to manage the virtual 98 98 address space in the background (to add or potentially also remove 99 - memory from the userfaultfd registered range). This means a userfault 99 + memory from the ``userfaultfd`` registered range). This means a userfault 100 100 could be triggering just before userland maps in the background the 101 101 user-faulted page. 102 102 103 - The primary ioctl to resolve userfaults is UFFDIO_COPY. That 103 + The primary ioctl to resolve userfaults is ``UFFDIO_COPY``. That 104 104 atomically copies a page into the userfault registered range and wakes 105 - up the blocked userfaults (unless uffdio_copy.mode & 106 - UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to 107 - UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an 108 - half copied page since it'll keep userfaulting until the copy has 109 - finished. 105 + up the blocked userfaults 106 + (unless ``uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE`` is set). 107 + Other ioctl works similarly to ``UFFDIO_COPY``. They're atomic as in 108 + guaranteeing that nothing can see an half copied page since it'll 109 + keep userfaulting until the copy has finished. 110 110 111 111 Notes: 112 112 113 - - If you requested UFFDIO_REGISTER_MODE_MISSING when registering then 113 + - If you requested ``UFFDIO_REGISTER_MODE_MISSING`` when registering then 114 114 you must provide some kind of page in your thread after reading from 115 - the uffd. You must provide either UFFDIO_COPY or UFFDIO_ZEROPAGE. 115 + the uffd. You must provide either ``UFFDIO_COPY`` or ``UFFDIO_ZEROPAGE``. 116 116 The normal behavior of the OS automatically providing a zero page on 117 117 an annonymous mmaping is not in place. 118 118 ··· 122 122 123 123 - You get the address of the access that triggered the missing page 124 124 event out of a struct uffd_msg that you read in the thread from the 125 - uffd. You can supply as many pages as you want with UFFDIO_COPY or 126 - UFFDIO_ZEROPAGE. Keep in mind that unless you used DONTWAKE then 125 + uffd. You can supply as many pages as you want with ``UFFDIO_COPY`` or 126 + ``UFFDIO_ZEROPAGE``. Keep in mind that unless you used DONTWAKE then 127 127 the first of any of those IOCTLs wakes up the faulting thread. 128 128 129 - - Be sure to test for all errors including (pollfd[0].revents & 130 - POLLERR). This can happen, e.g. when ranges supplied were 131 - incorrect. 129 + - Be sure to test for all errors including 130 + (``pollfd[0].revents & POLLERR``). This can happen, e.g. when ranges 131 + supplied were incorrect. 132 132 133 133 Write Protect Notifications 134 134 --------------------------- ··· 136 136 This is equivalent to (but faster than) using mprotect and a SIGSEGV 137 137 signal handler. 138 138 139 - Firstly you need to register a range with UFFDIO_REGISTER_MODE_WP. 140 - Instead of using mprotect(2) you use ioctl(uffd, UFFDIO_WRITEPROTECT, 141 - struct *uffdio_writeprotect) while mode = UFFDIO_WRITEPROTECT_MODE_WP 139 + Firstly you need to register a range with ``UFFDIO_REGISTER_MODE_WP``. 140 + Instead of using mprotect(2) you use 141 + ``ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect)`` 142 + while ``mode = UFFDIO_WRITEPROTECT_MODE_WP`` 142 143 in the struct passed in. The range does not default to and does not 143 144 have to be identical to the range you registered with. You can write 144 145 protect as many ranges as you like (inside the registered range). 145 146 Then, in the thread reading from uffd the struct will have 146 - msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP set. Now you send 147 - ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect) again 148 - while pagefault.mode does not have UFFDIO_WRITEPROTECT_MODE_WP set. 149 - This wakes up the thread which will continue to run with writes. This 147 + ``msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP`` set. Now you send 148 + ``ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect)`` 149 + again while ``pagefault.mode`` does not have ``UFFDIO_WRITEPROTECT_MODE_WP`` 150 + set. This wakes up the thread which will continue to run with writes. This 150 151 allows you to do the bookkeeping about the write in the uffd reading 151 152 thread before the ioctl. 152 153 153 - If you registered with both UFFDIO_REGISTER_MODE_MISSING and 154 - UFFDIO_REGISTER_MODE_WP then you need to think about the sequence in 154 + If you registered with both ``UFFDIO_REGISTER_MODE_MISSING`` and 155 + ``UFFDIO_REGISTER_MODE_WP`` then you need to think about the sequence in 155 156 which you supply a page and undo write protect. Note that there is a 156 157 difference between writes into a WP area and into a !WP area. The 157 - former will have UFFD_PAGEFAULT_FLAG_WP set, the latter 158 - UFFD_PAGEFAULT_FLAG_WRITE. The latter did not fail on protection but 159 - you still need to supply a page when UFFDIO_REGISTER_MODE_MISSING was 158 + former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter 159 + ``UFFD_PAGEFAULT_FLAG_WRITE``. The latter did not fail on protection but 160 + you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was 160 161 used. 161 162 162 163 QEMU/KVM 163 164 ======== 164 165 165 - QEMU/KVM is using the userfaultfd syscall to implement postcopy live 166 + QEMU/KVM is using the ``userfaultfd`` syscall to implement postcopy live 166 167 migration. Postcopy live migration is one form of memory 167 168 externalization consisting of a virtual machine running with part or 168 169 all of its memory residing on a different node in the cloud. The 169 - userfaultfd abstraction is generic enough that not a single line of 170 + ``userfaultfd`` abstraction is generic enough that not a single line of 170 171 KVM kernel code had to be modified in order to add postcopy live 171 172 migration to QEMU. 172 173 173 - Guest async page faults, FOLL_NOWAIT and all other GUP features work 174 + Guest async page faults, ``FOLL_NOWAIT`` and all other ``GUP*`` features work 174 175 just fine in combination with userfaults. Userfaults trigger async 175 176 page faults in the guest scheduler so those guest processes that 176 177 aren't waiting for userfaults (i.e. network bound) can keep running in ··· 184 183 The implementation of postcopy live migration currently uses one 185 184 single bidirectional socket but in the future two different sockets 186 185 will be used (to reduce the latency of the userfaults to the minimum 187 - possible without having to decrease /proc/sys/net/ipv4/tcp_wmem). 186 + possible without having to decrease ``/proc/sys/net/ipv4/tcp_wmem``). 188 187 189 188 The QEMU in the source node writes all pages that it knows are missing 190 189 in the destination node, into the socket, and the migration thread of 191 - the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE 192 - ioctls on the userfaultfd in order to map the received pages into the 193 - guest (UFFDIO_ZEROCOPY is used if the source page was a zero page). 190 + the QEMU running in the destination node runs ``UFFDIO_COPY|ZEROPAGE`` 191 + ioctls on the ``userfaultfd`` in order to map the received pages into the 192 + guest (``UFFDIO_ZEROCOPY`` is used if the source page was a zero page). 194 193 195 194 A different postcopy thread in the destination node listens with 196 - poll() to the userfaultfd in parallel. When a POLLIN event is 195 + poll() to the ``userfaultfd`` in parallel. When a ``POLLIN`` event is 197 196 generated after a userfault triggers, the postcopy thread read() from 198 - the userfaultfd and receives the fault address (or -EAGAIN in case the 199 - userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run 197 + the ``userfaultfd`` and receives the fault address (or ``-EAGAIN`` in case the 198 + userfault was already resolved and waken by a ``UFFDIO_COPY|ZEROPAGE`` run 200 199 by the parallel QEMU migration thread). 201 200 202 201 After the QEMU postcopy thread (running in the destination node) gets ··· 207 206 (just the time to flush the tcp_wmem queue through the network) the 208 207 migration thread in the QEMU running in the destination node will 209 208 receive the page that triggered the userfault and it'll map it as 210 - usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it 209 + usual with the ``UFFDIO_COPY|ZEROPAGE`` (without actually knowing if it 211 210 was spontaneously sent by the source or if it was an urgent page 212 211 requested through a userfault). 213 212 ··· 220 219 over it when receiving incoming userfaults. After sending each page of 221 220 course the bitmap is updated accordingly. It's also useful to avoid 222 221 sending the same page twice (in case the userfault is read by the 223 - postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration 222 + postcopy thread just before ``UFFDIO_COPY|ZEROPAGE`` runs in the migration 224 223 thread). 225 224 226 225 Non-cooperative userfaultfd 227 226 =========================== 228 227 229 - When the userfaultfd is monitored by an external manager, the manager 228 + When the ``userfaultfd`` is monitored by an external manager, the manager 230 229 must be able to track changes in the process virtual memory 231 230 layout. Userfaultfd can notify the manager about such changes using 232 231 the same read(2) protocol as for the page fault notifications. The 233 232 manager has to explicitly enable these events by setting appropriate 234 - bits in uffdio_api.features passed to UFFDIO_API ioctl: 233 + bits in ``uffdio_api.features`` passed to ``UFFDIO_API`` ioctl: 235 234 236 - UFFD_FEATURE_EVENT_FORK 237 - enable userfaultfd hooks for fork(). When this feature is 238 - enabled, the userfaultfd context of the parent process is 235 + ``UFFD_FEATURE_EVENT_FORK`` 236 + enable ``userfaultfd`` hooks for fork(). When this feature is 237 + enabled, the ``userfaultfd`` context of the parent process is 239 238 duplicated into the newly created process. The manager 240 - receives UFFD_EVENT_FORK with file descriptor of the new 241 - userfaultfd context in the uffd_msg.fork. 239 + receives ``UFFD_EVENT_FORK`` with file descriptor of the new 240 + ``userfaultfd`` context in the ``uffd_msg.fork``. 242 241 243 - UFFD_FEATURE_EVENT_REMAP 242 + ``UFFD_FEATURE_EVENT_REMAP`` 244 243 enable notifications about mremap() calls. When the 245 244 non-cooperative process moves a virtual memory area to a 246 245 different location, the manager will receive 247 - UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and 246 + ``UFFD_EVENT_REMAP``. The ``uffd_msg.remap`` will contain the old and 248 247 new addresses of the area and its original length. 249 248 250 - UFFD_FEATURE_EVENT_REMOVE 249 + ``UFFD_FEATURE_EVENT_REMOVE`` 251 250 enable notifications about madvise(MADV_REMOVE) and 252 - madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will 253 - be generated upon these calls to madvise. The uffd_msg.remove 251 + madvise(MADV_DONTNEED) calls. The event ``UFFD_EVENT_REMOVE`` will 252 + be generated upon these calls to madvise(). The ``uffd_msg.remove`` 254 253 will contain start and end addresses of the removed area. 255 254 256 - UFFD_FEATURE_EVENT_UNMAP 255 + ``UFFD_FEATURE_EVENT_UNMAP`` 257 256 enable notifications about memory unmapping. The manager will 258 - get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and 257 + get ``UFFD_EVENT_UNMAP`` with ``uffd_msg.remove`` containing start and 259 258 end addresses of the unmapped area. 260 259 261 - Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP 260 + Although the ``UFFD_FEATURE_EVENT_REMOVE`` and ``UFFD_FEATURE_EVENT_UNMAP`` 262 261 are pretty similar, they quite differ in the action expected from the 263 - userfaultfd manager. In the former case, the virtual memory is 262 + ``userfaultfd`` manager. In the former case, the virtual memory is 264 263 removed, but the area is not, the area remains monitored by the 265 - userfaultfd, and if a page fault occurs in that area it will be 264 + ``userfaultfd``, and if a page fault occurs in that area it will be 266 265 delivered to the manager. The proper resolution for such page fault is 267 266 to zeromap the faulting address. However, in the latter case, when an 268 267 area is unmapped, either explicitly (with munmap() system call), or 269 268 implicitly (e.g. during mremap()), the area is removed and in turn the 270 - userfaultfd context for such area disappears too and the manager will 269 + ``userfaultfd`` context for such area disappears too and the manager will 271 270 not get further userland page faults from the removed area. Still, the 272 271 notification is required in order to prevent manager from using 273 - UFFDIO_COPY on the unmapped area. 272 + ``UFFDIO_COPY`` on the unmapped area. 274 273 275 274 Unlike userland page faults which have to be synchronous and require 276 275 explicit or implicit wakeup, all the events are delivered 277 276 asynchronously and the non-cooperative process resumes execution as 278 - soon as manager executes read(). The userfaultfd manager should 279 - carefully synchronize calls to UFFDIO_COPY with the events 280 - processing. To aid the synchronization, the UFFDIO_COPY ioctl will 281 - return -ENOSPC when the monitored process exits at the time of 282 - UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed 283 - its virtual memory layout simultaneously with outstanding UFFDIO_COPY 277 + soon as manager executes read(). The ``userfaultfd`` manager should 278 + carefully synchronize calls to ``UFFDIO_COPY`` with the events 279 + processing. To aid the synchronization, the ``UFFDIO_COPY`` ioctl will 280 + return ``-ENOSPC`` when the monitored process exits at the time of 281 + ``UFFDIO_COPY``, and ``-ENOENT``, when the non-cooperative process has changed 282 + its virtual memory layout simultaneously with outstanding ``UFFDIO_COPY`` 284 283 operation. 285 284 286 285 The current asynchronous model of the event delivery is optimal for 287 - single threaded non-cooperative userfaultfd manager implementations. A 286 + single threaded non-cooperative ``userfaultfd`` manager implementations. A 288 287 synchronous event delivery model can be added later as a new 289 - userfaultfd feature to facilitate multithreading enhancements of the 290 - non cooperative manager, for example to allow UFFDIO_COPY ioctls to 288 + ``userfaultfd`` feature to facilitate multithreading enhancements of the 289 + non cooperative manager, for example to allow ``UFFDIO_COPY`` ioctls to 291 290 run in parallel to the event reception. Single threaded 292 291 implementations should continue to use the current async event 293 292 delivery model instead.

+1 -1

Documentation/admin-guide/nfs/nfsroot.rst

··· 18 18 In order to use a diskless system, such as an X-terminal or printer server for 19 19 example, it is necessary for the root filesystem to be present on a non-disk 20 20 device. This may be an initramfs (see 21 - Documentation/filesystems/ramfs-rootfs-initramfs.txt), a ramdisk (see 21 + Documentation/filesystems/ramfs-rootfs-initramfs.rst), a ramdisk (see 22 22 Documentation/admin-guide/initrd.rst) or a filesystem mounted via NFS. The 23 23 following text describes on how to use NFS for the root filesystem. For the rest 24 24 of this text 'client' means the diskless system, and 'server' means the NFS

+28 -3

Documentation/admin-guide/numastat.rst

··· 6 6 7 7 All units are pages. Hugepages have separate counters. 8 8 9 + The numa_hit, numa_miss and numa_foreign counters reflect how well processes 10 + are able to allocate memory from nodes they prefer. If they succeed, numa_hit 11 + is incremented on the preferred node, otherwise numa_foreign is incremented on 12 + the preferred node and numa_miss on the node where allocation succeeded. 13 + 14 + Usually preferred node is the one local to the CPU where the process executes, 15 + but restrictions such as mempolicies can change that, so there are also two 16 + counters based on CPU local node. local_node is similar to numa_hit and is 17 + incremented on allocation from a node by CPU on the same node. other_node is 18 + similar to numa_miss and is incremented on the node where allocation succeeds 19 + from a CPU from a different node. Note there is no counter analogical to 20 + numa_foreign. 21 + 22 + In more detail: 23 + 9 24 =============== ============================================================ 10 25 numa_hit A process wanted to allocate memory from this node, 11 26 and succeeded. ··· 29 14 but ended up with memory from this node. 30 15 31 16 numa_foreign A process wanted to allocate on this node, 32 - but ended up with memory from another one. 17 + but ended up with memory from another node. 33 18 34 - local_node A process ran on this node and got memory from it. 19 + local_node A process ran on this node's CPU, 20 + and got memory from this node. 35 21 36 - other_node A process ran on this node and got memory from another node. 22 + other_node A process ran on a different node's CPU 23 + and got memory from this node. 37 24 38 25 interleave_hit Interleaving wanted to allocate from this node 39 26 and succeeded. ··· 45 28 (http://oss.sgi.com/projects/libnuma/). Note that it only works 46 29 well right now on machines with a small number of CPUs. 47 30 31 + Note that on systems with memoryless nodes (where a node has CPUs but no 32 + memory) the numa_hit, numa_miss and numa_foreign statistics can be skewed 33 + heavily. In the current kernel implementation, if a process prefers a 34 + memoryless node (i.e. because it is running on one of its local CPU), the 35 + implementation actually treats one of the nearest nodes with memory as the 36 + preferred node. As a result, such allocation will not increase the numa_foreign 37 + counter on the memoryless node, and will skew the numa_hit, numa_miss and 38 + numa_foreign statistics of the nearest node.

+15 -13

Documentation/admin-guide/ras.rst

··· 156 156 ECC memory 157 157 ---------- 158 158 159 - As mentioned on the previous section, ECC memory has extra bits to be 160 - used for error correction. So, on 64 bit systems, a memory module 161 - has 64 bits of *data width*, and 74 bits of *total width*. So, there are 162 - 8 bits extra bits to be used for the error detection and correction 163 - mechanisms. Those extra bits are called *syndrome*\ [#f1]_\ [#f2]_. 159 + As mentioned in the previous section, ECC memory has extra bits to be 160 + used for error correction. In the above example, a memory module has 161 + 64 bits of *data width*, and 72 bits of *total width*. The extra 8 162 + bits which are used for the error detection and correction mechanisms 163 + are referred to as the *syndrome*\ [#f1]_\ [#f2]_. 164 164 165 165 So, when the cpu requests the memory controller to write a word with 166 166 *data width*, the memory controller calculates the *syndrome* in real time, ··· 212 212 purposes. 213 213 214 214 When the subsystem was pushed upstream for the first time, on 215 - Kernel 2.6.16, for the first time, it was renamed to ``EDAC``. 215 + Kernel 2.6.16, it was renamed to ``EDAC``. 216 216 217 217 Purpose 218 218 ------- ··· 351 351 +------------+-----------+-----------+ 352 352 | | ``ch0`` | ``ch1`` | 353 353 +============+===========+===========+ 354 - | ``csrow0`` | DIMM_A0 | DIMM_B0 | 355 - | | rank0 | rank0 | 356 - +------------+ - | - | 354 + | |**DIMM_A0**|**DIMM_B0**| 355 + +------------+-----------+-----------+ 356 + | ``csrow0`` | rank0 | rank0 | 357 + +------------+-----------+-----------+ 357 358 | ``csrow1`` | rank1 | rank1 | 358 359 +------------+-----------+-----------+ 359 - | ``csrow2`` | DIMM_A1 | DIMM_B1 | 360 - | | rank0 | rank0 | 361 - +------------+ - | - | 362 - | ``csrow3`` | rank1 | rank1 | 360 + | |**DIMM_A1**|**DIMM_B1**| 361 + +------------+-----------+-----------+ 362 + | ``csrow2`` | rank0 | rank0 | 363 + +------------+-----------+-----------+ 364 + | ``csrow3`` | rank1 | rank1 | 363 365 +------------+-----------+-----------+ 364 366 365 367 In the above example, there are 4 physical slots on the motherboard

+156 -1

Documentation/admin-guide/sysctl/kernel.rst

··· 102 102 :doc:`/x86/boot` for additional information. 103 103 104 104 105 + bpf_stats_enabled 106 + ================= 107 + 108 + Controls whether the kernel should collect statistics on BPF programs 109 + (total time spent running, number of times run...). Enabling 110 + statistics causes a slight reduction in performance on each program 111 + run. The statistics can be seen using ``bpftool``. 112 + 113 + = =================================== 114 + 0 Don't collect statistics (default). 115 + 1 Collect statistics. 116 + = =================================== 117 + 118 + 119 + cad_pid 120 + ======= 121 + 122 + This is the pid which will be signalled on reboot (notably, by 123 + Ctrl-Alt-Delete). Writing a value to this file which doesn't 124 + correspond to a running process will result in ``-ESRCH``. 125 + 126 + See also `ctrl-alt-del`_. 127 + 128 + 105 129 cap_last_cap 106 130 ============ 107 131 ··· 265 241 see the ``hostname(1)`` man page. 266 242 267 243 244 + firmware_config 245 + =============== 246 + 247 + See :doc:`/driver-api/firmware/fallback-mechanisms`. 248 + 249 + The entries in this directory allow the firmware loader helper 250 + fallback to be controlled: 251 + 252 + * ``force_sysfs_fallback``, when set to 1, forces the use of the 253 + fallback; 254 + * ``ignore_sysfs_fallback``, when set to 1, ignores any fallback. 255 + 256 + 257 + ftrace_dump_on_oops 258 + =================== 259 + 260 + Determines whether ``ftrace_dump()`` should be called on an oops (or 261 + kernel panic). This will output the contents of the ftrace buffers to 262 + the console. This is very useful for capturing traces that lead to 263 + crashes and outputting them to a serial console. 264 + 265 + = =================================================== 266 + 0 Disabled (default). 267 + 1 Dump buffers of all CPUs. 268 + 2 Dump the buffer of the CPU that triggered the oops. 269 + = =================================================== 270 + 271 + 272 + ftrace_enabled, stack_tracer_enabled 273 + ==================================== 274 + 275 + See :doc:`/trace/ftrace`. 276 + 277 + 268 278 hardlockup_all_cpu_backtrace 269 279 ============================ 270 280 ··· 400 342 0 Do not report panic kmsg data. 401 343 1 Report the panic kmsg data. This is the default behavior. 402 344 = ========================================================= 345 + 346 + 347 + ignore-unaligned-usertrap 348 + ========================= 349 + 350 + On architectures where unaligned accesses cause traps, and where this 351 + feature is supported (``CONFIG_SYSCTL_ARCH_UNALIGN_NO_WARN``; 352 + currently, ``arc`` and ``ia64``), controls whether all unaligned traps 353 + are logged. 354 + 355 + = ============================================================= 356 + 0 Log all unaligned accesses. 357 + 1 Only warn the first time a process traps. This is the default 358 + setting. 359 + = ============================================================= 360 + 361 + See also `unaligned-trap`_ and `unaligned-dump-stack`_. On ``ia64``, 362 + this allows system administrators to override the 363 + ``IA64_THREAD_UAC_NOPRINT`` ``prctl`` and avoid logs being flooded. 403 364 404 365 405 366 kexec_load_disabled ··· 535 458 2) Toggle with non-default value will be set back to -1 by kernel after 536 459 successful IPC object allocation. If an IPC object allocation syscall 537 460 fails, it is undefined if the value remains unmodified or is reset to -1. 461 + 462 + 463 + ngroups_max 464 + =========== 465 + 466 + Maximum number of supplementary groups, _i.e._ the maximum size which 467 + ``setgroups`` will accept. Exports ``NGROUPS_MAX`` from the kernel. 468 + 469 + 538 470 539 471 nmi_watchdog 540 472 ============ ··· 963 877 pty 964 878 === 965 879 966 - See Documentation/filesystems/devpts.txt. 880 + See Documentation/filesystems/devpts.rst. 967 881 968 882 969 883 randomize_va_space ··· 1259 1173 ``EINVAL`` error occurs. 1260 1174 1261 1175 1176 + traceoff_on_warning 1177 + =================== 1178 + 1179 + When set, disables tracing (see :doc:`/trace/ftrace`) when a 1180 + ``WARN()`` is hit. 1181 + 1182 + 1183 + tracepoint_printk 1184 + ================= 1185 + 1186 + When tracepoints are sent to printk() (enabled by the ``tp_printk`` 1187 + boot parameter), this entry provides runtime control:: 1188 + 1189 + echo 0 > /proc/sys/kernel/tracepoint_printk 1190 + 1191 + will stop tracepoints from being sent to printk(), and:: 1192 + 1193 + echo 1 > /proc/sys/kernel/tracepoint_printk 1194 + 1195 + will send them to printk() again. 1196 + 1197 + This only works if the kernel was booted with ``tp_printk`` enabled. 1198 + 1199 + See :doc:`/admin-guide/kernel-parameters` and 1200 + :doc:`/trace/boottime-trace`. 1201 + 1202 + 1203 + .. _unaligned-dump-stack: 1204 + 1205 + unaligned-dump-stack (ia64) 1206 + =========================== 1207 + 1208 + When logging unaligned accesses, controls whether the stack is 1209 + dumped. 1210 + 1211 + = =================================================== 1212 + 0 Do not dump the stack. This is the default setting. 1213 + 1 Dump the stack. 1214 + = =================================================== 1215 + 1216 + See also `ignore-unaligned-usertrap`_. 1217 + 1218 + 1219 + unaligned-trap 1220 + ============== 1221 + 1222 + On architectures where unaligned accesses cause traps, and where this 1223 + feature is supported (``CONFIG_SYSCTL_ARCH_UNALIGN_ALLOW``; currently, 1224 + ``arc`` and ``parisc``), controls whether unaligned traps are caught 1225 + and emulated (instead of failing). 1226 + 1227 + = ======================================================== 1228 + 0 Do not emulate unaligned accesses. 1229 + 1 Emulate unaligned accesses. This is the default setting. 1230 + = ======================================================== 1231 + 1232 + See also `ignore-unaligned-usertrap`_. 1233 + 1234 + 1262 1235 unknown_nmi_panic 1263 1236 ================= 1264 1237 ··· 1327 1182 1328 1183 NMI switch that most IA32 servers have fires unknown NMI up, for 1329 1184 example. If a system hangs up, try pressing the NMI switch. 1185 + 1186 + 1187 + unprivileged_bpf_disabled 1188 + ========================= 1189 + 1190 + Writing 1 to this entry will disable unprivileged calls to ``bpf()``; 1191 + once disabled, calling ``bpf()`` without ``CAP_SYS_ADMIN`` will return 1192 + ``-EPERM``. 1193 + 1194 + Once set, this can't be cleared. 1330 1195 1331 1196 1332 1197 watchdog

+21 -21

Documentation/arm64/amu.rst

··· 24 24 Version 1 of the Activity Monitors architecture implements a counter group 25 25 of four fixed and architecturally defined 64-bit event counters. 26 26 27 - - CPU cycle counter: increments at the frequency of the CPU. 28 - - Constant counter: increments at the fixed frequency of the system 29 - clock. 30 - - Instructions retired: increments with every architecturally executed 31 - instruction. 32 - - Memory stall cycles: counts instruction dispatch stall cycles caused by 33 - misses in the last level cache within the clock domain. 27 + - CPU cycle counter: increments at the frequency of the CPU. 28 + - Constant counter: increments at the fixed frequency of the system 29 + clock. 30 + - Instructions retired: increments with every architecturally executed 31 + instruction. 32 + - Memory stall cycles: counts instruction dispatch stall cycles caused by 33 + misses in the last level cache within the clock domain. 34 34 35 35 When in WFI or WFE these counters do not increment. 36 36 ··· 59 59 Firmware (code running at higher exception levels, e.g. arm-tf) support is 60 60 needed to: 61 61 62 - - Enable access for lower exception levels (EL2 and EL1) to the AMU 63 - registers. 64 - - Enable the counters. If not enabled these will read as 0. 65 - - Save/restore the counters before/after the CPU is being put/brought up 66 - from the 'off' power state. 62 + - Enable access for lower exception levels (EL2 and EL1) to the AMU 63 + registers. 64 + - Enable the counters. If not enabled these will read as 0. 65 + - Save/restore the counters before/after the CPU is being put/brought up 66 + from the 'off' power state. 67 67 68 68 When using kernels that have this feature enabled but boot with broken 69 69 firmware the user may experience panics or lockups when accessing the ··· 81 81 The fixed counters of AMUv1 are accessible though the following system 82 82 register definitions: 83 83 84 - - SYS_AMEVCNTR0_CORE_EL0 85 - - SYS_AMEVCNTR0_CONST_EL0 86 - - SYS_AMEVCNTR0_INST_RET_EL0 87 - - SYS_AMEVCNTR0_MEM_STALL_EL0 84 + - SYS_AMEVCNTR0_CORE_EL0 85 + - SYS_AMEVCNTR0_CONST_EL0 86 + - SYS_AMEVCNTR0_INST_RET_EL0 87 + - SYS_AMEVCNTR0_MEM_STALL_EL0 88 88 89 89 Auxiliary platform specific counters can be accessed using 90 90 SYS_AMEVCNTR1_EL0(n), where n is a value between 0 and 15. ··· 97 97 98 98 Currently, access from userspace to the AMU registers is disabled due to: 99 99 100 - - Security reasons: they might expose information about code executed in 101 - secure mode. 102 - - Purpose: AMU counters are intended for system management use. 100 + - Security reasons: they might expose information about code executed in 101 + secure mode. 102 + - Purpose: AMU counters are intended for system management use. 103 103 104 104 Also, the presence of the feature is not visible to userspace. 105 105 ··· 110 110 Currently, access from userspace (EL0) and kernelspace (EL1) on the KVM 111 111 guest side is disabled due to: 112 112 113 - - Security reasons: they might expose information about code executed 114 - by other guests or the host. 113 + - Security reasons: they might expose information about code executed 114 + by other guests or the host. 115 115 116 116 Any attempt to access the AMU registers will result in an UNDEFINED 117 117 exception being injected into the guest.

+22 -14

Documentation/arm64/booting.rst

··· 173 173 - Caches, MMUs 174 174 175 175 The MMU must be off. 176 + 176 177 The instruction cache may be on or off, and must not hold any stale 177 178 entries corresponding to the loaded kernel image. 179 + 178 180 The address range corresponding to the loaded kernel image must be 179 181 cleaned to the PoC. In the presence of a system cache or other 180 182 coherent masters with caches enabled, this will typically require ··· 241 239 - The DT or ACPI tables must describe a GICv2 interrupt controller. 242 240 243 241 For CPUs with pointer authentication functionality: 242 + 244 243 - If EL3 is present: 245 244 246 245 - SCR_EL3.APK (bit 16) must be initialised to 0b1 ··· 253 250 - HCR_EL2.API (bit 41) must be initialised to 0b1 254 251 255 252 For CPUs with Activity Monitors Unit v1 (AMUv1) extension present: 253 + 256 254 - If EL3 is present: 257 - CPTR_EL3.TAM (bit 30) must be initialised to 0b0 258 - CPTR_EL2.TAM (bit 30) must be initialised to 0b0 259 - AMCNTENSET0_EL0 must be initialised to 0b1111 260 - AMCNTENSET1_EL0 must be initialised to a platform specific value 261 - having 0b1 set for the corresponding bit for each of the auxiliary 262 - counters present. 255 + 256 + - CPTR_EL3.TAM (bit 30) must be initialised to 0b0 257 + - CPTR_EL2.TAM (bit 30) must be initialised to 0b0 258 + - AMCNTENSET0_EL0 must be initialised to 0b1111 259 + - AMCNTENSET1_EL0 must be initialised to a platform specific value 260 + having 0b1 set for the corresponding bit for each of the auxiliary 261 + counters present. 262 + 263 263 - If the kernel is entered at EL1: 264 - AMCNTENSET0_EL0 must be initialised to 0b1111 265 - AMCNTENSET1_EL0 must be initialised to a platform specific value 266 - having 0b1 set for the corresponding bit for each of the auxiliary 267 - counters present. 264 + 265 + - AMCNTENSET0_EL0 must be initialised to 0b1111 266 + - AMCNTENSET1_EL0 must be initialised to a platform specific value 267 + having 0b1 set for the corresponding bit for each of the auxiliary 268 + counters present. 268 269 269 270 The requirements described above for CPU mode, caches, MMUs, architected 270 271 timers, coherency and system registers apply to all CPUs. All CPUs must ··· 312 305 Documentation/devicetree/bindings/arm/psci.yaml. 313 306 314 307 - Secondary CPU general-purpose register settings 315 - x0 = 0 (reserved for future use) 316 - x1 = 0 (reserved for future use) 317 - x2 = 0 (reserved for future use) 318 - x3 = 0 (reserved for future use) 308 + 309 + - x0 = 0 (reserved for future use) 310 + - x1 = 0 (reserved for future use) 311 + - x2 = 0 (reserved for future use) 312 + - x3 = 0 (reserved for future use)

-38

Documentation/conf.py

··· 388 388 # author, documentclass [howto, manual, or own class]). 389 389 # Sorted in alphabetical order 390 390 latex_documents = [ 391 - ('admin-guide/index', 'linux-user.tex', 'Linux Kernel User Documentation', 392 - 'The kernel development community', 'manual'), 393 - ('core-api/index', 'core-api.tex', 'The kernel core API manual', 394 - 'The kernel development community', 'manual'), 395 - ('crypto/index', 'crypto-api.tex', 'Linux Kernel Crypto API manual', 396 - 'The kernel development community', 'manual'), 397 - ('dev-tools/index', 'dev-tools.tex', 'Development tools for the Kernel', 398 - 'The kernel development community', 'manual'), 399 - ('doc-guide/index', 'kernel-doc-guide.tex', 'Linux Kernel Documentation Guide', 400 - 'The kernel development community', 'manual'), 401 - ('driver-api/index', 'driver-api.tex', 'The kernel driver API manual', 402 - 'The kernel development community', 'manual'), 403 - ('filesystems/index', 'filesystems.tex', 'Linux Filesystems API', 404 - 'The kernel development community', 'manual'), 405 - ('admin-guide/ext4', 'ext4-admin-guide.tex', 'ext4 Administration Guide', 406 - 'ext4 Community', 'manual'), 407 - ('filesystems/ext4/index', 'ext4-data-structures.tex', 408 - 'ext4 Data Structures and Algorithms', 'ext4 Community', 'manual'), 409 - ('gpu/index', 'gpu.tex', 'Linux GPU Driver Developer\'s Guide', 410 - 'The kernel development community', 'manual'), 411 - ('input/index', 'linux-input.tex', 'The Linux input driver subsystem', 412 - 'The kernel development community', 'manual'), 413 - ('kernel-hacking/index', 'kernel-hacking.tex', 'Unreliable Guide To Hacking The Linux Kernel', 414 - 'The kernel development community', 'manual'), 415 - ('media/index', 'media.tex', 'Linux Media Subsystem Documentation', 416 - 'The kernel development community', 'manual'), 417 - ('networking/index', 'networking.tex', 'Linux Networking Documentation', 418 - 'The kernel development community', 'manual'), 419 - ('process/index', 'development-process.tex', 'Linux Kernel Development Documentation', 420 - 'The kernel development community', 'manual'), 421 - ('security/index', 'security.tex', 'The kernel security subsystem manual', 422 - 'The kernel development community', 'manual'), 423 - ('sh/index', 'sh.tex', 'SuperH architecture implementation manual', 424 - 'The kernel development community', 'manual'), 425 - ('sound/index', 'sound.tex', 'Linux Sound Subsystem Documentation', 426 - 'The kernel development community', 'manual'), 427 - ('userspace-api/index', 'userspace-api.tex', 'The Linux kernel user-space API guide', 428 - 'The kernel development community', 'manual'), 429 391 ] 430 392 431 393 # Add all other index files from Documentation/ subdirectories

+9

Documentation/core-api/index.rst

··· 18 18 19 19 kernel-api 20 20 workqueue 21 + printk-basics 21 22 printk-formats 22 23 symbol-namespaces 23 24 ··· 31 30 :maxdepth: 1 32 31 33 32 kobject 33 + kref 34 34 assoc_array 35 35 xarray 36 36 idr 37 37 circular-buffers 38 + rbtree 38 39 generic-radix-tree 39 40 packing 40 41 timekeeping ··· 53 50 54 51 atomic_ops 55 52 refcount-vs-atomic 53 + irq/index 56 54 local_ops 57 55 padata 58 56 ../RCU/index ··· 82 78 :maxdepth: 1 83 79 84 80 memory-allocation 81 + dma-api 82 + dma-api-howto 83 + dma-attributes 84 + dma-isa-lpc 85 85 mm-api 86 86 genalloc 87 87 pin_user_pages ··· 100 92 101 93 debug-objects 102 94 tracepoint 95 + debugging-via-ohci1394 103 96 104 97 Everything else 105 98 ===============

+11

Documentation/core-api/irq/index.rst

··· 1 + ==== 2 + IRQs 3 + ==== 4 + 5 + .. toctree:: 6 + :maxdepth: 1 7 + 8 + concepts 9 + irq-affinity 10 + irq-domain 11 + irqflags-tracing

+270

Documentation/core-api/irq/irq-domain.rst

··· 1 + =============================================== 2 + The irq_domain interrupt number mapping library 3 + =============================================== 4 + 5 + The current design of the Linux kernel uses a single large number 6 + space where each separate IRQ source is assigned a different number. 7 + This is simple when there is only one interrupt controller, but in 8 + systems with multiple interrupt controllers the kernel must ensure 9 + that each one gets assigned non-overlapping allocations of Linux 10 + IRQ numbers. 11 + 12 + The number of interrupt controllers registered as unique irqchips 13 + show a rising tendency: for example subdrivers of different kinds 14 + such as GPIO controllers avoid reimplementing identical callback 15 + mechanisms as the IRQ core system by modelling their interrupt 16 + handlers as irqchips, i.e. in effect cascading interrupt controllers. 17 + 18 + Here the interrupt number loose all kind of correspondence to 19 + hardware interrupt numbers: whereas in the past, IRQ numbers could 20 + be chosen so they matched the hardware IRQ line into the root 21 + interrupt controller (i.e. the component actually fireing the 22 + interrupt line to the CPU) nowadays this number is just a number. 23 + 24 + For this reason we need a mechanism to separate controller-local 25 + interrupt numbers, called hardware irq's, from Linux IRQ numbers. 26 + 27 + The irq_alloc_desc*() and irq_free_desc*() APIs provide allocation of 28 + irq numbers, but they don't provide any support for reverse mapping of 29 + the controller-local IRQ (hwirq) number into the Linux IRQ number 30 + space. 31 + 32 + The irq_domain library adds mapping between hwirq and IRQ numbers on 33 + top of the irq_alloc_desc*() API. An irq_domain to manage mapping is 34 + preferred over interrupt controller drivers open coding their own 35 + reverse mapping scheme. 36 + 37 + irq_domain also implements translation from an abstract irq_fwspec 38 + structure to hwirq numbers (Device Tree and ACPI GSI so far), and can 39 + be easily extended to support other IRQ topology data sources. 40 + 41 + irq_domain usage 42 + ================ 43 + 44 + An interrupt controller driver creates and registers an irq_domain by 45 + calling one of the irq_domain_add_*() functions (each mapping method 46 + has a different allocator function, more on that later). The function 47 + will return a pointer to the irq_domain on success. The caller must 48 + provide the allocator function with an irq_domain_ops structure. 49 + 50 + In most cases, the irq_domain will begin empty without any mappings 51 + between hwirq and IRQ numbers. Mappings are added to the irq_domain 52 + by calling irq_create_mapping() which accepts the irq_domain and a 53 + hwirq number as arguments. If a mapping for the hwirq doesn't already 54 + exist then it will allocate a new Linux irq_desc, associate it with 55 + the hwirq, and call the .map() callback so the driver can perform any 56 + required hardware setup. 57 + 58 + When an interrupt is received, irq_find_mapping() function should 59 + be used to find the Linux IRQ number from the hwirq number. 60 + 61 + The irq_create_mapping() function must be called *atleast once* 62 + before any call to irq_find_mapping(), lest the descriptor will not 63 + be allocated. 64 + 65 + If the driver has the Linux IRQ number or the irq_data pointer, and 66 + needs to know the associated hwirq number (such as in the irq_chip 67 + callbacks) then it can be directly obtained from irq_data->hwirq. 68 + 69 + Types of irq_domain mappings 70 + ============================ 71 + 72 + There are several mechanisms available for reverse mapping from hwirq 73 + to Linux irq, and each mechanism uses a different allocation function. 74 + Which reverse map type should be used depends on the use case. Each 75 + of the reverse map types are described below: 76 + 77 + Linear 78 + ------ 79 + 80 + :: 81 + 82 + irq_domain_add_linear() 83 + irq_domain_create_linear() 84 + 85 + The linear reverse map maintains a fixed size table indexed by the 86 + hwirq number. When a hwirq is mapped, an irq_desc is allocated for 87 + the hwirq, and the IRQ number is stored in the table. 88 + 89 + The Linear map is a good choice when the maximum number of hwirqs is 90 + fixed and a relatively small number (~ < 256). The advantages of this 91 + map are fixed time lookup for IRQ numbers, and irq_descs are only 92 + allocated for in-use IRQs. The disadvantage is that the table must be 93 + as large as the largest possible hwirq number. 94 + 95 + irq_domain_add_linear() and irq_domain_create_linear() are functionally 96 + equivalent, except for the first argument is different - the former 97 + accepts an Open Firmware specific 'struct device_node', while the latter 98 + accepts a more general abstraction 'struct fwnode_handle'. 99 + 100 + The majority of drivers should use the linear map. 101 + 102 + Tree 103 + ---- 104 + 105 + :: 106 + 107 + irq_domain_add_tree() 108 + irq_domain_create_tree() 109 + 110 + The irq_domain maintains a radix tree map from hwirq numbers to Linux 111 + IRQs. When an hwirq is mapped, an irq_desc is allocated and the 112 + hwirq is used as the lookup key for the radix tree. 113 + 114 + The tree map is a good choice if the hwirq number can be very large 115 + since it doesn't need to allocate a table as large as the largest 116 + hwirq number. The disadvantage is that hwirq to IRQ number lookup is 117 + dependent on how many entries are in the table. 118 + 119 + irq_domain_add_tree() and irq_domain_create_tree() are functionally 120 + equivalent, except for the first argument is different - the former 121 + accepts an Open Firmware specific 'struct device_node', while the latter 122 + accepts a more general abstraction 'struct fwnode_handle'. 123 + 124 + Very few drivers should need this mapping. 125 + 126 + No Map 127 + ------ 128 + 129 + :: 130 + 131 + irq_domain_add_nomap() 132 + 133 + The No Map mapping is to be used when the hwirq number is 134 + programmable in the hardware. In this case it is best to program the 135 + Linux IRQ number into the hardware itself so that no mapping is 136 + required. Calling irq_create_direct_mapping() will allocate a Linux 137 + IRQ number and call the .map() callback so that driver can program the 138 + Linux IRQ number into the hardware. 139 + 140 + Most drivers cannot use this mapping. 141 + 142 + Legacy 143 + ------ 144 + 145 + :: 146 + 147 + irq_domain_add_simple() 148 + irq_domain_add_legacy() 149 + irq_domain_add_legacy_isa() 150 + 151 + The Legacy mapping is a special case for drivers that already have a 152 + range of irq_descs allocated for the hwirqs. It is used when the 153 + driver cannot be immediately converted to use the linear mapping. For 154 + example, many embedded system board support files use a set of #defines 155 + for IRQ numbers that are passed to struct device registrations. In that 156 + case the Linux IRQ numbers cannot be dynamically assigned and the legacy 157 + mapping should be used. 158 + 159 + The legacy map assumes a contiguous range of IRQ numbers has already 160 + been allocated for the controller and that the IRQ number can be 161 + calculated by adding a fixed offset to the hwirq number, and 162 + visa-versa. The disadvantage is that it requires the interrupt 163 + controller to manage IRQ allocations and it requires an irq_desc to be 164 + allocated for every hwirq, even if it is unused. 165 + 166 + The legacy map should only be used if fixed IRQ mappings must be 167 + supported. For example, ISA controllers would use the legacy map for 168 + mapping Linux IRQs 0-15 so that existing ISA drivers get the correct IRQ 169 + numbers. 170 + 171 + Most users of legacy mappings should use irq_domain_add_simple() which 172 + will use a legacy domain only if an IRQ range is supplied by the 173 + system and will otherwise use a linear domain mapping. The semantics 174 + of this call are such that if an IRQ range is specified then 175 + descriptors will be allocated on-the-fly for it, and if no range is 176 + specified it will fall through to irq_domain_add_linear() which means 177 + *no* irq descriptors will be allocated. 178 + 179 + A typical use case for simple domains is where an irqchip provider 180 + is supporting both dynamic and static IRQ assignments. 181 + 182 + In order to avoid ending up in a situation where a linear domain is 183 + used and no descriptor gets allocated it is very important to make sure 184 + that the driver using the simple domain call irq_create_mapping() 185 + before any irq_find_mapping() since the latter will actually work 186 + for the static IRQ assignment case. 187 + 188 + Hierarchy IRQ domain 189 + -------------------- 190 + 191 + On some architectures, there may be multiple interrupt controllers 192 + involved in delivering an interrupt from the device to the target CPU. 193 + Let's look at a typical interrupt delivering path on x86 platforms:: 194 + 195 + Device --> IOAPIC -> Interrupt remapping Controller -> Local APIC -> CPU 196 + 197 + There are three interrupt controllers involved: 198 + 199 + 1) IOAPIC controller 200 + 2) Interrupt remapping controller 201 + 3) Local APIC controller 202 + 203 + To support such a hardware topology and make software architecture match 204 + hardware architecture, an irq_domain data structure is built for each 205 + interrupt controller and those irq_domains are organized into hierarchy. 206 + When building irq_domain hierarchy, the irq_domain near to the device is 207 + child and the irq_domain near to CPU is parent. So a hierarchy structure 208 + as below will be built for the example above:: 209 + 210 + CPU Vector irq_domain (root irq_domain to manage CPU vectors) 211 + ^ 212 + | 213 + Interrupt Remapping irq_domain (manage irq_remapping entries) 214 + ^ 215 + | 216 + IOAPIC irq_domain (manage IOAPIC delivery entries/pins) 217 + 218 + There are four major interfaces to use hierarchy irq_domain: 219 + 220 + 1) irq_domain_alloc_irqs(): allocate IRQ descriptors and interrupt 221 + controller related resources to deliver these interrupts. 222 + 2) irq_domain_free_irqs(): free IRQ descriptors and interrupt controller 223 + related resources associated with these interrupts. 224 + 3) irq_domain_activate_irq(): activate interrupt controller hardware to 225 + deliver the interrupt. 226 + 4) irq_domain_deactivate_irq(): deactivate interrupt controller hardware 227 + to stop delivering the interrupt. 228 + 229 + Following changes are needed to support hierarchy irq_domain: 230 + 231 + 1) a new field 'parent' is added to struct irq_domain; it's used to 232 + maintain irq_domain hierarchy information. 233 + 2) a new field 'parent_data' is added to struct irq_data; it's used to 234 + build hierarchy irq_data to match hierarchy irq_domains. The irq_data 235 + is used to store irq_domain pointer and hardware irq number. 236 + 3) new callbacks are added to struct irq_domain_ops to support hierarchy 237 + irq_domain operations. 238 + 239 + With support of hierarchy irq_domain and hierarchy irq_data ready, an 240 + irq_domain structure is built for each interrupt controller, and an 241 + irq_data structure is allocated for each irq_domain associated with an 242 + IRQ. Now we could go one step further to support stacked(hierarchy) 243 + irq_chip. That is, an irq_chip is associated with each irq_data along 244 + the hierarchy. A child irq_chip may implement a required action by 245 + itself or by cooperating with its parent irq_chip. 246 + 247 + With stacked irq_chip, interrupt controller driver only needs to deal 248 + with the hardware managed by itself and may ask for services from its 249 + parent irq_chip when needed. So we could achieve a much cleaner 250 + software architecture. 251 + 252 + For an interrupt controller driver to support hierarchy irq_domain, it 253 + needs to: 254 + 255 + 1) Implement irq_domain_ops.alloc and irq_domain_ops.free 256 + 2) Optionally implement irq_domain_ops.activate and 257 + irq_domain_ops.deactivate. 258 + 3) Optionally implement an irq_chip to manage the interrupt controller 259 + hardware. 260 + 4) No need to implement irq_domain_ops.map and irq_domain_ops.unmap, 261 + they are unused with hierarchy irq_domain. 262 + 263 + Hierarchy irq_domain is in no way x86 specific, and is heavily used to 264 + support other architectures, such as ARM, ARM64 etc. 265 + 266 + Debugging 267 + ========= 268 + 269 + Most of the internals of the IRQ subsystem are exposed in debugfs by 270 + turning CONFIG_GENERIC_IRQ_DEBUGFS on.

+15 -13

Documentation/core-api/kobject.rst

··· 80 80 (such as assuming that the kobject is at the beginning of the structure) 81 81 and, instead, use the container_of() macro, found in ``<linux/kernel.h>``:: 82 82 83 - container_of(pointer, type, member) 83 + container_of(ptr, type, member) 84 84 85 85 where: 86 86 87 - * ``pointer`` is the pointer to the embedded kobject, 87 + * ``ptr`` is the pointer to the embedded kobject, 88 88 * ``type`` is the type of the containing structure, and 89 89 * ``member`` is the name of the structure field to which ``pointer`` points. 90 90 ··· 140 140 141 141 int kobject_rename(struct kobject *kobj, const char *new_name); 142 142 143 - kobject_rename does not perform any locking or have a solid notion of 143 + kobject_rename() does not perform any locking or have a solid notion of 144 144 what names are valid so the caller must provide their own sanity checking 145 145 and serialization. 146 146 ··· 210 210 If all that you want to use a kobject for is to provide a reference counter 211 211 for your structure, please use the struct kref instead; a kobject would be 212 212 overkill. For more information on how to use struct kref, please see the 213 - file Documentation/kref.txt in the Linux kernel source tree. 213 + file Documentation/core-api/kref.rst in the Linux kernel source tree. 214 214 215 215 216 216 Creating "simple" kobjects ··· 222 222 exception where a single kobject should be created. To create such an 223 223 entry, use the function:: 224 224 225 - struct kobject *kobject_create_and_add(char *name, struct kobject *parent); 225 + struct kobject *kobject_create_and_add(const char *name, struct kobject *parent); 226 226 227 227 This function will create a kobject and place it in sysfs in the location 228 228 underneath the specified parent kobject. To create simple attributes 229 229 associated with this kobject, use:: 230 230 231 - int sysfs_create_file(struct kobject *kobj, struct attribute *attr); 231 + int sysfs_create_file(struct kobject *kobj, const struct attribute *attr); 232 232 233 233 or:: 234 234 235 - int sysfs_create_group(struct kobject *kobj, struct attribute_group *grp); 235 + int sysfs_create_group(struct kobject *kobj, const struct attribute_group *grp); 236 236 237 237 Both types of attributes used here, with a kobject that has been created 238 238 with the kobject_create_and_add(), can be of type kobj_attribute, so no ··· 300 300 void (*release)(struct kobject *kobj); 301 301 const struct sysfs_ops *sysfs_ops; 302 302 struct attribute **default_attrs; 303 + const struct attribute_group **default_groups; 303 304 const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj); 304 305 const void *(*namespace)(struct kobject *kobj); 306 + void (*get_ownership)(struct kobject *kobj, kuid_t *uid, kgid_t *gid); 305 307 }; 306 308 307 309 This structure is used to describe a particular type of kobject (or, more ··· 354 352 kset use:: 355 353 356 354 struct kset *kset_create_and_add(const char *name, 357 - struct kset_uevent_ops *u, 358 - struct kobject *parent); 355 + const struct kset_uevent_ops *uevent_ops, 356 + struct kobject *parent_kobj); 359 357 360 358 When you are finished with the kset, call:: 361 359 362 - void kset_unregister(struct kset *kset); 360 + void kset_unregister(struct kset *k); 363 361 364 362 to destroy it. This removes the kset from sysfs and decrements its reference 365 363 count. When the reference count goes to zero, the kset will be released. ··· 373 371 associated with it, it can use the struct kset_uevent_ops to handle it:: 374 372 375 373 struct kset_uevent_ops { 376 - int (*filter)(struct kset *kset, struct kobject *kobj); 377 - const char *(*name)(struct kset *kset, struct kobject *kobj); 378 - int (*uevent)(struct kset *kset, struct kobject *kobj, 374 + int (* const filter)(struct kset *kset, struct kobject *kobj); 375 + const char *(* const name)(struct kset *kset, struct kobject *kobj); 376 + int (* const uevent)(struct kset *kset, struct kobject *kobj, 379 377 struct kobj_uevent_env *env); 380 378 }; 381 379

+115

Documentation/core-api/printk-basics.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + =========================== 4 + Message logging with printk 5 + =========================== 6 + 7 + printk() is one of the most widely known functions in the Linux kernel. It's the 8 + standard tool we have for printing messages and usually the most basic way of 9 + tracing and debugging. If you're familiar with printf(3) you can tell printk() 10 + is based on it, although it has some functional differences: 11 + 12 + - printk() messages can specify a log level. 13 + 14 + - the format string, while largely compatible with C99, doesn't follow the 15 + exact same specification. It has some extensions and a few limitations 16 + (no ``%n`` or floating point conversion specifiers). See :ref:`How to get 17 + printk format specifiers right <printk-specifiers>`. 18 + 19 + All printk() messages are printed to the kernel log buffer, which is a ring 20 + buffer exported to userspace through /dev/kmsg. The usual way to read it is 21 + using ``dmesg``. 22 + 23 + printk() is typically used like this:: 24 + 25 + printk(KERN_INFO "Message: %s\n", arg); 26 + 27 + where ``KERN_INFO`` is the log level (note that it's concatenated to the format 28 + string, the log level is not a separate argument). The available log levels are: 29 + 30 + +----------------+--------+-----------------------------------------------+ 31 + | Name | String | Alias function | 32 + +================+========+===============================================+ 33 + | KERN_EMERG | "0" | pr_emerg() | 34 + +----------------+--------+-----------------------------------------------+ 35 + | KERN_ALERT | "1" | pr_alert() | 36 + +----------------+--------+-----------------------------------------------+ 37 + | KERN_CRIT | "2" | pr_crit() | 38 + +----------------+--------+-----------------------------------------------+ 39 + | KERN_ERR | "3" | pr_err() | 40 + +----------------+--------+-----------------------------------------------+ 41 + | KERN_WARNING | "4" | pr_warn() | 42 + +----------------+--------+-----------------------------------------------+ 43 + | KERN_NOTICE | "5" | pr_notice() | 44 + +----------------+--------+-----------------------------------------------+ 45 + | KERN_INFO | "6" | pr_info() | 46 + +----------------+--------+-----------------------------------------------+ 47 + | KERN_DEBUG | "7" | pr_debug() and pr_devel() if DEBUG is defined | 48 + +----------------+--------+-----------------------------------------------+ 49 + | KERN_DEFAULT | "" | | 50 + +----------------+--------+-----------------------------------------------+ 51 + | KERN_CONT | "c" | pr_cont() | 52 + +----------------+--------+-----------------------------------------------+ 53 + 54 + 55 + The log level specifies the importance of a message. The kernel decides whether 56 + to show the message immediately (printing it to the current console) depending 57 + on its log level and the current *console_loglevel* (a kernel variable). If the 58 + message priority is higher (lower log level value) than the *console_loglevel* 59 + the message will be printed to the console. 60 + 61 + If the log level is omitted, the message is printed with ``KERN_DEFAULT`` 62 + level. 63 + 64 + You can check the current *console_loglevel* with:: 65 + 66 + $ cat /proc/sys/kernel/printk 67 + 4 4 1 7 68 + 69 + The result shows the *current*, *default*, *minimum* and *boot-time-default* log 70 + levels. 71 + 72 + To change the current console_loglevel simply write the the desired level to 73 + ``/proc/sys/kernel/printk``. For example, to print all messages to the console:: 74 + 75 + # echo 8 > /proc/sys/kernel/printk 76 + 77 + Another way, using ``dmesg``:: 78 + 79 + # dmesg -n 5 80 + 81 + sets the console_loglevel to print KERN_WARNING (4) or more severe messages to 82 + console. See ``dmesg(1)`` for more information. 83 + 84 + As an alternative to printk() you can use the ``pr_*()`` aliases for 85 + logging. This family of macros embed the log level in the macro names. For 86 + example:: 87 + 88 + pr_info("Info message no. %d\n", msg_num); 89 + 90 + prints a ``KERN_INFO`` message. 91 + 92 + Besides being more concise than the equivalent printk() calls, they can use a 93 + common definition for the format string through the pr_fmt() macro. For 94 + instance, defining this at the top of a source file (before any ``#include`` 95 + directive):: 96 + 97 + #define pr_fmt(fmt) "%s:%s: " fmt, KBUILD_MODNAME, __func__ 98 + 99 + would prefix every pr_*() message in that file with the module and function name 100 + that originated the message. 101 + 102 + For debugging purposes there are also two conditionally-compiled macros: 103 + pr_debug() and pr_devel(), which are compiled-out unless ``DEBUG`` (or 104 + also ``CONFIG_DYNAMIC_DEBUG`` in the case of pr_debug()) is defined. 105 + 106 + 107 + Function reference 108 + ================== 109 + 110 + .. kernel-doc:: kernel/printk/printk.c 111 + :functions: printk 112 + 113 + .. kernel-doc:: include/linux/printk.h 114 + :functions: pr_emerg pr_alert pr_crit pr_err pr_warn pr_notice pr_info 115 + pr_fmt pr_debug pr_devel pr_cont

+2

Documentation/core-api/printk-formats.rst

··· 2 2 How to get printk format specifiers right 3 3 ========================================= 4 4 5 + .. _printk-specifiers: 6 + 5 7 :Author: Randy Dunlap <rdunlap@infradead.org> 6 8 :Author: Andrew Murray <amurray@mpc-data.co.uk> 7 9

Documentation/debugging-via-ohci1394.txt Documentation/core-api/debugging-via-ohci1394.rst

Documentation/digsig.txt Documentation/security/digsig.rst

+1 -1

Documentation/doc-guide/maintainer-profile.rst

··· 6 6 The documentation "subsystem" is the central coordinating point for the 7 7 kernel's documentation and associated infrastructure. It covers the 8 8 hierarchy under Documentation/ (with the exception of 9 - Documentation/device-tree), various utilities under scripts/ and, at least 9 + Documentation/devicetree), various utilities under scripts/ and, at least 10 10 some of the time, LICENSES/. 11 11 12 12 It's worth noting, though, that the boundaries of this subsystem are rather

+2 -2

Documentation/driver-api/dma-buf.rst

··· 11 11 The three main components of this are: (1) dma-buf, representing a 12 12 sg_table and exposed to userspace as a file descriptor to allow passing 13 13 between devices, (2) fence, which provides a mechanism to signal when 14 - one device as finished access, and (3) reservation, which manages the 14 + one device has finished access, and (3) reservation, which manages the 15 15 shared or exclusive fence(s) associated with the buffer. 16 16 17 17 Shared DMA Buffers ··· 31 31 - implements and manages operations in :c:type:`struct dma_buf_ops 32 32 <dma_buf_ops>` for the buffer, 33 33 - allows other users to share the buffer by using dma_buf sharing APIs, 34 - - manages the details of buffer allocation, wrapped int a :c:type:`struct 34 + - manages the details of buffer allocation, wrapped in a :c:type:`struct 35 35 dma_buf <dma_buf>`, 36 36 - decides about the actual backing storage where this allocation happens, 37 37 - and takes care of any migration of scatterlist - for all (shared) users of

+2 -2

Documentation/driver-api/driver-model/device.rst

··· 50 50 51 51 Attributes of devices can be exported by a device driver through sysfs. 52 52 53 - Please see Documentation/filesystems/sysfs.txt for more information 53 + Please see Documentation/filesystems/sysfs.rst for more information 54 54 on how sysfs works. 55 55 56 - As explained in Documentation/kobject.txt, device attributes must be 56 + As explained in Documentation/core-api/kobject.rst, device attributes must be 57 57 created before the KOBJ_ADD uevent is generated. The only way to realize 58 58 that is by defining an attribute group. 59 59

+1 -1

Documentation/driver-api/driver-model/overview.rst

··· 121 121 122 122 More information about the sysfs directory layout can be found in 123 123 the other documents in this directory and in the file 124 - Documentation/filesystems/sysfs.txt. 124 + Documentation/filesystems/sysfs.rst.

+1

Documentation/driver-api/index.rst

··· 39 39 spi 40 40 i2c 41 41 ipmb 42 + ipmi 42 43 i3c/index 43 44 interconnect 44 45 devfreq

+2 -2

Documentation/driver-api/nvdimm/nvdimm.rst

··· 278 278 be contiguous in DPA-space. 279 279 280 280 This bus is provided by the kernel under the device 281 - /sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and 282 - the nfit_test.ko module is loaded. This not only test LIBNVDIMM but the 281 + /sys/devices/platform/nfit_test.0 when the nfit_test.ko module from 282 + tools/testing/nvdimm is loaded. This not only test LIBNVDIMM but the 283 283 acpi_nfit.ko driver as well. 284 284 285 285

+3

Documentation/driver-api/thermal/cpu-idle-cooling.rst

··· 1 + ================ 2 + CPU Idle Cooling 3 + ================ 1 4 2 5 Situation: 3 6 ----------

+1

Documentation/driver-api/thermal/index.rst

··· 8 8 :maxdepth: 1 9 9 10 10 cpu-cooling-api 11 + cpu-idle-cooling 11 12 sysfs-api 12 13 power_allocator 13 14

+1 -1

Documentation/features/core/eBPF-JIT/arch-support.txt

··· 23 23 | openrisc: | TODO | 24 24 | parisc: | TODO | 25 25 | powerpc: | ok | 26 - | riscv: | TODO | 26 + | riscv: | ok | 27 27 | s390: | ok | 28 28 | sh: | TODO | 29 29 | sparc: | ok |

+3 -3

Documentation/features/debug/KASAN/arch-support.txt

··· 22 22 | nios2: | TODO | 23 23 | openrisc: | TODO | 24 24 | parisc: | TODO | 25 - | powerpc: | TODO | 26 - | riscv: | TODO | 27 - | s390: | TODO | 25 + | powerpc: | ok | 26 + | riscv: | ok | 27 + | s390: | ok | 28 28 | sh: | TODO | 29 29 | sparc: | TODO | 30 30 | um: | TODO |

+1 -1

Documentation/features/debug/gcov-profile-all/arch-support.txt

··· 11 11 | arm: | ok | 12 12 | arm64: | ok | 13 13 | c6x: | TODO | 14 - | csky: | TODO | 14 + | csky: | ok | 15 15 | h8300: | TODO | 16 16 | hexagon: | TODO | 17 17 | ia64: | TODO |

+1 -1

Documentation/features/debug/kprobes-on-ftrace/arch-support.txt

··· 11 11 | arm: | TODO | 12 12 | arm64: | TODO | 13 13 | c6x: | TODO | 14 - | csky: | TODO | 14 + | csky: | ok | 15 15 | h8300: | TODO | 16 16 | hexagon: | TODO | 17 17 | ia64: | TODO |

+2 -2

Documentation/features/debug/kprobes/arch-support.txt

··· 11 11 | arm: | ok | 12 12 | arm64: | ok | 13 13 | c6x: | TODO | 14 - | csky: | TODO | 14 + | csky: | ok | 15 15 | h8300: | TODO | 16 16 | hexagon: | TODO | 17 17 | ia64: | ok | ··· 23 23 | openrisc: | TODO | 24 24 | parisc: | ok | 25 25 | powerpc: | ok | 26 - | riscv: | ok | 26 + | riscv: | TODO | 27 27 | s390: | ok | 28 28 | sh: | ok | 29 29 | sparc: | ok |

+1 -1

Documentation/features/debug/kretprobes/arch-support.txt

··· 11 11 | arm: | ok | 12 12 | arm64: | ok | 13 13 | c6x: | TODO | 14 - | csky: | TODO | 14 + | csky: | ok | 15 15 | h8300: | TODO | 16 16 | hexagon: | TODO | 17 17 | ia64: | ok |

+1 -1

Documentation/features/debug/stackprotector/arch-support.txt

··· 11 11 | arm: | ok | 12 12 | arm64: | ok | 13 13 | c6x: | TODO | 14 - | csky: | TODO | 14 + | csky: | ok | 15 15 | h8300: | TODO | 16 16 | hexagon: | TODO | 17 17 | ia64: | TODO |

+1 -1

Documentation/features/debug/uprobes/arch-support.txt

··· 11 11 | arm: | ok | 12 12 | arm64: | ok | 13 13 | c6x: | TODO | 14 - | csky: | TODO | 14 + | csky: | ok | 15 15 | h8300: | TODO | 16 16 | hexagon: | TODO | 17 17 | ia64: | TODO |

+1 -1

Documentation/features/io/dma-contiguous/arch-support.txt

··· 16 16 | hexagon: | TODO | 17 17 | ia64: | TODO | 18 18 | m68k: | TODO | 19 - | microblaze: | TODO | 19 + | microblaze: | ok | 20 20 | mips: | ok | 21 21 | nds32: | TODO | 22 22 | nios2: | TODO |

+1 -1

Documentation/features/locking/lockdep/arch-support.txt

··· 11 11 | arm: | ok | 12 12 | arm64: | ok | 13 13 | c6x: | TODO | 14 - | csky: | TODO | 14 + | csky: | ok | 15 15 | h8300: | TODO | 16 16 | hexagon: | ok | 17 17 | ia64: | TODO |

+2 -2

Documentation/features/perf/kprobes-event/arch-support.txt

··· 11 11 | arm: | ok | 12 12 | arm64: | ok | 13 13 | c6x: | TODO | 14 - | csky: | TODO | 14 + | csky: | ok | 15 15 | h8300: | TODO | 16 16 | hexagon: | ok | 17 17 | ia64: | TODO | ··· 21 21 | nds32: | ok | 22 22 | nios2: | TODO | 23 23 | openrisc: | TODO | 24 - | parisc: | TODO | 24 + | parisc: | ok | 25 25 | powerpc: | ok | 26 26 | riscv: | TODO | 27 27 | s390: | ok |

+2 -2

Documentation/features/perf/perf-regs/arch-support.txt

··· 11 11 | arm: | ok | 12 12 | arm64: | ok | 13 13 | c6x: | TODO | 14 - | csky: | TODO | 14 + | csky: | ok | 15 15 | h8300: | TODO | 16 16 | hexagon: | TODO | 17 17 | ia64: | TODO | ··· 23 23 | openrisc: | TODO | 24 24 | parisc: | TODO | 25 25 | powerpc: | ok | 26 - | riscv: | TODO | 26 + | riscv: | ok | 27 27 | s390: | ok | 28 28 | sh: | TODO | 29 29 | sparc: | TODO |

+2 -2

Documentation/features/perf/perf-stackdump/arch-support.txt

··· 11 11 | arm: | ok | 12 12 | arm64: | ok | 13 13 | c6x: | TODO | 14 - | csky: | TODO | 14 + | csky: | ok | 15 15 | h8300: | TODO | 16 16 | hexagon: | TODO | 17 17 | ia64: | TODO | ··· 23 23 | openrisc: | TODO | 24 24 | parisc: | TODO | 25 25 | powerpc: | ok | 26 - | riscv: | TODO | 26 + | riscv: | ok | 27 27 | s390: | ok | 28 28 | sh: | TODO | 29 29 | sparc: | TODO |

+1 -1

Documentation/features/seccomp/seccomp-filter/arch-support.txt

··· 23 23 | openrisc: | TODO | 24 24 | parisc: | ok | 25 25 | powerpc: | ok | 26 - | riscv: | TODO | 26 + | riscv: | ok | 27 27 | s390: | ok | 28 28 | sh: | TODO | 29 29 | sparc: | TODO |

+1 -1

Documentation/features/vm/huge-vmap/arch-support.txt

··· 22 22 | nios2: | TODO | 23 23 | openrisc: | TODO | 24 24 | parisc: | TODO | 25 - | powerpc: | TODO | 25 + | powerpc: | ok | 26 26 | riscv: | TODO | 27 27 | s390: | TODO | 28 28 | sh: | TODO |

+1 -1

Documentation/features/vm/pte_special/arch-support.txt

··· 17 17 | ia64: | TODO | 18 18 | m68k: | TODO | 19 19 | microblaze: | TODO | 20 - | mips: | TODO | 20 + | mips: | ok | 21 21 | nds32: | TODO | 22 22 | nios2: | TODO | 23 23 | openrisc: | TODO |

+1 -1

Documentation/filesystems/9p.rst

··· 192 192 http://plan9.bell-labs.com/plan9 193 193 194 194 For information on Plan 9 from User Space (Plan 9 applications and libraries 195 - ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9 195 + ported to Linux/BSD/OSX/etc) check out https://9fans.github.io/plan9port/

+98

Documentation/filesystems/automount-support.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ================= 4 + Automount Support 5 + ================= 6 + 7 + 8 + Support is available for filesystems that wish to do automounting 9 + support (such as kAFS which can be found in fs/afs/ and NFS in 10 + fs/nfs/). This facility includes allowing in-kernel mounts to be 11 + performed and mountpoint degradation to be requested. The latter can 12 + also be requested by userspace. 13 + 14 + 15 + In-Kernel Automounting 16 + ====================== 17 + 18 + See section "Mount Traps" of Documentation/filesystems/autofs.rst 19 + 20 + Then from userspace, you can just do something like:: 21 + 22 + [root@andromeda root]# mount -t afs \#root.afs. /afs 23 + [root@andromeda root]# ls /afs 24 + asd cambridge cambridge.redhat.com grand.central.org 25 + [root@andromeda root]# ls /afs/cambridge 26 + afsdoc 27 + [root@andromeda root]# ls /afs/cambridge/afsdoc/ 28 + ChangeLog html LICENSE pdf RELNOTES-1.2.2 29 + 30 + And then if you look in the mountpoint catalogue, you'll see something like:: 31 + 32 + [root@andromeda root]# cat /proc/mounts 33 + ... 34 + #root.afs. /afs afs rw 0 0 35 + #root.cell. /afs/cambridge.redhat.com afs rw 0 0 36 + #afsdoc. /afs/cambridge.redhat.com/afsdoc afs rw 0 0 37 + 38 + 39 + Automatic Mountpoint Expiry 40 + =========================== 41 + 42 + Automatic expiration of mountpoints is easy, provided you've mounted the 43 + mountpoint to be expired in the automounting procedure outlined separately. 44 + 45 + To do expiration, you need to follow these steps: 46 + 47 + (1) Create at least one list off which the vfsmounts to be expired can be 48 + hung. 49 + 50 + (2) When a new mountpoint is created in the ->d_automount method, add 51 + the mnt to the list using mnt_set_expiry():: 52 + 53 + mnt_set_expiry(newmnt, &afs_vfsmounts); 54 + 55 + (3) When you want mountpoints to be expired, call mark_mounts_for_expiry() 56 + with a pointer to this list. This will process the list, marking every 57 + vfsmount thereon for potential expiry on the next call. 58 + 59 + If a vfsmount was already flagged for expiry, and if its usage count is 1 60 + (it's only referenced by its parent vfsmount), then it will be deleted 61 + from the namespace and thrown away (effectively unmounted). 62 + 63 + It may prove simplest to simply call this at regular intervals, using 64 + some sort of timed event to drive it. 65 + 66 + The expiration flag is cleared by calls to mntput. This means that expiration 67 + will only happen on the second expiration request after the last time the 68 + mountpoint was accessed. 69 + 70 + If a mountpoint is moved, it gets removed from the expiration list. If a bind 71 + mount is made on an expirable mount, the new vfsmount will not be on the 72 + expiration list and will not expire. 73 + 74 + If a namespace is copied, all mountpoints contained therein will be copied, 75 + and the copies of those that are on an expiration list will be added to the 76 + same expiration list. 77 + 78 + 79 + Userspace Driven Expiry 80 + ======================= 81 + 82 + As an alternative, it is possible for userspace to request expiry of any 83 + mountpoint (though some will be rejected - the current process's idea of the 84 + rootfs for example). It does this by passing the MNT_EXPIRE flag to 85 + umount(). This flag is considered incompatible with MNT_FORCE and MNT_DETACH. 86 + 87 + If the mountpoint in question is in referenced by something other than 88 + umount() or its parent mountpoint, an EBUSY error will be returned and the 89 + mountpoint will not be marked for expiration or unmounted. 90 + 91 + If the mountpoint was not already marked for expiry at that time, an EAGAIN 92 + error will be given and it won't be unmounted. 93 + 94 + Otherwise if it was already marked and it wasn't referenced, unmounting will 95 + take place as usual. 96 + 97 + Again, the expiration flag is cleared every time anything other than umount() 98 + looks at a mountpoint.

-93

Documentation/filesystems/automount-support.txt

··· 1 - Support is available for filesystems that wish to do automounting 2 - support (such as kAFS which can be found in fs/afs/ and NFS in 3 - fs/nfs/). This facility includes allowing in-kernel mounts to be 4 - performed and mountpoint degradation to be requested. The latter can 5 - also be requested by userspace. 6 - 7 - 8 - ====================== 9 - IN-KERNEL AUTOMOUNTING 10 - ====================== 11 - 12 - See section "Mount Traps" of Documentation/filesystems/autofs.rst 13 - 14 - Then from userspace, you can just do something like: 15 - 16 - [root@andromeda root]# mount -t afs \#root.afs. /afs 17 - [root@andromeda root]# ls /afs 18 - asd cambridge cambridge.redhat.com grand.central.org 19 - [root@andromeda root]# ls /afs/cambridge 20 - afsdoc 21 - [root@andromeda root]# ls /afs/cambridge/afsdoc/ 22 - ChangeLog html LICENSE pdf RELNOTES-1.2.2 23 - 24 - And then if you look in the mountpoint catalogue, you'll see something like: 25 - 26 - [root@andromeda root]# cat /proc/mounts 27 - ... 28 - #root.afs. /afs afs rw 0 0 29 - #root.cell. /afs/cambridge.redhat.com afs rw 0 0 30 - #afsdoc. /afs/cambridge.redhat.com/afsdoc afs rw 0 0 31 - 32 - 33 - =========================== 34 - AUTOMATIC MOUNTPOINT EXPIRY 35 - =========================== 36 - 37 - Automatic expiration of mountpoints is easy, provided you've mounted the 38 - mountpoint to be expired in the automounting procedure outlined separately. 39 - 40 - To do expiration, you need to follow these steps: 41 - 42 - (1) Create at least one list off which the vfsmounts to be expired can be 43 - hung. 44 - 45 - (2) When a new mountpoint is created in the ->d_automount method, add 46 - the mnt to the list using mnt_set_expiry() 47 - mnt_set_expiry(newmnt, &afs_vfsmounts); 48 - 49 - (3) When you want mountpoints to be expired, call mark_mounts_for_expiry() 50 - with a pointer to this list. This will process the list, marking every 51 - vfsmount thereon for potential expiry on the next call. 52 - 53 - If a vfsmount was already flagged for expiry, and if its usage count is 1 54 - (it's only referenced by its parent vfsmount), then it will be deleted 55 - from the namespace and thrown away (effectively unmounted). 56 - 57 - It may prove simplest to simply call this at regular intervals, using 58 - some sort of timed event to drive it. 59 - 60 - The expiration flag is cleared by calls to mntput. This means that expiration 61 - will only happen on the second expiration request after the last time the 62 - mountpoint was accessed. 63 - 64 - If a mountpoint is moved, it gets removed from the expiration list. If a bind 65 - mount is made on an expirable mount, the new vfsmount will not be on the 66 - expiration list and will not expire. 67 - 68 - If a namespace is copied, all mountpoints contained therein will be copied, 69 - and the copies of those that are on an expiration list will be added to the 70 - same expiration list. 71 - 72 - 73 - ======================= 74 - USERSPACE DRIVEN EXPIRY 75 - ======================= 76 - 77 - As an alternative, it is possible for userspace to request expiry of any 78 - mountpoint (though some will be rejected - the current process's idea of the 79 - rootfs for example). It does this by passing the MNT_EXPIRE flag to 80 - umount(). This flag is considered incompatible with MNT_FORCE and MNT_DETACH. 81 - 82 - If the mountpoint in question is in referenced by something other than 83 - umount() or its parent mountpoint, an EBUSY error will be returned and the 84 - mountpoint will not be marked for expiration or unmounted. 85 - 86 - If the mountpoint was not already marked for expiry at that time, an EAGAIN 87 - error will be given and it won't be unmounted. 88 - 89 - Otherwise if it was already marked and it wasn't referenced, unmounting will 90 - take place as usual. 91 - 92 - Again, the expiration flag is cleared every time anything other than umount() 93 - looks at a mountpoint.

+727

Documentation/filesystems/caching/backend-api.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ========================== 4 + FS-Cache Cache backend API 5 + ========================== 6 + 7 + The FS-Cache system provides an API by which actual caches can be supplied to 8 + FS-Cache for it to then serve out to network filesystems and other interested 9 + parties. 10 + 11 + This API is declared in <linux/fscache-cache.h>. 12 + 13 + 14 + Initialising and Registering a Cache 15 + ==================================== 16 + 17 + To start off, a cache definition must be initialised and registered for each 18 + cache the backend wants to make available. For instance, CacheFS does this in 19 + the fill_super() operation on mounting. 20 + 21 + The cache definition (struct fscache_cache) should be initialised by calling:: 22 + 23 + void fscache_init_cache(struct fscache_cache *cache, 24 + struct fscache_cache_ops *ops, 25 + const char *idfmt, 26 + ...); 27 + 28 + Where: 29 + 30 + * "cache" is a pointer to the cache definition; 31 + 32 + * "ops" is a pointer to the table of operations that the backend supports on 33 + this cache; and 34 + 35 + * "idfmt" is a format and printf-style arguments for constructing a label 36 + for the cache. 37 + 38 + 39 + The cache should then be registered with FS-Cache by passing a pointer to the 40 + previously initialised cache definition to:: 41 + 42 + int fscache_add_cache(struct fscache_cache *cache, 43 + struct fscache_object *fsdef, 44 + const char *tagname); 45 + 46 + Two extra arguments should also be supplied: 47 + 48 + * "fsdef" which should point to the object representation for the FS-Cache 49 + master index in this cache. Netfs primary index entries will be created 50 + here. FS-Cache keeps the caller's reference to the index object if 51 + successful and will release it upon withdrawal of the cache. 52 + 53 + * "tagname" which, if given, should be a text string naming this cache. If 54 + this is NULL, the identifier will be used instead. For CacheFS, the 55 + identifier is set to name the underlying block device and the tag can be 56 + supplied by mount. 57 + 58 + This function may return -ENOMEM if it ran out of memory or -EEXIST if the tag 59 + is already in use. 0 will be returned on success. 60 + 61 + 62 + Unregistering a Cache 63 + ===================== 64 + 65 + A cache can be withdrawn from the system by calling this function with a 66 + pointer to the cache definition:: 67 + 68 + void fscache_withdraw_cache(struct fscache_cache *cache); 69 + 70 + In CacheFS's case, this is called by put_super(). 71 + 72 + 73 + Security 74 + ======== 75 + 76 + The cache methods are executed one of two contexts: 77 + 78 + (1) that of the userspace process that issued the netfs operation that caused 79 + the cache method to be invoked, or 80 + 81 + (2) that of one of the processes in the FS-Cache thread pool. 82 + 83 + In either case, this may not be an appropriate context in which to access the 84 + cache. 85 + 86 + The calling process's fsuid, fsgid and SELinux security identities may need to 87 + be masqueraded for the duration of the cache driver's access to the cache. 88 + This is left to the cache to handle; FS-Cache makes no effort in this regard. 89 + 90 + 91 + Control and Statistics Presentation 92 + =================================== 93 + 94 + The cache may present data to the outside world through FS-Cache's interfaces 95 + in sysfs and procfs - the former for control and the latter for statistics. 96 + 97 + A sysfs directory called /sys/fs/fscache/<cachetag>/ is created if CONFIG_SYSFS 98 + is enabled. This is accessible through the kobject struct fscache_cache::kobj 99 + and is for use by the cache as it sees fit. 100 + 101 + 102 + Relevant Data Structures 103 + ======================== 104 + 105 + * Index/Data file FS-Cache representation cookie:: 106 + 107 + struct fscache_cookie { 108 + struct fscache_object_def *def; 109 + struct fscache_netfs *netfs; 110 + void *netfs_data; 111 + ... 112 + }; 113 + 114 + The fields that might be of use to the backend describe the object 115 + definition, the netfs definition and the netfs's data for this cookie. 116 + The object definition contain functions supplied by the netfs for loading 117 + and matching index entries; these are required to provide some of the 118 + cache operations. 119 + 120 + 121 + * In-cache object representation:: 122 + 123 + struct fscache_object { 124 + int debug_id; 125 + enum { 126 + FSCACHE_OBJECT_RECYCLING, 127 + ... 128 + } state; 129 + spinlock_t lock 130 + struct fscache_cache *cache; 131 + struct fscache_cookie *cookie; 132 + ... 133 + }; 134 + 135 + Structures of this type should be allocated by the cache backend and 136 + passed to FS-Cache when requested by the appropriate cache operation. In 137 + the case of CacheFS, they're embedded in CacheFS's internal object 138 + structures. 139 + 140 + The debug_id is a simple integer that can be used in debugging messages 141 + that refer to a particular object. In such a case it should be printed 142 + using "OBJ%x" to be consistent with FS-Cache. 143 + 144 + Each object contains a pointer to the cookie that represents the object it 145 + is backing. An object should retired when put_object() is called if it is 146 + in state FSCACHE_OBJECT_RECYCLING. The fscache_object struct should be 147 + initialised by calling fscache_object_init(object). 148 + 149 + 150 + * FS-Cache operation record:: 151 + 152 + struct fscache_operation { 153 + atomic_t usage; 154 + struct fscache_object *object; 155 + unsigned long flags; 156 + #define FSCACHE_OP_EXCLUSIVE 157 + void (*processor)(struct fscache_operation *op); 158 + void (*release)(struct fscache_operation *op); 159 + ... 160 + }; 161 + 162 + FS-Cache has a pool of threads that it uses to give CPU time to the 163 + various asynchronous operations that need to be done as part of driving 164 + the cache. These are represented by the above structure. The processor 165 + method is called to give the op CPU time, and the release method to get 166 + rid of it when its usage count reaches 0. 167 + 168 + An operation can be made exclusive upon an object by setting the 169 + appropriate flag before enqueuing it with fscache_enqueue_operation(). If 170 + an operation needs more processing time, it should be enqueued again. 171 + 172 + 173 + * FS-Cache retrieval operation record:: 174 + 175 + struct fscache_retrieval { 176 + struct fscache_operation op; 177 + struct address_space *mapping; 178 + struct list_head *to_do; 179 + ... 180 + }; 181 + 182 + A structure of this type is allocated by FS-Cache to record retrieval and 183 + allocation requests made by the netfs. This struct is then passed to the 184 + backend to do the operation. The backend may get extra refs to it by 185 + calling fscache_get_retrieval() and refs may be discarded by calling 186 + fscache_put_retrieval(). 187 + 188 + A retrieval operation can be used by the backend to do retrieval work. To 189 + do this, the retrieval->op.processor method pointer should be set 190 + appropriately by the backend and fscache_enqueue_retrieval() called to 191 + submit it to the thread pool. CacheFiles, for example, uses this to queue 192 + page examination when it detects PG_lock being cleared. 193 + 194 + The to_do field is an empty list available for the cache backend to use as 195 + it sees fit. 196 + 197 + 198 + * FS-Cache storage operation record:: 199 + 200 + struct fscache_storage { 201 + struct fscache_operation op; 202 + pgoff_t store_limit; 203 + ... 204 + }; 205 + 206 + A structure of this type is allocated by FS-Cache to record outstanding 207 + writes to be made. FS-Cache itself enqueues this operation and invokes 208 + the write_page() method on the object at appropriate times to effect 209 + storage. 210 + 211 + 212 + Cache Operations 213 + ================ 214 + 215 + The cache backend provides FS-Cache with a table of operations that can be 216 + performed on the denizens of the cache. These are held in a structure of type: 217 + 218 + :: 219 + 220 + struct fscache_cache_ops 221 + 222 + * Name of cache provider [mandatory]:: 223 + 224 + const char *name 225 + 226 + This isn't strictly an operation, but should be pointed at a string naming 227 + the backend. 228 + 229 + 230 + * Allocate a new object [mandatory]:: 231 + 232 + struct fscache_object *(*alloc_object)(struct fscache_cache *cache, 233 + struct fscache_cookie *cookie) 234 + 235 + This method is used to allocate a cache object representation to back a 236 + cookie in a particular cache. fscache_object_init() should be called on 237 + the object to initialise it prior to returning. 238 + 239 + This function may also be used to parse the index key to be used for 240 + multiple lookup calls to turn it into a more convenient form. FS-Cache 241 + will call the lookup_complete() method to allow the cache to release the 242 + form once lookup is complete or aborted. 243 + 244 + 245 + * Look up and create object [mandatory]:: 246 + 247 + void (*lookup_object)(struct fscache_object *object) 248 + 249 + This method is used to look up an object, given that the object is already 250 + allocated and attached to the cookie. This should instantiate that object 251 + in the cache if it can. 252 + 253 + The method should call fscache_object_lookup_negative() as soon as 254 + possible if it determines the object doesn't exist in the cache. If the 255 + object is found to exist and the netfs indicates that it is valid then 256 + fscache_obtained_object() should be called once the object is in a 257 + position to have data stored in it. Similarly, fscache_obtained_object() 258 + should also be called once a non-present object has been created. 259 + 260 + If a lookup error occurs, fscache_object_lookup_error() should be called 261 + to abort the lookup of that object. 262 + 263 + 264 + * Release lookup data [mandatory]:: 265 + 266 + void (*lookup_complete)(struct fscache_object *object) 267 + 268 + This method is called to ask the cache to release any resources it was 269 + using to perform a lookup. 270 + 271 + 272 + * Increment object refcount [mandatory]:: 273 + 274 + struct fscache_object *(*grab_object)(struct fscache_object *object) 275 + 276 + This method is called to increment the reference count on an object. It 277 + may fail (for instance if the cache is being withdrawn) by returning NULL. 278 + It should return the object pointer if successful. 279 + 280 + 281 + * Lock/Unlock object [mandatory]:: 282 + 283 + void (*lock_object)(struct fscache_object *object) 284 + void (*unlock_object)(struct fscache_object *object) 285 + 286 + These methods are used to exclusively lock an object. It must be possible 287 + to schedule with the lock held, so a spinlock isn't sufficient. 288 + 289 + 290 + * Pin/Unpin object [optional]:: 291 + 292 + int (*pin_object)(struct fscache_object *object) 293 + void (*unpin_object)(struct fscache_object *object) 294 + 295 + These methods are used to pin an object into the cache. Once pinned an 296 + object cannot be reclaimed to make space. Return -ENOSPC if there's not 297 + enough space in the cache to permit this. 298 + 299 + 300 + * Check coherency state of an object [mandatory]:: 301 + 302 + int (*check_consistency)(struct fscache_object *object) 303 + 304 + This method is called to have the cache check the saved auxiliary data of 305 + the object against the netfs's idea of the state. 0 should be returned 306 + if they're consistent and -ESTALE otherwise. -ENOMEM and -ERESTARTSYS 307 + may also be returned. 308 + 309 + * Update object [mandatory]:: 310 + 311 + int (*update_object)(struct fscache_object *object) 312 + 313 + This is called to update the index entry for the specified object. The 314 + new information should be in object->cookie->netfs_data. This can be 315 + obtained by calling object->cookie->def->get_aux()/get_attr(). 316 + 317 + 318 + * Invalidate data object [mandatory]:: 319 + 320 + int (*invalidate_object)(struct fscache_operation *op) 321 + 322 + This is called to invalidate a data object (as pointed to by op->object). 323 + All the data stored for this object should be discarded and an 324 + attr_changed operation should be performed. The caller will follow up 325 + with an object update operation. 326 + 327 + fscache_op_complete() must be called on op before returning. 328 + 329 + 330 + * Discard object [mandatory]:: 331 + 332 + void (*drop_object)(struct fscache_object *object) 333 + 334 + This method is called to indicate that an object has been unbound from its 335 + cookie, and that the cache should release the object's resources and 336 + retire it if it's in state FSCACHE_OBJECT_RECYCLING. 337 + 338 + This method should not attempt to release any references held by the 339 + caller. The caller will invoke the put_object() method as appropriate. 340 + 341 + 342 + * Release object reference [mandatory]:: 343 + 344 + void (*put_object)(struct fscache_object *object) 345 + 346 + This method is used to discard a reference to an object. The object may 347 + be freed when all the references to it are released. 348 + 349 + 350 + * Synchronise a cache [mandatory]:: 351 + 352 + void (*sync)(struct fscache_cache *cache) 353 + 354 + This is called to ask the backend to synchronise a cache with its backing 355 + device. 356 + 357 + 358 + * Dissociate a cache [mandatory]:: 359 + 360 + void (*dissociate_pages)(struct fscache_cache *cache) 361 + 362 + This is called to ask a cache to perform any page dissociations as part of 363 + cache withdrawal. 364 + 365 + 366 + * Notification that the attributes on a netfs file changed [mandatory]:: 367 + 368 + int (*attr_changed)(struct fscache_object *object); 369 + 370 + This is called to indicate to the cache that certain attributes on a netfs 371 + file have changed (for example the maximum size a file may reach). The 372 + cache can read these from the netfs by calling the cookie's get_attr() 373 + method. 374 + 375 + The cache may use the file size information to reserve space on the cache. 376 + It should also call fscache_set_store_limit() to indicate to FS-Cache the 377 + highest byte it's willing to store for an object. 378 + 379 + This method may return -ve if an error occurred or the cache object cannot 380 + be expanded. In such a case, the object will be withdrawn from service. 381 + 382 + This operation is run asynchronously from FS-Cache's thread pool, and 383 + storage and retrieval operations from the netfs are excluded during the 384 + execution of this operation. 385 + 386 + 387 + * Reserve cache space for an object's data [optional]:: 388 + 389 + int (*reserve_space)(struct fscache_object *object, loff_t size); 390 + 391 + This is called to request that cache space be reserved to hold the data 392 + for an object and the metadata used to track it. Zero size should be 393 + taken as request to cancel a reservation. 394 + 395 + This should return 0 if successful, -ENOSPC if there isn't enough space 396 + available, or -ENOMEM or -EIO on other errors. 397 + 398 + The reservation may exceed the current size of the object, thus permitting 399 + future expansion. If the amount of space consumed by an object would 400 + exceed the reservation, it's permitted to refuse requests to allocate 401 + pages, but not required. An object may be pruned down to its reservation 402 + size if larger than that already. 403 + 404 + 405 + * Request page be read from cache [mandatory]:: 406 + 407 + int (*read_or_alloc_page)(struct fscache_retrieval *op, 408 + struct page *page, 409 + gfp_t gfp) 410 + 411 + This is called to attempt to read a netfs page from the cache, or to 412 + reserve a backing block if not. FS-Cache will have done as much checking 413 + as it can before calling, but most of the work belongs to the backend. 414 + 415 + If there's no page in the cache, then -ENODATA should be returned if the 416 + backend managed to reserve a backing block; -ENOBUFS or -ENOMEM if it 417 + didn't. 418 + 419 + If there is suitable data in the cache, then a read operation should be 420 + queued and 0 returned. When the read finishes, fscache_end_io() should be 421 + called. 422 + 423 + The fscache_mark_pages_cached() should be called for the page if any cache 424 + metadata is retained. This will indicate to the netfs that the page needs 425 + explicit uncaching. This operation takes a pagevec, thus allowing several 426 + pages to be marked at once. 427 + 428 + The retrieval record pointed to by op should be retained for each page 429 + queued and released when I/O on the page has been formally ended. 430 + fscache_get/put_retrieval() are available for this purpose. 431 + 432 + The retrieval record may be used to get CPU time via the FS-Cache thread 433 + pool. If this is desired, the op->op.processor should be set to point to 434 + the appropriate processing routine, and fscache_enqueue_retrieval() should 435 + be called at an appropriate point to request CPU time. For instance, the 436 + retrieval routine could be enqueued upon the completion of a disk read. 437 + The to_do field in the retrieval record is provided to aid in this. 438 + 439 + If an I/O error occurs, fscache_io_error() should be called and -ENOBUFS 440 + returned if possible or fscache_end_io() called with a suitable error 441 + code. 442 + 443 + fscache_put_retrieval() should be called after a page or pages are dealt 444 + with. This will complete the operation when all pages are dealt with. 445 + 446 + 447 + * Request pages be read from cache [mandatory]:: 448 + 449 + int (*read_or_alloc_pages)(struct fscache_retrieval *op, 450 + struct list_head *pages, 451 + unsigned *nr_pages, 452 + gfp_t gfp) 453 + 454 + This is like the read_or_alloc_page() method, except it is handed a list 455 + of pages instead of one page. Any pages on which a read operation is 456 + started must be added to the page cache for the specified mapping and also 457 + to the LRU. Such pages must also be removed from the pages list and 458 + ``*nr_pages`` decremented per page. 459 + 460 + If there was an error such as -ENOMEM, then that should be returned; else 461 + if one or more pages couldn't be read or allocated, then -ENOBUFS should 462 + be returned; else if one or more pages couldn't be read, then -ENODATA 463 + should be returned. If all the pages are dispatched then 0 should be 464 + returned. 465 + 466 + 467 + * Request page be allocated in the cache [mandatory]:: 468 + 469 + int (*allocate_page)(struct fscache_retrieval *op, 470 + struct page *page, 471 + gfp_t gfp) 472 + 473 + This is like the read_or_alloc_page() method, except that it shouldn't 474 + read from the cache, even if there's data there that could be retrieved. 475 + It should, however, set up any internal metadata required such that 476 + the write_page() method can write to the cache. 477 + 478 + If there's no backing block available, then -ENOBUFS should be returned 479 + (or -ENOMEM if there were other problems). If a block is successfully 480 + allocated, then the netfs page should be marked and 0 returned. 481 + 482 + 483 + * Request pages be allocated in the cache [mandatory]:: 484 + 485 + int (*allocate_pages)(struct fscache_retrieval *op, 486 + struct list_head *pages, 487 + unsigned *nr_pages, 488 + gfp_t gfp) 489 + 490 + This is an multiple page version of the allocate_page() method. pages and 491 + nr_pages should be treated as for the read_or_alloc_pages() method. 492 + 493 + 494 + * Request page be written to cache [mandatory]:: 495 + 496 + int (*write_page)(struct fscache_storage *op, 497 + struct page *page); 498 + 499 + This is called to write from a page on which there was a previously 500 + successful read_or_alloc_page() call or similar. FS-Cache filters out 501 + pages that don't have mappings. 502 + 503 + This method is called asynchronously from the FS-Cache thread pool. It is 504 + not required to actually store anything, provided -ENODATA is then 505 + returned to the next read of this page. 506 + 507 + If an error occurred, then a negative error code should be returned, 508 + otherwise zero should be returned. FS-Cache will take appropriate action 509 + in response to an error, such as withdrawing this object. 510 + 511 + If this method returns success then FS-Cache will inform the netfs 512 + appropriately. 513 + 514 + 515 + * Discard retained per-page metadata [mandatory]:: 516 + 517 + void (*uncache_page)(struct fscache_object *object, struct page *page) 518 + 519 + This is called when a netfs page is being evicted from the pagecache. The 520 + cache backend should tear down any internal representation or tracking it 521 + maintains for this page. 522 + 523 + 524 + FS-Cache Utilities 525 + ================== 526 + 527 + FS-Cache provides some utilities that a cache backend may make use of: 528 + 529 + * Note occurrence of an I/O error in a cache:: 530 + 531 + void fscache_io_error(struct fscache_cache *cache) 532 + 533 + This tells FS-Cache that an I/O error occurred in the cache. After this 534 + has been called, only resource dissociation operations (object and page 535 + release) will be passed from the netfs to the cache backend for the 536 + specified cache. 537 + 538 + This does not actually withdraw the cache. That must be done separately. 539 + 540 + 541 + * Invoke the retrieval I/O completion function:: 542 + 543 + void fscache_end_io(struct fscache_retrieval *op, struct page *page, 544 + int error); 545 + 546 + This is called to note the end of an attempt to retrieve a page. The 547 + error value should be 0 if successful and an error otherwise. 548 + 549 + 550 + * Record that one or more pages being retrieved or allocated have been dealt 551 + with:: 552 + 553 + void fscache_retrieval_complete(struct fscache_retrieval *op, 554 + int n_pages); 555 + 556 + This is called to record the fact that one or more pages have been dealt 557 + with and are no longer the concern of this operation. When the number of 558 + pages remaining in the operation reaches 0, the operation will be 559 + completed. 560 + 561 + 562 + * Record operation completion:: 563 + 564 + void fscache_op_complete(struct fscache_operation *op); 565 + 566 + This is called to record the completion of an operation. This deducts 567 + this operation from the parent object's run state, potentially permitting 568 + one or more pending operations to start running. 569 + 570 + 571 + * Set highest store limit:: 572 + 573 + void fscache_set_store_limit(struct fscache_object *object, 574 + loff_t i_size); 575 + 576 + This sets the limit FS-Cache imposes on the highest byte it's willing to 577 + try and store for a netfs. Any page over this limit is automatically 578 + rejected by fscache_read_alloc_page() and co with -ENOBUFS. 579 + 580 + 581 + * Mark pages as being cached:: 582 + 583 + void fscache_mark_pages_cached(struct fscache_retrieval *op, 584 + struct pagevec *pagevec); 585 + 586 + This marks a set of pages as being cached. After this has been called, 587 + the netfs must call fscache_uncache_page() to unmark the pages. 588 + 589 + 590 + * Perform coherency check on an object:: 591 + 592 + enum fscache_checkaux fscache_check_aux(struct fscache_object *object, 593 + const void *data, 594 + uint16_t datalen); 595 + 596 + This asks the netfs to perform a coherency check on an object that has 597 + just been looked up. The cookie attached to the object will determine the 598 + netfs to use. data and datalen should specify where the auxiliary data 599 + retrieved from the cache can be found. 600 + 601 + One of three values will be returned: 602 + 603 + FSCACHE_CHECKAUX_OKAY 604 + The coherency data indicates the object is valid as is. 605 + 606 + FSCACHE_CHECKAUX_NEEDS_UPDATE 607 + The coherency data needs updating, but otherwise the object is 608 + valid. 609 + 610 + FSCACHE_CHECKAUX_OBSOLETE 611 + The coherency data indicates that the object is obsolete and should 612 + be discarded. 613 + 614 + 615 + * Initialise a freshly allocated object:: 616 + 617 + void fscache_object_init(struct fscache_object *object); 618 + 619 + This initialises all the fields in an object representation. 620 + 621 + 622 + * Indicate the destruction of an object:: 623 + 624 + void fscache_object_destroyed(struct fscache_cache *cache); 625 + 626 + This must be called to inform FS-Cache that an object that belonged to a 627 + cache has been destroyed and deallocated. This will allow continuation 628 + of the cache withdrawal process when it is stopped pending destruction of 629 + all the objects. 630 + 631 + 632 + * Indicate negative lookup on an object:: 633 + 634 + void fscache_object_lookup_negative(struct fscache_object *object); 635 + 636 + This is called to indicate to FS-Cache that a lookup process for an object 637 + found a negative result. 638 + 639 + This changes the state of an object to permit reads pending on lookup 640 + completion to go off and start fetching data from the netfs server as it's 641 + known at this point that there can't be any data in the cache. 642 + 643 + This may be called multiple times on an object. Only the first call is 644 + significant - all subsequent calls are ignored. 645 + 646 + 647 + * Indicate an object has been obtained:: 648 + 649 + void fscache_obtained_object(struct fscache_object *object); 650 + 651 + This is called to indicate to FS-Cache that a lookup process for an object 652 + produced a positive result, or that an object was created. This should 653 + only be called once for any particular object. 654 + 655 + This changes the state of an object to indicate: 656 + 657 + (1) if no call to fscache_object_lookup_negative() has been made on 658 + this object, that there may be data available, and that reads can 659 + now go and look for it; and 660 + 661 + (2) that writes may now proceed against this object. 662 + 663 + 664 + * Indicate that object lookup failed:: 665 + 666 + void fscache_object_lookup_error(struct fscache_object *object); 667 + 668 + This marks an object as having encountered a fatal error (usually EIO) 669 + and causes it to move into a state whereby it will be withdrawn as soon 670 + as possible. 671 + 672 + 673 + * Indicate that a stale object was found and discarded:: 674 + 675 + void fscache_object_retrying_stale(struct fscache_object *object); 676 + 677 + This is called to indicate that the lookup procedure found an object in 678 + the cache that the netfs decided was stale. The object has been 679 + discarded from the cache and the lookup will be performed again. 680 + 681 + 682 + * Indicate that the caching backend killed an object:: 683 + 684 + void fscache_object_mark_killed(struct fscache_object *object, 685 + enum fscache_why_object_killed why); 686 + 687 + This is called to indicate that the cache backend preemptively killed an 688 + object. The why parameter should be set to indicate the reason: 689 + 690 + FSCACHE_OBJECT_IS_STALE 691 + - the object was stale and needs discarding. 692 + 693 + FSCACHE_OBJECT_NO_SPACE 694 + - there was insufficient cache space 695 + 696 + FSCACHE_OBJECT_WAS_RETIRED 697 + - the object was retired when relinquished. 698 + 699 + FSCACHE_OBJECT_WAS_CULLED 700 + - the object was culled to make space. 701 + 702 + 703 + * Get and release references on a retrieval record:: 704 + 705 + void fscache_get_retrieval(struct fscache_retrieval *op); 706 + void fscache_put_retrieval(struct fscache_retrieval *op); 707 + 708 + These two functions are used to retain a retrieval record while doing 709 + asynchronous data retrieval and block allocation. 710 + 711 + 712 + * Enqueue a retrieval record for processing:: 713 + 714 + void fscache_enqueue_retrieval(struct fscache_retrieval *op); 715 + 716 + This enqueues a retrieval record for processing by the FS-Cache thread 717 + pool. One of the threads in the pool will invoke the retrieval record's 718 + op->op.processor callback function. This function may be called from 719 + within the callback function. 720 + 721 + 722 + * List of object state names:: 723 + 724 + const char *fscache_object_states[]; 725 + 726 + For debugging purposes, this may be used to turn the state that an object 727 + is in into a text string for display purposes.

-726

Documentation/filesystems/caching/backend-api.txt

··· 1 - ========================== 2 - FS-CACHE CACHE BACKEND API 3 - ========================== 4 - 5 - The FS-Cache system provides an API by which actual caches can be supplied to 6 - FS-Cache for it to then serve out to network filesystems and other interested 7 - parties. 8 - 9 - This API is declared in <linux/fscache-cache.h>. 10 - 11 - 12 - ==================================== 13 - INITIALISING AND REGISTERING A CACHE 14 - ==================================== 15 - 16 - To start off, a cache definition must be initialised and registered for each 17 - cache the backend wants to make available. For instance, CacheFS does this in 18 - the fill_super() operation on mounting. 19 - 20 - The cache definition (struct fscache_cache) should be initialised by calling: 21 - 22 - void fscache_init_cache(struct fscache_cache *cache, 23 - struct fscache_cache_ops *ops, 24 - const char *idfmt, 25 - ...); 26 - 27 - Where: 28 - 29 - (*) "cache" is a pointer to the cache definition; 30 - 31 - (*) "ops" is a pointer to the table of operations that the backend supports on 32 - this cache; and 33 - 34 - (*) "idfmt" is a format and printf-style arguments for constructing a label 35 - for the cache. 36 - 37 - 38 - The cache should then be registered with FS-Cache by passing a pointer to the 39 - previously initialised cache definition to: 40 - 41 - int fscache_add_cache(struct fscache_cache *cache, 42 - struct fscache_object *fsdef, 43 - const char *tagname); 44 - 45 - Two extra arguments should also be supplied: 46 - 47 - (*) "fsdef" which should point to the object representation for the FS-Cache 48 - master index in this cache. Netfs primary index entries will be created 49 - here. FS-Cache keeps the caller's reference to the index object if 50 - successful and will release it upon withdrawal of the cache. 51 - 52 - (*) "tagname" which, if given, should be a text string naming this cache. If 53 - this is NULL, the identifier will be used instead. For CacheFS, the 54 - identifier is set to name the underlying block device and the tag can be 55 - supplied by mount. 56 - 57 - This function may return -ENOMEM if it ran out of memory or -EEXIST if the tag 58 - is already in use. 0 will be returned on success. 59 - 60 - 61 - ===================== 62 - UNREGISTERING A CACHE 63 - ===================== 64 - 65 - A cache can be withdrawn from the system by calling this function with a 66 - pointer to the cache definition: 67 - 68 - void fscache_withdraw_cache(struct fscache_cache *cache); 69 - 70 - In CacheFS's case, this is called by put_super(). 71 - 72 - 73 - ======== 74 - SECURITY 75 - ======== 76 - 77 - The cache methods are executed one of two contexts: 78 - 79 - (1) that of the userspace process that issued the netfs operation that caused 80 - the cache method to be invoked, or 81 - 82 - (2) that of one of the processes in the FS-Cache thread pool. 83 - 84 - In either case, this may not be an appropriate context in which to access the 85 - cache. 86 - 87 - The calling process's fsuid, fsgid and SELinux security identities may need to 88 - be masqueraded for the duration of the cache driver's access to the cache. 89 - This is left to the cache to handle; FS-Cache makes no effort in this regard. 90 - 91 - 92 - =================================== 93 - CONTROL AND STATISTICS PRESENTATION 94 - =================================== 95 - 96 - The cache may present data to the outside world through FS-Cache's interfaces 97 - in sysfs and procfs - the former for control and the latter for statistics. 98 - 99 - A sysfs directory called /sys/fs/fscache/<cachetag>/ is created if CONFIG_SYSFS 100 - is enabled. This is accessible through the kobject struct fscache_cache::kobj 101 - and is for use by the cache as it sees fit. 102 - 103 - 104 - ======================== 105 - RELEVANT DATA STRUCTURES 106 - ======================== 107 - 108 - (*) Index/Data file FS-Cache representation cookie: 109 - 110 - struct fscache_cookie { 111 - struct fscache_object_def *def; 112 - struct fscache_netfs *netfs; 113 - void *netfs_data; 114 - ... 115 - }; 116 - 117 - The fields that might be of use to the backend describe the object 118 - definition, the netfs definition and the netfs's data for this cookie. 119 - The object definition contain functions supplied by the netfs for loading 120 - and matching index entries; these are required to provide some of the 121 - cache operations. 122 - 123 - 124 - (*) In-cache object representation: 125 - 126 - struct fscache_object { 127 - int debug_id; 128 - enum { 129 - FSCACHE_OBJECT_RECYCLING, 130 - ... 131 - } state; 132 - spinlock_t lock 133 - struct fscache_cache *cache; 134 - struct fscache_cookie *cookie; 135 - ... 136 - }; 137 - 138 - Structures of this type should be allocated by the cache backend and 139 - passed to FS-Cache when requested by the appropriate cache operation. In 140 - the case of CacheFS, they're embedded in CacheFS's internal object 141 - structures. 142 - 143 - The debug_id is a simple integer that can be used in debugging messages 144 - that refer to a particular object. In such a case it should be printed 145 - using "OBJ%x" to be consistent with FS-Cache. 146 - 147 - Each object contains a pointer to the cookie that represents the object it 148 - is backing. An object should retired when put_object() is called if it is 149 - in state FSCACHE_OBJECT_RECYCLING. The fscache_object struct should be 150 - initialised by calling fscache_object_init(object). 151 - 152 - 153 - (*) FS-Cache operation record: 154 - 155 - struct fscache_operation { 156 - atomic_t usage; 157 - struct fscache_object *object; 158 - unsigned long flags; 159 - #define FSCACHE_OP_EXCLUSIVE 160 - void (*processor)(struct fscache_operation *op); 161 - void (*release)(struct fscache_operation *op); 162 - ... 163 - }; 164 - 165 - FS-Cache has a pool of threads that it uses to give CPU time to the 166 - various asynchronous operations that need to be done as part of driving 167 - the cache. These are represented by the above structure. The processor 168 - method is called to give the op CPU time, and the release method to get 169 - rid of it when its usage count reaches 0. 170 - 171 - An operation can be made exclusive upon an object by setting the 172 - appropriate flag before enqueuing it with fscache_enqueue_operation(). If 173 - an operation needs more processing time, it should be enqueued again. 174 - 175 - 176 - (*) FS-Cache retrieval operation record: 177 - 178 - struct fscache_retrieval { 179 - struct fscache_operation op; 180 - struct address_space *mapping; 181 - struct list_head *to_do; 182 - ... 183 - }; 184 - 185 - A structure of this type is allocated by FS-Cache to record retrieval and 186 - allocation requests made by the netfs. This struct is then passed to the 187 - backend to do the operation. The backend may get extra refs to it by 188 - calling fscache_get_retrieval() and refs may be discarded by calling 189 - fscache_put_retrieval(). 190 - 191 - A retrieval operation can be used by the backend to do retrieval work. To 192 - do this, the retrieval->op.processor method pointer should be set 193 - appropriately by the backend and fscache_enqueue_retrieval() called to 194 - submit it to the thread pool. CacheFiles, for example, uses this to queue 195 - page examination when it detects PG_lock being cleared. 196 - 197 - The to_do field is an empty list available for the cache backend to use as 198 - it sees fit. 199 - 200 - 201 - (*) FS-Cache storage operation record: 202 - 203 - struct fscache_storage { 204 - struct fscache_operation op; 205 - pgoff_t store_limit; 206 - ... 207 - }; 208 - 209 - A structure of this type is allocated by FS-Cache to record outstanding 210 - writes to be made. FS-Cache itself enqueues this operation and invokes 211 - the write_page() method on the object at appropriate times to effect 212 - storage. 213 - 214 - 215 - ================ 216 - CACHE OPERATIONS 217 - ================ 218 - 219 - The cache backend provides FS-Cache with a table of operations that can be 220 - performed on the denizens of the cache. These are held in a structure of type: 221 - 222 - struct fscache_cache_ops 223 - 224 - (*) Name of cache provider [mandatory]: 225 - 226 - const char *name 227 - 228 - This isn't strictly an operation, but should be pointed at a string naming 229 - the backend. 230 - 231 - 232 - (*) Allocate a new object [mandatory]: 233 - 234 - struct fscache_object *(*alloc_object)(struct fscache_cache *cache, 235 - struct fscache_cookie *cookie) 236 - 237 - This method is used to allocate a cache object representation to back a 238 - cookie in a particular cache. fscache_object_init() should be called on 239 - the object to initialise it prior to returning. 240 - 241 - This function may also be used to parse the index key to be used for 242 - multiple lookup calls to turn it into a more convenient form. FS-Cache 243 - will call the lookup_complete() method to allow the cache to release the 244 - form once lookup is complete or aborted. 245 - 246 - 247 - (*) Look up and create object [mandatory]: 248 - 249 - void (*lookup_object)(struct fscache_object *object) 250 - 251 - This method is used to look up an object, given that the object is already 252 - allocated and attached to the cookie. This should instantiate that object 253 - in the cache if it can. 254 - 255 - The method should call fscache_object_lookup_negative() as soon as 256 - possible if it determines the object doesn't exist in the cache. If the 257 - object is found to exist and the netfs indicates that it is valid then 258 - fscache_obtained_object() should be called once the object is in a 259 - position to have data stored in it. Similarly, fscache_obtained_object() 260 - should also be called once a non-present object has been created. 261 - 262 - If a lookup error occurs, fscache_object_lookup_error() should be called 263 - to abort the lookup of that object. 264 - 265 - 266 - (*) Release lookup data [mandatory]: 267 - 268 - void (*lookup_complete)(struct fscache_object *object) 269 - 270 - This method is called to ask the cache to release any resources it was 271 - using to perform a lookup. 272 - 273 - 274 - (*) Increment object refcount [mandatory]: 275 - 276 - struct fscache_object *(*grab_object)(struct fscache_object *object) 277 - 278 - This method is called to increment the reference count on an object. It 279 - may fail (for instance if the cache is being withdrawn) by returning NULL. 280 - It should return the object pointer if successful. 281 - 282 - 283 - (*) Lock/Unlock object [mandatory]: 284 - 285 - void (*lock_object)(struct fscache_object *object) 286 - void (*unlock_object)(struct fscache_object *object) 287 - 288 - These methods are used to exclusively lock an object. It must be possible 289 - to schedule with the lock held, so a spinlock isn't sufficient. 290 - 291 - 292 - (*) Pin/Unpin object [optional]: 293 - 294 - int (*pin_object)(struct fscache_object *object) 295 - void (*unpin_object)(struct fscache_object *object) 296 - 297 - These methods are used to pin an object into the cache. Once pinned an 298 - object cannot be reclaimed to make space. Return -ENOSPC if there's not 299 - enough space in the cache to permit this. 300 - 301 - 302 - (*) Check coherency state of an object [mandatory]: 303 - 304 - int (*check_consistency)(struct fscache_object *object) 305 - 306 - This method is called to have the cache check the saved auxiliary data of 307 - the object against the netfs's idea of the state. 0 should be returned 308 - if they're consistent and -ESTALE otherwise. -ENOMEM and -ERESTARTSYS 309 - may also be returned. 310 - 311 - (*) Update object [mandatory]: 312 - 313 - int (*update_object)(struct fscache_object *object) 314 - 315 - This is called to update the index entry for the specified object. The 316 - new information should be in object->cookie->netfs_data. This can be 317 - obtained by calling object->cookie->def->get_aux()/get_attr(). 318 - 319 - 320 - (*) Invalidate data object [mandatory]: 321 - 322 - int (*invalidate_object)(struct fscache_operation *op) 323 - 324 - This is called to invalidate a data object (as pointed to by op->object). 325 - All the data stored for this object should be discarded and an 326 - attr_changed operation should be performed. The caller will follow up 327 - with an object update operation. 328 - 329 - fscache_op_complete() must be called on op before returning. 330 - 331 - 332 - (*) Discard object [mandatory]: 333 - 334 - void (*drop_object)(struct fscache_object *object) 335 - 336 - This method is called to indicate that an object has been unbound from its 337 - cookie, and that the cache should release the object's resources and 338 - retire it if it's in state FSCACHE_OBJECT_RECYCLING. 339 - 340 - This method should not attempt to release any references held by the 341 - caller. The caller will invoke the put_object() method as appropriate. 342 - 343 - 344 - (*) Release object reference [mandatory]: 345 - 346 - void (*put_object)(struct fscache_object *object) 347 - 348 - This method is used to discard a reference to an object. The object may 349 - be freed when all the references to it are released. 350 - 351 - 352 - (*) Synchronise a cache [mandatory]: 353 - 354 - void (*sync)(struct fscache_cache *cache) 355 - 356 - This is called to ask the backend to synchronise a cache with its backing 357 - device. 358 - 359 - 360 - (*) Dissociate a cache [mandatory]: 361 - 362 - void (*dissociate_pages)(struct fscache_cache *cache) 363 - 364 - This is called to ask a cache to perform any page dissociations as part of 365 - cache withdrawal. 366 - 367 - 368 - (*) Notification that the attributes on a netfs file changed [mandatory]: 369 - 370 - int (*attr_changed)(struct fscache_object *object); 371 - 372 - This is called to indicate to the cache that certain attributes on a netfs 373 - file have changed (for example the maximum size a file may reach). The 374 - cache can read these from the netfs by calling the cookie's get_attr() 375 - method. 376 - 377 - The cache may use the file size information to reserve space on the cache. 378 - It should also call fscache_set_store_limit() to indicate to FS-Cache the 379 - highest byte it's willing to store for an object. 380 - 381 - This method may return -ve if an error occurred or the cache object cannot 382 - be expanded. In such a case, the object will be withdrawn from service. 383 - 384 - This operation is run asynchronously from FS-Cache's thread pool, and 385 - storage and retrieval operations from the netfs are excluded during the 386 - execution of this operation. 387 - 388 - 389 - (*) Reserve cache space for an object's data [optional]: 390 - 391 - int (*reserve_space)(struct fscache_object *object, loff_t size); 392 - 393 - This is called to request that cache space be reserved to hold the data 394 - for an object and the metadata used to track it. Zero size should be 395 - taken as request to cancel a reservation. 396 - 397 - This should return 0 if successful, -ENOSPC if there isn't enough space 398 - available, or -ENOMEM or -EIO on other errors. 399 - 400 - The reservation may exceed the current size of the object, thus permitting 401 - future expansion. If the amount of space consumed by an object would 402 - exceed the reservation, it's permitted to refuse requests to allocate 403 - pages, but not required. An object may be pruned down to its reservation 404 - size if larger than that already. 405 - 406 - 407 - (*) Request page be read from cache [mandatory]: 408 - 409 - int (*read_or_alloc_page)(struct fscache_retrieval *op, 410 - struct page *page, 411 - gfp_t gfp) 412 - 413 - This is called to attempt to read a netfs page from the cache, or to 414 - reserve a backing block if not. FS-Cache will have done as much checking 415 - as it can before calling, but most of the work belongs to the backend. 416 - 417 - If there's no page in the cache, then -ENODATA should be returned if the 418 - backend managed to reserve a backing block; -ENOBUFS or -ENOMEM if it 419 - didn't. 420 - 421 - If there is suitable data in the cache, then a read operation should be 422 - queued and 0 returned. When the read finishes, fscache_end_io() should be 423 - called. 424 - 425 - The fscache_mark_pages_cached() should be called for the page if any cache 426 - metadata is retained. This will indicate to the netfs that the page needs 427 - explicit uncaching. This operation takes a pagevec, thus allowing several 428 - pages to be marked at once. 429 - 430 - The retrieval record pointed to by op should be retained for each page 431 - queued and released when I/O on the page has been formally ended. 432 - fscache_get/put_retrieval() are available for this purpose. 433 - 434 - The retrieval record may be used to get CPU time via the FS-Cache thread 435 - pool. If this is desired, the op->op.processor should be set to point to 436 - the appropriate processing routine, and fscache_enqueue_retrieval() should 437 - be called at an appropriate point to request CPU time. For instance, the 438 - retrieval routine could be enqueued upon the completion of a disk read. 439 - The to_do field in the retrieval record is provided to aid in this. 440 - 441 - If an I/O error occurs, fscache_io_error() should be called and -ENOBUFS 442 - returned if possible or fscache_end_io() called with a suitable error 443 - code. 444 - 445 - fscache_put_retrieval() should be called after a page or pages are dealt 446 - with. This will complete the operation when all pages are dealt with. 447 - 448 - 449 - (*) Request pages be read from cache [mandatory]: 450 - 451 - int (*read_or_alloc_pages)(struct fscache_retrieval *op, 452 - struct list_head *pages, 453 - unsigned *nr_pages, 454 - gfp_t gfp) 455 - 456 - This is like the read_or_alloc_page() method, except it is handed a list 457 - of pages instead of one page. Any pages on which a read operation is 458 - started must be added to the page cache for the specified mapping and also 459 - to the LRU. Such pages must also be removed from the pages list and 460 - *nr_pages decremented per page. 461 - 462 - If there was an error such as -ENOMEM, then that should be returned; else 463 - if one or more pages couldn't be read or allocated, then -ENOBUFS should 464 - be returned; else if one or more pages couldn't be read, then -ENODATA 465 - should be returned. If all the pages are dispatched then 0 should be 466 - returned. 467 - 468 - 469 - (*) Request page be allocated in the cache [mandatory]: 470 - 471 - int (*allocate_page)(struct fscache_retrieval *op, 472 - struct page *page, 473 - gfp_t gfp) 474 - 475 - This is like the read_or_alloc_page() method, except that it shouldn't 476 - read from the cache, even if there's data there that could be retrieved. 477 - It should, however, set up any internal metadata required such that 478 - the write_page() method can write to the cache. 479 - 480 - If there's no backing block available, then -ENOBUFS should be returned 481 - (or -ENOMEM if there were other problems). If a block is successfully 482 - allocated, then the netfs page should be marked and 0 returned. 483 - 484 - 485 - (*) Request pages be allocated in the cache [mandatory]: 486 - 487 - int (*allocate_pages)(struct fscache_retrieval *op, 488 - struct list_head *pages, 489 - unsigned *nr_pages, 490 - gfp_t gfp) 491 - 492 - This is an multiple page version of the allocate_page() method. pages and 493 - nr_pages should be treated as for the read_or_alloc_pages() method. 494 - 495 - 496 - (*) Request page be written to cache [mandatory]: 497 - 498 - int (*write_page)(struct fscache_storage *op, 499 - struct page *page); 500 - 501 - This is called to write from a page on which there was a previously 502 - successful read_or_alloc_page() call or similar. FS-Cache filters out 503 - pages that don't have mappings. 504 - 505 - This method is called asynchronously from the FS-Cache thread pool. It is 506 - not required to actually store anything, provided -ENODATA is then 507 - returned to the next read of this page. 508 - 509 - If an error occurred, then a negative error code should be returned, 510 - otherwise zero should be returned. FS-Cache will take appropriate action 511 - in response to an error, such as withdrawing this object. 512 - 513 - If this method returns success then FS-Cache will inform the netfs 514 - appropriately. 515 - 516 - 517 - (*) Discard retained per-page metadata [mandatory]: 518 - 519 - void (*uncache_page)(struct fscache_object *object, struct page *page) 520 - 521 - This is called when a netfs page is being evicted from the pagecache. The 522 - cache backend should tear down any internal representation or tracking it 523 - maintains for this page. 524 - 525 - 526 - ================== 527 - FS-CACHE UTILITIES 528 - ================== 529 - 530 - FS-Cache provides some utilities that a cache backend may make use of: 531 - 532 - (*) Note occurrence of an I/O error in a cache: 533 - 534 - void fscache_io_error(struct fscache_cache *cache) 535 - 536 - This tells FS-Cache that an I/O error occurred in the cache. After this 537 - has been called, only resource dissociation operations (object and page 538 - release) will be passed from the netfs to the cache backend for the 539 - specified cache. 540 - 541 - This does not actually withdraw the cache. That must be done separately. 542 - 543 - 544 - (*) Invoke the retrieval I/O completion function: 545 - 546 - void fscache_end_io(struct fscache_retrieval *op, struct page *page, 547 - int error); 548 - 549 - This is called to note the end of an attempt to retrieve a page. The 550 - error value should be 0 if successful and an error otherwise. 551 - 552 - 553 - (*) Record that one or more pages being retrieved or allocated have been dealt 554 - with: 555 - 556 - void fscache_retrieval_complete(struct fscache_retrieval *op, 557 - int n_pages); 558 - 559 - This is called to record the fact that one or more pages have been dealt 560 - with and are no longer the concern of this operation. When the number of 561 - pages remaining in the operation reaches 0, the operation will be 562 - completed. 563 - 564 - 565 - (*) Record operation completion: 566 - 567 - void fscache_op_complete(struct fscache_operation *op); 568 - 569 - This is called to record the completion of an operation. This deducts 570 - this operation from the parent object's run state, potentially permitting 571 - one or more pending operations to start running. 572 - 573 - 574 - (*) Set highest store limit: 575 - 576 - void fscache_set_store_limit(struct fscache_object *object, 577 - loff_t i_size); 578 - 579 - This sets the limit FS-Cache imposes on the highest byte it's willing to 580 - try and store for a netfs. Any page over this limit is automatically 581 - rejected by fscache_read_alloc_page() and co with -ENOBUFS. 582 - 583 - 584 - (*) Mark pages as being cached: 585 - 586 - void fscache_mark_pages_cached(struct fscache_retrieval *op, 587 - struct pagevec *pagevec); 588 - 589 - This marks a set of pages as being cached. After this has been called, 590 - the netfs must call fscache_uncache_page() to unmark the pages. 591 - 592 - 593 - (*) Perform coherency check on an object: 594 - 595 - enum fscache_checkaux fscache_check_aux(struct fscache_object *object, 596 - const void *data, 597 - uint16_t datalen); 598 - 599 - This asks the netfs to perform a coherency check on an object that has 600 - just been looked up. The cookie attached to the object will determine the 601 - netfs to use. data and datalen should specify where the auxiliary data 602 - retrieved from the cache can be found. 603 - 604 - One of three values will be returned: 605 - 606 - (*) FSCACHE_CHECKAUX_OKAY 607 - 608 - The coherency data indicates the object is valid as is. 609 - 610 - (*) FSCACHE_CHECKAUX_NEEDS_UPDATE 611 - 612 - The coherency data needs updating, but otherwise the object is 613 - valid. 614 - 615 - (*) FSCACHE_CHECKAUX_OBSOLETE 616 - 617 - The coherency data indicates that the object is obsolete and should 618 - be discarded. 619 - 620 - 621 - (*) Initialise a freshly allocated object: 622 - 623 - void fscache_object_init(struct fscache_object *object); 624 - 625 - This initialises all the fields in an object representation. 626 - 627 - 628 - (*) Indicate the destruction of an object: 629 - 630 - void fscache_object_destroyed(struct fscache_cache *cache); 631 - 632 - This must be called to inform FS-Cache that an object that belonged to a 633 - cache has been destroyed and deallocated. This will allow continuation 634 - of the cache withdrawal process when it is stopped pending destruction of 635 - all the objects. 636 - 637 - 638 - (*) Indicate negative lookup on an object: 639 - 640 - void fscache_object_lookup_negative(struct fscache_object *object); 641 - 642 - This is called to indicate to FS-Cache that a lookup process for an object 643 - found a negative result. 644 - 645 - This changes the state of an object to permit reads pending on lookup 646 - completion to go off and start fetching data from the netfs server as it's 647 - known at this point that there can't be any data in the cache. 648 - 649 - This may be called multiple times on an object. Only the first call is 650 - significant - all subsequent calls are ignored. 651 - 652 - 653 - (*) Indicate an object has been obtained: 654 - 655 - void fscache_obtained_object(struct fscache_object *object); 656 - 657 - This is called to indicate to FS-Cache that a lookup process for an object 658 - produced a positive result, or that an object was created. This should 659 - only be called once for any particular object. 660 - 661 - This changes the state of an object to indicate: 662 - 663 - (1) if no call to fscache_object_lookup_negative() has been made on 664 - this object, that there may be data available, and that reads can 665 - now go and look for it; and 666 - 667 - (2) that writes may now proceed against this object. 668 - 669 - 670 - (*) Indicate that object lookup failed: 671 - 672 - void fscache_object_lookup_error(struct fscache_object *object); 673 - 674 - This marks an object as having encountered a fatal error (usually EIO) 675 - and causes it to move into a state whereby it will be withdrawn as soon 676 - as possible. 677 - 678 - 679 - (*) Indicate that a stale object was found and discarded: 680 - 681 - void fscache_object_retrying_stale(struct fscache_object *object); 682 - 683 - This is called to indicate that the lookup procedure found an object in 684 - the cache that the netfs decided was stale. The object has been 685 - discarded from the cache and the lookup will be performed again. 686 - 687 - 688 - (*) Indicate that the caching backend killed an object: 689 - 690 - void fscache_object_mark_killed(struct fscache_object *object, 691 - enum fscache_why_object_killed why); 692 - 693 - This is called to indicate that the cache backend preemptively killed an 694 - object. The why parameter should be set to indicate the reason: 695 - 696 - FSCACHE_OBJECT_IS_STALE - the object was stale and needs discarding. 697 - FSCACHE_OBJECT_NO_SPACE - there was insufficient cache space 698 - FSCACHE_OBJECT_WAS_RETIRED - the object was retired when relinquished. 699 - FSCACHE_OBJECT_WAS_CULLED - the object was culled to make space. 700 - 701 - 702 - (*) Get and release references on a retrieval record: 703 - 704 - void fscache_get_retrieval(struct fscache_retrieval *op); 705 - void fscache_put_retrieval(struct fscache_retrieval *op); 706 - 707 - These two functions are used to retain a retrieval record while doing 708 - asynchronous data retrieval and block allocation. 709 - 710 - 711 - (*) Enqueue a retrieval record for processing. 712 - 713 - void fscache_enqueue_retrieval(struct fscache_retrieval *op); 714 - 715 - This enqueues a retrieval record for processing by the FS-Cache thread 716 - pool. One of the threads in the pool will invoke the retrieval record's 717 - op->op.processor callback function. This function may be called from 718 - within the callback function. 719 - 720 - 721 - (*) List of object state names: 722 - 723 - const char *fscache_object_states[]; 724 - 725 - For debugging purposes, this may be used to turn the state that an object 726 - is in into a text string for display purposes.

+484

Documentation/filesystems/caching/cachefiles.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + =============================================== 4 + CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM 5 + =============================================== 6 + 7 + .. Contents: 8 + 9 + (*) Overview. 10 + 11 + (*) Requirements. 12 + 13 + (*) Configuration. 14 + 15 + (*) Starting the cache. 16 + 17 + (*) Things to avoid. 18 + 19 + (*) Cache culling. 20 + 21 + (*) Cache structure. 22 + 23 + (*) Security model and SELinux. 24 + 25 + (*) A note on security. 26 + 27 + (*) Statistical information. 28 + 29 + (*) Debugging. 30 + 31 + 32 + 33 + Overview 34 + ======== 35 + 36 + CacheFiles is a caching backend that's meant to use as a cache a directory on 37 + an already mounted filesystem of a local type (such as Ext3). 38 + 39 + CacheFiles uses a userspace daemon to do some of the cache management - such as 40 + reaping stale nodes and culling. This is called cachefilesd and lives in 41 + /sbin. 42 + 43 + The filesystem and data integrity of the cache are only as good as those of the 44 + filesystem providing the backing services. Note that CacheFiles does not 45 + attempt to journal anything since the journalling interfaces of the various 46 + filesystems are very specific in nature. 47 + 48 + CacheFiles creates a misc character device - "/dev/cachefiles" - that is used 49 + to communication with the daemon. Only one thing may have this open at once, 50 + and while it is open, a cache is at least partially in existence. The daemon 51 + opens this and sends commands down it to control the cache. 52 + 53 + CacheFiles is currently limited to a single cache. 54 + 55 + CacheFiles attempts to maintain at least a certain percentage of free space on 56 + the filesystem, shrinking the cache by culling the objects it contains to make 57 + space if necessary - see the "Cache Culling" section. This means it can be 58 + placed on the same medium as a live set of data, and will expand to make use of 59 + spare space and automatically contract when the set of data requires more 60 + space. 61 + 62 + 63 + 64 + Requirements 65 + ============ 66 + 67 + The use of CacheFiles and its daemon requires the following features to be 68 + available in the system and in the cache filesystem: 69 + 70 + - dnotify. 71 + 72 + - extended attributes (xattrs). 73 + 74 + - openat() and friends. 75 + 76 + - bmap() support on files in the filesystem (FIBMAP ioctl). 77 + 78 + - The use of bmap() to detect a partial page at the end of the file. 79 + 80 + It is strongly recommended that the "dir_index" option is enabled on Ext3 81 + filesystems being used as a cache. 82 + 83 + 84 + Configuration 85 + ============= 86 + 87 + The cache is configured by a script in /etc/cachefilesd.conf. These commands 88 + set up cache ready for use. The following script commands are available: 89 + 90 + brun <N>%, bcull <N>%, bstop <N>%, frun <N>%, fcull <N>%, fstop <N>% 91 + Configure the culling limits. Optional. See the section on culling 92 + The defaults are 7% (run), 5% (cull) and 1% (stop) respectively. 93 + 94 + The commands beginning with a 'b' are file space (block) limits, those 95 + beginning with an 'f' are file count limits. 96 + 97 + dir <path> 98 + Specify the directory containing the root of the cache. Mandatory. 99 + 100 + tag <name> 101 + Specify a tag to FS-Cache to use in distinguishing multiple caches. 102 + Optional. The default is "CacheFiles". 103 + 104 + debug <mask> 105 + Specify a numeric bitmask to control debugging in the kernel module. 106 + Optional. The default is zero (all off). The following values can be 107 + OR'd into the mask to collect various information: 108 + 109 + == ================================================= 110 + 1 Turn on trace of function entry (_enter() macros) 111 + 2 Turn on trace of function exit (_leave() macros) 112 + 4 Turn on trace of internal debug points (_debug()) 113 + == ================================================= 114 + 115 + This mask can also be set through sysfs, eg:: 116 + 117 + echo 5 >/sys/modules/cachefiles/parameters/debug 118 + 119 + 120 + Starting the Cache 121 + ================== 122 + 123 + The cache is started by running the daemon. The daemon opens the cache device, 124 + configures the cache and tells it to begin caching. At that point the cache 125 + binds to fscache and the cache becomes live. 126 + 127 + The daemon is run as follows:: 128 + 129 + /sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>] 130 + 131 + The flags are: 132 + 133 + ``-d`` 134 + Increase the debugging level. This can be specified multiple times and 135 + is cumulative with itself. 136 + 137 + ``-s`` 138 + Send messages to stderr instead of syslog. 139 + 140 + ``-n`` 141 + Don't daemonise and go into background. 142 + 143 + ``-f <configfile>`` 144 + Use an alternative configuration file rather than the default one. 145 + 146 + 147 + Things to Avoid 148 + =============== 149 + 150 + Do not mount other things within the cache as this will cause problems. The 151 + kernel module contains its own very cut-down path walking facility that ignores 152 + mountpoints, but the daemon can't avoid them. 153 + 154 + Do not create, rename or unlink files and directories in the cache while the 155 + cache is active, as this may cause the state to become uncertain. 156 + 157 + Renaming files in the cache might make objects appear to be other objects (the 158 + filename is part of the lookup key). 159 + 160 + Do not change or remove the extended attributes attached to cache files by the 161 + cache as this will cause the cache state management to get confused. 162 + 163 + Do not create files or directories in the cache, lest the cache get confused or 164 + serve incorrect data. 165 + 166 + Do not chmod files in the cache. The module creates things with minimal 167 + permissions to prevent random users being able to access them directly. 168 + 169 + 170 + Cache Culling 171 + ============= 172 + 173 + The cache may need culling occasionally to make space. This involves 174 + discarding objects from the cache that have been used less recently than 175 + anything else. Culling is based on the access time of data objects. Empty 176 + directories are culled if not in use. 177 + 178 + Cache culling is done on the basis of the percentage of blocks and the 179 + percentage of files available in the underlying filesystem. There are six 180 + "limits": 181 + 182 + brun, frun 183 + If the amount of free space and the number of available files in the cache 184 + rises above both these limits, then culling is turned off. 185 + 186 + bcull, fcull 187 + If the amount of available space or the number of available files in the 188 + cache falls below either of these limits, then culling is started. 189 + 190 + bstop, fstop 191 + If the amount of available space or the number of available files in the 192 + cache falls below either of these limits, then no further allocation of 193 + disk space or files is permitted until culling has raised things above 194 + these limits again. 195 + 196 + These must be configured thusly:: 197 + 198 + 0 <= bstop < bcull < brun < 100 199 + 0 <= fstop < fcull < frun < 100 200 + 201 + Note that these are percentages of available space and available files, and do 202 + _not_ appear as 100 minus the percentage displayed by the "df" program. 203 + 204 + The userspace daemon scans the cache to build up a table of cullable objects. 205 + These are then culled in least recently used order. A new scan of the cache is 206 + started as soon as space is made in the table. Objects will be skipped if 207 + their atimes have changed or if the kernel module says it is still using them. 208 + 209 + 210 + Cache Structure 211 + =============== 212 + 213 + The CacheFiles module will create two directories in the directory it was 214 + given: 215 + 216 + * cache/ 217 + * graveyard/ 218 + 219 + The active cache objects all reside in the first directory. The CacheFiles 220 + kernel module moves any retired or culled objects that it can't simply unlink 221 + to the graveyard from which the daemon will actually delete them. 222 + 223 + The daemon uses dnotify to monitor the graveyard directory, and will delete 224 + anything that appears therein. 225 + 226 + 227 + The module represents index objects as directories with the filename "I..." or 228 + "J...". Note that the "cache/" directory is itself a special index. 229 + 230 + Data objects are represented as files if they have no children, or directories 231 + if they do. Their filenames all begin "D..." or "E...". If represented as a 232 + directory, data objects will have a file in the directory called "data" that 233 + actually holds the data. 234 + 235 + Special objects are similar to data objects, except their filenames begin 236 + "S..." or "T...". 237 + 238 + 239 + If an object has children, then it will be represented as a directory. 240 + Immediately in the representative directory are a collection of directories 241 + named for hash values of the child object keys with an '@' prepended. Into 242 + this directory, if possible, will be placed the representations of the child 243 + objects:: 244 + 245 + /INDEX /INDEX /INDEX /DATA FILES 246 + /=========/==========/=================================/================ 247 + cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400 248 + cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry 249 + cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry 250 + cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry 251 + 252 + 253 + If the key is so long that it exceeds NAME_MAX with the decorations added on to 254 + it, then it will be cut into pieces, the first few of which will be used to 255 + make a nest of directories, and the last one of which will be the objects 256 + inside the last directory. The names of the intermediate directories will have 257 + '+' prepended:: 258 + 259 + J1223/@23/+xy...z/+kl...m/Epqr 260 + 261 + 262 + Note that keys are raw data, and not only may they exceed NAME_MAX in size, 263 + they may also contain things like '/' and NUL characters, and so they may not 264 + be suitable for turning directly into a filename. 265 + 266 + To handle this, CacheFiles will use a suitably printable filename directly and 267 + "base-64" encode ones that aren't directly suitable. The two versions of 268 + object filenames indicate the encoding: 269 + 270 + =============== =============== =============== 271 + OBJECT TYPE PRINTABLE ENCODED 272 + =============== =============== =============== 273 + Index "I..." "J..." 274 + Data "D..." "E..." 275 + Special "S..." "T..." 276 + =============== =============== =============== 277 + 278 + Intermediate directories are always "@" or "+" as appropriate. 279 + 280 + 281 + Each object in the cache has an extended attribute label that holds the object 282 + type ID (required to distinguish special objects) and the auxiliary data from 283 + the netfs. The latter is used to detect stale objects in the cache and update 284 + or retire them. 285 + 286 + 287 + Note that CacheFiles will erase from the cache any file it doesn't recognise or 288 + any file of an incorrect type (such as a FIFO file or a device file). 289 + 290 + 291 + Security Model and SELinux 292 + ========================== 293 + 294 + CacheFiles is implemented to deal properly with the LSM security features of 295 + the Linux kernel and the SELinux facility. 296 + 297 + One of the problems that CacheFiles faces is that it is generally acting on 298 + behalf of a process, and running in that process's context, and that includes a 299 + security context that is not appropriate for accessing the cache - either 300 + because the files in the cache are inaccessible to that process, or because if 301 + the process creates a file in the cache, that file may be inaccessible to other 302 + processes. 303 + 304 + The way CacheFiles works is to temporarily change the security context (fsuid, 305 + fsgid and actor security label) that the process acts as - without changing the 306 + security context of the process when it the target of an operation performed by 307 + some other process (so signalling and suchlike still work correctly). 308 + 309 + 310 + When the CacheFiles module is asked to bind to its cache, it: 311 + 312 + (1) Finds the security label attached to the root cache directory and uses 313 + that as the security label with which it will create files. By default, 314 + this is:: 315 + 316 + cachefiles_var_t 317 + 318 + (2) Finds the security label of the process which issued the bind request 319 + (presumed to be the cachefilesd daemon), which by default will be:: 320 + 321 + cachefilesd_t 322 + 323 + and asks LSM to supply a security ID as which it should act given the 324 + daemon's label. By default, this will be:: 325 + 326 + cachefiles_kernel_t 327 + 328 + SELinux transitions the daemon's security ID to the module's security ID 329 + based on a rule of this form in the policy:: 330 + 331 + type_transition <daemon's-ID> kernel_t : process <module's-ID>; 332 + 333 + For instance:: 334 + 335 + type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t; 336 + 337 + 338 + The module's security ID gives it permission to create, move and remove files 339 + and directories in the cache, to find and access directories and files in the 340 + cache, to set and access extended attributes on cache objects, and to read and 341 + write files in the cache. 342 + 343 + The daemon's security ID gives it only a very restricted set of permissions: it 344 + may scan directories, stat files and erase files and directories. It may 345 + not read or write files in the cache, and so it is precluded from accessing the 346 + data cached therein; nor is it permitted to create new files in the cache. 347 + 348 + 349 + There are policy source files available in: 350 + 351 + http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2 352 + 353 + and later versions. In that tarball, see the files:: 354 + 355 + cachefilesd.te 356 + cachefilesd.fc 357 + cachefilesd.if 358 + 359 + They are built and installed directly by the RPM. 360 + 361 + If a non-RPM based system is being used, then copy the above files to their own 362 + directory and run:: 363 + 364 + make -f /usr/share/selinux/devel/Makefile 365 + semodule -i cachefilesd.pp 366 + 367 + You will need checkpolicy and selinux-policy-devel installed prior to the 368 + build. 369 + 370 + 371 + By default, the cache is located in /var/fscache, but if it is desirable that 372 + it should be elsewhere, than either the above policy files must be altered, or 373 + an auxiliary policy must be installed to label the alternate location of the 374 + cache. 375 + 376 + For instructions on how to add an auxiliary policy to enable the cache to be 377 + located elsewhere when SELinux is in enforcing mode, please see:: 378 + 379 + /usr/share/doc/cachefilesd-*/move-cache.txt 380 + 381 + When the cachefilesd rpm is installed; alternatively, the document can be found 382 + in the sources. 383 + 384 + 385 + A Note on Security 386 + ================== 387 + 388 + CacheFiles makes use of the split security in the task_struct. It allocates 389 + its own task_security structure, and redirects current->cred to point to it 390 + when it acts on behalf of another process, in that process's context. 391 + 392 + The reason it does this is that it calls vfs_mkdir() and suchlike rather than 393 + bypassing security and calling inode ops directly. Therefore the VFS and LSM 394 + may deny the CacheFiles access to the cache data because under some 395 + circumstances the caching code is running in the security context of whatever 396 + process issued the original syscall on the netfs. 397 + 398 + Furthermore, should CacheFiles create a file or directory, the security 399 + parameters with that object is created (UID, GID, security label) would be 400 + derived from that process that issued the system call, thus potentially 401 + preventing other processes from accessing the cache - including CacheFiles's 402 + cache management daemon (cachefilesd). 403 + 404 + What is required is to temporarily override the security of the process that 405 + issued the system call. We can't, however, just do an in-place change of the 406 + security data as that affects the process as an object, not just as a subject. 407 + This means it may lose signals or ptrace events for example, and affects what 408 + the process looks like in /proc. 409 + 410 + So CacheFiles makes use of a logical split in the security between the 411 + objective security (task->real_cred) and the subjective security (task->cred). 412 + The objective security holds the intrinsic security properties of a process and 413 + is never overridden. This is what appears in /proc, and is what is used when a 414 + process is the target of an operation by some other process (SIGKILL for 415 + example). 416 + 417 + The subjective security holds the active security properties of a process, and 418 + may be overridden. This is not seen externally, and is used whan a process 419 + acts upon another object, for example SIGKILLing another process or opening a 420 + file. 421 + 422 + LSM hooks exist that allow SELinux (or Smack or whatever) to reject a request 423 + for CacheFiles to run in a context of a specific security label, or to create 424 + files and directories with another security label. 425 + 426 + 427 + Statistical Information 428 + ======================= 429 + 430 + If FS-Cache is compiled with the following option enabled:: 431 + 432 + CONFIG_CACHEFILES_HISTOGRAM=y 433 + 434 + then it will gather certain statistics and display them through a proc file. 435 + 436 + /proc/fs/cachefiles/histogram 437 + 438 + :: 439 + 440 + cat /proc/fs/cachefiles/histogram 441 + JIFS SECS LOOKUPS MKDIRS CREATES 442 + ===== ===== ========= ========= ========= 443 + 444 + This shows the breakdown of the number of times each amount of time 445 + between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The 446 + columns are as follows: 447 + 448 + ======= ======================================================= 449 + COLUMN TIME MEASUREMENT 450 + ======= ======================================================= 451 + LOOKUPS Length of time to perform a lookup on the backing fs 452 + MKDIRS Length of time to perform a mkdir on the backing fs 453 + CREATES Length of time to perform a create on the backing fs 454 + ======= ======================================================= 455 + 456 + Each row shows the number of events that took a particular range of times. 457 + Each step is 1 jiffy in size. The JIFS column indicates the particular 458 + jiffy range covered, and the SECS field the equivalent number of seconds. 459 + 460 + 461 + Debugging 462 + ========= 463 + 464 + If CONFIG_CACHEFILES_DEBUG is enabled, the CacheFiles facility can have runtime 465 + debugging enabled by adjusting the value in:: 466 + 467 + /sys/module/cachefiles/parameters/debug 468 + 469 + This is a bitmask of debugging streams to enable: 470 + 471 + ======= ======= =============================== ======================= 472 + BIT VALUE STREAM POINT 473 + ======= ======= =============================== ======================= 474 + 0 1 General Function entry trace 475 + 1 2 Function exit trace 476 + 2 4 General 477 + ======= ======= =============================== ======================= 478 + 479 + The appropriate set of values should be OR'd together and the result written to 480 + the control file. For example:: 481 + 482 + echo $((1|4|8)) >/sys/module/cachefiles/parameters/debug 483 + 484 + will turn on all function entry debugging.

-501

Documentation/filesystems/caching/cachefiles.txt

··· 1 - =============================================== 2 - CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM 3 - =============================================== 4 - 5 - Contents: 6 - 7 - (*) Overview. 8 - 9 - (*) Requirements. 10 - 11 - (*) Configuration. 12 - 13 - (*) Starting the cache. 14 - 15 - (*) Things to avoid. 16 - 17 - (*) Cache culling. 18 - 19 - (*) Cache structure. 20 - 21 - (*) Security model and SELinux. 22 - 23 - (*) A note on security. 24 - 25 - (*) Statistical information. 26 - 27 - (*) Debugging. 28 - 29 - 30 - ======== 31 - OVERVIEW 32 - ======== 33 - 34 - CacheFiles is a caching backend that's meant to use as a cache a directory on 35 - an already mounted filesystem of a local type (such as Ext3). 36 - 37 - CacheFiles uses a userspace daemon to do some of the cache management - such as 38 - reaping stale nodes and culling. This is called cachefilesd and lives in 39 - /sbin. 40 - 41 - The filesystem and data integrity of the cache are only as good as those of the 42 - filesystem providing the backing services. Note that CacheFiles does not 43 - attempt to journal anything since the journalling interfaces of the various 44 - filesystems are very specific in nature. 45 - 46 - CacheFiles creates a misc character device - "/dev/cachefiles" - that is used 47 - to communication with the daemon. Only one thing may have this open at once, 48 - and while it is open, a cache is at least partially in existence. The daemon 49 - opens this and sends commands down it to control the cache. 50 - 51 - CacheFiles is currently limited to a single cache. 52 - 53 - CacheFiles attempts to maintain at least a certain percentage of free space on 54 - the filesystem, shrinking the cache by culling the objects it contains to make 55 - space if necessary - see the "Cache Culling" section. This means it can be 56 - placed on the same medium as a live set of data, and will expand to make use of 57 - spare space and automatically contract when the set of data requires more 58 - space. 59 - 60 - 61 - ============ 62 - REQUIREMENTS 63 - ============ 64 - 65 - The use of CacheFiles and its daemon requires the following features to be 66 - available in the system and in the cache filesystem: 67 - 68 - - dnotify. 69 - 70 - - extended attributes (xattrs). 71 - 72 - - openat() and friends. 73 - 74 - - bmap() support on files in the filesystem (FIBMAP ioctl). 75 - 76 - - The use of bmap() to detect a partial page at the end of the file. 77 - 78 - It is strongly recommended that the "dir_index" option is enabled on Ext3 79 - filesystems being used as a cache. 80 - 81 - 82 - ============= 83 - CONFIGURATION 84 - ============= 85 - 86 - The cache is configured by a script in /etc/cachefilesd.conf. These commands 87 - set up cache ready for use. The following script commands are available: 88 - 89 - (*) brun <N>% 90 - (*) bcull <N>% 91 - (*) bstop <N>% 92 - (*) frun <N>% 93 - (*) fcull <N>% 94 - (*) fstop <N>% 95 - 96 - Configure the culling limits. Optional. See the section on culling 97 - The defaults are 7% (run), 5% (cull) and 1% (stop) respectively. 98 - 99 - The commands beginning with a 'b' are file space (block) limits, those 100 - beginning with an 'f' are file count limits. 101 - 102 - (*) dir <path> 103 - 104 - Specify the directory containing the root of the cache. Mandatory. 105 - 106 - (*) tag <name> 107 - 108 - Specify a tag to FS-Cache to use in distinguishing multiple caches. 109 - Optional. The default is "CacheFiles". 110 - 111 - (*) debug <mask> 112 - 113 - Specify a numeric bitmask to control debugging in the kernel module. 114 - Optional. The default is zero (all off). The following values can be 115 - OR'd into the mask to collect various information: 116 - 117 - 1 Turn on trace of function entry (_enter() macros) 118 - 2 Turn on trace of function exit (_leave() macros) 119 - 4 Turn on trace of internal debug points (_debug()) 120 - 121 - This mask can also be set through sysfs, eg: 122 - 123 - echo 5 >/sys/modules/cachefiles/parameters/debug 124 - 125 - 126 - ================== 127 - STARTING THE CACHE 128 - ================== 129 - 130 - The cache is started by running the daemon. The daemon opens the cache device, 131 - configures the cache and tells it to begin caching. At that point the cache 132 - binds to fscache and the cache becomes live. 133 - 134 - The daemon is run as follows: 135 - 136 - /sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>] 137 - 138 - The flags are: 139 - 140 - (*) -d 141 - 142 - Increase the debugging level. This can be specified multiple times and 143 - is cumulative with itself. 144 - 145 - (*) -s 146 - 147 - Send messages to stderr instead of syslog. 148 - 149 - (*) -n 150 - 151 - Don't daemonise and go into background. 152 - 153 - (*) -f <configfile> 154 - 155 - Use an alternative configuration file rather than the default one. 156 - 157 - 158 - =============== 159 - THINGS TO AVOID 160 - =============== 161 - 162 - Do not mount other things within the cache as this will cause problems. The 163 - kernel module contains its own very cut-down path walking facility that ignores 164 - mountpoints, but the daemon can't avoid them. 165 - 166 - Do not create, rename or unlink files and directories in the cache while the 167 - cache is active, as this may cause the state to become uncertain. 168 - 169 - Renaming files in the cache might make objects appear to be other objects (the 170 - filename is part of the lookup key). 171 - 172 - Do not change or remove the extended attributes attached to cache files by the 173 - cache as this will cause the cache state management to get confused. 174 - 175 - Do not create files or directories in the cache, lest the cache get confused or 176 - serve incorrect data. 177 - 178 - Do not chmod files in the cache. The module creates things with minimal 179 - permissions to prevent random users being able to access them directly. 180 - 181 - 182 - ============= 183 - CACHE CULLING 184 - ============= 185 - 186 - The cache may need culling occasionally to make space. This involves 187 - discarding objects from the cache that have been used less recently than 188 - anything else. Culling is based on the access time of data objects. Empty 189 - directories are culled if not in use. 190 - 191 - Cache culling is done on the basis of the percentage of blocks and the 192 - percentage of files available in the underlying filesystem. There are six 193 - "limits": 194 - 195 - (*) brun 196 - (*) frun 197 - 198 - If the amount of free space and the number of available files in the cache 199 - rises above both these limits, then culling is turned off. 200 - 201 - (*) bcull 202 - (*) fcull 203 - 204 - If the amount of available space or the number of available files in the 205 - cache falls below either of these limits, then culling is started. 206 - 207 - (*) bstop 208 - (*) fstop 209 - 210 - If the amount of available space or the number of available files in the 211 - cache falls below either of these limits, then no further allocation of 212 - disk space or files is permitted until culling has raised things above 213 - these limits again. 214 - 215 - These must be configured thusly: 216 - 217 - 0 <= bstop < bcull < brun < 100 218 - 0 <= fstop < fcull < frun < 100 219 - 220 - Note that these are percentages of available space and available files, and do 221 - _not_ appear as 100 minus the percentage displayed by the "df" program. 222 - 223 - The userspace daemon scans the cache to build up a table of cullable objects. 224 - These are then culled in least recently used order. A new scan of the cache is 225 - started as soon as space is made in the table. Objects will be skipped if 226 - their atimes have changed or if the kernel module says it is still using them. 227 - 228 - 229 - =============== 230 - CACHE STRUCTURE 231 - =============== 232 - 233 - The CacheFiles module will create two directories in the directory it was 234 - given: 235 - 236 - (*) cache/ 237 - 238 - (*) graveyard/ 239 - 240 - The active cache objects all reside in the first directory. The CacheFiles 241 - kernel module moves any retired or culled objects that it can't simply unlink 242 - to the graveyard from which the daemon will actually delete them. 243 - 244 - The daemon uses dnotify to monitor the graveyard directory, and will delete 245 - anything that appears therein. 246 - 247 - 248 - The module represents index objects as directories with the filename "I..." or 249 - "J...". Note that the "cache/" directory is itself a special index. 250 - 251 - Data objects are represented as files if they have no children, or directories 252 - if they do. Their filenames all begin "D..." or "E...". If represented as a 253 - directory, data objects will have a file in the directory called "data" that 254 - actually holds the data. 255 - 256 - Special objects are similar to data objects, except their filenames begin 257 - "S..." or "T...". 258 - 259 - 260 - If an object has children, then it will be represented as a directory. 261 - Immediately in the representative directory are a collection of directories 262 - named for hash values of the child object keys with an '@' prepended. Into 263 - this directory, if possible, will be placed the representations of the child 264 - objects: 265 - 266 - INDEX INDEX INDEX DATA FILES 267 - ========= ========== ================================= ================ 268 - cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400 269 - cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry 270 - cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry 271 - cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry 272 - 273 - 274 - If the key is so long that it exceeds NAME_MAX with the decorations added on to 275 - it, then it will be cut into pieces, the first few of which will be used to 276 - make a nest of directories, and the last one of which will be the objects 277 - inside the last directory. The names of the intermediate directories will have 278 - '+' prepended: 279 - 280 - J1223/@23/+xy...z/+kl...m/Epqr 281 - 282 - 283 - Note that keys are raw data, and not only may they exceed NAME_MAX in size, 284 - they may also contain things like '/' and NUL characters, and so they may not 285 - be suitable for turning directly into a filename. 286 - 287 - To handle this, CacheFiles will use a suitably printable filename directly and 288 - "base-64" encode ones that aren't directly suitable. The two versions of 289 - object filenames indicate the encoding: 290 - 291 - OBJECT TYPE PRINTABLE ENCODED 292 - =============== =============== =============== 293 - Index "I..." "J..." 294 - Data "D..." "E..." 295 - Special "S..." "T..." 296 - 297 - Intermediate directories are always "@" or "+" as appropriate. 298 - 299 - 300 - Each object in the cache has an extended attribute label that holds the object 301 - type ID (required to distinguish special objects) and the auxiliary data from 302 - the netfs. The latter is used to detect stale objects in the cache and update 303 - or retire them. 304 - 305 - 306 - Note that CacheFiles will erase from the cache any file it doesn't recognise or 307 - any file of an incorrect type (such as a FIFO file or a device file). 308 - 309 - 310 - ========================== 311 - SECURITY MODEL AND SELINUX 312 - ========================== 313 - 314 - CacheFiles is implemented to deal properly with the LSM security features of 315 - the Linux kernel and the SELinux facility. 316 - 317 - One of the problems that CacheFiles faces is that it is generally acting on 318 - behalf of a process, and running in that process's context, and that includes a 319 - security context that is not appropriate for accessing the cache - either 320 - because the files in the cache are inaccessible to that process, or because if 321 - the process creates a file in the cache, that file may be inaccessible to other 322 - processes. 323 - 324 - The way CacheFiles works is to temporarily change the security context (fsuid, 325 - fsgid and actor security label) that the process acts as - without changing the 326 - security context of the process when it the target of an operation performed by 327 - some other process (so signalling and suchlike still work correctly). 328 - 329 - 330 - When the CacheFiles module is asked to bind to its cache, it: 331 - 332 - (1) Finds the security label attached to the root cache directory and uses 333 - that as the security label with which it will create files. By default, 334 - this is: 335 - 336 - cachefiles_var_t 337 - 338 - (2) Finds the security label of the process which issued the bind request 339 - (presumed to be the cachefilesd daemon), which by default will be: 340 - 341 - cachefilesd_t 342 - 343 - and asks LSM to supply a security ID as which it should act given the 344 - daemon's label. By default, this will be: 345 - 346 - cachefiles_kernel_t 347 - 348 - SELinux transitions the daemon's security ID to the module's security ID 349 - based on a rule of this form in the policy. 350 - 351 - type_transition <daemon's-ID> kernel_t : process <module's-ID>; 352 - 353 - For instance: 354 - 355 - type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t; 356 - 357 - 358 - The module's security ID gives it permission to create, move and remove files 359 - and directories in the cache, to find and access directories and files in the 360 - cache, to set and access extended attributes on cache objects, and to read and 361 - write files in the cache. 362 - 363 - The daemon's security ID gives it only a very restricted set of permissions: it 364 - may scan directories, stat files and erase files and directories. It may 365 - not read or write files in the cache, and so it is precluded from accessing the 366 - data cached therein; nor is it permitted to create new files in the cache. 367 - 368 - 369 - There are policy source files available in: 370 - 371 - http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2 372 - 373 - and later versions. In that tarball, see the files: 374 - 375 - cachefilesd.te 376 - cachefilesd.fc 377 - cachefilesd.if 378 - 379 - They are built and installed directly by the RPM. 380 - 381 - If a non-RPM based system is being used, then copy the above files to their own 382 - directory and run: 383 - 384 - make -f /usr/share/selinux/devel/Makefile 385 - semodule -i cachefilesd.pp 386 - 387 - You will need checkpolicy and selinux-policy-devel installed prior to the 388 - build. 389 - 390 - 391 - By default, the cache is located in /var/fscache, but if it is desirable that 392 - it should be elsewhere, than either the above policy files must be altered, or 393 - an auxiliary policy must be installed to label the alternate location of the 394 - cache. 395 - 396 - For instructions on how to add an auxiliary policy to enable the cache to be 397 - located elsewhere when SELinux is in enforcing mode, please see: 398 - 399 - /usr/share/doc/cachefilesd-*/move-cache.txt 400 - 401 - When the cachefilesd rpm is installed; alternatively, the document can be found 402 - in the sources. 403 - 404 - 405 - ================== 406 - A NOTE ON SECURITY 407 - ================== 408 - 409 - CacheFiles makes use of the split security in the task_struct. It allocates 410 - its own task_security structure, and redirects current->cred to point to it 411 - when it acts on behalf of another process, in that process's context. 412 - 413 - The reason it does this is that it calls vfs_mkdir() and suchlike rather than 414 - bypassing security and calling inode ops directly. Therefore the VFS and LSM 415 - may deny the CacheFiles access to the cache data because under some 416 - circumstances the caching code is running in the security context of whatever 417 - process issued the original syscall on the netfs. 418 - 419 - Furthermore, should CacheFiles create a file or directory, the security 420 - parameters with that object is created (UID, GID, security label) would be 421 - derived from that process that issued the system call, thus potentially 422 - preventing other processes from accessing the cache - including CacheFiles's 423 - cache management daemon (cachefilesd). 424 - 425 - What is required is to temporarily override the security of the process that 426 - issued the system call. We can't, however, just do an in-place change of the 427 - security data as that affects the process as an object, not just as a subject. 428 - This means it may lose signals or ptrace events for example, and affects what 429 - the process looks like in /proc. 430 - 431 - So CacheFiles makes use of a logical split in the security between the 432 - objective security (task->real_cred) and the subjective security (task->cred). 433 - The objective security holds the intrinsic security properties of a process and 434 - is never overridden. This is what appears in /proc, and is what is used when a 435 - process is the target of an operation by some other process (SIGKILL for 436 - example). 437 - 438 - The subjective security holds the active security properties of a process, and 439 - may be overridden. This is not seen externally, and is used whan a process 440 - acts upon another object, for example SIGKILLing another process or opening a 441 - file. 442 - 443 - LSM hooks exist that allow SELinux (or Smack or whatever) to reject a request 444 - for CacheFiles to run in a context of a specific security label, or to create 445 - files and directories with another security label. 446 - 447 - 448 - ======================= 449 - STATISTICAL INFORMATION 450 - ======================= 451 - 452 - If FS-Cache is compiled with the following option enabled: 453 - 454 - CONFIG_CACHEFILES_HISTOGRAM=y 455 - 456 - then it will gather certain statistics and display them through a proc file. 457 - 458 - (*) /proc/fs/cachefiles/histogram 459 - 460 - cat /proc/fs/cachefiles/histogram 461 - JIFS SECS LOOKUPS MKDIRS CREATES 462 - ===== ===== ========= ========= ========= 463 - 464 - This shows the breakdown of the number of times each amount of time 465 - between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The 466 - columns are as follows: 467 - 468 - COLUMN TIME MEASUREMENT 469 - ======= ======================================================= 470 - LOOKUPS Length of time to perform a lookup on the backing fs 471 - MKDIRS Length of time to perform a mkdir on the backing fs 472 - CREATES Length of time to perform a create on the backing fs 473 - 474 - Each row shows the number of events that took a particular range of times. 475 - Each step is 1 jiffy in size. The JIFS column indicates the particular 476 - jiffy range covered, and the SECS field the equivalent number of seconds. 477 - 478 - 479 - ========= 480 - DEBUGGING 481 - ========= 482 - 483 - If CONFIG_CACHEFILES_DEBUG is enabled, the CacheFiles facility can have runtime 484 - debugging enabled by adjusting the value in: 485 - 486 - /sys/module/cachefiles/parameters/debug 487 - 488 - This is a bitmask of debugging streams to enable: 489 - 490 - BIT VALUE STREAM POINT 491 - ======= ======= =============================== ======================= 492 - 0 1 General Function entry trace 493 - 1 2 Function exit trace 494 - 2 4 General 495 - 496 - The appropriate set of values should be OR'd together and the result written to 497 - the control file. For example: 498 - 499 - echo $((1|4|8)) >/sys/module/cachefiles/parameters/debug 500 - 501 - will turn on all function entry debugging.

+565

Documentation/filesystems/caching/fscache.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ========================== 4 + General Filesystem Caching 5 + ========================== 6 + 7 + Overview 8 + ======== 9 + 10 + This facility is a general purpose cache for network filesystems, though it 11 + could be used for caching other things such as ISO9660 filesystems too. 12 + 13 + FS-Cache mediates between cache backends (such as CacheFS) and network 14 + filesystems:: 15 + 16 + +---------+ 17 + | | +--------------+ 18 + | NFS |--+ | | 19 + | | | +-->| CacheFS | 20 + +---------+ | +----------+ | | /dev/hda5 | 21 + | | | | +--------------+ 22 + +---------+ +-->| | | 23 + | | | |--+ 24 + | AFS |----->| FS-Cache | 25 + | | | |--+ 26 + +---------+ +-->| | | 27 + | | | | +--------------+ 28 + +---------+ | +----------+ | | | 29 + | | | +-->| CacheFiles | 30 + | ISOFS |--+ | /var/cache | 31 + | | +--------------+ 32 + +---------+ 33 + 34 + Or to look at it another way, FS-Cache is a module that provides a caching 35 + facility to a network filesystem such that the cache is transparent to the 36 + user:: 37 + 38 + +---------+ 39 + | | 40 + | Server | 41 + | | 42 + +---------+ 43 + | NETWORK 44 + ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 45 + | 46 + | +----------+ 47 + V | | 48 + +---------+ | | 49 + | | | | 50 + | NFS |----->| FS-Cache | 51 + | | | |--+ 52 + +---------+ | | | +--------------+ +--------------+ 53 + | | | | | | | | 54 + V +----------+ +-->| CacheFiles |-->| Ext3 | 55 + +---------+ | /var/cache | | /dev/sda6 | 56 + | | +--------------+ +--------------+ 57 + | VFS | ^ ^ 58 + | | | | 59 + +---------+ +--------------+ | 60 + | KERNEL SPACE | | 61 + ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|~~~~~~|~~~~ 62 + | USER SPACE | | 63 + V | | 64 + +---------+ +--------------+ 65 + | | | | 66 + | Process | | cachefilesd | 67 + | | | | 68 + +---------+ +--------------+ 69 + 70 + 71 + FS-Cache does not follow the idea of completely loading every netfs file 72 + opened in its entirety into a cache before permitting it to be accessed and 73 + then serving the pages out of that cache rather than the netfs inode because: 74 + 75 + (1) It must be practical to operate without a cache. 76 + 77 + (2) The size of any accessible file must not be limited to the size of the 78 + cache. 79 + 80 + (3) The combined size of all opened files (this includes mapped libraries) 81 + must not be limited to the size of the cache. 82 + 83 + (4) The user should not be forced to download an entire file just to do a 84 + one-off access of a small portion of it (such as might be done with the 85 + "file" program). 86 + 87 + It instead serves the cache out in PAGE_SIZE chunks as and when requested by 88 + the netfs('s) using it. 89 + 90 + 91 + FS-Cache provides the following facilities: 92 + 93 + (1) More than one cache can be used at once. Caches can be selected 94 + explicitly by use of tags. 95 + 96 + (2) Caches can be added / removed at any time. 97 + 98 + (3) The netfs is provided with an interface that allows either party to 99 + withdraw caching facilities from a file (required for (2)). 100 + 101 + (4) The interface to the netfs returns as few errors as possible, preferring 102 + rather to let the netfs remain oblivious. 103 + 104 + (5) Cookies are used to represent indices, files and other objects to the 105 + netfs. The simplest cookie is just a NULL pointer - indicating nothing 106 + cached there. 107 + 108 + (6) The netfs is allowed to propose - dynamically - any index hierarchy it 109 + desires, though it must be aware that the index search function is 110 + recursive, stack space is limited, and indices can only be children of 111 + indices. 112 + 113 + (7) Data I/O is done direct to and from the netfs's pages. The netfs 114 + indicates that page A is at index B of the data-file represented by cookie 115 + C, and that it should be read or written. The cache backend may or may 116 + not start I/O on that page, but if it does, a netfs callback will be 117 + invoked to indicate completion. The I/O may be either synchronous or 118 + asynchronous. 119 + 120 + (8) Cookies can be "retired" upon release. At this point FS-Cache will mark 121 + them as obsolete and the index hierarchy rooted at that point will get 122 + recycled. 123 + 124 + (9) The netfs provides a "match" function for index searches. In addition to 125 + saying whether a match was made or not, this can also specify that an 126 + entry should be updated or deleted. 127 + 128 + (10) As much as possible is done asynchronously. 129 + 130 + 131 + FS-Cache maintains a virtual indexing tree in which all indices, files, objects 132 + and pages are kept. Bits of this tree may actually reside in one or more 133 + caches:: 134 + 135 + FSDEF 136 + | 137 + +------------------------------------+ 138 + | | 139 + NFS AFS 140 + | | 141 + +--------------------------+ +-----------+ 142 + | | | | 143 + homedir mirror afs.org redhat.com 144 + | | | 145 + +------------+ +---------------+ +----------+ 146 + | | | | | | 147 + 00001 00002 00007 00125 vol00001 vol00002 148 + | | | | | 149 + +---+---+ +-----+ +---+ +------+------+ +-----+----+ 150 + | | | | | | | | | | | | | 151 + PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak 152 + | | 153 + PG0 +-------+ 154 + | | 155 + 00001 00003 156 + | 157 + +---+---+ 158 + | | | 159 + PG0 PG1 PG2 160 + 161 + In the example above, you can see two netfs's being backed: NFS and AFS. These 162 + have different index hierarchies: 163 + 164 + * The NFS primary index contains per-server indices. Each server index is 165 + indexed by NFS file handles to get data file objects. Each data file 166 + objects can have an array of pages, but may also have further child 167 + objects, such as extended attributes and directory entries. Extended 168 + attribute objects themselves have page-array contents. 169 + 170 + * The AFS primary index contains per-cell indices. Each cell index contains 171 + per-logical-volume indices. Each of volume index contains up to three 172 + indices for the read-write, read-only and backup mirrors of those volumes. 173 + Each of these contains vnode data file objects, each of which contains an 174 + array of pages. 175 + 176 + The very top index is the FS-Cache master index in which individual netfs's 177 + have entries. 178 + 179 + Any index object may reside in more than one cache, provided it only has index 180 + children. Any index with non-index object children will be assumed to only 181 + reside in one cache. 182 + 183 + 184 + The netfs API to FS-Cache can be found in: 185 + 186 + Documentation/filesystems/caching/netfs-api.rst 187 + 188 + The cache backend API to FS-Cache can be found in: 189 + 190 + Documentation/filesystems/caching/backend-api.rst 191 + 192 + A description of the internal representations and object state machine can be 193 + found in: 194 + 195 + Documentation/filesystems/caching/object.rst 196 + 197 + 198 + Statistical Information 199 + ======================= 200 + 201 + If FS-Cache is compiled with the following options enabled:: 202 + 203 + CONFIG_FSCACHE_STATS=y 204 + CONFIG_FSCACHE_HISTOGRAM=y 205 + 206 + then it will gather certain statistics and display them through a number of 207 + proc files. 208 + 209 + /proc/fs/fscache/stats 210 + ---------------------- 211 + 212 + This shows counts of a number of events that can happen in FS-Cache: 213 + 214 + +--------------+-------+-------------------------------------------------------+ 215 + |CLASS |EVENT |MEANING | 216 + +==============+=======+=======================================================+ 217 + |Cookies |idx=N |Number of index cookies allocated | 218 + + +-------+-------------------------------------------------------+ 219 + | |dat=N |Number of data storage cookies allocated | 220 + + +-------+-------------------------------------------------------+ 221 + | |spc=N |Number of special cookies allocated | 222 + +--------------+-------+-------------------------------------------------------+ 223 + |Objects |alc=N |Number of objects allocated | 224 + + +-------+-------------------------------------------------------+ 225 + | |nal=N |Number of object allocation failures | 226 + + +-------+-------------------------------------------------------+ 227 + | |avl=N |Number of objects that reached the available state | 228 + + +-------+-------------------------------------------------------+ 229 + | |ded=N |Number of objects that reached the dead state | 230 + +--------------+-------+-------------------------------------------------------+ 231 + |ChkAux |non=N |Number of objects that didn't have a coherency check | 232 + + +-------+-------------------------------------------------------+ 233 + | |ok=N |Number of objects that passed a coherency check | 234 + + +-------+-------------------------------------------------------+ 235 + | |upd=N |Number of objects that needed a coherency data update | 236 + + +-------+-------------------------------------------------------+ 237 + | |obs=N |Number of objects that were declared obsolete | 238 + +--------------+-------+-------------------------------------------------------+ 239 + |Pages |mrk=N |Number of pages marked as being cached | 240 + | |unc=N |Number of uncache page requests seen | 241 + +--------------+-------+-------------------------------------------------------+ 242 + |Acquire |n=N |Number of acquire cookie requests seen | 243 + + +-------+-------------------------------------------------------+ 244 + | |nul=N |Number of acq reqs given a NULL parent | 245 + + +-------+-------------------------------------------------------+ 246 + | |noc=N |Number of acq reqs rejected due to no cache available | 247 + + +-------+-------------------------------------------------------+ 248 + | |ok=N |Number of acq reqs succeeded | 249 + + +-------+-------------------------------------------------------+ 250 + | |nbf=N |Number of acq reqs rejected due to error | 251 + + +-------+-------------------------------------------------------+ 252 + | |oom=N |Number of acq reqs failed on ENOMEM | 253 + +--------------+-------+-------------------------------------------------------+ 254 + |Lookups |n=N |Number of lookup calls made on cache backends | 255 + + +-------+-------------------------------------------------------+ 256 + | |neg=N |Number of negative lookups made | 257 + + +-------+-------------------------------------------------------+ 258 + | |pos=N |Number of positive lookups made | 259 + + +-------+-------------------------------------------------------+ 260 + | |crt=N |Number of objects created by lookup | 261 + + +-------+-------------------------------------------------------+ 262 + | |tmo=N |Number of lookups timed out and requeued | 263 + +--------------+-------+-------------------------------------------------------+ 264 + |Updates |n=N |Number of update cookie requests seen | 265 + + +-------+-------------------------------------------------------+ 266 + | |nul=N |Number of upd reqs given a NULL parent | 267 + + +-------+-------------------------------------------------------+ 268 + | |run=N |Number of upd reqs granted CPU time | 269 + +--------------+-------+-------------------------------------------------------+ 270 + |Relinqs |n=N |Number of relinquish cookie requests seen | 271 + + +-------+-------------------------------------------------------+ 272 + | |nul=N |Number of rlq reqs given a NULL parent | 273 + + +-------+-------------------------------------------------------+ 274 + | |wcr=N |Number of rlq reqs waited on completion of creation | 275 + +--------------+-------+-------------------------------------------------------+ 276 + |AttrChg |n=N |Number of attribute changed requests seen | 277 + + +-------+-------------------------------------------------------+ 278 + | |ok=N |Number of attr changed requests queued | 279 + + +-------+-------------------------------------------------------+ 280 + | |nbf=N |Number of attr changed rejected -ENOBUFS | 281 + + +-------+-------------------------------------------------------+ 282 + | |oom=N |Number of attr changed failed -ENOMEM | 283 + + +-------+-------------------------------------------------------+ 284 + | |run=N |Number of attr changed ops given CPU time | 285 + +--------------+-------+-------------------------------------------------------+ 286 + |Allocs |n=N |Number of allocation requests seen | 287 + + +-------+-------------------------------------------------------+ 288 + | |ok=N |Number of successful alloc reqs | 289 + + +-------+-------------------------------------------------------+ 290 + | |wt=N |Number of alloc reqs that waited on lookup completion | 291 + + +-------+-------------------------------------------------------+ 292 + | |nbf=N |Number of alloc reqs rejected -ENOBUFS | 293 + + +-------+-------------------------------------------------------+ 294 + | |int=N |Number of alloc reqs aborted -ERESTARTSYS | 295 + + +-------+-------------------------------------------------------+ 296 + | |ops=N |Number of alloc reqs submitted | 297 + + +-------+-------------------------------------------------------+ 298 + | |owt=N |Number of alloc reqs waited for CPU time | 299 + + +-------+-------------------------------------------------------+ 300 + | |abt=N |Number of alloc reqs aborted due to object death | 301 + +--------------+-------+-------------------------------------------------------+ 302 + |Retrvls |n=N |Number of retrieval (read) requests seen | 303 + + +-------+-------------------------------------------------------+ 304 + | |ok=N |Number of successful retr reqs | 305 + + +-------+-------------------------------------------------------+ 306 + | |wt=N |Number of retr reqs that waited on lookup completion | 307 + + +-------+-------------------------------------------------------+ 308 + | |nod=N |Number of retr reqs returned -ENODATA | 309 + + +-------+-------------------------------------------------------+ 310 + | |nbf=N |Number of retr reqs rejected -ENOBUFS | 311 + + +-------+-------------------------------------------------------+ 312 + | |int=N |Number of retr reqs aborted -ERESTARTSYS | 313 + + +-------+-------------------------------------------------------+ 314 + | |oom=N |Number of retr reqs failed -ENOMEM | 315 + + +-------+-------------------------------------------------------+ 316 + | |ops=N |Number of retr reqs submitted | 317 + + +-------+-------------------------------------------------------+ 318 + | |owt=N |Number of retr reqs waited for CPU time | 319 + + +-------+-------------------------------------------------------+ 320 + | |abt=N |Number of retr reqs aborted due to object death | 321 + +--------------+-------+-------------------------------------------------------+ 322 + |Stores |n=N |Number of storage (write) requests seen | 323 + + +-------+-------------------------------------------------------+ 324 + | |ok=N |Number of successful store reqs | 325 + + +-------+-------------------------------------------------------+ 326 + | |agn=N |Number of store reqs on a page already pending storage | 327 + + +-------+-------------------------------------------------------+ 328 + | |nbf=N |Number of store reqs rejected -ENOBUFS | 329 + + +-------+-------------------------------------------------------+ 330 + | |oom=N |Number of store reqs failed -ENOMEM | 331 + + +-------+-------------------------------------------------------+ 332 + | |ops=N |Number of store reqs submitted | 333 + + +-------+-------------------------------------------------------+ 334 + | |run=N |Number of store reqs granted CPU time | 335 + + +-------+-------------------------------------------------------+ 336 + | |pgs=N |Number of pages given store req processing time | 337 + + +-------+-------------------------------------------------------+ 338 + | |rxd=N |Number of store reqs deleted from tracking tree | 339 + + +-------+-------------------------------------------------------+ 340 + | |olm=N |Number of store reqs over store limit | 341 + +--------------+-------+-------------------------------------------------------+ 342 + |VmScan |nos=N |Number of release reqs against pages with no | 343 + | | |pending store | 344 + + +-------+-------------------------------------------------------+ 345 + | |gon=N |Number of release reqs against pages stored by | 346 + | | |time lock granted | 347 + + +-------+-------------------------------------------------------+ 348 + | |bsy=N |Number of release reqs ignored due to in-progress store| 349 + + +-------+-------------------------------------------------------+ 350 + | |can=N |Number of page stores cancelled due to release req | 351 + +--------------+-------+-------------------------------------------------------+ 352 + |Ops |pend=N |Number of times async ops added to pending queues | 353 + + +-------+-------------------------------------------------------+ 354 + | |run=N |Number of times async ops given CPU time | 355 + + +-------+-------------------------------------------------------+ 356 + | |enq=N |Number of times async ops queued for processing | 357 + + +-------+-------------------------------------------------------+ 358 + | |can=N |Number of async ops cancelled | 359 + + +-------+-------------------------------------------------------+ 360 + | |rej=N |Number of async ops rejected due to object | 361 + | | |lookup/create failure | 362 + + +-------+-------------------------------------------------------+ 363 + | |ini=N |Number of async ops initialised | 364 + + +-------+-------------------------------------------------------+ 365 + | |dfr=N |Number of async ops queued for deferred release | 366 + + +-------+-------------------------------------------------------+ 367 + | |rel=N |Number of async ops released | 368 + | | |(should equal ini=N when idle) | 369 + + +-------+-------------------------------------------------------+ 370 + | |gc=N |Number of deferred-release async ops garbage collected | 371 + +--------------+-------+-------------------------------------------------------+ 372 + |CacheOp |alo=N |Number of in-progress alloc_object() cache ops | 373 + + +-------+-------------------------------------------------------+ 374 + | |luo=N |Number of in-progress lookup_object() cache ops | 375 + + +-------+-------------------------------------------------------+ 376 + | |luc=N |Number of in-progress lookup_complete() cache ops | 377 + + +-------+-------------------------------------------------------+ 378 + | |gro=N |Number of in-progress grab_object() cache ops | 379 + + +-------+-------------------------------------------------------+ 380 + | |upo=N |Number of in-progress update_object() cache ops | 381 + + +-------+-------------------------------------------------------+ 382 + | |dro=N |Number of in-progress drop_object() cache ops | 383 + + +-------+-------------------------------------------------------+ 384 + | |pto=N |Number of in-progress put_object() cache ops | 385 + + +-------+-------------------------------------------------------+ 386 + | |syn=N |Number of in-progress sync_cache() cache ops | 387 + + +-------+-------------------------------------------------------+ 388 + | |atc=N |Number of in-progress attr_changed() cache ops | 389 + + +-------+-------------------------------------------------------+ 390 + | |rap=N |Number of in-progress read_or_alloc_page() cache ops | 391 + + +-------+-------------------------------------------------------+ 392 + | |ras=N |Number of in-progress read_or_alloc_pages() cache ops | 393 + + +-------+-------------------------------------------------------+ 394 + | |alp=N |Number of in-progress allocate_page() cache ops | 395 + + +-------+-------------------------------------------------------+ 396 + | |als=N |Number of in-progress allocate_pages() cache ops | 397 + + +-------+-------------------------------------------------------+ 398 + | |wrp=N |Number of in-progress write_page() cache ops | 399 + + +-------+-------------------------------------------------------+ 400 + | |ucp=N |Number of in-progress uncache_page() cache ops | 401 + + +-------+-------------------------------------------------------+ 402 + | |dsp=N |Number of in-progress dissociate_pages() cache ops | 403 + +--------------+-------+-------------------------------------------------------+ 404 + |CacheEv |nsp=N |Number of object lookups/creations rejected due to | 405 + | | |lack of space | 406 + + +-------+-------------------------------------------------------+ 407 + | |stl=N |Number of stale objects deleted | 408 + + +-------+-------------------------------------------------------+ 409 + | |rtr=N |Number of objects retired when relinquished | 410 + + +-------+-------------------------------------------------------+ 411 + | |cul=N |Number of objects culled | 412 + +--------------+-------+-------------------------------------------------------+ 413 + 414 + 415 + 416 + /proc/fs/fscache/histogram 417 + -------------------------- 418 + 419 + :: 420 + 421 + cat /proc/fs/fscache/histogram 422 + JIFS SECS OBJ INST OP RUNS OBJ RUNS RETRV DLY RETRIEVLS 423 + ===== ===== ========= ========= ========= ========= ========= 424 + 425 + This shows the breakdown of the number of times each amount of time 426 + between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The 427 + columns are as follows: 428 + 429 + ========= ======================================================= 430 + COLUMN TIME MEASUREMENT 431 + ========= ======================================================= 432 + OBJ INST Length of time to instantiate an object 433 + OP RUNS Length of time a call to process an operation took 434 + OBJ RUNS Length of time a call to process an object event took 435 + RETRV DLY Time between an requesting a read and lookup completing 436 + RETRIEVLS Time between beginning and end of a retrieval 437 + ========= ======================================================= 438 + 439 + Each row shows the number of events that took a particular range of times. 440 + Each step is 1 jiffy in size. The JIFS column indicates the particular 441 + jiffy range covered, and the SECS field the equivalent number of seconds. 442 + 443 + 444 + 445 + Object List 446 + =========== 447 + 448 + If CONFIG_FSCACHE_OBJECT_LIST is enabled, the FS-Cache facility will maintain a 449 + list of all the objects currently allocated and allow them to be viewed 450 + through:: 451 + 452 + /proc/fs/fscache/objects 453 + 454 + This will look something like:: 455 + 456 + [root@andromeda ~]# head /proc/fs/fscache/objects 457 + OBJECT PARENT STAT CHLDN OPS OOP IPR EX READS EM EV F S | NETFS_COOKIE_DEF TY FL NETFS_DATA OBJECT_KEY, AUX_DATA 458 + ======== ======== ==== ===== === === === == ===== == == = = | ================ == == ================ ================ 459 + 17e4b 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88001dd82820 010006017edcf8bbc93b43298fdfbe71e50b57b13a172c0117f38472, e567634700000000000000000000000063f2404a000000000000000000000000c9030000000000000000000063f2404a 460 + 1693a 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88002db23380 010006017edcf8bbc93b43298fdfbe71e50b57b1e0162c01a2df0ea6, 420ebc4a000000000000000000000000420ebc4a0000000000000000000000000e1801000000000000000000420ebc4a 461 + 462 + where the first set of columns before the '|' describe the object: 463 + 464 + ======= =============================================================== 465 + COLUMN DESCRIPTION 466 + ======= =============================================================== 467 + OBJECT Object debugging ID (appears as OBJ%x in some debug messages) 468 + PARENT Debugging ID of parent object 469 + STAT Object state 470 + CHLDN Number of child objects of this object 471 + OPS Number of outstanding operations on this object 472 + OOP Number of outstanding child object management operations 473 + IPR 474 + EX Number of outstanding exclusive operations 475 + READS Number of outstanding read operations 476 + EM Object's event mask 477 + EV Events raised on this object 478 + F Object flags 479 + S Object work item busy state mask (1:pending 2:running) 480 + ======= =============================================================== 481 + 482 + and the second set of columns describe the object's cookie, if present: 483 + 484 + ================ ====================================================== 485 + COLUMN DESCRIPTION 486 + ================ ====================================================== 487 + NETFS_COOKIE_DEF Name of netfs cookie definition 488 + TY Cookie type (IX - index, DT - data, hex - special) 489 + FL Cookie flags 490 + NETFS_DATA Netfs private data stored in the cookie 491 + OBJECT_KEY Object key } 1 column, with separating comma 492 + AUX_DATA Object aux data } presence may be configured 493 + ================ ====================================================== 494 + 495 + The data shown may be filtered by attaching the a key to an appropriate keyring 496 + before viewing the file. Something like:: 497 + 498 + keyctl add user fscache:objlist <restrictions> @s 499 + 500 + where <restrictions> are a selection of the following letters: 501 + 502 + == ========================================================= 503 + K Show hexdump of object key (don't show if not given) 504 + A Show hexdump of object aux data (don't show if not given) 505 + == ========================================================= 506 + 507 + and the following paired letters: 508 + 509 + == ========================================================= 510 + C Show objects that have a cookie 511 + c Show objects that don't have a cookie 512 + B Show objects that are busy 513 + b Show objects that aren't busy 514 + W Show objects that have pending writes 515 + w Show objects that don't have pending writes 516 + R Show objects that have outstanding reads 517 + r Show objects that don't have outstanding reads 518 + S Show objects that have work queued 519 + s Show objects that don't have work queued 520 + == ========================================================= 521 + 522 + If neither side of a letter pair is given, then both are implied. For example: 523 + 524 + keyctl add user fscache:objlist KB @s 525 + 526 + shows objects that are busy, and lists their object keys, but does not dump 527 + their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is 528 + not implied. 529 + 530 + By default all objects and all fields will be shown. 531 + 532 + 533 + Debugging 534 + ========= 535 + 536 + If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime 537 + debugging enabled by adjusting the value in:: 538 + 539 + /sys/module/fscache/parameters/debug 540 + 541 + This is a bitmask of debugging streams to enable: 542 + 543 + ======= ======= =============================== ======================= 544 + BIT VALUE STREAM POINT 545 + ======= ======= =============================== ======================= 546 + 0 1 Cache management Function entry trace 547 + 1 2 Function exit trace 548 + 2 4 General 549 + 3 8 Cookie management Function entry trace 550 + 4 16 Function exit trace 551 + 5 32 General 552 + 6 64 Page handling Function entry trace 553 + 7 128 Function exit trace 554 + 8 256 General 555 + 9 512 Operation management Function entry trace 556 + 10 1024 Function exit trace 557 + 11 2048 General 558 + ======= ======= =============================== ======================= 559 + 560 + The appropriate set of values should be OR'd together and the result written to 561 + the control file. For example:: 562 + 563 + echo $((1|8|64)) >/sys/module/fscache/parameters/debug 564 + 565 + will turn on all function entry debugging.

-448

Documentation/filesystems/caching/fscache.txt

··· 1 - ========================== 2 - General Filesystem Caching 3 - ========================== 4 - 5 - ======== 6 - OVERVIEW 7 - ======== 8 - 9 - This facility is a general purpose cache for network filesystems, though it 10 - could be used for caching other things such as ISO9660 filesystems too. 11 - 12 - FS-Cache mediates between cache backends (such as CacheFS) and network 13 - filesystems: 14 - 15 - +---------+ 16 - | | +--------------+ 17 - | NFS |--+ | | 18 - | | | +-->| CacheFS | 19 - +---------+ | +----------+ | | /dev/hda5 | 20 - | | | | +--------------+ 21 - +---------+ +-->| | | 22 - | | | |--+ 23 - | AFS |----->| FS-Cache | 24 - | | | |--+ 25 - +---------+ +-->| | | 26 - | | | | +--------------+ 27 - +---------+ | +----------+ | | | 28 - | | | +-->| CacheFiles | 29 - | ISOFS |--+ | /var/cache | 30 - | | +--------------+ 31 - +---------+ 32 - 33 - Or to look at it another way, FS-Cache is a module that provides a caching 34 - facility to a network filesystem such that the cache is transparent to the 35 - user: 36 - 37 - +---------+ 38 - | | 39 - | Server | 40 - | | 41 - +---------+ 42 - | NETWORK 43 - ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 44 - | 45 - | +----------+ 46 - V | | 47 - +---------+ | | 48 - | | | | 49 - | NFS |----->| FS-Cache | 50 - | | | |--+ 51 - +---------+ | | | +--------------+ +--------------+ 52 - | | | | | | | | 53 - V +----------+ +-->| CacheFiles |-->| Ext3 | 54 - +---------+ | /var/cache | | /dev/sda6 | 55 - | | +--------------+ +--------------+ 56 - | VFS | ^ ^ 57 - | | | | 58 - +---------+ +--------------+ | 59 - | KERNEL SPACE | | 60 - ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|~~~~~~|~~~~ 61 - | USER SPACE | | 62 - V | | 63 - +---------+ +--------------+ 64 - | | | | 65 - | Process | | cachefilesd | 66 - | | | | 67 - +---------+ +--------------+ 68 - 69 - 70 - FS-Cache does not follow the idea of completely loading every netfs file 71 - opened in its entirety into a cache before permitting it to be accessed and 72 - then serving the pages out of that cache rather than the netfs inode because: 73 - 74 - (1) It must be practical to operate without a cache. 75 - 76 - (2) The size of any accessible file must not be limited to the size of the 77 - cache. 78 - 79 - (3) The combined size of all opened files (this includes mapped libraries) 80 - must not be limited to the size of the cache. 81 - 82 - (4) The user should not be forced to download an entire file just to do a 83 - one-off access of a small portion of it (such as might be done with the 84 - "file" program). 85 - 86 - It instead serves the cache out in PAGE_SIZE chunks as and when requested by 87 - the netfs('s) using it. 88 - 89 - 90 - FS-Cache provides the following facilities: 91 - 92 - (1) More than one cache can be used at once. Caches can be selected 93 - explicitly by use of tags. 94 - 95 - (2) Caches can be added / removed at any time. 96 - 97 - (3) The netfs is provided with an interface that allows either party to 98 - withdraw caching facilities from a file (required for (2)). 99 - 100 - (4) The interface to the netfs returns as few errors as possible, preferring 101 - rather to let the netfs remain oblivious. 102 - 103 - (5) Cookies are used to represent indices, files and other objects to the 104 - netfs. The simplest cookie is just a NULL pointer - indicating nothing 105 - cached there. 106 - 107 - (6) The netfs is allowed to propose - dynamically - any index hierarchy it 108 - desires, though it must be aware that the index search function is 109 - recursive, stack space is limited, and indices can only be children of 110 - indices. 111 - 112 - (7) Data I/O is done direct to and from the netfs's pages. The netfs 113 - indicates that page A is at index B of the data-file represented by cookie 114 - C, and that it should be read or written. The cache backend may or may 115 - not start I/O on that page, but if it does, a netfs callback will be 116 - invoked to indicate completion. The I/O may be either synchronous or 117 - asynchronous. 118 - 119 - (8) Cookies can be "retired" upon release. At this point FS-Cache will mark 120 - them as obsolete and the index hierarchy rooted at that point will get 121 - recycled. 122 - 123 - (9) The netfs provides a "match" function for index searches. In addition to 124 - saying whether a match was made or not, this can also specify that an 125 - entry should be updated or deleted. 126 - 127 - (10) As much as possible is done asynchronously. 128 - 129 - 130 - FS-Cache maintains a virtual indexing tree in which all indices, files, objects 131 - and pages are kept. Bits of this tree may actually reside in one or more 132 - caches. 133 - 134 - FSDEF 135 - | 136 - +------------------------------------+ 137 - | | 138 - NFS AFS 139 - | | 140 - +--------------------------+ +-----------+ 141 - | | | | 142 - homedir mirror afs.org redhat.com 143 - | | | 144 - +------------+ +---------------+ +----------+ 145 - | | | | | | 146 - 00001 00002 00007 00125 vol00001 vol00002 147 - | | | | | 148 - +---+---+ +-----+ +---+ +------+------+ +-----+----+ 149 - | | | | | | | | | | | | | 150 - PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak 151 - | | 152 - PG0 +-------+ 153 - | | 154 - 00001 00003 155 - | 156 - +---+---+ 157 - | | | 158 - PG0 PG1 PG2 159 - 160 - In the example above, you can see two netfs's being backed: NFS and AFS. These 161 - have different index hierarchies: 162 - 163 - (*) The NFS primary index contains per-server indices. Each server index is 164 - indexed by NFS file handles to get data file objects. Each data file 165 - objects can have an array of pages, but may also have further child 166 - objects, such as extended attributes and directory entries. Extended 167 - attribute objects themselves have page-array contents. 168 - 169 - (*) The AFS primary index contains per-cell indices. Each cell index contains 170 - per-logical-volume indices. Each of volume index contains up to three 171 - indices for the read-write, read-only and backup mirrors of those volumes. 172 - Each of these contains vnode data file objects, each of which contains an 173 - array of pages. 174 - 175 - The very top index is the FS-Cache master index in which individual netfs's 176 - have entries. 177 - 178 - Any index object may reside in more than one cache, provided it only has index 179 - children. Any index with non-index object children will be assumed to only 180 - reside in one cache. 181 - 182 - 183 - The netfs API to FS-Cache can be found in: 184 - 185 - Documentation/filesystems/caching/netfs-api.txt 186 - 187 - The cache backend API to FS-Cache can be found in: 188 - 189 - Documentation/filesystems/caching/backend-api.txt 190 - 191 - A description of the internal representations and object state machine can be 192 - found in: 193 - 194 - Documentation/filesystems/caching/object.txt 195 - 196 - 197 - ======================= 198 - STATISTICAL INFORMATION 199 - ======================= 200 - 201 - If FS-Cache is compiled with the following options enabled: 202 - 203 - CONFIG_FSCACHE_STATS=y 204 - CONFIG_FSCACHE_HISTOGRAM=y 205 - 206 - then it will gather certain statistics and display them through a number of 207 - proc files. 208 - 209 - (*) /proc/fs/fscache/stats 210 - 211 - This shows counts of a number of events that can happen in FS-Cache: 212 - 213 - CLASS EVENT MEANING 214 - ======= ======= ======================================================= 215 - Cookies idx=N Number of index cookies allocated 216 - dat=N Number of data storage cookies allocated 217 - spc=N Number of special cookies allocated 218 - Objects alc=N Number of objects allocated 219 - nal=N Number of object allocation failures 220 - avl=N Number of objects that reached the available state 221 - ded=N Number of objects that reached the dead state 222 - ChkAux non=N Number of objects that didn't have a coherency check 223 - ok=N Number of objects that passed a coherency check 224 - upd=N Number of objects that needed a coherency data update 225 - obs=N Number of objects that were declared obsolete 226 - Pages mrk=N Number of pages marked as being cached 227 - unc=N Number of uncache page requests seen 228 - Acquire n=N Number of acquire cookie requests seen 229 - nul=N Number of acq reqs given a NULL parent 230 - noc=N Number of acq reqs rejected due to no cache available 231 - ok=N Number of acq reqs succeeded 232 - nbf=N Number of acq reqs rejected due to error 233 - oom=N Number of acq reqs failed on ENOMEM 234 - Lookups n=N Number of lookup calls made on cache backends 235 - neg=N Number of negative lookups made 236 - pos=N Number of positive lookups made 237 - crt=N Number of objects created by lookup 238 - tmo=N Number of lookups timed out and requeued 239 - Updates n=N Number of update cookie requests seen 240 - nul=N Number of upd reqs given a NULL parent 241 - run=N Number of upd reqs granted CPU time 242 - Relinqs n=N Number of relinquish cookie requests seen 243 - nul=N Number of rlq reqs given a NULL parent 244 - wcr=N Number of rlq reqs waited on completion of creation 245 - AttrChg n=N Number of attribute changed requests seen 246 - ok=N Number of attr changed requests queued 247 - nbf=N Number of attr changed rejected -ENOBUFS 248 - oom=N Number of attr changed failed -ENOMEM 249 - run=N Number of attr changed ops given CPU time 250 - Allocs n=N Number of allocation requests seen 251 - ok=N Number of successful alloc reqs 252 - wt=N Number of alloc reqs that waited on lookup completion 253 - nbf=N Number of alloc reqs rejected -ENOBUFS 254 - int=N Number of alloc reqs aborted -ERESTARTSYS 255 - ops=N Number of alloc reqs submitted 256 - owt=N Number of alloc reqs waited for CPU time 257 - abt=N Number of alloc reqs aborted due to object death 258 - Retrvls n=N Number of retrieval (read) requests seen 259 - ok=N Number of successful retr reqs 260 - wt=N Number of retr reqs that waited on lookup completion 261 - nod=N Number of retr reqs returned -ENODATA 262 - nbf=N Number of retr reqs rejected -ENOBUFS 263 - int=N Number of retr reqs aborted -ERESTARTSYS 264 - oom=N Number of retr reqs failed -ENOMEM 265 - ops=N Number of retr reqs submitted 266 - owt=N Number of retr reqs waited for CPU time 267 - abt=N Number of retr reqs aborted due to object death 268 - Stores n=N Number of storage (write) requests seen 269 - ok=N Number of successful store reqs 270 - agn=N Number of store reqs on a page already pending storage 271 - nbf=N Number of store reqs rejected -ENOBUFS 272 - oom=N Number of store reqs failed -ENOMEM 273 - ops=N Number of store reqs submitted 274 - run=N Number of store reqs granted CPU time 275 - pgs=N Number of pages given store req processing time 276 - rxd=N Number of store reqs deleted from tracking tree 277 - olm=N Number of store reqs over store limit 278 - VmScan nos=N Number of release reqs against pages with no pending store 279 - gon=N Number of release reqs against pages stored by time lock granted 280 - bsy=N Number of release reqs ignored due to in-progress store 281 - can=N Number of page stores cancelled due to release req 282 - Ops pend=N Number of times async ops added to pending queues 283 - run=N Number of times async ops given CPU time 284 - enq=N Number of times async ops queued for processing 285 - can=N Number of async ops cancelled 286 - rej=N Number of async ops rejected due to object lookup/create failure 287 - ini=N Number of async ops initialised 288 - dfr=N Number of async ops queued for deferred release 289 - rel=N Number of async ops released (should equal ini=N when idle) 290 - gc=N Number of deferred-release async ops garbage collected 291 - CacheOp alo=N Number of in-progress alloc_object() cache ops 292 - luo=N Number of in-progress lookup_object() cache ops 293 - luc=N Number of in-progress lookup_complete() cache ops 294 - gro=N Number of in-progress grab_object() cache ops 295 - upo=N Number of in-progress update_object() cache ops 296 - dro=N Number of in-progress drop_object() cache ops 297 - pto=N Number of in-progress put_object() cache ops 298 - syn=N Number of in-progress sync_cache() cache ops 299 - atc=N Number of in-progress attr_changed() cache ops 300 - rap=N Number of in-progress read_or_alloc_page() cache ops 301 - ras=N Number of in-progress read_or_alloc_pages() cache ops 302 - alp=N Number of in-progress allocate_page() cache ops 303 - als=N Number of in-progress allocate_pages() cache ops 304 - wrp=N Number of in-progress write_page() cache ops 305 - ucp=N Number of in-progress uncache_page() cache ops 306 - dsp=N Number of in-progress dissociate_pages() cache ops 307 - CacheEv nsp=N Number of object lookups/creations rejected due to lack of space 308 - stl=N Number of stale objects deleted 309 - rtr=N Number of objects retired when relinquished 310 - cul=N Number of objects culled 311 - 312 - 313 - (*) /proc/fs/fscache/histogram 314 - 315 - cat /proc/fs/fscache/histogram 316 - JIFS SECS OBJ INST OP RUNS OBJ RUNS RETRV DLY RETRIEVLS 317 - ===== ===== ========= ========= ========= ========= ========= 318 - 319 - This shows the breakdown of the number of times each amount of time 320 - between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The 321 - columns are as follows: 322 - 323 - COLUMN TIME MEASUREMENT 324 - ======= ======================================================= 325 - OBJ INST Length of time to instantiate an object 326 - OP RUNS Length of time a call to process an operation took 327 - OBJ RUNS Length of time a call to process an object event took 328 - RETRV DLY Time between an requesting a read and lookup completing 329 - RETRIEVLS Time between beginning and end of a retrieval 330 - 331 - Each row shows the number of events that took a particular range of times. 332 - Each step is 1 jiffy in size. The JIFS column indicates the particular 333 - jiffy range covered, and the SECS field the equivalent number of seconds. 334 - 335 - 336 - =========== 337 - OBJECT LIST 338 - =========== 339 - 340 - If CONFIG_FSCACHE_OBJECT_LIST is enabled, the FS-Cache facility will maintain a 341 - list of all the objects currently allocated and allow them to be viewed 342 - through: 343 - 344 - /proc/fs/fscache/objects 345 - 346 - This will look something like: 347 - 348 - [root@andromeda ~]# head /proc/fs/fscache/objects 349 - OBJECT PARENT STAT CHLDN OPS OOP IPR EX READS EM EV F S | NETFS_COOKIE_DEF TY FL NETFS_DATA OBJECT_KEY, AUX_DATA 350 - ======== ======== ==== ===== === === === == ===== == == = = | ================ == == ================ ================ 351 - 17e4b 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88001dd82820 010006017edcf8bbc93b43298fdfbe71e50b57b13a172c0117f38472, e567634700000000000000000000000063f2404a000000000000000000000000c9030000000000000000000063f2404a 352 - 1693a 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88002db23380 010006017edcf8bbc93b43298fdfbe71e50b57b1e0162c01a2df0ea6, 420ebc4a000000000000000000000000420ebc4a0000000000000000000000000e1801000000000000000000420ebc4a 353 - 354 - where the first set of columns before the '|' describe the object: 355 - 356 - COLUMN DESCRIPTION 357 - ======= =============================================================== 358 - OBJECT Object debugging ID (appears as OBJ%x in some debug messages) 359 - PARENT Debugging ID of parent object 360 - STAT Object state 361 - CHLDN Number of child objects of this object 362 - OPS Number of outstanding operations on this object 363 - OOP Number of outstanding child object management operations 364 - IPR 365 - EX Number of outstanding exclusive operations 366 - READS Number of outstanding read operations 367 - EM Object's event mask 368 - EV Events raised on this object 369 - F Object flags 370 - S Object work item busy state mask (1:pending 2:running) 371 - 372 - and the second set of columns describe the object's cookie, if present: 373 - 374 - COLUMN DESCRIPTION 375 - =============== ======================================================= 376 - NETFS_COOKIE_DEF Name of netfs cookie definition 377 - TY Cookie type (IX - index, DT - data, hex - special) 378 - FL Cookie flags 379 - NETFS_DATA Netfs private data stored in the cookie 380 - OBJECT_KEY Object key } 1 column, with separating comma 381 - AUX_DATA Object aux data } presence may be configured 382 - 383 - The data shown may be filtered by attaching the a key to an appropriate keyring 384 - before viewing the file. Something like: 385 - 386 - keyctl add user fscache:objlist <restrictions> @s 387 - 388 - where <restrictions> are a selection of the following letters: 389 - 390 - K Show hexdump of object key (don't show if not given) 391 - A Show hexdump of object aux data (don't show if not given) 392 - 393 - and the following paired letters: 394 - 395 - C Show objects that have a cookie 396 - c Show objects that don't have a cookie 397 - B Show objects that are busy 398 - b Show objects that aren't busy 399 - W Show objects that have pending writes 400 - w Show objects that don't have pending writes 401 - R Show objects that have outstanding reads 402 - r Show objects that don't have outstanding reads 403 - S Show objects that have work queued 404 - s Show objects that don't have work queued 405 - 406 - If neither side of a letter pair is given, then both are implied. For example: 407 - 408 - keyctl add user fscache:objlist KB @s 409 - 410 - shows objects that are busy, and lists their object keys, but does not dump 411 - their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is 412 - not implied. 413 - 414 - By default all objects and all fields will be shown. 415 - 416 - 417 - ========= 418 - DEBUGGING 419 - ========= 420 - 421 - If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime 422 - debugging enabled by adjusting the value in: 423 - 424 - /sys/module/fscache/parameters/debug 425 - 426 - This is a bitmask of debugging streams to enable: 427 - 428 - BIT VALUE STREAM POINT 429 - ======= ======= =============================== ======================= 430 - 0 1 Cache management Function entry trace 431 - 1 2 Function exit trace 432 - 2 4 General 433 - 3 8 Cookie management Function entry trace 434 - 4 16 Function exit trace 435 - 5 32 General 436 - 6 64 Page handling Function entry trace 437 - 7 128 Function exit trace 438 - 8 256 General 439 - 9 512 Operation management Function entry trace 440 - 10 1024 Function exit trace 441 - 11 2048 General 442 - 443 - The appropriate set of values should be OR'd together and the result written to 444 - the control file. For example: 445 - 446 - echo $((1|8|64)) >/sys/module/fscache/parameters/debug 447 - 448 - will turn on all function entry debugging.

+14

Documentation/filesystems/caching/index.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + Filesystem Caching 4 + ================== 5 + 6 + .. toctree:: 7 + :maxdepth: 2 8 + 9 + fscache 10 + object 11 + backend-api 12 + cachefiles 13 + netfs-api 14 + operations

+896

Documentation/filesystems/caching/netfs-api.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + =============================== 4 + FS-Cache Network Filesystem API 5 + =============================== 6 + 7 + There's an API by which a network filesystem can make use of the FS-Cache 8 + facilities. This is based around a number of principles: 9 + 10 + (1) Caches can store a number of different object types. There are two main 11 + object types: indices and files. The first is a special type used by 12 + FS-Cache to make finding objects faster and to make retiring of groups of 13 + objects easier. 14 + 15 + (2) Every index, file or other object is represented by a cookie. This cookie 16 + may or may not have anything associated with it, but the netfs doesn't 17 + need to care. 18 + 19 + (3) Barring the top-level index (one entry per cached netfs), the index 20 + hierarchy for each netfs is structured according the whim of the netfs. 21 + 22 + This API is declared in <linux/fscache.h>. 23 + 24 + .. This document contains the following sections: 25 + 26 + (1) Network filesystem definition 27 + (2) Index definition 28 + (3) Object definition 29 + (4) Network filesystem (un)registration 30 + (5) Cache tag lookup 31 + (6) Index registration 32 + (7) Data file registration 33 + (8) Miscellaneous object registration 34 + (9) Setting the data file size 35 + (10) Page alloc/read/write 36 + (11) Page uncaching 37 + (12) Index and data file consistency 38 + (13) Cookie enablement 39 + (14) Miscellaneous cookie operations 40 + (15) Cookie unregistration 41 + (16) Index invalidation 42 + (17) Data file invalidation 43 + (18) FS-Cache specific page flags. 44 + 45 + 46 + Network Filesystem Definition 47 + ============================= 48 + 49 + FS-Cache needs a description of the network filesystem. This is specified 50 + using a record of the following structure:: 51 + 52 + struct fscache_netfs { 53 + uint32_t version; 54 + const char *name; 55 + struct fscache_cookie *primary_index; 56 + ... 57 + }; 58 + 59 + This first two fields should be filled in before registration, and the third 60 + will be filled in by the registration function; any other fields should just be 61 + ignored and are for internal use only. 62 + 63 + The fields are: 64 + 65 + (1) The name of the netfs (used as the key in the toplevel index). 66 + 67 + (2) The version of the netfs (if the name matches but the version doesn't, the 68 + entire in-cache hierarchy for this netfs will be scrapped and begun 69 + afresh). 70 + 71 + (3) The cookie representing the primary index will be allocated according to 72 + another parameter passed into the registration function. 73 + 74 + For example, kAFS (linux/fs/afs/) uses the following definitions to describe 75 + itself:: 76 + 77 + struct fscache_netfs afs_cache_netfs = { 78 + .version = 0, 79 + .name = "afs", 80 + }; 81 + 82 + 83 + Index Definition 84 + ================ 85 + 86 + Indices are used for two purposes: 87 + 88 + (1) To aid the finding of a file based on a series of keys (such as AFS's 89 + "cell", "volume ID", "vnode ID"). 90 + 91 + (2) To make it easier to discard a subset of all the files cached based around 92 + a particular key - for instance to mirror the removal of an AFS volume. 93 + 94 + However, since it's unlikely that any two netfs's are going to want to define 95 + their index hierarchies in quite the same way, FS-Cache tries to impose as few 96 + restraints as possible on how an index is structured and where it is placed in 97 + the tree. The netfs can even mix indices and data files at the same level, but 98 + it's not recommended. 99 + 100 + Each index entry consists of a key of indeterminate length plus some auxiliary 101 + data, also of indeterminate length. 102 + 103 + There are some limits on indices: 104 + 105 + (1) Any index containing non-index objects should be restricted to a single 106 + cache. Any such objects created within an index will be created in the 107 + first cache only. The cache in which an index is created can be 108 + controlled by cache tags (see below). 109 + 110 + (2) The entry data must be atomically journallable, so it is limited to about 111 + 400 bytes at present. At least 400 bytes will be available. 112 + 113 + (3) The depth of the index tree should be judged with care as the search 114 + function is recursive. Too many layers will run the kernel out of stack. 115 + 116 + 117 + Object Definition 118 + ================= 119 + 120 + To define an object, a structure of the following type should be filled out:: 121 + 122 + struct fscache_cookie_def 123 + { 124 + uint8_t name[16]; 125 + uint8_t type; 126 + 127 + struct fscache_cache_tag *(*select_cache)( 128 + const void *parent_netfs_data, 129 + const void *cookie_netfs_data); 130 + 131 + enum fscache_checkaux (*check_aux)(void *cookie_netfs_data, 132 + const void *data, 133 + uint16_t datalen, 134 + loff_t object_size); 135 + 136 + void (*get_context)(void *cookie_netfs_data, void *context); 137 + 138 + void (*put_context)(void *cookie_netfs_data, void *context); 139 + 140 + void (*mark_pages_cached)(void *cookie_netfs_data, 141 + struct address_space *mapping, 142 + struct pagevec *cached_pvec); 143 + }; 144 + 145 + This has the following fields: 146 + 147 + (1) The type of the object [mandatory]. 148 + 149 + This is one of the following values: 150 + 151 + FSCACHE_COOKIE_TYPE_INDEX 152 + This defines an index, which is a special FS-Cache type. 153 + 154 + FSCACHE_COOKIE_TYPE_DATAFILE 155 + This defines an ordinary data file. 156 + 157 + Any other value between 2 and 255 158 + This defines an extraordinary object such as an XATTR. 159 + 160 + (2) The name of the object type (NUL terminated unless all 16 chars are used) 161 + [optional]. 162 + 163 + (3) A function to select the cache in which to store an index [optional]. 164 + 165 + This function is invoked when an index needs to be instantiated in a cache 166 + during the instantiation of a non-index object. Only the immediate index 167 + parent for the non-index object will be queried. Any indices above that 168 + in the hierarchy may be stored in multiple caches. This function does not 169 + need to be supplied for any non-index object or any index that will only 170 + have index children. 171 + 172 + If this function is not supplied or if it returns NULL then the first 173 + cache in the parent's list will be chosen, or failing that, the first 174 + cache in the master list. 175 + 176 + (4) A function to check the auxiliary data [optional]. 177 + 178 + This function will be called to check that a match found in the cache for 179 + this object is valid. For instance with AFS it could check the auxiliary 180 + data against the data version number returned by the server to determine 181 + whether the index entry in a cache is still valid. 182 + 183 + If this function is absent, it will be assumed that matching objects in a 184 + cache are always valid. 185 + 186 + The function is also passed the cache's idea of the object size and may 187 + use this to manage coherency also. 188 + 189 + If present, the function should return one of the following values: 190 + 191 + FSCACHE_CHECKAUX_OKAY 192 + - the entry is okay as is 193 + 194 + FSCACHE_CHECKAUX_NEEDS_UPDATE 195 + - the entry requires update 196 + 197 + FSCACHE_CHECKAUX_OBSOLETE 198 + - the entry should be deleted 199 + 200 + This function can also be used to extract data from the auxiliary data in 201 + the cache and copy it into the netfs's structures. 202 + 203 + (5) A pair of functions to manage contexts for the completion callback 204 + [optional]. 205 + 206 + The cache read/write functions are passed a context which is then passed 207 + to the I/O completion callback function. To ensure this context remains 208 + valid until after the I/O completion is called, two functions may be 209 + provided: one to get an extra reference on the context, and one to drop a 210 + reference to it. 211 + 212 + If the context is not used or is a type of object that won't go out of 213 + scope, then these functions are not required. These functions are not 214 + required for indices as indices may not contain data. These functions may 215 + be called in interrupt context and so may not sleep. 216 + 217 + (6) A function to mark a page as retaining cache metadata [optional]. 218 + 219 + This is called by the cache to indicate that it is retaining in-memory 220 + information for this page and that the netfs should uncache the page when 221 + it has finished. This does not indicate whether there's data on the disk 222 + or not. Note that several pages at once may be presented for marking. 223 + 224 + The PG_fscache bit is set on the pages before this function would be 225 + called, so the function need not be provided if this is sufficient. 226 + 227 + This function is not required for indices as they're not permitted data. 228 + 229 + (7) A function to unmark all the pages retaining cache metadata [mandatory]. 230 + 231 + This is called by FS-Cache to indicate that a backing store is being 232 + unbound from a cookie and that all the marks on the pages should be 233 + cleared to prevent confusion. Note that the cache will have torn down all 234 + its tracking information so that the pages don't need to be explicitly 235 + uncached. 236 + 237 + This function is not required for indices as they're not permitted data. 238 + 239 + 240 + Network Filesystem (Un)registration 241 + =================================== 242 + 243 + The first step is to declare the network filesystem to the cache. This also 244 + involves specifying the layout of the primary index (for AFS, this would be the 245 + "cell" level). 246 + 247 + The registration function is:: 248 + 249 + int fscache_register_netfs(struct fscache_netfs *netfs); 250 + 251 + It just takes a pointer to the netfs definition. It returns 0 or an error as 252 + appropriate. 253 + 254 + For kAFS, registration is done as follows:: 255 + 256 + ret = fscache_register_netfs(&afs_cache_netfs); 257 + 258 + The last step is, of course, unregistration:: 259 + 260 + void fscache_unregister_netfs(struct fscache_netfs *netfs); 261 + 262 + 263 + Cache Tag Lookup 264 + ================ 265 + 266 + FS-Cache permits the use of more than one cache. To permit particular index 267 + subtrees to be bound to particular caches, the second step is to look up cache 268 + representation tags. This step is optional; it can be left entirely up to 269 + FS-Cache as to which cache should be used. The problem with doing that is that 270 + FS-Cache will always pick the first cache that was registered. 271 + 272 + To get the representation for a named tag:: 273 + 274 + struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name); 275 + 276 + This takes a text string as the name and returns a representation of a tag. It 277 + will never return an error. It may return a dummy tag, however, if it runs out 278 + of memory; this will inhibit caching with this tag. 279 + 280 + Any representation so obtained must be released by passing it to this function:: 281 + 282 + void fscache_release_cache_tag(struct fscache_cache_tag *tag); 283 + 284 + The tag will be retrieved by FS-Cache when it calls the object definition 285 + operation select_cache(). 286 + 287 + 288 + Index Registration 289 + ================== 290 + 291 + The third step is to inform FS-Cache about part of an index hierarchy that can 292 + be used to locate files. This is done by requesting a cookie for each index in 293 + the path to the file:: 294 + 295 + struct fscache_cookie * 296 + fscache_acquire_cookie(struct fscache_cookie *parent, 297 + const struct fscache_object_def *def, 298 + const void *index_key, 299 + size_t index_key_len, 300 + const void *aux_data, 301 + size_t aux_data_len, 302 + void *netfs_data, 303 + loff_t object_size, 304 + bool enable); 305 + 306 + This function creates an index entry in the index represented by parent, 307 + filling in the index entry by calling the operations pointed to by def. 308 + 309 + A unique key that represents the object within the parent must be pointed to by 310 + index_key and is of length index_key_len. 311 + 312 + An optional blob of auxiliary data that is to be stored within the cache can be 313 + pointed to with aux_data and should be of length aux_data_len. This would 314 + typically be used for storing coherency data. 315 + 316 + The netfs may pass an arbitrary value in netfs_data and this will be presented 317 + to it in the event of any calling back. This may also be used in tracing or 318 + logging of messages. 319 + 320 + The cache tracks the size of the data attached to an object and this set to be 321 + object_size. For indices, this should be 0. This value will be passed to the 322 + ->check_aux() callback. 323 + 324 + Note that this function never returns an error - all errors are handled 325 + internally. It may, however, return NULL to indicate no cookie. It is quite 326 + acceptable to pass this token back to this function as the parent to another 327 + acquisition (or even to the relinquish cookie, read page and write page 328 + functions - see below). 329 + 330 + Note also that no indices are actually created in a cache until a non-index 331 + object needs to be created somewhere down the hierarchy. Furthermore, an index 332 + may be created in several different caches independently at different times. 333 + This is all handled transparently, and the netfs doesn't see any of it. 334 + 335 + A cookie will be created in the disabled state if enabled is false. A cookie 336 + must be enabled to do anything with it. A disabled cookie can be enabled by 337 + calling fscache_enable_cookie() (see below). 338 + 339 + For example, with AFS, a cell would be added to the primary index. This index 340 + entry would have a dependent inode containing volume mappings within this cell:: 341 + 342 + cell->cache = 343 + fscache_acquire_cookie(afs_cache_netfs.primary_index, 344 + &afs_cell_cache_index_def, 345 + cell->name, strlen(cell->name), 346 + NULL, 0, 347 + cell, 0, true); 348 + 349 + And then a particular volume could be added to that index by ID, creating 350 + another index for vnodes (AFS inode equivalents):: 351 + 352 + volume->cache = 353 + fscache_acquire_cookie(volume->cell->cache, 354 + &afs_volume_cache_index_def, 355 + &volume->vid, sizeof(volume->vid), 356 + NULL, 0, 357 + volume, 0, true); 358 + 359 + 360 + Data File Registration 361 + ====================== 362 + 363 + The fourth step is to request a data file be created in the cache. This is 364 + identical to index cookie acquisition. The only difference is that the type in 365 + the object definition should be something other than index type:: 366 + 367 + vnode->cache = 368 + fscache_acquire_cookie(volume->cache, 369 + &afs_vnode_cache_object_def, 370 + &key, sizeof(key), 371 + &aux, sizeof(aux), 372 + vnode, vnode->status.size, true); 373 + 374 + 375 + Miscellaneous Object Registration 376 + ================================= 377 + 378 + An optional step is to request an object of miscellaneous type be created in 379 + the cache. This is almost identical to index cookie acquisition. The only 380 + difference is that the type in the object definition should be something other 381 + than index type. While the parent object could be an index, it's more likely 382 + it would be some other type of object such as a data file:: 383 + 384 + xattr->cache = 385 + fscache_acquire_cookie(vnode->cache, 386 + &afs_xattr_cache_object_def, 387 + &xattr->name, strlen(xattr->name), 388 + NULL, 0, 389 + xattr, strlen(xattr->val), true); 390 + 391 + Miscellaneous objects might be used to store extended attributes or directory 392 + entries for example. 393 + 394 + 395 + Setting the Data File Size 396 + ========================== 397 + 398 + The fifth step is to set the physical attributes of the file, such as its size. 399 + This doesn't automatically reserve any space in the cache, but permits the 400 + cache to adjust its metadata for data tracking appropriately:: 401 + 402 + int fscache_attr_changed(struct fscache_cookie *cookie); 403 + 404 + The cache will return -ENOBUFS if there is no backing cache or if there is no 405 + space to allocate any extra metadata required in the cache. 406 + 407 + Note that attempts to read or write data pages in the cache over this size may 408 + be rebuffed with -ENOBUFS. 409 + 410 + This operation schedules an attribute adjustment to happen asynchronously at 411 + some point in the future, and as such, it may happen after the function returns 412 + to the caller. The attribute adjustment excludes read and write operations. 413 + 414 + 415 + Page alloc/read/write 416 + ===================== 417 + 418 + And the sixth step is to store and retrieve pages in the cache. There are 419 + three functions that are used to do this. 420 + 421 + Note: 422 + 423 + (1) A page should not be re-read or re-allocated without uncaching it first. 424 + 425 + (2) A read or allocated page must be uncached when the netfs page is released 426 + from the pagecache. 427 + 428 + (3) A page should only be written to the cache if previous read or allocated. 429 + 430 + This permits the cache to maintain its page tracking in proper order. 431 + 432 + 433 + PAGE READ 434 + --------- 435 + 436 + Firstly, the netfs should ask FS-Cache to examine the caches and read the 437 + contents cached for a particular page of a particular file if present, or else 438 + allocate space to store the contents if not:: 439 + 440 + typedef 441 + void (*fscache_rw_complete_t)(struct page *page, 442 + void *context, 443 + int error); 444 + 445 + int fscache_read_or_alloc_page(struct fscache_cookie *cookie, 446 + struct page *page, 447 + fscache_rw_complete_t end_io_func, 448 + void *context, 449 + gfp_t gfp); 450 + 451 + The cookie argument must specify a cookie for an object that isn't an index, 452 + the page specified will have the data loaded into it (and is also used to 453 + specify the page number), and the gfp argument is used to control how any 454 + memory allocations made are satisfied. 455 + 456 + If the cookie indicates the inode is not cached: 457 + 458 + (1) The function will return -ENOBUFS. 459 + 460 + Else if there's a copy of the page resident in the cache: 461 + 462 + (1) The mark_pages_cached() cookie operation will be called on that page. 463 + 464 + (2) The function will submit a request to read the data from the cache's 465 + backing device directly into the page specified. 466 + 467 + (3) The function will return 0. 468 + 469 + (4) When the read is complete, end_io_func() will be invoked with: 470 + 471 + * The netfs data supplied when the cookie was created. 472 + 473 + * The page descriptor. 474 + 475 + * The context argument passed to the above function. This will be 476 + maintained with the get_context/put_context functions mentioned above. 477 + 478 + * An argument that's 0 on success or negative for an error code. 479 + 480 + If an error occurs, it should be assumed that the page contains no usable 481 + data. fscache_readpages_cancel() may need to be called. 482 + 483 + end_io_func() will be called in process context if the read is results in 484 + an error, but it might be called in interrupt context if the read is 485 + successful. 486 + 487 + Otherwise, if there's not a copy available in cache, but the cache may be able 488 + to store the page: 489 + 490 + (1) The mark_pages_cached() cookie operation will be called on that page. 491 + 492 + (2) A block may be reserved in the cache and attached to the object at the 493 + appropriate place. 494 + 495 + (3) The function will return -ENODATA. 496 + 497 + This function may also return -ENOMEM or -EINTR, in which case it won't have 498 + read any data from the cache. 499 + 500 + 501 + Page Allocate 502 + ------------- 503 + 504 + Alternatively, if there's not expected to be any data in the cache for a page 505 + because the file has been extended, a block can simply be allocated instead:: 506 + 507 + int fscache_alloc_page(struct fscache_cookie *cookie, 508 + struct page *page, 509 + gfp_t gfp); 510 + 511 + This is similar to the fscache_read_or_alloc_page() function, except that it 512 + never reads from the cache. It will return 0 if a block has been allocated, 513 + rather than -ENODATA as the other would. One or the other must be performed 514 + before writing to the cache. 515 + 516 + The mark_pages_cached() cookie operation will be called on the page if 517 + successful. 518 + 519 + 520 + Page Write 521 + ---------- 522 + 523 + Secondly, if the netfs changes the contents of the page (either due to an 524 + initial download or if a user performs a write), then the page should be 525 + written back to the cache:: 526 + 527 + int fscache_write_page(struct fscache_cookie *cookie, 528 + struct page *page, 529 + loff_t object_size, 530 + gfp_t gfp); 531 + 532 + The cookie argument must specify a data file cookie, the page specified should 533 + contain the data to be written (and is also used to specify the page number), 534 + object_size is the revised size of the object and the gfp argument is used to 535 + control how any memory allocations made are satisfied. 536 + 537 + The page must have first been read or allocated successfully and must not have 538 + been uncached before writing is performed. 539 + 540 + If the cookie indicates the inode is not cached then: 541 + 542 + (1) The function will return -ENOBUFS. 543 + 544 + Else if space can be allocated in the cache to hold this page: 545 + 546 + (1) PG_fscache_write will be set on the page. 547 + 548 + (2) The function will submit a request to write the data to cache's backing 549 + device directly from the page specified. 550 + 551 + (3) The function will return 0. 552 + 553 + (4) When the write is complete PG_fscache_write is cleared on the page and 554 + anyone waiting for that bit will be woken up. 555 + 556 + Else if there's no space available in the cache, -ENOBUFS will be returned. It 557 + is also possible for the PG_fscache_write bit to be cleared when no write took 558 + place if unforeseen circumstances arose (such as a disk error). 559 + 560 + Writing takes place asynchronously. 561 + 562 + 563 + Multiple Page Read 564 + ------------------ 565 + 566 + A facility is provided to read several pages at once, as requested by the 567 + readpages() address space operation:: 568 + 569 + int fscache_read_or_alloc_pages(struct fscache_cookie *cookie, 570 + struct address_space *mapping, 571 + struct list_head *pages, 572 + int *nr_pages, 573 + fscache_rw_complete_t end_io_func, 574 + void *context, 575 + gfp_t gfp); 576 + 577 + This works in a similar way to fscache_read_or_alloc_page(), except: 578 + 579 + (1) Any page it can retrieve data for is removed from pages and nr_pages and 580 + dispatched for reading to the disk. Reads of adjacent pages on disk may 581 + be merged for greater efficiency. 582 + 583 + (2) The mark_pages_cached() cookie operation will be called on several pages 584 + at once if they're being read or allocated. 585 + 586 + (3) If there was an general error, then that error will be returned. 587 + 588 + Else if some pages couldn't be allocated or read, then -ENOBUFS will be 589 + returned. 590 + 591 + Else if some pages couldn't be read but were allocated, then -ENODATA will 592 + be returned. 593 + 594 + Otherwise, if all pages had reads dispatched, then 0 will be returned, the 595 + list will be empty and ``*nr_pages`` will be 0. 596 + 597 + (4) end_io_func will be called once for each page being read as the reads 598 + complete. It will be called in process context if error != 0, but it may 599 + be called in interrupt context if there is no error. 600 + 601 + Note that a return of -ENODATA, -ENOBUFS or any other error does not preclude 602 + some of the pages being read and some being allocated. Those pages will have 603 + been marked appropriately and will need uncaching. 604 + 605 + 606 + Cancellation of Unread Pages 607 + ---------------------------- 608 + 609 + If one or more pages are passed to fscache_read_or_alloc_pages() but not then 610 + read from the cache and also not read from the underlying filesystem then 611 + those pages will need to have any marks and reservations removed. This can be 612 + done by calling:: 613 + 614 + void fscache_readpages_cancel(struct fscache_cookie *cookie, 615 + struct list_head *pages); 616 + 617 + prior to returning to the caller. The cookie argument should be as passed to 618 + fscache_read_or_alloc_pages(). Every page in the pages list will be examined 619 + and any that have PG_fscache set will be uncached. 620 + 621 + 622 + Page Uncaching 623 + ============== 624 + 625 + To uncache a page, this function should be called:: 626 + 627 + void fscache_uncache_page(struct fscache_cookie *cookie, 628 + struct page *page); 629 + 630 + This function permits the cache to release any in-memory representation it 631 + might be holding for this netfs page. This function must be called once for 632 + each page on which the read or write page functions above have been called to 633 + make sure the cache's in-memory tracking information gets torn down. 634 + 635 + Note that pages can't be explicitly deleted from the a data file. The whole 636 + data file must be retired (see the relinquish cookie function below). 637 + 638 + Furthermore, note that this does not cancel the asynchronous read or write 639 + operation started by the read/alloc and write functions, so the page 640 + invalidation functions must use:: 641 + 642 + bool fscache_check_page_write(struct fscache_cookie *cookie, 643 + struct page *page); 644 + 645 + to see if a page is being written to the cache, and:: 646 + 647 + void fscache_wait_on_page_write(struct fscache_cookie *cookie, 648 + struct page *page); 649 + 650 + to wait for it to finish if it is. 651 + 652 + 653 + When releasepage() is being implemented, a special FS-Cache function exists to 654 + manage the heuristics of coping with vmscan trying to eject pages, which may 655 + conflict with the cache trying to write pages to the cache (which may itself 656 + need to allocate memory):: 657 + 658 + bool fscache_maybe_release_page(struct fscache_cookie *cookie, 659 + struct page *page, 660 + gfp_t gfp); 661 + 662 + This takes the netfs cookie, and the page and gfp arguments as supplied to 663 + releasepage(). It will return false if the page cannot be released yet for 664 + some reason and if it returns true, the page has been uncached and can now be 665 + released. 666 + 667 + To make a page available for release, this function may wait for an outstanding 668 + storage request to complete, or it may attempt to cancel the storage request - 669 + in which case the page will not be stored in the cache this time. 670 + 671 + 672 + Bulk Image Page Uncache 673 + ----------------------- 674 + 675 + A convenience routine is provided to perform an uncache on all the pages 676 + attached to an inode. This assumes that the pages on the inode correspond on a 677 + 1:1 basis with the pages in the cache:: 678 + 679 + void fscache_uncache_all_inode_pages(struct fscache_cookie *cookie, 680 + struct inode *inode); 681 + 682 + This takes the netfs cookie that the pages were cached with and the inode that 683 + the pages are attached to. This function will wait for pages to finish being 684 + written to the cache and for the cache to finish with the page generally. No 685 + error is returned. 686 + 687 + 688 + Index and Data File consistency 689 + =============================== 690 + 691 + To find out whether auxiliary data for an object is up to data within the 692 + cache, the following function can be called:: 693 + 694 + int fscache_check_consistency(struct fscache_cookie *cookie, 695 + const void *aux_data); 696 + 697 + This will call back to the netfs to check whether the auxiliary data associated 698 + with a cookie is correct; if aux_data is non-NULL, it will update the auxiliary 699 + data buffer first. It returns 0 if it is and -ESTALE if it isn't; it may also 700 + return -ENOMEM and -ERESTARTSYS. 701 + 702 + To request an update of the index data for an index or other object, the 703 + following function should be called:: 704 + 705 + void fscache_update_cookie(struct fscache_cookie *cookie, 706 + const void *aux_data); 707 + 708 + This function will update the cookie's auxiliary data buffer from aux_data if 709 + that is non-NULL and then schedule this to be stored on disk. The update 710 + method in the parent index definition will be called to transfer the data. 711 + 712 + Note that partial updates may happen automatically at other times, such as when 713 + data blocks are added to a data file object. 714 + 715 + 716 + Cookie Enablement 717 + ================= 718 + 719 + Cookies exist in one of two states: enabled and disabled. If a cookie is 720 + disabled, it ignores all attempts to acquire child cookies; check, update or 721 + invalidate its state; allocate, read or write backing pages - though it is 722 + still possible to uncache pages and relinquish the cookie. 723 + 724 + The initial enablement state is set by fscache_acquire_cookie(), but the cookie 725 + can be enabled or disabled later. To disable a cookie, call:: 726 + 727 + void fscache_disable_cookie(struct fscache_cookie *cookie, 728 + const void *aux_data, 729 + bool invalidate); 730 + 731 + If the cookie is not already disabled, this locks the cookie against other 732 + enable and disable ops, marks the cookie as being disabled, discards or 733 + invalidates any backing objects and waits for cessation of activity on any 734 + associated object before unlocking the cookie. 735 + 736 + All possible failures are handled internally. The caller should consider 737 + calling fscache_uncache_all_inode_pages() afterwards to make sure all page 738 + markings are cleared up. 739 + 740 + Cookies can be enabled or reenabled with:: 741 + 742 + void fscache_enable_cookie(struct fscache_cookie *cookie, 743 + const void *aux_data, 744 + loff_t object_size, 745 + bool (*can_enable)(void *data), 746 + void *data) 747 + 748 + If the cookie is not already enabled, this locks the cookie against other 749 + enable and disable ops, invokes can_enable() and, if the cookie is not an index 750 + cookie, will begin the procedure of acquiring backing objects. 751 + 752 + The optional can_enable() function is passed the data argument and returns a 753 + ruling as to whether or not enablement should actually be permitted to begin. 754 + 755 + All possible failures are handled internally. The cookie will only be marked 756 + as enabled if provisional backing objects are allocated. 757 + 758 + The object's data size is updated from object_size and is passed to the 759 + ->check_aux() function. 760 + 761 + In both cases, the cookie's auxiliary data buffer is updated from aux_data if 762 + that is non-NULL inside the enablement lock before proceeding. 763 + 764 + 765 + Miscellaneous Cookie operations 766 + =============================== 767 + 768 + There are a number of operations that can be used to control cookies: 769 + 770 + * Cookie pinning:: 771 + 772 + int fscache_pin_cookie(struct fscache_cookie *cookie); 773 + void fscache_unpin_cookie(struct fscache_cookie *cookie); 774 + 775 + These operations permit data cookies to be pinned into the cache and to 776 + have the pinning removed. They are not permitted on index cookies. 777 + 778 + The pinning function will return 0 if successful, -ENOBUFS in the cookie 779 + isn't backed by a cache, -EOPNOTSUPP if the cache doesn't support pinning, 780 + -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or 781 + -EIO if there's any other problem. 782 + 783 + * Data space reservation:: 784 + 785 + int fscache_reserve_space(struct fscache_cookie *cookie, loff_t size); 786 + 787 + This permits a netfs to request cache space be reserved to store up to the 788 + given amount of a file. It is permitted to ask for more than the current 789 + size of the file to allow for future file expansion. 790 + 791 + If size is given as zero then the reservation will be cancelled. 792 + 793 + The function will return 0 if successful, -ENOBUFS in the cookie isn't 794 + backed by a cache, -EOPNOTSUPP if the cache doesn't support reservations, 795 + -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or 796 + -EIO if there's any other problem. 797 + 798 + Note that this doesn't pin an object in a cache; it can still be culled to 799 + make space if it's not in use. 800 + 801 + 802 + Cookie Unregistration 803 + ===================== 804 + 805 + To get rid of a cookie, this function should be called:: 806 + 807 + void fscache_relinquish_cookie(struct fscache_cookie *cookie, 808 + const void *aux_data, 809 + bool retire); 810 + 811 + If retire is non-zero, then the object will be marked for recycling, and all 812 + copies of it will be removed from all active caches in which it is present. 813 + Not only that but all child objects will also be retired. 814 + 815 + If retire is zero, then the object may be available again when next the 816 + acquisition function is called. Retirement here will overrule the pinning on a 817 + cookie. 818 + 819 + The cookie's auxiliary data will be updated from aux_data if that is non-NULL 820 + so that the cache can lazily update it on disk. 821 + 822 + One very important note - relinquish must NOT be called for a cookie unless all 823 + the cookies for "child" indices, objects and pages have been relinquished 824 + first. 825 + 826 + 827 + Index Invalidation 828 + ================== 829 + 830 + There is no direct way to invalidate an index subtree. To do this, the caller 831 + should relinquish and retire the cookie they have, and then acquire a new one. 832 + 833 + 834 + Data File Invalidation 835 + ====================== 836 + 837 + Sometimes it will be necessary to invalidate an object that contains data. 838 + Typically this will be necessary when the server tells the netfs of a foreign 839 + change - at which point the netfs has to throw away all the state it had for an 840 + inode and reload from the server. 841 + 842 + To indicate that a cache object should be invalidated, the following function 843 + can be called:: 844 + 845 + void fscache_invalidate(struct fscache_cookie *cookie); 846 + 847 + This can be called with spinlocks held as it defers the work to a thread pool. 848 + All extant storage, retrieval and attribute change ops at this point are 849 + cancelled and discarded. Some future operations will be rejected until the 850 + cache has had a chance to insert a barrier in the operations queue. After 851 + that, operations will be queued again behind the invalidation operation. 852 + 853 + The invalidation operation will perform an attribute change operation and an 854 + auxiliary data update operation as it is very likely these will have changed. 855 + 856 + Using the following function, the netfs can wait for the invalidation operation 857 + to have reached a point at which it can start submitting ordinary operations 858 + once again:: 859 + 860 + void fscache_wait_on_invalidate(struct fscache_cookie *cookie); 861 + 862 + 863 + FS-cache Specific Page Flag 864 + =========================== 865 + 866 + FS-Cache makes use of a page flag, PG_private_2, for its own purpose. This is 867 + given the alternative name PG_fscache. 868 + 869 + PG_fscache is used to indicate that the page is known by the cache, and that 870 + the cache must be informed if the page is going to go away. It's an indication 871 + to the netfs that the cache has an interest in this page, where an interest may 872 + be a pointer to it, resources allocated or reserved for it, or I/O in progress 873 + upon it. 874 + 875 + The netfs can use this information in methods such as releasepage() to 876 + determine whether it needs to uncache a page or update it. 877 + 878 + Furthermore, if this bit is set, releasepage() and invalidatepage() operations 879 + will be called on a page to get rid of it, even if PG_private is not set. This 880 + allows caching to attempted on a page before read_cache_pages() to be called 881 + after fscache_read_or_alloc_pages() as the former will try and release pages it 882 + was given under certain circumstances. 883 + 884 + This bit does not overlap with such as PG_private. This means that FS-Cache 885 + can be used with a filesystem that uses the block buffering code. 886 + 887 + There are a number of operations defined on this flag:: 888 + 889 + int PageFsCache(struct page *page); 890 + void SetPageFsCache(struct page *page) 891 + void ClearPageFsCache(struct page *page) 892 + int TestSetPageFsCache(struct page *page) 893 + int TestClearPageFsCache(struct page *page) 894 + 895 + These functions are bit test, bit set, bit clear, bit test and set and bit 896 + test and clear operations on PG_fscache.

-910

Documentation/filesystems/caching/netfs-api.txt

··· 1 - =============================== 2 - FS-CACHE NETWORK FILESYSTEM API 3 - =============================== 4 - 5 - There's an API by which a network filesystem can make use of the FS-Cache 6 - facilities. This is based around a number of principles: 7 - 8 - (1) Caches can store a number of different object types. There are two main 9 - object types: indices and files. The first is a special type used by 10 - FS-Cache to make finding objects faster and to make retiring of groups of 11 - objects easier. 12 - 13 - (2) Every index, file or other object is represented by a cookie. This cookie 14 - may or may not have anything associated with it, but the netfs doesn't 15 - need to care. 16 - 17 - (3) Barring the top-level index (one entry per cached netfs), the index 18 - hierarchy for each netfs is structured according the whim of the netfs. 19 - 20 - This API is declared in <linux/fscache.h>. 21 - 22 - This document contains the following sections: 23 - 24 - (1) Network filesystem definition 25 - (2) Index definition 26 - (3) Object definition 27 - (4) Network filesystem (un)registration 28 - (5) Cache tag lookup 29 - (6) Index registration 30 - (7) Data file registration 31 - (8) Miscellaneous object registration 32 - (9) Setting the data file size 33 - (10) Page alloc/read/write 34 - (11) Page uncaching 35 - (12) Index and data file consistency 36 - (13) Cookie enablement 37 - (14) Miscellaneous cookie operations 38 - (15) Cookie unregistration 39 - (16) Index invalidation 40 - (17) Data file invalidation 41 - (18) FS-Cache specific page flags. 42 - 43 - 44 - ============================= 45 - NETWORK FILESYSTEM DEFINITION 46 - ============================= 47 - 48 - FS-Cache needs a description of the network filesystem. This is specified 49 - using a record of the following structure: 50 - 51 - struct fscache_netfs { 52 - uint32_t version; 53 - const char *name; 54 - struct fscache_cookie *primary_index; 55 - ... 56 - }; 57 - 58 - This first two fields should be filled in before registration, and the third 59 - will be filled in by the registration function; any other fields should just be 60 - ignored and are for internal use only. 61 - 62 - The fields are: 63 - 64 - (1) The name of the netfs (used as the key in the toplevel index). 65 - 66 - (2) The version of the netfs (if the name matches but the version doesn't, the 67 - entire in-cache hierarchy for this netfs will be scrapped and begun 68 - afresh). 69 - 70 - (3) The cookie representing the primary index will be allocated according to 71 - another parameter passed into the registration function. 72 - 73 - For example, kAFS (linux/fs/afs/) uses the following definitions to describe 74 - itself: 75 - 76 - struct fscache_netfs afs_cache_netfs = { 77 - .version = 0, 78 - .name = "afs", 79 - }; 80 - 81 - 82 - ================ 83 - INDEX DEFINITION 84 - ================ 85 - 86 - Indices are used for two purposes: 87 - 88 - (1) To aid the finding of a file based on a series of keys (such as AFS's 89 - "cell", "volume ID", "vnode ID"). 90 - 91 - (2) To make it easier to discard a subset of all the files cached based around 92 - a particular key - for instance to mirror the removal of an AFS volume. 93 - 94 - However, since it's unlikely that any two netfs's are going to want to define 95 - their index hierarchies in quite the same way, FS-Cache tries to impose as few 96 - restraints as possible on how an index is structured and where it is placed in 97 - the tree. The netfs can even mix indices and data files at the same level, but 98 - it's not recommended. 99 - 100 - Each index entry consists of a key of indeterminate length plus some auxiliary 101 - data, also of indeterminate length. 102 - 103 - There are some limits on indices: 104 - 105 - (1) Any index containing non-index objects should be restricted to a single 106 - cache. Any such objects created within an index will be created in the 107 - first cache only. The cache in which an index is created can be 108 - controlled by cache tags (see below). 109 - 110 - (2) The entry data must be atomically journallable, so it is limited to about 111 - 400 bytes at present. At least 400 bytes will be available. 112 - 113 - (3) The depth of the index tree should be judged with care as the search 114 - function is recursive. Too many layers will run the kernel out of stack. 115 - 116 - 117 - ================= 118 - OBJECT DEFINITION 119 - ================= 120 - 121 - To define an object, a structure of the following type should be filled out: 122 - 123 - struct fscache_cookie_def 124 - { 125 - uint8_t name[16]; 126 - uint8_t type; 127 - 128 - struct fscache_cache_tag *(*select_cache)( 129 - const void *parent_netfs_data, 130 - const void *cookie_netfs_data); 131 - 132 - enum fscache_checkaux (*check_aux)(void *cookie_netfs_data, 133 - const void *data, 134 - uint16_t datalen, 135 - loff_t object_size); 136 - 137 - void (*get_context)(void *cookie_netfs_data, void *context); 138 - 139 - void (*put_context)(void *cookie_netfs_data, void *context); 140 - 141 - void (*mark_pages_cached)(void *cookie_netfs_data, 142 - struct address_space *mapping, 143 - struct pagevec *cached_pvec); 144 - }; 145 - 146 - This has the following fields: 147 - 148 - (1) The type of the object [mandatory]. 149 - 150 - This is one of the following values: 151 - 152 - (*) FSCACHE_COOKIE_TYPE_INDEX 153 - 154 - This defines an index, which is a special FS-Cache type. 155 - 156 - (*) FSCACHE_COOKIE_TYPE_DATAFILE 157 - 158 - This defines an ordinary data file. 159 - 160 - (*) Any other value between 2 and 255 161 - 162 - This defines an extraordinary object such as an XATTR. 163 - 164 - (2) The name of the object type (NUL terminated unless all 16 chars are used) 165 - [optional]. 166 - 167 - (3) A function to select the cache in which to store an index [optional]. 168 - 169 - This function is invoked when an index needs to be instantiated in a cache 170 - during the instantiation of a non-index object. Only the immediate index 171 - parent for the non-index object will be queried. Any indices above that 172 - in the hierarchy may be stored in multiple caches. This function does not 173 - need to be supplied for any non-index object or any index that will only 174 - have index children. 175 - 176 - If this function is not supplied or if it returns NULL then the first 177 - cache in the parent's list will be chosen, or failing that, the first 178 - cache in the master list. 179 - 180 - (4) A function to check the auxiliary data [optional]. 181 - 182 - This function will be called to check that a match found in the cache for 183 - this object is valid. For instance with AFS it could check the auxiliary 184 - data against the data version number returned by the server to determine 185 - whether the index entry in a cache is still valid. 186 - 187 - If this function is absent, it will be assumed that matching objects in a 188 - cache are always valid. 189 - 190 - The function is also passed the cache's idea of the object size and may 191 - use this to manage coherency also. 192 - 193 - If present, the function should return one of the following values: 194 - 195 - (*) FSCACHE_CHECKAUX_OKAY - the entry is okay as is 196 - (*) FSCACHE_CHECKAUX_NEEDS_UPDATE - the entry requires update 197 - (*) FSCACHE_CHECKAUX_OBSOLETE - the entry should be deleted 198 - 199 - This function can also be used to extract data from the auxiliary data in 200 - the cache and copy it into the netfs's structures. 201 - 202 - (5) A pair of functions to manage contexts for the completion callback 203 - [optional]. 204 - 205 - The cache read/write functions are passed a context which is then passed 206 - to the I/O completion callback function. To ensure this context remains 207 - valid until after the I/O completion is called, two functions may be 208 - provided: one to get an extra reference on the context, and one to drop a 209 - reference to it. 210 - 211 - If the context is not used or is a type of object that won't go out of 212 - scope, then these functions are not required. These functions are not 213 - required for indices as indices may not contain data. These functions may 214 - be called in interrupt context and so may not sleep. 215 - 216 - (6) A function to mark a page as retaining cache metadata [optional]. 217 - 218 - This is called by the cache to indicate that it is retaining in-memory 219 - information for this page and that the netfs should uncache the page when 220 - it has finished. This does not indicate whether there's data on the disk 221 - or not. Note that several pages at once may be presented for marking. 222 - 223 - The PG_fscache bit is set on the pages before this function would be 224 - called, so the function need not be provided if this is sufficient. 225 - 226 - This function is not required for indices as they're not permitted data. 227 - 228 - (7) A function to unmark all the pages retaining cache metadata [mandatory]. 229 - 230 - This is called by FS-Cache to indicate that a backing store is being 231 - unbound from a cookie and that all the marks on the pages should be 232 - cleared to prevent confusion. Note that the cache will have torn down all 233 - its tracking information so that the pages don't need to be explicitly 234 - uncached. 235 - 236 - This function is not required for indices as they're not permitted data. 237 - 238 - 239 - =================================== 240 - NETWORK FILESYSTEM (UN)REGISTRATION 241 - =================================== 242 - 243 - The first step is to declare the network filesystem to the cache. This also 244 - involves specifying the layout of the primary index (for AFS, this would be the 245 - "cell" level). 246 - 247 - The registration function is: 248 - 249 - int fscache_register_netfs(struct fscache_netfs *netfs); 250 - 251 - It just takes a pointer to the netfs definition. It returns 0 or an error as 252 - appropriate. 253 - 254 - For kAFS, registration is done as follows: 255 - 256 - ret = fscache_register_netfs(&afs_cache_netfs); 257 - 258 - The last step is, of course, unregistration: 259 - 260 - void fscache_unregister_netfs(struct fscache_netfs *netfs); 261 - 262 - 263 - ================ 264 - CACHE TAG LOOKUP 265 - ================ 266 - 267 - FS-Cache permits the use of more than one cache. To permit particular index 268 - subtrees to be bound to particular caches, the second step is to look up cache 269 - representation tags. This step is optional; it can be left entirely up to 270 - FS-Cache as to which cache should be used. The problem with doing that is that 271 - FS-Cache will always pick the first cache that was registered. 272 - 273 - To get the representation for a named tag: 274 - 275 - struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name); 276 - 277 - This takes a text string as the name and returns a representation of a tag. It 278 - will never return an error. It may return a dummy tag, however, if it runs out 279 - of memory; this will inhibit caching with this tag. 280 - 281 - Any representation so obtained must be released by passing it to this function: 282 - 283 - void fscache_release_cache_tag(struct fscache_cache_tag *tag); 284 - 285 - The tag will be retrieved by FS-Cache when it calls the object definition 286 - operation select_cache(). 287 - 288 - 289 - ================== 290 - INDEX REGISTRATION 291 - ================== 292 - 293 - The third step is to inform FS-Cache about part of an index hierarchy that can 294 - be used to locate files. This is done by requesting a cookie for each index in 295 - the path to the file: 296 - 297 - struct fscache_cookie * 298 - fscache_acquire_cookie(struct fscache_cookie *parent, 299 - const struct fscache_object_def *def, 300 - const void *index_key, 301 - size_t index_key_len, 302 - const void *aux_data, 303 - size_t aux_data_len, 304 - void *netfs_data, 305 - loff_t object_size, 306 - bool enable); 307 - 308 - This function creates an index entry in the index represented by parent, 309 - filling in the index entry by calling the operations pointed to by def. 310 - 311 - A unique key that represents the object within the parent must be pointed to by 312 - index_key and is of length index_key_len. 313 - 314 - An optional blob of auxiliary data that is to be stored within the cache can be 315 - pointed to with aux_data and should be of length aux_data_len. This would 316 - typically be used for storing coherency data. 317 - 318 - The netfs may pass an arbitrary value in netfs_data and this will be presented 319 - to it in the event of any calling back. This may also be used in tracing or 320 - logging of messages. 321 - 322 - The cache tracks the size of the data attached to an object and this set to be 323 - object_size. For indices, this should be 0. This value will be passed to the 324 - ->check_aux() callback. 325 - 326 - Note that this function never returns an error - all errors are handled 327 - internally. It may, however, return NULL to indicate no cookie. It is quite 328 - acceptable to pass this token back to this function as the parent to another 329 - acquisition (or even to the relinquish cookie, read page and write page 330 - functions - see below). 331 - 332 - Note also that no indices are actually created in a cache until a non-index 333 - object needs to be created somewhere down the hierarchy. Furthermore, an index 334 - may be created in several different caches independently at different times. 335 - This is all handled transparently, and the netfs doesn't see any of it. 336 - 337 - A cookie will be created in the disabled state if enabled is false. A cookie 338 - must be enabled to do anything with it. A disabled cookie can be enabled by 339 - calling fscache_enable_cookie() (see below). 340 - 341 - For example, with AFS, a cell would be added to the primary index. This index 342 - entry would have a dependent inode containing volume mappings within this cell: 343 - 344 - cell->cache = 345 - fscache_acquire_cookie(afs_cache_netfs.primary_index, 346 - &afs_cell_cache_index_def, 347 - cell->name, strlen(cell->name), 348 - NULL, 0, 349 - cell, 0, true); 350 - 351 - And then a particular volume could be added to that index by ID, creating 352 - another index for vnodes (AFS inode equivalents): 353 - 354 - volume->cache = 355 - fscache_acquire_cookie(volume->cell->cache, 356 - &afs_volume_cache_index_def, 357 - &volume->vid, sizeof(volume->vid), 358 - NULL, 0, 359 - volume, 0, true); 360 - 361 - 362 - ====================== 363 - DATA FILE REGISTRATION 364 - ====================== 365 - 366 - The fourth step is to request a data file be created in the cache. This is 367 - identical to index cookie acquisition. The only difference is that the type in 368 - the object definition should be something other than index type. 369 - 370 - vnode->cache = 371 - fscache_acquire_cookie(volume->cache, 372 - &afs_vnode_cache_object_def, 373 - &key, sizeof(key), 374 - &aux, sizeof(aux), 375 - vnode, vnode->status.size, true); 376 - 377 - 378 - ================================= 379 - MISCELLANEOUS OBJECT REGISTRATION 380 - ================================= 381 - 382 - An optional step is to request an object of miscellaneous type be created in 383 - the cache. This is almost identical to index cookie acquisition. The only 384 - difference is that the type in the object definition should be something other 385 - than index type. While the parent object could be an index, it's more likely 386 - it would be some other type of object such as a data file. 387 - 388 - xattr->cache = 389 - fscache_acquire_cookie(vnode->cache, 390 - &afs_xattr_cache_object_def, 391 - &xattr->name, strlen(xattr->name), 392 - NULL, 0, 393 - xattr, strlen(xattr->val), true); 394 - 395 - Miscellaneous objects might be used to store extended attributes or directory 396 - entries for example. 397 - 398 - 399 - ========================== 400 - SETTING THE DATA FILE SIZE 401 - ========================== 402 - 403 - The fifth step is to set the physical attributes of the file, such as its size. 404 - This doesn't automatically reserve any space in the cache, but permits the 405 - cache to adjust its metadata for data tracking appropriately: 406 - 407 - int fscache_attr_changed(struct fscache_cookie *cookie); 408 - 409 - The cache will return -ENOBUFS if there is no backing cache or if there is no 410 - space to allocate any extra metadata required in the cache. 411 - 412 - Note that attempts to read or write data pages in the cache over this size may 413 - be rebuffed with -ENOBUFS. 414 - 415 - This operation schedules an attribute adjustment to happen asynchronously at 416 - some point in the future, and as such, it may happen after the function returns 417 - to the caller. The attribute adjustment excludes read and write operations. 418 - 419 - 420 - ===================== 421 - PAGE ALLOC/READ/WRITE 422 - ===================== 423 - 424 - And the sixth step is to store and retrieve pages in the cache. There are 425 - three functions that are used to do this. 426 - 427 - Note: 428 - 429 - (1) A page should not be re-read or re-allocated without uncaching it first. 430 - 431 - (2) A read or allocated page must be uncached when the netfs page is released 432 - from the pagecache. 433 - 434 - (3) A page should only be written to the cache if previous read or allocated. 435 - 436 - This permits the cache to maintain its page tracking in proper order. 437 - 438 - 439 - PAGE READ 440 - --------- 441 - 442 - Firstly, the netfs should ask FS-Cache to examine the caches and read the 443 - contents cached for a particular page of a particular file if present, or else 444 - allocate space to store the contents if not: 445 - 446 - typedef 447 - void (*fscache_rw_complete_t)(struct page *page, 448 - void *context, 449 - int error); 450 - 451 - int fscache_read_or_alloc_page(struct fscache_cookie *cookie, 452 - struct page *page, 453 - fscache_rw_complete_t end_io_func, 454 - void *context, 455 - gfp_t gfp); 456 - 457 - The cookie argument must specify a cookie for an object that isn't an index, 458 - the page specified will have the data loaded into it (and is also used to 459 - specify the page number), and the gfp argument is used to control how any 460 - memory allocations made are satisfied. 461 - 462 - If the cookie indicates the inode is not cached: 463 - 464 - (1) The function will return -ENOBUFS. 465 - 466 - Else if there's a copy of the page resident in the cache: 467 - 468 - (1) The mark_pages_cached() cookie operation will be called on that page. 469 - 470 - (2) The function will submit a request to read the data from the cache's 471 - backing device directly into the page specified. 472 - 473 - (3) The function will return 0. 474 - 475 - (4) When the read is complete, end_io_func() will be invoked with: 476 - 477 - (*) The netfs data supplied when the cookie was created. 478 - 479 - (*) The page descriptor. 480 - 481 - (*) The context argument passed to the above function. This will be 482 - maintained with the get_context/put_context functions mentioned above. 483 - 484 - (*) An argument that's 0 on success or negative for an error code. 485 - 486 - If an error occurs, it should be assumed that the page contains no usable 487 - data. fscache_readpages_cancel() may need to be called. 488 - 489 - end_io_func() will be called in process context if the read is results in 490 - an error, but it might be called in interrupt context if the read is 491 - successful. 492 - 493 - Otherwise, if there's not a copy available in cache, but the cache may be able 494 - to store the page: 495 - 496 - (1) The mark_pages_cached() cookie operation will be called on that page. 497 - 498 - (2) A block may be reserved in the cache and attached to the object at the 499 - appropriate place. 500 - 501 - (3) The function will return -ENODATA. 502 - 503 - This function may also return -ENOMEM or -EINTR, in which case it won't have 504 - read any data from the cache. 505 - 506 - 507 - PAGE ALLOCATE 508 - ------------- 509 - 510 - Alternatively, if there's not expected to be any data in the cache for a page 511 - because the file has been extended, a block can simply be allocated instead: 512 - 513 - int fscache_alloc_page(struct fscache_cookie *cookie, 514 - struct page *page, 515 - gfp_t gfp); 516 - 517 - This is similar to the fscache_read_or_alloc_page() function, except that it 518 - never reads from the cache. It will return 0 if a block has been allocated, 519 - rather than -ENODATA as the other would. One or the other must be performed 520 - before writing to the cache. 521 - 522 - The mark_pages_cached() cookie operation will be called on the page if 523 - successful. 524 - 525 - 526 - PAGE WRITE 527 - ---------- 528 - 529 - Secondly, if the netfs changes the contents of the page (either due to an 530 - initial download or if a user performs a write), then the page should be 531 - written back to the cache: 532 - 533 - int fscache_write_page(struct fscache_cookie *cookie, 534 - struct page *page, 535 - loff_t object_size, 536 - gfp_t gfp); 537 - 538 - The cookie argument must specify a data file cookie, the page specified should 539 - contain the data to be written (and is also used to specify the page number), 540 - object_size is the revised size of the object and the gfp argument is used to 541 - control how any memory allocations made are satisfied. 542 - 543 - The page must have first been read or allocated successfully and must not have 544 - been uncached before writing is performed. 545 - 546 - If the cookie indicates the inode is not cached then: 547 - 548 - (1) The function will return -ENOBUFS. 549 - 550 - Else if space can be allocated in the cache to hold this page: 551 - 552 - (1) PG_fscache_write will be set on the page. 553 - 554 - (2) The function will submit a request to write the data to cache's backing 555 - device directly from the page specified. 556 - 557 - (3) The function will return 0. 558 - 559 - (4) When the write is complete PG_fscache_write is cleared on the page and 560 - anyone waiting for that bit will be woken up. 561 - 562 - Else if there's no space available in the cache, -ENOBUFS will be returned. It 563 - is also possible for the PG_fscache_write bit to be cleared when no write took 564 - place if unforeseen circumstances arose (such as a disk error). 565 - 566 - Writing takes place asynchronously. 567 - 568 - 569 - MULTIPLE PAGE READ 570 - ------------------ 571 - 572 - A facility is provided to read several pages at once, as requested by the 573 - readpages() address space operation: 574 - 575 - int fscache_read_or_alloc_pages(struct fscache_cookie *cookie, 576 - struct address_space *mapping, 577 - struct list_head *pages, 578 - int *nr_pages, 579 - fscache_rw_complete_t end_io_func, 580 - void *context, 581 - gfp_t gfp); 582 - 583 - This works in a similar way to fscache_read_or_alloc_page(), except: 584 - 585 - (1) Any page it can retrieve data for is removed from pages and nr_pages and 586 - dispatched for reading to the disk. Reads of adjacent pages on disk may 587 - be merged for greater efficiency. 588 - 589 - (2) The mark_pages_cached() cookie operation will be called on several pages 590 - at once if they're being read or allocated. 591 - 592 - (3) If there was an general error, then that error will be returned. 593 - 594 - Else if some pages couldn't be allocated or read, then -ENOBUFS will be 595 - returned. 596 - 597 - Else if some pages couldn't be read but were allocated, then -ENODATA will 598 - be returned. 599 - 600 - Otherwise, if all pages had reads dispatched, then 0 will be returned, the 601 - list will be empty and *nr_pages will be 0. 602 - 603 - (4) end_io_func will be called once for each page being read as the reads 604 - complete. It will be called in process context if error != 0, but it may 605 - be called in interrupt context if there is no error. 606 - 607 - Note that a return of -ENODATA, -ENOBUFS or any other error does not preclude 608 - some of the pages being read and some being allocated. Those pages will have 609 - been marked appropriately and will need uncaching. 610 - 611 - 612 - CANCELLATION OF UNREAD PAGES 613 - ---------------------------- 614 - 615 - If one or more pages are passed to fscache_read_or_alloc_pages() but not then 616 - read from the cache and also not read from the underlying filesystem then 617 - those pages will need to have any marks and reservations removed. This can be 618 - done by calling: 619 - 620 - void fscache_readpages_cancel(struct fscache_cookie *cookie, 621 - struct list_head *pages); 622 - 623 - prior to returning to the caller. The cookie argument should be as passed to 624 - fscache_read_or_alloc_pages(). Every page in the pages list will be examined 625 - and any that have PG_fscache set will be uncached. 626 - 627 - 628 - ============== 629 - PAGE UNCACHING 630 - ============== 631 - 632 - To uncache a page, this function should be called: 633 - 634 - void fscache_uncache_page(struct fscache_cookie *cookie, 635 - struct page *page); 636 - 637 - This function permits the cache to release any in-memory representation it 638 - might be holding for this netfs page. This function must be called once for 639 - each page on which the read or write page functions above have been called to 640 - make sure the cache's in-memory tracking information gets torn down. 641 - 642 - Note that pages can't be explicitly deleted from the a data file. The whole 643 - data file must be retired (see the relinquish cookie function below). 644 - 645 - Furthermore, note that this does not cancel the asynchronous read or write 646 - operation started by the read/alloc and write functions, so the page 647 - invalidation functions must use: 648 - 649 - bool fscache_check_page_write(struct fscache_cookie *cookie, 650 - struct page *page); 651 - 652 - to see if a page is being written to the cache, and: 653 - 654 - void fscache_wait_on_page_write(struct fscache_cookie *cookie, 655 - struct page *page); 656 - 657 - to wait for it to finish if it is. 658 - 659 - 660 - When releasepage() is being implemented, a special FS-Cache function exists to 661 - manage the heuristics of coping with vmscan trying to eject pages, which may 662 - conflict with the cache trying to write pages to the cache (which may itself 663 - need to allocate memory): 664 - 665 - bool fscache_maybe_release_page(struct fscache_cookie *cookie, 666 - struct page *page, 667 - gfp_t gfp); 668 - 669 - This takes the netfs cookie, and the page and gfp arguments as supplied to 670 - releasepage(). It will return false if the page cannot be released yet for 671 - some reason and if it returns true, the page has been uncached and can now be 672 - released. 673 - 674 - To make a page available for release, this function may wait for an outstanding 675 - storage request to complete, or it may attempt to cancel the storage request - 676 - in which case the page will not be stored in the cache this time. 677 - 678 - 679 - BULK INODE PAGE UNCACHE 680 - ----------------------- 681 - 682 - A convenience routine is provided to perform an uncache on all the pages 683 - attached to an inode. This assumes that the pages on the inode correspond on a 684 - 1:1 basis with the pages in the cache. 685 - 686 - void fscache_uncache_all_inode_pages(struct fscache_cookie *cookie, 687 - struct inode *inode); 688 - 689 - This takes the netfs cookie that the pages were cached with and the inode that 690 - the pages are attached to. This function will wait for pages to finish being 691 - written to the cache and for the cache to finish with the page generally. No 692 - error is returned. 693 - 694 - 695 - =============================== 696 - INDEX AND DATA FILE CONSISTENCY 697 - =============================== 698 - 699 - To find out whether auxiliary data for an object is up to data within the 700 - cache, the following function can be called: 701 - 702 - int fscache_check_consistency(struct fscache_cookie *cookie, 703 - const void *aux_data); 704 - 705 - This will call back to the netfs to check whether the auxiliary data associated 706 - with a cookie is correct; if aux_data is non-NULL, it will update the auxiliary 707 - data buffer first. It returns 0 if it is and -ESTALE if it isn't; it may also 708 - return -ENOMEM and -ERESTARTSYS. 709 - 710 - To request an update of the index data for an index or other object, the 711 - following function should be called: 712 - 713 - void fscache_update_cookie(struct fscache_cookie *cookie, 714 - const void *aux_data); 715 - 716 - This function will update the cookie's auxiliary data buffer from aux_data if 717 - that is non-NULL and then schedule this to be stored on disk. The update 718 - method in the parent index definition will be called to transfer the data. 719 - 720 - Note that partial updates may happen automatically at other times, such as when 721 - data blocks are added to a data file object. 722 - 723 - 724 - ================= 725 - COOKIE ENABLEMENT 726 - ================= 727 - 728 - Cookies exist in one of two states: enabled and disabled. If a cookie is 729 - disabled, it ignores all attempts to acquire child cookies; check, update or 730 - invalidate its state; allocate, read or write backing pages - though it is 731 - still possible to uncache pages and relinquish the cookie. 732 - 733 - The initial enablement state is set by fscache_acquire_cookie(), but the cookie 734 - can be enabled or disabled later. To disable a cookie, call: 735 - 736 - void fscache_disable_cookie(struct fscache_cookie *cookie, 737 - const void *aux_data, 738 - bool invalidate); 739 - 740 - If the cookie is not already disabled, this locks the cookie against other 741 - enable and disable ops, marks the cookie as being disabled, discards or 742 - invalidates any backing objects and waits for cessation of activity on any 743 - associated object before unlocking the cookie. 744 - 745 - All possible failures are handled internally. The caller should consider 746 - calling fscache_uncache_all_inode_pages() afterwards to make sure all page 747 - markings are cleared up. 748 - 749 - Cookies can be enabled or reenabled with: 750 - 751 - void fscache_enable_cookie(struct fscache_cookie *cookie, 752 - const void *aux_data, 753 - loff_t object_size, 754 - bool (*can_enable)(void *data), 755 - void *data) 756 - 757 - If the cookie is not already enabled, this locks the cookie against other 758 - enable and disable ops, invokes can_enable() and, if the cookie is not an index 759 - cookie, will begin the procedure of acquiring backing objects. 760 - 761 - The optional can_enable() function is passed the data argument and returns a 762 - ruling as to whether or not enablement should actually be permitted to begin. 763 - 764 - All possible failures are handled internally. The cookie will only be marked 765 - as enabled if provisional backing objects are allocated. 766 - 767 - The object's data size is updated from object_size and is passed to the 768 - ->check_aux() function. 769 - 770 - In both cases, the cookie's auxiliary data buffer is updated from aux_data if 771 - that is non-NULL inside the enablement lock before proceeding. 772 - 773 - 774 - =============================== 775 - MISCELLANEOUS COOKIE OPERATIONS 776 - =============================== 777 - 778 - There are a number of operations that can be used to control cookies: 779 - 780 - (*) Cookie pinning: 781 - 782 - int fscache_pin_cookie(struct fscache_cookie *cookie); 783 - void fscache_unpin_cookie(struct fscache_cookie *cookie); 784 - 785 - These operations permit data cookies to be pinned into the cache and to 786 - have the pinning removed. They are not permitted on index cookies. 787 - 788 - The pinning function will return 0 if successful, -ENOBUFS in the cookie 789 - isn't backed by a cache, -EOPNOTSUPP if the cache doesn't support pinning, 790 - -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or 791 - -EIO if there's any other problem. 792 - 793 - (*) Data space reservation: 794 - 795 - int fscache_reserve_space(struct fscache_cookie *cookie, loff_t size); 796 - 797 - This permits a netfs to request cache space be reserved to store up to the 798 - given amount of a file. It is permitted to ask for more than the current 799 - size of the file to allow for future file expansion. 800 - 801 - If size is given as zero then the reservation will be cancelled. 802 - 803 - The function will return 0 if successful, -ENOBUFS in the cookie isn't 804 - backed by a cache, -EOPNOTSUPP if the cache doesn't support reservations, 805 - -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or 806 - -EIO if there's any other problem. 807 - 808 - Note that this doesn't pin an object in a cache; it can still be culled to 809 - make space if it's not in use. 810 - 811 - 812 - ===================== 813 - COOKIE UNREGISTRATION 814 - ===================== 815 - 816 - To get rid of a cookie, this function should be called. 817 - 818 - void fscache_relinquish_cookie(struct fscache_cookie *cookie, 819 - const void *aux_data, 820 - bool retire); 821 - 822 - If retire is non-zero, then the object will be marked for recycling, and all 823 - copies of it will be removed from all active caches in which it is present. 824 - Not only that but all child objects will also be retired. 825 - 826 - If retire is zero, then the object may be available again when next the 827 - acquisition function is called. Retirement here will overrule the pinning on a 828 - cookie. 829 - 830 - The cookie's auxiliary data will be updated from aux_data if that is non-NULL 831 - so that the cache can lazily update it on disk. 832 - 833 - One very important note - relinquish must NOT be called for a cookie unless all 834 - the cookies for "child" indices, objects and pages have been relinquished 835 - first. 836 - 837 - 838 - ================== 839 - INDEX INVALIDATION 840 - ================== 841 - 842 - There is no direct way to invalidate an index subtree. To do this, the caller 843 - should relinquish and retire the cookie they have, and then acquire a new one. 844 - 845 - 846 - ====================== 847 - DATA FILE INVALIDATION 848 - ====================== 849 - 850 - Sometimes it will be necessary to invalidate an object that contains data. 851 - Typically this will be necessary when the server tells the netfs of a foreign 852 - change - at which point the netfs has to throw away all the state it had for an 853 - inode and reload from the server. 854 - 855 - To indicate that a cache object should be invalidated, the following function 856 - can be called: 857 - 858 - void fscache_invalidate(struct fscache_cookie *cookie); 859 - 860 - This can be called with spinlocks held as it defers the work to a thread pool. 861 - All extant storage, retrieval and attribute change ops at this point are 862 - cancelled and discarded. Some future operations will be rejected until the 863 - cache has had a chance to insert a barrier in the operations queue. After 864 - that, operations will be queued again behind the invalidation operation. 865 - 866 - The invalidation operation will perform an attribute change operation and an 867 - auxiliary data update operation as it is very likely these will have changed. 868 - 869 - Using the following function, the netfs can wait for the invalidation operation 870 - to have reached a point at which it can start submitting ordinary operations 871 - once again: 872 - 873 - void fscache_wait_on_invalidate(struct fscache_cookie *cookie); 874 - 875 - 876 - =========================== 877 - FS-CACHE SPECIFIC PAGE FLAG 878 - =========================== 879 - 880 - FS-Cache makes use of a page flag, PG_private_2, for its own purpose. This is 881 - given the alternative name PG_fscache. 882 - 883 - PG_fscache is used to indicate that the page is known by the cache, and that 884 - the cache must be informed if the page is going to go away. It's an indication 885 - to the netfs that the cache has an interest in this page, where an interest may 886 - be a pointer to it, resources allocated or reserved for it, or I/O in progress 887 - upon it. 888 - 889 - The netfs can use this information in methods such as releasepage() to 890 - determine whether it needs to uncache a page or update it. 891 - 892 - Furthermore, if this bit is set, releasepage() and invalidatepage() operations 893 - will be called on a page to get rid of it, even if PG_private is not set. This 894 - allows caching to attempted on a page before read_cache_pages() to be called 895 - after fscache_read_or_alloc_pages() as the former will try and release pages it 896 - was given under certain circumstances. 897 - 898 - This bit does not overlap with such as PG_private. This means that FS-Cache 899 - can be used with a filesystem that uses the block buffering code. 900 - 901 - There are a number of operations defined on this flag: 902 - 903 - int PageFsCache(struct page *page); 904 - void SetPageFsCache(struct page *page) 905 - void ClearPageFsCache(struct page *page) 906 - int TestSetPageFsCache(struct page *page) 907 - int TestClearPageFsCache(struct page *page) 908 - 909 - These functions are bit test, bit set, bit clear, bit test and set and bit 910 - test and clear operations on PG_fscache.

+313

Documentation/filesystems/caching/object.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ==================================================== 4 + In-Kernel Cache Object Representation and Management 5 + ==================================================== 6 + 7 + By: David Howells <dhowells@redhat.com> 8 + 9 + .. Contents: 10 + 11 + (*) Representation 12 + 13 + (*) Object management state machine. 14 + 15 + - Provision of cpu time. 16 + - Locking simplification. 17 + 18 + (*) The set of states. 19 + 20 + (*) The set of events. 21 + 22 + 23 + Representation 24 + ============== 25 + 26 + FS-Cache maintains an in-kernel representation of each object that a netfs is 27 + currently interested in. Such objects are represented by the fscache_cookie 28 + struct and are referred to as cookies. 29 + 30 + FS-Cache also maintains a separate in-kernel representation of the objects that 31 + a cache backend is currently actively caching. Such objects are represented by 32 + the fscache_object struct. The cache backends allocate these upon request, and 33 + are expected to embed them in their own representations. These are referred to 34 + as objects. 35 + 36 + There is a 1:N relationship between cookies and objects. A cookie may be 37 + represented by multiple objects - an index may exist in more than one cache - 38 + or even by no objects (it may not be cached). 39 + 40 + Furthermore, both cookies and objects are hierarchical. The two hierarchies 41 + correspond, but the cookies tree is a superset of the union of the object trees 42 + of multiple caches:: 43 + 44 + NETFS INDEX TREE : CACHE 1 : CACHE 2 45 + : : 46 + : +-----------+ : 47 + +----------->| IObject | : 48 + +-----------+ | : +-----------+ : 49 + | ICookie |-------+ : | : 50 + +-----------+ | : | : +-----------+ 51 + | +------------------------------>| IObject | 52 + | : | : +-----------+ 53 + | : V : | 54 + | : +-----------+ : | 55 + V +----------->| IObject | : | 56 + +-----------+ | : +-----------+ : | 57 + | ICookie |-------+ : | : V 58 + +-----------+ | : | : +-----------+ 59 + | +------------------------------>| IObject | 60 + +-----+-----+ : | : +-----------+ 61 + | | : | : | 62 + V | : V : | 63 + +-----------+ | : +-----------+ : | 64 + | ICookie |------------------------->| IObject | : | 65 + +-----------+ | : +-----------+ : | 66 + | V : | : V 67 + | +-----------+ : | : +-----------+ 68 + | | ICookie |-------------------------------->| IObject | 69 + | +-----------+ : | : +-----------+ 70 + V | : V : | 71 + +-----------+ | : +-----------+ : | 72 + | DCookie |------------------------->| DObject | : | 73 + +-----------+ | : +-----------+ : | 74 + | : : | 75 + +-------+-------+ : : | 76 + | | : : | 77 + V V : : V 78 + +-----------+ +-----------+ : : +-----------+ 79 + | DCookie | | DCookie |------------------------>| DObject | 80 + +-----------+ +-----------+ : : +-----------+ 81 + : : 82 + 83 + In the above illustration, ICookie and IObject represent indices and DCookie 84 + and DObject represent data storage objects. Indices may have representation in 85 + multiple caches, but currently, non-index objects may not. Objects of any type 86 + may also be entirely unrepresented. 87 + 88 + As far as the netfs API goes, the netfs is only actually permitted to see 89 + pointers to the cookies. The cookies themselves and any objects attached to 90 + those cookies are hidden from it. 91 + 92 + 93 + Object Management State Machine 94 + =============================== 95 + 96 + Within FS-Cache, each active object is managed by its own individual state 97 + machine. The state for an object is kept in the fscache_object struct, in 98 + object->state. A cookie may point to a set of objects that are in different 99 + states. 100 + 101 + Each state has an action associated with it that is invoked when the machine 102 + wakes up in that state. There are four logical sets of states: 103 + 104 + (1) Preparation: states that wait for the parent objects to become ready. The 105 + representations are hierarchical, and it is expected that an object must 106 + be created or accessed with respect to its parent object. 107 + 108 + (2) Initialisation: states that perform lookups in the cache and validate 109 + what's found and that create on disk any missing metadata. 110 + 111 + (3) Normal running: states that allow netfs operations on objects to proceed 112 + and that update the state of objects. 113 + 114 + (4) Termination: states that detach objects from their netfs cookies, that 115 + delete objects from disk, that handle disk and system errors and that free 116 + up in-memory resources. 117 + 118 + 119 + In most cases, transitioning between states is in response to signalled events. 120 + When a state has finished processing, it will usually set the mask of events in 121 + which it is interested (object->event_mask) and relinquish the worker thread. 122 + Then when an event is raised (by calling fscache_raise_event()), if the event 123 + is not masked, the object will be queued for processing (by calling 124 + fscache_enqueue_object()). 125 + 126 + 127 + Provision of CPU Time 128 + --------------------- 129 + 130 + The work to be done by the various states was given CPU time by the threads of 131 + the slow work facility. This was used in preference to the workqueue facility 132 + because: 133 + 134 + (1) Threads may be completely occupied for very long periods of time by a 135 + particular work item. These state actions may be doing sequences of 136 + synchronous, journalled disk accesses (lookup, mkdir, create, setxattr, 137 + getxattr, truncate, unlink, rmdir, rename). 138 + 139 + (2) Threads may do little actual work, but may rather spend a lot of time 140 + sleeping on I/O. This means that single-threaded and 1-per-CPU-threaded 141 + workqueues don't necessarily have the right numbers of threads. 142 + 143 + 144 + Locking Simplification 145 + ---------------------- 146 + 147 + Because only one worker thread may be operating on any particular object's 148 + state machine at once, this simplifies the locking, particularly with respect 149 + to disconnecting the netfs's representation of a cache object (fscache_cookie) 150 + from the cache backend's representation (fscache_object) - which may be 151 + requested from either end. 152 + 153 + 154 + The Set of States 155 + ================= 156 + 157 + The object state machine has a set of states that it can be in. There are 158 + preparation states in which the object sets itself up and waits for its parent 159 + object to transit to a state that allows access to its children: 160 + 161 + (1) State FSCACHE_OBJECT_INIT. 162 + 163 + Initialise the object and wait for the parent object to become active. In 164 + the cache, it is expected that it will not be possible to look an object 165 + up from the parent object, until that parent object itself has been looked 166 + up. 167 + 168 + There are initialisation states in which the object sets itself up and accesses 169 + disk for the object metadata: 170 + 171 + (2) State FSCACHE_OBJECT_LOOKING_UP. 172 + 173 + Look up the object on disk, using the parent as a starting point. 174 + FS-Cache expects the cache backend to probe the cache to see whether this 175 + object is represented there, and if it is, to see if it's valid (coherency 176 + management). 177 + 178 + The cache should call fscache_object_lookup_negative() to indicate lookup 179 + failure for whatever reason, and should call fscache_obtained_object() to 180 + indicate success. 181 + 182 + At the completion of lookup, FS-Cache will let the netfs go ahead with 183 + read operations, no matter whether the file is yet cached. If not yet 184 + cached, read operations will be immediately rejected with ENODATA until 185 + the first known page is uncached - as to that point there can be no data 186 + to be read out of the cache for that file that isn't currently also held 187 + in the pagecache. 188 + 189 + (3) State FSCACHE_OBJECT_CREATING. 190 + 191 + Create an object on disk, using the parent as a starting point. This 192 + happens if the lookup failed to find the object, or if the object's 193 + coherency data indicated what's on disk is out of date. In this state, 194 + FS-Cache expects the cache to create 195 + 196 + The cache should call fscache_obtained_object() if creation completes 197 + successfully, fscache_object_lookup_negative() otherwise. 198 + 199 + At the completion of creation, FS-Cache will start processing write 200 + operations the netfs has queued for an object. If creation failed, the 201 + write ops will be transparently discarded, and nothing recorded in the 202 + cache. 203 + 204 + There are some normal running states in which the object spends its time 205 + servicing netfs requests: 206 + 207 + (4) State FSCACHE_OBJECT_AVAILABLE. 208 + 209 + A transient state in which pending operations are started, child objects 210 + are permitted to advance from FSCACHE_OBJECT_INIT state, and temporary 211 + lookup data is freed. 212 + 213 + (5) State FSCACHE_OBJECT_ACTIVE. 214 + 215 + The normal running state. In this state, requests the netfs makes will be 216 + passed on to the cache. 217 + 218 + (6) State FSCACHE_OBJECT_INVALIDATING. 219 + 220 + The object is undergoing invalidation. When the state comes here, it 221 + discards all pending read, write and attribute change operations as it is 222 + going to clear out the cache entirely and reinitialise it. It will then 223 + continue to the FSCACHE_OBJECT_UPDATING state. 224 + 225 + (7) State FSCACHE_OBJECT_UPDATING. 226 + 227 + The state machine comes here to update the object in the cache from the 228 + netfs's records. This involves updating the auxiliary data that is used 229 + to maintain coherency. 230 + 231 + And there are terminal states in which an object cleans itself up, deallocates 232 + memory and potentially deletes stuff from disk: 233 + 234 + (8) State FSCACHE_OBJECT_LC_DYING. 235 + 236 + The object comes here if it is dying because of a lookup or creation 237 + error. This would be due to a disk error or system error of some sort. 238 + Temporary data is cleaned up, and the parent is released. 239 + 240 + (9) State FSCACHE_OBJECT_DYING. 241 + 242 + The object comes here if it is dying due to an error, because its parent 243 + cookie has been relinquished by the netfs or because the cache is being 244 + withdrawn. 245 + 246 + Any child objects waiting on this one are given CPU time so that they too 247 + can destroy themselves. This object waits for all its children to go away 248 + before advancing to the next state. 249 + 250 + (10) State FSCACHE_OBJECT_ABORT_INIT. 251 + 252 + The object comes to this state if it was waiting on its parent in 253 + FSCACHE_OBJECT_INIT, but its parent died. The object will destroy itself 254 + so that the parent may proceed from the FSCACHE_OBJECT_DYING state. 255 + 256 + (11) State FSCACHE_OBJECT_RELEASING. 257 + (12) State FSCACHE_OBJECT_RECYCLING. 258 + 259 + The object comes to one of these two states when dying once it is rid of 260 + all its children, if it is dying because the netfs relinquished its 261 + cookie. In the first state, the cached data is expected to persist, and 262 + in the second it will be deleted. 263 + 264 + (13) State FSCACHE_OBJECT_WITHDRAWING. 265 + 266 + The object transits to this state if the cache decides it wants to 267 + withdraw the object from service, perhaps to make space, but also due to 268 + error or just because the whole cache is being withdrawn. 269 + 270 + (14) State FSCACHE_OBJECT_DEAD. 271 + 272 + The object transits to this state when the in-memory object record is 273 + ready to be deleted. The object processor shouldn't ever see an object in 274 + this state. 275 + 276 + 277 + The Set of Events 278 + ----------------- 279 + 280 + There are a number of events that can be raised to an object state machine: 281 + 282 + FSCACHE_OBJECT_EV_UPDATE 283 + The netfs requested that an object be updated. The state machine will ask 284 + the cache backend to update the object, and the cache backend will ask the 285 + netfs for details of the change through its cookie definition ops. 286 + 287 + FSCACHE_OBJECT_EV_CLEARED 288 + This is signalled in two circumstances: 289 + 290 + (a) when an object's last child object is dropped and 291 + 292 + (b) when the last operation outstanding on an object is completed. 293 + 294 + This is used to proceed from the dying state. 295 + 296 + FSCACHE_OBJECT_EV_ERROR 297 + This is signalled when an I/O error occurs during the processing of some 298 + object. 299 + 300 + FSCACHE_OBJECT_EV_RELEASE, FSCACHE_OBJECT_EV_RETIRE 301 + These are signalled when the netfs relinquishes a cookie it was using. 302 + The event selected depends on whether the netfs asks for the backing 303 + object to be retired (deleted) or retained. 304 + 305 + FSCACHE_OBJECT_EV_WITHDRAW 306 + This is signalled when the cache backend wants to withdraw an object. 307 + This means that the object will have to be detached from the netfs's 308 + cookie. 309 + 310 + Because the withdrawing releasing/retiring events are all handled by the object 311 + state machine, it doesn't matter if there's a collision with both ends trying 312 + to sever the connection at the same time. The state machine can just pick 313 + which one it wants to honour, and that effects the other.

-320

Documentation/filesystems/caching/object.txt

··· 1 - ==================================================== 2 - IN-KERNEL CACHE OBJECT REPRESENTATION AND MANAGEMENT 3 - ==================================================== 4 - 5 - By: David Howells <dhowells@redhat.com> 6 - 7 - Contents: 8 - 9 - (*) Representation 10 - 11 - (*) Object management state machine. 12 - 13 - - Provision of cpu time. 14 - - Locking simplification. 15 - 16 - (*) The set of states. 17 - 18 - (*) The set of events. 19 - 20 - 21 - ============== 22 - REPRESENTATION 23 - ============== 24 - 25 - FS-Cache maintains an in-kernel representation of each object that a netfs is 26 - currently interested in. Such objects are represented by the fscache_cookie 27 - struct and are referred to as cookies. 28 - 29 - FS-Cache also maintains a separate in-kernel representation of the objects that 30 - a cache backend is currently actively caching. Such objects are represented by 31 - the fscache_object struct. The cache backends allocate these upon request, and 32 - are expected to embed them in their own representations. These are referred to 33 - as objects. 34 - 35 - There is a 1:N relationship between cookies and objects. A cookie may be 36 - represented by multiple objects - an index may exist in more than one cache - 37 - or even by no objects (it may not be cached). 38 - 39 - Furthermore, both cookies and objects are hierarchical. The two hierarchies 40 - correspond, but the cookies tree is a superset of the union of the object trees 41 - of multiple caches: 42 - 43 - NETFS INDEX TREE : CACHE 1 : CACHE 2 44 - : : 45 - : +-----------+ : 46 - +----------->| IObject | : 47 - +-----------+ | : +-----------+ : 48 - | ICookie |-------+ : | : 49 - +-----------+ | : | : +-----------+ 50 - | +------------------------------>| IObject | 51 - | : | : +-----------+ 52 - | : V : | 53 - | : +-----------+ : | 54 - V +----------->| IObject | : | 55 - +-----------+ | : +-----------+ : | 56 - | ICookie |-------+ : | : V 57 - +-----------+ | : | : +-----------+ 58 - | +------------------------------>| IObject | 59 - +-----+-----+ : | : +-----------+ 60 - | | : | : | 61 - V | : V : | 62 - +-----------+ | : +-----------+ : | 63 - | ICookie |------------------------->| IObject | : | 64 - +-----------+ | : +-----------+ : | 65 - | V : | : V 66 - | +-----------+ : | : +-----------+ 67 - | | ICookie |-------------------------------->| IObject | 68 - | +-----------+ : | : +-----------+ 69 - V | : V : | 70 - +-----------+ | : +-----------+ : | 71 - | DCookie |------------------------->| DObject | : | 72 - +-----------+ | : +-----------+ : | 73 - | : : | 74 - +-------+-------+ : : | 75 - | | : : | 76 - V V : : V 77 - +-----------+ +-----------+ : : +-----------+ 78 - | DCookie | | DCookie |------------------------>| DObject | 79 - +-----------+ +-----------+ : : +-----------+ 80 - : : 81 - 82 - In the above illustration, ICookie and IObject represent indices and DCookie 83 - and DObject represent data storage objects. Indices may have representation in 84 - multiple caches, but currently, non-index objects may not. Objects of any type 85 - may also be entirely unrepresented. 86 - 87 - As far as the netfs API goes, the netfs is only actually permitted to see 88 - pointers to the cookies. The cookies themselves and any objects attached to 89 - those cookies are hidden from it. 90 - 91 - 92 - =============================== 93 - OBJECT MANAGEMENT STATE MACHINE 94 - =============================== 95 - 96 - Within FS-Cache, each active object is managed by its own individual state 97 - machine. The state for an object is kept in the fscache_object struct, in 98 - object->state. A cookie may point to a set of objects that are in different 99 - states. 100 - 101 - Each state has an action associated with it that is invoked when the machine 102 - wakes up in that state. There are four logical sets of states: 103 - 104 - (1) Preparation: states that wait for the parent objects to become ready. The 105 - representations are hierarchical, and it is expected that an object must 106 - be created or accessed with respect to its parent object. 107 - 108 - (2) Initialisation: states that perform lookups in the cache and validate 109 - what's found and that create on disk any missing metadata. 110 - 111 - (3) Normal running: states that allow netfs operations on objects to proceed 112 - and that update the state of objects. 113 - 114 - (4) Termination: states that detach objects from their netfs cookies, that 115 - delete objects from disk, that handle disk and system errors and that free 116 - up in-memory resources. 117 - 118 - 119 - In most cases, transitioning between states is in response to signalled events. 120 - When a state has finished processing, it will usually set the mask of events in 121 - which it is interested (object->event_mask) and relinquish the worker thread. 122 - Then when an event is raised (by calling fscache_raise_event()), if the event 123 - is not masked, the object will be queued for processing (by calling 124 - fscache_enqueue_object()). 125 - 126 - 127 - PROVISION OF CPU TIME 128 - --------------------- 129 - 130 - The work to be done by the various states was given CPU time by the threads of 131 - the slow work facility. This was used in preference to the workqueue facility 132 - because: 133 - 134 - (1) Threads may be completely occupied for very long periods of time by a 135 - particular work item. These state actions may be doing sequences of 136 - synchronous, journalled disk accesses (lookup, mkdir, create, setxattr, 137 - getxattr, truncate, unlink, rmdir, rename). 138 - 139 - (2) Threads may do little actual work, but may rather spend a lot of time 140 - sleeping on I/O. This means that single-threaded and 1-per-CPU-threaded 141 - workqueues don't necessarily have the right numbers of threads. 142 - 143 - 144 - LOCKING SIMPLIFICATION 145 - ---------------------- 146 - 147 - Because only one worker thread may be operating on any particular object's 148 - state machine at once, this simplifies the locking, particularly with respect 149 - to disconnecting the netfs's representation of a cache object (fscache_cookie) 150 - from the cache backend's representation (fscache_object) - which may be 151 - requested from either end. 152 - 153 - 154 - ================= 155 - THE SET OF STATES 156 - ================= 157 - 158 - The object state machine has a set of states that it can be in. There are 159 - preparation states in which the object sets itself up and waits for its parent 160 - object to transit to a state that allows access to its children: 161 - 162 - (1) State FSCACHE_OBJECT_INIT. 163 - 164 - Initialise the object and wait for the parent object to become active. In 165 - the cache, it is expected that it will not be possible to look an object 166 - up from the parent object, until that parent object itself has been looked 167 - up. 168 - 169 - There are initialisation states in which the object sets itself up and accesses 170 - disk for the object metadata: 171 - 172 - (2) State FSCACHE_OBJECT_LOOKING_UP. 173 - 174 - Look up the object on disk, using the parent as a starting point. 175 - FS-Cache expects the cache backend to probe the cache to see whether this 176 - object is represented there, and if it is, to see if it's valid (coherency 177 - management). 178 - 179 - The cache should call fscache_object_lookup_negative() to indicate lookup 180 - failure for whatever reason, and should call fscache_obtained_object() to 181 - indicate success. 182 - 183 - At the completion of lookup, FS-Cache will let the netfs go ahead with 184 - read operations, no matter whether the file is yet cached. If not yet 185 - cached, read operations will be immediately rejected with ENODATA until 186 - the first known page is uncached - as to that point there can be no data 187 - to be read out of the cache for that file that isn't currently also held 188 - in the pagecache. 189 - 190 - (3) State FSCACHE_OBJECT_CREATING. 191 - 192 - Create an object on disk, using the parent as a starting point. This 193 - happens if the lookup failed to find the object, or if the object's 194 - coherency data indicated what's on disk is out of date. In this state, 195 - FS-Cache expects the cache to create 196 - 197 - The cache should call fscache_obtained_object() if creation completes 198 - successfully, fscache_object_lookup_negative() otherwise. 199 - 200 - At the completion of creation, FS-Cache will start processing write 201 - operations the netfs has queued for an object. If creation failed, the 202 - write ops will be transparently discarded, and nothing recorded in the 203 - cache. 204 - 205 - There are some normal running states in which the object spends its time 206 - servicing netfs requests: 207 - 208 - (4) State FSCACHE_OBJECT_AVAILABLE. 209 - 210 - A transient state in which pending operations are started, child objects 211 - are permitted to advance from FSCACHE_OBJECT_INIT state, and temporary 212 - lookup data is freed. 213 - 214 - (5) State FSCACHE_OBJECT_ACTIVE. 215 - 216 - The normal running state. In this state, requests the netfs makes will be 217 - passed on to the cache. 218 - 219 - (6) State FSCACHE_OBJECT_INVALIDATING. 220 - 221 - The object is undergoing invalidation. When the state comes here, it 222 - discards all pending read, write and attribute change operations as it is 223 - going to clear out the cache entirely and reinitialise it. It will then 224 - continue to the FSCACHE_OBJECT_UPDATING state. 225 - 226 - (7) State FSCACHE_OBJECT_UPDATING. 227 - 228 - The state machine comes here to update the object in the cache from the 229 - netfs's records. This involves updating the auxiliary data that is used 230 - to maintain coherency. 231 - 232 - And there are terminal states in which an object cleans itself up, deallocates 233 - memory and potentially deletes stuff from disk: 234 - 235 - (8) State FSCACHE_OBJECT_LC_DYING. 236 - 237 - The object comes here if it is dying because of a lookup or creation 238 - error. This would be due to a disk error or system error of some sort. 239 - Temporary data is cleaned up, and the parent is released. 240 - 241 - (9) State FSCACHE_OBJECT_DYING. 242 - 243 - The object comes here if it is dying due to an error, because its parent 244 - cookie has been relinquished by the netfs or because the cache is being 245 - withdrawn. 246 - 247 - Any child objects waiting on this one are given CPU time so that they too 248 - can destroy themselves. This object waits for all its children to go away 249 - before advancing to the next state. 250 - 251 - (10) State FSCACHE_OBJECT_ABORT_INIT. 252 - 253 - The object comes to this state if it was waiting on its parent in 254 - FSCACHE_OBJECT_INIT, but its parent died. The object will destroy itself 255 - so that the parent may proceed from the FSCACHE_OBJECT_DYING state. 256 - 257 - (11) State FSCACHE_OBJECT_RELEASING. 258 - (12) State FSCACHE_OBJECT_RECYCLING. 259 - 260 - The object comes to one of these two states when dying once it is rid of 261 - all its children, if it is dying because the netfs relinquished its 262 - cookie. In the first state, the cached data is expected to persist, and 263 - in the second it will be deleted. 264 - 265 - (13) State FSCACHE_OBJECT_WITHDRAWING. 266 - 267 - The object transits to this state if the cache decides it wants to 268 - withdraw the object from service, perhaps to make space, but also due to 269 - error or just because the whole cache is being withdrawn. 270 - 271 - (14) State FSCACHE_OBJECT_DEAD. 272 - 273 - The object transits to this state when the in-memory object record is 274 - ready to be deleted. The object processor shouldn't ever see an object in 275 - this state. 276 - 277 - 278 - THE SET OF EVENTS 279 - ----------------- 280 - 281 - There are a number of events that can be raised to an object state machine: 282 - 283 - (*) FSCACHE_OBJECT_EV_UPDATE 284 - 285 - The netfs requested that an object be updated. The state machine will ask 286 - the cache backend to update the object, and the cache backend will ask the 287 - netfs for details of the change through its cookie definition ops. 288 - 289 - (*) FSCACHE_OBJECT_EV_CLEARED 290 - 291 - This is signalled in two circumstances: 292 - 293 - (a) when an object's last child object is dropped and 294 - 295 - (b) when the last operation outstanding on an object is completed. 296 - 297 - This is used to proceed from the dying state. 298 - 299 - (*) FSCACHE_OBJECT_EV_ERROR 300 - 301 - This is signalled when an I/O error occurs during the processing of some 302 - object. 303 - 304 - (*) FSCACHE_OBJECT_EV_RELEASE 305 - (*) FSCACHE_OBJECT_EV_RETIRE 306 - 307 - These are signalled when the netfs relinquishes a cookie it was using. 308 - The event selected depends on whether the netfs asks for the backing 309 - object to be retired (deleted) or retained. 310 - 311 - (*) FSCACHE_OBJECT_EV_WITHDRAW 312 - 313 - This is signalled when the cache backend wants to withdraw an object. 314 - This means that the object will have to be detached from the netfs's 315 - cookie. 316 - 317 - Because the withdrawing releasing/retiring events are all handled by the object 318 - state machine, it doesn't matter if there's a collision with both ends trying 319 - to sever the connection at the same time. The state machine can just pick 320 - which one it wants to honour, and that effects the other.

+210

Documentation/filesystems/caching/operations.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ================================ 4 + Asynchronous Operations Handling 5 + ================================ 6 + 7 + By: David Howells <dhowells@redhat.com> 8 + 9 + .. Contents: 10 + 11 + (*) Overview. 12 + 13 + (*) Operation record initialisation. 14 + 15 + (*) Parameters. 16 + 17 + (*) Procedure. 18 + 19 + (*) Asynchronous callback. 20 + 21 + 22 + Overview 23 + ======== 24 + 25 + FS-Cache has an asynchronous operations handling facility that it uses for its 26 + data storage and retrieval routines. Its operations are represented by 27 + fscache_operation structs, though these are usually embedded into some other 28 + structure. 29 + 30 + This facility is available to and expected to be be used by the cache backends, 31 + and FS-Cache will create operations and pass them off to the appropriate cache 32 + backend for completion. 33 + 34 + To make use of this facility, <linux/fscache-cache.h> should be #included. 35 + 36 + 37 + Operation Record Initialisation 38 + =============================== 39 + 40 + An operation is recorded in an fscache_operation struct:: 41 + 42 + struct fscache_operation { 43 + union { 44 + struct work_struct fast_work; 45 + struct slow_work slow_work; 46 + }; 47 + unsigned long flags; 48 + fscache_operation_processor_t processor; 49 + ... 50 + }; 51 + 52 + Someone wanting to issue an operation should allocate something with this 53 + struct embedded in it. They should initialise it by calling:: 54 + 55 + void fscache_operation_init(struct fscache_operation *op, 56 + fscache_operation_release_t release); 57 + 58 + with the operation to be initialised and the release function to use. 59 + 60 + The op->flags parameter should be set to indicate the CPU time provision and 61 + the exclusivity (see the Parameters section). 62 + 63 + The op->fast_work, op->slow_work and op->processor flags should be set as 64 + appropriate for the CPU time provision (see the Parameters section). 65 + 66 + FSCACHE_OP_WAITING may be set in op->flags prior to each submission of the 67 + operation and waited for afterwards. 68 + 69 + 70 + Parameters 71 + ========== 72 + 73 + There are a number of parameters that can be set in the operation record's flag 74 + parameter. There are three options for the provision of CPU time in these 75 + operations: 76 + 77 + (1) The operation may be done synchronously (FSCACHE_OP_MYTHREAD). A thread 78 + may decide it wants to handle an operation itself without deferring it to 79 + another thread. 80 + 81 + This is, for example, used in read operations for calling readpages() on 82 + the backing filesystem in CacheFiles. Although readpages() does an 83 + asynchronous data fetch, the determination of whether pages exist is done 84 + synchronously - and the netfs does not proceed until this has been 85 + determined. 86 + 87 + If this option is to be used, FSCACHE_OP_WAITING must be set in op->flags 88 + before submitting the operation, and the operating thread must wait for it 89 + to be cleared before proceeding:: 90 + 91 + wait_on_bit(&op->flags, FSCACHE_OP_WAITING, 92 + TASK_UNINTERRUPTIBLE); 93 + 94 + 95 + (2) The operation may be fast asynchronous (FSCACHE_OP_FAST), in which case it 96 + will be given to keventd to process. Such an operation is not permitted 97 + to sleep on I/O. 98 + 99 + This is, for example, used by CacheFiles to copy data from a backing fs 100 + page to a netfs page after the backing fs has read the page in. 101 + 102 + If this option is used, op->fast_work and op->processor must be 103 + initialised before submitting the operation:: 104 + 105 + INIT_WORK(&op->fast_work, do_some_work); 106 + 107 + 108 + (3) The operation may be slow asynchronous (FSCACHE_OP_SLOW), in which case it 109 + will be given to the slow work facility to process. Such an operation is 110 + permitted to sleep on I/O. 111 + 112 + This is, for example, used by FS-Cache to handle background writes of 113 + pages that have just been fetched from a remote server. 114 + 115 + If this option is used, op->slow_work and op->processor must be 116 + initialised before submitting the operation:: 117 + 118 + fscache_operation_init_slow(op, processor) 119 + 120 + 121 + Furthermore, operations may be one of two types: 122 + 123 + (1) Exclusive (FSCACHE_OP_EXCLUSIVE). Operations of this type may not run in 124 + conjunction with any other operation on the object being operated upon. 125 + 126 + An example of this is the attribute change operation, in which the file 127 + being written to may need truncation. 128 + 129 + (2) Shareable. Operations of this type may be running simultaneously. It's 130 + up to the operation implementation to prevent interference between other 131 + operations running at the same time. 132 + 133 + 134 + Procedure 135 + ========= 136 + 137 + Operations are used through the following procedure: 138 + 139 + (1) The submitting thread must allocate the operation and initialise it 140 + itself. Normally this would be part of a more specific structure with the 141 + generic op embedded within. 142 + 143 + (2) The submitting thread must then submit the operation for processing using 144 + one of the following two functions:: 145 + 146 + int fscache_submit_op(struct fscache_object *object, 147 + struct fscache_operation *op); 148 + 149 + int fscache_submit_exclusive_op(struct fscache_object *object, 150 + struct fscache_operation *op); 151 + 152 + The first function should be used to submit non-exclusive ops and the 153 + second to submit exclusive ones. The caller must still set the 154 + FSCACHE_OP_EXCLUSIVE flag. 155 + 156 + If successful, both functions will assign the operation to the specified 157 + object and return 0. -ENOBUFS will be returned if the object specified is 158 + permanently unavailable. 159 + 160 + The operation manager will defer operations on an object that is still 161 + undergoing lookup or creation. The operation will also be deferred if an 162 + operation of conflicting exclusivity is in progress on the object. 163 + 164 + If the operation is asynchronous, the manager will retain a reference to 165 + it, so the caller should put their reference to it by passing it to:: 166 + 167 + void fscache_put_operation(struct fscache_operation *op); 168 + 169 + (3) If the submitting thread wants to do the work itself, and has marked the 170 + operation with FSCACHE_OP_MYTHREAD, then it should monitor 171 + FSCACHE_OP_WAITING as described above and check the state of the object if 172 + necessary (the object might have died while the thread was waiting). 173 + 174 + When it has finished doing its processing, it should call 175 + fscache_op_complete() and fscache_put_operation() on it. 176 + 177 + (4) The operation holds an effective lock upon the object, preventing other 178 + exclusive ops conflicting until it is released. The operation can be 179 + enqueued for further immediate asynchronous processing by adjusting the 180 + CPU time provisioning option if necessary, eg:: 181 + 182 + op->flags &= ~FSCACHE_OP_TYPE; 183 + op->flags |= ~FSCACHE_OP_FAST; 184 + 185 + and calling:: 186 + 187 + void fscache_enqueue_operation(struct fscache_operation *op) 188 + 189 + This can be used to allow other things to have use of the worker thread 190 + pools. 191 + 192 + 193 + Asynchronous Callback 194 + ===================== 195 + 196 + When used in asynchronous mode, the worker thread pool will invoke the 197 + processor method with a pointer to the operation. This should then get at the 198 + container struct by using container_of():: 199 + 200 + static void fscache_write_op(struct fscache_operation *_op) 201 + { 202 + struct fscache_storage *op = 203 + container_of(_op, struct fscache_storage, op); 204 + ... 205 + } 206 + 207 + The caller holds a reference on the operation, and will invoke 208 + fscache_put_operation() when the processor function returns. The processor 209 + function is at liberty to call fscache_enqueue_operation() or to take extra 210 + references.

-213

Documentation/filesystems/caching/operations.txt

··· 1 - ================================ 2 - ASYNCHRONOUS OPERATIONS HANDLING 3 - ================================ 4 - 5 - By: David Howells <dhowells@redhat.com> 6 - 7 - Contents: 8 - 9 - (*) Overview. 10 - 11 - (*) Operation record initialisation. 12 - 13 - (*) Parameters. 14 - 15 - (*) Procedure. 16 - 17 - (*) Asynchronous callback. 18 - 19 - 20 - ======== 21 - OVERVIEW 22 - ======== 23 - 24 - FS-Cache has an asynchronous operations handling facility that it uses for its 25 - data storage and retrieval routines. Its operations are represented by 26 - fscache_operation structs, though these are usually embedded into some other 27 - structure. 28 - 29 - This facility is available to and expected to be be used by the cache backends, 30 - and FS-Cache will create operations and pass them off to the appropriate cache 31 - backend for completion. 32 - 33 - To make use of this facility, <linux/fscache-cache.h> should be #included. 34 - 35 - 36 - =============================== 37 - OPERATION RECORD INITIALISATION 38 - =============================== 39 - 40 - An operation is recorded in an fscache_operation struct: 41 - 42 - struct fscache_operation { 43 - union { 44 - struct work_struct fast_work; 45 - struct slow_work slow_work; 46 - }; 47 - unsigned long flags; 48 - fscache_operation_processor_t processor; 49 - ... 50 - }; 51 - 52 - Someone wanting to issue an operation should allocate something with this 53 - struct embedded in it. They should initialise it by calling: 54 - 55 - void fscache_operation_init(struct fscache_operation *op, 56 - fscache_operation_release_t release); 57 - 58 - with the operation to be initialised and the release function to use. 59 - 60 - The op->flags parameter should be set to indicate the CPU time provision and 61 - the exclusivity (see the Parameters section). 62 - 63 - The op->fast_work, op->slow_work and op->processor flags should be set as 64 - appropriate for the CPU time provision (see the Parameters section). 65 - 66 - FSCACHE_OP_WAITING may be set in op->flags prior to each submission of the 67 - operation and waited for afterwards. 68 - 69 - 70 - ========== 71 - PARAMETERS 72 - ========== 73 - 74 - There are a number of parameters that can be set in the operation record's flag 75 - parameter. There are three options for the provision of CPU time in these 76 - operations: 77 - 78 - (1) The operation may be done synchronously (FSCACHE_OP_MYTHREAD). A thread 79 - may decide it wants to handle an operation itself without deferring it to 80 - another thread. 81 - 82 - This is, for example, used in read operations for calling readpages() on 83 - the backing filesystem in CacheFiles. Although readpages() does an 84 - asynchronous data fetch, the determination of whether pages exist is done 85 - synchronously - and the netfs does not proceed until this has been 86 - determined. 87 - 88 - If this option is to be used, FSCACHE_OP_WAITING must be set in op->flags 89 - before submitting the operation, and the operating thread must wait for it 90 - to be cleared before proceeding: 91 - 92 - wait_on_bit(&op->flags, FSCACHE_OP_WAITING, 93 - TASK_UNINTERRUPTIBLE); 94 - 95 - 96 - (2) The operation may be fast asynchronous (FSCACHE_OP_FAST), in which case it 97 - will be given to keventd to process. Such an operation is not permitted 98 - to sleep on I/O. 99 - 100 - This is, for example, used by CacheFiles to copy data from a backing fs 101 - page to a netfs page after the backing fs has read the page in. 102 - 103 - If this option is used, op->fast_work and op->processor must be 104 - initialised before submitting the operation: 105 - 106 - INIT_WORK(&op->fast_work, do_some_work); 107 - 108 - 109 - (3) The operation may be slow asynchronous (FSCACHE_OP_SLOW), in which case it 110 - will be given to the slow work facility to process. Such an operation is 111 - permitted to sleep on I/O. 112 - 113 - This is, for example, used by FS-Cache to handle background writes of 114 - pages that have just been fetched from a remote server. 115 - 116 - If this option is used, op->slow_work and op->processor must be 117 - initialised before submitting the operation: 118 - 119 - fscache_operation_init_slow(op, processor) 120 - 121 - 122 - Furthermore, operations may be one of two types: 123 - 124 - (1) Exclusive (FSCACHE_OP_EXCLUSIVE). Operations of this type may not run in 125 - conjunction with any other operation on the object being operated upon. 126 - 127 - An example of this is the attribute change operation, in which the file 128 - being written to may need truncation. 129 - 130 - (2) Shareable. Operations of this type may be running simultaneously. It's 131 - up to the operation implementation to prevent interference between other 132 - operations running at the same time. 133 - 134 - 135 - ========= 136 - PROCEDURE 137 - ========= 138 - 139 - Operations are used through the following procedure: 140 - 141 - (1) The submitting thread must allocate the operation and initialise it 142 - itself. Normally this would be part of a more specific structure with the 143 - generic op embedded within. 144 - 145 - (2) The submitting thread must then submit the operation for processing using 146 - one of the following two functions: 147 - 148 - int fscache_submit_op(struct fscache_object *object, 149 - struct fscache_operation *op); 150 - 151 - int fscache_submit_exclusive_op(struct fscache_object *object, 152 - struct fscache_operation *op); 153 - 154 - The first function should be used to submit non-exclusive ops and the 155 - second to submit exclusive ones. The caller must still set the 156 - FSCACHE_OP_EXCLUSIVE flag. 157 - 158 - If successful, both functions will assign the operation to the specified 159 - object and return 0. -ENOBUFS will be returned if the object specified is 160 - permanently unavailable. 161 - 162 - The operation manager will defer operations on an object that is still 163 - undergoing lookup or creation. The operation will also be deferred if an 164 - operation of conflicting exclusivity is in progress on the object. 165 - 166 - If the operation is asynchronous, the manager will retain a reference to 167 - it, so the caller should put their reference to it by passing it to: 168 - 169 - void fscache_put_operation(struct fscache_operation *op); 170 - 171 - (3) If the submitting thread wants to do the work itself, and has marked the 172 - operation with FSCACHE_OP_MYTHREAD, then it should monitor 173 - FSCACHE_OP_WAITING as described above and check the state of the object if 174 - necessary (the object might have died while the thread was waiting). 175 - 176 - When it has finished doing its processing, it should call 177 - fscache_op_complete() and fscache_put_operation() on it. 178 - 179 - (4) The operation holds an effective lock upon the object, preventing other 180 - exclusive ops conflicting until it is released. The operation can be 181 - enqueued for further immediate asynchronous processing by adjusting the 182 - CPU time provisioning option if necessary, eg: 183 - 184 - op->flags &= ~FSCACHE_OP_TYPE; 185 - op->flags |= ~FSCACHE_OP_FAST; 186 - 187 - and calling: 188 - 189 - void fscache_enqueue_operation(struct fscache_operation *op) 190 - 191 - This can be used to allow other things to have use of the worker thread 192 - pools. 193 - 194 - 195 - ===================== 196 - ASYNCHRONOUS CALLBACK 197 - ===================== 198 - 199 - When used in asynchronous mode, the worker thread pool will invoke the 200 - processor method with a pointer to the operation. This should then get at the 201 - container struct by using container_of(): 202 - 203 - static void fscache_write_op(struct fscache_operation *_op) 204 - { 205 - struct fscache_storage *op = 206 - container_of(_op, struct fscache_storage, op); 207 - ... 208 - } 209 - 210 - The caller holds a reference on the operation, and will invoke 211 - fscache_put_operation() when the processor function returns. The processor 212 - function is at liberty to call fscache_enqueue_operation() or to take extra 213 - references.

+105

Documentation/filesystems/cifs/cifsroot.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + =========================================== 4 + Mounting root file system via SMB (cifs.ko) 5 + =========================================== 6 + 7 + Written 2019 by Paulo Alcantara <palcantara@suse.de> 8 + 9 + Written 2019 by Aurelien Aptel <aaptel@suse.com> 10 + 11 + The CONFIG_CIFS_ROOT option enables experimental root file system 12 + support over the SMB protocol via cifs.ko. 13 + 14 + It introduces a new kernel command-line option called 'cifsroot=' 15 + which will tell the kernel to mount the root file system over the 16 + network by utilizing SMB or CIFS protocol. 17 + 18 + In order to mount, the network stack will also need to be set up by 19 + using 'ip=' config option. For more details, see 20 + Documentation/admin-guide/nfs/nfsroot.rst. 21 + 22 + A CIFS root mount currently requires the use of SMB1+UNIX Extensions 23 + which is only supported by the Samba server. SMB1 is the older 24 + deprecated version of the protocol but it has been extended to support 25 + POSIX features (See [1]). The equivalent extensions for the newer 26 + recommended version of the protocol (SMB3) have not been fully 27 + implemented yet which means SMB3 doesn't support some required POSIX 28 + file system objects (e.g. block devices, pipes, sockets). 29 + 30 + As a result, a CIFS root will default to SMB1 for now but the version 31 + to use can nonetheless be changed via the 'vers=' mount option. This 32 + default will change once the SMB3 POSIX extensions are fully 33 + implemented. 34 + 35 + Server configuration 36 + ==================== 37 + 38 + To enable SMB1+UNIX extensions you will need to set these global 39 + settings in Samba smb.conf:: 40 + 41 + [global] 42 + server min protocol = NT1 43 + unix extension = yes # default 44 + 45 + Kernel command line 46 + =================== 47 + 48 + :: 49 + 50 + root=/dev/cifs 51 + 52 + This is just a virtual device that basically tells the kernel to mount 53 + the root file system via SMB protocol. 54 + 55 + :: 56 + 57 + cifsroot=//<server-ip>/<share>[,options] 58 + 59 + Enables the kernel to mount the root file system via SMB that are 60 + located in the <server-ip> and <share> specified in this option. 61 + 62 + The default mount options are set in fs/cifs/cifsroot.c. 63 + 64 + server-ip 65 + IPv4 address of the server. 66 + 67 + share 68 + Path to SMB share (rootfs). 69 + 70 + options 71 + Optional mount options. For more information, see mount.cifs(8). 72 + 73 + Examples 74 + ======== 75 + 76 + Export root file system as a Samba share in smb.conf file:: 77 + 78 + ... 79 + [linux] 80 + path = /path/to/rootfs 81 + read only = no 82 + guest ok = yes 83 + force user = root 84 + force group = root 85 + browseable = yes 86 + writeable = yes 87 + admin users = root 88 + public = yes 89 + create mask = 0777 90 + directory mask = 0777 91 + ... 92 + 93 + Restart smb service:: 94 + 95 + # systemctl restart smb 96 + 97 + Test it under QEMU on a kernel built with CONFIG_CIFS_ROOT and 98 + CONFIG_IP_PNP options enabled:: 99 + 100 + # qemu-system-x86_64 -enable-kvm -cpu host -m 1024 \ 101 + -kernel /path/to/linux/arch/x86/boot/bzImage -nographic \ 102 + -append "root=/dev/cifs rw ip=dhcp cifsroot=//10.0.2.2/linux,username=foo,password=bar console=ttyS0 3" 103 + 104 + 105 + 1: https://wiki.samba.org/index.php/UNIX_Extensions

-97

Documentation/filesystems/cifs/cifsroot.txt

··· 1 - Mounting root file system via SMB (cifs.ko) 2 - =========================================== 3 - 4 - Written 2019 by Paulo Alcantara <palcantara@suse.de> 5 - Written 2019 by Aurelien Aptel <aaptel@suse.com> 6 - 7 - The CONFIG_CIFS_ROOT option enables experimental root file system 8 - support over the SMB protocol via cifs.ko. 9 - 10 - It introduces a new kernel command-line option called 'cifsroot=' 11 - which will tell the kernel to mount the root file system over the 12 - network by utilizing SMB or CIFS protocol. 13 - 14 - In order to mount, the network stack will also need to be set up by 15 - using 'ip=' config option. For more details, see 16 - Documentation/admin-guide/nfs/nfsroot.rst. 17 - 18 - A CIFS root mount currently requires the use of SMB1+UNIX Extensions 19 - which is only supported by the Samba server. SMB1 is the older 20 - deprecated version of the protocol but it has been extended to support 21 - POSIX features (See [1]). The equivalent extensions for the newer 22 - recommended version of the protocol (SMB3) have not been fully 23 - implemented yet which means SMB3 doesn't support some required POSIX 24 - file system objects (e.g. block devices, pipes, sockets). 25 - 26 - As a result, a CIFS root will default to SMB1 for now but the version 27 - to use can nonetheless be changed via the 'vers=' mount option. This 28 - default will change once the SMB3 POSIX extensions are fully 29 - implemented. 30 - 31 - Server configuration 32 - ==================== 33 - 34 - To enable SMB1+UNIX extensions you will need to set these global 35 - settings in Samba smb.conf: 36 - 37 - [global] 38 - server min protocol = NT1 39 - unix extension = yes # default 40 - 41 - Kernel command line 42 - =================== 43 - 44 - root=/dev/cifs 45 - 46 - This is just a virtual device that basically tells the kernel to mount 47 - the root file system via SMB protocol. 48 - 49 - cifsroot=//<server-ip>/<share>[,options] 50 - 51 - Enables the kernel to mount the root file system via SMB that are 52 - located in the <server-ip> and <share> specified in this option. 53 - 54 - The default mount options are set in fs/cifs/cifsroot.c. 55 - 56 - server-ip 57 - IPv4 address of the server. 58 - 59 - share 60 - Path to SMB share (rootfs). 61 - 62 - options 63 - Optional mount options. For more information, see mount.cifs(8). 64 - 65 - Examples 66 - ======== 67 - 68 - Export root file system as a Samba share in smb.conf file. 69 - 70 - ... 71 - [linux] 72 - path = /path/to/rootfs 73 - read only = no 74 - guest ok = yes 75 - force user = root 76 - force group = root 77 - browseable = yes 78 - writeable = yes 79 - admin users = root 80 - public = yes 81 - create mask = 0777 82 - directory mask = 0777 83 - ... 84 - 85 - Restart smb service. 86 - 87 - # systemctl restart smb 88 - 89 - Test it under QEMU on a kernel built with CONFIG_CIFS_ROOT and 90 - CONFIG_IP_PNP options enabled. 91 - 92 - # qemu-system-x86_64 -enable-kvm -cpu host -m 1024 \ 93 - -kernel /path/to/linux/arch/x86/boot/bzImage -nographic \ 94 - -append "root=/dev/cifs rw ip=dhcp cifsroot=//10.0.2.2/linux,username=foo,password=bar console=ttyS0 3" 95 - 96 - 97 - 1: https://wiki.samba.org/index.php/UNIX_Extensions

+1670

Documentation/filesystems/coda.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + =========================== 4 + Coda Kernel-Venus Interface 5 + =========================== 6 + 7 + .. Note:: 8 + 9 + This is one of the technical documents describing a component of 10 + Coda -- this document describes the client kernel-Venus interface. 11 + 12 + For more information: 13 + 14 + http://www.coda.cs.cmu.edu 15 + 16 + For user level software needed to run Coda: 17 + 18 + ftp://ftp.coda.cs.cmu.edu 19 + 20 + To run Coda you need to get a user level cache manager for the client, 21 + named Venus, as well as tools to manipulate ACLs, to log in, etc. The 22 + client needs to have the Coda filesystem selected in the kernel 23 + configuration. 24 + 25 + The server needs a user level server and at present does not depend on 26 + kernel support. 27 + 28 + The Venus kernel interface 29 + 30 + Peter J. Braam 31 + 32 + v1.0, Nov 9, 1997 33 + 34 + This document describes the communication between Venus and kernel 35 + level filesystem code needed for the operation of the Coda file sys- 36 + tem. This document version is meant to describe the current interface 37 + (version 1.0) as well as improvements we envisage. 38 + 39 + .. Table of Contents 40 + 41 + 1. Introduction 42 + 43 + 2. Servicing Coda filesystem calls 44 + 45 + 3. The message layer 46 + 47 + 3.1 Implementation details 48 + 49 + 4. The interface at the call level 50 + 51 + 4.1 Data structures shared by the kernel and Venus 52 + 4.2 The pioctl interface 53 + 4.3 root 54 + 4.4 lookup 55 + 4.5 getattr 56 + 4.6 setattr 57 + 4.7 access 58 + 4.8 create 59 + 4.9 mkdir 60 + 4.10 link 61 + 4.11 symlink 62 + 4.12 remove 63 + 4.13 rmdir 64 + 4.14 readlink 65 + 4.15 open 66 + 4.16 close 67 + 4.17 ioctl 68 + 4.18 rename 69 + 4.19 readdir 70 + 4.20 vget 71 + 4.21 fsync 72 + 4.22 inactive 73 + 4.23 rdwr 74 + 4.24 odymount 75 + 4.25 ody_lookup 76 + 4.26 ody_expand 77 + 4.27 prefetch 78 + 4.28 signal 79 + 80 + 5. The minicache and downcalls 81 + 82 + 5.1 INVALIDATE 83 + 5.2 FLUSH 84 + 5.3 PURGEUSER 85 + 5.4 ZAPFILE 86 + 5.5 ZAPDIR 87 + 5.6 ZAPVNODE 88 + 5.7 PURGEFID 89 + 5.8 REPLACE 90 + 91 + 6. Initialization and cleanup 92 + 93 + 6.1 Requirements 94 + 95 + 1. Introduction 96 + =============== 97 + 98 + A key component in the Coda Distributed File System is the cache 99 + manager, Venus. 100 + 101 + When processes on a Coda enabled system access files in the Coda 102 + filesystem, requests are directed at the filesystem layer in the 103 + operating system. The operating system will communicate with Venus to 104 + service the request for the process. Venus manages a persistent 105 + client cache and makes remote procedure calls to Coda file servers and 106 + related servers (such as authentication servers) to service these 107 + requests it receives from the operating system. When Venus has 108 + serviced a request it replies to the operating system with appropriate 109 + return codes, and other data related to the request. Optionally the 110 + kernel support for Coda may maintain a minicache of recently processed 111 + requests to limit the number of interactions with Venus. Venus 112 + possesses the facility to inform the kernel when elements from its 113 + minicache are no longer valid. 114 + 115 + This document describes precisely this communication between the 116 + kernel and Venus. The definitions of so called upcalls and downcalls 117 + will be given with the format of the data they handle. We shall also 118 + describe the semantic invariants resulting from the calls. 119 + 120 + Historically Coda was implemented in a BSD file system in Mach 2.6. 121 + The interface between the kernel and Venus is very similar to the BSD 122 + VFS interface. Similar functionality is provided, and the format of 123 + the parameters and returned data is very similar to the BSD VFS. This 124 + leads to an almost natural environment for implementing a kernel-level 125 + filesystem driver for Coda in a BSD system. However, other operating 126 + systems such as Linux and Windows 95 and NT have virtual filesystem 127 + with different interfaces. 128 + 129 + To implement Coda on these systems some reverse engineering of the 130 + Venus/Kernel protocol is necessary. Also it came to light that other 131 + systems could profit significantly from certain small optimizations 132 + and modifications to the protocol. To facilitate this work as well as 133 + to make future ports easier, communication between Venus and the 134 + kernel should be documented in great detail. This is the aim of this 135 + document. 136 + 137 + 2. Servicing Coda filesystem calls 138 + =================================== 139 + 140 + The service of a request for a Coda file system service originates in 141 + a process P which accessing a Coda file. It makes a system call which 142 + traps to the OS kernel. Examples of such calls trapping to the kernel 143 + are ``read``, ``write``, ``open``, ``close``, ``create``, ``mkdir``, 144 + ``rmdir``, ``chmod`` in a Unix ontext. Similar calls exist in the Win32 145 + environment, and are named ``CreateFile``. 146 + 147 + Generally the operating system handles the request in a virtual 148 + filesystem (VFS) layer, which is named I/O Manager in NT and IFS 149 + manager in Windows 95. The VFS is responsible for partial processing 150 + of the request and for locating the specific filesystem(s) which will 151 + service parts of the request. Usually the information in the path 152 + assists in locating the correct FS drivers. Sometimes after extensive 153 + pre-processing, the VFS starts invoking exported routines in the FS 154 + driver. This is the point where the FS specific processing of the 155 + request starts, and here the Coda specific kernel code comes into 156 + play. 157 + 158 + The FS layer for Coda must expose and implement several interfaces. 159 + First and foremost the VFS must be able to make all necessary calls to 160 + the Coda FS layer, so the Coda FS driver must expose the VFS interface 161 + as applicable in the operating system. These differ very significantly 162 + among operating systems, but share features such as facilities to 163 + read/write and create and remove objects. The Coda FS layer services 164 + such VFS requests by invoking one or more well defined services 165 + offered by the cache manager Venus. When the replies from Venus have 166 + come back to the FS driver, servicing of the VFS call continues and 167 + finishes with a reply to the kernel's VFS. Finally the VFS layer 168 + returns to the process. 169 + 170 + As a result of this design a basic interface exposed by the FS driver 171 + must allow Venus to manage message traffic. In particular Venus must 172 + be able to retrieve and place messages and to be notified of the 173 + arrival of a new message. The notification must be through a mechanism 174 + which does not block Venus since Venus must attend to other tasks even 175 + when no messages are waiting or being processed. 176 + 177 + **Interfaces of the Coda FS Driver** 178 + 179 + Furthermore the FS layer provides for a special path of communication 180 + between a user process and Venus, called the pioctl interface. The 181 + pioctl interface is used for Coda specific services, such as 182 + requesting detailed information about the persistent cache managed by 183 + Venus. Here the involvement of the kernel is minimal. It identifies 184 + the calling process and passes the information on to Venus. When 185 + Venus replies the response is passed back to the caller in unmodified 186 + form. 187 + 188 + Finally Venus allows the kernel FS driver to cache the results from 189 + certain services. This is done to avoid excessive context switches 190 + and results in an efficient system. However, Venus may acquire 191 + information, for example from the network which implies that cached 192 + information must be flushed or replaced. Venus then makes a downcall 193 + to the Coda FS layer to request flushes or updates in the cache. The 194 + kernel FS driver handles such requests synchronously. 195 + 196 + Among these interfaces the VFS interface and the facility to place, 197 + receive and be notified of messages are platform specific. We will 198 + not go into the calls exported to the VFS layer but we will state the 199 + requirements of the message exchange mechanism. 200 + 201 + 202 + 3. The message layer 203 + ===================== 204 + 205 + At the lowest level the communication between Venus and the FS driver 206 + proceeds through messages. The synchronization between processes 207 + requesting Coda file service and Venus relies on blocking and waking 208 + up processes. The Coda FS driver processes VFS- and pioctl-requests 209 + on behalf of a process P, creates messages for Venus, awaits replies 210 + and finally returns to the caller. The implementation of the exchange 211 + of messages is platform specific, but the semantics have (so far) 212 + appeared to be generally applicable. Data buffers are created by the 213 + FS Driver in kernel memory on behalf of P and copied to user memory in 214 + Venus. 215 + 216 + The FS Driver while servicing P makes upcalls to Venus. Such an 217 + upcall is dispatched to Venus by creating a message structure. The 218 + structure contains the identification of P, the message sequence 219 + number, the size of the request and a pointer to the data in kernel 220 + memory for the request. Since the data buffer is re-used to hold the 221 + reply from Venus, there is a field for the size of the reply. A flags 222 + field is used in the message to precisely record the status of the 223 + message. Additional platform dependent structures involve pointers to 224 + determine the position of the message on queues and pointers to 225 + synchronization objects. In the upcall routine the message structure 226 + is filled in, flags are set to 0, and it is placed on the *pending* 227 + queue. The routine calling upcall is responsible for allocating the 228 + data buffer; its structure will be described in the next section. 229 + 230 + A facility must exist to notify Venus that the message has been 231 + created, and implemented using available synchronization objects in 232 + the OS. This notification is done in the upcall context of the process 233 + P. When the message is on the pending queue, process P cannot proceed 234 + in upcall. The (kernel mode) processing of P in the filesystem 235 + request routine must be suspended until Venus has replied. Therefore 236 + the calling thread in P is blocked in upcall. A pointer in the 237 + message structure will locate the synchronization object on which P is 238 + sleeping. 239 + 240 + Venus detects the notification that a message has arrived, and the FS 241 + driver allow Venus to retrieve the message with a getmsg_from_kernel 242 + call. This action finishes in the kernel by putting the message on the 243 + queue of processing messages and setting flags to READ. Venus is 244 + passed the contents of the data buffer. The getmsg_from_kernel call 245 + now returns and Venus processes the request. 246 + 247 + At some later point the FS driver receives a message from Venus, 248 + namely when Venus calls sendmsg_to_kernel. At this moment the Coda FS 249 + driver looks at the contents of the message and decides if: 250 + 251 + 252 + * the message is a reply for a suspended thread P. If so it removes 253 + the message from the processing queue and marks the message as 254 + WRITTEN. Finally, the FS driver unblocks P (still in the kernel 255 + mode context of Venus) and the sendmsg_to_kernel call returns to 256 + Venus. The process P will be scheduled at some point and continues 257 + processing its upcall with the data buffer replaced with the reply 258 + from Venus. 259 + 260 + * The message is a ``downcall``. A downcall is a request from Venus to 261 + the FS Driver. The FS driver processes the request immediately 262 + (usually a cache eviction or replacement) and when it finishes 263 + sendmsg_to_kernel returns. 264 + 265 + Now P awakes and continues processing upcall. There are some 266 + subtleties to take account of. First P will determine if it was woken 267 + up in upcall by a signal from some other source (for example an 268 + attempt to terminate P) or as is normally the case by Venus in its 269 + sendmsg_to_kernel call. In the normal case, the upcall routine will 270 + deallocate the message structure and return. The FS routine can proceed 271 + with its processing. 272 + 273 + 274 + **Sleeping and IPC arrangements** 275 + 276 + In case P is woken up by a signal and not by Venus, it will first look 277 + at the flags field. If the message is not yet READ, the process P can 278 + handle its signal without notifying Venus. If Venus has READ, and 279 + the request should not be processed, P can send Venus a signal message 280 + to indicate that it should disregard the previous message. Such 281 + signals are put in the queue at the head, and read first by Venus. If 282 + the message is already marked as WRITTEN it is too late to stop the 283 + processing. The VFS routine will now continue. (-- If a VFS request 284 + involves more than one upcall, this can lead to complicated state, an 285 + extra field "handle_signals" could be added in the message structure 286 + to indicate points of no return have been passed.--) 287 + 288 + 289 + 290 + 3.1. Implementation details 291 + ---------------------------- 292 + 293 + The Unix implementation of this mechanism has been through the 294 + implementation of a character device associated with Coda. Venus 295 + retrieves messages by doing a read on the device, replies are sent 296 + with a write and notification is through the select system call on the 297 + file descriptor for the device. The process P is kept waiting on an 298 + interruptible wait queue object. 299 + 300 + In Windows NT and the DPMI Windows 95 implementation a DeviceIoControl 301 + call is used. The DeviceIoControl call is designed to copy buffers 302 + from user memory to kernel memory with OPCODES. The sendmsg_to_kernel 303 + is issued as a synchronous call, while the getmsg_from_kernel call is 304 + asynchronous. Windows EventObjects are used for notification of 305 + message arrival. The process P is kept waiting on a KernelEvent 306 + object in NT and a semaphore in Windows 95. 307 + 308 + 309 + 4. The interface at the call level 310 + =================================== 311 + 312 + 313 + This section describes the upcalls a Coda FS driver can make to Venus. 314 + Each of these upcalls make use of two structures: inputArgs and 315 + outputArgs. In pseudo BNF form the structures take the following 316 + form:: 317 + 318 + 319 + struct inputArgs { 320 + u_long opcode; 321 + u_long unique; /* Keep multiple outstanding msgs distinct */ 322 + u_short pid; /* Common to all */ 323 + u_short pgid; /* Common to all */ 324 + struct CodaCred cred; /* Common to all */ 325 + 326 + <union "in" of call dependent parts of inputArgs> 327 + }; 328 + 329 + struct outputArgs { 330 + u_long opcode; 331 + u_long unique; /* Keep multiple outstanding msgs distinct */ 332 + u_long result; 333 + 334 + <union "out" of call dependent parts of inputArgs> 335 + }; 336 + 337 + 338 + 339 + Before going on let us elucidate the role of the various fields. The 340 + inputArgs start with the opcode which defines the type of service 341 + requested from Venus. There are approximately 30 upcalls at present 342 + which we will discuss. The unique field labels the inputArg with a 343 + unique number which will identify the message uniquely. A process and 344 + process group id are passed. Finally the credentials of the caller 345 + are included. 346 + 347 + Before delving into the specific calls we need to discuss a variety of 348 + data structures shared by the kernel and Venus. 349 + 350 + 351 + 352 + 353 + 4.1. Data structures shared by the kernel and Venus 354 + ---------------------------------------------------- 355 + 356 + 357 + The CodaCred structure defines a variety of user and group ids as 358 + they are set for the calling process. The vuid_t and vgid_t are 32 bit 359 + unsigned integers. It also defines group membership in an array. On 360 + Unix the CodaCred has proven sufficient to implement good security 361 + semantics for Coda but the structure may have to undergo modification 362 + for the Windows environment when these mature:: 363 + 364 + struct CodaCred { 365 + vuid_t cr_uid, cr_euid, cr_suid, cr_fsuid; /* Real, effective, set, fs uid */ 366 + vgid_t cr_gid, cr_egid, cr_sgid, cr_fsgid; /* same for groups */ 367 + vgid_t cr_groups[NGROUPS]; /* Group membership for caller */ 368 + }; 369 + 370 + 371 + .. Note:: 372 + 373 + It is questionable if we need CodaCreds in Venus. Finally Venus 374 + doesn't know about groups, although it does create files with the 375 + default uid/gid. Perhaps the list of group membership is superfluous. 376 + 377 + 378 + The next item is the fundamental identifier used to identify Coda 379 + files, the ViceFid. A fid of a file uniquely defines a file or 380 + directory in the Coda filesystem within a cell [1]_:: 381 + 382 + typedef struct ViceFid { 383 + VolumeId Volume; 384 + VnodeId Vnode; 385 + Unique_t Unique; 386 + } ViceFid; 387 + 388 + .. [1] A cell is agroup of Coda servers acting under the aegis of a single 389 + system control machine or SCM. See the Coda Administration manual 390 + for a detailed description of the role of the SCM. 391 + 392 + Each of the constituent fields: VolumeId, VnodeId and Unique_t are 393 + unsigned 32 bit integers. We envisage that a further field will need 394 + to be prefixed to identify the Coda cell; this will probably take the 395 + form of a Ipv6 size IP address naming the Coda cell through DNS. 396 + 397 + The next important structure shared between Venus and the kernel is 398 + the attributes of the file. The following structure is used to 399 + exchange information. It has room for future extensions such as 400 + support for device files (currently not present in Coda):: 401 + 402 + 403 + struct coda_timespec { 404 + int64_t tv_sec; /* seconds */ 405 + long tv_nsec; /* nanoseconds */ 406 + }; 407 + 408 + struct coda_vattr { 409 + enum coda_vtype va_type; /* vnode type (for create) */ 410 + u_short va_mode; /* files access mode and type */ 411 + short va_nlink; /* number of references to file */ 412 + vuid_t va_uid; /* owner user id */ 413 + vgid_t va_gid; /* owner group id */ 414 + long va_fsid; /* file system id (dev for now) */ 415 + long va_fileid; /* file id */ 416 + u_quad_t va_size; /* file size in bytes */ 417 + long va_blocksize; /* blocksize preferred for i/o */ 418 + struct coda_timespec va_atime; /* time of last access */ 419 + struct coda_timespec va_mtime; /* time of last modification */ 420 + struct coda_timespec va_ctime; /* time file changed */ 421 + u_long va_gen; /* generation number of file */ 422 + u_long va_flags; /* flags defined for file */ 423 + dev_t va_rdev; /* device special file represents */ 424 + u_quad_t va_bytes; /* bytes of disk space held by file */ 425 + u_quad_t va_filerev; /* file modification number */ 426 + u_int va_vaflags; /* operations flags, see below */ 427 + long va_spare; /* remain quad aligned */ 428 + }; 429 + 430 + 431 + 4.2. The pioctl interface 432 + -------------------------- 433 + 434 + 435 + Coda specific requests can be made by application through the pioctl 436 + interface. The pioctl is implemented as an ordinary ioctl on a 437 + fictitious file /coda/.CONTROL. The pioctl call opens this file, gets 438 + a file handle and makes the ioctl call. Finally it closes the file. 439 + 440 + The kernel involvement in this is limited to providing the facility to 441 + open and close and pass the ioctl message and to verify that a path in 442 + the pioctl data buffers is a file in a Coda filesystem. 443 + 444 + The kernel is handed a data packet of the form:: 445 + 446 + struct { 447 + const char *path; 448 + struct ViceIoctl vidata; 449 + int follow; 450 + } data; 451 + 452 + 453 + 454 + where:: 455 + 456 + 457 + struct ViceIoctl { 458 + caddr_t in, out; /* Data to be transferred in, or out */ 459 + short in_size; /* Size of input buffer <= 2K */ 460 + short out_size; /* Maximum size of output buffer, <= 2K */ 461 + }; 462 + 463 + 464 + 465 + The path must be a Coda file, otherwise the ioctl upcall will not be 466 + made. 467 + 468 + .. Note:: The data structures and code are a mess. We need to clean this up. 469 + 470 + 471 + **We now proceed to document the individual calls**: 472 + 473 + 474 + 4.3. root 475 + ---------- 476 + 477 + 478 + Arguments 479 + in 480 + 481 + empty 482 + 483 + out:: 484 + 485 + struct cfs_root_out { 486 + ViceFid VFid; 487 + } cfs_root; 488 + 489 + 490 + 491 + Description 492 + This call is made to Venus during the initialization of 493 + the Coda filesystem. If the result is zero, the cfs_root structure 494 + contains the ViceFid of the root of the Coda filesystem. If a non-zero 495 + result is generated, its value is a platform dependent error code 496 + indicating the difficulty Venus encountered in locating the root of 497 + the Coda filesystem. 498 + 499 + 4.4. lookup 500 + ------------ 501 + 502 + 503 + Summary 504 + Find the ViceFid and type of an object in a directory if it exists. 505 + 506 + Arguments 507 + in:: 508 + 509 + struct cfs_lookup_in { 510 + ViceFid VFid; 511 + char *name; /* Place holder for data. */ 512 + } cfs_lookup; 513 + 514 + 515 + 516 + out:: 517 + 518 + struct cfs_lookup_out { 519 + ViceFid VFid; 520 + int vtype; 521 + } cfs_lookup; 522 + 523 + 524 + 525 + Description 526 + This call is made to determine the ViceFid and filetype of 527 + a directory entry. The directory entry requested carries name name 528 + and Venus will search the directory identified by cfs_lookup_in.VFid. 529 + The result may indicate that the name does not exist, or that 530 + difficulty was encountered in finding it (e.g. due to disconnection). 531 + If the result is zero, the field cfs_lookup_out.VFid contains the 532 + targets ViceFid and cfs_lookup_out.vtype the coda_vtype giving the 533 + type of object the name designates. 534 + 535 + The name of the object is an 8 bit character string of maximum length 536 + CFS_MAXNAMLEN, currently set to 256 (including a 0 terminator.) 537 + 538 + It is extremely important to realize that Venus bitwise ors the field 539 + cfs_lookup.vtype with CFS_NOCACHE to indicate that the object should 540 + not be put in the kernel name cache. 541 + 542 + .. Note:: 543 + 544 + The type of the vtype is currently wrong. It should be 545 + coda_vtype. Linux does not take note of CFS_NOCACHE. It should. 546 + 547 + 548 + 4.5. getattr 549 + ------------- 550 + 551 + 552 + Summary Get the attributes of a file. 553 + 554 + Arguments 555 + in:: 556 + 557 + struct cfs_getattr_in { 558 + ViceFid VFid; 559 + struct coda_vattr attr; /* XXXXX */ 560 + } cfs_getattr; 561 + 562 + 563 + 564 + out:: 565 + 566 + struct cfs_getattr_out { 567 + struct coda_vattr attr; 568 + } cfs_getattr; 569 + 570 + 571 + 572 + Description 573 + This call returns the attributes of the file identified by fid. 574 + 575 + Errors 576 + Errors can occur if the object with fid does not exist, is 577 + unaccessible or if the caller does not have permission to fetch 578 + attributes. 579 + 580 + .. Note:: 581 + 582 + Many kernel FS drivers (Linux, NT and Windows 95) need to acquire 583 + the attributes as well as the Fid for the instantiation of an internal 584 + "inode" or "FileHandle". A significant improvement in performance on 585 + such systems could be made by combining the lookup and getattr calls 586 + both at the Venus/kernel interaction level and at the RPC level. 587 + 588 + The vattr structure included in the input arguments is superfluous and 589 + should be removed. 590 + 591 + 592 + 4.6. setattr 593 + ------------- 594 + 595 + 596 + Summary 597 + Set the attributes of a file. 598 + 599 + Arguments 600 + in:: 601 + 602 + struct cfs_setattr_in { 603 + ViceFid VFid; 604 + struct coda_vattr attr; 605 + } cfs_setattr; 606 + 607 + 608 + 609 + 610 + out 611 + 612 + empty 613 + 614 + Description 615 + The structure attr is filled with attributes to be changed 616 + in BSD style. Attributes not to be changed are set to -1, apart from 617 + vtype which is set to VNON. Other are set to the value to be assigned. 618 + The only attributes which the FS driver may request to change are the 619 + mode, owner, groupid, atime, mtime and ctime. The return value 620 + indicates success or failure. 621 + 622 + Errors 623 + A variety of errors can occur. The object may not exist, may 624 + be inaccessible, or permission may not be granted by Venus. 625 + 626 + 627 + 4.7. access 628 + ------------ 629 + 630 + 631 + Arguments 632 + in:: 633 + 634 + struct cfs_access_in { 635 + ViceFid VFid; 636 + int flags; 637 + } cfs_access; 638 + 639 + 640 + 641 + out 642 + 643 + empty 644 + 645 + Description 646 + Verify if access to the object identified by VFid for 647 + operations described by flags is permitted. The result indicates if 648 + access will be granted. It is important to remember that Coda uses 649 + ACLs to enforce protection and that ultimately the servers, not the 650 + clients enforce the security of the system. The result of this call 651 + will depend on whether a token is held by the user. 652 + 653 + Errors 654 + The object may not exist, or the ACL describing the protection 655 + may not be accessible. 656 + 657 + 658 + 4.8. create 659 + ------------ 660 + 661 + 662 + Summary 663 + Invoked to create a file 664 + 665 + Arguments 666 + in:: 667 + 668 + struct cfs_create_in { 669 + ViceFid VFid; 670 + struct coda_vattr attr; 671 + int excl; 672 + int mode; 673 + char *name; /* Place holder for data. */ 674 + } cfs_create; 675 + 676 + 677 + 678 + 679 + out:: 680 + 681 + struct cfs_create_out { 682 + ViceFid VFid; 683 + struct coda_vattr attr; 684 + } cfs_create; 685 + 686 + 687 + 688 + Description 689 + This upcall is invoked to request creation of a file. 690 + The file will be created in the directory identified by VFid, its name 691 + will be name, and the mode will be mode. If excl is set an error will 692 + be returned if the file already exists. If the size field in attr is 693 + set to zero the file will be truncated. The uid and gid of the file 694 + are set by converting the CodaCred to a uid using a macro CRTOUID 695 + (this macro is platform dependent). Upon success the VFid and 696 + attributes of the file are returned. The Coda FS Driver will normally 697 + instantiate a vnode, inode or file handle at kernel level for the new 698 + object. 699 + 700 + 701 + Errors 702 + A variety of errors can occur. Permissions may be insufficient. 703 + If the object exists and is not a file the error EISDIR is returned 704 + under Unix. 705 + 706 + .. Note:: 707 + 708 + The packing of parameters is very inefficient and appears to 709 + indicate confusion between the system call creat and the VFS operation 710 + create. The VFS operation create is only called to create new objects. 711 + This create call differs from the Unix one in that it is not invoked 712 + to return a file descriptor. The truncate and exclusive options, 713 + together with the mode, could simply be part of the mode as it is 714 + under Unix. There should be no flags argument; this is used in open 715 + (2) to return a file descriptor for READ or WRITE mode. 716 + 717 + The attributes of the directory should be returned too, since the size 718 + and mtime changed. 719 + 720 + 721 + 4.9. mkdir 722 + ----------- 723 + 724 + 725 + Summary 726 + Create a new directory. 727 + 728 + Arguments 729 + in:: 730 + 731 + struct cfs_mkdir_in { 732 + ViceFid VFid; 733 + struct coda_vattr attr; 734 + char *name; /* Place holder for data. */ 735 + } cfs_mkdir; 736 + 737 + 738 + 739 + out:: 740 + 741 + struct cfs_mkdir_out { 742 + ViceFid VFid; 743 + struct coda_vattr attr; 744 + } cfs_mkdir; 745 + 746 + 747 + 748 + 749 + Description 750 + This call is similar to create but creates a directory. 751 + Only the mode field in the input parameters is used for creation. 752 + Upon successful creation, the attr returned contains the attributes of 753 + the new directory. 754 + 755 + Errors 756 + As for create. 757 + 758 + .. Note:: 759 + 760 + The input parameter should be changed to mode instead of 761 + attributes. 762 + 763 + The attributes of the parent should be returned since the size and 764 + mtime changes. 765 + 766 + 767 + 4.10. link 768 + ----------- 769 + 770 + 771 + Summary 772 + Create a link to an existing file. 773 + 774 + Arguments 775 + in:: 776 + 777 + struct cfs_link_in { 778 + ViceFid sourceFid; /* cnode to link *to* */ 779 + ViceFid destFid; /* Directory in which to place link */ 780 + char *tname; /* Place holder for data. */ 781 + } cfs_link; 782 + 783 + 784 + 785 + out 786 + 787 + empty 788 + 789 + Description 790 + This call creates a link to the sourceFid in the directory 791 + identified by destFid with name tname. The source must reside in the 792 + target's parent, i.e. the source must be have parent destFid, i.e. Coda 793 + does not support cross directory hard links. Only the return value is 794 + relevant. It indicates success or the type of failure. 795 + 796 + Errors 797 + The usual errors can occur. 798 + 799 + 800 + 4.11. symlink 801 + -------------- 802 + 803 + 804 + Summary 805 + create a symbolic link 806 + 807 + Arguments 808 + in:: 809 + 810 + struct cfs_symlink_in { 811 + ViceFid VFid; /* Directory to put symlink in */ 812 + char *srcname; 813 + struct coda_vattr attr; 814 + char *tname; 815 + } cfs_symlink; 816 + 817 + 818 + 819 + out 820 + 821 + none 822 + 823 + Description 824 + Create a symbolic link. The link is to be placed in the 825 + directory identified by VFid and named tname. It should point to the 826 + pathname srcname. The attributes of the newly created object are to 827 + be set to attr. 828 + 829 + .. Note:: 830 + 831 + The attributes of the target directory should be returned since 832 + its size changed. 833 + 834 + 835 + 4.12. remove 836 + ------------- 837 + 838 + 839 + Summary 840 + Remove a file 841 + 842 + Arguments 843 + in:: 844 + 845 + struct cfs_remove_in { 846 + ViceFid VFid; 847 + char *name; /* Place holder for data. */ 848 + } cfs_remove; 849 + 850 + 851 + 852 + out 853 + 854 + none 855 + 856 + Description 857 + Remove file named cfs_remove_in.name in directory 858 + identified by VFid. 859 + 860 + 861 + .. Note:: 862 + 863 + The attributes of the directory should be returned since its 864 + mtime and size may change. 865 + 866 + 867 + 4.13. rmdir 868 + ------------ 869 + 870 + 871 + Summary 872 + Remove a directory 873 + 874 + Arguments 875 + in:: 876 + 877 + struct cfs_rmdir_in { 878 + ViceFid VFid; 879 + char *name; /* Place holder for data. */ 880 + } cfs_rmdir; 881 + 882 + 883 + 884 + out 885 + 886 + none 887 + 888 + Description 889 + Remove the directory with name name from the directory 890 + identified by VFid. 891 + 892 + .. Note:: The attributes of the parent directory should be returned since 893 + its mtime and size may change. 894 + 895 + 896 + 4.14. readlink 897 + --------------- 898 + 899 + 900 + Summary 901 + Read the value of a symbolic link. 902 + 903 + Arguments 904 + in:: 905 + 906 + struct cfs_readlink_in { 907 + ViceFid VFid; 908 + } cfs_readlink; 909 + 910 + 911 + 912 + out:: 913 + 914 + struct cfs_readlink_out { 915 + int count; 916 + caddr_t data; /* Place holder for data. */ 917 + } cfs_readlink; 918 + 919 + 920 + 921 + Description 922 + This routine reads the contents of symbolic link 923 + identified by VFid into the buffer data. The buffer data must be able 924 + to hold any name up to CFS_MAXNAMLEN (PATH or NAM??). 925 + 926 + Errors 927 + No unusual errors. 928 + 929 + 930 + 4.15. open 931 + ----------- 932 + 933 + 934 + Summary 935 + Open a file. 936 + 937 + Arguments 938 + in:: 939 + 940 + struct cfs_open_in { 941 + ViceFid VFid; 942 + int flags; 943 + } cfs_open; 944 + 945 + 946 + 947 + out:: 948 + 949 + struct cfs_open_out { 950 + dev_t dev; 951 + ino_t inode; 952 + } cfs_open; 953 + 954 + 955 + 956 + Description 957 + This request asks Venus to place the file identified by 958 + VFid in its cache and to note that the calling process wishes to open 959 + it with flags as in open(2). The return value to the kernel differs 960 + for Unix and Windows systems. For Unix systems the Coda FS Driver is 961 + informed of the device and inode number of the container file in the 962 + fields dev and inode. For Windows the path of the container file is 963 + returned to the kernel. 964 + 965 + 966 + .. Note:: 967 + 968 + Currently the cfs_open_out structure is not properly adapted to 969 + deal with the Windows case. It might be best to implement two 970 + upcalls, one to open aiming at a container file name, the other at a 971 + container file inode. 972 + 973 + 974 + 4.16. close 975 + ------------ 976 + 977 + 978 + Summary 979 + Close a file, update it on the servers. 980 + 981 + Arguments 982 + in:: 983 + 984 + struct cfs_close_in { 985 + ViceFid VFid; 986 + int flags; 987 + } cfs_close; 988 + 989 + 990 + 991 + out 992 + 993 + none 994 + 995 + Description 996 + Close the file identified by VFid. 997 + 998 + .. Note:: 999 + 1000 + The flags argument is bogus and not used. However, Venus' code 1001 + has room to deal with an execp input field, probably this field should 1002 + be used to inform Venus that the file was closed but is still memory 1003 + mapped for execution. There are comments about fetching versus not 1004 + fetching the data in Venus vproc_vfscalls. This seems silly. If a 1005 + file is being closed, the data in the container file is to be the new 1006 + data. Here again the execp flag might be in play to create confusion: 1007 + currently Venus might think a file can be flushed from the cache when 1008 + it is still memory mapped. This needs to be understood. 1009 + 1010 + 1011 + 4.17. ioctl 1012 + ------------ 1013 + 1014 + 1015 + Summary 1016 + Do an ioctl on a file. This includes the pioctl interface. 1017 + 1018 + Arguments 1019 + in:: 1020 + 1021 + struct cfs_ioctl_in { 1022 + ViceFid VFid; 1023 + int cmd; 1024 + int len; 1025 + int rwflag; 1026 + char *data; /* Place holder for data. */ 1027 + } cfs_ioctl; 1028 + 1029 + 1030 + 1031 + out:: 1032 + 1033 + 1034 + struct cfs_ioctl_out { 1035 + int len; 1036 + caddr_t data; /* Place holder for data. */ 1037 + } cfs_ioctl; 1038 + 1039 + 1040 + 1041 + Description 1042 + Do an ioctl operation on a file. The command, len and 1043 + data arguments are filled as usual. flags is not used by Venus. 1044 + 1045 + .. Note:: 1046 + 1047 + Another bogus parameter. flags is not used. What is the 1048 + business about PREFETCHING in the Venus code? 1049 + 1050 + 1051 + 1052 + 4.18. rename 1053 + ------------- 1054 + 1055 + 1056 + Summary 1057 + Rename a fid. 1058 + 1059 + Arguments 1060 + in:: 1061 + 1062 + struct cfs_rename_in { 1063 + ViceFid sourceFid; 1064 + char *srcname; 1065 + ViceFid destFid; 1066 + char *destname; 1067 + } cfs_rename; 1068 + 1069 + 1070 + 1071 + out 1072 + 1073 + none 1074 + 1075 + Description 1076 + Rename the object with name srcname in directory 1077 + sourceFid to destname in destFid. It is important that the names 1078 + srcname and destname are 0 terminated strings. Strings in Unix 1079 + kernels are not always null terminated. 1080 + 1081 + 1082 + 4.19. readdir 1083 + -------------- 1084 + 1085 + 1086 + Summary 1087 + Read directory entries. 1088 + 1089 + Arguments 1090 + in:: 1091 + 1092 + struct cfs_readdir_in { 1093 + ViceFid VFid; 1094 + int count; 1095 + int offset; 1096 + } cfs_readdir; 1097 + 1098 + 1099 + 1100 + 1101 + out:: 1102 + 1103 + struct cfs_readdir_out { 1104 + int size; 1105 + caddr_t data; /* Place holder for data. */ 1106 + } cfs_readdir; 1107 + 1108 + 1109 + 1110 + Description 1111 + Read directory entries from VFid starting at offset and 1112 + read at most count bytes. Returns the data in data and returns 1113 + the size in size. 1114 + 1115 + 1116 + .. Note:: 1117 + 1118 + This call is not used. Readdir operations exploit container 1119 + files. We will re-evaluate this during the directory revamp which is 1120 + about to take place. 1121 + 1122 + 1123 + 4.20. vget 1124 + ----------- 1125 + 1126 + 1127 + Summary 1128 + instructs Venus to do an FSDB->Get. 1129 + 1130 + Arguments 1131 + in:: 1132 + 1133 + struct cfs_vget_in { 1134 + ViceFid VFid; 1135 + } cfs_vget; 1136 + 1137 + 1138 + 1139 + out:: 1140 + 1141 + struct cfs_vget_out { 1142 + ViceFid VFid; 1143 + int vtype; 1144 + } cfs_vget; 1145 + 1146 + 1147 + 1148 + Description 1149 + This upcall asks Venus to do a get operation on an fsobj 1150 + labelled by VFid. 1151 + 1152 + .. Note:: 1153 + 1154 + This operation is not used. However, it is extremely useful 1155 + since it can be used to deal with read/write memory mapped files. 1156 + These can be "pinned" in the Venus cache using vget and released with 1157 + inactive. 1158 + 1159 + 1160 + 4.21. fsync 1161 + ------------ 1162 + 1163 + 1164 + Summary 1165 + Tell Venus to update the RVM attributes of a file. 1166 + 1167 + Arguments 1168 + in:: 1169 + 1170 + struct cfs_fsync_in { 1171 + ViceFid VFid; 1172 + } cfs_fsync; 1173 + 1174 + 1175 + 1176 + out 1177 + 1178 + none 1179 + 1180 + Description 1181 + Ask Venus to update RVM attributes of object VFid. This 1182 + should be called as part of kernel level fsync type calls. The 1183 + result indicates if the syncing was successful. 1184 + 1185 + .. Note:: Linux does not implement this call. It should. 1186 + 1187 + 1188 + 4.22. inactive 1189 + --------------- 1190 + 1191 + 1192 + Summary 1193 + Tell Venus a vnode is no longer in use. 1194 + 1195 + Arguments 1196 + in:: 1197 + 1198 + struct cfs_inactive_in { 1199 + ViceFid VFid; 1200 + } cfs_inactive; 1201 + 1202 + 1203 + 1204 + out 1205 + 1206 + none 1207 + 1208 + Description 1209 + This operation returns EOPNOTSUPP. 1210 + 1211 + .. Note:: This should perhaps be removed. 1212 + 1213 + 1214 + 4.23. rdwr 1215 + ----------- 1216 + 1217 + 1218 + Summary 1219 + Read or write from a file 1220 + 1221 + Arguments 1222 + in:: 1223 + 1224 + struct cfs_rdwr_in { 1225 + ViceFid VFid; 1226 + int rwflag; 1227 + int count; 1228 + int offset; 1229 + int ioflag; 1230 + caddr_t data; /* Place holder for data. */ 1231 + } cfs_rdwr; 1232 + 1233 + 1234 + 1235 + 1236 + out:: 1237 + 1238 + struct cfs_rdwr_out { 1239 + int rwflag; 1240 + int count; 1241 + caddr_t data; /* Place holder for data. */ 1242 + } cfs_rdwr; 1243 + 1244 + 1245 + 1246 + Description 1247 + This upcall asks Venus to read or write from a file. 1248 + 1249 + 1250 + .. Note:: 1251 + 1252 + It should be removed since it is against the Coda philosophy that 1253 + read/write operations never reach Venus. I have been told the 1254 + operation does not work. It is not currently used. 1255 + 1256 + 1257 + 1258 + 4.24. odymount 1259 + --------------- 1260 + 1261 + 1262 + Summary 1263 + Allows mounting multiple Coda "filesystems" on one Unix mount point. 1264 + 1265 + Arguments 1266 + in:: 1267 + 1268 + struct ody_mount_in { 1269 + char *name; /* Place holder for data. */ 1270 + } ody_mount; 1271 + 1272 + 1273 + 1274 + out:: 1275 + 1276 + struct ody_mount_out { 1277 + ViceFid VFid; 1278 + } ody_mount; 1279 + 1280 + 1281 + 1282 + Description 1283 + Asks Venus to return the rootfid of a Coda system named 1284 + name. The fid is returned in VFid. 1285 + 1286 + .. Note:: 1287 + 1288 + This call was used by David for dynamic sets. It should be 1289 + removed since it causes a jungle of pointers in the VFS mounting area. 1290 + It is not used by Coda proper. Call is not implemented by Venus. 1291 + 1292 + 1293 + 4.25. ody_lookup 1294 + ----------------- 1295 + 1296 + 1297 + Summary 1298 + Looks up something. 1299 + 1300 + Arguments 1301 + in 1302 + 1303 + irrelevant 1304 + 1305 + 1306 + out 1307 + 1308 + irrelevant 1309 + 1310 + 1311 + .. Note:: Gut it. Call is not implemented by Venus. 1312 + 1313 + 1314 + 4.26. ody_expand 1315 + ----------------- 1316 + 1317 + 1318 + Summary 1319 + expands something in a dynamic set. 1320 + 1321 + Arguments 1322 + in 1323 + 1324 + irrelevant 1325 + 1326 + out 1327 + 1328 + irrelevant 1329 + 1330 + .. Note:: Gut it. Call is not implemented by Venus. 1331 + 1332 + 1333 + 4.27. prefetch 1334 + --------------- 1335 + 1336 + 1337 + Summary 1338 + Prefetch a dynamic set. 1339 + 1340 + Arguments 1341 + 1342 + in 1343 + 1344 + Not documented. 1345 + 1346 + out 1347 + 1348 + Not documented. 1349 + 1350 + Description 1351 + Venus worker.cc has support for this call, although it is 1352 + noted that it doesn't work. Not surprising, since the kernel does not 1353 + have support for it. (ODY_PREFETCH is not a defined operation). 1354 + 1355 + 1356 + .. Note:: Gut it. It isn't working and isn't used by Coda. 1357 + 1358 + 1359 + 1360 + 4.28. signal 1361 + ------------- 1362 + 1363 + 1364 + Summary 1365 + Send Venus a signal about an upcall. 1366 + 1367 + Arguments 1368 + in 1369 + 1370 + none 1371 + 1372 + out 1373 + 1374 + not applicable. 1375 + 1376 + Description 1377 + This is an out-of-band upcall to Venus to inform Venus 1378 + that the calling process received a signal after Venus read the 1379 + message from the input queue. Venus is supposed to clean up the 1380 + operation. 1381 + 1382 + Errors 1383 + No reply is given. 1384 + 1385 + .. Note:: 1386 + 1387 + We need to better understand what Venus needs to clean up and if 1388 + it is doing this correctly. Also we need to handle multiple upcall 1389 + per system call situations correctly. It would be important to know 1390 + what state changes in Venus take place after an upcall for which the 1391 + kernel is responsible for notifying Venus to clean up (e.g. open 1392 + definitely is such a state change, but many others are maybe not). 1393 + 1394 + 1395 + 5. The minicache and downcalls 1396 + =============================== 1397 + 1398 + 1399 + The Coda FS Driver can cache results of lookup and access upcalls, to 1400 + limit the frequency of upcalls. Upcalls carry a price since a process 1401 + context switch needs to take place. The counterpart of caching the 1402 + information is that Venus will notify the FS Driver that cached 1403 + entries must be flushed or renamed. 1404 + 1405 + The kernel code generally has to maintain a structure which links the 1406 + internal file handles (called vnodes in BSD, inodes in Linux and 1407 + FileHandles in Windows) with the ViceFid's which Venus maintains. The 1408 + reason is that frequent translations back and forth are needed in 1409 + order to make upcalls and use the results of upcalls. Such linking 1410 + objects are called cnodes. 1411 + 1412 + The current minicache implementations have cache entries which record 1413 + the following: 1414 + 1415 + 1. the name of the file 1416 + 1417 + 2. the cnode of the directory containing the object 1418 + 1419 + 3. a list of CodaCred's for which the lookup is permitted. 1420 + 1421 + 4. the cnode of the object 1422 + 1423 + The lookup call in the Coda FS Driver may request the cnode of the 1424 + desired object from the cache, by passing its name, directory and the 1425 + CodaCred's of the caller. The cache will return the cnode or indicate 1426 + that it cannot be found. The Coda FS Driver must be careful to 1427 + invalidate cache entries when it modifies or removes objects. 1428 + 1429 + When Venus obtains information that indicates that cache entries are 1430 + no longer valid, it will make a downcall to the kernel. Downcalls are 1431 + intercepted by the Coda FS Driver and lead to cache invalidations of 1432 + the kind described below. The Coda FS Driver does not return an error 1433 + unless the downcall data could not be read into kernel memory. 1434 + 1435 + 1436 + 5.1. INVALIDATE 1437 + ---------------- 1438 + 1439 + 1440 + No information is available on this call. 1441 + 1442 + 1443 + 5.2. FLUSH 1444 + ----------- 1445 + 1446 + 1447 + 1448 + Arguments 1449 + None 1450 + 1451 + Summary 1452 + Flush the name cache entirely. 1453 + 1454 + Description 1455 + Venus issues this call upon startup and when it dies. This 1456 + is to prevent stale cache information being held. Some operating 1457 + systems allow the kernel name cache to be switched off dynamically. 1458 + When this is done, this downcall is made. 1459 + 1460 + 1461 + 5.3. PURGEUSER 1462 + --------------- 1463 + 1464 + 1465 + Arguments 1466 + :: 1467 + 1468 + struct cfs_purgeuser_out {/* CFS_PURGEUSER is a venus->kernel call */ 1469 + struct CodaCred cred; 1470 + } cfs_purgeuser; 1471 + 1472 + 1473 + 1474 + Description 1475 + Remove all entries in the cache carrying the Cred. This 1476 + call is issued when tokens for a user expire or are flushed. 1477 + 1478 + 1479 + 5.4. ZAPFILE 1480 + ------------- 1481 + 1482 + 1483 + Arguments 1484 + :: 1485 + 1486 + struct cfs_zapfile_out { /* CFS_ZAPFILE is a venus->kernel call */ 1487 + ViceFid CodaFid; 1488 + } cfs_zapfile; 1489 + 1490 + 1491 + 1492 + Description 1493 + Remove all entries which have the (dir vnode, name) pair. 1494 + This is issued as a result of an invalidation of cached attributes of 1495 + a vnode. 1496 + 1497 + .. Note:: 1498 + 1499 + Call is not named correctly in NetBSD and Mach. The minicache 1500 + zapfile routine takes different arguments. Linux does not implement 1501 + the invalidation of attributes correctly. 1502 + 1503 + 1504 + 1505 + 5.5. ZAPDIR 1506 + ------------ 1507 + 1508 + 1509 + Arguments 1510 + :: 1511 + 1512 + struct cfs_zapdir_out { /* CFS_ZAPDIR is a venus->kernel call */ 1513 + ViceFid CodaFid; 1514 + } cfs_zapdir; 1515 + 1516 + 1517 + 1518 + Description 1519 + Remove all entries in the cache lying in a directory 1520 + CodaFid, and all children of this directory. This call is issued when 1521 + Venus receives a callback on the directory. 1522 + 1523 + 1524 + 5.6. ZAPVNODE 1525 + -------------- 1526 + 1527 + 1528 + 1529 + Arguments 1530 + :: 1531 + 1532 + struct cfs_zapvnode_out { /* CFS_ZAPVNODE is a venus->kernel call */ 1533 + struct CodaCred cred; 1534 + ViceFid VFid; 1535 + } cfs_zapvnode; 1536 + 1537 + 1538 + 1539 + Description 1540 + Remove all entries in the cache carrying the cred and VFid 1541 + as in the arguments. This downcall is probably never issued. 1542 + 1543 + 1544 + 5.7. PURGEFID 1545 + -------------- 1546 + 1547 + 1548 + Arguments 1549 + :: 1550 + 1551 + struct cfs_purgefid_out { /* CFS_PURGEFID is a venus->kernel call */ 1552 + ViceFid CodaFid; 1553 + } cfs_purgefid; 1554 + 1555 + 1556 + 1557 + Description 1558 + Flush the attribute for the file. If it is a dir (odd 1559 + vnode), purge its children from the namecache and remove the file from the 1560 + namecache. 1561 + 1562 + 1563 + 1564 + 5.8. REPLACE 1565 + ------------- 1566 + 1567 + 1568 + Summary 1569 + Replace the Fid's for a collection of names. 1570 + 1571 + Arguments 1572 + :: 1573 + 1574 + struct cfs_replace_out { /* cfs_replace is a venus->kernel call */ 1575 + ViceFid NewFid; 1576 + ViceFid OldFid; 1577 + } cfs_replace; 1578 + 1579 + 1580 + 1581 + Description 1582 + This routine replaces a ViceFid in the name cache with 1583 + another. It is added to allow Venus during reintegration to replace 1584 + locally allocated temp fids while disconnected with global fids even 1585 + when the reference counts on those fids are not zero. 1586 + 1587 + 1588 + 6. Initialization and cleanup 1589 + ============================== 1590 + 1591 + 1592 + This section gives brief hints as to desirable features for the Coda 1593 + FS Driver at startup and upon shutdown or Venus failures. Before 1594 + entering the discussion it is useful to repeat that the Coda FS Driver 1595 + maintains the following data: 1596 + 1597 + 1598 + 1. message queues 1599 + 1600 + 2. cnodes 1601 + 1602 + 3. name cache entries 1603 + 1604 + The name cache entries are entirely private to the driver, so they 1605 + can easily be manipulated. The message queues will generally have 1606 + clear points of initialization and destruction. The cnodes are 1607 + much more delicate. User processes hold reference counts in Coda 1608 + filesystems and it can be difficult to clean up the cnodes. 1609 + 1610 + It can expect requests through: 1611 + 1612 + 1. the message subsystem 1613 + 1614 + 2. the VFS layer 1615 + 1616 + 3. pioctl interface 1617 + 1618 + Currently the pioctl passes through the VFS for Coda so we can 1619 + treat these similarly. 1620 + 1621 + 1622 + 6.1. Requirements 1623 + ------------------ 1624 + 1625 + 1626 + The following requirements should be accommodated: 1627 + 1628 + 1. The message queues should have open and close routines. On Unix 1629 + the opening of the character devices are such routines. 1630 + 1631 + - Before opening, no messages can be placed. 1632 + 1633 + - Opening will remove any old messages still pending. 1634 + 1635 + - Close will notify any sleeping processes that their upcall cannot 1636 + be completed. 1637 + 1638 + - Close will free all memory allocated by the message queues. 1639 + 1640 + 1641 + 2. At open the namecache shall be initialized to empty state. 1642 + 1643 + 3. Before the message queues are open, all VFS operations will fail. 1644 + Fortunately this can be achieved by making sure than mounting the 1645 + Coda filesystem cannot succeed before opening. 1646 + 1647 + 4. After closing of the queues, no VFS operations can succeed. Here 1648 + one needs to be careful, since a few operations (lookup, 1649 + read/write, readdir) can proceed without upcalls. These must be 1650 + explicitly blocked. 1651 + 1652 + 5. Upon closing the namecache shall be flushed and disabled. 1653 + 1654 + 6. All memory held by cnodes can be freed without relying on upcalls. 1655 + 1656 + 7. Unmounting the file system can be done without relying on upcalls. 1657 + 1658 + 8. Mounting the Coda filesystem should fail gracefully if Venus cannot 1659 + get the rootfid or the attributes of the rootfid. The latter is 1660 + best implemented by Venus fetching these objects before attempting 1661 + to mount. 1662 + 1663 + .. Note:: 1664 + 1665 + NetBSD in particular but also Linux have not implemented the 1666 + above requirements fully. For smooth operation this needs to be 1667 + corrected. 1668 + 1669 + 1670 +

-1676

Documentation/filesystems/coda.txt

··· 1 - NOTE: 2 - This is one of the technical documents describing a component of 3 - Coda -- this document describes the client kernel-Venus interface. 4 - 5 - For more information: 6 - http://www.coda.cs.cmu.edu 7 - For user level software needed to run Coda: 8 - ftp://ftp.coda.cs.cmu.edu 9 - 10 - To run Coda you need to get a user level cache manager for the client, 11 - named Venus, as well as tools to manipulate ACLs, to log in, etc. The 12 - client needs to have the Coda filesystem selected in the kernel 13 - configuration. 14 - 15 - The server needs a user level server and at present does not depend on 16 - kernel support. 17 - 18 - 19 - 20 - 21 - 22 - 23 - 24 - The Venus kernel interface 25 - Peter J. Braam 26 - v1.0, Nov 9, 1997 27 - 28 - This document describes the communication between Venus and kernel 29 - level filesystem code needed for the operation of the Coda file sys- 30 - tem. This document version is meant to describe the current interface 31 - (version 1.0) as well as improvements we envisage. 32 - ______________________________________________________________________ 33 - 34 - Table of Contents 35 - 36 - 37 - 38 - 39 - 40 - 41 - 42 - 43 - 44 - 45 - 46 - 47 - 48 - 49 - 50 - 51 - 52 - 53 - 54 - 55 - 56 - 57 - 58 - 59 - 60 - 61 - 62 - 63 - 64 - 65 - 66 - 67 - 68 - 69 - 70 - 71 - 72 - 73 - 74 - 75 - 76 - 77 - 78 - 79 - 80 - 81 - 82 - 83 - 84 - 85 - 86 - 87 - 88 - 89 - 90 - 1. Introduction 91 - 92 - 2. Servicing Coda filesystem calls 93 - 94 - 3. The message layer 95 - 96 - 3.1 Implementation details 97 - 98 - 4. The interface at the call level 99 - 100 - 4.1 Data structures shared by the kernel and Venus 101 - 4.2 The pioctl interface 102 - 4.3 root 103 - 4.4 lookup 104 - 4.5 getattr 105 - 4.6 setattr 106 - 4.7 access 107 - 4.8 create 108 - 4.9 mkdir 109 - 4.10 link 110 - 4.11 symlink 111 - 4.12 remove 112 - 4.13 rmdir 113 - 4.14 readlink 114 - 4.15 open 115 - 4.16 close 116 - 4.17 ioctl 117 - 4.18 rename 118 - 4.19 readdir 119 - 4.20 vget 120 - 4.21 fsync 121 - 4.22 inactive 122 - 4.23 rdwr 123 - 4.24 odymount 124 - 4.25 ody_lookup 125 - 4.26 ody_expand 126 - 4.27 prefetch 127 - 4.28 signal 128 - 129 - 5. The minicache and downcalls 130 - 131 - 5.1 INVALIDATE 132 - 5.2 FLUSH 133 - 5.3 PURGEUSER 134 - 5.4 ZAPFILE 135 - 5.5 ZAPDIR 136 - 5.6 ZAPVNODE 137 - 5.7 PURGEFID 138 - 5.8 REPLACE 139 - 140 - 6. Initialization and cleanup 141 - 142 - 6.1 Requirements 143 - 144 - 145 - ______________________________________________________________________ 146 - 0wpage 147 - 148 - 11.. IInnttrroodduuccttiioonn 149 - 150 - 151 - 152 - A key component in the Coda Distributed File System is the cache 153 - manager, _V_e_n_u_s. 154 - 155 - 156 - When processes on a Coda enabled system access files in the Coda 157 - filesystem, requests are directed at the filesystem layer in the 158 - operating system. The operating system will communicate with Venus to 159 - service the request for the process. Venus manages a persistent 160 - client cache and makes remote procedure calls to Coda file servers and 161 - related servers (such as authentication servers) to service these 162 - requests it receives from the operating system. When Venus has 163 - serviced a request it replies to the operating system with appropriate 164 - return codes, and other data related to the request. Optionally the 165 - kernel support for Coda may maintain a minicache of recently processed 166 - requests to limit the number of interactions with Venus. Venus 167 - possesses the facility to inform the kernel when elements from its 168 - minicache are no longer valid. 169 - 170 - This document describes precisely this communication between the 171 - kernel and Venus. The definitions of so called upcalls and downcalls 172 - will be given with the format of the data they handle. We shall also 173 - describe the semantic invariants resulting from the calls. 174 - 175 - Historically Coda was implemented in a BSD file system in Mach 2.6. 176 - The interface between the kernel and Venus is very similar to the BSD 177 - VFS interface. Similar functionality is provided, and the format of 178 - the parameters and returned data is very similar to the BSD VFS. This 179 - leads to an almost natural environment for implementing a kernel-level 180 - filesystem driver for Coda in a BSD system. However, other operating 181 - systems such as Linux and Windows 95 and NT have virtual filesystem 182 - with different interfaces. 183 - 184 - To implement Coda on these systems some reverse engineering of the 185 - Venus/Kernel protocol is necessary. Also it came to light that other 186 - systems could profit significantly from certain small optimizations 187 - and modifications to the protocol. To facilitate this work as well as 188 - to make future ports easier, communication between Venus and the 189 - kernel should be documented in great detail. This is the aim of this 190 - document. 191 - 192 - 0wpage 193 - 194 - 22.. SSeerrvviicciinngg CCooddaa ffiilleessyysstteemm ccaallllss 195 - 196 - The service of a request for a Coda file system service originates in 197 - a process PP which accessing a Coda file. It makes a system call which 198 - traps to the OS kernel. Examples of such calls trapping to the kernel 199 - are _r_e_a_d_, _w_r_i_t_e_, _o_p_e_n_, _c_l_o_s_e_, _c_r_e_a_t_e_, _m_k_d_i_r_, _r_m_d_i_r_, _c_h_m_o_d in a Unix 200 - context. Similar calls exist in the Win32 environment, and are named 201 - _C_r_e_a_t_e_F_i_l_e_, . 202 - 203 - Generally the operating system handles the request in a virtual 204 - filesystem (VFS) layer, which is named I/O Manager in NT and IFS 205 - manager in Windows 95. The VFS is responsible for partial processing 206 - of the request and for locating the specific filesystem(s) which will 207 - service parts of the request. Usually the information in the path 208 - assists in locating the correct FS drivers. Sometimes after extensive 209 - pre-processing, the VFS starts invoking exported routines in the FS 210 - driver. This is the point where the FS specific processing of the 211 - request starts, and here the Coda specific kernel code comes into 212 - play. 213 - 214 - The FS layer for Coda must expose and implement several interfaces. 215 - First and foremost the VFS must be able to make all necessary calls to 216 - the Coda FS layer, so the Coda FS driver must expose the VFS interface 217 - as applicable in the operating system. These differ very significantly 218 - among operating systems, but share features such as facilities to 219 - read/write and create and remove objects. The Coda FS layer services 220 - such VFS requests by invoking one or more well defined services 221 - offered by the cache manager Venus. When the replies from Venus have 222 - come back to the FS driver, servicing of the VFS call continues and 223 - finishes with a reply to the kernel's VFS. Finally the VFS layer 224 - returns to the process. 225 - 226 - As a result of this design a basic interface exposed by the FS driver 227 - must allow Venus to manage message traffic. In particular Venus must 228 - be able to retrieve and place messages and to be notified of the 229 - arrival of a new message. The notification must be through a mechanism 230 - which does not block Venus since Venus must attend to other tasks even 231 - when no messages are waiting or being processed. 232 - 233 - 234 - 235 - 236 - 237 - 238 - Interfaces of the Coda FS Driver 239 - 240 - Furthermore the FS layer provides for a special path of communication 241 - between a user process and Venus, called the pioctl interface. The 242 - pioctl interface is used for Coda specific services, such as 243 - requesting detailed information about the persistent cache managed by 244 - Venus. Here the involvement of the kernel is minimal. It identifies 245 - the calling process and passes the information on to Venus. When 246 - Venus replies the response is passed back to the caller in unmodified 247 - form. 248 - 249 - Finally Venus allows the kernel FS driver to cache the results from 250 - certain services. This is done to avoid excessive context switches 251 - and results in an efficient system. However, Venus may acquire 252 - information, for example from the network which implies that cached 253 - information must be flushed or replaced. Venus then makes a downcall 254 - to the Coda FS layer to request flushes or updates in the cache. The 255 - kernel FS driver handles such requests synchronously. 256 - 257 - Among these interfaces the VFS interface and the facility to place, 258 - receive and be notified of messages are platform specific. We will 259 - not go into the calls exported to the VFS layer but we will state the 260 - requirements of the message exchange mechanism. 261 - 262 - 0wpage 263 - 264 - 33.. TThhee mmeessssaaggee llaayyeerr 265 - 266 - 267 - 268 - At the lowest level the communication between Venus and the FS driver 269 - proceeds through messages. The synchronization between processes 270 - requesting Coda file service and Venus relies on blocking and waking 271 - up processes. The Coda FS driver processes VFS- and pioctl-requests 272 - on behalf of a process P, creates messages for Venus, awaits replies 273 - and finally returns to the caller. The implementation of the exchange 274 - of messages is platform specific, but the semantics have (so far) 275 - appeared to be generally applicable. Data buffers are created by the 276 - FS Driver in kernel memory on behalf of P and copied to user memory in 277 - Venus. 278 - 279 - The FS Driver while servicing P makes upcalls to Venus. Such an 280 - upcall is dispatched to Venus by creating a message structure. The 281 - structure contains the identification of P, the message sequence 282 - number, the size of the request and a pointer to the data in kernel 283 - memory for the request. Since the data buffer is re-used to hold the 284 - reply from Venus, there is a field for the size of the reply. A flags 285 - field is used in the message to precisely record the status of the 286 - message. Additional platform dependent structures involve pointers to 287 - determine the position of the message on queues and pointers to 288 - synchronization objects. In the upcall routine the message structure 289 - is filled in, flags are set to 0, and it is placed on the _p_e_n_d_i_n_g 290 - queue. The routine calling upcall is responsible for allocating the 291 - data buffer; its structure will be described in the next section. 292 - 293 - A facility must exist to notify Venus that the message has been 294 - created, and implemented using available synchronization objects in 295 - the OS. This notification is done in the upcall context of the process 296 - P. When the message is on the pending queue, process P cannot proceed 297 - in upcall. The (kernel mode) processing of P in the filesystem 298 - request routine must be suspended until Venus has replied. Therefore 299 - the calling thread in P is blocked in upcall. A pointer in the 300 - message structure will locate the synchronization object on which P is 301 - sleeping. 302 - 303 - Venus detects the notification that a message has arrived, and the FS 304 - driver allow Venus to retrieve the message with a getmsg_from_kernel 305 - call. This action finishes in the kernel by putting the message on the 306 - queue of processing messages and setting flags to READ. Venus is 307 - passed the contents of the data buffer. The getmsg_from_kernel call 308 - now returns and Venus processes the request. 309 - 310 - At some later point the FS driver receives a message from Venus, 311 - namely when Venus calls sendmsg_to_kernel. At this moment the Coda FS 312 - driver looks at the contents of the message and decides if: 313 - 314 - 315 - +o the message is a reply for a suspended thread P. If so it removes 316 - the message from the processing queue and marks the message as 317 - WRITTEN. Finally, the FS driver unblocks P (still in the kernel 318 - mode context of Venus) and the sendmsg_to_kernel call returns to 319 - Venus. The process P will be scheduled at some point and continues 320 - processing its upcall with the data buffer replaced with the reply 321 - from Venus. 322 - 323 - +o The message is a _d_o_w_n_c_a_l_l. A downcall is a request from Venus to 324 - the FS Driver. The FS driver processes the request immediately 325 - (usually a cache eviction or replacement) and when it finishes 326 - sendmsg_to_kernel returns. 327 - 328 - Now P awakes and continues processing upcall. There are some 329 - subtleties to take account of. First P will determine if it was woken 330 - up in upcall by a signal from some other source (for example an 331 - attempt to terminate P) or as is normally the case by Venus in its 332 - sendmsg_to_kernel call. In the normal case, the upcall routine will 333 - deallocate the message structure and return. The FS routine can proceed 334 - with its processing. 335 - 336 - 337 - 338 - 339 - 340 - 341 - 342 - Sleeping and IPC arrangements 343 - 344 - In case P is woken up by a signal and not by Venus, it will first look 345 - at the flags field. If the message is not yet READ, the process P can 346 - handle its signal without notifying Venus. If Venus has READ, and 347 - the request should not be processed, P can send Venus a signal message 348 - to indicate that it should disregard the previous message. Such 349 - signals are put in the queue at the head, and read first by Venus. If 350 - the message is already marked as WRITTEN it is too late to stop the 351 - processing. The VFS routine will now continue. (-- If a VFS request 352 - involves more than one upcall, this can lead to complicated state, an 353 - extra field "handle_signals" could be added in the message structure 354 - to indicate points of no return have been passed.--) 355 - 356 - 357 - 358 - 33..11.. IImmpplleemmeennttaattiioonn ddeettaaiillss 359 - 360 - The Unix implementation of this mechanism has been through the 361 - implementation of a character device associated with Coda. Venus 362 - retrieves messages by doing a read on the device, replies are sent 363 - with a write and notification is through the select system call on the 364 - file descriptor for the device. The process P is kept waiting on an 365 - interruptible wait queue object. 366 - 367 - In Windows NT and the DPMI Windows 95 implementation a DeviceIoControl 368 - call is used. The DeviceIoControl call is designed to copy buffers 369 - from user memory to kernel memory with OPCODES. The sendmsg_to_kernel 370 - is issued as a synchronous call, while the getmsg_from_kernel call is 371 - asynchronous. Windows EventObjects are used for notification of 372 - message arrival. The process P is kept waiting on a KernelEvent 373 - object in NT and a semaphore in Windows 95. 374 - 375 - 0wpage 376 - 377 - 44.. TThhee iinntteerrffaaccee aatt tthhee ccaallll lleevveell 378 - 379 - 380 - This section describes the upcalls a Coda FS driver can make to Venus. 381 - Each of these upcalls make use of two structures: inputArgs and 382 - outputArgs. In pseudo BNF form the structures take the following 383 - form: 384 - 385 - 386 - struct inputArgs { 387 - u_long opcode; 388 - u_long unique; /* Keep multiple outstanding msgs distinct */ 389 - u_short pid; /* Common to all */ 390 - u_short pgid; /* Common to all */ 391 - struct CodaCred cred; /* Common to all */ 392 - 393 - <union "in" of call dependent parts of inputArgs> 394 - }; 395 - 396 - struct outputArgs { 397 - u_long opcode; 398 - u_long unique; /* Keep multiple outstanding msgs distinct */ 399 - u_long result; 400 - 401 - <union "out" of call dependent parts of inputArgs> 402 - }; 403 - 404 - 405 - 406 - Before going on let us elucidate the role of the various fields. The 407 - inputArgs start with the opcode which defines the type of service 408 - requested from Venus. There are approximately 30 upcalls at present 409 - which we will discuss. The unique field labels the inputArg with a 410 - unique number which will identify the message uniquely. A process and 411 - process group id are passed. Finally the credentials of the caller 412 - are included. 413 - 414 - Before delving into the specific calls we need to discuss a variety of 415 - data structures shared by the kernel and Venus. 416 - 417 - 418 - 419 - 420 - 44..11.. DDaattaa ssttrruuccttuurreess sshhaarreedd bbyy tthhee kkeerrnneell aanndd VVeennuuss 421 - 422 - 423 - The CodaCred structure defines a variety of user and group ids as 424 - they are set for the calling process. The vuid_t and vgid_t are 32 bit 425 - unsigned integers. It also defines group membership in an array. On 426 - Unix the CodaCred has proven sufficient to implement good security 427 - semantics for Coda but the structure may have to undergo modification 428 - for the Windows environment when these mature. 429 - 430 - struct CodaCred { 431 - vuid_t cr_uid, cr_euid, cr_suid, cr_fsuid; /* Real, effective, set, fs uid */ 432 - vgid_t cr_gid, cr_egid, cr_sgid, cr_fsgid; /* same for groups */ 433 - vgid_t cr_groups[NGROUPS]; /* Group membership for caller */ 434 - }; 435 - 436 - 437 - 438 - NNOOTTEE It is questionable if we need CodaCreds in Venus. Finally Venus 439 - doesn't know about groups, although it does create files with the 440 - default uid/gid. Perhaps the list of group membership is superfluous. 441 - 442 - 443 - The next item is the fundamental identifier used to identify Coda 444 - files, the ViceFid. A fid of a file uniquely defines a file or 445 - directory in the Coda filesystem within a _c_e_l_l. (-- A _c_e_l_l is a 446 - group of Coda servers acting under the aegis of a single system 447 - control machine or SCM. See the Coda Administration manual for a 448 - detailed description of the role of the SCM.--) 449 - 450 - 451 - typedef struct ViceFid { 452 - VolumeId Volume; 453 - VnodeId Vnode; 454 - Unique_t Unique; 455 - } ViceFid; 456 - 457 - 458 - 459 - Each of the constituent fields: VolumeId, VnodeId and Unique_t are 460 - unsigned 32 bit integers. We envisage that a further field will need 461 - to be prefixed to identify the Coda cell; this will probably take the 462 - form of a Ipv6 size IP address naming the Coda cell through DNS. 463 - 464 - The next important structure shared between Venus and the kernel is 465 - the attributes of the file. The following structure is used to 466 - exchange information. It has room for future extensions such as 467 - support for device files (currently not present in Coda). 468 - 469 - 470 - 471 - 472 - 473 - 474 - 475 - 476 - 477 - 478 - 479 - 480 - 481 - 482 - 483 - 484 - struct coda_timespec { 485 - int64_t tv_sec; /* seconds */ 486 - long tv_nsec; /* nanoseconds */ 487 - }; 488 - 489 - struct coda_vattr { 490 - enum coda_vtype va_type; /* vnode type (for create) */ 491 - u_short va_mode; /* files access mode and type */ 492 - short va_nlink; /* number of references to file */ 493 - vuid_t va_uid; /* owner user id */ 494 - vgid_t va_gid; /* owner group id */ 495 - long va_fsid; /* file system id (dev for now) */ 496 - long va_fileid; /* file id */ 497 - u_quad_t va_size; /* file size in bytes */ 498 - long va_blocksize; /* blocksize preferred for i/o */ 499 - struct coda_timespec va_atime; /* time of last access */ 500 - struct coda_timespec va_mtime; /* time of last modification */ 501 - struct coda_timespec va_ctime; /* time file changed */ 502 - u_long va_gen; /* generation number of file */ 503 - u_long va_flags; /* flags defined for file */ 504 - dev_t va_rdev; /* device special file represents */ 505 - u_quad_t va_bytes; /* bytes of disk space held by file */ 506 - u_quad_t va_filerev; /* file modification number */ 507 - u_int va_vaflags; /* operations flags, see below */ 508 - long va_spare; /* remain quad aligned */ 509 - }; 510 - 511 - 512 - 513 - 514 - 44..22.. TThhee ppiiooccttll iinntteerrffaaccee 515 - 516 - 517 - Coda specific requests can be made by application through the pioctl 518 - interface. The pioctl is implemented as an ordinary ioctl on a 519 - fictitious file /coda/.CONTROL. The pioctl call opens this file, gets 520 - a file handle and makes the ioctl call. Finally it closes the file. 521 - 522 - The kernel involvement in this is limited to providing the facility to 523 - open and close and pass the ioctl message _a_n_d to verify that a path in 524 - the pioctl data buffers is a file in a Coda filesystem. 525 - 526 - The kernel is handed a data packet of the form: 527 - 528 - struct { 529 - const char *path; 530 - struct ViceIoctl vidata; 531 - int follow; 532 - } data; 533 - 534 - 535 - 536 - where 537 - 538 - 539 - struct ViceIoctl { 540 - caddr_t in, out; /* Data to be transferred in, or out */ 541 - short in_size; /* Size of input buffer <= 2K */ 542 - short out_size; /* Maximum size of output buffer, <= 2K */ 543 - }; 544 - 545 - 546 - 547 - The path must be a Coda file, otherwise the ioctl upcall will not be 548 - made. 549 - 550 - NNOOTTEE The data structures and code are a mess. We need to clean this 551 - up. 552 - 553 - We now proceed to document the individual calls: 554 - 555 - 0wpage 556 - 557 - 44..33.. rroooott 558 - 559 - 560 - AArrgguummeennttss 561 - 562 - iinn empty 563 - 564 - oouutt 565 - 566 - struct cfs_root_out { 567 - ViceFid VFid; 568 - } cfs_root; 569 - 570 - 571 - 572 - DDeessccrriippttiioonn This call is made to Venus during the initialization of 573 - the Coda filesystem. If the result is zero, the cfs_root structure 574 - contains the ViceFid of the root of the Coda filesystem. If a non-zero 575 - result is generated, its value is a platform dependent error code 576 - indicating the difficulty Venus encountered in locating the root of 577 - the Coda filesystem. 578 - 579 - 0wpage 580 - 581 - 44..44.. llooookkuupp 582 - 583 - 584 - SSuummmmaarryy Find the ViceFid and type of an object in a directory if it 585 - exists. 586 - 587 - AArrgguummeennttss 588 - 589 - iinn 590 - 591 - struct cfs_lookup_in { 592 - ViceFid VFid; 593 - char *name; /* Place holder for data. */ 594 - } cfs_lookup; 595 - 596 - 597 - 598 - oouutt 599 - 600 - struct cfs_lookup_out { 601 - ViceFid VFid; 602 - int vtype; 603 - } cfs_lookup; 604 - 605 - 606 - 607 - DDeessccrriippttiioonn This call is made to determine the ViceFid and filetype of 608 - a directory entry. The directory entry requested carries name name 609 - and Venus will search the directory identified by cfs_lookup_in.VFid. 610 - The result may indicate that the name does not exist, or that 611 - difficulty was encountered in finding it (e.g. due to disconnection). 612 - If the result is zero, the field cfs_lookup_out.VFid contains the 613 - targets ViceFid and cfs_lookup_out.vtype the coda_vtype giving the 614 - type of object the name designates. 615 - 616 - The name of the object is an 8 bit character string of maximum length 617 - CFS_MAXNAMLEN, currently set to 256 (including a 0 terminator.) 618 - 619 - It is extremely important to realize that Venus bitwise ors the field 620 - cfs_lookup.vtype with CFS_NOCACHE to indicate that the object should 621 - not be put in the kernel name cache. 622 - 623 - NNOOTTEE The type of the vtype is currently wrong. It should be 624 - coda_vtype. Linux does not take note of CFS_NOCACHE. It should. 625 - 626 - 0wpage 627 - 628 - 44..55.. ggeettaattttrr 629 - 630 - 631 - SSuummmmaarryy Get the attributes of a file. 632 - 633 - AArrgguummeennttss 634 - 635 - iinn 636 - 637 - struct cfs_getattr_in { 638 - ViceFid VFid; 639 - struct coda_vattr attr; /* XXXXX */ 640 - } cfs_getattr; 641 - 642 - 643 - 644 - oouutt 645 - 646 - struct cfs_getattr_out { 647 - struct coda_vattr attr; 648 - } cfs_getattr; 649 - 650 - 651 - 652 - DDeessccrriippttiioonn This call returns the attributes of the file identified by 653 - fid. 654 - 655 - EErrrroorrss Errors can occur if the object with fid does not exist, is 656 - unaccessible or if the caller does not have permission to fetch 657 - attributes. 658 - 659 - NNoottee Many kernel FS drivers (Linux, NT and Windows 95) need to acquire 660 - the attributes as well as the Fid for the instantiation of an internal 661 - "inode" or "FileHandle". A significant improvement in performance on 662 - such systems could be made by combining the _l_o_o_k_u_p and _g_e_t_a_t_t_r calls 663 - both at the Venus/kernel interaction level and at the RPC level. 664 - 665 - The vattr structure included in the input arguments is superfluous and 666 - should be removed. 667 - 668 - 0wpage 669 - 670 - 44..66.. sseettaattttrr 671 - 672 - 673 - SSuummmmaarryy Set the attributes of a file. 674 - 675 - AArrgguummeennttss 676 - 677 - iinn 678 - 679 - struct cfs_setattr_in { 680 - ViceFid VFid; 681 - struct coda_vattr attr; 682 - } cfs_setattr; 683 - 684 - 685 - 686 - 687 - oouutt 688 - empty 689 - 690 - DDeessccrriippttiioonn The structure attr is filled with attributes to be changed 691 - in BSD style. Attributes not to be changed are set to -1, apart from 692 - vtype which is set to VNON. Other are set to the value to be assigned. 693 - The only attributes which the FS driver may request to change are the 694 - mode, owner, groupid, atime, mtime and ctime. The return value 695 - indicates success or failure. 696 - 697 - EErrrroorrss A variety of errors can occur. The object may not exist, may 698 - be inaccessible, or permission may not be granted by Venus. 699 - 700 - 0wpage 701 - 702 - 44..77.. aacccceessss 703 - 704 - 705 - SSuummmmaarryy 706 - 707 - AArrgguummeennttss 708 - 709 - iinn 710 - 711 - struct cfs_access_in { 712 - ViceFid VFid; 713 - int flags; 714 - } cfs_access; 715 - 716 - 717 - 718 - oouutt 719 - empty 720 - 721 - DDeessccrriippttiioonn Verify if access to the object identified by VFid for 722 - operations described by flags is permitted. The result indicates if 723 - access will be granted. It is important to remember that Coda uses 724 - ACLs to enforce protection and that ultimately the servers, not the 725 - clients enforce the security of the system. The result of this call 726 - will depend on whether a _t_o_k_e_n is held by the user. 727 - 728 - EErrrroorrss The object may not exist, or the ACL describing the protection 729 - may not be accessible. 730 - 731 - 0wpage 732 - 733 - 44..88.. ccrreeaattee 734 - 735 - 736 - SSuummmmaarryy Invoked to create a file 737 - 738 - AArrgguummeennttss 739 - 740 - iinn 741 - 742 - struct cfs_create_in { 743 - ViceFid VFid; 744 - struct coda_vattr attr; 745 - int excl; 746 - int mode; 747 - char *name; /* Place holder for data. */ 748 - } cfs_create; 749 - 750 - 751 - 752 - 753 - oouutt 754 - 755 - struct cfs_create_out { 756 - ViceFid VFid; 757 - struct coda_vattr attr; 758 - } cfs_create; 759 - 760 - 761 - 762 - DDeessccrriippttiioonn This upcall is invoked to request creation of a file. 763 - The file will be created in the directory identified by VFid, its name 764 - will be name, and the mode will be mode. If excl is set an error will 765 - be returned if the file already exists. If the size field in attr is 766 - set to zero the file will be truncated. The uid and gid of the file 767 - are set by converting the CodaCred to a uid using a macro CRTOUID 768 - (this macro is platform dependent). Upon success the VFid and 769 - attributes of the file are returned. The Coda FS Driver will normally 770 - instantiate a vnode, inode or file handle at kernel level for the new 771 - object. 772 - 773 - 774 - EErrrroorrss A variety of errors can occur. Permissions may be insufficient. 775 - If the object exists and is not a file the error EISDIR is returned 776 - under Unix. 777 - 778 - NNOOTTEE The packing of parameters is very inefficient and appears to 779 - indicate confusion between the system call creat and the VFS operation 780 - create. The VFS operation create is only called to create new objects. 781 - This create call differs from the Unix one in that it is not invoked 782 - to return a file descriptor. The truncate and exclusive options, 783 - together with the mode, could simply be part of the mode as it is 784 - under Unix. There should be no flags argument; this is used in open 785 - (2) to return a file descriptor for READ or WRITE mode. 786 - 787 - The attributes of the directory should be returned too, since the size 788 - and mtime changed. 789 - 790 - 0wpage 791 - 792 - 44..99.. mmkkddiirr 793 - 794 - 795 - SSuummmmaarryy Create a new directory. 796 - 797 - AArrgguummeennttss 798 - 799 - iinn 800 - 801 - struct cfs_mkdir_in { 802 - ViceFid VFid; 803 - struct coda_vattr attr; 804 - char *name; /* Place holder for data. */ 805 - } cfs_mkdir; 806 - 807 - 808 - 809 - oouutt 810 - 811 - struct cfs_mkdir_out { 812 - ViceFid VFid; 813 - struct coda_vattr attr; 814 - } cfs_mkdir; 815 - 816 - 817 - 818 - 819 - DDeessccrriippttiioonn This call is similar to create but creates a directory. 820 - Only the mode field in the input parameters is used for creation. 821 - Upon successful creation, the attr returned contains the attributes of 822 - the new directory. 823 - 824 - EErrrroorrss As for create. 825 - 826 - NNOOTTEE The input parameter should be changed to mode instead of 827 - attributes. 828 - 829 - The attributes of the parent should be returned since the size and 830 - mtime changes. 831 - 832 - 0wpage 833 - 834 - 44..1100.. lliinnkk 835 - 836 - 837 - SSuummmmaarryy Create a link to an existing file. 838 - 839 - AArrgguummeennttss 840 - 841 - iinn 842 - 843 - struct cfs_link_in { 844 - ViceFid sourceFid; /* cnode to link *to* */ 845 - ViceFid destFid; /* Directory in which to place link */ 846 - char *tname; /* Place holder for data. */ 847 - } cfs_link; 848 - 849 - 850 - 851 - oouutt 852 - empty 853 - 854 - DDeessccrriippttiioonn This call creates a link to the sourceFid in the directory 855 - identified by destFid with name tname. The source must reside in the 856 - target's parent, i.e. the source must be have parent destFid, i.e. Coda 857 - does not support cross directory hard links. Only the return value is 858 - relevant. It indicates success or the type of failure. 859 - 860 - EErrrroorrss The usual errors can occur.0wpage 861 - 862 - 44..1111.. ssyymmlliinnkk 863 - 864 - 865 - SSuummmmaarryy create a symbolic link 866 - 867 - AArrgguummeennttss 868 - 869 - iinn 870 - 871 - struct cfs_symlink_in { 872 - ViceFid VFid; /* Directory to put symlink in */ 873 - char *srcname; 874 - struct coda_vattr attr; 875 - char *tname; 876 - } cfs_symlink; 877 - 878 - 879 - 880 - oouutt 881 - none 882 - 883 - DDeessccrriippttiioonn Create a symbolic link. The link is to be placed in the 884 - directory identified by VFid and named tname. It should point to the 885 - pathname srcname. The attributes of the newly created object are to 886 - be set to attr. 887 - 888 - EErrrroorrss 889 - 890 - NNOOTTEE The attributes of the target directory should be returned since 891 - its size changed. 892 - 893 - 0wpage 894 - 895 - 44..1122.. rreemmoovvee 896 - 897 - 898 - SSuummmmaarryy Remove a file 899 - 900 - AArrgguummeennttss 901 - 902 - iinn 903 - 904 - struct cfs_remove_in { 905 - ViceFid VFid; 906 - char *name; /* Place holder for data. */ 907 - } cfs_remove; 908 - 909 - 910 - 911 - oouutt 912 - none 913 - 914 - DDeessccrriippttiioonn Remove file named cfs_remove_in.name in directory 915 - identified by VFid. 916 - 917 - EErrrroorrss 918 - 919 - NNOOTTEE The attributes of the directory should be returned since its 920 - mtime and size may change. 921 - 922 - 0wpage 923 - 924 - 44..1133.. rrmmddiirr 925 - 926 - 927 - SSuummmmaarryy Remove a directory 928 - 929 - AArrgguummeennttss 930 - 931 - iinn 932 - 933 - struct cfs_rmdir_in { 934 - ViceFid VFid; 935 - char *name; /* Place holder for data. */ 936 - } cfs_rmdir; 937 - 938 - 939 - 940 - oouutt 941 - none 942 - 943 - DDeessccrriippttiioonn Remove the directory with name name from the directory 944 - identified by VFid. 945 - 946 - EErrrroorrss 947 - 948 - NNOOTTEE The attributes of the parent directory should be returned since 949 - its mtime and size may change. 950 - 951 - 0wpage 952 - 953 - 44..1144.. rreeaaddlliinnkk 954 - 955 - 956 - SSuummmmaarryy Read the value of a symbolic link. 957 - 958 - AArrgguummeennttss 959 - 960 - iinn 961 - 962 - struct cfs_readlink_in { 963 - ViceFid VFid; 964 - } cfs_readlink; 965 - 966 - 967 - 968 - oouutt 969 - 970 - struct cfs_readlink_out { 971 - int count; 972 - caddr_t data; /* Place holder for data. */ 973 - } cfs_readlink; 974 - 975 - 976 - 977 - DDeessccrriippttiioonn This routine reads the contents of symbolic link 978 - identified by VFid into the buffer data. The buffer data must be able 979 - to hold any name up to CFS_MAXNAMLEN (PATH or NAM??). 980 - 981 - EErrrroorrss No unusual errors. 982 - 983 - 0wpage 984 - 985 - 44..1155.. ooppeenn 986 - 987 - 988 - SSuummmmaarryy Open a file. 989 - 990 - AArrgguummeennttss 991 - 992 - iinn 993 - 994 - struct cfs_open_in { 995 - ViceFid VFid; 996 - int flags; 997 - } cfs_open; 998 - 999 - 1000 - 1001 - oouutt 1002 - 1003 - struct cfs_open_out { 1004 - dev_t dev; 1005 - ino_t inode; 1006 - } cfs_open; 1007 - 1008 - 1009 - 1010 - DDeessccrriippttiioonn This request asks Venus to place the file identified by 1011 - VFid in its cache and to note that the calling process wishes to open 1012 - it with flags as in open(2). The return value to the kernel differs 1013 - for Unix and Windows systems. For Unix systems the Coda FS Driver is 1014 - informed of the device and inode number of the container file in the 1015 - fields dev and inode. For Windows the path of the container file is 1016 - returned to the kernel. 1017 - EErrrroorrss 1018 - 1019 - NNOOTTEE Currently the cfs_open_out structure is not properly adapted to 1020 - deal with the Windows case. It might be best to implement two 1021 - upcalls, one to open aiming at a container file name, the other at a 1022 - container file inode. 1023 - 1024 - 0wpage 1025 - 1026 - 44..1166.. cclloossee 1027 - 1028 - 1029 - SSuummmmaarryy Close a file, update it on the servers. 1030 - 1031 - AArrgguummeennttss 1032 - 1033 - iinn 1034 - 1035 - struct cfs_close_in { 1036 - ViceFid VFid; 1037 - int flags; 1038 - } cfs_close; 1039 - 1040 - 1041 - 1042 - oouutt 1043 - none 1044 - 1045 - DDeessccrriippttiioonn Close the file identified by VFid. 1046 - 1047 - EErrrroorrss 1048 - 1049 - NNOOTTEE The flags argument is bogus and not used. However, Venus' code 1050 - has room to deal with an execp input field, probably this field should 1051 - be used to inform Venus that the file was closed but is still memory 1052 - mapped for execution. There are comments about fetching versus not 1053 - fetching the data in Venus vproc_vfscalls. This seems silly. If a 1054 - file is being closed, the data in the container file is to be the new 1055 - data. Here again the execp flag might be in play to create confusion: 1056 - currently Venus might think a file can be flushed from the cache when 1057 - it is still memory mapped. This needs to be understood. 1058 - 1059 - 0wpage 1060 - 1061 - 44..1177.. iiooccttll 1062 - 1063 - 1064 - SSuummmmaarryy Do an ioctl on a file. This includes the pioctl interface. 1065 - 1066 - AArrgguummeennttss 1067 - 1068 - iinn 1069 - 1070 - struct cfs_ioctl_in { 1071 - ViceFid VFid; 1072 - int cmd; 1073 - int len; 1074 - int rwflag; 1075 - char *data; /* Place holder for data. */ 1076 - } cfs_ioctl; 1077 - 1078 - 1079 - 1080 - oouutt 1081 - 1082 - 1083 - struct cfs_ioctl_out { 1084 - int len; 1085 - caddr_t data; /* Place holder for data. */ 1086 - } cfs_ioctl; 1087 - 1088 - 1089 - 1090 - DDeessccrriippttiioonn Do an ioctl operation on a file. The command, len and 1091 - data arguments are filled as usual. flags is not used by Venus. 1092 - 1093 - EErrrroorrss 1094 - 1095 - NNOOTTEE Another bogus parameter. flags is not used. What is the 1096 - business about PREFETCHING in the Venus code? 1097 - 1098 - 1099 - 0wpage 1100 - 1101 - 44..1188.. rreennaammee 1102 - 1103 - 1104 - SSuummmmaarryy Rename a fid. 1105 - 1106 - AArrgguummeennttss 1107 - 1108 - iinn 1109 - 1110 - struct cfs_rename_in { 1111 - ViceFid sourceFid; 1112 - char *srcname; 1113 - ViceFid destFid; 1114 - char *destname; 1115 - } cfs_rename; 1116 - 1117 - 1118 - 1119 - oouutt 1120 - none 1121 - 1122 - DDeessccrriippttiioonn Rename the object with name srcname in directory 1123 - sourceFid to destname in destFid. It is important that the names 1124 - srcname and destname are 0 terminated strings. Strings in Unix 1125 - kernels are not always null terminated. 1126 - 1127 - EErrrroorrss 1128 - 1129 - 0wpage 1130 - 1131 - 44..1199.. rreeaaddddiirr 1132 - 1133 - 1134 - SSuummmmaarryy Read directory entries. 1135 - 1136 - AArrgguummeennttss 1137 - 1138 - iinn 1139 - 1140 - struct cfs_readdir_in { 1141 - ViceFid VFid; 1142 - int count; 1143 - int offset; 1144 - } cfs_readdir; 1145 - 1146 - 1147 - 1148 - 1149 - oouutt 1150 - 1151 - struct cfs_readdir_out { 1152 - int size; 1153 - caddr_t data; /* Place holder for data. */ 1154 - } cfs_readdir; 1155 - 1156 - 1157 - 1158 - DDeessccrriippttiioonn Read directory entries from VFid starting at offset and 1159 - read at most count bytes. Returns the data in data and returns 1160 - the size in size. 1161 - 1162 - EErrrroorrss 1163 - 1164 - NNOOTTEE This call is not used. Readdir operations exploit container 1165 - files. We will re-evaluate this during the directory revamp which is 1166 - about to take place. 1167 - 1168 - 0wpage 1169 - 1170 - 44..2200.. vvggeett 1171 - 1172 - 1173 - SSuummmmaarryy instructs Venus to do an FSDB->Get. 1174 - 1175 - AArrgguummeennttss 1176 - 1177 - iinn 1178 - 1179 - struct cfs_vget_in { 1180 - ViceFid VFid; 1181 - } cfs_vget; 1182 - 1183 - 1184 - 1185 - oouutt 1186 - 1187 - struct cfs_vget_out { 1188 - ViceFid VFid; 1189 - int vtype; 1190 - } cfs_vget; 1191 - 1192 - 1193 - 1194 - DDeessccrriippttiioonn This upcall asks Venus to do a get operation on an fsobj 1195 - labelled by VFid. 1196 - 1197 - EErrrroorrss 1198 - 1199 - NNOOTTEE This operation is not used. However, it is extremely useful 1200 - since it can be used to deal with read/write memory mapped files. 1201 - These can be "pinned" in the Venus cache using vget and released with 1202 - inactive. 1203 - 1204 - 0wpage 1205 - 1206 - 44..2211.. ffssyynncc 1207 - 1208 - 1209 - SSuummmmaarryy Tell Venus to update the RVM attributes of a file. 1210 - 1211 - AArrgguummeennttss 1212 - 1213 - iinn 1214 - 1215 - struct cfs_fsync_in { 1216 - ViceFid VFid; 1217 - } cfs_fsync; 1218 - 1219 - 1220 - 1221 - oouutt 1222 - none 1223 - 1224 - DDeessccrriippttiioonn Ask Venus to update RVM attributes of object VFid. This 1225 - should be called as part of kernel level fsync type calls. The 1226 - result indicates if the syncing was successful. 1227 - 1228 - EErrrroorrss 1229 - 1230 - NNOOTTEE Linux does not implement this call. It should. 1231 - 1232 - 0wpage 1233 - 1234 - 44..2222.. iinnaaccttiivvee 1235 - 1236 - 1237 - SSuummmmaarryy Tell Venus a vnode is no longer in use. 1238 - 1239 - AArrgguummeennttss 1240 - 1241 - iinn 1242 - 1243 - struct cfs_inactive_in { 1244 - ViceFid VFid; 1245 - } cfs_inactive; 1246 - 1247 - 1248 - 1249 - oouutt 1250 - none 1251 - 1252 - DDeessccrriippttiioonn This operation returns EOPNOTSUPP. 1253 - 1254 - EErrrroorrss 1255 - 1256 - NNOOTTEE This should perhaps be removed. 1257 - 1258 - 0wpage 1259 - 1260 - 44..2233.. rrddwwrr 1261 - 1262 - 1263 - SSuummmmaarryy Read or write from a file 1264 - 1265 - AArrgguummeennttss 1266 - 1267 - iinn 1268 - 1269 - struct cfs_rdwr_in { 1270 - ViceFid VFid; 1271 - int rwflag; 1272 - int count; 1273 - int offset; 1274 - int ioflag; 1275 - caddr_t data; /* Place holder for data. */ 1276 - } cfs_rdwr; 1277 - 1278 - 1279 - 1280 - 1281 - oouutt 1282 - 1283 - struct cfs_rdwr_out { 1284 - int rwflag; 1285 - int count; 1286 - caddr_t data; /* Place holder for data. */ 1287 - } cfs_rdwr; 1288 - 1289 - 1290 - 1291 - DDeessccrriippttiioonn This upcall asks Venus to read or write from a file. 1292 - 1293 - EErrrroorrss 1294 - 1295 - NNOOTTEE It should be removed since it is against the Coda philosophy that 1296 - read/write operations never reach Venus. I have been told the 1297 - operation does not work. It is not currently used. 1298 - 1299 - 1300 - 0wpage 1301 - 1302 - 44..2244.. ooddyymmoouunntt 1303 - 1304 - 1305 - SSuummmmaarryy Allows mounting multiple Coda "filesystems" on one Unix mount 1306 - point. 1307 - 1308 - AArrgguummeennttss 1309 - 1310 - iinn 1311 - 1312 - struct ody_mount_in { 1313 - char *name; /* Place holder for data. */ 1314 - } ody_mount; 1315 - 1316 - 1317 - 1318 - oouutt 1319 - 1320 - struct ody_mount_out { 1321 - ViceFid VFid; 1322 - } ody_mount; 1323 - 1324 - 1325 - 1326 - DDeessccrriippttiioonn Asks Venus to return the rootfid of a Coda system named 1327 - name. The fid is returned in VFid. 1328 - 1329 - EErrrroorrss 1330 - 1331 - NNOOTTEE This call was used by David for dynamic sets. It should be 1332 - removed since it causes a jungle of pointers in the VFS mounting area. 1333 - It is not used by Coda proper. Call is not implemented by Venus. 1334 - 1335 - 0wpage 1336 - 1337 - 44..2255.. ooddyy__llooookkuupp 1338 - 1339 - 1340 - SSuummmmaarryy Looks up something. 1341 - 1342 - AArrgguummeennttss 1343 - 1344 - iinn irrelevant 1345 - 1346 - 1347 - oouutt 1348 - irrelevant 1349 - 1350 - DDeessccrriippttiioonn 1351 - 1352 - EErrrroorrss 1353 - 1354 - NNOOTTEE Gut it. Call is not implemented by Venus. 1355 - 1356 - 0wpage 1357 - 1358 - 44..2266.. ooddyy__eexxppaanndd 1359 - 1360 - 1361 - SSuummmmaarryy expands something in a dynamic set. 1362 - 1363 - AArrgguummeennttss 1364 - 1365 - iinn irrelevant 1366 - 1367 - oouutt 1368 - irrelevant 1369 - 1370 - DDeessccrriippttiioonn 1371 - 1372 - EErrrroorrss 1373 - 1374 - NNOOTTEE Gut it. Call is not implemented by Venus. 1375 - 1376 - 0wpage 1377 - 1378 - 44..2277.. pprreeffeettcchh 1379 - 1380 - 1381 - SSuummmmaarryy Prefetch a dynamic set. 1382 - 1383 - AArrgguummeennttss 1384 - 1385 - iinn Not documented. 1386 - 1387 - oouutt 1388 - Not documented. 1389 - 1390 - DDeessccrriippttiioonn Venus worker.cc has support for this call, although it is 1391 - noted that it doesn't work. Not surprising, since the kernel does not 1392 - have support for it. (ODY_PREFETCH is not a defined operation). 1393 - 1394 - EErrrroorrss 1395 - 1396 - NNOOTTEE Gut it. It isn't working and isn't used by Coda. 1397 - 1398 - 1399 - 0wpage 1400 - 1401 - 44..2288.. ssiiggnnaall 1402 - 1403 - 1404 - SSuummmmaarryy Send Venus a signal about an upcall. 1405 - 1406 - AArrgguummeennttss 1407 - 1408 - iinn none 1409 - 1410 - oouutt 1411 - not applicable. 1412 - 1413 - DDeessccrriippttiioonn This is an out-of-band upcall to Venus to inform Venus 1414 - that the calling process received a signal after Venus read the 1415 - message from the input queue. Venus is supposed to clean up the 1416 - operation. 1417 - 1418 - EErrrroorrss No reply is given. 1419 - 1420 - NNOOTTEE We need to better understand what Venus needs to clean up and if 1421 - it is doing this correctly. Also we need to handle multiple upcall 1422 - per system call situations correctly. It would be important to know 1423 - what state changes in Venus take place after an upcall for which the 1424 - kernel is responsible for notifying Venus to clean up (e.g. open 1425 - definitely is such a state change, but many others are maybe not). 1426 - 1427 - 0wpage 1428 - 1429 - 55.. TThhee mmiinniiccaacchhee aanndd ddoowwnnccaallllss 1430 - 1431 - 1432 - The Coda FS Driver can cache results of lookup and access upcalls, to 1433 - limit the frequency of upcalls. Upcalls carry a price since a process 1434 - context switch needs to take place. The counterpart of caching the 1435 - information is that Venus will notify the FS Driver that cached 1436 - entries must be flushed or renamed. 1437 - 1438 - The kernel code generally has to maintain a structure which links the 1439 - internal file handles (called vnodes in BSD, inodes in Linux and 1440 - FileHandles in Windows) with the ViceFid's which Venus maintains. The 1441 - reason is that frequent translations back and forth are needed in 1442 - order to make upcalls and use the results of upcalls. Such linking 1443 - objects are called ccnnooddeess. 1444 - 1445 - The current minicache implementations have cache entries which record 1446 - the following: 1447 - 1448 - 1. the name of the file 1449 - 1450 - 2. the cnode of the directory containing the object 1451 - 1452 - 3. a list of CodaCred's for which the lookup is permitted. 1453 - 1454 - 4. the cnode of the object 1455 - 1456 - The lookup call in the Coda FS Driver may request the cnode of the 1457 - desired object from the cache, by passing its name, directory and the 1458 - CodaCred's of the caller. The cache will return the cnode or indicate 1459 - that it cannot be found. The Coda FS Driver must be careful to 1460 - invalidate cache entries when it modifies or removes objects. 1461 - 1462 - When Venus obtains information that indicates that cache entries are 1463 - no longer valid, it will make a downcall to the kernel. Downcalls are 1464 - intercepted by the Coda FS Driver and lead to cache invalidations of 1465 - the kind described below. The Coda FS Driver does not return an error 1466 - unless the downcall data could not be read into kernel memory. 1467 - 1468 - 1469 - 55..11.. IINNVVAALLIIDDAATTEE 1470 - 1471 - 1472 - No information is available on this call. 1473 - 1474 - 1475 - 55..22.. FFLLUUSSHH 1476 - 1477 - 1478 - 1479 - AArrgguummeennttss None 1480 - 1481 - SSuummmmaarryy Flush the name cache entirely. 1482 - 1483 - DDeessccrriippttiioonn Venus issues this call upon startup and when it dies. This 1484 - is to prevent stale cache information being held. Some operating 1485 - systems allow the kernel name cache to be switched off dynamically. 1486 - When this is done, this downcall is made. 1487 - 1488 - 1489 - 55..33.. PPUURRGGEEUUSSEERR 1490 - 1491 - 1492 - AArrgguummeennttss 1493 - 1494 - struct cfs_purgeuser_out {/* CFS_PURGEUSER is a venus->kernel call */ 1495 - struct CodaCred cred; 1496 - } cfs_purgeuser; 1497 - 1498 - 1499 - 1500 - DDeessccrriippttiioonn Remove all entries in the cache carrying the Cred. This 1501 - call is issued when tokens for a user expire or are flushed. 1502 - 1503 - 1504 - 55..44.. ZZAAPPFFIILLEE 1505 - 1506 - 1507 - AArrgguummeennttss 1508 - 1509 - struct cfs_zapfile_out { /* CFS_ZAPFILE is a venus->kernel call */ 1510 - ViceFid CodaFid; 1511 - } cfs_zapfile; 1512 - 1513 - 1514 - 1515 - DDeessccrriippttiioonn Remove all entries which have the (dir vnode, name) pair. 1516 - This is issued as a result of an invalidation of cached attributes of 1517 - a vnode. 1518 - 1519 - NNOOTTEE Call is not named correctly in NetBSD and Mach. The minicache 1520 - zapfile routine takes different arguments. Linux does not implement 1521 - the invalidation of attributes correctly. 1522 - 1523 - 1524 - 1525 - 55..55.. ZZAAPPDDIIRR 1526 - 1527 - 1528 - AArrgguummeennttss 1529 - 1530 - struct cfs_zapdir_out { /* CFS_ZAPDIR is a venus->kernel call */ 1531 - ViceFid CodaFid; 1532 - } cfs_zapdir; 1533 - 1534 - 1535 - 1536 - DDeessccrriippttiioonn Remove all entries in the cache lying in a directory 1537 - CodaFid, and all children of this directory. This call is issued when 1538 - Venus receives a callback on the directory. 1539 - 1540 - 1541 - 55..66.. ZZAAPPVVNNOODDEE 1542 - 1543 - 1544 - 1545 - AArrgguummeennttss 1546 - 1547 - struct cfs_zapvnode_out { /* CFS_ZAPVNODE is a venus->kernel call */ 1548 - struct CodaCred cred; 1549 - ViceFid VFid; 1550 - } cfs_zapvnode; 1551 - 1552 - 1553 - 1554 - DDeessccrriippttiioonn Remove all entries in the cache carrying the cred and VFid 1555 - as in the arguments. This downcall is probably never issued. 1556 - 1557 - 1558 - 55..77.. PPUURRGGEEFFIIDD 1559 - 1560 - 1561 - SSuummmmaarryy 1562 - 1563 - AArrgguummeennttss 1564 - 1565 - struct cfs_purgefid_out { /* CFS_PURGEFID is a venus->kernel call */ 1566 - ViceFid CodaFid; 1567 - } cfs_purgefid; 1568 - 1569 - 1570 - 1571 - DDeessccrriippttiioonn Flush the attribute for the file. If it is a dir (odd 1572 - vnode), purge its children from the namecache and remove the file from the 1573 - namecache. 1574 - 1575 - 1576 - 1577 - 55..88.. RREEPPLLAACCEE 1578 - 1579 - 1580 - SSuummmmaarryy Replace the Fid's for a collection of names. 1581 - 1582 - AArrgguummeennttss 1583 - 1584 - struct cfs_replace_out { /* cfs_replace is a venus->kernel call */ 1585 - ViceFid NewFid; 1586 - ViceFid OldFid; 1587 - } cfs_replace; 1588 - 1589 - 1590 - 1591 - DDeessccrriippttiioonn This routine replaces a ViceFid in the name cache with 1592 - another. It is added to allow Venus during reintegration to replace 1593 - locally allocated temp fids while disconnected with global fids even 1594 - when the reference counts on those fids are not zero. 1595 - 1596 - 0wpage 1597 - 1598 - 66.. IInniittiiaalliizzaattiioonn aanndd cclleeaannuupp 1599 - 1600 - 1601 - This section gives brief hints as to desirable features for the Coda 1602 - FS Driver at startup and upon shutdown or Venus failures. Before 1603 - entering the discussion it is useful to repeat that the Coda FS Driver 1604 - maintains the following data: 1605 - 1606 - 1607 - 1. message queues 1608 - 1609 - 2. cnodes 1610 - 1611 - 3. name cache entries 1612 - 1613 - The name cache entries are entirely private to the driver, so they 1614 - can easily be manipulated. The message queues will generally have 1615 - clear points of initialization and destruction. The cnodes are 1616 - much more delicate. User processes hold reference counts in Coda 1617 - filesystems and it can be difficult to clean up the cnodes. 1618 - 1619 - It can expect requests through: 1620 - 1621 - 1. the message subsystem 1622 - 1623 - 2. the VFS layer 1624 - 1625 - 3. pioctl interface 1626 - 1627 - Currently the _p_i_o_c_t_l passes through the VFS for Coda so we can 1628 - treat these similarly. 1629 - 1630 - 1631 - 66..11.. RReeqquuiirreemmeennttss 1632 - 1633 - 1634 - The following requirements should be accommodated: 1635 - 1636 - 1. The message queues should have open and close routines. On Unix 1637 - the opening of the character devices are such routines. 1638 - 1639 - +o Before opening, no messages can be placed. 1640 - 1641 - +o Opening will remove any old messages still pending. 1642 - 1643 - +o Close will notify any sleeping processes that their upcall cannot 1644 - be completed. 1645 - 1646 - +o Close will free all memory allocated by the message queues. 1647 - 1648 - 1649 - 2. At open the namecache shall be initialized to empty state. 1650 - 1651 - 3. Before the message queues are open, all VFS operations will fail. 1652 - Fortunately this can be achieved by making sure than mounting the 1653 - Coda filesystem cannot succeed before opening. 1654 - 1655 - 4. After closing of the queues, no VFS operations can succeed. Here 1656 - one needs to be careful, since a few operations (lookup, 1657 - read/write, readdir) can proceed without upcalls. These must be 1658 - explicitly blocked. 1659 - 1660 - 5. Upon closing the namecache shall be flushed and disabled. 1661 - 1662 - 6. All memory held by cnodes can be freed without relying on upcalls. 1663 - 1664 - 7. Unmounting the file system can be done without relying on upcalls. 1665 - 1666 - 8. Mounting the Coda filesystem should fail gracefully if Venus cannot 1667 - get the rootfid or the attributes of the rootfid. The latter is 1668 - best implemented by Venus fetching these objects before attempting 1669 - to mount. 1670 - 1671 - NNOOTTEE NetBSD in particular but also Linux have not implemented the 1672 - above requirements fully. For smooth operation this needs to be 1673 - corrected. 1674 - 1675 - 1676 -

+535

Documentation/filesystems/configfs.rst

··· 1 + ======================================================= 2 + Configfs - Userspace-driven Kernel Object Configuration 3 + ======================================================= 4 + 5 + Joel Becker <joel.becker@oracle.com> 6 + 7 + Updated: 31 March 2005 8 + 9 + Copyright (c) 2005 Oracle Corporation, 10 + Joel Becker <joel.becker@oracle.com> 11 + 12 + 13 + What is configfs? 14 + ================= 15 + 16 + configfs is a ram-based filesystem that provides the converse of 17 + sysfs's functionality. Where sysfs is a filesystem-based view of 18 + kernel objects, configfs is a filesystem-based manager of kernel 19 + objects, or config_items. 20 + 21 + With sysfs, an object is created in kernel (for example, when a device 22 + is discovered) and it is registered with sysfs. Its attributes then 23 + appear in sysfs, allowing userspace to read the attributes via 24 + readdir(3)/read(2). It may allow some attributes to be modified via 25 + write(2). The important point is that the object is created and 26 + destroyed in kernel, the kernel controls the lifecycle of the sysfs 27 + representation, and sysfs is merely a window on all this. 28 + 29 + A configfs config_item is created via an explicit userspace operation: 30 + mkdir(2). It is destroyed via rmdir(2). The attributes appear at 31 + mkdir(2) time, and can be read or modified via read(2) and write(2). 32 + As with sysfs, readdir(3) queries the list of items and/or attributes. 33 + symlink(2) can be used to group items together. Unlike sysfs, the 34 + lifetime of the representation is completely driven by userspace. The 35 + kernel modules backing the items must respond to this. 36 + 37 + Both sysfs and configfs can and should exist together on the same 38 + system. One is not a replacement for the other. 39 + 40 + Using configfs 41 + ============== 42 + 43 + configfs can be compiled as a module or into the kernel. You can access 44 + it by doing:: 45 + 46 + mount -t configfs none /config 47 + 48 + The configfs tree will be empty unless client modules are also loaded. 49 + These are modules that register their item types with configfs as 50 + subsystems. Once a client subsystem is loaded, it will appear as a 51 + subdirectory (or more than one) under /config. Like sysfs, the 52 + configfs tree is always there, whether mounted on /config or not. 53 + 54 + An item is created via mkdir(2). The item's attributes will also 55 + appear at this time. readdir(3) can determine what the attributes are, 56 + read(2) can query their default values, and write(2) can store new 57 + values. Don't mix more than one attribute in one attribute file. 58 + 59 + There are two types of configfs attributes: 60 + 61 + * Normal attributes, which similar to sysfs attributes, are small ASCII text 62 + files, with a maximum size of one page (PAGE_SIZE, 4096 on i386). Preferably 63 + only one value per file should be used, and the same caveats from sysfs apply. 64 + Configfs expects write(2) to store the entire buffer at once. When writing to 65 + normal configfs attributes, userspace processes should first read the entire 66 + file, modify the portions they wish to change, and then write the entire 67 + buffer back. 68 + 69 + * Binary attributes, which are somewhat similar to sysfs binary attributes, 70 + but with a few slight changes to semantics. The PAGE_SIZE limitation does not 71 + apply, but the whole binary item must fit in single kernel vmalloc'ed buffer. 72 + The write(2) calls from user space are buffered, and the attributes' 73 + write_bin_attribute method will be invoked on the final close, therefore it is 74 + imperative for user-space to check the return code of close(2) in order to 75 + verify that the operation finished successfully. 76 + To avoid a malicious user OOMing the kernel, there's a per-binary attribute 77 + maximum buffer value. 78 + 79 + When an item needs to be destroyed, remove it with rmdir(2). An 80 + item cannot be destroyed if any other item has a link to it (via 81 + symlink(2)). Links can be removed via unlink(2). 82 + 83 + Configuring FakeNBD: an Example 84 + =============================== 85 + 86 + Imagine there's a Network Block Device (NBD) driver that allows you to 87 + access remote block devices. Call it FakeNBD. FakeNBD uses configfs 88 + for its configuration. Obviously, there will be a nice program that 89 + sysadmins use to configure FakeNBD, but somehow that program has to tell 90 + the driver about it. Here's where configfs comes in. 91 + 92 + When the FakeNBD driver is loaded, it registers itself with configfs. 93 + readdir(3) sees this just fine:: 94 + 95 + # ls /config 96 + fakenbd 97 + 98 + A fakenbd connection can be created with mkdir(2). The name is 99 + arbitrary, but likely the tool will make some use of the name. Perhaps 100 + it is a uuid or a disk name:: 101 + 102 + # mkdir /config/fakenbd/disk1 103 + # ls /config/fakenbd/disk1 104 + target device rw 105 + 106 + The target attribute contains the IP address of the server FakeNBD will 107 + connect to. The device attribute is the device on the server. 108 + Predictably, the rw attribute determines whether the connection is 109 + read-only or read-write:: 110 + 111 + # echo 10.0.0.1 > /config/fakenbd/disk1/target 112 + # echo /dev/sda1 > /config/fakenbd/disk1/device 113 + # echo 1 > /config/fakenbd/disk1/rw 114 + 115 + That's it. That's all there is. Now the device is configured, via the 116 + shell no less. 117 + 118 + Coding With configfs 119 + ==================== 120 + 121 + Every object in configfs is a config_item. A config_item reflects an 122 + object in the subsystem. It has attributes that match values on that 123 + object. configfs handles the filesystem representation of that object 124 + and its attributes, allowing the subsystem to ignore all but the 125 + basic show/store interaction. 126 + 127 + Items are created and destroyed inside a config_group. A group is a 128 + collection of items that share the same attributes and operations. 129 + Items are created by mkdir(2) and removed by rmdir(2), but configfs 130 + handles that. The group has a set of operations to perform these tasks 131 + 132 + A subsystem is the top level of a client module. During initialization, 133 + the client module registers the subsystem with configfs, the subsystem 134 + appears as a directory at the top of the configfs filesystem. A 135 + subsystem is also a config_group, and can do everything a config_group 136 + can. 137 + 138 + struct config_item 139 + ================== 140 + 141 + :: 142 + 143 + struct config_item { 144 + char *ci_name; 145 + char ci_namebuf[UOBJ_NAME_LEN]; 146 + struct kref ci_kref; 147 + struct list_head ci_entry; 148 + struct config_item *ci_parent; 149 + struct config_group *ci_group; 150 + struct config_item_type *ci_type; 151 + struct dentry *ci_dentry; 152 + }; 153 + 154 + void config_item_init(struct config_item *); 155 + void config_item_init_type_name(struct config_item *, 156 + const char *name, 157 + struct config_item_type *type); 158 + struct config_item *config_item_get(struct config_item *); 159 + void config_item_put(struct config_item *); 160 + 161 + Generally, struct config_item is embedded in a container structure, a 162 + structure that actually represents what the subsystem is doing. The 163 + config_item portion of that structure is how the object interacts with 164 + configfs. 165 + 166 + Whether statically defined in a source file or created by a parent 167 + config_group, a config_item must have one of the _init() functions 168 + called on it. This initializes the reference count and sets up the 169 + appropriate fields. 170 + 171 + All users of a config_item should have a reference on it via 172 + config_item_get(), and drop the reference when they are done via 173 + config_item_put(). 174 + 175 + By itself, a config_item cannot do much more than appear in configfs. 176 + Usually a subsystem wants the item to display and/or store attributes, 177 + among other things. For that, it needs a type. 178 + 179 + struct config_item_type 180 + ======================= 181 + 182 + :: 183 + 184 + struct configfs_item_operations { 185 + void (*release)(struct config_item *); 186 + int (*allow_link)(struct config_item *src, 187 + struct config_item *target); 188 + void (*drop_link)(struct config_item *src, 189 + struct config_item *target); 190 + }; 191 + 192 + struct config_item_type { 193 + struct module *ct_owner; 194 + struct configfs_item_operations *ct_item_ops; 195 + struct configfs_group_operations *ct_group_ops; 196 + struct configfs_attribute **ct_attrs; 197 + struct configfs_bin_attribute **ct_bin_attrs; 198 + }; 199 + 200 + The most basic function of a config_item_type is to define what 201 + operations can be performed on a config_item. All items that have been 202 + allocated dynamically will need to provide the ct_item_ops->release() 203 + method. This method is called when the config_item's reference count 204 + reaches zero. 205 + 206 + struct configfs_attribute 207 + ========================= 208 + 209 + :: 210 + 211 + struct configfs_attribute { 212 + char *ca_name; 213 + struct module *ca_owner; 214 + umode_t ca_mode; 215 + ssize_t (*show)(struct config_item *, char *); 216 + ssize_t (*store)(struct config_item *, const char *, size_t); 217 + }; 218 + 219 + When a config_item wants an attribute to appear as a file in the item's 220 + configfs directory, it must define a configfs_attribute describing it. 221 + It then adds the attribute to the NULL-terminated array 222 + config_item_type->ct_attrs. When the item appears in configfs, the 223 + attribute file will appear with the configfs_attribute->ca_name 224 + filename. configfs_attribute->ca_mode specifies the file permissions. 225 + 226 + If an attribute is readable and provides a ->show method, that method will 227 + be called whenever userspace asks for a read(2) on the attribute. If an 228 + attribute is writable and provides a ->store method, that method will be 229 + be called whenever userspace asks for a write(2) on the attribute. 230 + 231 + struct configfs_bin_attribute 232 + ============================= 233 + 234 + :: 235 + 236 + struct configfs_bin_attribute { 237 + struct configfs_attribute cb_attr; 238 + void *cb_private; 239 + size_t cb_max_size; 240 + }; 241 + 242 + The binary attribute is used when the one needs to use binary blob to 243 + appear as the contents of a file in the item's configfs directory. 244 + To do so add the binary attribute to the NULL-terminated array 245 + config_item_type->ct_bin_attrs, and the item appears in configfs, the 246 + attribute file will appear with the configfs_bin_attribute->cb_attr.ca_name 247 + filename. configfs_bin_attribute->cb_attr.ca_mode specifies the file 248 + permissions. 249 + The cb_private member is provided for use by the driver, while the 250 + cb_max_size member specifies the maximum amount of vmalloc buffer 251 + to be used. 252 + 253 + If binary attribute is readable and the config_item provides a 254 + ct_item_ops->read_bin_attribute() method, that method will be called 255 + whenever userspace asks for a read(2) on the attribute. The converse 256 + will happen for write(2). The reads/writes are bufferred so only a 257 + single read/write will occur; the attributes' need not concern itself 258 + with it. 259 + 260 + struct config_group 261 + =================== 262 + 263 + A config_item cannot live in a vacuum. The only way one can be created 264 + is via mkdir(2) on a config_group. This will trigger creation of a 265 + child item:: 266 + 267 + struct config_group { 268 + struct config_item cg_item; 269 + struct list_head cg_children; 270 + struct configfs_subsystem *cg_subsys; 271 + struct list_head default_groups; 272 + struct list_head group_entry; 273 + }; 274 + 275 + void config_group_init(struct config_group *group); 276 + void config_group_init_type_name(struct config_group *group, 277 + const char *name, 278 + struct config_item_type *type); 279 + 280 + 281 + The config_group structure contains a config_item. Properly configuring 282 + that item means that a group can behave as an item in its own right. 283 + However, it can do more: it can create child items or groups. This is 284 + accomplished via the group operations specified on the group's 285 + config_item_type:: 286 + 287 + struct configfs_group_operations { 288 + struct config_item *(*make_item)(struct config_group *group, 289 + const char *name); 290 + struct config_group *(*make_group)(struct config_group *group, 291 + const char *name); 292 + int (*commit_item)(struct config_item *item); 293 + void (*disconnect_notify)(struct config_group *group, 294 + struct config_item *item); 295 + void (*drop_item)(struct config_group *group, 296 + struct config_item *item); 297 + }; 298 + 299 + A group creates child items by providing the 300 + ct_group_ops->make_item() method. If provided, this method is called from 301 + mkdir(2) in the group's directory. The subsystem allocates a new 302 + config_item (or more likely, its container structure), initializes it, 303 + and returns it to configfs. Configfs will then populate the filesystem 304 + tree to reflect the new item. 305 + 306 + If the subsystem wants the child to be a group itself, the subsystem 307 + provides ct_group_ops->make_group(). Everything else behaves the same, 308 + using the group _init() functions on the group. 309 + 310 + Finally, when userspace calls rmdir(2) on the item or group, 311 + ct_group_ops->drop_item() is called. As a config_group is also a 312 + config_item, it is not necessary for a separate drop_group() method. 313 + The subsystem must config_item_put() the reference that was initialized 314 + upon item allocation. If a subsystem has no work to do, it may omit 315 + the ct_group_ops->drop_item() method, and configfs will call 316 + config_item_put() on the item on behalf of the subsystem. 317 + 318 + Important: 319 + drop_item() is void, and as such cannot fail. When rmdir(2) 320 + is called, configfs WILL remove the item from the filesystem tree 321 + (assuming that it has no children to keep it busy). The subsystem is 322 + responsible for responding to this. If the subsystem has references to 323 + the item in other threads, the memory is safe. It may take some time 324 + for the item to actually disappear from the subsystem's usage. But it 325 + is gone from configfs. 326 + 327 + When drop_item() is called, the item's linkage has already been torn 328 + down. It no longer has a reference on its parent and has no place in 329 + the item hierarchy. If a client needs to do some cleanup before this 330 + teardown happens, the subsystem can implement the 331 + ct_group_ops->disconnect_notify() method. The method is called after 332 + configfs has removed the item from the filesystem view but before the 333 + item is removed from its parent group. Like drop_item(), 334 + disconnect_notify() is void and cannot fail. Client subsystems should 335 + not drop any references here, as they still must do it in drop_item(). 336 + 337 + A config_group cannot be removed while it still has child items. This 338 + is implemented in the configfs rmdir(2) code. ->drop_item() will not be 339 + called, as the item has not been dropped. rmdir(2) will fail, as the 340 + directory is not empty. 341 + 342 + struct configfs_subsystem 343 + ========================= 344 + 345 + A subsystem must register itself, usually at module_init time. This 346 + tells configfs to make the subsystem appear in the file tree:: 347 + 348 + struct configfs_subsystem { 349 + struct config_group su_group; 350 + struct mutex su_mutex; 351 + }; 352 + 353 + int configfs_register_subsystem(struct configfs_subsystem *subsys); 354 + void configfs_unregister_subsystem(struct configfs_subsystem *subsys); 355 + 356 + A subsystem consists of a toplevel config_group and a mutex. 357 + The group is where child config_items are created. For a subsystem, 358 + this group is usually defined statically. Before calling 359 + configfs_register_subsystem(), the subsystem must have initialized the 360 + group via the usual group _init() functions, and it must also have 361 + initialized the mutex. 362 + 363 + When the register call returns, the subsystem is live, and it 364 + will be visible via configfs. At that point, mkdir(2) can be called and 365 + the subsystem must be ready for it. 366 + 367 + An Example 368 + ========== 369 + 370 + The best example of these basic concepts is the simple_children 371 + subsystem/group and the simple_child item in 372 + samples/configfs/configfs_sample.c. It shows a trivial object displaying 373 + and storing an attribute, and a simple group creating and destroying 374 + these children. 375 + 376 + Hierarchy Navigation and the Subsystem Mutex 377 + ============================================ 378 + 379 + There is an extra bonus that configfs provides. The config_groups and 380 + config_items are arranged in a hierarchy due to the fact that they 381 + appear in a filesystem. A subsystem is NEVER to touch the filesystem 382 + parts, but the subsystem might be interested in this hierarchy. For 383 + this reason, the hierarchy is mirrored via the config_group->cg_children 384 + and config_item->ci_parent structure members. 385 + 386 + A subsystem can navigate the cg_children list and the ci_parent pointer 387 + to see the tree created by the subsystem. This can race with configfs' 388 + management of the hierarchy, so configfs uses the subsystem mutex to 389 + protect modifications. Whenever a subsystem wants to navigate the 390 + hierarchy, it must do so under the protection of the subsystem 391 + mutex. 392 + 393 + A subsystem will be prevented from acquiring the mutex while a newly 394 + allocated item has not been linked into this hierarchy. Similarly, it 395 + will not be able to acquire the mutex while a dropping item has not 396 + yet been unlinked. This means that an item's ci_parent pointer will 397 + never be NULL while the item is in configfs, and that an item will only 398 + be in its parent's cg_children list for the same duration. This allows 399 + a subsystem to trust ci_parent and cg_children while they hold the 400 + mutex. 401 + 402 + Item Aggregation Via symlink(2) 403 + =============================== 404 + 405 + configfs provides a simple group via the group->item parent/child 406 + relationship. Often, however, a larger environment requires aggregation 407 + outside of the parent/child connection. This is implemented via 408 + symlink(2). 409 + 410 + A config_item may provide the ct_item_ops->allow_link() and 411 + ct_item_ops->drop_link() methods. If the ->allow_link() method exists, 412 + symlink(2) may be called with the config_item as the source of the link. 413 + These links are only allowed between configfs config_items. Any 414 + symlink(2) attempt outside the configfs filesystem will be denied. 415 + 416 + When symlink(2) is called, the source config_item's ->allow_link() 417 + method is called with itself and a target item. If the source item 418 + allows linking to target item, it returns 0. A source item may wish to 419 + reject a link if it only wants links to a certain type of object (say, 420 + in its own subsystem). 421 + 422 + When unlink(2) is called on the symbolic link, the source item is 423 + notified via the ->drop_link() method. Like the ->drop_item() method, 424 + this is a void function and cannot return failure. The subsystem is 425 + responsible for responding to the change. 426 + 427 + A config_item cannot be removed while it links to any other item, nor 428 + can it be removed while an item links to it. Dangling symlinks are not 429 + allowed in configfs. 430 + 431 + Automatically Created Subgroups 432 + =============================== 433 + 434 + A new config_group may want to have two types of child config_items. 435 + While this could be codified by magic names in ->make_item(), it is much 436 + more explicit to have a method whereby userspace sees this divergence. 437 + 438 + Rather than have a group where some items behave differently than 439 + others, configfs provides a method whereby one or many subgroups are 440 + automatically created inside the parent at its creation. Thus, 441 + mkdir("parent") results in "parent", "parent/subgroup1", up through 442 + "parent/subgroupN". Items of type 1 can now be created in 443 + "parent/subgroup1", and items of type N can be created in 444 + "parent/subgroupN". 445 + 446 + These automatic subgroups, or default groups, do not preclude other 447 + children of the parent group. If ct_group_ops->make_group() exists, 448 + other child groups can be created on the parent group directly. 449 + 450 + A configfs subsystem specifies default groups by adding them using the 451 + configfs_add_default_group() function to the parent config_group 452 + structure. Each added group is populated in the configfs tree at the same 453 + time as the parent group. Similarly, they are removed at the same time 454 + as the parent. No extra notification is provided. When a ->drop_item() 455 + method call notifies the subsystem the parent group is going away, it 456 + also means every default group child associated with that parent group. 457 + 458 + As a consequence of this, default groups cannot be removed directly via 459 + rmdir(2). They also are not considered when rmdir(2) on the parent 460 + group is checking for children. 461 + 462 + Dependent Subsystems 463 + ==================== 464 + 465 + Sometimes other drivers depend on particular configfs items. For 466 + example, ocfs2 mounts depend on a heartbeat region item. If that 467 + region item is removed with rmdir(2), the ocfs2 mount must BUG or go 468 + readonly. Not happy. 469 + 470 + configfs provides two additional API calls: configfs_depend_item() and 471 + configfs_undepend_item(). A client driver can call 472 + configfs_depend_item() on an existing item to tell configfs that it is 473 + depended on. configfs will then return -EBUSY from rmdir(2) for that 474 + item. When the item is no longer depended on, the client driver calls 475 + configfs_undepend_item() on it. 476 + 477 + These API cannot be called underneath any configfs callbacks, as 478 + they will conflict. They can block and allocate. A client driver 479 + probably shouldn't calling them of its own gumption. Rather it should 480 + be providing an API that external subsystems call. 481 + 482 + How does this work? Imagine the ocfs2 mount process. When it mounts, 483 + it asks for a heartbeat region item. This is done via a call into the 484 + heartbeat code. Inside the heartbeat code, the region item is looked 485 + up. Here, the heartbeat code calls configfs_depend_item(). If it 486 + succeeds, then heartbeat knows the region is safe to give to ocfs2. 487 + If it fails, it was being torn down anyway, and heartbeat can gracefully 488 + pass up an error. 489 + 490 + Committable Items 491 + ================= 492 + 493 + Note: 494 + Committable items are currently unimplemented. 495 + 496 + Some config_items cannot have a valid initial state. That is, no 497 + default values can be specified for the item's attributes such that the 498 + item can do its work. Userspace must configure one or more attributes, 499 + after which the subsystem can start whatever entity this item 500 + represents. 501 + 502 + Consider the FakeNBD device from above. Without a target address *and* 503 + a target device, the subsystem has no idea what block device to import. 504 + The simple example assumes that the subsystem merely waits until all the 505 + appropriate attributes are configured, and then connects. This will, 506 + indeed, work, but now every attribute store must check if the attributes 507 + are initialized. Every attribute store must fire off the connection if 508 + that condition is met. 509 + 510 + Far better would be an explicit action notifying the subsystem that the 511 + config_item is ready to go. More importantly, an explicit action allows 512 + the subsystem to provide feedback as to whether the attributes are 513 + initialized in a way that makes sense. configfs provides this as 514 + committable items. 515 + 516 + configfs still uses only normal filesystem operations. An item is 517 + committed via rename(2). The item is moved from a directory where it 518 + can be modified to a directory where it cannot. 519 + 520 + Any group that provides the ct_group_ops->commit_item() method has 521 + committable items. When this group appears in configfs, mkdir(2) will 522 + not work directly in the group. Instead, the group will have two 523 + subdirectories: "live" and "pending". The "live" directory does not 524 + support mkdir(2) or rmdir(2) either. It only allows rename(2). The 525 + "pending" directory does allow mkdir(2) and rmdir(2). An item is 526 + created in the "pending" directory. Its attributes can be modified at 527 + will. Userspace commits the item by renaming it into the "live" 528 + directory. At this point, the subsystem receives the ->commit_item() 529 + callback. If all required attributes are filled to satisfaction, the 530 + method returns zero and the item is moved to the "live" directory. 531 + 532 + As rmdir(2) does not work in the "live" directory, an item must be 533 + shutdown, or "uncommitted". Again, this is done via rename(2), this 534 + time from the "live" directory back to the "pending" one. The subsystem 535 + is notified by the ct_group_ops->uncommit_object() method.

-508

Documentation/filesystems/configfs/configfs.txt

··· 1 - 2 - configfs - Userspace-driven kernel object configuration. 3 - 4 - Joel Becker <joel.becker@oracle.com> 5 - 6 - Updated: 31 March 2005 7 - 8 - Copyright (c) 2005 Oracle Corporation, 9 - Joel Becker <joel.becker@oracle.com> 10 - 11 - 12 - [What is configfs?] 13 - 14 - configfs is a ram-based filesystem that provides the converse of 15 - sysfs's functionality. Where sysfs is a filesystem-based view of 16 - kernel objects, configfs is a filesystem-based manager of kernel 17 - objects, or config_items. 18 - 19 - With sysfs, an object is created in kernel (for example, when a device 20 - is discovered) and it is registered with sysfs. Its attributes then 21 - appear in sysfs, allowing userspace to read the attributes via 22 - readdir(3)/read(2). It may allow some attributes to be modified via 23 - write(2). The important point is that the object is created and 24 - destroyed in kernel, the kernel controls the lifecycle of the sysfs 25 - representation, and sysfs is merely a window on all this. 26 - 27 - A configfs config_item is created via an explicit userspace operation: 28 - mkdir(2). It is destroyed via rmdir(2). The attributes appear at 29 - mkdir(2) time, and can be read or modified via read(2) and write(2). 30 - As with sysfs, readdir(3) queries the list of items and/or attributes. 31 - symlink(2) can be used to group items together. Unlike sysfs, the 32 - lifetime of the representation is completely driven by userspace. The 33 - kernel modules backing the items must respond to this. 34 - 35 - Both sysfs and configfs can and should exist together on the same 36 - system. One is not a replacement for the other. 37 - 38 - [Using configfs] 39 - 40 - configfs can be compiled as a module or into the kernel. You can access 41 - it by doing 42 - 43 - mount -t configfs none /config 44 - 45 - The configfs tree will be empty unless client modules are also loaded. 46 - These are modules that register their item types with configfs as 47 - subsystems. Once a client subsystem is loaded, it will appear as a 48 - subdirectory (or more than one) under /config. Like sysfs, the 49 - configfs tree is always there, whether mounted on /config or not. 50 - 51 - An item is created via mkdir(2). The item's attributes will also 52 - appear at this time. readdir(3) can determine what the attributes are, 53 - read(2) can query their default values, and write(2) can store new 54 - values. Don't mix more than one attribute in one attribute file. 55 - 56 - There are two types of configfs attributes: 57 - 58 - * Normal attributes, which similar to sysfs attributes, are small ASCII text 59 - files, with a maximum size of one page (PAGE_SIZE, 4096 on i386). Preferably 60 - only one value per file should be used, and the same caveats from sysfs apply. 61 - Configfs expects write(2) to store the entire buffer at once. When writing to 62 - normal configfs attributes, userspace processes should first read the entire 63 - file, modify the portions they wish to change, and then write the entire 64 - buffer back. 65 - 66 - * Binary attributes, which are somewhat similar to sysfs binary attributes, 67 - but with a few slight changes to semantics. The PAGE_SIZE limitation does not 68 - apply, but the whole binary item must fit in single kernel vmalloc'ed buffer. 69 - The write(2) calls from user space are buffered, and the attributes' 70 - write_bin_attribute method will be invoked on the final close, therefore it is 71 - imperative for user-space to check the return code of close(2) in order to 72 - verify that the operation finished successfully. 73 - To avoid a malicious user OOMing the kernel, there's a per-binary attribute 74 - maximum buffer value. 75 - 76 - When an item needs to be destroyed, remove it with rmdir(2). An 77 - item cannot be destroyed if any other item has a link to it (via 78 - symlink(2)). Links can be removed via unlink(2). 79 - 80 - [Configuring FakeNBD: an Example] 81 - 82 - Imagine there's a Network Block Device (NBD) driver that allows you to 83 - access remote block devices. Call it FakeNBD. FakeNBD uses configfs 84 - for its configuration. Obviously, there will be a nice program that 85 - sysadmins use to configure FakeNBD, but somehow that program has to tell 86 - the driver about it. Here's where configfs comes in. 87 - 88 - When the FakeNBD driver is loaded, it registers itself with configfs. 89 - readdir(3) sees this just fine: 90 - 91 - # ls /config 92 - fakenbd 93 - 94 - A fakenbd connection can be created with mkdir(2). The name is 95 - arbitrary, but likely the tool will make some use of the name. Perhaps 96 - it is a uuid or a disk name: 97 - 98 - # mkdir /config/fakenbd/disk1 99 - # ls /config/fakenbd/disk1 100 - target device rw 101 - 102 - The target attribute contains the IP address of the server FakeNBD will 103 - connect to. The device attribute is the device on the server. 104 - Predictably, the rw attribute determines whether the connection is 105 - read-only or read-write. 106 - 107 - # echo 10.0.0.1 > /config/fakenbd/disk1/target 108 - # echo /dev/sda1 > /config/fakenbd/disk1/device 109 - # echo 1 > /config/fakenbd/disk1/rw 110 - 111 - That's it. That's all there is. Now the device is configured, via the 112 - shell no less. 113 - 114 - [Coding With configfs] 115 - 116 - Every object in configfs is a config_item. A config_item reflects an 117 - object in the subsystem. It has attributes that match values on that 118 - object. configfs handles the filesystem representation of that object 119 - and its attributes, allowing the subsystem to ignore all but the 120 - basic show/store interaction. 121 - 122 - Items are created and destroyed inside a config_group. A group is a 123 - collection of items that share the same attributes and operations. 124 - Items are created by mkdir(2) and removed by rmdir(2), but configfs 125 - handles that. The group has a set of operations to perform these tasks 126 - 127 - A subsystem is the top level of a client module. During initialization, 128 - the client module registers the subsystem with configfs, the subsystem 129 - appears as a directory at the top of the configfs filesystem. A 130 - subsystem is also a config_group, and can do everything a config_group 131 - can. 132 - 133 - [struct config_item] 134 - 135 - struct config_item { 136 - char *ci_name; 137 - char ci_namebuf[UOBJ_NAME_LEN]; 138 - struct kref ci_kref; 139 - struct list_head ci_entry; 140 - struct config_item *ci_parent; 141 - struct config_group *ci_group; 142 - struct config_item_type *ci_type; 143 - struct dentry *ci_dentry; 144 - }; 145 - 146 - void config_item_init(struct config_item *); 147 - void config_item_init_type_name(struct config_item *, 148 - const char *name, 149 - struct config_item_type *type); 150 - struct config_item *config_item_get(struct config_item *); 151 - void config_item_put(struct config_item *); 152 - 153 - Generally, struct config_item is embedded in a container structure, a 154 - structure that actually represents what the subsystem is doing. The 155 - config_item portion of that structure is how the object interacts with 156 - configfs. 157 - 158 - Whether statically defined in a source file or created by a parent 159 - config_group, a config_item must have one of the _init() functions 160 - called on it. This initializes the reference count and sets up the 161 - appropriate fields. 162 - 163 - All users of a config_item should have a reference on it via 164 - config_item_get(), and drop the reference when they are done via 165 - config_item_put(). 166 - 167 - By itself, a config_item cannot do much more than appear in configfs. 168 - Usually a subsystem wants the item to display and/or store attributes, 169 - among other things. For that, it needs a type. 170 - 171 - [struct config_item_type] 172 - 173 - struct configfs_item_operations { 174 - void (*release)(struct config_item *); 175 - int (*allow_link)(struct config_item *src, 176 - struct config_item *target); 177 - void (*drop_link)(struct config_item *src, 178 - struct config_item *target); 179 - }; 180 - 181 - struct config_item_type { 182 - struct module *ct_owner; 183 - struct configfs_item_operations *ct_item_ops; 184 - struct configfs_group_operations *ct_group_ops; 185 - struct configfs_attribute **ct_attrs; 186 - struct configfs_bin_attribute **ct_bin_attrs; 187 - }; 188 - 189 - The most basic function of a config_item_type is to define what 190 - operations can be performed on a config_item. All items that have been 191 - allocated dynamically will need to provide the ct_item_ops->release() 192 - method. This method is called when the config_item's reference count 193 - reaches zero. 194 - 195 - [struct configfs_attribute] 196 - 197 - struct configfs_attribute { 198 - char *ca_name; 199 - struct module *ca_owner; 200 - umode_t ca_mode; 201 - ssize_t (*show)(struct config_item *, char *); 202 - ssize_t (*store)(struct config_item *, const char *, size_t); 203 - }; 204 - 205 - When a config_item wants an attribute to appear as a file in the item's 206 - configfs directory, it must define a configfs_attribute describing it. 207 - It then adds the attribute to the NULL-terminated array 208 - config_item_type->ct_attrs. When the item appears in configfs, the 209 - attribute file will appear with the configfs_attribute->ca_name 210 - filename. configfs_attribute->ca_mode specifies the file permissions. 211 - 212 - If an attribute is readable and provides a ->show method, that method will 213 - be called whenever userspace asks for a read(2) on the attribute. If an 214 - attribute is writable and provides a ->store method, that method will be 215 - be called whenever userspace asks for a write(2) on the attribute. 216 - 217 - [struct configfs_bin_attribute] 218 - 219 - struct configfs_bin_attribute { 220 - struct configfs_attribute cb_attr; 221 - void *cb_private; 222 - size_t cb_max_size; 223 - }; 224 - 225 - The binary attribute is used when the one needs to use binary blob to 226 - appear as the contents of a file in the item's configfs directory. 227 - To do so add the binary attribute to the NULL-terminated array 228 - config_item_type->ct_bin_attrs, and the item appears in configfs, the 229 - attribute file will appear with the configfs_bin_attribute->cb_attr.ca_name 230 - filename. configfs_bin_attribute->cb_attr.ca_mode specifies the file 231 - permissions. 232 - The cb_private member is provided for use by the driver, while the 233 - cb_max_size member specifies the maximum amount of vmalloc buffer 234 - to be used. 235 - 236 - If binary attribute is readable and the config_item provides a 237 - ct_item_ops->read_bin_attribute() method, that method will be called 238 - whenever userspace asks for a read(2) on the attribute. The converse 239 - will happen for write(2). The reads/writes are bufferred so only a 240 - single read/write will occur; the attributes' need not concern itself 241 - with it. 242 - 243 - [struct config_group] 244 - 245 - A config_item cannot live in a vacuum. The only way one can be created 246 - is via mkdir(2) on a config_group. This will trigger creation of a 247 - child item. 248 - 249 - struct config_group { 250 - struct config_item cg_item; 251 - struct list_head cg_children; 252 - struct configfs_subsystem *cg_subsys; 253 - struct list_head default_groups; 254 - struct list_head group_entry; 255 - }; 256 - 257 - void config_group_init(struct config_group *group); 258 - void config_group_init_type_name(struct config_group *group, 259 - const char *name, 260 - struct config_item_type *type); 261 - 262 - 263 - The config_group structure contains a config_item. Properly configuring 264 - that item means that a group can behave as an item in its own right. 265 - However, it can do more: it can create child items or groups. This is 266 - accomplished via the group operations specified on the group's 267 - config_item_type. 268 - 269 - struct configfs_group_operations { 270 - struct config_item *(*make_item)(struct config_group *group, 271 - const char *name); 272 - struct config_group *(*make_group)(struct config_group *group, 273 - const char *name); 274 - int (*commit_item)(struct config_item *item); 275 - void (*disconnect_notify)(struct config_group *group, 276 - struct config_item *item); 277 - void (*drop_item)(struct config_group *group, 278 - struct config_item *item); 279 - }; 280 - 281 - A group creates child items by providing the 282 - ct_group_ops->make_item() method. If provided, this method is called from mkdir(2) in the group's directory. The subsystem allocates a new 283 - config_item (or more likely, its container structure), initializes it, 284 - and returns it to configfs. Configfs will then populate the filesystem 285 - tree to reflect the new item. 286 - 287 - If the subsystem wants the child to be a group itself, the subsystem 288 - provides ct_group_ops->make_group(). Everything else behaves the same, 289 - using the group _init() functions on the group. 290 - 291 - Finally, when userspace calls rmdir(2) on the item or group, 292 - ct_group_ops->drop_item() is called. As a config_group is also a 293 - config_item, it is not necessary for a separate drop_group() method. 294 - The subsystem must config_item_put() the reference that was initialized 295 - upon item allocation. If a subsystem has no work to do, it may omit 296 - the ct_group_ops->drop_item() method, and configfs will call 297 - config_item_put() on the item on behalf of the subsystem. 298 - 299 - IMPORTANT: drop_item() is void, and as such cannot fail. When rmdir(2) 300 - is called, configfs WILL remove the item from the filesystem tree 301 - (assuming that it has no children to keep it busy). The subsystem is 302 - responsible for responding to this. If the subsystem has references to 303 - the item in other threads, the memory is safe. It may take some time 304 - for the item to actually disappear from the subsystem's usage. But it 305 - is gone from configfs. 306 - 307 - When drop_item() is called, the item's linkage has already been torn 308 - down. It no longer has a reference on its parent and has no place in 309 - the item hierarchy. If a client needs to do some cleanup before this 310 - teardown happens, the subsystem can implement the 311 - ct_group_ops->disconnect_notify() method. The method is called after 312 - configfs has removed the item from the filesystem view but before the 313 - item is removed from its parent group. Like drop_item(), 314 - disconnect_notify() is void and cannot fail. Client subsystems should 315 - not drop any references here, as they still must do it in drop_item(). 316 - 317 - A config_group cannot be removed while it still has child items. This 318 - is implemented in the configfs rmdir(2) code. ->drop_item() will not be 319 - called, as the item has not been dropped. rmdir(2) will fail, as the 320 - directory is not empty. 321 - 322 - [struct configfs_subsystem] 323 - 324 - A subsystem must register itself, usually at module_init time. This 325 - tells configfs to make the subsystem appear in the file tree. 326 - 327 - struct configfs_subsystem { 328 - struct config_group su_group; 329 - struct mutex su_mutex; 330 - }; 331 - 332 - int configfs_register_subsystem(struct configfs_subsystem *subsys); 333 - void configfs_unregister_subsystem(struct configfs_subsystem *subsys); 334 - 335 - A subsystem consists of a toplevel config_group and a mutex. 336 - The group is where child config_items are created. For a subsystem, 337 - this group is usually defined statically. Before calling 338 - configfs_register_subsystem(), the subsystem must have initialized the 339 - group via the usual group _init() functions, and it must also have 340 - initialized the mutex. 341 - When the register call returns, the subsystem is live, and it 342 - will be visible via configfs. At that point, mkdir(2) can be called and 343 - the subsystem must be ready for it. 344 - 345 - [An Example] 346 - 347 - The best example of these basic concepts is the simple_children 348 - subsystem/group and the simple_child item in 349 - samples/configfs/configfs_sample.c. It shows a trivial object displaying 350 - and storing an attribute, and a simple group creating and destroying 351 - these children. 352 - 353 - [Hierarchy Navigation and the Subsystem Mutex] 354 - 355 - There is an extra bonus that configfs provides. The config_groups and 356 - config_items are arranged in a hierarchy due to the fact that they 357 - appear in a filesystem. A subsystem is NEVER to touch the filesystem 358 - parts, but the subsystem might be interested in this hierarchy. For 359 - this reason, the hierarchy is mirrored via the config_group->cg_children 360 - and config_item->ci_parent structure members. 361 - 362 - A subsystem can navigate the cg_children list and the ci_parent pointer 363 - to see the tree created by the subsystem. This can race with configfs' 364 - management of the hierarchy, so configfs uses the subsystem mutex to 365 - protect modifications. Whenever a subsystem wants to navigate the 366 - hierarchy, it must do so under the protection of the subsystem 367 - mutex. 368 - 369 - A subsystem will be prevented from acquiring the mutex while a newly 370 - allocated item has not been linked into this hierarchy. Similarly, it 371 - will not be able to acquire the mutex while a dropping item has not 372 - yet been unlinked. This means that an item's ci_parent pointer will 373 - never be NULL while the item is in configfs, and that an item will only 374 - be in its parent's cg_children list for the same duration. This allows 375 - a subsystem to trust ci_parent and cg_children while they hold the 376 - mutex. 377 - 378 - [Item Aggregation Via symlink(2)] 379 - 380 - configfs provides a simple group via the group->item parent/child 381 - relationship. Often, however, a larger environment requires aggregation 382 - outside of the parent/child connection. This is implemented via 383 - symlink(2). 384 - 385 - A config_item may provide the ct_item_ops->allow_link() and 386 - ct_item_ops->drop_link() methods. If the ->allow_link() method exists, 387 - symlink(2) may be called with the config_item as the source of the link. 388 - These links are only allowed between configfs config_items. Any 389 - symlink(2) attempt outside the configfs filesystem will be denied. 390 - 391 - When symlink(2) is called, the source config_item's ->allow_link() 392 - method is called with itself and a target item. If the source item 393 - allows linking to target item, it returns 0. A source item may wish to 394 - reject a link if it only wants links to a certain type of object (say, 395 - in its own subsystem). 396 - 397 - When unlink(2) is called on the symbolic link, the source item is 398 - notified via the ->drop_link() method. Like the ->drop_item() method, 399 - this is a void function and cannot return failure. The subsystem is 400 - responsible for responding to the change. 401 - 402 - A config_item cannot be removed while it links to any other item, nor 403 - can it be removed while an item links to it. Dangling symlinks are not 404 - allowed in configfs. 405 - 406 - [Automatically Created Subgroups] 407 - 408 - A new config_group may want to have two types of child config_items. 409 - While this could be codified by magic names in ->make_item(), it is much 410 - more explicit to have a method whereby userspace sees this divergence. 411 - 412 - Rather than have a group where some items behave differently than 413 - others, configfs provides a method whereby one or many subgroups are 414 - automatically created inside the parent at its creation. Thus, 415 - mkdir("parent") results in "parent", "parent/subgroup1", up through 416 - "parent/subgroupN". Items of type 1 can now be created in 417 - "parent/subgroup1", and items of type N can be created in 418 - "parent/subgroupN". 419 - 420 - These automatic subgroups, or default groups, do not preclude other 421 - children of the parent group. If ct_group_ops->make_group() exists, 422 - other child groups can be created on the parent group directly. 423 - 424 - A configfs subsystem specifies default groups by adding them using the 425 - configfs_add_default_group() function to the parent config_group 426 - structure. Each added group is populated in the configfs tree at the same 427 - time as the parent group. Similarly, they are removed at the same time 428 - as the parent. No extra notification is provided. When a ->drop_item() 429 - method call notifies the subsystem the parent group is going away, it 430 - also means every default group child associated with that parent group. 431 - 432 - As a consequence of this, default groups cannot be removed directly via 433 - rmdir(2). They also are not considered when rmdir(2) on the parent 434 - group is checking for children. 435 - 436 - [Dependent Subsystems] 437 - 438 - Sometimes other drivers depend on particular configfs items. For 439 - example, ocfs2 mounts depend on a heartbeat region item. If that 440 - region item is removed with rmdir(2), the ocfs2 mount must BUG or go 441 - readonly. Not happy. 442 - 443 - configfs provides two additional API calls: configfs_depend_item() and 444 - configfs_undepend_item(). A client driver can call 445 - configfs_depend_item() on an existing item to tell configfs that it is 446 - depended on. configfs will then return -EBUSY from rmdir(2) for that 447 - item. When the item is no longer depended on, the client driver calls 448 - configfs_undepend_item() on it. 449 - 450 - These API cannot be called underneath any configfs callbacks, as 451 - they will conflict. They can block and allocate. A client driver 452 - probably shouldn't calling them of its own gumption. Rather it should 453 - be providing an API that external subsystems call. 454 - 455 - How does this work? Imagine the ocfs2 mount process. When it mounts, 456 - it asks for a heartbeat region item. This is done via a call into the 457 - heartbeat code. Inside the heartbeat code, the region item is looked 458 - up. Here, the heartbeat code calls configfs_depend_item(). If it 459 - succeeds, then heartbeat knows the region is safe to give to ocfs2. 460 - If it fails, it was being torn down anyway, and heartbeat can gracefully 461 - pass up an error. 462 - 463 - [Committable Items] 464 - 465 - NOTE: Committable items are currently unimplemented. 466 - 467 - Some config_items cannot have a valid initial state. That is, no 468 - default values can be specified for the item's attributes such that the 469 - item can do its work. Userspace must configure one or more attributes, 470 - after which the subsystem can start whatever entity this item 471 - represents. 472 - 473 - Consider the FakeNBD device from above. Without a target address *and* 474 - a target device, the subsystem has no idea what block device to import. 475 - The simple example assumes that the subsystem merely waits until all the 476 - appropriate attributes are configured, and then connects. This will, 477 - indeed, work, but now every attribute store must check if the attributes 478 - are initialized. Every attribute store must fire off the connection if 479 - that condition is met. 480 - 481 - Far better would be an explicit action notifying the subsystem that the 482 - config_item is ready to go. More importantly, an explicit action allows 483 - the subsystem to provide feedback as to whether the attributes are 484 - initialized in a way that makes sense. configfs provides this as 485 - committable items. 486 - 487 - configfs still uses only normal filesystem operations. An item is 488 - committed via rename(2). The item is moved from a directory where it 489 - can be modified to a directory where it cannot. 490 - 491 - Any group that provides the ct_group_ops->commit_item() method has 492 - committable items. When this group appears in configfs, mkdir(2) will 493 - not work directly in the group. Instead, the group will have two 494 - subdirectories: "live" and "pending". The "live" directory does not 495 - support mkdir(2) or rmdir(2) either. It only allows rename(2). The 496 - "pending" directory does allow mkdir(2) and rmdir(2). An item is 497 - created in the "pending" directory. Its attributes can be modified at 498 - will. Userspace commits the item by renaming it into the "live" 499 - directory. At this point, the subsystem receives the ->commit_item() 500 - callback. If all required attributes are filled to satisfaction, the 501 - method returns zero and the item is moved to the "live" directory. 502 - 503 - As rmdir(2) does not work in the "live" directory, an item must be 504 - shutdown, or "uncommitted". Again, this is done via rename(2), this 505 - time from the "live" directory back to the "pending" one. The subsystem 506 - is notified by the ct_group_ops->uncommit_object() method. 507 - 508 -

+1 -1

Documentation/filesystems/dax.txt

··· 74 74 exposure of uninitialized data through mmap. 75 75 76 76 These filesystems may be used for inspiration: 77 - - ext2: see Documentation/filesystems/ext2.txt 77 + - ext2: see Documentation/filesystems/ext2.rst 78 78 - ext4: see Documentation/filesystems/ext4/ 79 79 - xfs: see Documentation/admin-guide/xfs.rst 80 80

+3 -2

Documentation/filesystems/debugfs.rst

··· 166 166 }; 167 167 168 168 struct debugfs_regset32 { 169 - struct debugfs_reg32 *regs; 169 + const struct debugfs_reg32 *regs; 170 170 int nregs; 171 171 void __iomem *base; 172 + struct device *dev; /* Optional device for Runtime PM */ 172 173 }; 173 174 174 175 debugfs_create_regset32(const char *name, umode_t mode, 175 176 struct dentry *parent, 176 177 struct debugfs_regset32 *regset); 177 178 178 - void debugfs_print_regs32(struct seq_file *s, struct debugfs_reg32 *regs, 179 + void debugfs_print_regs32(struct seq_file *s, const struct debugfs_reg32 *regs, 179 180 int nregs, void __iomem *base, char *prefix); 180 181 181 182 The "base" argument may be 0, but you may want to build the reg32 array

+36

Documentation/filesystems/devpts.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ===================== 4 + The Devpts Filesystem 5 + ===================== 6 + 7 + Each mount of the devpts filesystem is now distinct such that ptys 8 + and their indicies allocated in one mount are independent from ptys 9 + and their indicies in all other mounts. 10 + 11 + All mounts of the devpts filesystem now create a ``/dev/pts/ptmx`` node 12 + with permissions ``0000``. 13 + 14 + To retain backwards compatibility the a ptmx device node (aka any node 15 + created with ``mknod name c 5 2``) when opened will look for an instance 16 + of devpts under the name ``pts`` in the same directory as the ptmx device 17 + node. 18 + 19 + As an option instead of placing a ``/dev/ptmx`` device node at ``/dev/ptmx`` 20 + it is possible to place a symlink to ``/dev/pts/ptmx`` at ``/dev/ptmx`` or 21 + to bind mount ``/dev/ptx/ptmx`` to ``/dev/ptmx``. If you opt for using 22 + the devpts filesystem in this manner devpts should be mounted with 23 + the ``ptmxmode=0666``, or ``chmod 0666 /dev/pts/ptmx`` should be called. 24 + 25 + Total count of pty pairs in all instances is limited by sysctls:: 26 + 27 + kernel.pty.max = 4096 - global limit 28 + kernel.pty.reserve = 1024 - reserved for filesystems mounted from the initial mount namespace 29 + kernel.pty.nr - current count of ptys 30 + 31 + Per-instance limit could be set by adding mount option ``max=<count>``. 32 + 33 + This feature was added in kernel 3.4 together with 34 + ``sysctl kernel.pty.reserve``. 35 + 36 + In kernels older than 3.4 sysctl ``kernel.pty.max`` works as per-instance limit.

-26

Documentation/filesystems/devpts.txt

··· 1 - Each mount of the devpts filesystem is now distinct such that ptys 2 - and their indicies allocated in one mount are independent from ptys 3 - and their indicies in all other mounts. 4 - 5 - All mounts of the devpts filesystem now create a /dev/pts/ptmx node 6 - with permissions 0000. 7 - 8 - To retain backwards compatibility the a ptmx device node (aka any node 9 - created with "mknod name c 5 2") when opened will look for an instance 10 - of devpts under the name "pts" in the same directory as the ptmx device 11 - node. 12 - 13 - As an option instead of placing a /dev/ptmx device node at /dev/ptmx 14 - it is possible to place a symlink to /dev/pts/ptmx at /dev/ptmx or 15 - to bind mount /dev/ptx/ptmx to /dev/ptmx. If you opt for using 16 - the devpts filesystem in this manner devpts should be mounted with 17 - the ptmxmode=0666, or chmod 0666 /dev/pts/ptmx should be called. 18 - 19 - Total count of pty pairs in all instances is limited by sysctls: 20 - kernel.pty.max = 4096 - global limit 21 - kernel.pty.reserve = 1024 - reserved for filesystems mounted from the initial mount namespace 22 - kernel.pty.nr - current count of ptys 23 - 24 - Per-instance limit could be set by adding mount option "max=<count>". 25 - This feature was added in kernel 3.4 together with sysctl kernel.pty.reserve. 26 - In kernels older than 3.4 sysctl kernel.pty.max works as per-instance limit.

+75

Documentation/filesystems/dnotify.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ============================ 4 + Linux Directory Notification 5 + ============================ 6 + 7 + Stephen Rothwell <sfr@canb.auug.org.au> 8 + 9 + The intention of directory notification is to allow user applications 10 + to be notified when a directory, or any of the files in it, are changed. 11 + The basic mechanism involves the application registering for notification 12 + on a directory using a fcntl(2) call and the notifications themselves 13 + being delivered using signals. 14 + 15 + The application decides which "events" it wants to be notified about. 16 + The currently defined events are: 17 + 18 + ========= ===================================================== 19 + DN_ACCESS A file in the directory was accessed (read) 20 + DN_MODIFY A file in the directory was modified (write,truncate) 21 + DN_CREATE A file was created in the directory 22 + DN_DELETE A file was unlinked from directory 23 + DN_RENAME A file in the directory was renamed 24 + DN_ATTRIB A file in the directory had its attributes 25 + changed (chmod,chown) 26 + ========= ===================================================== 27 + 28 + Usually, the application must reregister after each notification, but 29 + if DN_MULTISHOT is or'ed with the event mask, then the registration will 30 + remain until explicitly removed (by registering for no events). 31 + 32 + By default, SIGIO will be delivered to the process and no other useful 33 + information. However, if the F_SETSIG fcntl(2) call is used to let the 34 + kernel know which signal to deliver, a siginfo structure will be passed to 35 + the signal handler and the si_fd member of that structure will contain the 36 + file descriptor associated with the directory in which the event occurred. 37 + 38 + Preferably the application will choose one of the real time signals 39 + (SIGRTMIN + <n>) so that the notifications may be queued. This is 40 + especially important if DN_MULTISHOT is specified. Note that SIGRTMIN 41 + is often blocked, so it is better to use (at least) SIGRTMIN + 1. 42 + 43 + Implementation expectations (features and bugs :-)) 44 + --------------------------------------------------- 45 + 46 + The notification should work for any local access to files even if the 47 + actual file system is on a remote server. This implies that remote 48 + access to files served by local user mode servers should be notified. 49 + Also, remote accesses to files served by a local kernel NFS server should 50 + be notified. 51 + 52 + In order to make the impact on the file system code as small as possible, 53 + the problem of hard links to files has been ignored. So if a file (x) 54 + exists in two directories (a and b) then a change to the file using the 55 + name "a/x" should be notified to a program expecting notifications on 56 + directory "a", but will not be notified to one expecting notifications on 57 + directory "b". 58 + 59 + Also, files that are unlinked, will still cause notifications in the 60 + last directory that they were linked to. 61 + 62 + Configuration 63 + ------------- 64 + 65 + Dnotify is controlled via the CONFIG_DNOTIFY configuration option. When 66 + disabled, fcntl(fd, F_NOTIFY, ...) will return -EINVAL. 67 + 68 + Example 69 + ------- 70 + See tools/testing/selftests/filesystems/dnotify_test.c for an example. 71 + 72 + NOTE 73 + ---- 74 + Beginning with Linux 2.6.13, dnotify has been replaced by inotify. 75 + See Documentation/filesystems/inotify.rst for more information on it.

-70

Documentation/filesystems/dnotify.txt

··· 1 - Linux Directory Notification 2 - ============================ 3 - 4 - Stephen Rothwell <sfr@canb.auug.org.au> 5 - 6 - The intention of directory notification is to allow user applications 7 - to be notified when a directory, or any of the files in it, are changed. 8 - The basic mechanism involves the application registering for notification 9 - on a directory using a fcntl(2) call and the notifications themselves 10 - being delivered using signals. 11 - 12 - The application decides which "events" it wants to be notified about. 13 - The currently defined events are: 14 - 15 - DN_ACCESS A file in the directory was accessed (read) 16 - DN_MODIFY A file in the directory was modified (write,truncate) 17 - DN_CREATE A file was created in the directory 18 - DN_DELETE A file was unlinked from directory 19 - DN_RENAME A file in the directory was renamed 20 - DN_ATTRIB A file in the directory had its attributes 21 - changed (chmod,chown) 22 - 23 - Usually, the application must reregister after each notification, but 24 - if DN_MULTISHOT is or'ed with the event mask, then the registration will 25 - remain until explicitly removed (by registering for no events). 26 - 27 - By default, SIGIO will be delivered to the process and no other useful 28 - information. However, if the F_SETSIG fcntl(2) call is used to let the 29 - kernel know which signal to deliver, a siginfo structure will be passed to 30 - the signal handler and the si_fd member of that structure will contain the 31 - file descriptor associated with the directory in which the event occurred. 32 - 33 - Preferably the application will choose one of the real time signals 34 - (SIGRTMIN + <n>) so that the notifications may be queued. This is 35 - especially important if DN_MULTISHOT is specified. Note that SIGRTMIN 36 - is often blocked, so it is better to use (at least) SIGRTMIN + 1. 37 - 38 - Implementation expectations (features and bugs :-)) 39 - --------------------------- 40 - 41 - The notification should work for any local access to files even if the 42 - actual file system is on a remote server. This implies that remote 43 - access to files served by local user mode servers should be notified. 44 - Also, remote accesses to files served by a local kernel NFS server should 45 - be notified. 46 - 47 - In order to make the impact on the file system code as small as possible, 48 - the problem of hard links to files has been ignored. So if a file (x) 49 - exists in two directories (a and b) then a change to the file using the 50 - name "a/x" should be notified to a program expecting notifications on 51 - directory "a", but will not be notified to one expecting notifications on 52 - directory "b". 53 - 54 - Also, files that are unlinked, will still cause notifications in the 55 - last directory that they were linked to. 56 - 57 - Configuration 58 - ------------- 59 - 60 - Dnotify is controlled via the CONFIG_DNOTIFY configuration option. When 61 - disabled, fcntl(fd, F_NOTIFY, ...) will return -EINVAL. 62 - 63 - Example 64 - ------- 65 - See tools/testing/selftests/filesystems/dnotify_test.c for an example. 66 - 67 - NOTE 68 - ---- 69 - Beginning with Linux 2.6.13, dnotify has been replaced by inotify. 70 - See Documentation/filesystems/inotify.txt for more information on it.

+17

Documentation/filesystems/efivarfs.rst

··· 24 24 as immutable files. This doesn't prevent removal - "chattr -i" will work - 25 25 but it does prevent this kind of failure from being accomplished 26 26 accidentally. 27 + 28 + .. warning :: 29 + When a content of an UEFI variable in /sys/firmware/efi/efivars is 30 + displayed, for example using "hexdump", pay attention that the first 31 + 4 bytes of the output represent the UEFI variable attributes, 32 + in little-endian format. 33 + 34 + Practically the output of each efivar is composed of: 35 + 36 + +-----------------------------------+ 37 + |4_bytes_of_attributes + efivar_data| 38 + +-----------------------------------+ 39 + 40 + *See also:* 41 + 42 + - Documentation/admin-guide/acpi/ssdt-overlays.rst 43 + - Documentation/ABI/stable/sysfs-firmware-efi-vars

+234

Documentation/filesystems/fiemap.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ============ 4 + Fiemap Ioctl 5 + ============ 6 + 7 + The fiemap ioctl is an efficient method for userspace to get file 8 + extent mappings. Instead of block-by-block mapping (such as bmap), fiemap 9 + returns a list of extents. 10 + 11 + 12 + Request Basics 13 + -------------- 14 + 15 + A fiemap request is encoded within struct fiemap:: 16 + 17 + struct fiemap { 18 + __u64 fm_start; /* logical offset (inclusive) at 19 + * which to start mapping (in) */ 20 + __u64 fm_length; /* logical length of mapping which 21 + * userspace cares about (in) */ 22 + __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ 23 + __u32 fm_mapped_extents; /* number of extents that were 24 + * mapped (out) */ 25 + __u32 fm_extent_count; /* size of fm_extents array (in) */ 26 + __u32 fm_reserved; 27 + struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */ 28 + }; 29 + 30 + 31 + fm_start, and fm_length specify the logical range within the file 32 + which the process would like mappings for. Extents returned mirror 33 + those on disk - that is, the logical offset of the 1st returned extent 34 + may start before fm_start, and the range covered by the last returned 35 + extent may end after fm_length. All offsets and lengths are in bytes. 36 + 37 + Certain flags to modify the way in which mappings are looked up can be 38 + set in fm_flags. If the kernel doesn't understand some particular 39 + flags, it will return EBADR and the contents of fm_flags will contain 40 + the set of flags which caused the error. If the kernel is compatible 41 + with all flags passed, the contents of fm_flags will be unmodified. 42 + It is up to userspace to determine whether rejection of a particular 43 + flag is fatal to its operation. This scheme is intended to allow the 44 + fiemap interface to grow in the future but without losing 45 + compatibility with old software. 46 + 47 + fm_extent_count specifies the number of elements in the fm_extents[] array 48 + that can be used to return extents. If fm_extent_count is zero, then the 49 + fm_extents[] array is ignored (no extents will be returned), and the 50 + fm_mapped_extents count will hold the number of extents needed in 51 + fm_extents[] to hold the file's current mapping. Note that there is 52 + nothing to prevent the file from changing between calls to FIEMAP. 53 + 54 + The following flags can be set in fm_flags: 55 + 56 + FIEMAP_FLAG_SYNC 57 + If this flag is set, the kernel will sync the file before mapping extents. 58 + 59 + FIEMAP_FLAG_XATTR 60 + If this flag is set, the extents returned will describe the inodes 61 + extended attribute lookup tree, instead of its data tree. 62 + 63 + 64 + Extent Mapping 65 + -------------- 66 + 67 + Extent information is returned within the embedded fm_extents array 68 + which userspace must allocate along with the fiemap structure. The 69 + number of elements in the fiemap_extents[] array should be passed via 70 + fm_extent_count. The number of extents mapped by kernel will be 71 + returned via fm_mapped_extents. If the number of fiemap_extents 72 + allocated is less than would be required to map the requested range, 73 + the maximum number of extents that can be mapped in the fm_extent[] 74 + array will be returned and fm_mapped_extents will be equal to 75 + fm_extent_count. In that case, the last extent in the array will not 76 + complete the requested range and will not have the FIEMAP_EXTENT_LAST 77 + flag set (see the next section on extent flags). 78 + 79 + Each extent is described by a single fiemap_extent structure as 80 + returned in fm_extents:: 81 + 82 + struct fiemap_extent { 83 + __u64 fe_logical; /* logical offset in bytes for the start of 84 + * the extent */ 85 + __u64 fe_physical; /* physical offset in bytes for the start 86 + * of the extent */ 87 + __u64 fe_length; /* length in bytes for the extent */ 88 + __u64 fe_reserved64[2]; 89 + __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ 90 + __u32 fe_reserved[3]; 91 + }; 92 + 93 + All offsets and lengths are in bytes and mirror those on disk. It is valid 94 + for an extents logical offset to start before the request or its logical 95 + length to extend past the request. Unless FIEMAP_EXTENT_NOT_ALIGNED is 96 + returned, fe_logical, fe_physical, and fe_length will be aligned to the 97 + block size of the file system. With the exception of extents flagged as 98 + FIEMAP_EXTENT_MERGED, adjacent extents will not be merged. 99 + 100 + The fe_flags field contains flags which describe the extent returned. 101 + A special flag, FIEMAP_EXTENT_LAST is always set on the last extent in 102 + the file so that the process making fiemap calls can determine when no 103 + more extents are available, without having to call the ioctl again. 104 + 105 + Some flags are intentionally vague and will always be set in the 106 + presence of other more specific flags. This way a program looking for 107 + a general property does not have to know all existing and future flags 108 + which imply that property. 109 + 110 + For example, if FIEMAP_EXTENT_DATA_INLINE or FIEMAP_EXTENT_DATA_TAIL 111 + are set, FIEMAP_EXTENT_NOT_ALIGNED will also be set. A program looking 112 + for inline or tail-packed data can key on the specific flag. Software 113 + which simply cares not to try operating on non-aligned extents 114 + however, can just key on FIEMAP_EXTENT_NOT_ALIGNED, and not have to 115 + worry about all present and future flags which might imply unaligned 116 + data. Note that the opposite is not true - it would be valid for 117 + FIEMAP_EXTENT_NOT_ALIGNED to appear alone. 118 + 119 + FIEMAP_EXTENT_LAST 120 + This is generally the last extent in the file. A mapping attempt past 121 + this extent may return nothing. Some implementations set this flag to 122 + indicate this extent is the last one in the range queried by the user 123 + (via fiemap->fm_length). 124 + 125 + FIEMAP_EXTENT_UNKNOWN 126 + The location of this extent is currently unknown. This may indicate 127 + the data is stored on an inaccessible volume or that no storage has 128 + been allocated for the file yet. 129 + 130 + FIEMAP_EXTENT_DELALLOC 131 + This will also set FIEMAP_EXTENT_UNKNOWN. 132 + 133 + Delayed allocation - while there is data for this extent, its 134 + physical location has not been allocated yet. 135 + 136 + FIEMAP_EXTENT_ENCODED 137 + This extent does not consist of plain filesystem blocks but is 138 + encoded (e.g. encrypted or compressed). Reading the data in this 139 + extent via I/O to the block device will have undefined results. 140 + 141 + Note that it is *always* undefined to try to update the data 142 + in-place by writing to the indicated location without the 143 + assistance of the filesystem, or to access the data using the 144 + information returned by the FIEMAP interface while the filesystem 145 + is mounted. In other words, user applications may only read the 146 + extent data via I/O to the block device while the filesystem is 147 + unmounted, and then only if the FIEMAP_EXTENT_ENCODED flag is 148 + clear; user applications must not try reading or writing to the 149 + filesystem via the block device under any other circumstances. 150 + 151 + FIEMAP_EXTENT_DATA_ENCRYPTED 152 + This will also set FIEMAP_EXTENT_ENCODED 153 + The data in this extent has been encrypted by the file system. 154 + 155 + FIEMAP_EXTENT_NOT_ALIGNED 156 + Extent offsets and length are not guaranteed to be block aligned. 157 + 158 + FIEMAP_EXTENT_DATA_INLINE 159 + This will also set FIEMAP_EXTENT_NOT_ALIGNED 160 + Data is located within a meta data block. 161 + 162 + FIEMAP_EXTENT_DATA_TAIL 163 + This will also set FIEMAP_EXTENT_NOT_ALIGNED 164 + Data is packed into a block with data from other files. 165 + 166 + FIEMAP_EXTENT_UNWRITTEN 167 + Unwritten extent - the extent is allocated but its data has not been 168 + initialized. This indicates the extent's data will be all zero if read 169 + through the filesystem but the contents are undefined if read directly from 170 + the device. 171 + 172 + FIEMAP_EXTENT_MERGED 173 + This will be set when a file does not support extents, i.e., it uses a block 174 + based addressing scheme. Since returning an extent for each block back to 175 + userspace would be highly inefficient, the kernel will try to merge most 176 + adjacent blocks into 'extents'. 177 + 178 + 179 + VFS -> File System Implementation 180 + --------------------------------- 181 + 182 + File systems wishing to support fiemap must implement a ->fiemap callback on 183 + their inode_operations structure. The fs ->fiemap call is responsible for 184 + defining its set of supported fiemap flags, and calling a helper function on 185 + each discovered extent:: 186 + 187 + struct inode_operations { 188 + ... 189 + 190 + int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, 191 + u64 len); 192 + 193 + ->fiemap is passed struct fiemap_extent_info which describes the 194 + fiemap request:: 195 + 196 + struct fiemap_extent_info { 197 + unsigned int fi_flags; /* Flags as passed from user */ 198 + unsigned int fi_extents_mapped; /* Number of mapped extents */ 199 + unsigned int fi_extents_max; /* Size of fiemap_extent array */ 200 + struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent array */ 201 + }; 202 + 203 + It is intended that the file system should not need to access any of this 204 + structure directly. Filesystem handlers should be tolerant to signals and return 205 + EINTR once fatal signal received. 206 + 207 + 208 + Flag checking should be done at the beginning of the ->fiemap callback via the 209 + fiemap_check_flags() helper:: 210 + 211 + int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); 212 + 213 + The struct fieinfo should be passed in as received from ioctl_fiemap(). The 214 + set of fiemap flags which the fs understands should be passed via fs_flags. If 215 + fiemap_check_flags finds invalid user flags, it will place the bad values in 216 + fieinfo->fi_flags and return -EBADR. If the file system gets -EBADR, from 217 + fiemap_check_flags(), it should immediately exit, returning that error back to 218 + ioctl_fiemap(). 219 + 220 + 221 + For each extent in the request range, the file system should call 222 + the helper function, fiemap_fill_next_extent():: 223 + 224 + int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, 225 + u64 phys, u64 len, u32 flags, u32 dev); 226 + 227 + fiemap_fill_next_extent() will use the passed values to populate the 228 + next free extent in the fm_extents array. 'General' extent flags will 229 + automatically be set from specific flags on behalf of the calling file 230 + system so that the userspace API is not broken. 231 + 232 + fiemap_fill_next_extent() returns 0 on success, and 1 when the 233 + user-supplied fm_extents array is full. If an error is encountered 234 + while copying the extent to user memory, -EFAULT will be returned.

-231

Documentation/filesystems/fiemap.txt

··· 1 - ============ 2 - Fiemap Ioctl 3 - ============ 4 - 5 - The fiemap ioctl is an efficient method for userspace to get file 6 - extent mappings. Instead of block-by-block mapping (such as bmap), fiemap 7 - returns a list of extents. 8 - 9 - 10 - Request Basics 11 - -------------- 12 - 13 - A fiemap request is encoded within struct fiemap: 14 - 15 - struct fiemap { 16 - __u64 fm_start; /* logical offset (inclusive) at 17 - * which to start mapping (in) */ 18 - __u64 fm_length; /* logical length of mapping which 19 - * userspace cares about (in) */ 20 - __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ 21 - __u32 fm_mapped_extents; /* number of extents that were 22 - * mapped (out) */ 23 - __u32 fm_extent_count; /* size of fm_extents array (in) */ 24 - __u32 fm_reserved; 25 - struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */ 26 - }; 27 - 28 - 29 - fm_start, and fm_length specify the logical range within the file 30 - which the process would like mappings for. Extents returned mirror 31 - those on disk - that is, the logical offset of the 1st returned extent 32 - may start before fm_start, and the range covered by the last returned 33 - extent may end after fm_length. All offsets and lengths are in bytes. 34 - 35 - Certain flags to modify the way in which mappings are looked up can be 36 - set in fm_flags. If the kernel doesn't understand some particular 37 - flags, it will return EBADR and the contents of fm_flags will contain 38 - the set of flags which caused the error. If the kernel is compatible 39 - with all flags passed, the contents of fm_flags will be unmodified. 40 - It is up to userspace to determine whether rejection of a particular 41 - flag is fatal to its operation. This scheme is intended to allow the 42 - fiemap interface to grow in the future but without losing 43 - compatibility with old software. 44 - 45 - fm_extent_count specifies the number of elements in the fm_extents[] array 46 - that can be used to return extents. If fm_extent_count is zero, then the 47 - fm_extents[] array is ignored (no extents will be returned), and the 48 - fm_mapped_extents count will hold the number of extents needed in 49 - fm_extents[] to hold the file's current mapping. Note that there is 50 - nothing to prevent the file from changing between calls to FIEMAP. 51 - 52 - The following flags can be set in fm_flags: 53 - 54 - * FIEMAP_FLAG_SYNC 55 - If this flag is set, the kernel will sync the file before mapping extents. 56 - 57 - * FIEMAP_FLAG_XATTR 58 - If this flag is set, the extents returned will describe the inodes 59 - extended attribute lookup tree, instead of its data tree. 60 - 61 - 62 - Extent Mapping 63 - -------------- 64 - 65 - Extent information is returned within the embedded fm_extents array 66 - which userspace must allocate along with the fiemap structure. The 67 - number of elements in the fiemap_extents[] array should be passed via 68 - fm_extent_count. The number of extents mapped by kernel will be 69 - returned via fm_mapped_extents. If the number of fiemap_extents 70 - allocated is less than would be required to map the requested range, 71 - the maximum number of extents that can be mapped in the fm_extent[] 72 - array will be returned and fm_mapped_extents will be equal to 73 - fm_extent_count. In that case, the last extent in the array will not 74 - complete the requested range and will not have the FIEMAP_EXTENT_LAST 75 - flag set (see the next section on extent flags). 76 - 77 - Each extent is described by a single fiemap_extent structure as 78 - returned in fm_extents. 79 - 80 - struct fiemap_extent { 81 - __u64 fe_logical; /* logical offset in bytes for the start of 82 - * the extent */ 83 - __u64 fe_physical; /* physical offset in bytes for the start 84 - * of the extent */ 85 - __u64 fe_length; /* length in bytes for the extent */ 86 - __u64 fe_reserved64[2]; 87 - __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ 88 - __u32 fe_reserved[3]; 89 - }; 90 - 91 - All offsets and lengths are in bytes and mirror those on disk. It is valid 92 - for an extents logical offset to start before the request or its logical 93 - length to extend past the request. Unless FIEMAP_EXTENT_NOT_ALIGNED is 94 - returned, fe_logical, fe_physical, and fe_length will be aligned to the 95 - block size of the file system. With the exception of extents flagged as 96 - FIEMAP_EXTENT_MERGED, adjacent extents will not be merged. 97 - 98 - The fe_flags field contains flags which describe the extent returned. 99 - A special flag, FIEMAP_EXTENT_LAST is always set on the last extent in 100 - the file so that the process making fiemap calls can determine when no 101 - more extents are available, without having to call the ioctl again. 102 - 103 - Some flags are intentionally vague and will always be set in the 104 - presence of other more specific flags. This way a program looking for 105 - a general property does not have to know all existing and future flags 106 - which imply that property. 107 - 108 - For example, if FIEMAP_EXTENT_DATA_INLINE or FIEMAP_EXTENT_DATA_TAIL 109 - are set, FIEMAP_EXTENT_NOT_ALIGNED will also be set. A program looking 110 - for inline or tail-packed data can key on the specific flag. Software 111 - which simply cares not to try operating on non-aligned extents 112 - however, can just key on FIEMAP_EXTENT_NOT_ALIGNED, and not have to 113 - worry about all present and future flags which might imply unaligned 114 - data. Note that the opposite is not true - it would be valid for 115 - FIEMAP_EXTENT_NOT_ALIGNED to appear alone. 116 - 117 - * FIEMAP_EXTENT_LAST 118 - This is generally the last extent in the file. A mapping attempt past 119 - this extent may return nothing. Some implementations set this flag to 120 - indicate this extent is the last one in the range queried by the user 121 - (via fiemap->fm_length). 122 - 123 - * FIEMAP_EXTENT_UNKNOWN 124 - The location of this extent is currently unknown. This may indicate 125 - the data is stored on an inaccessible volume or that no storage has 126 - been allocated for the file yet. 127 - 128 - * FIEMAP_EXTENT_DELALLOC 129 - - This will also set FIEMAP_EXTENT_UNKNOWN. 130 - Delayed allocation - while there is data for this extent, its 131 - physical location has not been allocated yet. 132 - 133 - * FIEMAP_EXTENT_ENCODED 134 - This extent does not consist of plain filesystem blocks but is 135 - encoded (e.g. encrypted or compressed). Reading the data in this 136 - extent via I/O to the block device will have undefined results. 137 - 138 - Note that it is *always* undefined to try to update the data 139 - in-place by writing to the indicated location without the 140 - assistance of the filesystem, or to access the data using the 141 - information returned by the FIEMAP interface while the filesystem 142 - is mounted. In other words, user applications may only read the 143 - extent data via I/O to the block device while the filesystem is 144 - unmounted, and then only if the FIEMAP_EXTENT_ENCODED flag is 145 - clear; user applications must not try reading or writing to the 146 - filesystem via the block device under any other circumstances. 147 - 148 - * FIEMAP_EXTENT_DATA_ENCRYPTED 149 - - This will also set FIEMAP_EXTENT_ENCODED 150 - The data in this extent has been encrypted by the file system. 151 - 152 - * FIEMAP_EXTENT_NOT_ALIGNED 153 - Extent offsets and length are not guaranteed to be block aligned. 154 - 155 - * FIEMAP_EXTENT_DATA_INLINE 156 - This will also set FIEMAP_EXTENT_NOT_ALIGNED 157 - Data is located within a meta data block. 158 - 159 - * FIEMAP_EXTENT_DATA_TAIL 160 - This will also set FIEMAP_EXTENT_NOT_ALIGNED 161 - Data is packed into a block with data from other files. 162 - 163 - * FIEMAP_EXTENT_UNWRITTEN 164 - Unwritten extent - the extent is allocated but its data has not been 165 - initialized. This indicates the extent's data will be all zero if read 166 - through the filesystem but the contents are undefined if read directly from 167 - the device. 168 - 169 - * FIEMAP_EXTENT_MERGED 170 - This will be set when a file does not support extents, i.e., it uses a block 171 - based addressing scheme. Since returning an extent for each block back to 172 - userspace would be highly inefficient, the kernel will try to merge most 173 - adjacent blocks into 'extents'. 174 - 175 - 176 - VFS -> File System Implementation 177 - --------------------------------- 178 - 179 - File systems wishing to support fiemap must implement a ->fiemap callback on 180 - their inode_operations structure. The fs ->fiemap call is responsible for 181 - defining its set of supported fiemap flags, and calling a helper function on 182 - each discovered extent: 183 - 184 - struct inode_operations { 185 - ... 186 - 187 - int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, 188 - u64 len); 189 - 190 - ->fiemap is passed struct fiemap_extent_info which describes the 191 - fiemap request: 192 - 193 - struct fiemap_extent_info { 194 - unsigned int fi_flags; /* Flags as passed from user */ 195 - unsigned int fi_extents_mapped; /* Number of mapped extents */ 196 - unsigned int fi_extents_max; /* Size of fiemap_extent array */ 197 - struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent array */ 198 - }; 199 - 200 - It is intended that the file system should not need to access any of this 201 - structure directly. Filesystem handlers should be tolerant to signals and return 202 - EINTR once fatal signal received. 203 - 204 - 205 - Flag checking should be done at the beginning of the ->fiemap callback via the 206 - fiemap_check_flags() helper: 207 - 208 - int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); 209 - 210 - The struct fieinfo should be passed in as received from ioctl_fiemap(). The 211 - set of fiemap flags which the fs understands should be passed via fs_flags. If 212 - fiemap_check_flags finds invalid user flags, it will place the bad values in 213 - fieinfo->fi_flags and return -EBADR. If the file system gets -EBADR, from 214 - fiemap_check_flags(), it should immediately exit, returning that error back to 215 - ioctl_fiemap(). 216 - 217 - 218 - For each extent in the request range, the file system should call 219 - the helper function, fiemap_fill_next_extent(): 220 - 221 - int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, 222 - u64 phys, u64 len, u32 flags, u32 dev); 223 - 224 - fiemap_fill_next_extent() will use the passed values to populate the 225 - next free extent in the fm_extents array. 'General' extent flags will 226 - automatically be set from specific flags on behalf of the calling file 227 - system so that the userspace API is not broken. 228 - 229 - fiemap_fill_next_extent() returns 0 on success, and 1 when the 230 - user-supplied fm_extents array is full. If an error is encountered 231 - while copying the extent to user memory, -EFAULT will be returned.

+128

Documentation/filesystems/files.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + =================================== 4 + File management in the Linux kernel 5 + =================================== 6 + 7 + This document describes how locking for files (struct file) 8 + and file descriptor table (struct files) works. 9 + 10 + Up until 2.6.12, the file descriptor table has been protected 11 + with a lock (files->file_lock) and reference count (files->count). 12 + ->file_lock protected accesses to all the file related fields 13 + of the table. ->count was used for sharing the file descriptor 14 + table between tasks cloned with CLONE_FILES flag. Typically 15 + this would be the case for posix threads. As with the common 16 + refcounting model in the kernel, the last task doing 17 + a put_files_struct() frees the file descriptor (fd) table. 18 + The files (struct file) themselves are protected using 19 + reference count (->f_count). 20 + 21 + In the new lock-free model of file descriptor management, 22 + the reference counting is similar, but the locking is 23 + based on RCU. The file descriptor table contains multiple 24 + elements - the fd sets (open_fds and close_on_exec, the 25 + array of file pointers, the sizes of the sets and the array 26 + etc.). In order for the updates to appear atomic to 27 + a lock-free reader, all the elements of the file descriptor 28 + table are in a separate structure - struct fdtable. 29 + files_struct contains a pointer to struct fdtable through 30 + which the actual fd table is accessed. Initially the 31 + fdtable is embedded in files_struct itself. On a subsequent 32 + expansion of fdtable, a new fdtable structure is allocated 33 + and files->fdtab points to the new structure. The fdtable 34 + structure is freed with RCU and lock-free readers either 35 + see the old fdtable or the new fdtable making the update 36 + appear atomic. Here are the locking rules for 37 + the fdtable structure - 38 + 39 + 1. All references to the fdtable must be done through 40 + the files_fdtable() macro:: 41 + 42 + struct fdtable *fdt; 43 + 44 + rcu_read_lock(); 45 + 46 + fdt = files_fdtable(files); 47 + .... 48 + if (n <= fdt->max_fds) 49 + .... 50 + ... 51 + rcu_read_unlock(); 52 + 53 + files_fdtable() uses rcu_dereference() macro which takes care of 54 + the memory barrier requirements for lock-free dereference. 55 + The fdtable pointer must be read within the read-side 56 + critical section. 57 + 58 + 2. Reading of the fdtable as described above must be protected 59 + by rcu_read_lock()/rcu_read_unlock(). 60 + 61 + 3. For any update to the fd table, files->file_lock must 62 + be held. 63 + 64 + 4. To look up the file structure given an fd, a reader 65 + must use either fcheck() or fcheck_files() APIs. These 66 + take care of barrier requirements due to lock-free lookup. 67 + 68 + An example:: 69 + 70 + struct file *file; 71 + 72 + rcu_read_lock(); 73 + file = fcheck(fd); 74 + if (file) { 75 + ... 76 + } 77 + .... 78 + rcu_read_unlock(); 79 + 80 + 5. Handling of the file structures is special. Since the look-up 81 + of the fd (fget()/fget_light()) are lock-free, it is possible 82 + that look-up may race with the last put() operation on the 83 + file structure. This is avoided using atomic_long_inc_not_zero() 84 + on ->f_count:: 85 + 86 + rcu_read_lock(); 87 + file = fcheck_files(files, fd); 88 + if (file) { 89 + if (atomic_long_inc_not_zero(&file->f_count)) 90 + *fput_needed = 1; 91 + else 92 + /* Didn't get the reference, someone's freed */ 93 + file = NULL; 94 + } 95 + rcu_read_unlock(); 96 + .... 97 + return file; 98 + 99 + atomic_long_inc_not_zero() detects if refcounts is already zero or 100 + goes to zero during increment. If it does, we fail 101 + fget()/fget_light(). 102 + 103 + 6. Since both fdtable and file structures can be looked up 104 + lock-free, they must be installed using rcu_assign_pointer() 105 + API. If they are looked up lock-free, rcu_dereference() 106 + must be used. However it is advisable to use files_fdtable() 107 + and fcheck()/fcheck_files() which take care of these issues. 108 + 109 + 7. While updating, the fdtable pointer must be looked up while 110 + holding files->file_lock. If ->file_lock is dropped, then 111 + another thread expand the files thereby creating a new 112 + fdtable and making the earlier fdtable pointer stale. 113 + 114 + For example:: 115 + 116 + spin_lock(&files->file_lock); 117 + fd = locate_fd(files, file, start); 118 + if (fd >= 0) { 119 + /* locate_fd() may have expanded fdtable, load the ptr */ 120 + fdt = files_fdtable(files); 121 + __set_open_fd(fd, fdt); 122 + __clear_close_on_exec(fd, fdt); 123 + spin_unlock(&files->file_lock); 124 + ..... 125 + 126 + Since locate_fd() can drop ->file_lock (and reacquire ->file_lock), 127 + the fdtable pointer (fdt) must be loaded after locate_fd(). 128 +

-123

Documentation/filesystems/files.txt

··· 1 - File management in the Linux kernel 2 - ----------------------------------- 3 - 4 - This document describes how locking for files (struct file) 5 - and file descriptor table (struct files) works. 6 - 7 - Up until 2.6.12, the file descriptor table has been protected 8 - with a lock (files->file_lock) and reference count (files->count). 9 - ->file_lock protected accesses to all the file related fields 10 - of the table. ->count was used for sharing the file descriptor 11 - table between tasks cloned with CLONE_FILES flag. Typically 12 - this would be the case for posix threads. As with the common 13 - refcounting model in the kernel, the last task doing 14 - a put_files_struct() frees the file descriptor (fd) table. 15 - The files (struct file) themselves are protected using 16 - reference count (->f_count). 17 - 18 - In the new lock-free model of file descriptor management, 19 - the reference counting is similar, but the locking is 20 - based on RCU. The file descriptor table contains multiple 21 - elements - the fd sets (open_fds and close_on_exec, the 22 - array of file pointers, the sizes of the sets and the array 23 - etc.). In order for the updates to appear atomic to 24 - a lock-free reader, all the elements of the file descriptor 25 - table are in a separate structure - struct fdtable. 26 - files_struct contains a pointer to struct fdtable through 27 - which the actual fd table is accessed. Initially the 28 - fdtable is embedded in files_struct itself. On a subsequent 29 - expansion of fdtable, a new fdtable structure is allocated 30 - and files->fdtab points to the new structure. The fdtable 31 - structure is freed with RCU and lock-free readers either 32 - see the old fdtable or the new fdtable making the update 33 - appear atomic. Here are the locking rules for 34 - the fdtable structure - 35 - 36 - 1. All references to the fdtable must be done through 37 - the files_fdtable() macro : 38 - 39 - struct fdtable *fdt; 40 - 41 - rcu_read_lock(); 42 - 43 - fdt = files_fdtable(files); 44 - .... 45 - if (n <= fdt->max_fds) 46 - .... 47 - ... 48 - rcu_read_unlock(); 49 - 50 - files_fdtable() uses rcu_dereference() macro which takes care of 51 - the memory barrier requirements for lock-free dereference. 52 - The fdtable pointer must be read within the read-side 53 - critical section. 54 - 55 - 2. Reading of the fdtable as described above must be protected 56 - by rcu_read_lock()/rcu_read_unlock(). 57 - 58 - 3. For any update to the fd table, files->file_lock must 59 - be held. 60 - 61 - 4. To look up the file structure given an fd, a reader 62 - must use either fcheck() or fcheck_files() APIs. These 63 - take care of barrier requirements due to lock-free lookup. 64 - An example : 65 - 66 - struct file *file; 67 - 68 - rcu_read_lock(); 69 - file = fcheck(fd); 70 - if (file) { 71 - ... 72 - } 73 - .... 74 - rcu_read_unlock(); 75 - 76 - 5. Handling of the file structures is special. Since the look-up 77 - of the fd (fget()/fget_light()) are lock-free, it is possible 78 - that look-up may race with the last put() operation on the 79 - file structure. This is avoided using atomic_long_inc_not_zero() 80 - on ->f_count : 81 - 82 - rcu_read_lock(); 83 - file = fcheck_files(files, fd); 84 - if (file) { 85 - if (atomic_long_inc_not_zero(&file->f_count)) 86 - *fput_needed = 1; 87 - else 88 - /* Didn't get the reference, someone's freed */ 89 - file = NULL; 90 - } 91 - rcu_read_unlock(); 92 - .... 93 - return file; 94 - 95 - atomic_long_inc_not_zero() detects if refcounts is already zero or 96 - goes to zero during increment. If it does, we fail 97 - fget()/fget_light(). 98 - 99 - 6. Since both fdtable and file structures can be looked up 100 - lock-free, they must be installed using rcu_assign_pointer() 101 - API. If they are looked up lock-free, rcu_dereference() 102 - must be used. However it is advisable to use files_fdtable() 103 - and fcheck()/fcheck_files() which take care of these issues. 104 - 105 - 7. While updating, the fdtable pointer must be looked up while 106 - holding files->file_lock. If ->file_lock is dropped, then 107 - another thread expand the files thereby creating a new 108 - fdtable and making the earlier fdtable pointer stale. 109 - For example : 110 - 111 - spin_lock(&files->file_lock); 112 - fd = locate_fd(files, file, start); 113 - if (fd >= 0) { 114 - /* locate_fd() may have expanded fdtable, load the ptr */ 115 - fdt = files_fdtable(files); 116 - __set_open_fd(fd, fdt); 117 - __clear_close_on_exec(fd, fdt); 118 - spin_unlock(&files->file_lock); 119 - ..... 120 - 121 - Since locate_fd() can drop ->file_lock (and reacquire ->file_lock), 122 - the fdtable pointer (fdt) must be loaded after locate_fd(). 123 -

+44

Documentation/filesystems/fuse-io.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ============== 4 + Fuse I/O Modes 5 + ============== 6 + 7 + Fuse supports the following I/O modes: 8 + 9 + - direct-io 10 + - cached 11 + + write-through 12 + + writeback-cache 13 + 14 + The direct-io mode can be selected with the FOPEN_DIRECT_IO flag in the 15 + FUSE_OPEN reply. 16 + 17 + In direct-io mode the page cache is completely bypassed for reads and writes. 18 + No read-ahead takes place. Shared mmap is disabled. 19 + 20 + In cached mode reads may be satisfied from the page cache, and data may be 21 + read-ahead by the kernel to fill the cache. The cache is always kept consistent 22 + after any writes to the file. All mmap modes are supported. 23 + 24 + The cached mode has two sub modes controlling how writes are handled. The 25 + write-through mode is the default and is supported on all kernels. The 26 + writeback-cache mode may be selected by the FUSE_WRITEBACK_CACHE flag in the 27 + FUSE_INIT reply. 28 + 29 + In write-through mode each write is immediately sent to userspace as one or more 30 + WRITE requests, as well as updating any cached pages (and caching previously 31 + uncached, but fully written pages). No READ requests are ever sent for writes, 32 + so when an uncached page is partially written, the page is discarded. 33 + 34 + In writeback-cache mode (enabled by the FUSE_WRITEBACK_CACHE flag) writes go to 35 + the cache only, which means that the write(2) syscall can often complete very 36 + fast. Dirty pages are written back implicitly (background writeback or page 37 + reclaim on memory pressure) or explicitly (invoked by close(2), fsync(2) and 38 + when the last ref to the file is being released on munmap(2)). This mode 39 + assumes that all changes to the filesystem go through the FUSE kernel module 40 + (size and atime/ctime/mtime attributes are kept up-to-date by the kernel), so 41 + it's generally not suitable for network filesystems. If a partial page is 42 + written, then the page needs to be first read from userspace. This means, that 43 + even for files opened for O_WRONLY it is possible that READ requests will be 44 + generated by the kernel.

-38

Documentation/filesystems/fuse-io.txt

··· 1 - Fuse supports the following I/O modes: 2 - 3 - - direct-io 4 - - cached 5 - + write-through 6 - + writeback-cache 7 - 8 - The direct-io mode can be selected with the FOPEN_DIRECT_IO flag in the 9 - FUSE_OPEN reply. 10 - 11 - In direct-io mode the page cache is completely bypassed for reads and writes. 12 - No read-ahead takes place. Shared mmap is disabled. 13 - 14 - In cached mode reads may be satisfied from the page cache, and data may be 15 - read-ahead by the kernel to fill the cache. The cache is always kept consistent 16 - after any writes to the file. All mmap modes are supported. 17 - 18 - The cached mode has two sub modes controlling how writes are handled. The 19 - write-through mode is the default and is supported on all kernels. The 20 - writeback-cache mode may be selected by the FUSE_WRITEBACK_CACHE flag in the 21 - FUSE_INIT reply. 22 - 23 - In write-through mode each write is immediately sent to userspace as one or more 24 - WRITE requests, as well as updating any cached pages (and caching previously 25 - uncached, but fully written pages). No READ requests are ever sent for writes, 26 - so when an uncached page is partially written, the page is discarded. 27 - 28 - In writeback-cache mode (enabled by the FUSE_WRITEBACK_CACHE flag) writes go to 29 - the cache only, which means that the write(2) syscall can often complete very 30 - fast. Dirty pages are written back implicitly (background writeback or page 31 - reclaim on memory pressure) or explicitly (invoked by close(2), fsync(2) and 32 - when the last ref to the file is being released on munmap(2)). This mode 33 - assumes that all changes to the filesystem go through the FUSE kernel module 34 - (size and atime/ctime/mtime attributes are kept up-to-date by the kernel), so 35 - it's generally not suitable for network filesystems. If a partial page is 36 - written, then the page needs to be first read from userspace. This means, that 37 - even for files opened for O_WRONLY it is possible that READ requests will be 38 - generated by the kernel.

+23

Documentation/filesystems/index.rst

··· 24 24 splice 25 25 locking 26 26 directory-locking 27 + devpts 28 + dnotify 29 + fiemap 30 + files 31 + locks 32 + mandatory-locking 33 + mount_api 34 + quota 35 + seq_file 36 + sharedsubtree 37 + sysfs-pci 38 + sysfs-tagging 39 + 40 + automount-support 41 + 42 + caching/index 27 43 28 44 porting 29 45 ··· 73 57 befs 74 58 bfs 75 59 btrfs 60 + cifs/cifsroot 76 61 ceph 62 + coda 63 + configfs 77 64 cramfs 78 65 debugfs 79 66 dlmfs ··· 92 73 hfsplus 93 74 hpfs 94 75 fuse 76 + fuse-io 95 77 inotify 96 78 isofs 97 79 nilfs2 ··· 108 88 ramfs-rootfs-initramfs 109 89 relay 110 90 romfs 91 + spufs/index 111 92 squashfs 112 93 sysfs 113 94 sysv-fs ··· 118 97 udf 119 98 virtiofs 120 99 vfat 100 + xfs-delayed-logging-design 101 + xfs-self-describing-metadata 121 102 zonefs

+72

Documentation/filesystems/locks.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ========================== 4 + File Locking Release Notes 5 + ========================== 6 + 7 + Andy Walker <andy@lysaker.kvaerner.no> 8 + 9 + 12 May 1997 10 + 11 + 12 + 1. What's New? 13 + ============== 14 + 15 + 1.1 Broken Flock Emulation 16 + -------------------------- 17 + 18 + The old flock(2) emulation in the kernel was swapped for proper BSD 19 + compatible flock(2) support in the 1.3.x series of kernels. With the 20 + release of the 2.1.x kernel series, support for the old emulation has 21 + been totally removed, so that we don't need to carry this baggage 22 + forever. 23 + 24 + This should not cause problems for anybody, since everybody using a 25 + 2.1.x kernel should have updated their C library to a suitable version 26 + anyway (see the file "Documentation/process/changes.rst".) 27 + 28 + 1.2 Allow Mixed Locks Again 29 + --------------------------- 30 + 31 + 1.2.1 Typical Problems - Sendmail 32 + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 33 + Because sendmail was unable to use the old flock() emulation, many sendmail 34 + installations use fcntl() instead of flock(). This is true of Slackware 3.0 35 + for example. This gave rise to some other subtle problems if sendmail was 36 + configured to rebuild the alias file. Sendmail tried to lock the aliases.dir 37 + file with fcntl() at the same time as the GDBM routines tried to lock this 38 + file with flock(). With pre 1.3.96 kernels this could result in deadlocks that, 39 + over time, or under a very heavy mail load, would eventually cause the kernel 40 + to lock solid with deadlocked processes. 41 + 42 + 43 + 1.2.2 The Solution 44 + ^^^^^^^^^^^^^^^^^^ 45 + The solution I have chosen, after much experimentation and discussion, 46 + is to make flock() and fcntl() locks oblivious to each other. Both can 47 + exists, and neither will have any effect on the other. 48 + 49 + I wanted the two lock styles to be cooperative, but there were so many 50 + race and deadlock conditions that the current solution was the only 51 + practical one. It puts us in the same position as, for example, SunOS 52 + 4.1.x and several other commercial Unices. The only OS's that support 53 + cooperative flock()/fcntl() are those that emulate flock() using 54 + fcntl(), with all the problems that implies. 55 + 56 + 57 + 1.3 Mandatory Locking As A Mount Option 58 + --------------------------------------- 59 + 60 + Mandatory locking, as described in 61 + 'Documentation/filesystems/mandatory-locking.rst' was prior to this release a 62 + general configuration option that was valid for all mounted filesystems. This 63 + had a number of inherent dangers, not the least of which was the ability to 64 + freeze an NFS server by asking it to read a file for which a mandatory lock 65 + existed. 66 + 67 + From this release of the kernel, mandatory locking can be turned on and off 68 + on a per-filesystem basis, using the mount options 'mand' and 'nomand'. 69 + The default is to disallow mandatory locking. The intention is that 70 + mandatory locking only be enabled on a local filesystem as the specific need 71 + arises. 72 +

-68

Documentation/filesystems/locks.txt

··· 1 - File Locking Release Notes 2 - 3 - Andy Walker <andy@lysaker.kvaerner.no> 4 - 5 - 12 May 1997 6 - 7 - 8 - 1. What's New? 9 - -------------- 10 - 11 - 1.1 Broken Flock Emulation 12 - -------------------------- 13 - 14 - The old flock(2) emulation in the kernel was swapped for proper BSD 15 - compatible flock(2) support in the 1.3.x series of kernels. With the 16 - release of the 2.1.x kernel series, support for the old emulation has 17 - been totally removed, so that we don't need to carry this baggage 18 - forever. 19 - 20 - This should not cause problems for anybody, since everybody using a 21 - 2.1.x kernel should have updated their C library to a suitable version 22 - anyway (see the file "Documentation/process/changes.rst".) 23 - 24 - 1.2 Allow Mixed Locks Again 25 - --------------------------- 26 - 27 - 1.2.1 Typical Problems - Sendmail 28 - --------------------------------- 29 - Because sendmail was unable to use the old flock() emulation, many sendmail 30 - installations use fcntl() instead of flock(). This is true of Slackware 3.0 31 - for example. This gave rise to some other subtle problems if sendmail was 32 - configured to rebuild the alias file. Sendmail tried to lock the aliases.dir 33 - file with fcntl() at the same time as the GDBM routines tried to lock this 34 - file with flock(). With pre 1.3.96 kernels this could result in deadlocks that, 35 - over time, or under a very heavy mail load, would eventually cause the kernel 36 - to lock solid with deadlocked processes. 37 - 38 - 39 - 1.2.2 The Solution 40 - ------------------ 41 - The solution I have chosen, after much experimentation and discussion, 42 - is to make flock() and fcntl() locks oblivious to each other. Both can 43 - exists, and neither will have any effect on the other. 44 - 45 - I wanted the two lock styles to be cooperative, but there were so many 46 - race and deadlock conditions that the current solution was the only 47 - practical one. It puts us in the same position as, for example, SunOS 48 - 4.1.x and several other commercial Unices. The only OS's that support 49 - cooperative flock()/fcntl() are those that emulate flock() using 50 - fcntl(), with all the problems that implies. 51 - 52 - 53 - 1.3 Mandatory Locking As A Mount Option 54 - --------------------------------------- 55 - 56 - Mandatory locking, as described in 57 - 'Documentation/filesystems/mandatory-locking.txt' was prior to this release a 58 - general configuration option that was valid for all mounted filesystems. This 59 - had a number of inherent dangers, not the least of which was the ability to 60 - freeze an NFS server by asking it to read a file for which a mandatory lock 61 - existed. 62 - 63 - From this release of the kernel, mandatory locking can be turned on and off 64 - on a per-filesystem basis, using the mount options 'mand' and 'nomand'. 65 - The default is to disallow mandatory locking. The intention is that 66 - mandatory locking only be enabled on a local filesystem as the specific need 67 - arises. 68 -

+188

Documentation/filesystems/mandatory-locking.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ===================================================== 4 + Mandatory File Locking For The Linux Operating System 5 + ===================================================== 6 + 7 + Andy Walker <andy@lysaker.kvaerner.no> 8 + 9 + 15 April 1996 10 + 11 + (Updated September 2007) 12 + 13 + 0. Why you should avoid mandatory locking 14 + ----------------------------------------- 15 + 16 + The Linux implementation is prey to a number of difficult-to-fix race 17 + conditions which in practice make it not dependable: 18 + 19 + - The write system call checks for a mandatory lock only once 20 + at its start. It is therefore possible for a lock request to 21 + be granted after this check but before the data is modified. 22 + A process may then see file data change even while a mandatory 23 + lock was held. 24 + - Similarly, an exclusive lock may be granted on a file after 25 + the kernel has decided to proceed with a read, but before the 26 + read has actually completed, and the reading process may see 27 + the file data in a state which should not have been visible 28 + to it. 29 + - Similar races make the claimed mutual exclusion between lock 30 + and mmap similarly unreliable. 31 + 32 + 1. What is mandatory locking? 33 + ------------------------------ 34 + 35 + Mandatory locking is kernel enforced file locking, as opposed to the more usual 36 + cooperative file locking used to guarantee sequential access to files among 37 + processes. File locks are applied using the flock() and fcntl() system calls 38 + (and the lockf() library routine which is a wrapper around fcntl().) It is 39 + normally a process' responsibility to check for locks on a file it wishes to 40 + update, before applying its own lock, updating the file and unlocking it again. 41 + The most commonly used example of this (and in the case of sendmail, the most 42 + troublesome) is access to a user's mailbox. The mail user agent and the mail 43 + transfer agent must guard against updating the mailbox at the same time, and 44 + prevent reading the mailbox while it is being updated. 45 + 46 + In a perfect world all processes would use and honour a cooperative, or 47 + "advisory" locking scheme. However, the world isn't perfect, and there's 48 + a lot of poorly written code out there. 49 + 50 + In trying to address this problem, the designers of System V UNIX came up 51 + with a "mandatory" locking scheme, whereby the operating system kernel would 52 + block attempts by a process to write to a file that another process holds a 53 + "read" -or- "shared" lock on, and block attempts to both read and write to a 54 + file that a process holds a "write " -or- "exclusive" lock on. 55 + 56 + The System V mandatory locking scheme was intended to have as little impact as 57 + possible on existing user code. The scheme is based on marking individual files 58 + as candidates for mandatory locking, and using the existing fcntl()/lockf() 59 + interface for applying locks just as if they were normal, advisory locks. 60 + 61 + .. Note:: 62 + 63 + 1. In saying "file" in the paragraphs above I am actually not telling 64 + the whole truth. System V locking is based on fcntl(). The granularity of 65 + fcntl() is such that it allows the locking of byte ranges in files, in 66 + addition to entire files, so the mandatory locking rules also have byte 67 + level granularity. 68 + 69 + 2. POSIX.1 does not specify any scheme for mandatory locking, despite 70 + borrowing the fcntl() locking scheme from System V. The mandatory locking 71 + scheme is defined by the System V Interface Definition (SVID) Version 3. 72 + 73 + 2. Marking a file for mandatory locking 74 + --------------------------------------- 75 + 76 + A file is marked as a candidate for mandatory locking by setting the group-id 77 + bit in its file mode but removing the group-execute bit. This is an otherwise 78 + meaningless combination, and was chosen by the System V implementors so as not 79 + to break existing user programs. 80 + 81 + Note that the group-id bit is usually automatically cleared by the kernel when 82 + a setgid file is written to. This is a security measure. The kernel has been 83 + modified to recognize the special case of a mandatory lock candidate and to 84 + refrain from clearing this bit. Similarly the kernel has been modified not 85 + to run mandatory lock candidates with setgid privileges. 86 + 87 + 3. Available implementations 88 + ---------------------------- 89 + 90 + I have considered the implementations of mandatory locking available with 91 + SunOS 4.1.x, Solaris 2.x and HP-UX 9.x. 92 + 93 + Generally I have tried to make the most sense out of the behaviour exhibited 94 + by these three reference systems. There are many anomalies. 95 + 96 + All the reference systems reject all calls to open() for a file on which 97 + another process has outstanding mandatory locks. This is in direct 98 + contravention of SVID 3, which states that only calls to open() with the 99 + O_TRUNC flag set should be rejected. The Linux implementation follows the SVID 100 + definition, which is the "Right Thing", since only calls with O_TRUNC can 101 + modify the contents of the file. 102 + 103 + HP-UX even disallows open() with O_TRUNC for a file with advisory locks, not 104 + just mandatory locks. That would appear to contravene POSIX.1. 105 + 106 + mmap() is another interesting case. All the operating systems mentioned 107 + prevent mandatory locks from being applied to an mmap()'ed file, but HP-UX 108 + also disallows advisory locks for such a file. SVID actually specifies the 109 + paranoid HP-UX behaviour. 110 + 111 + In my opinion only MAP_SHARED mappings should be immune from locking, and then 112 + only from mandatory locks - that is what is currently implemented. 113 + 114 + SunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for 115 + mandatory locks, so reads and writes to locked files always block when they 116 + should return EAGAIN. 117 + 118 + I'm afraid that this is such an esoteric area that the semantics described 119 + below are just as valid as any others, so long as the main points seem to 120 + agree. 121 + 122 + 4. Semantics 123 + ------------ 124 + 125 + 1. Mandatory locks can only be applied via the fcntl()/lockf() locking 126 + interface - in other words the System V/POSIX interface. BSD style 127 + locks using flock() never result in a mandatory lock. 128 + 129 + 2. If a process has locked a region of a file with a mandatory read lock, then 130 + other processes are permitted to read from that region. If any of these 131 + processes attempts to write to the region it will block until the lock is 132 + released, unless the process has opened the file with the O_NONBLOCK 133 + flag in which case the system call will return immediately with the error 134 + status EAGAIN. 135 + 136 + 3. If a process has locked a region of a file with a mandatory write lock, all 137 + attempts to read or write to that region block until the lock is released, 138 + unless a process has opened the file with the O_NONBLOCK flag in which case 139 + the system call will return immediately with the error status EAGAIN. 140 + 141 + 4. Calls to open() with O_TRUNC, or to creat(), on a existing file that has 142 + any mandatory locks owned by other processes will be rejected with the 143 + error status EAGAIN. 144 + 145 + 5. Attempts to apply a mandatory lock to a file that is memory mapped and 146 + shared (via mmap() with MAP_SHARED) will be rejected with the error status 147 + EAGAIN. 148 + 149 + 6. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED) 150 + that has any mandatory locks in effect will be rejected with the error status 151 + EAGAIN. 152 + 153 + 5. Which system calls are affected? 154 + ----------------------------------- 155 + 156 + Those which modify a file's contents, not just the inode. That gives read(), 157 + write(), readv(), writev(), open(), creat(), mmap(), truncate() and 158 + ftruncate(). truncate() and ftruncate() are considered to be "write" actions 159 + for the purposes of mandatory locking. 160 + 161 + The affected region is usually defined as stretching from the current position 162 + for the total number of bytes read or written. For the truncate calls it is 163 + defined as the bytes of a file removed or added (we must also consider bytes 164 + added, as a lock can specify just "the whole file", rather than a specific 165 + range of bytes.) 166 + 167 + Note 3: I may have overlooked some system calls that need mandatory lock 168 + checking in my eagerness to get this code out the door. Please let me know, or 169 + better still fix the system calls yourself and submit a patch to me or Linus. 170 + 171 + 6. Warning! 172 + ----------- 173 + 174 + Not even root can override a mandatory lock, so runaway processes can wreak 175 + havoc if they lock crucial files. The way around it is to change the file 176 + permissions (remove the setgid bit) before trying to read or write to it. 177 + Of course, that might be a bit tricky if the system is hung :-( 178 + 179 + 7. The "mand" mount option 180 + -------------------------- 181 + Mandatory locking is disabled on all filesystems by default, and must be 182 + administratively enabled by mounting with "-o mand". That mount option 183 + is only allowed if the mounting task has the CAP_SYS_ADMIN capability. 184 + 185 + Since kernel v4.5, it is possible to disable mandatory locking 186 + altogether by setting CONFIG_MANDATORY_FILE_LOCKING to "n". A kernel 187 + with this disabled will reject attempts to mount filesystems with the 188 + "mand" mount option with the error status EPERM.

-181

Documentation/filesystems/mandatory-locking.txt

··· 1 - Mandatory File Locking For The Linux Operating System 2 - 3 - Andy Walker <andy@lysaker.kvaerner.no> 4 - 5 - 15 April 1996 6 - (Updated September 2007) 7 - 8 - 0. Why you should avoid mandatory locking 9 - ----------------------------------------- 10 - 11 - The Linux implementation is prey to a number of difficult-to-fix race 12 - conditions which in practice make it not dependable: 13 - 14 - - The write system call checks for a mandatory lock only once 15 - at its start. It is therefore possible for a lock request to 16 - be granted after this check but before the data is modified. 17 - A process may then see file data change even while a mandatory 18 - lock was held. 19 - - Similarly, an exclusive lock may be granted on a file after 20 - the kernel has decided to proceed with a read, but before the 21 - read has actually completed, and the reading process may see 22 - the file data in a state which should not have been visible 23 - to it. 24 - - Similar races make the claimed mutual exclusion between lock 25 - and mmap similarly unreliable. 26 - 27 - 1. What is mandatory locking? 28 - ------------------------------ 29 - 30 - Mandatory locking is kernel enforced file locking, as opposed to the more usual 31 - cooperative file locking used to guarantee sequential access to files among 32 - processes. File locks are applied using the flock() and fcntl() system calls 33 - (and the lockf() library routine which is a wrapper around fcntl().) It is 34 - normally a process' responsibility to check for locks on a file it wishes to 35 - update, before applying its own lock, updating the file and unlocking it again. 36 - The most commonly used example of this (and in the case of sendmail, the most 37 - troublesome) is access to a user's mailbox. The mail user agent and the mail 38 - transfer agent must guard against updating the mailbox at the same time, and 39 - prevent reading the mailbox while it is being updated. 40 - 41 - In a perfect world all processes would use and honour a cooperative, or 42 - "advisory" locking scheme. However, the world isn't perfect, and there's 43 - a lot of poorly written code out there. 44 - 45 - In trying to address this problem, the designers of System V UNIX came up 46 - with a "mandatory" locking scheme, whereby the operating system kernel would 47 - block attempts by a process to write to a file that another process holds a 48 - "read" -or- "shared" lock on, and block attempts to both read and write to a 49 - file that a process holds a "write " -or- "exclusive" lock on. 50 - 51 - The System V mandatory locking scheme was intended to have as little impact as 52 - possible on existing user code. The scheme is based on marking individual files 53 - as candidates for mandatory locking, and using the existing fcntl()/lockf() 54 - interface for applying locks just as if they were normal, advisory locks. 55 - 56 - Note 1: In saying "file" in the paragraphs above I am actually not telling 57 - the whole truth. System V locking is based on fcntl(). The granularity of 58 - fcntl() is such that it allows the locking of byte ranges in files, in addition 59 - to entire files, so the mandatory locking rules also have byte level 60 - granularity. 61 - 62 - Note 2: POSIX.1 does not specify any scheme for mandatory locking, despite 63 - borrowing the fcntl() locking scheme from System V. The mandatory locking 64 - scheme is defined by the System V Interface Definition (SVID) Version 3. 65 - 66 - 2. Marking a file for mandatory locking 67 - --------------------------------------- 68 - 69 - A file is marked as a candidate for mandatory locking by setting the group-id 70 - bit in its file mode but removing the group-execute bit. This is an otherwise 71 - meaningless combination, and was chosen by the System V implementors so as not 72 - to break existing user programs. 73 - 74 - Note that the group-id bit is usually automatically cleared by the kernel when 75 - a setgid file is written to. This is a security measure. The kernel has been 76 - modified to recognize the special case of a mandatory lock candidate and to 77 - refrain from clearing this bit. Similarly the kernel has been modified not 78 - to run mandatory lock candidates with setgid privileges. 79 - 80 - 3. Available implementations 81 - ---------------------------- 82 - 83 - I have considered the implementations of mandatory locking available with 84 - SunOS 4.1.x, Solaris 2.x and HP-UX 9.x. 85 - 86 - Generally I have tried to make the most sense out of the behaviour exhibited 87 - by these three reference systems. There are many anomalies. 88 - 89 - All the reference systems reject all calls to open() for a file on which 90 - another process has outstanding mandatory locks. This is in direct 91 - contravention of SVID 3, which states that only calls to open() with the 92 - O_TRUNC flag set should be rejected. The Linux implementation follows the SVID 93 - definition, which is the "Right Thing", since only calls with O_TRUNC can 94 - modify the contents of the file. 95 - 96 - HP-UX even disallows open() with O_TRUNC for a file with advisory locks, not 97 - just mandatory locks. That would appear to contravene POSIX.1. 98 - 99 - mmap() is another interesting case. All the operating systems mentioned 100 - prevent mandatory locks from being applied to an mmap()'ed file, but HP-UX 101 - also disallows advisory locks for such a file. SVID actually specifies the 102 - paranoid HP-UX behaviour. 103 - 104 - In my opinion only MAP_SHARED mappings should be immune from locking, and then 105 - only from mandatory locks - that is what is currently implemented. 106 - 107 - SunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for 108 - mandatory locks, so reads and writes to locked files always block when they 109 - should return EAGAIN. 110 - 111 - I'm afraid that this is such an esoteric area that the semantics described 112 - below are just as valid as any others, so long as the main points seem to 113 - agree. 114 - 115 - 4. Semantics 116 - ------------ 117 - 118 - 1. Mandatory locks can only be applied via the fcntl()/lockf() locking 119 - interface - in other words the System V/POSIX interface. BSD style 120 - locks using flock() never result in a mandatory lock. 121 - 122 - 2. If a process has locked a region of a file with a mandatory read lock, then 123 - other processes are permitted to read from that region. If any of these 124 - processes attempts to write to the region it will block until the lock is 125 - released, unless the process has opened the file with the O_NONBLOCK 126 - flag in which case the system call will return immediately with the error 127 - status EAGAIN. 128 - 129 - 3. If a process has locked a region of a file with a mandatory write lock, all 130 - attempts to read or write to that region block until the lock is released, 131 - unless a process has opened the file with the O_NONBLOCK flag in which case 132 - the system call will return immediately with the error status EAGAIN. 133 - 134 - 4. Calls to open() with O_TRUNC, or to creat(), on a existing file that has 135 - any mandatory locks owned by other processes will be rejected with the 136 - error status EAGAIN. 137 - 138 - 5. Attempts to apply a mandatory lock to a file that is memory mapped and 139 - shared (via mmap() with MAP_SHARED) will be rejected with the error status 140 - EAGAIN. 141 - 142 - 6. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED) 143 - that has any mandatory locks in effect will be rejected with the error status 144 - EAGAIN. 145 - 146 - 5. Which system calls are affected? 147 - ----------------------------------- 148 - 149 - Those which modify a file's contents, not just the inode. That gives read(), 150 - write(), readv(), writev(), open(), creat(), mmap(), truncate() and 151 - ftruncate(). truncate() and ftruncate() are considered to be "write" actions 152 - for the purposes of mandatory locking. 153 - 154 - The affected region is usually defined as stretching from the current position 155 - for the total number of bytes read or written. For the truncate calls it is 156 - defined as the bytes of a file removed or added (we must also consider bytes 157 - added, as a lock can specify just "the whole file", rather than a specific 158 - range of bytes.) 159 - 160 - Note 3: I may have overlooked some system calls that need mandatory lock 161 - checking in my eagerness to get this code out the door. Please let me know, or 162 - better still fix the system calls yourself and submit a patch to me or Linus. 163 - 164 - 6. Warning! 165 - ----------- 166 - 167 - Not even root can override a mandatory lock, so runaway processes can wreak 168 - havoc if they lock crucial files. The way around it is to change the file 169 - permissions (remove the setgid bit) before trying to read or write to it. 170 - Of course, that might be a bit tricky if the system is hung :-( 171 - 172 - 7. The "mand" mount option 173 - -------------------------- 174 - Mandatory locking is disabled on all filesystems by default, and must be 175 - administratively enabled by mounting with "-o mand". That mount option 176 - is only allowed if the mounting task has the CAP_SYS_ADMIN capability. 177 - 178 - Since kernel v4.5, it is possible to disable mandatory locking 179 - altogether by setting CONFIG_MANDATORY_FILE_LOCKING to "n". A kernel 180 - with this disabled will reject attempts to mount filesystems with the 181 - "mand" mount option with the error status EPERM.

+825

Documentation/filesystems/mount_api.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ==================== 4 + fILESYSTEM Mount API 5 + ==================== 6 + 7 + .. CONTENTS 8 + 9 + (1) Overview. 10 + 11 + (2) The filesystem context. 12 + 13 + (3) The filesystem context operations. 14 + 15 + (4) Filesystem context security. 16 + 17 + (5) VFS filesystem context API. 18 + 19 + (6) Superblock creation helpers. 20 + 21 + (7) Parameter description. 22 + 23 + (8) Parameter helper functions. 24 + 25 + 26 + Overview 27 + ======== 28 + 29 + The creation of new mounts is now to be done in a multistep process: 30 + 31 + (1) Create a filesystem context. 32 + 33 + (2) Parse the parameters and attach them to the context. Parameters are 34 + expected to be passed individually from userspace, though legacy binary 35 + parameters can also be handled. 36 + 37 + (3) Validate and pre-process the context. 38 + 39 + (4) Get or create a superblock and mountable root. 40 + 41 + (5) Perform the mount. 42 + 43 + (6) Return an error message attached to the context. 44 + 45 + (7) Destroy the context. 46 + 47 + To support this, the file_system_type struct gains two new fields:: 48 + 49 + int (*init_fs_context)(struct fs_context *fc); 50 + const struct fs_parameter_description *parameters; 51 + 52 + The first is invoked to set up the filesystem-specific parts of a filesystem 53 + context, including the additional space, and the second points to the 54 + parameter description for validation at registration time and querying by a 55 + future system call. 56 + 57 + Note that security initialisation is done *after* the filesystem is called so 58 + that the namespaces may be adjusted first. 59 + 60 + 61 + The Filesystem context 62 + ====================== 63 + 64 + The creation and reconfiguration of a superblock is governed by a filesystem 65 + context. This is represented by the fs_context structure:: 66 + 67 + struct fs_context { 68 + const struct fs_context_operations *ops; 69 + struct file_system_type *fs_type; 70 + void *fs_private; 71 + struct dentry *root; 72 + struct user_namespace *user_ns; 73 + struct net *net_ns; 74 + const struct cred *cred; 75 + char *source; 76 + char *subtype; 77 + void *security; 78 + void *s_fs_info; 79 + unsigned int sb_flags; 80 + unsigned int sb_flags_mask; 81 + unsigned int s_iflags; 82 + unsigned int lsm_flags; 83 + enum fs_context_purpose purpose:8; 84 + ... 85 + }; 86 + 87 + The fs_context fields are as follows: 88 + 89 + * :: 90 + 91 + const struct fs_context_operations *ops 92 + 93 + These are operations that can be done on a filesystem context (see 94 + below). This must be set by the ->init_fs_context() file_system_type 95 + operation. 96 + 97 + * :: 98 + 99 + struct file_system_type *fs_type 100 + 101 + A pointer to the file_system_type of the filesystem that is being 102 + constructed or reconfigured. This retains a reference on the type owner. 103 + 104 + * :: 105 + 106 + void *fs_private 107 + 108 + A pointer to the file system's private data. This is where the filesystem 109 + will need to store any options it parses. 110 + 111 + * :: 112 + 113 + struct dentry *root 114 + 115 + A pointer to the root of the mountable tree (and indirectly, the 116 + superblock thereof). This is filled in by the ->get_tree() op. If this 117 + is set, an active reference on root->d_sb must also be held. 118 + 119 + * :: 120 + 121 + struct user_namespace *user_ns 122 + struct net *net_ns 123 + 124 + There are a subset of the namespaces in use by the invoking process. They 125 + retain references on each namespace. The subscribed namespaces may be 126 + replaced by the filesystem to reflect other sources, such as the parent 127 + mount superblock on an automount. 128 + 129 + * :: 130 + 131 + const struct cred *cred 132 + 133 + The mounter's credentials. This retains a reference on the credentials. 134 + 135 + * :: 136 + 137 + char *source 138 + 139 + This specifies the source. It may be a block device (e.g. /dev/sda1) or 140 + something more exotic, such as the "host:/path" that NFS desires. 141 + 142 + * :: 143 + 144 + char *subtype 145 + 146 + This is a string to be added to the type displayed in /proc/mounts to 147 + qualify it (used by FUSE). This is available for the filesystem to set if 148 + desired. 149 + 150 + * :: 151 + 152 + void *security 153 + 154 + A place for the LSMs to hang their security data for the superblock. The 155 + relevant security operations are described below. 156 + 157 + * :: 158 + 159 + void *s_fs_info 160 + 161 + The proposed s_fs_info for a new superblock, set in the superblock by 162 + sget_fc(). This can be used to distinguish superblocks. 163 + 164 + * :: 165 + 166 + unsigned int sb_flags 167 + unsigned int sb_flags_mask 168 + 169 + Which bits SB_* flags are to be set/cleared in super_block::s_flags. 170 + 171 + * :: 172 + 173 + unsigned int s_iflags 174 + 175 + These will be bitwise-OR'd with s->s_iflags when a superblock is created. 176 + 177 + * :: 178 + 179 + enum fs_context_purpose 180 + 181 + This indicates the purpose for which the context is intended. The 182 + available values are: 183 + 184 + ========================== ====================================== 185 + FS_CONTEXT_FOR_MOUNT, New superblock for explicit mount 186 + FS_CONTEXT_FOR_SUBMOUNT New automatic submount of extant mount 187 + FS_CONTEXT_FOR_RECONFIGURE Change an existing mount 188 + ========================== ====================================== 189 + 190 + The mount context is created by calling vfs_new_fs_context() or 191 + vfs_dup_fs_context() and is destroyed with put_fs_context(). Note that the 192 + structure is not refcounted. 193 + 194 + VFS, security and filesystem mount options are set individually with 195 + vfs_parse_mount_option(). Options provided by the old mount(2) system call as 196 + a page of data can be parsed with generic_parse_monolithic(). 197 + 198 + When mounting, the filesystem is allowed to take data from any of the pointers 199 + and attach it to the superblock (or whatever), provided it clears the pointer 200 + in the mount context. 201 + 202 + The filesystem is also allowed to allocate resources and pin them with the 203 + mount context. For instance, NFS might pin the appropriate protocol version 204 + module. 205 + 206 + 207 + The Filesystem Context Operations 208 + ================================= 209 + 210 + The filesystem context points to a table of operations:: 211 + 212 + struct fs_context_operations { 213 + void (*free)(struct fs_context *fc); 214 + int (*dup)(struct fs_context *fc, struct fs_context *src_fc); 215 + int (*parse_param)(struct fs_context *fc, 216 + struct struct fs_parameter *param); 217 + int (*parse_monolithic)(struct fs_context *fc, void *data); 218 + int (*get_tree)(struct fs_context *fc); 219 + int (*reconfigure)(struct fs_context *fc); 220 + }; 221 + 222 + These operations are invoked by the various stages of the mount procedure to 223 + manage the filesystem context. They are as follows: 224 + 225 + * :: 226 + 227 + void (*free)(struct fs_context *fc); 228 + 229 + Called to clean up the filesystem-specific part of the filesystem context 230 + when the context is destroyed. It should be aware that parts of the 231 + context may have been removed and NULL'd out by ->get_tree(). 232 + 233 + * :: 234 + 235 + int (*dup)(struct fs_context *fc, struct fs_context *src_fc); 236 + 237 + Called when a filesystem context has been duplicated to duplicate the 238 + filesystem-private data. An error may be returned to indicate failure to 239 + do this. 240 + 241 + .. Warning:: 242 + 243 + Note that even if this fails, put_fs_context() will be called 244 + immediately thereafter, so ->dup() *must* make the 245 + filesystem-private data safe for ->free(). 246 + 247 + * :: 248 + 249 + int (*parse_param)(struct fs_context *fc, 250 + struct struct fs_parameter *param); 251 + 252 + Called when a parameter is being added to the filesystem context. param 253 + points to the key name and maybe a value object. VFS-specific options 254 + will have been weeded out and fc->sb_flags updated in the context. 255 + Security options will also have been weeded out and fc->security updated. 256 + 257 + The parameter can be parsed with fs_parse() and fs_lookup_param(). Note 258 + that the source(s) are presented as parameters named "source". 259 + 260 + If successful, 0 should be returned or a negative error code otherwise. 261 + 262 + * :: 263 + 264 + int (*parse_monolithic)(struct fs_context *fc, void *data); 265 + 266 + Called when the mount(2) system call is invoked to pass the entire data 267 + page in one go. If this is expected to be just a list of "key[=val]" 268 + items separated by commas, then this may be set to NULL. 269 + 270 + The return value is as for ->parse_param(). 271 + 272 + If the filesystem (e.g. NFS) needs to examine the data first and then 273 + finds it's the standard key-val list then it may pass it off to 274 + generic_parse_monolithic(). 275 + 276 + * :: 277 + 278 + int (*get_tree)(struct fs_context *fc); 279 + 280 + Called to get or create the mountable root and superblock, using the 281 + information stored in the filesystem context (reconfiguration goes via a 282 + different vector). It may detach any resources it desires from the 283 + filesystem context and transfer them to the superblock it creates. 284 + 285 + On success it should set fc->root to the mountable root and return 0. In 286 + the case of an error, it should return a negative error code. 287 + 288 + The phase on a userspace-driven context will be set to only allow this to 289 + be called once on any particular context. 290 + 291 + * :: 292 + 293 + int (*reconfigure)(struct fs_context *fc); 294 + 295 + Called to effect reconfiguration of a superblock using information stored 296 + in the filesystem context. It may detach any resources it desires from 297 + the filesystem context and transfer them to the superblock. The 298 + superblock can be found from fc->root->d_sb. 299 + 300 + On success it should return 0. In the case of an error, it should return 301 + a negative error code. 302 + 303 + .. Note:: reconfigure is intended as a replacement for remount_fs. 304 + 305 + 306 + Filesystem context Security 307 + =========================== 308 + 309 + The filesystem context contains a security pointer that the LSMs can use for 310 + building up a security context for the superblock to be mounted. There are a 311 + number of operations used by the new mount code for this purpose: 312 + 313 + * :: 314 + 315 + int security_fs_context_alloc(struct fs_context *fc, 316 + struct dentry *reference); 317 + 318 + Called to initialise fc->security (which is preset to NULL) and allocate 319 + any resources needed. It should return 0 on success or a negative error 320 + code on failure. 321 + 322 + reference will be non-NULL if the context is being created for superblock 323 + reconfiguration (FS_CONTEXT_FOR_RECONFIGURE) in which case it indicates 324 + the root dentry of the superblock to be reconfigured. It will also be 325 + non-NULL in the case of a submount (FS_CONTEXT_FOR_SUBMOUNT) in which case 326 + it indicates the automount point. 327 + 328 + * :: 329 + 330 + int security_fs_context_dup(struct fs_context *fc, 331 + struct fs_context *src_fc); 332 + 333 + Called to initialise fc->security (which is preset to NULL) and allocate 334 + any resources needed. The original filesystem context is pointed to by 335 + src_fc and may be used for reference. It should return 0 on success or a 336 + negative error code on failure. 337 + 338 + * :: 339 + 340 + void security_fs_context_free(struct fs_context *fc); 341 + 342 + Called to clean up anything attached to fc->security. Note that the 343 + contents may have been transferred to a superblock and the pointer cleared 344 + during get_tree. 345 + 346 + * :: 347 + 348 + int security_fs_context_parse_param(struct fs_context *fc, 349 + struct fs_parameter *param); 350 + 351 + Called for each mount parameter, including the source. The arguments are 352 + as for the ->parse_param() method. It should return 0 to indicate that 353 + the parameter should be passed on to the filesystem, 1 to indicate that 354 + the parameter should be discarded or an error to indicate that the 355 + parameter should be rejected. 356 + 357 + The value pointed to by param may be modified (if a string) or stolen 358 + (provided the value pointer is NULL'd out). If it is stolen, 1 must be 359 + returned to prevent it being passed to the filesystem. 360 + 361 + * :: 362 + 363 + int security_fs_context_validate(struct fs_context *fc); 364 + 365 + Called after all the options have been parsed to validate the collection 366 + as a whole and to do any necessary allocation so that 367 + security_sb_get_tree() and security_sb_reconfigure() are less likely to 368 + fail. It should return 0 or a negative error code. 369 + 370 + In the case of reconfiguration, the target superblock will be accessible 371 + via fc->root. 372 + 373 + * :: 374 + 375 + int security_sb_get_tree(struct fs_context *fc); 376 + 377 + Called during the mount procedure to verify that the specified superblock 378 + is allowed to be mounted and to transfer the security data there. It 379 + should return 0 or a negative error code. 380 + 381 + * :: 382 + 383 + void security_sb_reconfigure(struct fs_context *fc); 384 + 385 + Called to apply any reconfiguration to an LSM's context. It must not 386 + fail. Error checking and resource allocation must be done in advance by 387 + the parameter parsing and validation hooks. 388 + 389 + * :: 390 + 391 + int security_sb_mountpoint(struct fs_context *fc, 392 + struct path *mountpoint, 393 + unsigned int mnt_flags); 394 + 395 + Called during the mount procedure to verify that the root dentry attached 396 + to the context is permitted to be attached to the specified mountpoint. 397 + It should return 0 on success or a negative error code on failure. 398 + 399 + 400 + VFS Filesystem context API 401 + ========================== 402 + 403 + There are four operations for creating a filesystem context and one for 404 + destroying a context: 405 + 406 + * :: 407 + 408 + struct fs_context *fs_context_for_mount(struct file_system_type *fs_type, 409 + unsigned int sb_flags); 410 + 411 + Allocate a filesystem context for the purpose of setting up a new mount, 412 + whether that be with a new superblock or sharing an existing one. This 413 + sets the superblock flags, initialises the security and calls 414 + fs_type->init_fs_context() to initialise the filesystem private data. 415 + 416 + fs_type specifies the filesystem type that will manage the context and 417 + sb_flags presets the superblock flags stored therein. 418 + 419 + * :: 420 + 421 + struct fs_context *fs_context_for_reconfigure( 422 + struct dentry *dentry, 423 + unsigned int sb_flags, 424 + unsigned int sb_flags_mask); 425 + 426 + Allocate a filesystem context for the purpose of reconfiguring an 427 + existing superblock. dentry provides a reference to the superblock to be 428 + configured. sb_flags and sb_flags_mask indicate which superblock flags 429 + need changing and to what. 430 + 431 + * :: 432 + 433 + struct fs_context *fs_context_for_submount( 434 + struct file_system_type *fs_type, 435 + struct dentry *reference); 436 + 437 + Allocate a filesystem context for the purpose of creating a new mount for 438 + an automount point or other derived superblock. fs_type specifies the 439 + filesystem type that will manage the context and the reference dentry 440 + supplies the parameters. Namespaces are propagated from the reference 441 + dentry's superblock also. 442 + 443 + Note that it's not a requirement that the reference dentry be of the same 444 + filesystem type as fs_type. 445 + 446 + * :: 447 + 448 + struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc); 449 + 450 + Duplicate a filesystem context, copying any options noted and duplicating 451 + or additionally referencing any resources held therein. This is available 452 + for use where a filesystem has to get a mount within a mount, such as NFS4 453 + does by internally mounting the root of the target server and then doing a 454 + private pathwalk to the target directory. 455 + 456 + The purpose in the new context is inherited from the old one. 457 + 458 + * :: 459 + 460 + void put_fs_context(struct fs_context *fc); 461 + 462 + Destroy a filesystem context, releasing any resources it holds. This 463 + calls the ->free() operation. This is intended to be called by anyone who 464 + created a filesystem context. 465 + 466 + .. Warning:: 467 + 468 + filesystem contexts are not refcounted, so this causes unconditional 469 + destruction. 470 + 471 + In all the above operations, apart from the put op, the return is a mount 472 + context pointer or a negative error code. 473 + 474 + For the remaining operations, if an error occurs, a negative error code will be 475 + returned. 476 + 477 + * :: 478 + 479 + int vfs_parse_fs_param(struct fs_context *fc, 480 + struct fs_parameter *param); 481 + 482 + Supply a single mount parameter to the filesystem context. This include 483 + the specification of the source/device which is specified as the "source" 484 + parameter (which may be specified multiple times if the filesystem 485 + supports that). 486 + 487 + param specifies the parameter key name and the value. The parameter is 488 + first checked to see if it corresponds to a standard mount flag (in which 489 + case it is used to set an SB_xxx flag and consumed) or a security option 490 + (in which case the LSM consumes it) before it is passed on to the 491 + filesystem. 492 + 493 + The parameter value is typed and can be one of: 494 + 495 + ==================== ============================= 496 + fs_value_is_flag Parameter not given a value 497 + fs_value_is_string Value is a string 498 + fs_value_is_blob Value is a binary blob 499 + fs_value_is_filename Value is a filename* + dirfd 500 + fs_value_is_file Value is an open file (file*) 501 + ==================== ============================= 502 + 503 + If there is a value, that value is stored in a union in the struct in one 504 + of param->{string,blob,name,file}. Note that the function may steal and 505 + clear the pointer, but then becomes responsible for disposing of the 506 + object. 507 + 508 + * :: 509 + 510 + int vfs_parse_fs_string(struct fs_context *fc, const char *key, 511 + const char *value, size_t v_size); 512 + 513 + A wrapper around vfs_parse_fs_param() that copies the value string it is 514 + passed. 515 + 516 + * :: 517 + 518 + int generic_parse_monolithic(struct fs_context *fc, void *data); 519 + 520 + Parse a sys_mount() data page, assuming the form to be a text list 521 + consisting of key[=val] options separated by commas. Each item in the 522 + list is passed to vfs_mount_option(). This is the default when the 523 + ->parse_monolithic() method is NULL. 524 + 525 + * :: 526 + 527 + int vfs_get_tree(struct fs_context *fc); 528 + 529 + Get or create the mountable root and superblock, using the parameters in 530 + the filesystem context to select/configure the superblock. This invokes 531 + the ->get_tree() method. 532 + 533 + * :: 534 + 535 + struct vfsmount *vfs_create_mount(struct fs_context *fc); 536 + 537 + Create a mount given the parameters in the specified filesystem context. 538 + Note that this does not attach the mount to anything. 539 + 540 + 541 + Superblock Creation Helpers 542 + =========================== 543 + 544 + A number of VFS helpers are available for use by filesystems for the creation 545 + or looking up of superblocks. 546 + 547 + * :: 548 + 549 + struct super_block * 550 + sget_fc(struct fs_context *fc, 551 + int (*test)(struct super_block *sb, struct fs_context *fc), 552 + int (*set)(struct super_block *sb, struct fs_context *fc)); 553 + 554 + This is the core routine. If test is non-NULL, it searches for an 555 + existing superblock matching the criteria held in the fs_context, using 556 + the test function to match them. If no match is found, a new superblock 557 + is created and the set function is called to set it up. 558 + 559 + Prior to the set function being called, fc->s_fs_info will be transferred 560 + to sb->s_fs_info - and fc->s_fs_info will be cleared if set returns 561 + success (ie. 0). 562 + 563 + The following helpers all wrap sget_fc(): 564 + 565 + * :: 566 + 567 + int vfs_get_super(struct fs_context *fc, 568 + enum vfs_get_super_keying keying, 569 + int (*fill_super)(struct super_block *sb, 570 + struct fs_context *fc)) 571 + 572 + This creates/looks up a deviceless superblock. The keying indicates how 573 + many superblocks of this type may exist and in what manner they may be 574 + shared: 575 + 576 + (1) vfs_get_single_super 577 + 578 + Only one such superblock may exist in the system. Any further 579 + attempt to get a new superblock gets this one (and any parameter 580 + differences are ignored). 581 + 582 + (2) vfs_get_keyed_super 583 + 584 + Multiple superblocks of this type may exist and they're keyed on 585 + their s_fs_info pointer (for example this may refer to a 586 + namespace). 587 + 588 + (3) vfs_get_independent_super 589 + 590 + Multiple independent superblocks of this type may exist. This 591 + function never matches an existing one and always creates a new 592 + one. 593 + 594 + 595 + ===================== 596 + PARAMETER DESCRIPTION 597 + ===================== 598 + 599 + Parameters are described using structures defined in linux/fs_parser.h. 600 + There's a core description struct that links everything together:: 601 + 602 + struct fs_parameter_description { 603 + const struct fs_parameter_spec *specs; 604 + const struct fs_parameter_enum *enums; 605 + }; 606 + 607 + For example:: 608 + 609 + enum { 610 + Opt_autocell, 611 + Opt_bar, 612 + Opt_dyn, 613 + Opt_foo, 614 + Opt_source, 615 + }; 616 + 617 + static const struct fs_parameter_description afs_fs_parameters = { 618 + .specs = afs_param_specs, 619 + .enums = afs_param_enums, 620 + }; 621 + 622 + The members are as follows: 623 + 624 + (1) :: 625 + 626 + const struct fs_parameter_specification *specs; 627 + 628 + Table of parameter specifications, terminated with a null entry, where the 629 + entries are of type:: 630 + 631 + struct fs_parameter_spec { 632 + const char *name; 633 + u8 opt; 634 + enum fs_parameter_type type:8; 635 + unsigned short flags; 636 + }; 637 + 638 + The 'name' field is a string to match exactly to the parameter key (no 639 + wildcards, patterns and no case-independence) and 'opt' is the value that 640 + will be returned by the fs_parser() function in the case of a successful 641 + match. 642 + 643 + The 'type' field indicates the desired value type and must be one of: 644 + 645 + ======================= ======================= ===================== 646 + TYPE NAME EXPECTED VALUE RESULT IN 647 + ======================= ======================= ===================== 648 + fs_param_is_flag No value n/a 649 + fs_param_is_bool Boolean value result->boolean 650 + fs_param_is_u32 32-bit unsigned int result->uint_32 651 + fs_param_is_u32_octal 32-bit octal int result->uint_32 652 + fs_param_is_u32_hex 32-bit hex int result->uint_32 653 + fs_param_is_s32 32-bit signed int result->int_32 654 + fs_param_is_u64 64-bit unsigned int result->uint_64 655 + fs_param_is_enum Enum value name result->uint_32 656 + fs_param_is_string Arbitrary string param->string 657 + fs_param_is_blob Binary blob param->blob 658 + fs_param_is_blockdev Blockdev path * Needs lookup 659 + fs_param_is_path Path * Needs lookup 660 + fs_param_is_fd File descriptor result->int_32 661 + ======================= ======================= ===================== 662 + 663 + Note that if the value is of fs_param_is_bool type, fs_parse() will try 664 + to match any string value against "0", "1", "no", "yes", "false", "true". 665 + 666 + Each parameter can also be qualified with 'flags': 667 + 668 + ======================= ================================================ 669 + fs_param_v_optional The value is optional 670 + fs_param_neg_with_no result->negated set if key is prefixed with "no" 671 + fs_param_neg_with_empty result->negated set if value is "" 672 + fs_param_deprecated The parameter is deprecated. 673 + ======================= ================================================ 674 + 675 + These are wrapped with a number of convenience wrappers: 676 + 677 + ======================= =============================================== 678 + MACRO SPECIFIES 679 + ======================= =============================================== 680 + fsparam_flag() fs_param_is_flag 681 + fsparam_flag_no() fs_param_is_flag, fs_param_neg_with_no 682 + fsparam_bool() fs_param_is_bool 683 + fsparam_u32() fs_param_is_u32 684 + fsparam_u32oct() fs_param_is_u32_octal 685 + fsparam_u32hex() fs_param_is_u32_hex 686 + fsparam_s32() fs_param_is_s32 687 + fsparam_u64() fs_param_is_u64 688 + fsparam_enum() fs_param_is_enum 689 + fsparam_string() fs_param_is_string 690 + fsparam_blob() fs_param_is_blob 691 + fsparam_bdev() fs_param_is_blockdev 692 + fsparam_path() fs_param_is_path 693 + fsparam_fd() fs_param_is_fd 694 + ======================= =============================================== 695 + 696 + all of which take two arguments, name string and option number - for 697 + example:: 698 + 699 + static const struct fs_parameter_spec afs_param_specs[] = { 700 + fsparam_flag ("autocell", Opt_autocell), 701 + fsparam_flag ("dyn", Opt_dyn), 702 + fsparam_string ("source", Opt_source), 703 + fsparam_flag_no ("foo", Opt_foo), 704 + {} 705 + }; 706 + 707 + An addition macro, __fsparam() is provided that takes an additional pair 708 + of arguments to specify the type and the flags for anything that doesn't 709 + match one of the above macros. 710 + 711 + (2) :: 712 + 713 + const struct fs_parameter_enum *enums; 714 + 715 + Table of enum value names to integer mappings, terminated with a null 716 + entry. This is of type:: 717 + 718 + struct fs_parameter_enum { 719 + u8 opt; 720 + char name[14]; 721 + u8 value; 722 + }; 723 + 724 + Where the array is an unsorted list of { parameter ID, name }-keyed 725 + elements that indicate the value to map to, e.g.:: 726 + 727 + static const struct fs_parameter_enum afs_param_enums[] = { 728 + { Opt_bar, "x", 1}, 729 + { Opt_bar, "y", 23}, 730 + { Opt_bar, "z", 42}, 731 + }; 732 + 733 + If a parameter of type fs_param_is_enum is encountered, fs_parse() will 734 + try to look the value up in the enum table and the result will be stored 735 + in the parse result. 736 + 737 + The parser should be pointed to by the parser pointer in the file_system_type 738 + struct as this will provide validation on registration (if 739 + CONFIG_VALIDATE_FS_PARSER=y) and will allow the description to be queried from 740 + userspace using the fsinfo() syscall. 741 + 742 + 743 + Parameter Helper Functions 744 + ========================== 745 + 746 + A number of helper functions are provided to help a filesystem or an LSM 747 + process the parameters it is given. 748 + 749 + * :: 750 + 751 + int lookup_constant(const struct constant_table tbl[], 752 + const char *name, int not_found); 753 + 754 + Look up a constant by name in a table of name -> integer mappings. The 755 + table is an array of elements of the following type:: 756 + 757 + struct constant_table { 758 + const char *name; 759 + int value; 760 + }; 761 + 762 + If a match is found, the corresponding value is returned. If a match 763 + isn't found, the not_found value is returned instead. 764 + 765 + * :: 766 + 767 + bool validate_constant_table(const struct constant_table *tbl, 768 + size_t tbl_size, 769 + int low, int high, int special); 770 + 771 + Validate a constant table. Checks that all the elements are appropriately 772 + ordered, that there are no duplicates and that the values are between low 773 + and high inclusive, though provision is made for one allowable special 774 + value outside of that range. If no special value is required, special 775 + should just be set to lie inside the low-to-high range. 776 + 777 + If all is good, true is returned. If the table is invalid, errors are 778 + logged to dmesg and false is returned. 779 + 780 + * :: 781 + 782 + bool fs_validate_description(const struct fs_parameter_description *desc); 783 + 784 + This performs some validation checks on a parameter description. It 785 + returns true if the description is good and false if it is not. It will 786 + log errors to dmesg if validation fails. 787 + 788 + * :: 789 + 790 + int fs_parse(struct fs_context *fc, 791 + const struct fs_parameter_description *desc, 792 + struct fs_parameter *param, 793 + struct fs_parse_result *result); 794 + 795 + This is the main interpreter of parameters. It uses the parameter 796 + description to look up a parameter by key name and to convert that to an 797 + option number (which it returns). 798 + 799 + If successful, and if the parameter type indicates the result is a 800 + boolean, integer or enum type, the value is converted by this function and 801 + the result stored in result->{boolean,int_32,uint_32,uint_64}. 802 + 803 + If a match isn't initially made, the key is prefixed with "no" and no 804 + value is present then an attempt will be made to look up the key with the 805 + prefix removed. If this matches a parameter for which the type has flag 806 + fs_param_neg_with_no set, then a match will be made and result->negated 807 + will be set to true. 808 + 809 + If the parameter isn't matched, -ENOPARAM will be returned; if the 810 + parameter is matched, but the value is erroneous, -EINVAL will be 811 + returned; otherwise the parameter's option number will be returned. 812 + 813 + * :: 814 + 815 + int fs_lookup_param(struct fs_context *fc, 816 + struct fs_parameter *value, 817 + bool want_bdev, 818 + struct path *_path); 819 + 820 + This takes a parameter that carries a string or filename type and attempts 821 + to do a path lookup on it. If the parameter expects a blockdev, a check 822 + is made that the inode actually represents one. 823 + 824 + Returns 0 if successful and ``*_path`` will be set; returns a negative 825 + error code if not.

-724

Documentation/filesystems/mount_api.txt

··· 1 - ==================== 2 - FILESYSTEM MOUNT API 3 - ==================== 4 - 5 - CONTENTS 6 - 7 - (1) Overview. 8 - 9 - (2) The filesystem context. 10 - 11 - (3) The filesystem context operations. 12 - 13 - (4) Filesystem context security. 14 - 15 - (5) VFS filesystem context API. 16 - 17 - (6) Superblock creation helpers. 18 - 19 - (7) Parameter description. 20 - 21 - (8) Parameter helper functions. 22 - 23 - 24 - ======== 25 - OVERVIEW 26 - ======== 27 - 28 - The creation of new mounts is now to be done in a multistep process: 29 - 30 - (1) Create a filesystem context. 31 - 32 - (2) Parse the parameters and attach them to the context. Parameters are 33 - expected to be passed individually from userspace, though legacy binary 34 - parameters can also be handled. 35 - 36 - (3) Validate and pre-process the context. 37 - 38 - (4) Get or create a superblock and mountable root. 39 - 40 - (5) Perform the mount. 41 - 42 - (6) Return an error message attached to the context. 43 - 44 - (7) Destroy the context. 45 - 46 - To support this, the file_system_type struct gains two new fields: 47 - 48 - int (*init_fs_context)(struct fs_context *fc); 49 - const struct fs_parameter_description *parameters; 50 - 51 - The first is invoked to set up the filesystem-specific parts of a filesystem 52 - context, including the additional space, and the second points to the 53 - parameter description for validation at registration time and querying by a 54 - future system call. 55 - 56 - Note that security initialisation is done *after* the filesystem is called so 57 - that the namespaces may be adjusted first. 58 - 59 - 60 - ====================== 61 - THE FILESYSTEM CONTEXT 62 - ====================== 63 - 64 - The creation and reconfiguration of a superblock is governed by a filesystem 65 - context. This is represented by the fs_context structure: 66 - 67 - struct fs_context { 68 - const struct fs_context_operations *ops; 69 - struct file_system_type *fs_type; 70 - void *fs_private; 71 - struct dentry *root; 72 - struct user_namespace *user_ns; 73 - struct net *net_ns; 74 - const struct cred *cred; 75 - char *source; 76 - char *subtype; 77 - void *security; 78 - void *s_fs_info; 79 - unsigned int sb_flags; 80 - unsigned int sb_flags_mask; 81 - unsigned int s_iflags; 82 - unsigned int lsm_flags; 83 - enum fs_context_purpose purpose:8; 84 - ... 85 - }; 86 - 87 - The fs_context fields are as follows: 88 - 89 - (*) const struct fs_context_operations *ops 90 - 91 - These are operations that can be done on a filesystem context (see 92 - below). This must be set by the ->init_fs_context() file_system_type 93 - operation. 94 - 95 - (*) struct file_system_type *fs_type 96 - 97 - A pointer to the file_system_type of the filesystem that is being 98 - constructed or reconfigured. This retains a reference on the type owner. 99 - 100 - (*) void *fs_private 101 - 102 - A pointer to the file system's private data. This is where the filesystem 103 - will need to store any options it parses. 104 - 105 - (*) struct dentry *root 106 - 107 - A pointer to the root of the mountable tree (and indirectly, the 108 - superblock thereof). This is filled in by the ->get_tree() op. If this 109 - is set, an active reference on root->d_sb must also be held. 110 - 111 - (*) struct user_namespace *user_ns 112 - (*) struct net *net_ns 113 - 114 - There are a subset of the namespaces in use by the invoking process. They 115 - retain references on each namespace. The subscribed namespaces may be 116 - replaced by the filesystem to reflect other sources, such as the parent 117 - mount superblock on an automount. 118 - 119 - (*) const struct cred *cred 120 - 121 - The mounter's credentials. This retains a reference on the credentials. 122 - 123 - (*) char *source 124 - 125 - This specifies the source. It may be a block device (e.g. /dev/sda1) or 126 - something more exotic, such as the "host:/path" that NFS desires. 127 - 128 - (*) char *subtype 129 - 130 - This is a string to be added to the type displayed in /proc/mounts to 131 - qualify it (used by FUSE). This is available for the filesystem to set if 132 - desired. 133 - 134 - (*) void *security 135 - 136 - A place for the LSMs to hang their security data for the superblock. The 137 - relevant security operations are described below. 138 - 139 - (*) void *s_fs_info 140 - 141 - The proposed s_fs_info for a new superblock, set in the superblock by 142 - sget_fc(). This can be used to distinguish superblocks. 143 - 144 - (*) unsigned int sb_flags 145 - (*) unsigned int sb_flags_mask 146 - 147 - Which bits SB_* flags are to be set/cleared in super_block::s_flags. 148 - 149 - (*) unsigned int s_iflags 150 - 151 - These will be bitwise-OR'd with s->s_iflags when a superblock is created. 152 - 153 - (*) enum fs_context_purpose 154 - 155 - This indicates the purpose for which the context is intended. The 156 - available values are: 157 - 158 - FS_CONTEXT_FOR_MOUNT, -- New superblock for explicit mount 159 - FS_CONTEXT_FOR_SUBMOUNT -- New automatic submount of extant mount 160 - FS_CONTEXT_FOR_RECONFIGURE -- Change an existing mount 161 - 162 - The mount context is created by calling vfs_new_fs_context() or 163 - vfs_dup_fs_context() and is destroyed with put_fs_context(). Note that the 164 - structure is not refcounted. 165 - 166 - VFS, security and filesystem mount options are set individually with 167 - vfs_parse_mount_option(). Options provided by the old mount(2) system call as 168 - a page of data can be parsed with generic_parse_monolithic(). 169 - 170 - When mounting, the filesystem is allowed to take data from any of the pointers 171 - and attach it to the superblock (or whatever), provided it clears the pointer 172 - in the mount context. 173 - 174 - The filesystem is also allowed to allocate resources and pin them with the 175 - mount context. For instance, NFS might pin the appropriate protocol version 176 - module. 177 - 178 - 179 - ================================= 180 - THE FILESYSTEM CONTEXT OPERATIONS 181 - ================================= 182 - 183 - The filesystem context points to a table of operations: 184 - 185 - struct fs_context_operations { 186 - void (*free)(struct fs_context *fc); 187 - int (*dup)(struct fs_context *fc, struct fs_context *src_fc); 188 - int (*parse_param)(struct fs_context *fc, 189 - struct struct fs_parameter *param); 190 - int (*parse_monolithic)(struct fs_context *fc, void *data); 191 - int (*get_tree)(struct fs_context *fc); 192 - int (*reconfigure)(struct fs_context *fc); 193 - }; 194 - 195 - These operations are invoked by the various stages of the mount procedure to 196 - manage the filesystem context. They are as follows: 197 - 198 - (*) void (*free)(struct fs_context *fc); 199 - 200 - Called to clean up the filesystem-specific part of the filesystem context 201 - when the context is destroyed. It should be aware that parts of the 202 - context may have been removed and NULL'd out by ->get_tree(). 203 - 204 - (*) int (*dup)(struct fs_context *fc, struct fs_context *src_fc); 205 - 206 - Called when a filesystem context has been duplicated to duplicate the 207 - filesystem-private data. An error may be returned to indicate failure to 208 - do this. 209 - 210 - [!] Note that even if this fails, put_fs_context() will be called 211 - immediately thereafter, so ->dup() *must* make the 212 - filesystem-private data safe for ->free(). 213 - 214 - (*) int (*parse_param)(struct fs_context *fc, 215 - struct struct fs_parameter *param); 216 - 217 - Called when a parameter is being added to the filesystem context. param 218 - points to the key name and maybe a value object. VFS-specific options 219 - will have been weeded out and fc->sb_flags updated in the context. 220 - Security options will also have been weeded out and fc->security updated. 221 - 222 - The parameter can be parsed with fs_parse() and fs_lookup_param(). Note 223 - that the source(s) are presented as parameters named "source". 224 - 225 - If successful, 0 should be returned or a negative error code otherwise. 226 - 227 - (*) int (*parse_monolithic)(struct fs_context *fc, void *data); 228 - 229 - Called when the mount(2) system call is invoked to pass the entire data 230 - page in one go. If this is expected to be just a list of "key[=val]" 231 - items separated by commas, then this may be set to NULL. 232 - 233 - The return value is as for ->parse_param(). 234 - 235 - If the filesystem (e.g. NFS) needs to examine the data first and then 236 - finds it's the standard key-val list then it may pass it off to 237 - generic_parse_monolithic(). 238 - 239 - (*) int (*get_tree)(struct fs_context *fc); 240 - 241 - Called to get or create the mountable root and superblock, using the 242 - information stored in the filesystem context (reconfiguration goes via a 243 - different vector). It may detach any resources it desires from the 244 - filesystem context and transfer them to the superblock it creates. 245 - 246 - On success it should set fc->root to the mountable root and return 0. In 247 - the case of an error, it should return a negative error code. 248 - 249 - The phase on a userspace-driven context will be set to only allow this to 250 - be called once on any particular context. 251 - 252 - (*) int (*reconfigure)(struct fs_context *fc); 253 - 254 - Called to effect reconfiguration of a superblock using information stored 255 - in the filesystem context. It may detach any resources it desires from 256 - the filesystem context and transfer them to the superblock. The 257 - superblock can be found from fc->root->d_sb. 258 - 259 - On success it should return 0. In the case of an error, it should return 260 - a negative error code. 261 - 262 - [NOTE] reconfigure is intended as a replacement for remount_fs. 263 - 264 - 265 - =========================== 266 - FILESYSTEM CONTEXT SECURITY 267 - =========================== 268 - 269 - The filesystem context contains a security pointer that the LSMs can use for 270 - building up a security context for the superblock to be mounted. There are a 271 - number of operations used by the new mount code for this purpose: 272 - 273 - (*) int security_fs_context_alloc(struct fs_context *fc, 274 - struct dentry *reference); 275 - 276 - Called to initialise fc->security (which is preset to NULL) and allocate 277 - any resources needed. It should return 0 on success or a negative error 278 - code on failure. 279 - 280 - reference will be non-NULL if the context is being created for superblock 281 - reconfiguration (FS_CONTEXT_FOR_RECONFIGURE) in which case it indicates 282 - the root dentry of the superblock to be reconfigured. It will also be 283 - non-NULL in the case of a submount (FS_CONTEXT_FOR_SUBMOUNT) in which case 284 - it indicates the automount point. 285 - 286 - (*) int security_fs_context_dup(struct fs_context *fc, 287 - struct fs_context *src_fc); 288 - 289 - Called to initialise fc->security (which is preset to NULL) and allocate 290 - any resources needed. The original filesystem context is pointed to by 291 - src_fc and may be used for reference. It should return 0 on success or a 292 - negative error code on failure. 293 - 294 - (*) void security_fs_context_free(struct fs_context *fc); 295 - 296 - Called to clean up anything attached to fc->security. Note that the 297 - contents may have been transferred to a superblock and the pointer cleared 298 - during get_tree. 299 - 300 - (*) int security_fs_context_parse_param(struct fs_context *fc, 301 - struct fs_parameter *param); 302 - 303 - Called for each mount parameter, including the source. The arguments are 304 - as for the ->parse_param() method. It should return 0 to indicate that 305 - the parameter should be passed on to the filesystem, 1 to indicate that 306 - the parameter should be discarded or an error to indicate that the 307 - parameter should be rejected. 308 - 309 - The value pointed to by param may be modified (if a string) or stolen 310 - (provided the value pointer is NULL'd out). If it is stolen, 1 must be 311 - returned to prevent it being passed to the filesystem. 312 - 313 - (*) int security_fs_context_validate(struct fs_context *fc); 314 - 315 - Called after all the options have been parsed to validate the collection 316 - as a whole and to do any necessary allocation so that 317 - security_sb_get_tree() and security_sb_reconfigure() are less likely to 318 - fail. It should return 0 or a negative error code. 319 - 320 - In the case of reconfiguration, the target superblock will be accessible 321 - via fc->root. 322 - 323 - (*) int security_sb_get_tree(struct fs_context *fc); 324 - 325 - Called during the mount procedure to verify that the specified superblock 326 - is allowed to be mounted and to transfer the security data there. It 327 - should return 0 or a negative error code. 328 - 329 - (*) void security_sb_reconfigure(struct fs_context *fc); 330 - 331 - Called to apply any reconfiguration to an LSM's context. It must not 332 - fail. Error checking and resource allocation must be done in advance by 333 - the parameter parsing and validation hooks. 334 - 335 - (*) int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint, 336 - unsigned int mnt_flags); 337 - 338 - Called during the mount procedure to verify that the root dentry attached 339 - to the context is permitted to be attached to the specified mountpoint. 340 - It should return 0 on success or a negative error code on failure. 341 - 342 - 343 - ========================== 344 - VFS FILESYSTEM CONTEXT API 345 - ========================== 346 - 347 - There are four operations for creating a filesystem context and one for 348 - destroying a context: 349 - 350 - (*) struct fs_context *fs_context_for_mount( 351 - struct file_system_type *fs_type, 352 - unsigned int sb_flags); 353 - 354 - Allocate a filesystem context for the purpose of setting up a new mount, 355 - whether that be with a new superblock or sharing an existing one. This 356 - sets the superblock flags, initialises the security and calls 357 - fs_type->init_fs_context() to initialise the filesystem private data. 358 - 359 - fs_type specifies the filesystem type that will manage the context and 360 - sb_flags presets the superblock flags stored therein. 361 - 362 - (*) struct fs_context *fs_context_for_reconfigure( 363 - struct dentry *dentry, 364 - unsigned int sb_flags, 365 - unsigned int sb_flags_mask); 366 - 367 - Allocate a filesystem context for the purpose of reconfiguring an 368 - existing superblock. dentry provides a reference to the superblock to be 369 - configured. sb_flags and sb_flags_mask indicate which superblock flags 370 - need changing and to what. 371 - 372 - (*) struct fs_context *fs_context_for_submount( 373 - struct file_system_type *fs_type, 374 - struct dentry *reference); 375 - 376 - Allocate a filesystem context for the purpose of creating a new mount for 377 - an automount point or other derived superblock. fs_type specifies the 378 - filesystem type that will manage the context and the reference dentry 379 - supplies the parameters. Namespaces are propagated from the reference 380 - dentry's superblock also. 381 - 382 - Note that it's not a requirement that the reference dentry be of the same 383 - filesystem type as fs_type. 384 - 385 - (*) struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc); 386 - 387 - Duplicate a filesystem context, copying any options noted and duplicating 388 - or additionally referencing any resources held therein. This is available 389 - for use where a filesystem has to get a mount within a mount, such as NFS4 390 - does by internally mounting the root of the target server and then doing a 391 - private pathwalk to the target directory. 392 - 393 - The purpose in the new context is inherited from the old one. 394 - 395 - (*) void put_fs_context(struct fs_context *fc); 396 - 397 - Destroy a filesystem context, releasing any resources it holds. This 398 - calls the ->free() operation. This is intended to be called by anyone who 399 - created a filesystem context. 400 - 401 - [!] filesystem contexts are not refcounted, so this causes unconditional 402 - destruction. 403 - 404 - In all the above operations, apart from the put op, the return is a mount 405 - context pointer or a negative error code. 406 - 407 - For the remaining operations, if an error occurs, a negative error code will be 408 - returned. 409 - 410 - (*) int vfs_parse_fs_param(struct fs_context *fc, 411 - struct fs_parameter *param); 412 - 413 - Supply a single mount parameter to the filesystem context. This include 414 - the specification of the source/device which is specified as the "source" 415 - parameter (which may be specified multiple times if the filesystem 416 - supports that). 417 - 418 - param specifies the parameter key name and the value. The parameter is 419 - first checked to see if it corresponds to a standard mount flag (in which 420 - case it is used to set an SB_xxx flag and consumed) or a security option 421 - (in which case the LSM consumes it) before it is passed on to the 422 - filesystem. 423 - 424 - The parameter value is typed and can be one of: 425 - 426 - fs_value_is_flag, Parameter not given a value. 427 - fs_value_is_string, Value is a string 428 - fs_value_is_blob, Value is a binary blob 429 - fs_value_is_filename, Value is a filename* + dirfd 430 - fs_value_is_file, Value is an open file (file*) 431 - 432 - If there is a value, that value is stored in a union in the struct in one 433 - of param->{string,blob,name,file}. Note that the function may steal and 434 - clear the pointer, but then becomes responsible for disposing of the 435 - object. 436 - 437 - (*) int vfs_parse_fs_string(struct fs_context *fc, const char *key, 438 - const char *value, size_t v_size); 439 - 440 - A wrapper around vfs_parse_fs_param() that copies the value string it is 441 - passed. 442 - 443 - (*) int generic_parse_monolithic(struct fs_context *fc, void *data); 444 - 445 - Parse a sys_mount() data page, assuming the form to be a text list 446 - consisting of key[=val] options separated by commas. Each item in the 447 - list is passed to vfs_mount_option(). This is the default when the 448 - ->parse_monolithic() method is NULL. 449 - 450 - (*) int vfs_get_tree(struct fs_context *fc); 451 - 452 - Get or create the mountable root and superblock, using the parameters in 453 - the filesystem context to select/configure the superblock. This invokes 454 - the ->get_tree() method. 455 - 456 - (*) struct vfsmount *vfs_create_mount(struct fs_context *fc); 457 - 458 - Create a mount given the parameters in the specified filesystem context. 459 - Note that this does not attach the mount to anything. 460 - 461 - 462 - =========================== 463 - SUPERBLOCK CREATION HELPERS 464 - =========================== 465 - 466 - A number of VFS helpers are available for use by filesystems for the creation 467 - or looking up of superblocks. 468 - 469 - (*) struct super_block * 470 - sget_fc(struct fs_context *fc, 471 - int (*test)(struct super_block *sb, struct fs_context *fc), 472 - int (*set)(struct super_block *sb, struct fs_context *fc)); 473 - 474 - This is the core routine. If test is non-NULL, it searches for an 475 - existing superblock matching the criteria held in the fs_context, using 476 - the test function to match them. If no match is found, a new superblock 477 - is created and the set function is called to set it up. 478 - 479 - Prior to the set function being called, fc->s_fs_info will be transferred 480 - to sb->s_fs_info - and fc->s_fs_info will be cleared if set returns 481 - success (ie. 0). 482 - 483 - The following helpers all wrap sget_fc(): 484 - 485 - (*) int vfs_get_super(struct fs_context *fc, 486 - enum vfs_get_super_keying keying, 487 - int (*fill_super)(struct super_block *sb, 488 - struct fs_context *fc)) 489 - 490 - This creates/looks up a deviceless superblock. The keying indicates how 491 - many superblocks of this type may exist and in what manner they may be 492 - shared: 493 - 494 - (1) vfs_get_single_super 495 - 496 - Only one such superblock may exist in the system. Any further 497 - attempt to get a new superblock gets this one (and any parameter 498 - differences are ignored). 499 - 500 - (2) vfs_get_keyed_super 501 - 502 - Multiple superblocks of this type may exist and they're keyed on 503 - their s_fs_info pointer (for example this may refer to a 504 - namespace). 505 - 506 - (3) vfs_get_independent_super 507 - 508 - Multiple independent superblocks of this type may exist. This 509 - function never matches an existing one and always creates a new 510 - one. 511 - 512 - 513 - ===================== 514 - PARAMETER DESCRIPTION 515 - ===================== 516 - 517 - Parameters are described using structures defined in linux/fs_parser.h. 518 - There's a core description struct that links everything together: 519 - 520 - struct fs_parameter_description { 521 - const struct fs_parameter_spec *specs; 522 - const struct fs_parameter_enum *enums; 523 - }; 524 - 525 - For example: 526 - 527 - enum { 528 - Opt_autocell, 529 - Opt_bar, 530 - Opt_dyn, 531 - Opt_foo, 532 - Opt_source, 533 - }; 534 - 535 - static const struct fs_parameter_description afs_fs_parameters = { 536 - .specs = afs_param_specs, 537 - .enums = afs_param_enums, 538 - }; 539 - 540 - The members are as follows: 541 - 542 - (1) const struct fs_parameter_specification *specs; 543 - 544 - Table of parameter specifications, terminated with a null entry, where the 545 - entries are of type: 546 - 547 - struct fs_parameter_spec { 548 - const char *name; 549 - u8 opt; 550 - enum fs_parameter_type type:8; 551 - unsigned short flags; 552 - }; 553 - 554 - The 'name' field is a string to match exactly to the parameter key (no 555 - wildcards, patterns and no case-independence) and 'opt' is the value that 556 - will be returned by the fs_parser() function in the case of a successful 557 - match. 558 - 559 - The 'type' field indicates the desired value type and must be one of: 560 - 561 - TYPE NAME EXPECTED VALUE RESULT IN 562 - ======================= ======================= ===================== 563 - fs_param_is_flag No value n/a 564 - fs_param_is_bool Boolean value result->boolean 565 - fs_param_is_u32 32-bit unsigned int result->uint_32 566 - fs_param_is_u32_octal 32-bit octal int result->uint_32 567 - fs_param_is_u32_hex 32-bit hex int result->uint_32 568 - fs_param_is_s32 32-bit signed int result->int_32 569 - fs_param_is_u64 64-bit unsigned int result->uint_64 570 - fs_param_is_enum Enum value name result->uint_32 571 - fs_param_is_string Arbitrary string param->string 572 - fs_param_is_blob Binary blob param->blob 573 - fs_param_is_blockdev Blockdev path * Needs lookup 574 - fs_param_is_path Path * Needs lookup 575 - fs_param_is_fd File descriptor result->int_32 576 - 577 - Note that if the value is of fs_param_is_bool type, fs_parse() will try 578 - to match any string value against "0", "1", "no", "yes", "false", "true". 579 - 580 - Each parameter can also be qualified with 'flags': 581 - 582 - fs_param_v_optional The value is optional 583 - fs_param_neg_with_no result->negated set if key is prefixed with "no" 584 - fs_param_neg_with_empty result->negated set if value is "" 585 - fs_param_deprecated The parameter is deprecated. 586 - 587 - These are wrapped with a number of convenience wrappers: 588 - 589 - MACRO SPECIFIES 590 - ======================= =============================================== 591 - fsparam_flag() fs_param_is_flag 592 - fsparam_flag_no() fs_param_is_flag, fs_param_neg_with_no 593 - fsparam_bool() fs_param_is_bool 594 - fsparam_u32() fs_param_is_u32 595 - fsparam_u32oct() fs_param_is_u32_octal 596 - fsparam_u32hex() fs_param_is_u32_hex 597 - fsparam_s32() fs_param_is_s32 598 - fsparam_u64() fs_param_is_u64 599 - fsparam_enum() fs_param_is_enum 600 - fsparam_string() fs_param_is_string 601 - fsparam_blob() fs_param_is_blob 602 - fsparam_bdev() fs_param_is_blockdev 603 - fsparam_path() fs_param_is_path 604 - fsparam_fd() fs_param_is_fd 605 - 606 - all of which take two arguments, name string and option number - for 607 - example: 608 - 609 - static const struct fs_parameter_spec afs_param_specs[] = { 610 - fsparam_flag ("autocell", Opt_autocell), 611 - fsparam_flag ("dyn", Opt_dyn), 612 - fsparam_string ("source", Opt_source), 613 - fsparam_flag_no ("foo", Opt_foo), 614 - {} 615 - }; 616 - 617 - An addition macro, __fsparam() is provided that takes an additional pair 618 - of arguments to specify the type and the flags for anything that doesn't 619 - match one of the above macros. 620 - 621 - (2) const struct fs_parameter_enum *enums; 622 - 623 - Table of enum value names to integer mappings, terminated with a null 624 - entry. This is of type: 625 - 626 - struct fs_parameter_enum { 627 - u8 opt; 628 - char name[14]; 629 - u8 value; 630 - }; 631 - 632 - Where the array is an unsorted list of { parameter ID, name }-keyed 633 - elements that indicate the value to map to, e.g.: 634 - 635 - static const struct fs_parameter_enum afs_param_enums[] = { 636 - { Opt_bar, "x", 1}, 637 - { Opt_bar, "y", 23}, 638 - { Opt_bar, "z", 42}, 639 - }; 640 - 641 - If a parameter of type fs_param_is_enum is encountered, fs_parse() will 642 - try to look the value up in the enum table and the result will be stored 643 - in the parse result. 644 - 645 - The parser should be pointed to by the parser pointer in the file_system_type 646 - struct as this will provide validation on registration (if 647 - CONFIG_VALIDATE_FS_PARSER=y) and will allow the description to be queried from 648 - userspace using the fsinfo() syscall. 649 - 650 - 651 - ========================== 652 - PARAMETER HELPER FUNCTIONS 653 - ========================== 654 - 655 - A number of helper functions are provided to help a filesystem or an LSM 656 - process the parameters it is given. 657 - 658 - (*) int lookup_constant(const struct constant_table tbl[], 659 - const char *name, int not_found); 660 - 661 - Look up a constant by name in a table of name -> integer mappings. The 662 - table is an array of elements of the following type: 663 - 664 - struct constant_table { 665 - const char *name; 666 - int value; 667 - }; 668 - 669 - If a match is found, the corresponding value is returned. If a match 670 - isn't found, the not_found value is returned instead. 671 - 672 - (*) bool validate_constant_table(const struct constant_table *tbl, 673 - size_t tbl_size, 674 - int low, int high, int special); 675 - 676 - Validate a constant table. Checks that all the elements are appropriately 677 - ordered, that there are no duplicates and that the values are between low 678 - and high inclusive, though provision is made for one allowable special 679 - value outside of that range. If no special value is required, special 680 - should just be set to lie inside the low-to-high range. 681 - 682 - If all is good, true is returned. If the table is invalid, errors are 683 - logged to dmesg and false is returned. 684 - 685 - (*) bool fs_validate_description(const struct fs_parameter_description *desc); 686 - 687 - This performs some validation checks on a parameter description. It 688 - returns true if the description is good and false if it is not. It will 689 - log errors to dmesg if validation fails. 690 - 691 - (*) int fs_parse(struct fs_context *fc, 692 - const struct fs_parameter_description *desc, 693 - struct fs_parameter *param, 694 - struct fs_parse_result *result); 695 - 696 - This is the main interpreter of parameters. It uses the parameter 697 - description to look up a parameter by key name and to convert that to an 698 - option number (which it returns). 699 - 700 - If successful, and if the parameter type indicates the result is a 701 - boolean, integer or enum type, the value is converted by this function and 702 - the result stored in result->{boolean,int_32,uint_32,uint_64}. 703 - 704 - If a match isn't initially made, the key is prefixed with "no" and no 705 - value is present then an attempt will be made to look up the key with the 706 - prefix removed. If this matches a parameter for which the type has flag 707 - fs_param_neg_with_no set, then a match will be made and result->negated 708 - will be set to true. 709 - 710 - If the parameter isn't matched, -ENOPARAM will be returned; if the 711 - parameter is matched, but the value is erroneous, -EINVAL will be 712 - returned; otherwise the parameter's option number will be returned. 713 - 714 - (*) int fs_lookup_param(struct fs_context *fc, 715 - struct fs_parameter *value, 716 - bool want_bdev, 717 - struct path *_path); 718 - 719 - This takes a parameter that carries a string or filename type and attempts 720 - to do a path lookup on it. If the parameter expects a blockdev, a check 721 - is made that the inode actually represents one. 722 - 723 - Returns 0 if successful and *_path will be set; returns a negative error 724 - code if not.

+1 -3

Documentation/filesystems/orangefs.rst

··· 119 119 120 120 /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf 121 121 122 - Create an /etc/pvfs2tab file:: 123 - 124 - Localhost is fine for your pvfs2tab file: 122 + Create an /etc/pvfs2tab file (localhost is fine):: 125 123 126 124 echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \ 127 125 /etc/pvfs2tab

+1 -1

Documentation/filesystems/proc.rst

··· 1871 1871 1872 1872 For more information on mount propagation see: 1873 1873 1874 - Documentation/filesystems/sharedsubtree.txt 1874 + Documentation/filesystems/sharedsubtree.rst 1875 1875 1876 1876 1877 1877 3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm

+85

Documentation/filesystems/quota.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + =============== 4 + Quota subsystem 5 + =============== 6 + 7 + Quota subsystem allows system administrator to set limits on used space and 8 + number of used inodes (inode is a filesystem structure which is associated with 9 + each file or directory) for users and/or groups. For both used space and number 10 + of used inodes there are actually two limits. The first one is called softlimit 11 + and the second one hardlimit. A user can never exceed a hardlimit for any 12 + resource (unless he has CAP_SYS_RESOURCE capability). User is allowed to exceed 13 + softlimit but only for limited period of time. This period is called "grace 14 + period" or "grace time". When grace time is over, user is not able to allocate 15 + more space/inodes until he frees enough of them to get below softlimit. 16 + 17 + Quota limits (and amount of grace time) are set independently for each 18 + filesystem. 19 + 20 + For more details about quota design, see the documentation in quota-tools package 21 + (http://sourceforge.net/projects/linuxquota). 22 + 23 + Quota netlink interface 24 + ======================= 25 + When user exceeds a softlimit, runs out of grace time or reaches hardlimit, 26 + quota subsystem traditionally printed a message to the controlling terminal of 27 + the process which caused the excess. This method has the disadvantage that 28 + when user is using a graphical desktop he usually cannot see the message. 29 + Thus quota netlink interface has been designed to pass information about 30 + the above events to userspace. There they can be captured by an application 31 + and processed accordingly. 32 + 33 + The interface uses generic netlink framework (see 34 + http://lwn.net/Articles/208755/ and http://people.suug.ch/~tgr/libnl/ for more 35 + details about this layer). The name of the quota generic netlink interface 36 + is "VFS_DQUOT". Definitions of constants below are in <linux/quota.h>. 37 + Since the quota netlink protocol is not namespace aware, quota netlink messages 38 + are sent only in initial network namespace. 39 + 40 + Currently, the interface supports only one message type QUOTA_NL_C_WARNING. 41 + This command is used to send a notification about any of the above mentioned 42 + events. Each message has six attributes. These are (type of the argument is 43 + in parentheses): 44 + 45 + QUOTA_NL_A_QTYPE (u32) 46 + - type of quota being exceeded (one of USRQUOTA, GRPQUOTA) 47 + QUOTA_NL_A_EXCESS_ID (u64) 48 + - UID/GID (depends on quota type) of user / group whose limit 49 + is being exceeded. 50 + QUOTA_NL_A_CAUSED_ID (u64) 51 + - UID of a user who caused the event 52 + QUOTA_NL_A_WARNING (u32) 53 + - what kind of limit is exceeded: 54 + 55 + QUOTA_NL_IHARDWARN 56 + inode hardlimit 57 + QUOTA_NL_ISOFTLONGWARN 58 + inode softlimit is exceeded longer 59 + than given grace period 60 + QUOTA_NL_ISOFTWARN 61 + inode softlimit 62 + QUOTA_NL_BHARDWARN 63 + space (block) hardlimit 64 + QUOTA_NL_BSOFTLONGWARN 65 + space (block) softlimit is exceeded 66 + longer than given grace period. 67 + QUOTA_NL_BSOFTWARN 68 + space (block) softlimit 69 + 70 + - four warnings are also defined for the event when user stops 71 + exceeding some limit: 72 + 73 + QUOTA_NL_IHARDBELOW 74 + inode hardlimit 75 + QUOTA_NL_ISOFTBELOW 76 + inode softlimit 77 + QUOTA_NL_BHARDBELOW 78 + space (block) hardlimit 79 + QUOTA_NL_BSOFTBELOW 80 + space (block) softlimit 81 + 82 + QUOTA_NL_A_DEV_MAJOR (u32) 83 + - major number of a device with the affected filesystem 84 + QUOTA_NL_A_DEV_MINOR (u32) 85 + - minor number of a device with the affected filesystem

-68

Documentation/filesystems/quota.txt

··· 1 - 2 - Quota subsystem 3 - =============== 4 - 5 - Quota subsystem allows system administrator to set limits on used space and 6 - number of used inodes (inode is a filesystem structure which is associated with 7 - each file or directory) for users and/or groups. For both used space and number 8 - of used inodes there are actually two limits. The first one is called softlimit 9 - and the second one hardlimit. A user can never exceed a hardlimit for any 10 - resource (unless he has CAP_SYS_RESOURCE capability). User is allowed to exceed 11 - softlimit but only for limited period of time. This period is called "grace 12 - period" or "grace time". When grace time is over, user is not able to allocate 13 - more space/inodes until he frees enough of them to get below softlimit. 14 - 15 - Quota limits (and amount of grace time) are set independently for each 16 - filesystem. 17 - 18 - For more details about quota design, see the documentation in quota-tools package 19 - (http://sourceforge.net/projects/linuxquota). 20 - 21 - Quota netlink interface 22 - ======================= 23 - When user exceeds a softlimit, runs out of grace time or reaches hardlimit, 24 - quota subsystem traditionally printed a message to the controlling terminal of 25 - the process which caused the excess. This method has the disadvantage that 26 - when user is using a graphical desktop he usually cannot see the message. 27 - Thus quota netlink interface has been designed to pass information about 28 - the above events to userspace. There they can be captured by an application 29 - and processed accordingly. 30 - 31 - The interface uses generic netlink framework (see 32 - http://lwn.net/Articles/208755/ and http://people.suug.ch/~tgr/libnl/ for more 33 - details about this layer). The name of the quota generic netlink interface 34 - is "VFS_DQUOT". Definitions of constants below are in <linux/quota.h>. 35 - Since the quota netlink protocol is not namespace aware, quota netlink messages 36 - are sent only in initial network namespace. 37 - 38 - Currently, the interface supports only one message type QUOTA_NL_C_WARNING. 39 - This command is used to send a notification about any of the above mentioned 40 - events. Each message has six attributes. These are (type of the argument is 41 - in parentheses): 42 - QUOTA_NL_A_QTYPE (u32) 43 - - type of quota being exceeded (one of USRQUOTA, GRPQUOTA) 44 - QUOTA_NL_A_EXCESS_ID (u64) 45 - - UID/GID (depends on quota type) of user / group whose limit 46 - is being exceeded. 47 - QUOTA_NL_A_CAUSED_ID (u64) 48 - - UID of a user who caused the event 49 - QUOTA_NL_A_WARNING (u32) 50 - - what kind of limit is exceeded: 51 - QUOTA_NL_IHARDWARN - inode hardlimit 52 - QUOTA_NL_ISOFTLONGWARN - inode softlimit is exceeded longer 53 - than given grace period 54 - QUOTA_NL_ISOFTWARN - inode softlimit 55 - QUOTA_NL_BHARDWARN - space (block) hardlimit 56 - QUOTA_NL_BSOFTLONGWARN - space (block) softlimit is exceeded 57 - longer than given grace period. 58 - QUOTA_NL_BSOFTWARN - space (block) softlimit 59 - - four warnings are also defined for the event when user stops 60 - exceeding some limit: 61 - QUOTA_NL_IHARDBELOW - inode hardlimit 62 - QUOTA_NL_ISOFTBELOW - inode softlimit 63 - QUOTA_NL_BHARDBELOW - space (block) hardlimit 64 - QUOTA_NL_BSOFTBELOW - space (block) softlimit 65 - QUOTA_NL_A_DEV_MAJOR (u32) 66 - - major number of a device with the affected filesystem 67 - QUOTA_NL_A_DEV_MINOR (u32) 68 - - minor number of a device with the affected filesystem

+1 -1

Documentation/filesystems/ramfs-rootfs-initramfs.rst

··· 71 71 72 72 A ramfs derivative called tmpfs was created to add size limits, and the ability 73 73 to write the data to swap space. Normal users can be allowed write access to 74 - tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information. 74 + tmpfs mounts. See Documentation/filesystems/tmpfs.rst for more information. 75 75 76 76 What is rootfs? 77 77 ---------------

+372

Documentation/filesystems/seq_file.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ====================== 4 + The seq_file Interface 5 + ====================== 6 + 7 + Copyright 2003 Jonathan Corbet <corbet@lwn.net> 8 + 9 + This file is originally from the LWN.net Driver Porting series at 10 + http://lwn.net/Articles/driver-porting/ 11 + 12 + 13 + There are numerous ways for a device driver (or other kernel component) to 14 + provide information to the user or system administrator. One useful 15 + technique is the creation of virtual files, in debugfs, /proc or elsewhere. 16 + Virtual files can provide human-readable output that is easy to get at 17 + without any special utility programs; they can also make life easier for 18 + script writers. It is not surprising that the use of virtual files has 19 + grown over the years. 20 + 21 + Creating those files correctly has always been a bit of a challenge, 22 + however. It is not that hard to make a virtual file which returns a 23 + string. But life gets trickier if the output is long - anything greater 24 + than an application is likely to read in a single operation. Handling 25 + multiple reads (and seeks) requires careful attention to the reader's 26 + position within the virtual file - that position is, likely as not, in the 27 + middle of a line of output. The kernel has traditionally had a number of 28 + implementations that got this wrong. 29 + 30 + The 2.6 kernel contains a set of functions (implemented by Alexander Viro) 31 + which are designed to make it easy for virtual file creators to get it 32 + right. 33 + 34 + The seq_file interface is available via <linux/seq_file.h>. There are 35 + three aspects to seq_file: 36 + 37 + * An iterator interface which lets a virtual file implementation 38 + step through the objects it is presenting. 39 + 40 + * Some utility functions for formatting objects for output without 41 + needing to worry about things like output buffers. 42 + 43 + * A set of canned file_operations which implement most operations on 44 + the virtual file. 45 + 46 + We'll look at the seq_file interface via an extremely simple example: a 47 + loadable module which creates a file called /proc/sequence. The file, when 48 + read, simply produces a set of increasing integer values, one per line. The 49 + sequence will continue until the user loses patience and finds something 50 + better to do. The file is seekable, in that one can do something like the 51 + following:: 52 + 53 + dd if=/proc/sequence of=out1 count=1 54 + dd if=/proc/sequence skip=1 of=out2 count=1 55 + 56 + Then concatenate the output files out1 and out2 and get the right 57 + result. Yes, it is a thoroughly useless module, but the point is to show 58 + how the mechanism works without getting lost in other details. (Those 59 + wanting to see the full source for this module can find it at 60 + http://lwn.net/Articles/22359/). 61 + 62 + Deprecated create_proc_entry 63 + ============================ 64 + 65 + Note that the above article uses create_proc_entry which was removed in 66 + kernel 3.10. Current versions require the following update:: 67 + 68 + - entry = create_proc_entry("sequence", 0, NULL); 69 + - if (entry) 70 + - entry->proc_fops = &ct_file_ops; 71 + + entry = proc_create("sequence", 0, NULL, &ct_file_ops); 72 + 73 + The iterator interface 74 + ====================== 75 + 76 + Modules implementing a virtual file with seq_file must implement an 77 + iterator object that allows stepping through the data of interest 78 + during a "session" (roughly one read() system call). If the iterator 79 + is able to move to a specific position - like the file they implement, 80 + though with freedom to map the position number to a sequence location 81 + in whatever way is convenient - the iterator need only exist 82 + transiently during a session. If the iterator cannot easily find a 83 + numerical position but works well with a first/next interface, the 84 + iterator can be stored in the private data area and continue from one 85 + session to the next. 86 + 87 + A seq_file implementation that is formatting firewall rules from a 88 + table, for example, could provide a simple iterator that interprets 89 + position N as the Nth rule in the chain. A seq_file implementation 90 + that presents the content of a, potentially volatile, linked list 91 + might record a pointer into that list, providing that can be done 92 + without risk of the current location being removed. 93 + 94 + Positioning can thus be done in whatever way makes the most sense for 95 + the generator of the data, which need not be aware of how a position 96 + translates to an offset in the virtual file. The one obvious exception 97 + is that a position of zero should indicate the beginning of the file. 98 + 99 + The /proc/sequence iterator just uses the count of the next number it 100 + will output as its position. 101 + 102 + Four functions must be implemented to make the iterator work. The 103 + first, called start(), starts a session and takes a position as an 104 + argument, returning an iterator which will start reading at that 105 + position. The pos passed to start() will always be either zero, or 106 + the most recent pos used in the previous session. 107 + 108 + For our simple sequence example, 109 + the start() function looks like:: 110 + 111 + static void *ct_seq_start(struct seq_file *s, loff_t *pos) 112 + { 113 + loff_t *spos = kmalloc(sizeof(loff_t), GFP_KERNEL); 114 + if (! spos) 115 + return NULL; 116 + *spos = *pos; 117 + return spos; 118 + } 119 + 120 + The entire data structure for this iterator is a single loff_t value 121 + holding the current position. There is no upper bound for the sequence 122 + iterator, but that will not be the case for most other seq_file 123 + implementations; in most cases the start() function should check for a 124 + "past end of file" condition and return NULL if need be. 125 + 126 + For more complicated applications, the private field of the seq_file 127 + structure can be used to hold state from session to session. There is 128 + also a special value which can be returned by the start() function 129 + called SEQ_START_TOKEN; it can be used if you wish to instruct your 130 + show() function (described below) to print a header at the top of the 131 + output. SEQ_START_TOKEN should only be used if the offset is zero, 132 + however. 133 + 134 + The next function to implement is called, amazingly, next(); its job is to 135 + move the iterator forward to the next position in the sequence. The 136 + example module can simply increment the position by one; more useful 137 + modules will do what is needed to step through some data structure. The 138 + next() function returns a new iterator, or NULL if the sequence is 139 + complete. Here's the example version:: 140 + 141 + static void *ct_seq_next(struct seq_file *s, void *v, loff_t *pos) 142 + { 143 + loff_t *spos = v; 144 + *pos = ++*spos; 145 + return spos; 146 + } 147 + 148 + The stop() function closes a session; its job, of course, is to clean 149 + up. If dynamic memory is allocated for the iterator, stop() is the 150 + place to free it; if a lock was taken by start(), stop() must release 151 + that lock. The value that ``*pos`` was set to by the last next() call 152 + before stop() is remembered, and used for the first start() call of 153 + the next session unless lseek() has been called on the file; in that 154 + case next start() will be asked to start at position zero:: 155 + 156 + static void ct_seq_stop(struct seq_file *s, void *v) 157 + { 158 + kfree(v); 159 + } 160 + 161 + Finally, the show() function should format the object currently pointed to 162 + by the iterator for output. The example module's show() function is:: 163 + 164 + static int ct_seq_show(struct seq_file *s, void *v) 165 + { 166 + loff_t *spos = v; 167 + seq_printf(s, "%lld\n", (long long)*spos); 168 + return 0; 169 + } 170 + 171 + If all is well, the show() function should return zero. A negative error 172 + code in the usual manner indicates that something went wrong; it will be 173 + passed back to user space. This function can also return SEQ_SKIP, which 174 + causes the current item to be skipped; if the show() function has already 175 + generated output before returning SEQ_SKIP, that output will be dropped. 176 + 177 + We will look at seq_printf() in a moment. But first, the definition of the 178 + seq_file iterator is finished by creating a seq_operations structure with 179 + the four functions we have just defined:: 180 + 181 + static const struct seq_operations ct_seq_ops = { 182 + .start = ct_seq_start, 183 + .next = ct_seq_next, 184 + .stop = ct_seq_stop, 185 + .show = ct_seq_show 186 + }; 187 + 188 + This structure will be needed to tie our iterator to the /proc file in 189 + a little bit. 190 + 191 + It's worth noting that the iterator value returned by start() and 192 + manipulated by the other functions is considered to be completely opaque by 193 + the seq_file code. It can thus be anything that is useful in stepping 194 + through the data to be output. Counters can be useful, but it could also be 195 + a direct pointer into an array or linked list. Anything goes, as long as 196 + the programmer is aware that things can happen between calls to the 197 + iterator function. However, the seq_file code (by design) will not sleep 198 + between the calls to start() and stop(), so holding a lock during that time 199 + is a reasonable thing to do. The seq_file code will also avoid taking any 200 + other locks while the iterator is active. 201 + 202 + 203 + Formatted output 204 + ================ 205 + 206 + The seq_file code manages positioning within the output created by the 207 + iterator and getting it into the user's buffer. But, for that to work, that 208 + output must be passed to the seq_file code. Some utility functions have 209 + been defined which make this task easy. 210 + 211 + Most code will simply use seq_printf(), which works pretty much like 212 + printk(), but which requires the seq_file pointer as an argument. 213 + 214 + For straight character output, the following functions may be used:: 215 + 216 + seq_putc(struct seq_file *m, char c); 217 + seq_puts(struct seq_file *m, const char *s); 218 + seq_escape(struct seq_file *m, const char *s, const char *esc); 219 + 220 + The first two output a single character and a string, just like one would 221 + expect. seq_escape() is like seq_puts(), except that any character in s 222 + which is in the string esc will be represented in octal form in the output. 223 + 224 + There are also a pair of functions for printing filenames:: 225 + 226 + int seq_path(struct seq_file *m, const struct path *path, 227 + const char *esc); 228 + int seq_path_root(struct seq_file *m, const struct path *path, 229 + const struct path *root, const char *esc) 230 + 231 + Here, path indicates the file of interest, and esc is a set of characters 232 + which should be escaped in the output. A call to seq_path() will output 233 + the path relative to the current process's filesystem root. If a different 234 + root is desired, it can be used with seq_path_root(). If it turns out that 235 + path cannot be reached from root, seq_path_root() returns SEQ_SKIP. 236 + 237 + A function producing complicated output may want to check:: 238 + 239 + bool seq_has_overflowed(struct seq_file *m); 240 + 241 + and avoid further seq_<output> calls if true is returned. 242 + 243 + A true return from seq_has_overflowed means that the seq_file buffer will 244 + be discarded and the seq_show function will attempt to allocate a larger 245 + buffer and retry printing. 246 + 247 + 248 + Making it all work 249 + ================== 250 + 251 + So far, we have a nice set of functions which can produce output within the 252 + seq_file system, but we have not yet turned them into a file that a user 253 + can see. Creating a file within the kernel requires, of course, the 254 + creation of a set of file_operations which implement the operations on that 255 + file. The seq_file interface provides a set of canned operations which do 256 + most of the work. The virtual file author still must implement the open() 257 + method, however, to hook everything up. The open function is often a single 258 + line, as in the example module:: 259 + 260 + static int ct_open(struct inode *inode, struct file *file) 261 + { 262 + return seq_open(file, &ct_seq_ops); 263 + } 264 + 265 + Here, the call to seq_open() takes the seq_operations structure we created 266 + before, and gets set up to iterate through the virtual file. 267 + 268 + On a successful open, seq_open() stores the struct seq_file pointer in 269 + file->private_data. If you have an application where the same iterator can 270 + be used for more than one file, you can store an arbitrary pointer in the 271 + private field of the seq_file structure; that value can then be retrieved 272 + by the iterator functions. 273 + 274 + There is also a wrapper function to seq_open() called seq_open_private(). It 275 + kmallocs a zero filled block of memory and stores a pointer to it in the 276 + private field of the seq_file structure, returning 0 on success. The 277 + block size is specified in a third parameter to the function, e.g.:: 278 + 279 + static int ct_open(struct inode *inode, struct file *file) 280 + { 281 + return seq_open_private(file, &ct_seq_ops, 282 + sizeof(struct mystruct)); 283 + } 284 + 285 + There is also a variant function, __seq_open_private(), which is functionally 286 + identical except that, if successful, it returns the pointer to the allocated 287 + memory block, allowing further initialisation e.g.:: 288 + 289 + static int ct_open(struct inode *inode, struct file *file) 290 + { 291 + struct mystruct *p = 292 + __seq_open_private(file, &ct_seq_ops, sizeof(*p)); 293 + 294 + if (!p) 295 + return -ENOMEM; 296 + 297 + p->foo = bar; /* initialize my stuff */ 298 + ... 299 + p->baz = true; 300 + 301 + return 0; 302 + } 303 + 304 + A corresponding close function, seq_release_private() is available which 305 + frees the memory allocated in the corresponding open. 306 + 307 + The other operations of interest - read(), llseek(), and release() - are 308 + all implemented by the seq_file code itself. So a virtual file's 309 + file_operations structure will look like:: 310 + 311 + static const struct file_operations ct_file_ops = { 312 + .owner = THIS_MODULE, 313 + .open = ct_open, 314 + .read = seq_read, 315 + .llseek = seq_lseek, 316 + .release = seq_release 317 + }; 318 + 319 + There is also a seq_release_private() which passes the contents of the 320 + seq_file private field to kfree() before releasing the structure. 321 + 322 + The final step is the creation of the /proc file itself. In the example 323 + code, that is done in the initialization code in the usual way:: 324 + 325 + static int ct_init(void) 326 + { 327 + struct proc_dir_entry *entry; 328 + 329 + proc_create("sequence", 0, NULL, &ct_file_ops); 330 + return 0; 331 + } 332 + 333 + module_init(ct_init); 334 + 335 + And that is pretty much it. 336 + 337 + 338 + seq_list 339 + ======== 340 + 341 + If your file will be iterating through a linked list, you may find these 342 + routines useful:: 343 + 344 + struct list_head *seq_list_start(struct list_head *head, 345 + loff_t pos); 346 + struct list_head *seq_list_start_head(struct list_head *head, 347 + loff_t pos); 348 + struct list_head *seq_list_next(void *v, struct list_head *head, 349 + loff_t *ppos); 350 + 351 + These helpers will interpret pos as a position within the list and iterate 352 + accordingly. Your start() and next() functions need only invoke the 353 + ``seq_list_*`` helpers with a pointer to the appropriate list_head structure. 354 + 355 + 356 + The extra-simple version 357 + ======================== 358 + 359 + For extremely simple virtual files, there is an even easier interface. A 360 + module can define only the show() function, which should create all the 361 + output that the virtual file will contain. The file's open() method then 362 + calls:: 363 + 364 + int single_open(struct file *file, 365 + int (*show)(struct seq_file *m, void *p), 366 + void *data); 367 + 368 + When output time comes, the show() function will be called once. The data 369 + value given to single_open() can be found in the private field of the 370 + seq_file structure. When using single_open(), the programmer should use 371 + single_release() instead of seq_release() in the file_operations structure 372 + to avoid a memory leak.

-359

Documentation/filesystems/seq_file.txt

··· 1 - The seq_file interface 2 - 3 - Copyright 2003 Jonathan Corbet <corbet@lwn.net> 4 - This file is originally from the LWN.net Driver Porting series at 5 - http://lwn.net/Articles/driver-porting/ 6 - 7 - 8 - There are numerous ways for a device driver (or other kernel component) to 9 - provide information to the user or system administrator. One useful 10 - technique is the creation of virtual files, in debugfs, /proc or elsewhere. 11 - Virtual files can provide human-readable output that is easy to get at 12 - without any special utility programs; they can also make life easier for 13 - script writers. It is not surprising that the use of virtual files has 14 - grown over the years. 15 - 16 - Creating those files correctly has always been a bit of a challenge, 17 - however. It is not that hard to make a virtual file which returns a 18 - string. But life gets trickier if the output is long - anything greater 19 - than an application is likely to read in a single operation. Handling 20 - multiple reads (and seeks) requires careful attention to the reader's 21 - position within the virtual file - that position is, likely as not, in the 22 - middle of a line of output. The kernel has traditionally had a number of 23 - implementations that got this wrong. 24 - 25 - The 2.6 kernel contains a set of functions (implemented by Alexander Viro) 26 - which are designed to make it easy for virtual file creators to get it 27 - right. 28 - 29 - The seq_file interface is available via <linux/seq_file.h>. There are 30 - three aspects to seq_file: 31 - 32 - * An iterator interface which lets a virtual file implementation 33 - step through the objects it is presenting. 34 - 35 - * Some utility functions for formatting objects for output without 36 - needing to worry about things like output buffers. 37 - 38 - * A set of canned file_operations which implement most operations on 39 - the virtual file. 40 - 41 - We'll look at the seq_file interface via an extremely simple example: a 42 - loadable module which creates a file called /proc/sequence. The file, when 43 - read, simply produces a set of increasing integer values, one per line. The 44 - sequence will continue until the user loses patience and finds something 45 - better to do. The file is seekable, in that one can do something like the 46 - following: 47 - 48 - dd if=/proc/sequence of=out1 count=1 49 - dd if=/proc/sequence skip=1 of=out2 count=1 50 - 51 - Then concatenate the output files out1 and out2 and get the right 52 - result. Yes, it is a thoroughly useless module, but the point is to show 53 - how the mechanism works without getting lost in other details. (Those 54 - wanting to see the full source for this module can find it at 55 - http://lwn.net/Articles/22359/). 56 - 57 - Deprecated create_proc_entry 58 - 59 - Note that the above article uses create_proc_entry which was removed in 60 - kernel 3.10. Current versions require the following update 61 - 62 - - entry = create_proc_entry("sequence", 0, NULL); 63 - - if (entry) 64 - - entry->proc_fops = &ct_file_ops; 65 - + entry = proc_create("sequence", 0, NULL, &ct_file_ops); 66 - 67 - The iterator interface 68 - 69 - Modules implementing a virtual file with seq_file must implement an 70 - iterator object that allows stepping through the data of interest 71 - during a "session" (roughly one read() system call). If the iterator 72 - is able to move to a specific position - like the file they implement, 73 - though with freedom to map the position number to a sequence location 74 - in whatever way is convenient - the iterator need only exist 75 - transiently during a session. If the iterator cannot easily find a 76 - numerical position but works well with a first/next interface, the 77 - iterator can be stored in the private data area and continue from one 78 - session to the next. 79 - 80 - A seq_file implementation that is formatting firewall rules from a 81 - table, for example, could provide a simple iterator that interprets 82 - position N as the Nth rule in the chain. A seq_file implementation 83 - that presents the content of a, potentially volatile, linked list 84 - might record a pointer into that list, providing that can be done 85 - without risk of the current location being removed. 86 - 87 - Positioning can thus be done in whatever way makes the most sense for 88 - the generator of the data, which need not be aware of how a position 89 - translates to an offset in the virtual file. The one obvious exception 90 - is that a position of zero should indicate the beginning of the file. 91 - 92 - The /proc/sequence iterator just uses the count of the next number it 93 - will output as its position. 94 - 95 - Four functions must be implemented to make the iterator work. The 96 - first, called start(), starts a session and takes a position as an 97 - argument, returning an iterator which will start reading at that 98 - position. The pos passed to start() will always be either zero, or 99 - the most recent pos used in the previous session. 100 - 101 - For our simple sequence example, 102 - the start() function looks like: 103 - 104 - static void *ct_seq_start(struct seq_file *s, loff_t *pos) 105 - { 106 - loff_t *spos = kmalloc(sizeof(loff_t), GFP_KERNEL); 107 - if (! spos) 108 - return NULL; 109 - *spos = *pos; 110 - return spos; 111 - } 112 - 113 - The entire data structure for this iterator is a single loff_t value 114 - holding the current position. There is no upper bound for the sequence 115 - iterator, but that will not be the case for most other seq_file 116 - implementations; in most cases the start() function should check for a 117 - "past end of file" condition and return NULL if need be. 118 - 119 - For more complicated applications, the private field of the seq_file 120 - structure can be used to hold state from session to session. There is 121 - also a special value which can be returned by the start() function 122 - called SEQ_START_TOKEN; it can be used if you wish to instruct your 123 - show() function (described below) to print a header at the top of the 124 - output. SEQ_START_TOKEN should only be used if the offset is zero, 125 - however. 126 - 127 - The next function to implement is called, amazingly, next(); its job is to 128 - move the iterator forward to the next position in the sequence. The 129 - example module can simply increment the position by one; more useful 130 - modules will do what is needed to step through some data structure. The 131 - next() function returns a new iterator, or NULL if the sequence is 132 - complete. Here's the example version: 133 - 134 - static void *ct_seq_next(struct seq_file *s, void *v, loff_t *pos) 135 - { 136 - loff_t *spos = v; 137 - *pos = ++*spos; 138 - return spos; 139 - } 140 - 141 - The stop() function closes a session; its job, of course, is to clean 142 - up. If dynamic memory is allocated for the iterator, stop() is the 143 - place to free it; if a lock was taken by start(), stop() must release 144 - that lock. The value that *pos was set to by the last next() call 145 - before stop() is remembered, and used for the first start() call of 146 - the next session unless lseek() has been called on the file; in that 147 - case next start() will be asked to start at position zero. 148 - 149 - static void ct_seq_stop(struct seq_file *s, void *v) 150 - { 151 - kfree(v); 152 - } 153 - 154 - Finally, the show() function should format the object currently pointed to 155 - by the iterator for output. The example module's show() function is: 156 - 157 - static int ct_seq_show(struct seq_file *s, void *v) 158 - { 159 - loff_t *spos = v; 160 - seq_printf(s, "%lld\n", (long long)*spos); 161 - return 0; 162 - } 163 - 164 - If all is well, the show() function should return zero. A negative error 165 - code in the usual manner indicates that something went wrong; it will be 166 - passed back to user space. This function can also return SEQ_SKIP, which 167 - causes the current item to be skipped; if the show() function has already 168 - generated output before returning SEQ_SKIP, that output will be dropped. 169 - 170 - We will look at seq_printf() in a moment. But first, the definition of the 171 - seq_file iterator is finished by creating a seq_operations structure with 172 - the four functions we have just defined: 173 - 174 - static const struct seq_operations ct_seq_ops = { 175 - .start = ct_seq_start, 176 - .next = ct_seq_next, 177 - .stop = ct_seq_stop, 178 - .show = ct_seq_show 179 - }; 180 - 181 - This structure will be needed to tie our iterator to the /proc file in 182 - a little bit. 183 - 184 - It's worth noting that the iterator value returned by start() and 185 - manipulated by the other functions is considered to be completely opaque by 186 - the seq_file code. It can thus be anything that is useful in stepping 187 - through the data to be output. Counters can be useful, but it could also be 188 - a direct pointer into an array or linked list. Anything goes, as long as 189 - the programmer is aware that things can happen between calls to the 190 - iterator function. However, the seq_file code (by design) will not sleep 191 - between the calls to start() and stop(), so holding a lock during that time 192 - is a reasonable thing to do. The seq_file code will also avoid taking any 193 - other locks while the iterator is active. 194 - 195 - 196 - Formatted output 197 - 198 - The seq_file code manages positioning within the output created by the 199 - iterator and getting it into the user's buffer. But, for that to work, that 200 - output must be passed to the seq_file code. Some utility functions have 201 - been defined which make this task easy. 202 - 203 - Most code will simply use seq_printf(), which works pretty much like 204 - printk(), but which requires the seq_file pointer as an argument. 205 - 206 - For straight character output, the following functions may be used: 207 - 208 - seq_putc(struct seq_file *m, char c); 209 - seq_puts(struct seq_file *m, const char *s); 210 - seq_escape(struct seq_file *m, const char *s, const char *esc); 211 - 212 - The first two output a single character and a string, just like one would 213 - expect. seq_escape() is like seq_puts(), except that any character in s 214 - which is in the string esc will be represented in octal form in the output. 215 - 216 - There are also a pair of functions for printing filenames: 217 - 218 - int seq_path(struct seq_file *m, const struct path *path, 219 - const char *esc); 220 - int seq_path_root(struct seq_file *m, const struct path *path, 221 - const struct path *root, const char *esc) 222 - 223 - Here, path indicates the file of interest, and esc is a set of characters 224 - which should be escaped in the output. A call to seq_path() will output 225 - the path relative to the current process's filesystem root. If a different 226 - root is desired, it can be used with seq_path_root(). If it turns out that 227 - path cannot be reached from root, seq_path_root() returns SEQ_SKIP. 228 - 229 - A function producing complicated output may want to check 230 - bool seq_has_overflowed(struct seq_file *m); 231 - and avoid further seq_<output> calls if true is returned. 232 - 233 - A true return from seq_has_overflowed means that the seq_file buffer will 234 - be discarded and the seq_show function will attempt to allocate a larger 235 - buffer and retry printing. 236 - 237 - 238 - Making it all work 239 - 240 - So far, we have a nice set of functions which can produce output within the 241 - seq_file system, but we have not yet turned them into a file that a user 242 - can see. Creating a file within the kernel requires, of course, the 243 - creation of a set of file_operations which implement the operations on that 244 - file. The seq_file interface provides a set of canned operations which do 245 - most of the work. The virtual file author still must implement the open() 246 - method, however, to hook everything up. The open function is often a single 247 - line, as in the example module: 248 - 249 - static int ct_open(struct inode *inode, struct file *file) 250 - { 251 - return seq_open(file, &ct_seq_ops); 252 - } 253 - 254 - Here, the call to seq_open() takes the seq_operations structure we created 255 - before, and gets set up to iterate through the virtual file. 256 - 257 - On a successful open, seq_open() stores the struct seq_file pointer in 258 - file->private_data. If you have an application where the same iterator can 259 - be used for more than one file, you can store an arbitrary pointer in the 260 - private field of the seq_file structure; that value can then be retrieved 261 - by the iterator functions. 262 - 263 - There is also a wrapper function to seq_open() called seq_open_private(). It 264 - kmallocs a zero filled block of memory and stores a pointer to it in the 265 - private field of the seq_file structure, returning 0 on success. The 266 - block size is specified in a third parameter to the function, e.g.: 267 - 268 - static int ct_open(struct inode *inode, struct file *file) 269 - { 270 - return seq_open_private(file, &ct_seq_ops, 271 - sizeof(struct mystruct)); 272 - } 273 - 274 - There is also a variant function, __seq_open_private(), which is functionally 275 - identical except that, if successful, it returns the pointer to the allocated 276 - memory block, allowing further initialisation e.g.: 277 - 278 - static int ct_open(struct inode *inode, struct file *file) 279 - { 280 - struct mystruct *p = 281 - __seq_open_private(file, &ct_seq_ops, sizeof(*p)); 282 - 283 - if (!p) 284 - return -ENOMEM; 285 - 286 - p->foo = bar; /* initialize my stuff */ 287 - ... 288 - p->baz = true; 289 - 290 - return 0; 291 - } 292 - 293 - A corresponding close function, seq_release_private() is available which 294 - frees the memory allocated in the corresponding open. 295 - 296 - The other operations of interest - read(), llseek(), and release() - are 297 - all implemented by the seq_file code itself. So a virtual file's 298 - file_operations structure will look like: 299 - 300 - static const struct file_operations ct_file_ops = { 301 - .owner = THIS_MODULE, 302 - .open = ct_open, 303 - .read = seq_read, 304 - .llseek = seq_lseek, 305 - .release = seq_release 306 - }; 307 - 308 - There is also a seq_release_private() which passes the contents of the 309 - seq_file private field to kfree() before releasing the structure. 310 - 311 - The final step is the creation of the /proc file itself. In the example 312 - code, that is done in the initialization code in the usual way: 313 - 314 - static int ct_init(void) 315 - { 316 - struct proc_dir_entry *entry; 317 - 318 - proc_create("sequence", 0, NULL, &ct_file_ops); 319 - return 0; 320 - } 321 - 322 - module_init(ct_init); 323 - 324 - And that is pretty much it. 325 - 326 - 327 - seq_list 328 - 329 - If your file will be iterating through a linked list, you may find these 330 - routines useful: 331 - 332 - struct list_head *seq_list_start(struct list_head *head, 333 - loff_t pos); 334 - struct list_head *seq_list_start_head(struct list_head *head, 335 - loff_t pos); 336 - struct list_head *seq_list_next(void *v, struct list_head *head, 337 - loff_t *ppos); 338 - 339 - These helpers will interpret pos as a position within the list and iterate 340 - accordingly. Your start() and next() functions need only invoke the 341 - seq_list_* helpers with a pointer to the appropriate list_head structure. 342 - 343 - 344 - The extra-simple version 345 - 346 - For extremely simple virtual files, there is an even easier interface. A 347 - module can define only the show() function, which should create all the 348 - output that the virtual file will contain. The file's open() method then 349 - calls: 350 - 351 - int single_open(struct file *file, 352 - int (*show)(struct seq_file *m, void *p), 353 - void *data); 354 - 355 - When output time comes, the show() function will be called once. The data 356 - value given to single_open() can be found in the private field of the 357 - seq_file structure. When using single_open(), the programmer should use 358 - single_release() instead of seq_release() in the file_operations structure 359 - to avoid a memory leak.

+995

Documentation/filesystems/sharedsubtree.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + =============== 4 + Shared Subtrees 5 + =============== 6 + 7 + .. Contents: 8 + 1) Overview 9 + 2) Features 10 + 3) Setting mount states 11 + 4) Use-case 12 + 5) Detailed semantics 13 + 6) Quiz 14 + 7) FAQ 15 + 8) Implementation 16 + 17 + 18 + 1) Overview 19 + ----------- 20 + 21 + Consider the following situation: 22 + 23 + A process wants to clone its own namespace, but still wants to access the CD 24 + that got mounted recently. Shared subtree semantics provide the necessary 25 + mechanism to accomplish the above. 26 + 27 + It provides the necessary building blocks for features like per-user-namespace 28 + and versioned filesystem. 29 + 30 + 2) Features 31 + ----------- 32 + 33 + Shared subtree provides four different flavors of mounts; struct vfsmount to be 34 + precise 35 + 36 + a. shared mount 37 + b. slave mount 38 + c. private mount 39 + d. unbindable mount 40 + 41 + 42 + 2a) A shared mount can be replicated to as many mountpoints and all the 43 + replicas continue to be exactly same. 44 + 45 + Here is an example: 46 + 47 + Let's say /mnt has a mount that is shared:: 48 + 49 + mount --make-shared /mnt 50 + 51 + Note: mount(8) command now supports the --make-shared flag, 52 + so the sample 'smount' program is no longer needed and has been 53 + removed. 54 + 55 + :: 56 + 57 + # mount --bind /mnt /tmp 58 + 59 + The above command replicates the mount at /mnt to the mountpoint /tmp 60 + and the contents of both the mounts remain identical. 61 + 62 + :: 63 + 64 + #ls /mnt 65 + a b c 66 + 67 + #ls /tmp 68 + a b c 69 + 70 + Now let's say we mount a device at /tmp/a:: 71 + 72 + # mount /dev/sd0 /tmp/a 73 + 74 + #ls /tmp/a 75 + t1 t2 t3 76 + 77 + #ls /mnt/a 78 + t1 t2 t3 79 + 80 + Note that the mount has propagated to the mount at /mnt as well. 81 + 82 + And the same is true even when /dev/sd0 is mounted on /mnt/a. The 83 + contents will be visible under /tmp/a too. 84 + 85 + 86 + 2b) A slave mount is like a shared mount except that mount and umount events 87 + only propagate towards it. 88 + 89 + All slave mounts have a master mount which is a shared. 90 + 91 + Here is an example: 92 + 93 + Let's say /mnt has a mount which is shared. 94 + # mount --make-shared /mnt 95 + 96 + Let's bind mount /mnt to /tmp 97 + # mount --bind /mnt /tmp 98 + 99 + the new mount at /tmp becomes a shared mount and it is a replica of 100 + the mount at /mnt. 101 + 102 + Now let's make the mount at /tmp; a slave of /mnt 103 + # mount --make-slave /tmp 104 + 105 + let's mount /dev/sd0 on /mnt/a 106 + # mount /dev/sd0 /mnt/a 107 + 108 + #ls /mnt/a 109 + t1 t2 t3 110 + 111 + #ls /tmp/a 112 + t1 t2 t3 113 + 114 + Note the mount event has propagated to the mount at /tmp 115 + 116 + However let's see what happens if we mount something on the mount at /tmp 117 + 118 + # mount /dev/sd1 /tmp/b 119 + 120 + #ls /tmp/b 121 + s1 s2 s3 122 + 123 + #ls /mnt/b 124 + 125 + Note how the mount event has not propagated to the mount at 126 + /mnt 127 + 128 + 129 + 2c) A private mount does not forward or receive propagation. 130 + 131 + This is the mount we are familiar with. Its the default type. 132 + 133 + 134 + 2d) A unbindable mount is a unbindable private mount 135 + 136 + let's say we have a mount at /mnt and we make it unbindable:: 137 + 138 + # mount --make-unbindable /mnt 139 + 140 + Let's try to bind mount this mount somewhere else:: 141 + 142 + # mount --bind /mnt /tmp 143 + mount: wrong fs type, bad option, bad superblock on /mnt, 144 + or too many mounted file systems 145 + 146 + Binding a unbindable mount is a invalid operation. 147 + 148 + 149 + 3) Setting mount states 150 + 151 + The mount command (util-linux package) can be used to set mount 152 + states:: 153 + 154 + mount --make-shared mountpoint 155 + mount --make-slave mountpoint 156 + mount --make-private mountpoint 157 + mount --make-unbindable mountpoint 158 + 159 + 160 + 4) Use cases 161 + ------------ 162 + 163 + A) A process wants to clone its own namespace, but still wants to 164 + access the CD that got mounted recently. 165 + 166 + Solution: 167 + 168 + The system administrator can make the mount at /cdrom shared:: 169 + 170 + mount --bind /cdrom /cdrom 171 + mount --make-shared /cdrom 172 + 173 + Now any process that clones off a new namespace will have a 174 + mount at /cdrom which is a replica of the same mount in the 175 + parent namespace. 176 + 177 + So when a CD is inserted and mounted at /cdrom that mount gets 178 + propagated to the other mount at /cdrom in all the other clone 179 + namespaces. 180 + 181 + B) A process wants its mounts invisible to any other process, but 182 + still be able to see the other system mounts. 183 + 184 + Solution: 185 + 186 + To begin with, the administrator can mark the entire mount tree 187 + as shareable:: 188 + 189 + mount --make-rshared / 190 + 191 + A new process can clone off a new namespace. And mark some part 192 + of its namespace as slave:: 193 + 194 + mount --make-rslave /myprivatetree 195 + 196 + Hence forth any mounts within the /myprivatetree done by the 197 + process will not show up in any other namespace. However mounts 198 + done in the parent namespace under /myprivatetree still shows 199 + up in the process's namespace. 200 + 201 + 202 + Apart from the above semantics this feature provides the 203 + building blocks to solve the following problems: 204 + 205 + C) Per-user namespace 206 + 207 + The above semantics allows a way to share mounts across 208 + namespaces. But namespaces are associated with processes. If 209 + namespaces are made first class objects with user API to 210 + associate/disassociate a namespace with userid, then each user 211 + could have his/her own namespace and tailor it to his/her 212 + requirements. This needs to be supported in PAM. 213 + 214 + D) Versioned files 215 + 216 + If the entire mount tree is visible at multiple locations, then 217 + an underlying versioning file system can return different 218 + versions of the file depending on the path used to access that 219 + file. 220 + 221 + An example is:: 222 + 223 + mount --make-shared / 224 + mount --rbind / /view/v1 225 + mount --rbind / /view/v2 226 + mount --rbind / /view/v3 227 + mount --rbind / /view/v4 228 + 229 + and if /usr has a versioning filesystem mounted, then that 230 + mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and 231 + /view/v4/usr too 232 + 233 + A user can request v3 version of the file /usr/fs/namespace.c 234 + by accessing /view/v3/usr/fs/namespace.c . The underlying 235 + versioning filesystem can then decipher that v3 version of the 236 + filesystem is being requested and return the corresponding 237 + inode. 238 + 239 + 5) Detailed semantics 240 + --------------------- 241 + The section below explains the detailed semantics of 242 + bind, rbind, move, mount, umount and clone-namespace operations. 243 + 244 + Note: the word 'vfsmount' and the noun 'mount' have been used 245 + to mean the same thing, throughout this document. 246 + 247 + 5a) Mount states 248 + 249 + A given mount can be in one of the following states 250 + 251 + 1) shared 252 + 2) slave 253 + 3) shared and slave 254 + 4) private 255 + 5) unbindable 256 + 257 + A 'propagation event' is defined as event generated on a vfsmount 258 + that leads to mount or unmount actions in other vfsmounts. 259 + 260 + A 'peer group' is defined as a group of vfsmounts that propagate 261 + events to each other. 262 + 263 + (1) Shared mounts 264 + 265 + A 'shared mount' is defined as a vfsmount that belongs to a 266 + 'peer group'. 267 + 268 + For example:: 269 + 270 + mount --make-shared /mnt 271 + mount --bind /mnt /tmp 272 + 273 + The mount at /mnt and that at /tmp are both shared and belong 274 + to the same peer group. Anything mounted or unmounted under 275 + /mnt or /tmp reflect in all the other mounts of its peer 276 + group. 277 + 278 + 279 + (2) Slave mounts 280 + 281 + A 'slave mount' is defined as a vfsmount that receives 282 + propagation events and does not forward propagation events. 283 + 284 + A slave mount as the name implies has a master mount from which 285 + mount/unmount events are received. Events do not propagate from 286 + the slave mount to the master. Only a shared mount can be made 287 + a slave by executing the following command:: 288 + 289 + mount --make-slave mount 290 + 291 + A shared mount that is made as a slave is no more shared unless 292 + modified to become shared. 293 + 294 + (3) Shared and Slave 295 + 296 + A vfsmount can be both shared as well as slave. This state 297 + indicates that the mount is a slave of some vfsmount, and 298 + has its own peer group too. This vfsmount receives propagation 299 + events from its master vfsmount, and also forwards propagation 300 + events to its 'peer group' and to its slave vfsmounts. 301 + 302 + Strictly speaking, the vfsmount is shared having its own 303 + peer group, and this peer-group is a slave of some other 304 + peer group. 305 + 306 + Only a slave vfsmount can be made as 'shared and slave' by 307 + either executing the following command:: 308 + 309 + mount --make-shared mount 310 + 311 + or by moving the slave vfsmount under a shared vfsmount. 312 + 313 + (4) Private mount 314 + 315 + A 'private mount' is defined as vfsmount that does not 316 + receive or forward any propagation events. 317 + 318 + (5) Unbindable mount 319 + 320 + A 'unbindable mount' is defined as vfsmount that does not 321 + receive or forward any propagation events and cannot 322 + be bind mounted. 323 + 324 + 325 + State diagram: 326 + 327 + The state diagram below explains the state transition of a mount, 328 + in response to various commands:: 329 + 330 + ----------------------------------------------------------------------- 331 + | |make-shared | make-slave | make-private |make-unbindab| 332 + --------------|------------|--------------|--------------|-------------| 333 + |shared |shared |*slave/private| private | unbindable | 334 + | | | | | | 335 + |-------------|------------|--------------|--------------|-------------| 336 + |slave |shared | **slave | private | unbindable | 337 + | |and slave | | | | 338 + |-------------|------------|--------------|--------------|-------------| 339 + |shared |shared | slave | private | unbindable | 340 + |and slave |and slave | | | | 341 + |-------------|------------|--------------|--------------|-------------| 342 + |private |shared | **private | private | unbindable | 343 + |-------------|------------|--------------|--------------|-------------| 344 + |unbindable |shared |**unbindable | private | unbindable | 345 + ------------------------------------------------------------------------ 346 + 347 + * if the shared mount is the only mount in its peer group, making it 348 + slave, makes it private automatically. Note that there is no master to 349 + which it can be slaved to. 350 + 351 + ** slaving a non-shared mount has no effect on the mount. 352 + 353 + Apart from the commands listed below, the 'move' operation also changes 354 + the state of a mount depending on type of the destination mount. Its 355 + explained in section 5d. 356 + 357 + 5b) Bind semantics 358 + 359 + Consider the following command:: 360 + 361 + mount --bind A/a B/b 362 + 363 + where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B' 364 + is the destination mount and 'b' is the dentry in the destination mount. 365 + 366 + The outcome depends on the type of mount of 'A' and 'B'. The table 367 + below contains quick reference:: 368 + 369 + -------------------------------------------------------------------------- 370 + | BIND MOUNT OPERATION | 371 + |************************************************************************| 372 + |source(A)->| shared | private | slave | unbindable | 373 + | dest(B) | | | | | 374 + | | | | | | | 375 + | v | | | | | 376 + |************************************************************************| 377 + | shared | shared | shared | shared & slave | invalid | 378 + | | | | | | 379 + |non-shared| shared | private | slave | invalid | 380 + ************************************************************************** 381 + 382 + Details: 383 + 384 + 1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C' 385 + which is clone of 'A', is created. Its root dentry is 'a' . 'C' is 386 + mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ... 387 + are created and mounted at the dentry 'b' on all mounts where 'B' 388 + propagates to. A new propagation tree containing 'C1',..,'Cn' is 389 + created. This propagation tree is identical to the propagation tree of 390 + 'B'. And finally the peer-group of 'C' is merged with the peer group 391 + of 'A'. 392 + 393 + 2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C' 394 + which is clone of 'A', is created. Its root dentry is 'a'. 'C' is 395 + mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ... 396 + are created and mounted at the dentry 'b' on all mounts where 'B' 397 + propagates to. A new propagation tree is set containing all new mounts 398 + 'C', 'C1', .., 'Cn' with exactly the same configuration as the 399 + propagation tree for 'B'. 400 + 401 + 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new 402 + mount 'C' which is clone of 'A', is created. Its root dentry is 'a' . 403 + 'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2', 404 + 'C3' ... are created and mounted at the dentry 'b' on all mounts where 405 + 'B' propagates to. A new propagation tree containing the new mounts 406 + 'C','C1',.. 'Cn' is created. This propagation tree is identical to the 407 + propagation tree for 'B'. And finally the mount 'C' and its peer group 408 + is made the slave of mount 'Z'. In other words, mount 'C' is in the 409 + state 'slave and shared'. 410 + 411 + 4. 'A' is a unbindable mount and 'B' is a shared mount. This is a 412 + invalid operation. 413 + 414 + 5. 'A' is a private mount and 'B' is a non-shared(private or slave or 415 + unbindable) mount. A new mount 'C' which is clone of 'A', is created. 416 + Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'. 417 + 418 + 6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C' 419 + which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is 420 + mounted on mount 'B' at dentry 'b'. 'C' is made a member of the 421 + peer-group of 'A'. 422 + 423 + 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A 424 + new mount 'C' which is a clone of 'A' is created. Its root dentry is 425 + 'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a 426 + slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of 427 + 'Z'. All mount/unmount events on 'Z' propagates to 'A' and 'C'. But 428 + mount/unmount on 'A' do not propagate anywhere else. Similarly 429 + mount/unmount on 'C' do not propagate anywhere else. 430 + 431 + 8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a 432 + invalid operation. A unbindable mount cannot be bind mounted. 433 + 434 + 5c) Rbind semantics 435 + 436 + rbind is same as bind. Bind replicates the specified mount. Rbind 437 + replicates all the mounts in the tree belonging to the specified mount. 438 + Rbind mount is bind mount applied to all the mounts in the tree. 439 + 440 + If the source tree that is rbind has some unbindable mounts, 441 + then the subtree under the unbindable mount is pruned in the new 442 + location. 443 + 444 + eg: 445 + 446 + let's say we have the following mount tree:: 447 + 448 + A 449 + / \ 450 + B C 451 + / \ / \ 452 + D E F G 453 + 454 + Let's say all the mount except the mount C in the tree are 455 + of a type other than unbindable. 456 + 457 + If this tree is rbound to say Z 458 + 459 + We will have the following tree at the new location:: 460 + 461 + Z 462 + | 463 + A' 464 + / 465 + B' Note how the tree under C is pruned 466 + / \ in the new location. 467 + D' E' 468 + 469 + 470 + 471 + 5d) Move semantics 472 + 473 + Consider the following command 474 + 475 + mount --move A B/b 476 + 477 + where 'A' is the source mount, 'B' is the destination mount and 'b' is 478 + the dentry in the destination mount. 479 + 480 + The outcome depends on the type of the mount of 'A' and 'B'. The table 481 + below is a quick reference:: 482 + 483 + --------------------------------------------------------------------------- 484 + | MOVE MOUNT OPERATION | 485 + |************************************************************************** 486 + | source(A)->| shared | private | slave | unbindable | 487 + | dest(B) | | | | | 488 + | | | | | | | 489 + | v | | | | | 490 + |************************************************************************** 491 + | shared | shared | shared |shared and slave| invalid | 492 + | | | | | | 493 + |non-shared| shared | private | slave | unbindable | 494 + *************************************************************************** 495 + 496 + .. Note:: moving a mount residing under a shared mount is invalid. 497 + 498 + Details follow: 499 + 500 + 1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is 501 + mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An' 502 + are created and mounted at dentry 'b' on all mounts that receive 503 + propagation from mount 'B'. A new propagation tree is created in the 504 + exact same configuration as that of 'B'. This new propagation tree 505 + contains all the new mounts 'A1', 'A2'... 'An'. And this new 506 + propagation tree is appended to the already existing propagation tree 507 + of 'A'. 508 + 509 + 2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is 510 + mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An' 511 + are created and mounted at dentry 'b' on all mounts that receive 512 + propagation from mount 'B'. The mount 'A' becomes a shared mount and a 513 + propagation tree is created which is identical to that of 514 + 'B'. This new propagation tree contains all the new mounts 'A1', 515 + 'A2'... 'An'. 516 + 517 + 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The 518 + mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 519 + 'A2'... 'An' are created and mounted at dentry 'b' on all mounts that 520 + receive propagation from mount 'B'. A new propagation tree is created 521 + in the exact same configuration as that of 'B'. This new propagation 522 + tree contains all the new mounts 'A1', 'A2'... 'An'. And this new 523 + propagation tree is appended to the already existing propagation tree of 524 + 'A'. Mount 'A' continues to be the slave mount of 'Z' but it also 525 + becomes 'shared'. 526 + 527 + 4. 'A' is a unbindable mount and 'B' is a shared mount. The operation 528 + is invalid. Because mounting anything on the shared mount 'B' can 529 + create new mounts that get mounted on the mounts that receive 530 + propagation from 'B'. And since the mount 'A' is unbindable, cloning 531 + it to mount at other mountpoints is not possible. 532 + 533 + 5. 'A' is a private mount and 'B' is a non-shared(private or slave or 534 + unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'. 535 + 536 + 6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A' 537 + is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a 538 + shared mount. 539 + 540 + 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. 541 + The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' 542 + continues to be a slave mount of mount 'Z'. 543 + 544 + 8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount 545 + 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a 546 + unbindable mount. 547 + 548 + 5e) Mount semantics 549 + 550 + Consider the following command:: 551 + 552 + mount device B/b 553 + 554 + 'B' is the destination mount and 'b' is the dentry in the destination 555 + mount. 556 + 557 + The above operation is the same as bind operation with the exception 558 + that the source mount is always a private mount. 559 + 560 + 561 + 5f) Unmount semantics 562 + 563 + Consider the following command:: 564 + 565 + umount A 566 + 567 + where 'A' is a mount mounted on mount 'B' at dentry 'b'. 568 + 569 + If mount 'B' is shared, then all most-recently-mounted mounts at dentry 570 + 'b' on mounts that receive propagation from mount 'B' and does not have 571 + sub-mounts within them are unmounted. 572 + 573 + Example: Let's say 'B1', 'B2', 'B3' are shared mounts that propagate to 574 + each other. 575 + 576 + let's say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount 577 + 'B1', 'B2' and 'B3' respectively. 578 + 579 + let's say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on 580 + mount 'B1', 'B2' and 'B3' respectively. 581 + 582 + if 'C1' is unmounted, all the mounts that are most-recently-mounted on 583 + 'B1' and on the mounts that 'B1' propagates-to are unmounted. 584 + 585 + 'B1' propagates to 'B2' and 'B3'. And the most recently mounted mount 586 + on 'B2' at dentry 'b' is 'C2', and that of mount 'B3' is 'C3'. 587 + 588 + So all 'C1', 'C2' and 'C3' should be unmounted. 589 + 590 + If any of 'C2' or 'C3' has some child mounts, then that mount is not 591 + unmounted, but all other mounts are unmounted. However if 'C1' is told 592 + to be unmounted and 'C1' has some sub-mounts, the umount operation is 593 + failed entirely. 594 + 595 + 5g) Clone Namespace 596 + 597 + A cloned namespace contains all the mounts as that of the parent 598 + namespace. 599 + 600 + Let's say 'A' and 'B' are the corresponding mounts in the parent and the 601 + child namespace. 602 + 603 + If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to 604 + each other. 605 + 606 + If 'A' is a slave mount of 'Z', then 'B' is also the slave mount of 607 + 'Z'. 608 + 609 + If 'A' is a private mount, then 'B' is a private mount too. 610 + 611 + If 'A' is unbindable mount, then 'B' is a unbindable mount too. 612 + 613 + 614 + 6) Quiz 615 + 616 + A. What is the result of the following command sequence? 617 + 618 + :: 619 + 620 + mount --bind /mnt /mnt 621 + mount --make-shared /mnt 622 + mount --bind /mnt /tmp 623 + mount --move /tmp /mnt/1 624 + 625 + what should be the contents of /mnt /mnt/1 /mnt/1/1 should be? 626 + Should they all be identical? or should /mnt and /mnt/1 be 627 + identical only? 628 + 629 + 630 + B. What is the result of the following command sequence? 631 + 632 + :: 633 + 634 + mount --make-rshared / 635 + mkdir -p /v/1 636 + mount --rbind / /v/1 637 + 638 + what should be the content of /v/1/v/1 be? 639 + 640 + 641 + C. What is the result of the following command sequence? 642 + 643 + :: 644 + 645 + mount --bind /mnt /mnt 646 + mount --make-shared /mnt 647 + mkdir -p /mnt/1/2/3 /mnt/1/test 648 + mount --bind /mnt/1 /tmp 649 + mount --make-slave /mnt 650 + mount --make-shared /mnt 651 + mount --bind /mnt/1/2 /tmp1 652 + mount --make-slave /mnt 653 + 654 + At this point we have the first mount at /tmp and 655 + its root dentry is 1. Let's call this mount 'A' 656 + And then we have a second mount at /tmp1 with root 657 + dentry 2. Let's call this mount 'B' 658 + Next we have a third mount at /mnt with root dentry 659 + mnt. Let's call this mount 'C' 660 + 661 + 'B' is the slave of 'A' and 'C' is a slave of 'B' 662 + A -> B -> C 663 + 664 + at this point if we execute the following command 665 + 666 + mount --bind /bin /tmp/test 667 + 668 + The mount is attempted on 'A' 669 + 670 + will the mount propagate to 'B' and 'C' ? 671 + 672 + what would be the contents of 673 + /mnt/1/test be? 674 + 675 + 7) FAQ 676 + 677 + Q1. Why is bind mount needed? How is it different from symbolic links? 678 + symbolic links can get stale if the destination mount gets 679 + unmounted or moved. Bind mounts continue to exist even if the 680 + other mount is unmounted or moved. 681 + 682 + Q2. Why can't the shared subtree be implemented using exportfs? 683 + 684 + exportfs is a heavyweight way of accomplishing part of what 685 + shared subtree can do. I cannot imagine a way to implement the 686 + semantics of slave mount using exportfs? 687 + 688 + Q3 Why is unbindable mount needed? 689 + 690 + Let's say we want to replicate the mount tree at multiple 691 + locations within the same subtree. 692 + 693 + if one rbind mounts a tree within the same subtree 'n' times 694 + the number of mounts created is an exponential function of 'n'. 695 + Having unbindable mount can help prune the unneeded bind 696 + mounts. Here is an example. 697 + 698 + step 1: 699 + let's say the root tree has just two directories with 700 + one vfsmount:: 701 + 702 + root 703 + / \ 704 + tmp usr 705 + 706 + And we want to replicate the tree at multiple 707 + mountpoints under /root/tmp 708 + 709 + step 2: 710 + :: 711 + 712 + 713 + mount --make-shared /root 714 + 715 + mkdir -p /tmp/m1 716 + 717 + mount --rbind /root /tmp/m1 718 + 719 + the new tree now looks like this:: 720 + 721 + root 722 + / \ 723 + tmp usr 724 + / 725 + m1 726 + / \ 727 + tmp usr 728 + / 729 + m1 730 + 731 + it has two vfsmounts 732 + 733 + step 3: 734 + :: 735 + 736 + mkdir -p /tmp/m2 737 + mount --rbind /root /tmp/m2 738 + 739 + the new tree now looks like this:: 740 + 741 + root 742 + / \ 743 + tmp usr 744 + / \ 745 + m1 m2 746 + / \ / \ 747 + tmp usr tmp usr 748 + / \ / 749 + m1 m2 m1 750 + / \ / \ 751 + tmp usr tmp usr 752 + / / \ 753 + m1 m1 m2 754 + / \ 755 + tmp usr 756 + / \ 757 + m1 m2 758 + 759 + it has 6 vfsmounts 760 + 761 + step 4: 762 + :: 763 + mkdir -p /tmp/m3 764 + mount --rbind /root /tmp/m3 765 + 766 + I won't draw the tree..but it has 24 vfsmounts 767 + 768 + 769 + at step i the number of vfsmounts is V[i] = i*V[i-1]. 770 + This is an exponential function. And this tree has way more 771 + mounts than what we really needed in the first place. 772 + 773 + One could use a series of umount at each step to prune 774 + out the unneeded mounts. But there is a better solution. 775 + Unclonable mounts come in handy here. 776 + 777 + step 1: 778 + let's say the root tree has just two directories with 779 + one vfsmount:: 780 + 781 + root 782 + / \ 783 + tmp usr 784 + 785 + How do we set up the same tree at multiple locations under 786 + /root/tmp 787 + 788 + step 2: 789 + :: 790 + 791 + 792 + mount --bind /root/tmp /root/tmp 793 + 794 + mount --make-rshared /root 795 + mount --make-unbindable /root/tmp 796 + 797 + mkdir -p /tmp/m1 798 + 799 + mount --rbind /root /tmp/m1 800 + 801 + the new tree now looks like this:: 802 + 803 + root 804 + / \ 805 + tmp usr 806 + / 807 + m1 808 + / \ 809 + tmp usr 810 + 811 + step 3: 812 + :: 813 + 814 + mkdir -p /tmp/m2 815 + mount --rbind /root /tmp/m2 816 + 817 + the new tree now looks like this:: 818 + 819 + root 820 + / \ 821 + tmp usr 822 + / \ 823 + m1 m2 824 + / \ / \ 825 + tmp usr tmp usr 826 + 827 + step 4: 828 + :: 829 + 830 + mkdir -p /tmp/m3 831 + mount --rbind /root /tmp/m3 832 + 833 + the new tree now looks like this:: 834 + 835 + root 836 + / \ 837 + tmp usr 838 + / \ \ 839 + m1 m2 m3 840 + / \ / \ / \ 841 + tmp usr tmp usr tmp usr 842 + 843 + 8) Implementation 844 + 845 + 8A) Datastructure 846 + 847 + 4 new fields are introduced to struct vfsmount: 848 + 849 + * ->mnt_share 850 + * ->mnt_slave_list 851 + * ->mnt_slave 852 + * ->mnt_master 853 + 854 + ->mnt_share 855 + links together all the mount to/from which this vfsmount 856 + send/receives propagation events. 857 + 858 + ->mnt_slave_list 859 + links all the mounts to which this vfsmount propagates 860 + to. 861 + 862 + ->mnt_slave 863 + links together all the slaves that its master vfsmount 864 + propagates to. 865 + 866 + ->mnt_master 867 + points to the master vfsmount from which this vfsmount 868 + receives propagation. 869 + 870 + ->mnt_flags 871 + takes two more flags to indicate the propagation status of 872 + the vfsmount. MNT_SHARE indicates that the vfsmount is a shared 873 + vfsmount. MNT_UNCLONABLE indicates that the vfsmount cannot be 874 + replicated. 875 + 876 + All the shared vfsmounts in a peer group form a cyclic list through 877 + ->mnt_share. 878 + 879 + All vfsmounts with the same ->mnt_master form on a cyclic list anchored 880 + in ->mnt_master->mnt_slave_list and going through ->mnt_slave. 881 + 882 + ->mnt_master can point to arbitrary (and possibly different) members 883 + of master peer group. To find all immediate slaves of a peer group 884 + you need to go through _all_ ->mnt_slave_list of its members. 885 + Conceptually it's just a single set - distribution among the 886 + individual lists does not affect propagation or the way propagation 887 + tree is modified by operations. 888 + 889 + All vfsmounts in a peer group have the same ->mnt_master. If it is 890 + non-NULL, they form a contiguous (ordered) segment of slave list. 891 + 892 + A example propagation tree looks as shown in the figure below. 893 + [ NOTE: Though it looks like a forest, if we consider all the shared 894 + mounts as a conceptual entity called 'pnode', it becomes a tree]:: 895 + 896 + 897 + A <--> B <--> C <---> D 898 + /|\ /| |\ 899 + / F G J K H I 900 + / 901 + E<-->K 902 + /|\ 903 + M L N 904 + 905 + In the above figure A,B,C and D all are shared and propagate to each 906 + other. 'A' has got 3 slave mounts 'E' 'F' and 'G' 'C' has got 2 slave 907 + mounts 'J' and 'K' and 'D' has got two slave mounts 'H' and 'I'. 908 + 'E' is also shared with 'K' and they propagate to each other. And 909 + 'K' has 3 slaves 'M', 'L' and 'N' 910 + 911 + A's ->mnt_share links with the ->mnt_share of 'B' 'C' and 'D' 912 + 913 + A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G' 914 + 915 + E's ->mnt_share links with ->mnt_share of K 916 + 917 + 'E', 'K', 'F', 'G' have their ->mnt_master point to struct vfsmount of 'A' 918 + 919 + 'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K' 920 + 921 + K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N' 922 + 923 + C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K' 924 + 925 + J and K's ->mnt_master points to struct vfsmount of C 926 + 927 + and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I' 928 + 929 + 'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'. 930 + 931 + 932 + NOTE: The propagation tree is orthogonal to the mount tree. 933 + 934 + 8B Locking: 935 + 936 + ->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected 937 + by namespace_sem (exclusive for modifications, shared for reading). 938 + 939 + Normally we have ->mnt_flags modifications serialized by vfsmount_lock. 940 + There are two exceptions: do_add_mount() and clone_mnt(). 941 + The former modifies a vfsmount that has not been visible in any shared 942 + data structures yet. 943 + The latter holds namespace_sem and the only references to vfsmount 944 + are in lists that can't be traversed without namespace_sem. 945 + 946 + 8C Algorithm: 947 + 948 + The crux of the implementation resides in rbind/move operation. 949 + 950 + The overall algorithm breaks the operation into 3 phases: (look at 951 + attach_recursive_mnt() and propagate_mnt()) 952 + 953 + 1. prepare phase. 954 + 2. commit phases. 955 + 3. abort phases. 956 + 957 + Prepare phase: 958 + 959 + for each mount in the source tree: 960 + 961 + a) Create the necessary number of mount trees to 962 + be attached to each of the mounts that receive 963 + propagation from the destination mount. 964 + b) Do not attach any of the trees to its destination. 965 + However note down its ->mnt_parent and ->mnt_mountpoint 966 + c) Link all the new mounts to form a propagation tree that 967 + is identical to the propagation tree of the destination 968 + mount. 969 + 970 + If this phase is successful, there should be 'n' new 971 + propagation trees; where 'n' is the number of mounts in the 972 + source tree. Go to the commit phase 973 + 974 + Also there should be 'm' new mount trees, where 'm' is 975 + the number of mounts to which the destination mount 976 + propagates to. 977 + 978 + if any memory allocations fail, go to the abort phase. 979 + 980 + Commit phase 981 + attach each of the mount trees to their corresponding 982 + destination mounts. 983 + 984 + Abort phase 985 + delete all the newly created trees. 986 + 987 + .. Note:: 988 + all the propagation related functionality resides in the file pnode.c 989 + 990 + 991 + ------------------------------------------------------------------------ 992 + 993 + version 0.1 (created the initial document, Ram Pai linuxram@us.ibm.com) 994 + 995 + version 0.2 (Incorporated comments from Al Viro)

-939

Documentation/filesystems/sharedsubtree.txt

··· 1 - Shared Subtrees 2 - --------------- 3 - 4 - Contents: 5 - 1) Overview 6 - 2) Features 7 - 3) Setting mount states 8 - 4) Use-case 9 - 5) Detailed semantics 10 - 6) Quiz 11 - 7) FAQ 12 - 8) Implementation 13 - 14 - 15 - 1) Overview 16 - ----------- 17 - 18 - Consider the following situation: 19 - 20 - A process wants to clone its own namespace, but still wants to access the CD 21 - that got mounted recently. Shared subtree semantics provide the necessary 22 - mechanism to accomplish the above. 23 - 24 - It provides the necessary building blocks for features like per-user-namespace 25 - and versioned filesystem. 26 - 27 - 2) Features 28 - ----------- 29 - 30 - Shared subtree provides four different flavors of mounts; struct vfsmount to be 31 - precise 32 - 33 - a. shared mount 34 - b. slave mount 35 - c. private mount 36 - d. unbindable mount 37 - 38 - 39 - 2a) A shared mount can be replicated to as many mountpoints and all the 40 - replicas continue to be exactly same. 41 - 42 - Here is an example: 43 - 44 - Let's say /mnt has a mount that is shared. 45 - mount --make-shared /mnt 46 - 47 - Note: mount(8) command now supports the --make-shared flag, 48 - so the sample 'smount' program is no longer needed and has been 49 - removed. 50 - 51 - # mount --bind /mnt /tmp 52 - The above command replicates the mount at /mnt to the mountpoint /tmp 53 - and the contents of both the mounts remain identical. 54 - 55 - #ls /mnt 56 - a b c 57 - 58 - #ls /tmp 59 - a b c 60 - 61 - Now let's say we mount a device at /tmp/a 62 - # mount /dev/sd0 /tmp/a 63 - 64 - #ls /tmp/a 65 - t1 t2 t3 66 - 67 - #ls /mnt/a 68 - t1 t2 t3 69 - 70 - Note that the mount has propagated to the mount at /mnt as well. 71 - 72 - And the same is true even when /dev/sd0 is mounted on /mnt/a. The 73 - contents will be visible under /tmp/a too. 74 - 75 - 76 - 2b) A slave mount is like a shared mount except that mount and umount events 77 - only propagate towards it. 78 - 79 - All slave mounts have a master mount which is a shared. 80 - 81 - Here is an example: 82 - 83 - Let's say /mnt has a mount which is shared. 84 - # mount --make-shared /mnt 85 - 86 - Let's bind mount /mnt to /tmp 87 - # mount --bind /mnt /tmp 88 - 89 - the new mount at /tmp becomes a shared mount and it is a replica of 90 - the mount at /mnt. 91 - 92 - Now let's make the mount at /tmp; a slave of /mnt 93 - # mount --make-slave /tmp 94 - 95 - let's mount /dev/sd0 on /mnt/a 96 - # mount /dev/sd0 /mnt/a 97 - 98 - #ls /mnt/a 99 - t1 t2 t3 100 - 101 - #ls /tmp/a 102 - t1 t2 t3 103 - 104 - Note the mount event has propagated to the mount at /tmp 105 - 106 - However let's see what happens if we mount something on the mount at /tmp 107 - 108 - # mount /dev/sd1 /tmp/b 109 - 110 - #ls /tmp/b 111 - s1 s2 s3 112 - 113 - #ls /mnt/b 114 - 115 - Note how the mount event has not propagated to the mount at 116 - /mnt 117 - 118 - 119 - 2c) A private mount does not forward or receive propagation. 120 - 121 - This is the mount we are familiar with. Its the default type. 122 - 123 - 124 - 2d) A unbindable mount is a unbindable private mount 125 - 126 - let's say we have a mount at /mnt and we make it unbindable 127 - 128 - # mount --make-unbindable /mnt 129 - 130 - Let's try to bind mount this mount somewhere else. 131 - # mount --bind /mnt /tmp 132 - mount: wrong fs type, bad option, bad superblock on /mnt, 133 - or too many mounted file systems 134 - 135 - Binding a unbindable mount is a invalid operation. 136 - 137 - 138 - 3) Setting mount states 139 - 140 - The mount command (util-linux package) can be used to set mount 141 - states: 142 - 143 - mount --make-shared mountpoint 144 - mount --make-slave mountpoint 145 - mount --make-private mountpoint 146 - mount --make-unbindable mountpoint 147 - 148 - 149 - 4) Use cases 150 - ------------ 151 - 152 - A) A process wants to clone its own namespace, but still wants to 153 - access the CD that got mounted recently. 154 - 155 - Solution: 156 - 157 - The system administrator can make the mount at /cdrom shared 158 - mount --bind /cdrom /cdrom 159 - mount --make-shared /cdrom 160 - 161 - Now any process that clones off a new namespace will have a 162 - mount at /cdrom which is a replica of the same mount in the 163 - parent namespace. 164 - 165 - So when a CD is inserted and mounted at /cdrom that mount gets 166 - propagated to the other mount at /cdrom in all the other clone 167 - namespaces. 168 - 169 - B) A process wants its mounts invisible to any other process, but 170 - still be able to see the other system mounts. 171 - 172 - Solution: 173 - 174 - To begin with, the administrator can mark the entire mount tree 175 - as shareable. 176 - 177 - mount --make-rshared / 178 - 179 - A new process can clone off a new namespace. And mark some part 180 - of its namespace as slave 181 - 182 - mount --make-rslave /myprivatetree 183 - 184 - Hence forth any mounts within the /myprivatetree done by the 185 - process will not show up in any other namespace. However mounts 186 - done in the parent namespace under /myprivatetree still shows 187 - up in the process's namespace. 188 - 189 - 190 - Apart from the above semantics this feature provides the 191 - building blocks to solve the following problems: 192 - 193 - C) Per-user namespace 194 - 195 - The above semantics allows a way to share mounts across 196 - namespaces. But namespaces are associated with processes. If 197 - namespaces are made first class objects with user API to 198 - associate/disassociate a namespace with userid, then each user 199 - could have his/her own namespace and tailor it to his/her 200 - requirements. This needs to be supported in PAM. 201 - 202 - D) Versioned files 203 - 204 - If the entire mount tree is visible at multiple locations, then 205 - an underlying versioning file system can return different 206 - versions of the file depending on the path used to access that 207 - file. 208 - 209 - An example is: 210 - 211 - mount --make-shared / 212 - mount --rbind / /view/v1 213 - mount --rbind / /view/v2 214 - mount --rbind / /view/v3 215 - mount --rbind / /view/v4 216 - 217 - and if /usr has a versioning filesystem mounted, then that 218 - mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and 219 - /view/v4/usr too 220 - 221 - A user can request v3 version of the file /usr/fs/namespace.c 222 - by accessing /view/v3/usr/fs/namespace.c . The underlying 223 - versioning filesystem can then decipher that v3 version of the 224 - filesystem is being requested and return the corresponding 225 - inode. 226 - 227 - 5) Detailed semantics: 228 - ------------------- 229 - The section below explains the detailed semantics of 230 - bind, rbind, move, mount, umount and clone-namespace operations. 231 - 232 - Note: the word 'vfsmount' and the noun 'mount' have been used 233 - to mean the same thing, throughout this document. 234 - 235 - 5a) Mount states 236 - 237 - A given mount can be in one of the following states 238 - 1) shared 239 - 2) slave 240 - 3) shared and slave 241 - 4) private 242 - 5) unbindable 243 - 244 - A 'propagation event' is defined as event generated on a vfsmount 245 - that leads to mount or unmount actions in other vfsmounts. 246 - 247 - A 'peer group' is defined as a group of vfsmounts that propagate 248 - events to each other. 249 - 250 - (1) Shared mounts 251 - 252 - A 'shared mount' is defined as a vfsmount that belongs to a 253 - 'peer group'. 254 - 255 - For example: 256 - mount --make-shared /mnt 257 - mount --bind /mnt /tmp 258 - 259 - The mount at /mnt and that at /tmp are both shared and belong 260 - to the same peer group. Anything mounted or unmounted under 261 - /mnt or /tmp reflect in all the other mounts of its peer 262 - group. 263 - 264 - 265 - (2) Slave mounts 266 - 267 - A 'slave mount' is defined as a vfsmount that receives 268 - propagation events and does not forward propagation events. 269 - 270 - A slave mount as the name implies has a master mount from which 271 - mount/unmount events are received. Events do not propagate from 272 - the slave mount to the master. Only a shared mount can be made 273 - a slave by executing the following command 274 - 275 - mount --make-slave mount 276 - 277 - A shared mount that is made as a slave is no more shared unless 278 - modified to become shared. 279 - 280 - (3) Shared and Slave 281 - 282 - A vfsmount can be both shared as well as slave. This state 283 - indicates that the mount is a slave of some vfsmount, and 284 - has its own peer group too. This vfsmount receives propagation 285 - events from its master vfsmount, and also forwards propagation 286 - events to its 'peer group' and to its slave vfsmounts. 287 - 288 - Strictly speaking, the vfsmount is shared having its own 289 - peer group, and this peer-group is a slave of some other 290 - peer group. 291 - 292 - Only a slave vfsmount can be made as 'shared and slave' by 293 - either executing the following command 294 - mount --make-shared mount 295 - or by moving the slave vfsmount under a shared vfsmount. 296 - 297 - (4) Private mount 298 - 299 - A 'private mount' is defined as vfsmount that does not 300 - receive or forward any propagation events. 301 - 302 - (5) Unbindable mount 303 - 304 - A 'unbindable mount' is defined as vfsmount that does not 305 - receive or forward any propagation events and cannot 306 - be bind mounted. 307 - 308 - 309 - State diagram: 310 - The state diagram below explains the state transition of a mount, 311 - in response to various commands. 312 - ------------------------------------------------------------------------ 313 - | |make-shared | make-slave | make-private |make-unbindab| 314 - --------------|------------|--------------|--------------|-------------| 315 - |shared |shared |*slave/private| private | unbindable | 316 - | | | | | | 317 - |-------------|------------|--------------|--------------|-------------| 318 - |slave |shared | **slave | private | unbindable | 319 - | |and slave | | | | 320 - |-------------|------------|--------------|--------------|-------------| 321 - |shared |shared | slave | private | unbindable | 322 - |and slave |and slave | | | | 323 - |-------------|------------|--------------|--------------|-------------| 324 - |private |shared | **private | private | unbindable | 325 - |-------------|------------|--------------|--------------|-------------| 326 - |unbindable |shared |**unbindable | private | unbindable | 327 - ------------------------------------------------------------------------ 328 - 329 - * if the shared mount is the only mount in its peer group, making it 330 - slave, makes it private automatically. Note that there is no master to 331 - which it can be slaved to. 332 - 333 - ** slaving a non-shared mount has no effect on the mount. 334 - 335 - Apart from the commands listed below, the 'move' operation also changes 336 - the state of a mount depending on type of the destination mount. Its 337 - explained in section 5d. 338 - 339 - 5b) Bind semantics 340 - 341 - Consider the following command 342 - 343 - mount --bind A/a B/b 344 - 345 - where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B' 346 - is the destination mount and 'b' is the dentry in the destination mount. 347 - 348 - The outcome depends on the type of mount of 'A' and 'B'. The table 349 - below contains quick reference. 350 - --------------------------------------------------------------------------- 351 - | BIND MOUNT OPERATION | 352 - |************************************************************************** 353 - |source(A)->| shared | private | slave | unbindable | 354 - | dest(B) | | | | | 355 - | | | | | | | 356 - | v | | | | | 357 - |************************************************************************** 358 - | shared | shared | shared | shared & slave | invalid | 359 - | | | | | | 360 - |non-shared| shared | private | slave | invalid | 361 - *************************************************************************** 362 - 363 - Details: 364 - 365 - 1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C' 366 - which is clone of 'A', is created. Its root dentry is 'a' . 'C' is 367 - mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ... 368 - are created and mounted at the dentry 'b' on all mounts where 'B' 369 - propagates to. A new propagation tree containing 'C1',..,'Cn' is 370 - created. This propagation tree is identical to the propagation tree of 371 - 'B'. And finally the peer-group of 'C' is merged with the peer group 372 - of 'A'. 373 - 374 - 2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C' 375 - which is clone of 'A', is created. Its root dentry is 'a'. 'C' is 376 - mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ... 377 - are created and mounted at the dentry 'b' on all mounts where 'B' 378 - propagates to. A new propagation tree is set containing all new mounts 379 - 'C', 'C1', .., 'Cn' with exactly the same configuration as the 380 - propagation tree for 'B'. 381 - 382 - 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new 383 - mount 'C' which is clone of 'A', is created. Its root dentry is 'a' . 384 - 'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2', 385 - 'C3' ... are created and mounted at the dentry 'b' on all mounts where 386 - 'B' propagates to. A new propagation tree containing the new mounts 387 - 'C','C1',.. 'Cn' is created. This propagation tree is identical to the 388 - propagation tree for 'B'. And finally the mount 'C' and its peer group 389 - is made the slave of mount 'Z'. In other words, mount 'C' is in the 390 - state 'slave and shared'. 391 - 392 - 4. 'A' is a unbindable mount and 'B' is a shared mount. This is a 393 - invalid operation. 394 - 395 - 5. 'A' is a private mount and 'B' is a non-shared(private or slave or 396 - unbindable) mount. A new mount 'C' which is clone of 'A', is created. 397 - Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'. 398 - 399 - 6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C' 400 - which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is 401 - mounted on mount 'B' at dentry 'b'. 'C' is made a member of the 402 - peer-group of 'A'. 403 - 404 - 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A 405 - new mount 'C' which is a clone of 'A' is created. Its root dentry is 406 - 'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a 407 - slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of 408 - 'Z'. All mount/unmount events on 'Z' propagates to 'A' and 'C'. But 409 - mount/unmount on 'A' do not propagate anywhere else. Similarly 410 - mount/unmount on 'C' do not propagate anywhere else. 411 - 412 - 8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a 413 - invalid operation. A unbindable mount cannot be bind mounted. 414 - 415 - 5c) Rbind semantics 416 - 417 - rbind is same as bind. Bind replicates the specified mount. Rbind 418 - replicates all the mounts in the tree belonging to the specified mount. 419 - Rbind mount is bind mount applied to all the mounts in the tree. 420 - 421 - If the source tree that is rbind has some unbindable mounts, 422 - then the subtree under the unbindable mount is pruned in the new 423 - location. 424 - 425 - eg: let's say we have the following mount tree. 426 - 427 - A 428 - / \ 429 - B C 430 - / \ / \ 431 - D E F G 432 - 433 - Let's say all the mount except the mount C in the tree are 434 - of a type other than unbindable. 435 - 436 - If this tree is rbound to say Z 437 - 438 - We will have the following tree at the new location. 439 - 440 - Z 441 - | 442 - A' 443 - / 444 - B' Note how the tree under C is pruned 445 - / \ in the new location. 446 - D' E' 447 - 448 - 449 - 450 - 5d) Move semantics 451 - 452 - Consider the following command 453 - 454 - mount --move A B/b 455 - 456 - where 'A' is the source mount, 'B' is the destination mount and 'b' is 457 - the dentry in the destination mount. 458 - 459 - The outcome depends on the type of the mount of 'A' and 'B'. The table 460 - below is a quick reference. 461 - --------------------------------------------------------------------------- 462 - | MOVE MOUNT OPERATION | 463 - |************************************************************************** 464 - | source(A)->| shared | private | slave | unbindable | 465 - | dest(B) | | | | | 466 - | | | | | | | 467 - | v | | | | | 468 - |************************************************************************** 469 - | shared | shared | shared |shared and slave| invalid | 470 - | | | | | | 471 - |non-shared| shared | private | slave | unbindable | 472 - *************************************************************************** 473 - NOTE: moving a mount residing under a shared mount is invalid. 474 - 475 - Details follow: 476 - 477 - 1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is 478 - mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An' 479 - are created and mounted at dentry 'b' on all mounts that receive 480 - propagation from mount 'B'. A new propagation tree is created in the 481 - exact same configuration as that of 'B'. This new propagation tree 482 - contains all the new mounts 'A1', 'A2'... 'An'. And this new 483 - propagation tree is appended to the already existing propagation tree 484 - of 'A'. 485 - 486 - 2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is 487 - mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An' 488 - are created and mounted at dentry 'b' on all mounts that receive 489 - propagation from mount 'B'. The mount 'A' becomes a shared mount and a 490 - propagation tree is created which is identical to that of 491 - 'B'. This new propagation tree contains all the new mounts 'A1', 492 - 'A2'... 'An'. 493 - 494 - 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The 495 - mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 496 - 'A2'... 'An' are created and mounted at dentry 'b' on all mounts that 497 - receive propagation from mount 'B'. A new propagation tree is created 498 - in the exact same configuration as that of 'B'. This new propagation 499 - tree contains all the new mounts 'A1', 'A2'... 'An'. And this new 500 - propagation tree is appended to the already existing propagation tree of 501 - 'A'. Mount 'A' continues to be the slave mount of 'Z' but it also 502 - becomes 'shared'. 503 - 504 - 4. 'A' is a unbindable mount and 'B' is a shared mount. The operation 505 - is invalid. Because mounting anything on the shared mount 'B' can 506 - create new mounts that get mounted on the mounts that receive 507 - propagation from 'B'. And since the mount 'A' is unbindable, cloning 508 - it to mount at other mountpoints is not possible. 509 - 510 - 5. 'A' is a private mount and 'B' is a non-shared(private or slave or 511 - unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'. 512 - 513 - 6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A' 514 - is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a 515 - shared mount. 516 - 517 - 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. 518 - The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' 519 - continues to be a slave mount of mount 'Z'. 520 - 521 - 8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount 522 - 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a 523 - unbindable mount. 524 - 525 - 5e) Mount semantics 526 - 527 - Consider the following command 528 - 529 - mount device B/b 530 - 531 - 'B' is the destination mount and 'b' is the dentry in the destination 532 - mount. 533 - 534 - The above operation is the same as bind operation with the exception 535 - that the source mount is always a private mount. 536 - 537 - 538 - 5f) Unmount semantics 539 - 540 - Consider the following command 541 - 542 - umount A 543 - 544 - where 'A' is a mount mounted on mount 'B' at dentry 'b'. 545 - 546 - If mount 'B' is shared, then all most-recently-mounted mounts at dentry 547 - 'b' on mounts that receive propagation from mount 'B' and does not have 548 - sub-mounts within them are unmounted. 549 - 550 - Example: Let's say 'B1', 'B2', 'B3' are shared mounts that propagate to 551 - each other. 552 - 553 - let's say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount 554 - 'B1', 'B2' and 'B3' respectively. 555 - 556 - let's say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on 557 - mount 'B1', 'B2' and 'B3' respectively. 558 - 559 - if 'C1' is unmounted, all the mounts that are most-recently-mounted on 560 - 'B1' and on the mounts that 'B1' propagates-to are unmounted. 561 - 562 - 'B1' propagates to 'B2' and 'B3'. And the most recently mounted mount 563 - on 'B2' at dentry 'b' is 'C2', and that of mount 'B3' is 'C3'. 564 - 565 - So all 'C1', 'C2' and 'C3' should be unmounted. 566 - 567 - If any of 'C2' or 'C3' has some child mounts, then that mount is not 568 - unmounted, but all other mounts are unmounted. However if 'C1' is told 569 - to be unmounted and 'C1' has some sub-mounts, the umount operation is 570 - failed entirely. 571 - 572 - 5g) Clone Namespace 573 - 574 - A cloned namespace contains all the mounts as that of the parent 575 - namespace. 576 - 577 - Let's say 'A' and 'B' are the corresponding mounts in the parent and the 578 - child namespace. 579 - 580 - If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to 581 - each other. 582 - 583 - If 'A' is a slave mount of 'Z', then 'B' is also the slave mount of 584 - 'Z'. 585 - 586 - If 'A' is a private mount, then 'B' is a private mount too. 587 - 588 - If 'A' is unbindable mount, then 'B' is a unbindable mount too. 589 - 590 - 591 - 6) Quiz 592 - 593 - A. What is the result of the following command sequence? 594 - 595 - mount --bind /mnt /mnt 596 - mount --make-shared /mnt 597 - mount --bind /mnt /tmp 598 - mount --move /tmp /mnt/1 599 - 600 - what should be the contents of /mnt /mnt/1 /mnt/1/1 should be? 601 - Should they all be identical? or should /mnt and /mnt/1 be 602 - identical only? 603 - 604 - 605 - B. What is the result of the following command sequence? 606 - 607 - mount --make-rshared / 608 - mkdir -p /v/1 609 - mount --rbind / /v/1 610 - 611 - what should be the content of /v/1/v/1 be? 612 - 613 - 614 - C. What is the result of the following command sequence? 615 - 616 - mount --bind /mnt /mnt 617 - mount --make-shared /mnt 618 - mkdir -p /mnt/1/2/3 /mnt/1/test 619 - mount --bind /mnt/1 /tmp 620 - mount --make-slave /mnt 621 - mount --make-shared /mnt 622 - mount --bind /mnt/1/2 /tmp1 623 - mount --make-slave /mnt 624 - 625 - At this point we have the first mount at /tmp and 626 - its root dentry is 1. Let's call this mount 'A' 627 - And then we have a second mount at /tmp1 with root 628 - dentry 2. Let's call this mount 'B' 629 - Next we have a third mount at /mnt with root dentry 630 - mnt. Let's call this mount 'C' 631 - 632 - 'B' is the slave of 'A' and 'C' is a slave of 'B' 633 - A -> B -> C 634 - 635 - at this point if we execute the following command 636 - 637 - mount --bind /bin /tmp/test 638 - 639 - The mount is attempted on 'A' 640 - 641 - will the mount propagate to 'B' and 'C' ? 642 - 643 - what would be the contents of 644 - /mnt/1/test be? 645 - 646 - 7) FAQ 647 - 648 - Q1. Why is bind mount needed? How is it different from symbolic links? 649 - symbolic links can get stale if the destination mount gets 650 - unmounted or moved. Bind mounts continue to exist even if the 651 - other mount is unmounted or moved. 652 - 653 - Q2. Why can't the shared subtree be implemented using exportfs? 654 - 655 - exportfs is a heavyweight way of accomplishing part of what 656 - shared subtree can do. I cannot imagine a way to implement the 657 - semantics of slave mount using exportfs? 658 - 659 - Q3 Why is unbindable mount needed? 660 - 661 - Let's say we want to replicate the mount tree at multiple 662 - locations within the same subtree. 663 - 664 - if one rbind mounts a tree within the same subtree 'n' times 665 - the number of mounts created is an exponential function of 'n'. 666 - Having unbindable mount can help prune the unneeded bind 667 - mounts. Here is an example. 668 - 669 - step 1: 670 - let's say the root tree has just two directories with 671 - one vfsmount. 672 - root 673 - / \ 674 - tmp usr 675 - 676 - And we want to replicate the tree at multiple 677 - mountpoints under /root/tmp 678 - 679 - step2: 680 - mount --make-shared /root 681 - 682 - mkdir -p /tmp/m1 683 - 684 - mount --rbind /root /tmp/m1 685 - 686 - the new tree now looks like this: 687 - 688 - root 689 - / \ 690 - tmp usr 691 - / 692 - m1 693 - / \ 694 - tmp usr 695 - / 696 - m1 697 - 698 - it has two vfsmounts 699 - 700 - step3: 701 - mkdir -p /tmp/m2 702 - mount --rbind /root /tmp/m2 703 - 704 - the new tree now looks like this: 705 - 706 - root 707 - / \ 708 - tmp usr 709 - / \ 710 - m1 m2 711 - / \ / \ 712 - tmp usr tmp usr 713 - / \ / 714 - m1 m2 m1 715 - / \ / \ 716 - tmp usr tmp usr 717 - / / \ 718 - m1 m1 m2 719 - / \ 720 - tmp usr 721 - / \ 722 - m1 m2 723 - 724 - it has 6 vfsmounts 725 - 726 - step 4: 727 - mkdir -p /tmp/m3 728 - mount --rbind /root /tmp/m3 729 - 730 - I won't draw the tree..but it has 24 vfsmounts 731 - 732 - 733 - at step i the number of vfsmounts is V[i] = i*V[i-1]. 734 - This is an exponential function. And this tree has way more 735 - mounts than what we really needed in the first place. 736 - 737 - One could use a series of umount at each step to prune 738 - out the unneeded mounts. But there is a better solution. 739 - Unclonable mounts come in handy here. 740 - 741 - step 1: 742 - let's say the root tree has just two directories with 743 - one vfsmount. 744 - root 745 - / \ 746 - tmp usr 747 - 748 - How do we set up the same tree at multiple locations under 749 - /root/tmp 750 - 751 - step2: 752 - mount --bind /root/tmp /root/tmp 753 - 754 - mount --make-rshared /root 755 - mount --make-unbindable /root/tmp 756 - 757 - mkdir -p /tmp/m1 758 - 759 - mount --rbind /root /tmp/m1 760 - 761 - the new tree now looks like this: 762 - 763 - root 764 - / \ 765 - tmp usr 766 - / 767 - m1 768 - / \ 769 - tmp usr 770 - 771 - step3: 772 - mkdir -p /tmp/m2 773 - mount --rbind /root /tmp/m2 774 - 775 - the new tree now looks like this: 776 - 777 - root 778 - / \ 779 - tmp usr 780 - / \ 781 - m1 m2 782 - / \ / \ 783 - tmp usr tmp usr 784 - 785 - step4: 786 - 787 - mkdir -p /tmp/m3 788 - mount --rbind /root /tmp/m3 789 - 790 - the new tree now looks like this: 791 - 792 - root 793 - / \ 794 - tmp usr 795 - / \ \ 796 - m1 m2 m3 797 - / \ / \ / \ 798 - tmp usr tmp usr tmp usr 799 - 800 - 8) Implementation 801 - 802 - 8A) Datastructure 803 - 804 - 4 new fields are introduced to struct vfsmount 805 - ->mnt_share 806 - ->mnt_slave_list 807 - ->mnt_slave 808 - ->mnt_master 809 - 810 - ->mnt_share links together all the mount to/from which this vfsmount 811 - send/receives propagation events. 812 - 813 - ->mnt_slave_list links all the mounts to which this vfsmount propagates 814 - to. 815 - 816 - ->mnt_slave links together all the slaves that its master vfsmount 817 - propagates to. 818 - 819 - ->mnt_master points to the master vfsmount from which this vfsmount 820 - receives propagation. 821 - 822 - ->mnt_flags takes two more flags to indicate the propagation status of 823 - the vfsmount. MNT_SHARE indicates that the vfsmount is a shared 824 - vfsmount. MNT_UNCLONABLE indicates that the vfsmount cannot be 825 - replicated. 826 - 827 - All the shared vfsmounts in a peer group form a cyclic list through 828 - ->mnt_share. 829 - 830 - All vfsmounts with the same ->mnt_master form on a cyclic list anchored 831 - in ->mnt_master->mnt_slave_list and going through ->mnt_slave. 832 - 833 - ->mnt_master can point to arbitrary (and possibly different) members 834 - of master peer group. To find all immediate slaves of a peer group 835 - you need to go through _all_ ->mnt_slave_list of its members. 836 - Conceptually it's just a single set - distribution among the 837 - individual lists does not affect propagation or the way propagation 838 - tree is modified by operations. 839 - 840 - All vfsmounts in a peer group have the same ->mnt_master. If it is 841 - non-NULL, they form a contiguous (ordered) segment of slave list. 842 - 843 - A example propagation tree looks as shown in the figure below. 844 - [ NOTE: Though it looks like a forest, if we consider all the shared 845 - mounts as a conceptual entity called 'pnode', it becomes a tree] 846 - 847 - 848 - A <--> B <--> C <---> D 849 - /|\ /| |\ 850 - / F G J K H I 851 - / 852 - E<-->K 853 - /|\ 854 - M L N 855 - 856 - In the above figure A,B,C and D all are shared and propagate to each 857 - other. 'A' has got 3 slave mounts 'E' 'F' and 'G' 'C' has got 2 slave 858 - mounts 'J' and 'K' and 'D' has got two slave mounts 'H' and 'I'. 859 - 'E' is also shared with 'K' and they propagate to each other. And 860 - 'K' has 3 slaves 'M', 'L' and 'N' 861 - 862 - A's ->mnt_share links with the ->mnt_share of 'B' 'C' and 'D' 863 - 864 - A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G' 865 - 866 - E's ->mnt_share links with ->mnt_share of K 867 - 'E', 'K', 'F', 'G' have their ->mnt_master point to struct 868 - vfsmount of 'A' 869 - 'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K' 870 - K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N' 871 - 872 - C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K' 873 - J and K's ->mnt_master points to struct vfsmount of C 874 - and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I' 875 - 'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'. 876 - 877 - 878 - NOTE: The propagation tree is orthogonal to the mount tree. 879 - 880 - 8B Locking: 881 - 882 - ->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected 883 - by namespace_sem (exclusive for modifications, shared for reading). 884 - 885 - Normally we have ->mnt_flags modifications serialized by vfsmount_lock. 886 - There are two exceptions: do_add_mount() and clone_mnt(). 887 - The former modifies a vfsmount that has not been visible in any shared 888 - data structures yet. 889 - The latter holds namespace_sem and the only references to vfsmount 890 - are in lists that can't be traversed without namespace_sem. 891 - 892 - 8C Algorithm: 893 - 894 - The crux of the implementation resides in rbind/move operation. 895 - 896 - The overall algorithm breaks the operation into 3 phases: (look at 897 - attach_recursive_mnt() and propagate_mnt()) 898 - 899 - 1. prepare phase. 900 - 2. commit phases. 901 - 3. abort phases. 902 - 903 - Prepare phase: 904 - 905 - for each mount in the source tree: 906 - a) Create the necessary number of mount trees to 907 - be attached to each of the mounts that receive 908 - propagation from the destination mount. 909 - b) Do not attach any of the trees to its destination. 910 - However note down its ->mnt_parent and ->mnt_mountpoint 911 - c) Link all the new mounts to form a propagation tree that 912 - is identical to the propagation tree of the destination 913 - mount. 914 - 915 - If this phase is successful, there should be 'n' new 916 - propagation trees; where 'n' is the number of mounts in the 917 - source tree. Go to the commit phase 918 - 919 - Also there should be 'm' new mount trees, where 'm' is 920 - the number of mounts to which the destination mount 921 - propagates to. 922 - 923 - if any memory allocations fail, go to the abort phase. 924 - 925 - Commit phase 926 - attach each of the mount trees to their corresponding 927 - destination mounts. 928 - 929 - Abort phase 930 - delete all the newly created trees. 931 - 932 - NOTE: all the propagation related functionality resides in the file 933 - pnode.c 934 - 935 - 936 - ------------------------------------------------------------------------ 937 - 938 - version 0.1 (created the initial document, Ram Pai linuxram@us.ibm.com) 939 - version 0.2 (Incorporated comments from Al Viro)

-521

Documentation/filesystems/spufs.txt

··· 1 - SPUFS(2) Linux Programmer's Manual SPUFS(2) 2 - 3 - 4 - 5 - NAME 6 - spufs - the SPU file system 7 - 8 - 9 - DESCRIPTION 10 - The SPU file system is used on PowerPC machines that implement the Cell 11 - Broadband Engine Architecture in order to access Synergistic Processor 12 - Units (SPUs). 13 - 14 - The file system provides a name space similar to posix shared memory or 15 - message queues. Users that have write permissions on the file system 16 - can use spu_create(2) to establish SPU contexts in the spufs root. 17 - 18 - Every SPU context is represented by a directory containing a predefined 19 - set of files. These files can be used for manipulating the state of the 20 - logical SPU. Users can change permissions on those files, but not actu- 21 - ally add or remove files. 22 - 23 - 24 - MOUNT OPTIONS 25 - uid=<uid> 26 - set the user owning the mount point, the default is 0 (root). 27 - 28 - gid=<gid> 29 - set the group owning the mount point, the default is 0 (root). 30 - 31 - 32 - FILES 33 - The files in spufs mostly follow the standard behavior for regular sys- 34 - tem calls like read(2) or write(2), but often support only a subset of 35 - the operations supported on regular file systems. This list details the 36 - supported operations and the deviations from the behaviour in the 37 - respective man pages. 38 - 39 - All files that support the read(2) operation also support readv(2) and 40 - all files that support the write(2) operation also support writev(2). 41 - All files support the access(2) and stat(2) family of operations, but 42 - only the st_mode, st_nlink, st_uid and st_gid fields of struct stat 43 - contain reliable information. 44 - 45 - All files support the chmod(2)/fchmod(2) and chown(2)/fchown(2) opera- 46 - tions, but will not be able to grant permissions that contradict the 47 - possible operations, e.g. read access on the wbox file. 48 - 49 - The current set of files is: 50 - 51 - 52 - /mem 53 - the contents of the local storage memory of the SPU. This can be 54 - accessed like a regular shared memory file and contains both code and 55 - data in the address space of the SPU. The possible operations on an 56 - open mem file are: 57 - 58 - read(2), pread(2), write(2), pwrite(2), lseek(2) 59 - These operate as documented, with the exception that seek(2), 60 - write(2) and pwrite(2) are not supported beyond the end of the 61 - file. The file size is the size of the local storage of the SPU, 62 - which normally is 256 kilobytes. 63 - 64 - mmap(2) 65 - Mapping mem into the process address space gives access to the 66 - SPU local storage within the process address space. Only 67 - MAP_SHARED mappings are allowed. 68 - 69 - 70 - /mbox 71 - The first SPU to CPU communication mailbox. This file is read-only and 72 - can be read in units of 32 bits. The file can only be used in non- 73 - blocking mode and it even poll() will not block on it. The possible 74 - operations on an open mbox file are: 75 - 76 - read(2) 77 - If a count smaller than four is requested, read returns -1 and 78 - sets errno to EINVAL. If there is no data available in the mail 79 - box, the return value is set to -1 and errno becomes EAGAIN. 80 - When data has been read successfully, four bytes are placed in 81 - the data buffer and the value four is returned. 82 - 83 - 84 - /ibox 85 - The second SPU to CPU communication mailbox. This file is similar to 86 - the first mailbox file, but can be read in blocking I/O mode, and the 87 - poll family of system calls can be used to wait for it. The possible 88 - operations on an open ibox file are: 89 - 90 - read(2) 91 - If a count smaller than four is requested, read returns -1 and 92 - sets errno to EINVAL. If there is no data available in the mail 93 - box and the file descriptor has been opened with O_NONBLOCK, the 94 - return value is set to -1 and errno becomes EAGAIN. 95 - 96 - If there is no data available in the mail box and the file 97 - descriptor has been opened without O_NONBLOCK, the call will 98 - block until the SPU writes to its interrupt mailbox channel. 99 - When data has been read successfully, four bytes are placed in 100 - the data buffer and the value four is returned. 101 - 102 - poll(2) 103 - Poll on the ibox file returns (POLLIN | POLLRDNORM) whenever 104 - data is available for reading. 105 - 106 - 107 - /wbox 108 - The CPU to SPU communation mailbox. It is write-only and can be written 109 - in units of 32 bits. If the mailbox is full, write() will block and 110 - poll can be used to wait for it becoming empty again. The possible 111 - operations on an open wbox file are: write(2) If a count smaller than 112 - four is requested, write returns -1 and sets errno to EINVAL. If there 113 - is no space available in the mail box and the file descriptor has been 114 - opened with O_NONBLOCK, the return value is set to -1 and errno becomes 115 - EAGAIN. 116 - 117 - If there is no space available in the mail box and the file descriptor 118 - has been opened without O_NONBLOCK, the call will block until the SPU 119 - reads from its PPE mailbox channel. When data has been read success- 120 - fully, four bytes are placed in the data buffer and the value four is 121 - returned. 122 - 123 - poll(2) 124 - Poll on the ibox file returns (POLLOUT | POLLWRNORM) whenever 125 - space is available for writing. 126 - 127 - 128 - /mbox_stat 129 - /ibox_stat 130 - /wbox_stat 131 - Read-only files that contain the length of the current queue, i.e. how 132 - many words can be read from mbox or ibox or how many words can be 133 - written to wbox without blocking. The files can be read only in 4-byte 134 - units and return a big-endian binary integer number. The possible 135 - operations on an open *box_stat file are: 136 - 137 - read(2) 138 - If a count smaller than four is requested, read returns -1 and 139 - sets errno to EINVAL. Otherwise, a four byte value is placed in 140 - the data buffer, containing the number of elements that can be 141 - read from (for mbox_stat and ibox_stat) or written to (for 142 - wbox_stat) the respective mail box without blocking or resulting 143 - in EAGAIN. 144 - 145 - 146 - /npc 147 - /decr 148 - /decr_status 149 - /spu_tag_mask 150 - /event_mask 151 - /srr0 152 - Internal registers of the SPU. The representation is an ASCII string 153 - with the numeric value of the next instruction to be executed. These 154 - can be used in read/write mode for debugging, but normal operation of 155 - programs should not rely on them because access to any of them except 156 - npc requires an SPU context save and is therefore very inefficient. 157 - 158 - The contents of these files are: 159 - 160 - npc Next Program Counter 161 - 162 - decr SPU Decrementer 163 - 164 - decr_status Decrementer Status 165 - 166 - spu_tag_mask MFC tag mask for SPU DMA 167 - 168 - event_mask Event mask for SPU interrupts 169 - 170 - srr0 Interrupt Return address register 171 - 172 - 173 - The possible operations on an open npc, decr, decr_status, 174 - spu_tag_mask, event_mask or srr0 file are: 175 - 176 - read(2) 177 - When the count supplied to the read call is shorter than the 178 - required length for the pointer value plus a newline character, 179 - subsequent reads from the same file descriptor will result in 180 - completing the string, regardless of changes to the register by 181 - a running SPU task. When a complete string has been read, all 182 - subsequent read operations will return zero bytes and a new file 183 - descriptor needs to be opened to read the value again. 184 - 185 - write(2) 186 - A write operation on the file results in setting the register to 187 - the value given in the string. The string is parsed from the 188 - beginning to the first non-numeric character or the end of the 189 - buffer. Subsequent writes to the same file descriptor overwrite 190 - the previous setting. 191 - 192 - 193 - /fpcr 194 - This file gives access to the Floating Point Status and Control Regis- 195 - ter as a four byte long file. The operations on the fpcr file are: 196 - 197 - read(2) 198 - If a count smaller than four is requested, read returns -1 and 199 - sets errno to EINVAL. Otherwise, a four byte value is placed in 200 - the data buffer, containing the current value of the fpcr regis- 201 - ter. 202 - 203 - write(2) 204 - If a count smaller than four is requested, write returns -1 and 205 - sets errno to EINVAL. Otherwise, a four byte value is copied 206 - from the data buffer, updating the value of the fpcr register. 207 - 208 - 209 - /signal1 210 - /signal2 211 - The two signal notification channels of an SPU. These are read-write 212 - files that operate on a 32 bit word. Writing to one of these files 213 - triggers an interrupt on the SPU. The value written to the signal 214 - files can be read from the SPU through a channel read or from host user 215 - space through the file. After the value has been read by the SPU, it 216 - is reset to zero. The possible operations on an open signal1 or sig- 217 - nal2 file are: 218 - 219 - read(2) 220 - If a count smaller than four is requested, read returns -1 and 221 - sets errno to EINVAL. Otherwise, a four byte value is placed in 222 - the data buffer, containing the current value of the specified 223 - signal notification register. 224 - 225 - write(2) 226 - If a count smaller than four is requested, write returns -1 and 227 - sets errno to EINVAL. Otherwise, a four byte value is copied 228 - from the data buffer, updating the value of the specified signal 229 - notification register. The signal notification register will 230 - either be replaced with the input data or will be updated to the 231 - bitwise OR or the old value and the input data, depending on the 232 - contents of the signal1_type, or signal2_type respectively, 233 - file. 234 - 235 - 236 - /signal1_type 237 - /signal2_type 238 - These two files change the behavior of the signal1 and signal2 notifi- 239 - cation files. The contain a numerical ASCII string which is read as 240 - either "1" or "0". In mode 0 (overwrite), the hardware replaces the 241 - contents of the signal channel with the data that is written to it. in 242 - mode 1 (logical OR), the hardware accumulates the bits that are subse- 243 - quently written to it. The possible operations on an open signal1_type 244 - or signal2_type file are: 245 - 246 - read(2) 247 - When the count supplied to the read call is shorter than the 248 - required length for the digit plus a newline character, subse- 249 - quent reads from the same file descriptor will result in com- 250 - pleting the string. When a complete string has been read, all 251 - subsequent read operations will return zero bytes and a new file 252 - descriptor needs to be opened to read the value again. 253 - 254 - write(2) 255 - A write operation on the file results in setting the register to 256 - the value given in the string. The string is parsed from the 257 - beginning to the first non-numeric character or the end of the 258 - buffer. Subsequent writes to the same file descriptor overwrite 259 - the previous setting. 260 - 261 - 262 - EXAMPLES 263 - /etc/fstab entry 264 - none /spu spufs gid=spu 0 0 265 - 266 - 267 - AUTHORS 268 - Arnd Bergmann <arndb@de.ibm.com>, Mark Nutter <mnutter@us.ibm.com>, 269 - Ulrich Weigand <Ulrich.Weigand@de.ibm.com> 270 - 271 - SEE ALSO 272 - capabilities(7), close(2), spu_create(2), spu_run(2), spufs(7) 273 - 274 - 275 - 276 - Linux 2005-09-28 SPUFS(2) 277 - 278 - ------------------------------------------------------------------------------ 279 - 280 - SPU_RUN(2) Linux Programmer's Manual SPU_RUN(2) 281 - 282 - 283 - 284 - NAME 285 - spu_run - execute an spu context 286 - 287 - 288 - SYNOPSIS 289 - #include <sys/spu.h> 290 - 291 - int spu_run(int fd, unsigned int *npc, unsigned int *event); 292 - 293 - DESCRIPTION 294 - The spu_run system call is used on PowerPC machines that implement the 295 - Cell Broadband Engine Architecture in order to access Synergistic Pro- 296 - cessor Units (SPUs). It uses the fd that was returned from spu_cre- 297 - ate(2) to address a specific SPU context. When the context gets sched- 298 - uled to a physical SPU, it starts execution at the instruction pointer 299 - passed in npc. 300 - 301 - Execution of SPU code happens synchronously, meaning that spu_run does 302 - not return while the SPU is still running. If there is a need to exe- 303 - cute SPU code in parallel with other code on either the main CPU or 304 - other SPUs, you need to create a new thread of execution first, e.g. 305 - using the pthread_create(3) call. 306 - 307 - When spu_run returns, the current value of the SPU instruction pointer 308 - is written back to npc, so you can call spu_run again without updating 309 - the pointers. 310 - 311 - event can be a NULL pointer or point to an extended status code that 312 - gets filled when spu_run returns. It can be one of the following con- 313 - stants: 314 - 315 - SPE_EVENT_DMA_ALIGNMENT 316 - A DMA alignment error 317 - 318 - SPE_EVENT_SPE_DATA_SEGMENT 319 - A DMA segmentation error 320 - 321 - SPE_EVENT_SPE_DATA_STORAGE 322 - A DMA storage error 323 - 324 - If NULL is passed as the event argument, these errors will result in a 325 - signal delivered to the calling process. 326 - 327 - RETURN VALUE 328 - spu_run returns the value of the spu_status register or -1 to indicate 329 - an error and set errno to one of the error codes listed below. The 330 - spu_status register value contains a bit mask of status codes and 331 - optionally a 14 bit code returned from the stop-and-signal instruction 332 - on the SPU. The bit masks for the status codes are: 333 - 334 - 0x02 SPU was stopped by stop-and-signal. 335 - 336 - 0x04 SPU was stopped by halt. 337 - 338 - 0x08 SPU is waiting for a channel. 339 - 340 - 0x10 SPU is in single-step mode. 341 - 342 - 0x20 SPU has tried to execute an invalid instruction. 343 - 344 - 0x40 SPU has tried to access an invalid channel. 345 - 346 - 0x3fff0000 347 - The bits masked with this value contain the code returned from 348 - stop-and-signal. 349 - 350 - There are always one or more of the lower eight bits set or an error 351 - code is returned from spu_run. 352 - 353 - ERRORS 354 - EAGAIN or EWOULDBLOCK 355 - fd is in non-blocking mode and spu_run would block. 356 - 357 - EBADF fd is not a valid file descriptor. 358 - 359 - EFAULT npc is not a valid pointer or status is neither NULL nor a valid 360 - pointer. 361 - 362 - EINTR A signal occurred while spu_run was in progress. The npc value 363 - has been updated to the new program counter value if necessary. 364 - 365 - EINVAL fd is not a file descriptor returned from spu_create(2). 366 - 367 - ENOMEM Insufficient memory was available to handle a page fault result- 368 - ing from an MFC direct memory access. 369 - 370 - ENOSYS the functionality is not provided by the current system, because 371 - either the hardware does not provide SPUs or the spufs module is 372 - not loaded. 373 - 374 - 375 - NOTES 376 - spu_run is meant to be used from libraries that implement a more 377 - abstract interface to SPUs, not to be used from regular applications. 378 - See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec- 379 - ommended libraries. 380 - 381 - 382 - CONFORMING TO 383 - This call is Linux specific and only implemented by the ppc64 architec- 384 - ture. Programs using this system call are not portable. 385 - 386 - 387 - BUGS 388 - The code does not yet fully implement all features lined out here. 389 - 390 - 391 - AUTHOR 392 - Arnd Bergmann <arndb@de.ibm.com> 393 - 394 - SEE ALSO 395 - capabilities(7), close(2), spu_create(2), spufs(7) 396 - 397 - 398 - 399 - Linux 2005-09-28 SPU_RUN(2) 400 - 401 - ------------------------------------------------------------------------------ 402 - 403 - SPU_CREATE(2) Linux Programmer's Manual SPU_CREATE(2) 404 - 405 - 406 - 407 - NAME 408 - spu_create - create a new spu context 409 - 410 - 411 - SYNOPSIS 412 - #include <sys/types.h> 413 - #include <sys/spu.h> 414 - 415 - int spu_create(const char *pathname, int flags, mode_t mode); 416 - 417 - DESCRIPTION 418 - The spu_create system call is used on PowerPC machines that implement 419 - the Cell Broadband Engine Architecture in order to access Synergistic 420 - Processor Units (SPUs). It creates a new logical context for an SPU in 421 - pathname and returns a handle to associated with it. pathname must 422 - point to a non-existing directory in the mount point of the SPU file 423 - system (spufs). When spu_create is successful, a directory gets cre- 424 - ated on pathname and it is populated with files. 425 - 426 - The returned file handle can only be passed to spu_run(2) or closed, 427 - other operations are not defined on it. When it is closed, all associ- 428 - ated directory entries in spufs are removed. When the last file handle 429 - pointing either inside of the context directory or to this file 430 - descriptor is closed, the logical SPU context is destroyed. 431 - 432 - The parameter flags can be zero or any bitwise or'd combination of the 433 - following constants: 434 - 435 - SPU_RAWIO 436 - Allow mapping of some of the hardware registers of the SPU into 437 - user space. This flag requires the CAP_SYS_RAWIO capability, see 438 - capabilities(7). 439 - 440 - The mode parameter specifies the permissions used for creating the new 441 - directory in spufs. mode is modified with the user's umask(2) value 442 - and then used for both the directory and the files contained in it. The 443 - file permissions mask out some more bits of mode because they typically 444 - support only read or write access. See stat(2) for a full list of the 445 - possible mode values. 446 - 447 - 448 - RETURN VALUE 449 - spu_create returns a new file descriptor. It may return -1 to indicate 450 - an error condition and set errno to one of the error codes listed 451 - below. 452 - 453 - 454 - ERRORS 455 - EACCES 456 - The current user does not have write access on the spufs mount 457 - point. 458 - 459 - EEXIST An SPU context already exists at the given path name. 460 - 461 - EFAULT pathname is not a valid string pointer in the current address 462 - space. 463 - 464 - EINVAL pathname is not a directory in the spufs mount point. 465 - 466 - ELOOP Too many symlinks were found while resolving pathname. 467 - 468 - EMFILE The process has reached its maximum open file limit. 469 - 470 - ENAMETOOLONG 471 - pathname was too long. 472 - 473 - ENFILE The system has reached the global open file limit. 474 - 475 - ENOENT Part of pathname could not be resolved. 476 - 477 - ENOMEM The kernel could not allocate all resources required. 478 - 479 - ENOSPC There are not enough SPU resources available to create a new 480 - context or the user specific limit for the number of SPU con- 481 - texts has been reached. 482 - 483 - ENOSYS the functionality is not provided by the current system, because 484 - either the hardware does not provide SPUs or the spufs module is 485 - not loaded. 486 - 487 - ENOTDIR 488 - A part of pathname is not a directory. 489 - 490 - 491 - 492 - NOTES 493 - spu_create is meant to be used from libraries that implement a more 494 - abstract interface to SPUs, not to be used from regular applications. 495 - See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec- 496 - ommended libraries. 497 - 498 - 499 - FILES 500 - pathname must point to a location beneath the mount point of spufs. By 501 - convention, it gets mounted in /spu. 502 - 503 - 504 - CONFORMING TO 505 - This call is Linux specific and only implemented by the ppc64 architec- 506 - ture. Programs using this system call are not portable. 507 - 508 - 509 - BUGS 510 - The code does not yet fully implement all features lined out here. 511 - 512 - 513 - AUTHOR 514 - Arnd Bergmann <arndb@de.ibm.com> 515 - 516 - SEE ALSO 517 - capabilities(7), close(2), spu_run(2), spufs(7) 518 - 519 - 520 - 521 - Linux 2005-09-28 SPU_CREATE(2)

+13

Documentation/filesystems/spufs/index.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ============== 4 + SPU Filesystem 5 + ============== 6 + 7 + 8 + .. toctree:: 9 + :maxdepth: 1 10 + 11 + spufs 12 + spu_create 13 + spu_run

+131

Documentation/filesystems/spufs/spu_create.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ========== 4 + spu_create 5 + ========== 6 + 7 + Name 8 + ==== 9 + spu_create - create a new spu context 10 + 11 + 12 + Synopsis 13 + ======== 14 + 15 + :: 16 + 17 + #include <sys/types.h> 18 + #include <sys/spu.h> 19 + 20 + int spu_create(const char *pathname, int flags, mode_t mode); 21 + 22 + Description 23 + =========== 24 + The spu_create system call is used on PowerPC machines that implement 25 + the Cell Broadband Engine Architecture in order to access Synergistic 26 + Processor Units (SPUs). It creates a new logical context for an SPU in 27 + pathname and returns a handle to associated with it. pathname must 28 + point to a non-existing directory in the mount point of the SPU file 29 + system (spufs). When spu_create is successful, a directory gets cre- 30 + ated on pathname and it is populated with files. 31 + 32 + The returned file handle can only be passed to spu_run(2) or closed, 33 + other operations are not defined on it. When it is closed, all associ- 34 + ated directory entries in spufs are removed. When the last file handle 35 + pointing either inside of the context directory or to this file 36 + descriptor is closed, the logical SPU context is destroyed. 37 + 38 + The parameter flags can be zero or any bitwise or'd combination of the 39 + following constants: 40 + 41 + SPU_RAWIO 42 + Allow mapping of some of the hardware registers of the SPU into 43 + user space. This flag requires the CAP_SYS_RAWIO capability, see 44 + capabilities(7). 45 + 46 + The mode parameter specifies the permissions used for creating the new 47 + directory in spufs. mode is modified with the user's umask(2) value 48 + and then used for both the directory and the files contained in it. The 49 + file permissions mask out some more bits of mode because they typically 50 + support only read or write access. See stat(2) for a full list of the 51 + possible mode values. 52 + 53 + 54 + Return Value 55 + ============ 56 + spu_create returns a new file descriptor. It may return -1 to indicate 57 + an error condition and set errno to one of the error codes listed 58 + below. 59 + 60 + 61 + Errors 62 + ====== 63 + EACCES 64 + The current user does not have write access on the spufs mount 65 + point. 66 + 67 + EEXIST An SPU context already exists at the given path name. 68 + 69 + EFAULT pathname is not a valid string pointer in the current address 70 + space. 71 + 72 + EINVAL pathname is not a directory in the spufs mount point. 73 + 74 + ELOOP Too many symlinks were found while resolving pathname. 75 + 76 + EMFILE The process has reached its maximum open file limit. 77 + 78 + ENAMETOOLONG 79 + pathname was too long. 80 + 81 + ENFILE The system has reached the global open file limit. 82 + 83 + ENOENT Part of pathname could not be resolved. 84 + 85 + ENOMEM The kernel could not allocate all resources required. 86 + 87 + ENOSPC There are not enough SPU resources available to create a new 88 + context or the user specific limit for the number of SPU con- 89 + texts has been reached. 90 + 91 + ENOSYS the functionality is not provided by the current system, because 92 + either the hardware does not provide SPUs or the spufs module is 93 + not loaded. 94 + 95 + ENOTDIR 96 + A part of pathname is not a directory. 97 + 98 + 99 + 100 + Notes 101 + ===== 102 + spu_create is meant to be used from libraries that implement a more 103 + abstract interface to SPUs, not to be used from regular applications. 104 + See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec- 105 + ommended libraries. 106 + 107 + 108 + Files 109 + ===== 110 + pathname must point to a location beneath the mount point of spufs. By 111 + convention, it gets mounted in /spu. 112 + 113 + 114 + Conforming to 115 + ============= 116 + This call is Linux specific and only implemented by the ppc64 architec- 117 + ture. Programs using this system call are not portable. 118 + 119 + 120 + Bugs 121 + ==== 122 + The code does not yet fully implement all features lined out here. 123 + 124 + 125 + Author 126 + ====== 127 + Arnd Bergmann <arndb@de.ibm.com> 128 + 129 + See Also 130 + ======== 131 + capabilities(7), close(2), spu_run(2), spufs(7)

+138

Documentation/filesystems/spufs/spu_run.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ======= 4 + spu_run 5 + ======= 6 + 7 + 8 + Name 9 + ==== 10 + spu_run - execute an spu context 11 + 12 + 13 + Synopsis 14 + ======== 15 + 16 + :: 17 + 18 + #include <sys/spu.h> 19 + 20 + int spu_run(int fd, unsigned int *npc, unsigned int *event); 21 + 22 + Description 23 + =========== 24 + The spu_run system call is used on PowerPC machines that implement the 25 + Cell Broadband Engine Architecture in order to access Synergistic Pro- 26 + cessor Units (SPUs). It uses the fd that was returned from spu_cre- 27 + ate(2) to address a specific SPU context. When the context gets sched- 28 + uled to a physical SPU, it starts execution at the instruction pointer 29 + passed in npc. 30 + 31 + Execution of SPU code happens synchronously, meaning that spu_run does 32 + not return while the SPU is still running. If there is a need to exe- 33 + cute SPU code in parallel with other code on either the main CPU or 34 + other SPUs, you need to create a new thread of execution first, e.g. 35 + using the pthread_create(3) call. 36 + 37 + When spu_run returns, the current value of the SPU instruction pointer 38 + is written back to npc, so you can call spu_run again without updating 39 + the pointers. 40 + 41 + event can be a NULL pointer or point to an extended status code that 42 + gets filled when spu_run returns. It can be one of the following con- 43 + stants: 44 + 45 + SPE_EVENT_DMA_ALIGNMENT 46 + A DMA alignment error 47 + 48 + SPE_EVENT_SPE_DATA_SEGMENT 49 + A DMA segmentation error 50 + 51 + SPE_EVENT_SPE_DATA_STORAGE 52 + A DMA storage error 53 + 54 + If NULL is passed as the event argument, these errors will result in a 55 + signal delivered to the calling process. 56 + 57 + Return Value 58 + ============ 59 + spu_run returns the value of the spu_status register or -1 to indicate 60 + an error and set errno to one of the error codes listed below. The 61 + spu_status register value contains a bit mask of status codes and 62 + optionally a 14 bit code returned from the stop-and-signal instruction 63 + on the SPU. The bit masks for the status codes are: 64 + 65 + 0x02 66 + SPU was stopped by stop-and-signal. 67 + 68 + 0x04 69 + SPU was stopped by halt. 70 + 71 + 0x08 72 + SPU is waiting for a channel. 73 + 74 + 0x10 75 + SPU is in single-step mode. 76 + 77 + 0x20 78 + SPU has tried to execute an invalid instruction. 79 + 80 + 0x40 81 + SPU has tried to access an invalid channel. 82 + 83 + 0x3fff0000 84 + The bits masked with this value contain the code returned from 85 + stop-and-signal. 86 + 87 + There are always one or more of the lower eight bits set or an error 88 + code is returned from spu_run. 89 + 90 + Errors 91 + ====== 92 + EAGAIN or EWOULDBLOCK 93 + fd is in non-blocking mode and spu_run would block. 94 + 95 + EBADF fd is not a valid file descriptor. 96 + 97 + EFAULT npc is not a valid pointer or status is neither NULL nor a valid 98 + pointer. 99 + 100 + EINTR A signal occurred while spu_run was in progress. The npc value 101 + has been updated to the new program counter value if necessary. 102 + 103 + EINVAL fd is not a file descriptor returned from spu_create(2). 104 + 105 + ENOMEM Insufficient memory was available to handle a page fault result- 106 + ing from an MFC direct memory access. 107 + 108 + ENOSYS the functionality is not provided by the current system, because 109 + either the hardware does not provide SPUs or the spufs module is 110 + not loaded. 111 + 112 + 113 + Notes 114 + ===== 115 + spu_run is meant to be used from libraries that implement a more 116 + abstract interface to SPUs, not to be used from regular applications. 117 + See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec- 118 + ommended libraries. 119 + 120 + 121 + Conforming to 122 + ============= 123 + This call is Linux specific and only implemented by the ppc64 architec- 124 + ture. Programs using this system call are not portable. 125 + 126 + 127 + Bugs 128 + ==== 129 + The code does not yet fully implement all features lined out here. 130 + 131 + 132 + Author 133 + ====== 134 + Arnd Bergmann <arndb@de.ibm.com> 135 + 136 + See Also 137 + ======== 138 + capabilities(7), close(2), spu_create(2), spufs(7)

+273

Documentation/filesystems/spufs/spufs.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ===== 4 + spufs 5 + ===== 6 + 7 + Name 8 + ==== 9 + 10 + spufs - the SPU file system 11 + 12 + 13 + Description 14 + =========== 15 + 16 + The SPU file system is used on PowerPC machines that implement the Cell 17 + Broadband Engine Architecture in order to access Synergistic Processor 18 + Units (SPUs). 19 + 20 + The file system provides a name space similar to posix shared memory or 21 + message queues. Users that have write permissions on the file system 22 + can use spu_create(2) to establish SPU contexts in the spufs root. 23 + 24 + Every SPU context is represented by a directory containing a predefined 25 + set of files. These files can be used for manipulating the state of the 26 + logical SPU. Users can change permissions on those files, but not actu- 27 + ally add or remove files. 28 + 29 + 30 + Mount Options 31 + ============= 32 + 33 + uid=<uid> 34 + set the user owning the mount point, the default is 0 (root). 35 + 36 + gid=<gid> 37 + set the group owning the mount point, the default is 0 (root). 38 + 39 + 40 + Files 41 + ===== 42 + 43 + The files in spufs mostly follow the standard behavior for regular sys- 44 + tem calls like read(2) or write(2), but often support only a subset of 45 + the operations supported on regular file systems. This list details the 46 + supported operations and the deviations from the behaviour in the 47 + respective man pages. 48 + 49 + All files that support the read(2) operation also support readv(2) and 50 + all files that support the write(2) operation also support writev(2). 51 + All files support the access(2) and stat(2) family of operations, but 52 + only the st_mode, st_nlink, st_uid and st_gid fields of struct stat 53 + contain reliable information. 54 + 55 + All files support the chmod(2)/fchmod(2) and chown(2)/fchown(2) opera- 56 + tions, but will not be able to grant permissions that contradict the 57 + possible operations, e.g. read access on the wbox file. 58 + 59 + The current set of files is: 60 + 61 + 62 + /mem 63 + the contents of the local storage memory of the SPU. This can be 64 + accessed like a regular shared memory file and contains both code and 65 + data in the address space of the SPU. The possible operations on an 66 + open mem file are: 67 + 68 + read(2), pread(2), write(2), pwrite(2), lseek(2) 69 + These operate as documented, with the exception that seek(2), 70 + write(2) and pwrite(2) are not supported beyond the end of the 71 + file. The file size is the size of the local storage of the SPU, 72 + which normally is 256 kilobytes. 73 + 74 + mmap(2) 75 + Mapping mem into the process address space gives access to the 76 + SPU local storage within the process address space. Only 77 + MAP_SHARED mappings are allowed. 78 + 79 + 80 + /mbox 81 + The first SPU to CPU communication mailbox. This file is read-only and 82 + can be read in units of 32 bits. The file can only be used in non- 83 + blocking mode and it even poll() will not block on it. The possible 84 + operations on an open mbox file are: 85 + 86 + read(2) 87 + If a count smaller than four is requested, read returns -1 and 88 + sets errno to EINVAL. If there is no data available in the mail 89 + box, the return value is set to -1 and errno becomes EAGAIN. 90 + When data has been read successfully, four bytes are placed in 91 + the data buffer and the value four is returned. 92 + 93 + 94 + /ibox 95 + The second SPU to CPU communication mailbox. This file is similar to 96 + the first mailbox file, but can be read in blocking I/O mode, and the 97 + poll family of system calls can be used to wait for it. The possible 98 + operations on an open ibox file are: 99 + 100 + read(2) 101 + If a count smaller than four is requested, read returns -1 and 102 + sets errno to EINVAL. If there is no data available in the mail 103 + box and the file descriptor has been opened with O_NONBLOCK, the 104 + return value is set to -1 and errno becomes EAGAIN. 105 + 106 + If there is no data available in the mail box and the file 107 + descriptor has been opened without O_NONBLOCK, the call will 108 + block until the SPU writes to its interrupt mailbox channel. 109 + When data has been read successfully, four bytes are placed in 110 + the data buffer and the value four is returned. 111 + 112 + poll(2) 113 + Poll on the ibox file returns (POLLIN | POLLRDNORM) whenever 114 + data is available for reading. 115 + 116 + 117 + /wbox 118 + The CPU to SPU communation mailbox. It is write-only and can be written 119 + in units of 32 bits. If the mailbox is full, write() will block and 120 + poll can be used to wait for it becoming empty again. The possible 121 + operations on an open wbox file are: write(2) If a count smaller than 122 + four is requested, write returns -1 and sets errno to EINVAL. If there 123 + is no space available in the mail box and the file descriptor has been 124 + opened with O_NONBLOCK, the return value is set to -1 and errno becomes 125 + EAGAIN. 126 + 127 + If there is no space available in the mail box and the file descriptor 128 + has been opened without O_NONBLOCK, the call will block until the SPU 129 + reads from its PPE mailbox channel. When data has been read success- 130 + fully, four bytes are placed in the data buffer and the value four is 131 + returned. 132 + 133 + poll(2) 134 + Poll on the ibox file returns (POLLOUT | POLLWRNORM) whenever 135 + space is available for writing. 136 + 137 + 138 + /mbox_stat, /ibox_stat, /wbox_stat 139 + Read-only files that contain the length of the current queue, i.e. how 140 + many words can be read from mbox or ibox or how many words can be 141 + written to wbox without blocking. The files can be read only in 4-byte 142 + units and return a big-endian binary integer number. The possible 143 + operations on an open ``*box_stat`` file are: 144 + 145 + read(2) 146 + If a count smaller than four is requested, read returns -1 and 147 + sets errno to EINVAL. Otherwise, a four byte value is placed in 148 + the data buffer, containing the number of elements that can be 149 + read from (for mbox_stat and ibox_stat) or written to (for 150 + wbox_stat) the respective mail box without blocking or resulting 151 + in EAGAIN. 152 + 153 + 154 + /npc, /decr, /decr_status, /spu_tag_mask, /event_mask, /srr0 155 + Internal registers of the SPU. The representation is an ASCII string 156 + with the numeric value of the next instruction to be executed. These 157 + can be used in read/write mode for debugging, but normal operation of 158 + programs should not rely on them because access to any of them except 159 + npc requires an SPU context save and is therefore very inefficient. 160 + 161 + The contents of these files are: 162 + 163 + =================== =================================== 164 + npc Next Program Counter 165 + decr SPU Decrementer 166 + decr_status Decrementer Status 167 + spu_tag_mask MFC tag mask for SPU DMA 168 + event_mask Event mask for SPU interrupts 169 + srr0 Interrupt Return address register 170 + =================== =================================== 171 + 172 + 173 + The possible operations on an open npc, decr, decr_status, 174 + spu_tag_mask, event_mask or srr0 file are: 175 + 176 + read(2) 177 + When the count supplied to the read call is shorter than the 178 + required length for the pointer value plus a newline character, 179 + subsequent reads from the same file descriptor will result in 180 + completing the string, regardless of changes to the register by 181 + a running SPU task. When a complete string has been read, all 182 + subsequent read operations will return zero bytes and a new file 183 + descriptor needs to be opened to read the value again. 184 + 185 + write(2) 186 + A write operation on the file results in setting the register to 187 + the value given in the string. The string is parsed from the 188 + beginning to the first non-numeric character or the end of the 189 + buffer. Subsequent writes to the same file descriptor overwrite 190 + the previous setting. 191 + 192 + 193 + /fpcr 194 + This file gives access to the Floating Point Status and Control Regis- 195 + ter as a four byte long file. The operations on the fpcr file are: 196 + 197 + read(2) 198 + If a count smaller than four is requested, read returns -1 and 199 + sets errno to EINVAL. Otherwise, a four byte value is placed in 200 + the data buffer, containing the current value of the fpcr regis- 201 + ter. 202 + 203 + write(2) 204 + If a count smaller than four is requested, write returns -1 and 205 + sets errno to EINVAL. Otherwise, a four byte value is copied 206 + from the data buffer, updating the value of the fpcr register. 207 + 208 + 209 + /signal1, /signal2 210 + The two signal notification channels of an SPU. These are read-write 211 + files that operate on a 32 bit word. Writing to one of these files 212 + triggers an interrupt on the SPU. The value written to the signal 213 + files can be read from the SPU through a channel read or from host user 214 + space through the file. After the value has been read by the SPU, it 215 + is reset to zero. The possible operations on an open signal1 or sig- 216 + nal2 file are: 217 + 218 + read(2) 219 + If a count smaller than four is requested, read returns -1 and 220 + sets errno to EINVAL. Otherwise, a four byte value is placed in 221 + the data buffer, containing the current value of the specified 222 + signal notification register. 223 + 224 + write(2) 225 + If a count smaller than four is requested, write returns -1 and 226 + sets errno to EINVAL. Otherwise, a four byte value is copied 227 + from the data buffer, updating the value of the specified signal 228 + notification register. The signal notification register will 229 + either be replaced with the input data or will be updated to the 230 + bitwise OR or the old value and the input data, depending on the 231 + contents of the signal1_type, or signal2_type respectively, 232 + file. 233 + 234 + 235 + /signal1_type, /signal2_type 236 + These two files change the behavior of the signal1 and signal2 notifi- 237 + cation files. The contain a numerical ASCII string which is read as 238 + either "1" or "0". In mode 0 (overwrite), the hardware replaces the 239 + contents of the signal channel with the data that is written to it. in 240 + mode 1 (logical OR), the hardware accumulates the bits that are subse- 241 + quently written to it. The possible operations on an open signal1_type 242 + or signal2_type file are: 243 + 244 + read(2) 245 + When the count supplied to the read call is shorter than the 246 + required length for the digit plus a newline character, subse- 247 + quent reads from the same file descriptor will result in com- 248 + pleting the string. When a complete string has been read, all 249 + subsequent read operations will return zero bytes and a new file 250 + descriptor needs to be opened to read the value again. 251 + 252 + write(2) 253 + A write operation on the file results in setting the register to 254 + the value given in the string. The string is parsed from the 255 + beginning to the first non-numeric character or the end of the 256 + buffer. Subsequent writes to the same file descriptor overwrite 257 + the previous setting. 258 + 259 + 260 + Examples 261 + ======== 262 + /etc/fstab entry 263 + none /spu spufs gid=spu 0 0 264 + 265 + 266 + Authors 267 + ======= 268 + Arnd Bergmann <arndb@de.ibm.com>, Mark Nutter <mnutter@us.ibm.com>, 269 + Ulrich Weigand <Ulrich.Weigand@de.ibm.com> 270 + 271 + See Also 272 + ======== 273 + capabilities(7), close(2), spu_create(2), spu_run(2), spufs(7)

+138

Documentation/filesystems/sysfs-pci.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ============================================ 4 + Accessing PCI device resources through sysfs 5 + ============================================ 6 + 7 + sysfs, usually mounted at /sys, provides access to PCI resources on platforms 8 + that support it. For example, a given bus might look like this:: 9 + 10 + /sys/devices/pci0000:17 11 + |-- 0000:17:00.0 12 + | |-- class 13 + | |-- config 14 + | |-- device 15 + | |-- enable 16 + | |-- irq 17 + | |-- local_cpus 18 + | |-- remove 19 + | |-- resource 20 + | |-- resource0 21 + | |-- resource1 22 + | |-- resource2 23 + | |-- revision 24 + | |-- rom 25 + | |-- subsystem_device 26 + | |-- subsystem_vendor 27 + | `-- vendor 28 + `-- ... 29 + 30 + The topmost element describes the PCI domain and bus number. In this case, 31 + the domain number is 0000 and the bus number is 17 (both values are in hex). 32 + This bus contains a single function device in slot 0. The domain and bus 33 + numbers are reproduced for convenience. Under the device directory are several 34 + files, each with their own function. 35 + 36 + =================== ===================================================== 37 + file function 38 + =================== ===================================================== 39 + class PCI class (ascii, ro) 40 + config PCI config space (binary, rw) 41 + device PCI device (ascii, ro) 42 + enable Whether the device is enabled (ascii, rw) 43 + irq IRQ number (ascii, ro) 44 + local_cpus nearby CPU mask (cpumask, ro) 45 + remove remove device from kernel's list (ascii, wo) 46 + resource PCI resource host addresses (ascii, ro) 47 + resource0..N PCI resource N, if present (binary, mmap, rw\ [1]_) 48 + resource0_wc..N_wc PCI WC map resource N, if prefetchable (binary, mmap) 49 + revision PCI revision (ascii, ro) 50 + rom PCI ROM resource, if present (binary, ro) 51 + subsystem_device PCI subsystem device (ascii, ro) 52 + subsystem_vendor PCI subsystem vendor (ascii, ro) 53 + vendor PCI vendor (ascii, ro) 54 + =================== ===================================================== 55 + 56 + :: 57 + 58 + ro - read only file 59 + rw - file is readable and writable 60 + wo - write only file 61 + mmap - file is mmapable 62 + ascii - file contains ascii text 63 + binary - file contains binary data 64 + cpumask - file contains a cpumask type 65 + 66 + .. [1] rw for RESOURCE_IO (I/O port) regions only 67 + 68 + The read only files are informational, writes to them will be ignored, with 69 + the exception of the 'rom' file. Writable files can be used to perform 70 + actions on the device (e.g. changing config space, detaching a device). 71 + mmapable files are available via an mmap of the file at offset 0 and can be 72 + used to do actual device programming from userspace. Note that some platforms 73 + don't support mmapping of certain resources, so be sure to check the return 74 + value from any attempted mmap. The most notable of these are I/O port 75 + resources, which also provide read/write access. 76 + 77 + The 'enable' file provides a counter that indicates how many times the device 78 + has been enabled. If the 'enable' file currently returns '4', and a '1' is 79 + echoed into it, it will then return '5'. Echoing a '0' into it will decrease 80 + the count. Even when it returns to 0, though, some of the initialisation 81 + may not be reversed. 82 + 83 + The 'rom' file is special in that it provides read-only access to the device's 84 + ROM file, if available. It's disabled by default, however, so applications 85 + should write the string "1" to the file to enable it before attempting a read 86 + call, and disable it following the access by writing "0" to the file. Note 87 + that the device must be enabled for a rom read to return data successfully. 88 + In the event a driver is not bound to the device, it can be enabled using the 89 + 'enable' file, documented above. 90 + 91 + The 'remove' file is used to remove the PCI device, by writing a non-zero 92 + integer to the file. This does not involve any kind of hot-plug functionality, 93 + e.g. powering off the device. The device is removed from the kernel's list of 94 + PCI devices, the sysfs directory for it is removed, and the device will be 95 + removed from any drivers attached to it. Removal of PCI root buses is 96 + disallowed. 97 + 98 + Accessing legacy resources through sysfs 99 + ---------------------------------------- 100 + 101 + Legacy I/O port and ISA memory resources are also provided in sysfs if the 102 + underlying platform supports them. They're located in the PCI class hierarchy, 103 + e.g.:: 104 + 105 + /sys/class/pci_bus/0000:17/ 106 + |-- bridge -> ../../../devices/pci0000:17 107 + |-- cpuaffinity 108 + |-- legacy_io 109 + `-- legacy_mem 110 + 111 + The legacy_io file is a read/write file that can be used by applications to 112 + do legacy port I/O. The application should open the file, seek to the desired 113 + port (e.g. 0x3e8) and do a read or a write of 1, 2 or 4 bytes. The legacy_mem 114 + file should be mmapped with an offset corresponding to the memory offset 115 + desired, e.g. 0xa0000 for the VGA frame buffer. The application can then 116 + simply dereference the returned pointer (after checking for errors of course) 117 + to access legacy memory space. 118 + 119 + Supporting PCI access on new platforms 120 + -------------------------------------- 121 + 122 + In order to support PCI resource mapping as described above, Linux platform 123 + code should ideally define ARCH_GENERIC_PCI_MMAP_RESOURCE and use the generic 124 + implementation of that functionality. To support the historical interface of 125 + mmap() through files in /proc/bus/pci, platforms may also set HAVE_PCI_MMAP. 126 + 127 + Alternatively, platforms which set HAVE_PCI_MMAP may provide their own 128 + implementation of pci_mmap_page_range() instead of defining 129 + ARCH_GENERIC_PCI_MMAP_RESOURCE. 130 + 131 + Platforms which support write-combining maps of PCI resources must define 132 + arch_can_pci_mmap_wc() which shall evaluate to non-zero at runtime when 133 + write-combining is permitted. Platforms which support maps of I/O resources 134 + define arch_can_pci_mmap_io() similarly. 135 + 136 + Legacy resources are protected by the HAVE_PCI_LEGACY define. Platforms 137 + wishing to support legacy functionality should define it and provide 138 + pci_legacy_read, pci_legacy_write and pci_mmap_legacy_page_range functions.

-131

Documentation/filesystems/sysfs-pci.txt

··· 1 - Accessing PCI device resources through sysfs 2 - -------------------------------------------- 3 - 4 - sysfs, usually mounted at /sys, provides access to PCI resources on platforms 5 - that support it. For example, a given bus might look like this: 6 - 7 - /sys/devices/pci0000:17 8 - |-- 0000:17:00.0 9 - | |-- class 10 - | |-- config 11 - | |-- device 12 - | |-- enable 13 - | |-- irq 14 - | |-- local_cpus 15 - | |-- remove 16 - | |-- resource 17 - | |-- resource0 18 - | |-- resource1 19 - | |-- resource2 20 - | |-- revision 21 - | |-- rom 22 - | |-- subsystem_device 23 - | |-- subsystem_vendor 24 - | `-- vendor 25 - `-- ... 26 - 27 - The topmost element describes the PCI domain and bus number. In this case, 28 - the domain number is 0000 and the bus number is 17 (both values are in hex). 29 - This bus contains a single function device in slot 0. The domain and bus 30 - numbers are reproduced for convenience. Under the device directory are several 31 - files, each with their own function. 32 - 33 - file function 34 - ---- -------- 35 - class PCI class (ascii, ro) 36 - config PCI config space (binary, rw) 37 - device PCI device (ascii, ro) 38 - enable Whether the device is enabled (ascii, rw) 39 - irq IRQ number (ascii, ro) 40 - local_cpus nearby CPU mask (cpumask, ro) 41 - remove remove device from kernel's list (ascii, wo) 42 - resource PCI resource host addresses (ascii, ro) 43 - resource0..N PCI resource N, if present (binary, mmap, rw[1]) 44 - resource0_wc..N_wc PCI WC map resource N, if prefetchable (binary, mmap) 45 - revision PCI revision (ascii, ro) 46 - rom PCI ROM resource, if present (binary, ro) 47 - subsystem_device PCI subsystem device (ascii, ro) 48 - subsystem_vendor PCI subsystem vendor (ascii, ro) 49 - vendor PCI vendor (ascii, ro) 50 - 51 - ro - read only file 52 - rw - file is readable and writable 53 - wo - write only file 54 - mmap - file is mmapable 55 - ascii - file contains ascii text 56 - binary - file contains binary data 57 - cpumask - file contains a cpumask type 58 - 59 - [1] rw for RESOURCE_IO (I/O port) regions only 60 - 61 - The read only files are informational, writes to them will be ignored, with 62 - the exception of the 'rom' file. Writable files can be used to perform 63 - actions on the device (e.g. changing config space, detaching a device). 64 - mmapable files are available via an mmap of the file at offset 0 and can be 65 - used to do actual device programming from userspace. Note that some platforms 66 - don't support mmapping of certain resources, so be sure to check the return 67 - value from any attempted mmap. The most notable of these are I/O port 68 - resources, which also provide read/write access. 69 - 70 - The 'enable' file provides a counter that indicates how many times the device 71 - has been enabled. If the 'enable' file currently returns '4', and a '1' is 72 - echoed into it, it will then return '5'. Echoing a '0' into it will decrease 73 - the count. Even when it returns to 0, though, some of the initialisation 74 - may not be reversed. 75 - 76 - The 'rom' file is special in that it provides read-only access to the device's 77 - ROM file, if available. It's disabled by default, however, so applications 78 - should write the string "1" to the file to enable it before attempting a read 79 - call, and disable it following the access by writing "0" to the file. Note 80 - that the device must be enabled for a rom read to return data successfully. 81 - In the event a driver is not bound to the device, it can be enabled using the 82 - 'enable' file, documented above. 83 - 84 - The 'remove' file is used to remove the PCI device, by writing a non-zero 85 - integer to the file. This does not involve any kind of hot-plug functionality, 86 - e.g. powering off the device. The device is removed from the kernel's list of 87 - PCI devices, the sysfs directory for it is removed, and the device will be 88 - removed from any drivers attached to it. Removal of PCI root buses is 89 - disallowed. 90 - 91 - Accessing legacy resources through sysfs 92 - ---------------------------------------- 93 - 94 - Legacy I/O port and ISA memory resources are also provided in sysfs if the 95 - underlying platform supports them. They're located in the PCI class hierarchy, 96 - e.g. 97 - 98 - /sys/class/pci_bus/0000:17/ 99 - |-- bridge -> ../../../devices/pci0000:17 100 - |-- cpuaffinity 101 - |-- legacy_io 102 - `-- legacy_mem 103 - 104 - The legacy_io file is a read/write file that can be used by applications to 105 - do legacy port I/O. The application should open the file, seek to the desired 106 - port (e.g. 0x3e8) and do a read or a write of 1, 2 or 4 bytes. The legacy_mem 107 - file should be mmapped with an offset corresponding to the memory offset 108 - desired, e.g. 0xa0000 for the VGA frame buffer. The application can then 109 - simply dereference the returned pointer (after checking for errors of course) 110 - to access legacy memory space. 111 - 112 - Supporting PCI access on new platforms 113 - -------------------------------------- 114 - 115 - In order to support PCI resource mapping as described above, Linux platform 116 - code should ideally define ARCH_GENERIC_PCI_MMAP_RESOURCE and use the generic 117 - implementation of that functionality. To support the historical interface of 118 - mmap() through files in /proc/bus/pci, platforms may also set HAVE_PCI_MMAP. 119 - 120 - Alternatively, platforms which set HAVE_PCI_MMAP may provide their own 121 - implementation of pci_mmap_page_range() instead of defining 122 - ARCH_GENERIC_PCI_MMAP_RESOURCE. 123 - 124 - Platforms which support write-combining maps of PCI resources must define 125 - arch_can_pci_mmap_wc() which shall evaluate to non-zero at runtime when 126 - write-combining is permitted. Platforms which support maps of I/O resources 127 - define arch_can_pci_mmap_io() similarly. 128 - 129 - Legacy resources are protected by the HAVE_PCI_LEGACY define. Platforms 130 - wishing to support legacy functionality should define it and provide 131 - pci_legacy_read, pci_legacy_write and pci_mmap_legacy_page_range functions.

+48

Documentation/filesystems/sysfs-tagging.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ============= 4 + Sysfs tagging 5 + ============= 6 + 7 + (Taken almost verbatim from Eric Biederman's netns tagging patch 8 + commit msg) 9 + 10 + The problem. Network devices show up in sysfs and with the network 11 + namespace active multiple devices with the same name can show up in 12 + the same directory, ouch! 13 + 14 + To avoid that problem and allow existing applications in network 15 + namespaces to see the same interface that is currently presented in 16 + sysfs, sysfs now has tagging directory support. 17 + 18 + By using the network namespace pointers as tags to separate out the 19 + the sysfs directory entries we ensure that we don't have conflicts 20 + in the directories and applications only see a limited set of 21 + the network devices. 22 + 23 + Each sysfs directory entry may be tagged with a namespace via the 24 + ``void *ns member`` of its ``kernfs_node``. If a directory entry is tagged, 25 + then ``kernfs_node->flags`` will have a flag between KOBJ_NS_TYPE_NONE 26 + and KOBJ_NS_TYPES, and ns will point to the namespace to which it 27 + belongs. 28 + 29 + Each sysfs superblock's kernfs_super_info contains an array 30 + ``void *ns[KOBJ_NS_TYPES]``. When a task in a tagging namespace 31 + kobj_nstype first mounts sysfs, a new superblock is created. It 32 + will be differentiated from other sysfs mounts by having its 33 + ``s_fs_info->ns[kobj_nstype]`` set to the new namespace. Note that 34 + through bind mounting and mounts propagation, a task can easily view 35 + the contents of other namespaces' sysfs mounts. Therefore, when a 36 + namespace exits, it will call kobj_ns_exit() to invalidate any 37 + kernfs_node->ns pointers pointing to it. 38 + 39 + Users of this interface: 40 + 41 + - define a type in the ``kobj_ns_type`` enumeration. 42 + - call kobj_ns_type_register() with its ``kobj_ns_type_operations`` which has 43 + 44 + - current_ns() which returns current's namespace 45 + - netlink_ns() which returns a socket's namespace 46 + - initial_ns() which returns the initial namesapce 47 + 48 + - call kobj_ns_exit() when an individual tag is no longer valid

-42

Documentation/filesystems/sysfs-tagging.txt

··· 1 - Sysfs tagging 2 - ------------- 3 - 4 - (Taken almost verbatim from Eric Biederman's netns tagging patch 5 - commit msg) 6 - 7 - The problem. Network devices show up in sysfs and with the network 8 - namespace active multiple devices with the same name can show up in 9 - the same directory, ouch! 10 - 11 - To avoid that problem and allow existing applications in network 12 - namespaces to see the same interface that is currently presented in 13 - sysfs, sysfs now has tagging directory support. 14 - 15 - By using the network namespace pointers as tags to separate out the 16 - the sysfs directory entries we ensure that we don't have conflicts 17 - in the directories and applications only see a limited set of 18 - the network devices. 19 - 20 - Each sysfs directory entry may be tagged with a namespace via the 21 - void *ns member of its kernfs_node. If a directory entry is tagged, 22 - then kernfs_node->flags will have a flag between KOBJ_NS_TYPE_NONE 23 - and KOBJ_NS_TYPES, and ns will point to the namespace to which it 24 - belongs. 25 - 26 - Each sysfs superblock's kernfs_super_info contains an array void 27 - *ns[KOBJ_NS_TYPES]. When a task in a tagging namespace 28 - kobj_nstype first mounts sysfs, a new superblock is created. It 29 - will be differentiated from other sysfs mounts by having its 30 - s_fs_info->ns[kobj_nstype] set to the new namespace. Note that 31 - through bind mounting and mounts propagation, a task can easily view 32 - the contents of other namespaces' sysfs mounts. Therefore, when a 33 - namespace exits, it will call kobj_ns_exit() to invalidate any 34 - kernfs_node->ns pointers pointing to it. 35 - 36 - Users of this interface: 37 - - define a type in the kobj_ns_type enumeration. 38 - - call kobj_ns_type_register() with its kobj_ns_type_operations which has 39 - - current_ns() which returns current's namespace 40 - - netlink_ns() which returns a socket's namespace 41 - - initial_ns() which returns the initial namesapce 42 - - call kobj_ns_exit() when an individual tag is no longer valid

+1 -1

Documentation/filesystems/sysfs.rst

··· 20 20 linkages between them to userspace. 21 21 22 22 sysfs is tied inherently to the kobject infrastructure. Please read 23 - Documentation/kobject.txt for more information concerning the kobject 23 + Documentation/core-api/kobject.rst for more information concerning the kobject 24 24 interface. 25 25 26 26

+804

Documentation/filesystems/xfs-delayed-logging-design.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ========================== 4 + XFS Delayed Logging Design 5 + ========================== 6 + 7 + Introduction to Re-logging in XFS 8 + ================================= 9 + 10 + XFS logging is a combination of logical and physical logging. Some objects, 11 + such as inodes and dquots, are logged in logical format where the details 12 + logged are made up of the changes to in-core structures rather than on-disk 13 + structures. Other objects - typically buffers - have their physical changes 14 + logged. The reason for these differences is to reduce the amount of log space 15 + required for objects that are frequently logged. Some parts of inodes are more 16 + frequently logged than others, and inodes are typically more frequently logged 17 + than any other object (except maybe the superblock buffer) so keeping the 18 + amount of metadata logged low is of prime importance. 19 + 20 + The reason that this is such a concern is that XFS allows multiple separate 21 + modifications to a single object to be carried in the log at any given time. 22 + This allows the log to avoid needing to flush each change to disk before 23 + recording a new change to the object. XFS does this via a method called 24 + "re-logging". Conceptually, this is quite simple - all it requires is that any 25 + new change to the object is recorded with a *new copy* of all the existing 26 + changes in the new transaction that is written to the log. 27 + 28 + That is, if we have a sequence of changes A through to F, and the object was 29 + written to disk after change D, we would see in the log the following series 30 + of transactions, their contents and the log sequence number (LSN) of the 31 + transaction:: 32 + 33 + Transaction Contents LSN 34 + A A X 35 + B A+B X+n 36 + C A+B+C X+n+m 37 + D A+B+C+D X+n+m+o 38 + <object written to disk> 39 + E E Y (> X+n+m+o) 40 + F E+F Y+p 41 + 42 + In other words, each time an object is relogged, the new transaction contains 43 + the aggregation of all the previous changes currently held only in the log. 44 + 45 + This relogging technique also allows objects to be moved forward in the log so 46 + that an object being relogged does not prevent the tail of the log from ever 47 + moving forward. This can be seen in the table above by the changing 48 + (increasing) LSN of each subsequent transaction - the LSN is effectively a 49 + direct encoding of the location in the log of the transaction. 50 + 51 + This relogging is also used to implement long-running, multiple-commit 52 + transactions. These transaction are known as rolling transactions, and require 53 + a special log reservation known as a permanent transaction reservation. A 54 + typical example of a rolling transaction is the removal of extents from an 55 + inode which can only be done at a rate of two extents per transaction because 56 + of reservation size limitations. Hence a rolling extent removal transaction 57 + keeps relogging the inode and btree buffers as they get modified in each 58 + removal operation. This keeps them moving forward in the log as the operation 59 + progresses, ensuring that current operation never gets blocked by itself if the 60 + log wraps around. 61 + 62 + Hence it can be seen that the relogging operation is fundamental to the correct 63 + working of the XFS journalling subsystem. From the above description, most 64 + people should be able to see why the XFS metadata operations writes so much to 65 + the log - repeated operations to the same objects write the same changes to 66 + the log over and over again. Worse is the fact that objects tend to get 67 + dirtier as they get relogged, so each subsequent transaction is writing more 68 + metadata into the log. 69 + 70 + Another feature of the XFS transaction subsystem is that most transactions are 71 + asynchronous. That is, they don't commit to disk until either a log buffer is 72 + filled (a log buffer can hold multiple transactions) or a synchronous operation 73 + forces the log buffers holding the transactions to disk. This means that XFS is 74 + doing aggregation of transactions in memory - batching them, if you like - to 75 + minimise the impact of the log IO on transaction throughput. 76 + 77 + The limitation on asynchronous transaction throughput is the number and size of 78 + log buffers made available by the log manager. By default there are 8 log 79 + buffers available and the size of each is 32kB - the size can be increased up 80 + to 256kB by use of a mount option. 81 + 82 + Effectively, this gives us the maximum bound of outstanding metadata changes 83 + that can be made to the filesystem at any point in time - if all the log 84 + buffers are full and under IO, then no more transactions can be committed until 85 + the current batch completes. It is now common for a single current CPU core to 86 + be to able to issue enough transactions to keep the log buffers full and under 87 + IO permanently. Hence the XFS journalling subsystem can be considered to be IO 88 + bound. 89 + 90 + Delayed Logging: Concepts 91 + ========================= 92 + 93 + The key thing to note about the asynchronous logging combined with the 94 + relogging technique XFS uses is that we can be relogging changed objects 95 + multiple times before they are committed to disk in the log buffers. If we 96 + return to the previous relogging example, it is entirely possible that 97 + transactions A through D are committed to disk in the same log buffer. 98 + 99 + That is, a single log buffer may contain multiple copies of the same object, 100 + but only one of those copies needs to be there - the last one "D", as it 101 + contains all the changes from the previous changes. In other words, we have one 102 + necessary copy in the log buffer, and three stale copies that are simply 103 + wasting space. When we are doing repeated operations on the same set of 104 + objects, these "stale objects" can be over 90% of the space used in the log 105 + buffers. It is clear that reducing the number of stale objects written to the 106 + log would greatly reduce the amount of metadata we write to the log, and this 107 + is the fundamental goal of delayed logging. 108 + 109 + From a conceptual point of view, XFS is already doing relogging in memory (where 110 + memory == log buffer), only it is doing it extremely inefficiently. It is using 111 + logical to physical formatting to do the relogging because there is no 112 + infrastructure to keep track of logical changes in memory prior to physically 113 + formatting the changes in a transaction to the log buffer. Hence we cannot avoid 114 + accumulating stale objects in the log buffers. 115 + 116 + Delayed logging is the name we've given to keeping and tracking transactional 117 + changes to objects in memory outside the log buffer infrastructure. Because of 118 + the relogging concept fundamental to the XFS journalling subsystem, this is 119 + actually relatively easy to do - all the changes to logged items are already 120 + tracked in the current infrastructure. The big problem is how to accumulate 121 + them and get them to the log in a consistent, recoverable manner. 122 + Describing the problems and how they have been solved is the focus of this 123 + document. 124 + 125 + One of the key changes that delayed logging makes to the operation of the 126 + journalling subsystem is that it disassociates the amount of outstanding 127 + metadata changes from the size and number of log buffers available. In other 128 + words, instead of there only being a maximum of 2MB of transaction changes not 129 + written to the log at any point in time, there may be a much greater amount 130 + being accumulated in memory. Hence the potential for loss of metadata on a 131 + crash is much greater than for the existing logging mechanism. 132 + 133 + It should be noted that this does not change the guarantee that log recovery 134 + will result in a consistent filesystem. What it does mean is that as far as the 135 + recovered filesystem is concerned, there may be many thousands of transactions 136 + that simply did not occur as a result of the crash. This makes it even more 137 + important that applications that care about their data use fsync() where they 138 + need to ensure application level data integrity is maintained. 139 + 140 + It should be noted that delayed logging is not an innovative new concept that 141 + warrants rigorous proofs to determine whether it is correct or not. The method 142 + of accumulating changes in memory for some period before writing them to the 143 + log is used effectively in many filesystems including ext3 and ext4. Hence 144 + no time is spent in this document trying to convince the reader that the 145 + concept is sound. Instead it is simply considered a "solved problem" and as 146 + such implementing it in XFS is purely an exercise in software engineering. 147 + 148 + The fundamental requirements for delayed logging in XFS are simple: 149 + 150 + 1. Reduce the amount of metadata written to the log by at least 151 + an order of magnitude. 152 + 2. Supply sufficient statistics to validate Requirement #1. 153 + 3. Supply sufficient new tracing infrastructure to be able to debug 154 + problems with the new code. 155 + 4. No on-disk format change (metadata or log format). 156 + 5. Enable and disable with a mount option. 157 + 6. No performance regressions for synchronous transaction workloads. 158 + 159 + Delayed Logging: Design 160 + ======================= 161 + 162 + Storing Changes 163 + --------------- 164 + 165 + The problem with accumulating changes at a logical level (i.e. just using the 166 + existing log item dirty region tracking) is that when it comes to writing the 167 + changes to the log buffers, we need to ensure that the object we are formatting 168 + is not changing while we do this. This requires locking the object to prevent 169 + concurrent modification. Hence flushing the logical changes to the log would 170 + require us to lock every object, format them, and then unlock them again. 171 + 172 + This introduces lots of scope for deadlocks with transactions that are already 173 + running. For example, a transaction has object A locked and modified, but needs 174 + the delayed logging tracking lock to commit the transaction. However, the 175 + flushing thread has the delayed logging tracking lock already held, and is 176 + trying to get the lock on object A to flush it to the log buffer. This appears 177 + to be an unsolvable deadlock condition, and it was solving this problem that 178 + was the barrier to implementing delayed logging for so long. 179 + 180 + The solution is relatively simple - it just took a long time to recognise it. 181 + Put simply, the current logging code formats the changes to each item into an 182 + vector array that points to the changed regions in the item. The log write code 183 + simply copies the memory these vectors point to into the log buffer during 184 + transaction commit while the item is locked in the transaction. Instead of 185 + using the log buffer as the destination of the formatting code, we can use an 186 + allocated memory buffer big enough to fit the formatted vector. 187 + 188 + If we then copy the vector into the memory buffer and rewrite the vector to 189 + point to the memory buffer rather than the object itself, we now have a copy of 190 + the changes in a format that is compatible with the log buffer writing code. 191 + that does not require us to lock the item to access. This formatting and 192 + rewriting can all be done while the object is locked during transaction commit, 193 + resulting in a vector that is transactionally consistent and can be accessed 194 + without needing to lock the owning item. 195 + 196 + Hence we avoid the need to lock items when we need to flush outstanding 197 + asynchronous transactions to the log. The differences between the existing 198 + formatting method and the delayed logging formatting can be seen in the 199 + diagram below. 200 + 201 + Current format log vector:: 202 + 203 + Object +---------------------------------------------+ 204 + Vector 1 +----+ 205 + Vector 2 +----+ 206 + Vector 3 +----------+ 207 + 208 + After formatting:: 209 + 210 + Log Buffer +-V1-+-V2-+----V3----+ 211 + 212 + Delayed logging vector:: 213 + 214 + Object +---------------------------------------------+ 215 + Vector 1 +----+ 216 + Vector 2 +----+ 217 + Vector 3 +----------+ 218 + 219 + After formatting:: 220 + 221 + Memory Buffer +-V1-+-V2-+----V3----+ 222 + Vector 1 +----+ 223 + Vector 2 +----+ 224 + Vector 3 +----------+ 225 + 226 + The memory buffer and associated vector need to be passed as a single object, 227 + but still need to be associated with the parent object so if the object is 228 + relogged we can replace the current memory buffer with a new memory buffer that 229 + contains the latest changes. 230 + 231 + The reason for keeping the vector around after we've formatted the memory 232 + buffer is to support splitting vectors across log buffer boundaries correctly. 233 + If we don't keep the vector around, we do not know where the region boundaries 234 + are in the item, so we'd need a new encapsulation method for regions in the log 235 + buffer writing (i.e. double encapsulation). This would be an on-disk format 236 + change and as such is not desirable. It also means we'd have to write the log 237 + region headers in the formatting stage, which is problematic as there is per 238 + region state that needs to be placed into the headers during the log write. 239 + 240 + Hence we need to keep the vector, but by attaching the memory buffer to it and 241 + rewriting the vector addresses to point at the memory buffer we end up with a 242 + self-describing object that can be passed to the log buffer write code to be 243 + handled in exactly the same manner as the existing log vectors are handled. 244 + Hence we avoid needing a new on-disk format to handle items that have been 245 + relogged in memory. 246 + 247 + 248 + Tracking Changes 249 + ---------------- 250 + 251 + Now that we can record transactional changes in memory in a form that allows 252 + them to be used without limitations, we need to be able to track and accumulate 253 + them so that they can be written to the log at some later point in time. The 254 + log item is the natural place to store this vector and buffer, and also makes sense 255 + to be the object that is used to track committed objects as it will always 256 + exist once the object has been included in a transaction. 257 + 258 + The log item is already used to track the log items that have been written to 259 + the log but not yet written to disk. Such log items are considered "active" 260 + and as such are stored in the Active Item List (AIL) which is a LSN-ordered 261 + double linked list. Items are inserted into this list during log buffer IO 262 + completion, after which they are unpinned and can be written to disk. An object 263 + that is in the AIL can be relogged, which causes the object to be pinned again 264 + and then moved forward in the AIL when the log buffer IO completes for that 265 + transaction. 266 + 267 + Essentially, this shows that an item that is in the AIL can still be modified 268 + and relogged, so any tracking must be separate to the AIL infrastructure. As 269 + such, we cannot reuse the AIL list pointers for tracking committed items, nor 270 + can we store state in any field that is protected by the AIL lock. Hence the 271 + committed item tracking needs it's own locks, lists and state fields in the log 272 + item. 273 + 274 + Similar to the AIL, tracking of committed items is done through a new list 275 + called the Committed Item List (CIL). The list tracks log items that have been 276 + committed and have formatted memory buffers attached to them. It tracks objects 277 + in transaction commit order, so when an object is relogged it is removed from 278 + it's place in the list and re-inserted at the tail. This is entirely arbitrary 279 + and done to make it easy for debugging - the last items in the list are the 280 + ones that are most recently modified. Ordering of the CIL is not necessary for 281 + transactional integrity (as discussed in the next section) so the ordering is 282 + done for convenience/sanity of the developers. 283 + 284 + 285 + Delayed Logging: Checkpoints 286 + ---------------------------- 287 + 288 + When we have a log synchronisation event, commonly known as a "log force", 289 + all the items in the CIL must be written into the log via the log buffers. 290 + We need to write these items in the order that they exist in the CIL, and they 291 + need to be written as an atomic transaction. The need for all the objects to be 292 + written as an atomic transaction comes from the requirements of relogging and 293 + log replay - all the changes in all the objects in a given transaction must 294 + either be completely replayed during log recovery, or not replayed at all. If 295 + a transaction is not replayed because it is not complete in the log, then 296 + no later transactions should be replayed, either. 297 + 298 + To fulfill this requirement, we need to write the entire CIL in a single log 299 + transaction. Fortunately, the XFS log code has no fixed limit on the size of a 300 + transaction, nor does the log replay code. The only fundamental limit is that 301 + the transaction cannot be larger than just under half the size of the log. The 302 + reason for this limit is that to find the head and tail of the log, there must 303 + be at least one complete transaction in the log at any given time. If a 304 + transaction is larger than half the log, then there is the possibility that a 305 + crash during the write of a such a transaction could partially overwrite the 306 + only complete previous transaction in the log. This will result in a recovery 307 + failure and an inconsistent filesystem and hence we must enforce the maximum 308 + size of a checkpoint to be slightly less than a half the log. 309 + 310 + Apart from this size requirement, a checkpoint transaction looks no different 311 + to any other transaction - it contains a transaction header, a series of 312 + formatted log items and a commit record at the tail. From a recovery 313 + perspective, the checkpoint transaction is also no different - just a lot 314 + bigger with a lot more items in it. The worst case effect of this is that we 315 + might need to tune the recovery transaction object hash size. 316 + 317 + Because the checkpoint is just another transaction and all the changes to log 318 + items are stored as log vectors, we can use the existing log buffer writing 319 + code to write the changes into the log. To do this efficiently, we need to 320 + minimise the time we hold the CIL locked while writing the checkpoint 321 + transaction. The current log write code enables us to do this easily with the 322 + way it separates the writing of the transaction contents (the log vectors) from 323 + the transaction commit record, but tracking this requires us to have a 324 + per-checkpoint context that travels through the log write process through to 325 + checkpoint completion. 326 + 327 + Hence a checkpoint has a context that tracks the state of the current 328 + checkpoint from initiation to checkpoint completion. A new context is initiated 329 + at the same time a checkpoint transaction is started. That is, when we remove 330 + all the current items from the CIL during a checkpoint operation, we move all 331 + those changes into the current checkpoint context. We then initialise a new 332 + context and attach that to the CIL for aggregation of new transactions. 333 + 334 + This allows us to unlock the CIL immediately after transfer of all the 335 + committed items and effectively allow new transactions to be issued while we 336 + are formatting the checkpoint into the log. It also allows concurrent 337 + checkpoints to be written into the log buffers in the case of log force heavy 338 + workloads, just like the existing transaction commit code does. This, however, 339 + requires that we strictly order the commit records in the log so that 340 + checkpoint sequence order is maintained during log replay. 341 + 342 + To ensure that we can be writing an item into a checkpoint transaction at 343 + the same time another transaction modifies the item and inserts the log item 344 + into the new CIL, then checkpoint transaction commit code cannot use log items 345 + to store the list of log vectors that need to be written into the transaction. 346 + Hence log vectors need to be able to be chained together to allow them to be 347 + detached from the log items. That is, when the CIL is flushed the memory 348 + buffer and log vector attached to each log item needs to be attached to the 349 + checkpoint context so that the log item can be released. In diagrammatic form, 350 + the CIL would look like this before the flush:: 351 + 352 + CIL Head 353 + | 354 + V 355 + Log Item <-> log vector 1 -> memory buffer 356 + | -> vector array 357 + V 358 + Log Item <-> log vector 2 -> memory buffer 359 + | -> vector array 360 + V 361 + ...... 362 + | 363 + V 364 + Log Item <-> log vector N-1 -> memory buffer 365 + | -> vector array 366 + V 367 + Log Item <-> log vector N -> memory buffer 368 + -> vector array 369 + 370 + And after the flush the CIL head is empty, and the checkpoint context log 371 + vector list would look like:: 372 + 373 + Checkpoint Context 374 + | 375 + V 376 + log vector 1 -> memory buffer 377 + | -> vector array 378 + | -> Log Item 379 + V 380 + log vector 2 -> memory buffer 381 + | -> vector array 382 + | -> Log Item 383 + V 384 + ...... 385 + | 386 + V 387 + log vector N-1 -> memory buffer 388 + | -> vector array 389 + | -> Log Item 390 + V 391 + log vector N -> memory buffer 392 + -> vector array 393 + -> Log Item 394 + 395 + Once this transfer is done, the CIL can be unlocked and new transactions can 396 + start, while the checkpoint flush code works over the log vector chain to 397 + commit the checkpoint. 398 + 399 + Once the checkpoint is written into the log buffers, the checkpoint context is 400 + attached to the log buffer that the commit record was written to along with a 401 + completion callback. Log IO completion will call that callback, which can then 402 + run transaction committed processing for the log items (i.e. insert into AIL 403 + and unpin) in the log vector chain and then free the log vector chain and 404 + checkpoint context. 405 + 406 + Discussion Point: I am uncertain as to whether the log item is the most 407 + efficient way to track vectors, even though it seems like the natural way to do 408 + it. The fact that we walk the log items (in the CIL) just to chain the log 409 + vectors and break the link between the log item and the log vector means that 410 + we take a cache line hit for the log item list modification, then another for 411 + the log vector chaining. If we track by the log vectors, then we only need to 412 + break the link between the log item and the log vector, which means we should 413 + dirty only the log item cachelines. Normally I wouldn't be concerned about one 414 + vs two dirty cachelines except for the fact I've seen upwards of 80,000 log 415 + vectors in one checkpoint transaction. I'd guess this is a "measure and 416 + compare" situation that can be done after a working and reviewed implementation 417 + is in the dev tree.... 418 + 419 + Delayed Logging: Checkpoint Sequencing 420 + -------------------------------------- 421 + 422 + One of the key aspects of the XFS transaction subsystem is that it tags 423 + committed transactions with the log sequence number of the transaction commit. 424 + This allows transactions to be issued asynchronously even though there may be 425 + future operations that cannot be completed until that transaction is fully 426 + committed to the log. In the rare case that a dependent operation occurs (e.g. 427 + re-using a freed metadata extent for a data extent), a special, optimised log 428 + force can be issued to force the dependent transaction to disk immediately. 429 + 430 + To do this, transactions need to record the LSN of the commit record of the 431 + transaction. This LSN comes directly from the log buffer the transaction is 432 + written into. While this works just fine for the existing transaction 433 + mechanism, it does not work for delayed logging because transactions are not 434 + written directly into the log buffers. Hence some other method of sequencing 435 + transactions is required. 436 + 437 + As discussed in the checkpoint section, delayed logging uses per-checkpoint 438 + contexts, and as such it is simple to assign a sequence number to each 439 + checkpoint. Because the switching of checkpoint contexts must be done 440 + atomically, it is simple to ensure that each new context has a monotonically 441 + increasing sequence number assigned to it without the need for an external 442 + atomic counter - we can just take the current context sequence number and add 443 + one to it for the new context. 444 + 445 + Then, instead of assigning a log buffer LSN to the transaction commit LSN 446 + during the commit, we can assign the current checkpoint sequence. This allows 447 + operations that track transactions that have not yet completed know what 448 + checkpoint sequence needs to be committed before they can continue. As a 449 + result, the code that forces the log to a specific LSN now needs to ensure that 450 + the log forces to a specific checkpoint. 451 + 452 + To ensure that we can do this, we need to track all the checkpoint contexts 453 + that are currently committing to the log. When we flush a checkpoint, the 454 + context gets added to a "committing" list which can be searched. When a 455 + checkpoint commit completes, it is removed from the committing list. Because 456 + the checkpoint context records the LSN of the commit record for the checkpoint, 457 + we can also wait on the log buffer that contains the commit record, thereby 458 + using the existing log force mechanisms to execute synchronous forces. 459 + 460 + It should be noted that the synchronous forces may need to be extended with 461 + mitigation algorithms similar to the current log buffer code to allow 462 + aggregation of multiple synchronous transactions if there are already 463 + synchronous transactions being flushed. Investigation of the performance of the 464 + current design is needed before making any decisions here. 465 + 466 + The main concern with log forces is to ensure that all the previous checkpoints 467 + are also committed to disk before the one we need to wait for. Therefore we 468 + need to check that all the prior contexts in the committing list are also 469 + complete before waiting on the one we need to complete. We do this 470 + synchronisation in the log force code so that we don't need to wait anywhere 471 + else for such serialisation - it only matters when we do a log force. 472 + 473 + The only remaining complexity is that a log force now also has to handle the 474 + case where the forcing sequence number is the same as the current context. That 475 + is, we need to flush the CIL and potentially wait for it to complete. This is a 476 + simple addition to the existing log forcing code to check the sequence numbers 477 + and push if required. Indeed, placing the current sequence checkpoint flush in 478 + the log force code enables the current mechanism for issuing synchronous 479 + transactions to remain untouched (i.e. commit an asynchronous transaction, then 480 + force the log at the LSN of that transaction) and so the higher level code 481 + behaves the same regardless of whether delayed logging is being used or not. 482 + 483 + Delayed Logging: Checkpoint Log Space Accounting 484 + ------------------------------------------------ 485 + 486 + The big issue for a checkpoint transaction is the log space reservation for the 487 + transaction. We don't know how big a checkpoint transaction is going to be 488 + ahead of time, nor how many log buffers it will take to write out, nor the 489 + number of split log vector regions are going to be used. We can track the 490 + amount of log space required as we add items to the commit item list, but we 491 + still need to reserve the space in the log for the checkpoint. 492 + 493 + A typical transaction reserves enough space in the log for the worst case space 494 + usage of the transaction. The reservation accounts for log record headers, 495 + transaction and region headers, headers for split regions, buffer tail padding, 496 + etc. as well as the actual space for all the changed metadata in the 497 + transaction. While some of this is fixed overhead, much of it is dependent on 498 + the size of the transaction and the number of regions being logged (the number 499 + of log vectors in the transaction). 500 + 501 + An example of the differences would be logging directory changes versus logging 502 + inode changes. If you modify lots of inode cores (e.g. ``chmod -R g+w *``), then 503 + there are lots of transactions that only contain an inode core and an inode log 504 + format structure. That is, two vectors totaling roughly 150 bytes. If we modify 505 + 10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each 506 + vector is 12 bytes, so the total to be logged is approximately 1.75MB. In 507 + comparison, if we are logging full directory buffers, they are typically 4KB 508 + each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a 509 + buffer format structure for each buffer - roughly 800 vectors or 1.51MB total 510 + space. From this, it should be obvious that a static log space reservation is 511 + not particularly flexible and is difficult to select the "optimal value" for 512 + all workloads. 513 + 514 + Further, if we are going to use a static reservation, which bit of the entire 515 + reservation does it cover? We account for space used by the transaction 516 + reservation by tracking the space currently used by the object in the CIL and 517 + then calculating the increase or decrease in space used as the object is 518 + relogged. This allows for a checkpoint reservation to only have to account for 519 + log buffer metadata used such as log header records. 520 + 521 + However, even using a static reservation for just the log metadata is 522 + problematic. Typically log record headers use at least 16KB of log space per 523 + 1MB of log space consumed (512 bytes per 32k) and the reservation needs to be 524 + large enough to handle arbitrary sized checkpoint transactions. This 525 + reservation needs to be made before the checkpoint is started, and we need to 526 + be able to reserve the space without sleeping. For a 8MB checkpoint, we need a 527 + reservation of around 150KB, which is a non-trivial amount of space. 528 + 529 + A static reservation needs to manipulate the log grant counters - we can take a 530 + permanent reservation on the space, but we still need to make sure we refresh 531 + the write reservation (the actual space available to the transaction) after 532 + every checkpoint transaction completion. Unfortunately, if this space is not 533 + available when required, then the regrant code will sleep waiting for it. 534 + 535 + The problem with this is that it can lead to deadlocks as we may need to commit 536 + checkpoints to be able to free up log space (refer back to the description of 537 + rolling transactions for an example of this). Hence we *must* always have 538 + space available in the log if we are to use static reservations, and that is 539 + very difficult and complex to arrange. It is possible to do, but there is a 540 + simpler way. 541 + 542 + The simpler way of doing this is tracking the entire log space used by the 543 + items in the CIL and using this to dynamically calculate the amount of log 544 + space required by the log metadata. If this log metadata space changes as a 545 + result of a transaction commit inserting a new memory buffer into the CIL, then 546 + the difference in space required is removed from the transaction that causes 547 + the change. Transactions at this level will *always* have enough space 548 + available in their reservation for this as they have already reserved the 549 + maximal amount of log metadata space they require, and such a delta reservation 550 + will always be less than or equal to the maximal amount in the reservation. 551 + 552 + Hence we can grow the checkpoint transaction reservation dynamically as items 553 + are added to the CIL and avoid the need for reserving and regranting log space 554 + up front. This avoids deadlocks and removes a blocking point from the 555 + checkpoint flush code. 556 + 557 + As mentioned early, transactions can't grow to more than half the size of the 558 + log. Hence as part of the reservation growing, we need to also check the size 559 + of the reservation against the maximum allowed transaction size. If we reach 560 + the maximum threshold, we need to push the CIL to the log. This is effectively 561 + a "background flush" and is done on demand. This is identical to 562 + a CIL push triggered by a log force, only that there is no waiting for the 563 + checkpoint commit to complete. This background push is checked and executed by 564 + transaction commit code. 565 + 566 + If the transaction subsystem goes idle while we still have items in the CIL, 567 + they will be flushed by the periodic log force issued by the xfssyncd. This log 568 + force will push the CIL to disk, and if the transaction subsystem stays idle, 569 + allow the idle log to be covered (effectively marked clean) in exactly the same 570 + manner that is done for the existing logging method. A discussion point is 571 + whether this log force needs to be done more frequently than the current rate 572 + which is once every 30s. 573 + 574 + 575 + Delayed Logging: Log Item Pinning 576 + --------------------------------- 577 + 578 + Currently log items are pinned during transaction commit while the items are 579 + still locked. This happens just after the items are formatted, though it could 580 + be done any time before the items are unlocked. The result of this mechanism is 581 + that items get pinned once for every transaction that is committed to the log 582 + buffers. Hence items that are relogged in the log buffers will have a pin count 583 + for every outstanding transaction they were dirtied in. When each of these 584 + transactions is completed, they will unpin the item once. As a result, the item 585 + only becomes unpinned when all the transactions complete and there are no 586 + pending transactions. Thus the pinning and unpinning of a log item is symmetric 587 + as there is a 1:1 relationship with transaction commit and log item completion. 588 + 589 + For delayed logging, however, we have an asymmetric transaction commit to 590 + completion relationship. Every time an object is relogged in the CIL it goes 591 + through the commit process without a corresponding completion being registered. 592 + That is, we now have a many-to-one relationship between transaction commit and 593 + log item completion. The result of this is that pinning and unpinning of the 594 + log items becomes unbalanced if we retain the "pin on transaction commit, unpin 595 + on transaction completion" model. 596 + 597 + To keep pin/unpin symmetry, the algorithm needs to change to a "pin on 598 + insertion into the CIL, unpin on checkpoint completion". In other words, the 599 + pinning and unpinning becomes symmetric around a checkpoint context. We have to 600 + pin the object the first time it is inserted into the CIL - if it is already in 601 + the CIL during a transaction commit, then we do not pin it again. Because there 602 + can be multiple outstanding checkpoint contexts, we can still see elevated pin 603 + counts, but as each checkpoint completes the pin count will retain the correct 604 + value according to it's context. 605 + 606 + Just to make matters more slightly more complex, this checkpoint level context 607 + for the pin count means that the pinning of an item must take place under the 608 + CIL commit/flush lock. If we pin the object outside this lock, we cannot 609 + guarantee which context the pin count is associated with. This is because of 610 + the fact pinning the item is dependent on whether the item is present in the 611 + current CIL or not. If we don't pin the CIL first before we check and pin the 612 + object, we have a race with CIL being flushed between the check and the pin 613 + (or not pinning, as the case may be). Hence we must hold the CIL flush/commit 614 + lock to guarantee that we pin the items correctly. 615 + 616 + Delayed Logging: Concurrent Scalability 617 + --------------------------------------- 618 + 619 + A fundamental requirement for the CIL is that accesses through transaction 620 + commits must scale to many concurrent commits. The current transaction commit 621 + code does not break down even when there are transactions coming from 2048 622 + processors at once. The current transaction code does not go any faster than if 623 + there was only one CPU using it, but it does not slow down either. 624 + 625 + As a result, the delayed logging transaction commit code needs to be designed 626 + for concurrency from the ground up. It is obvious that there are serialisation 627 + points in the design - the three important ones are: 628 + 629 + 1. Locking out new transaction commits while flushing the CIL 630 + 2. Adding items to the CIL and updating item space accounting 631 + 3. Checkpoint commit ordering 632 + 633 + Looking at the transaction commit and CIL flushing interactions, it is clear 634 + that we have a many-to-one interaction here. That is, the only restriction on 635 + the number of concurrent transactions that can be trying to commit at once is 636 + the amount of space available in the log for their reservations. The practical 637 + limit here is in the order of several hundred concurrent transactions for a 638 + 128MB log, which means that it is generally one per CPU in a machine. 639 + 640 + The amount of time a transaction commit needs to hold out a flush is a 641 + relatively long period of time - the pinning of log items needs to be done 642 + while we are holding out a CIL flush, so at the moment that means it is held 643 + across the formatting of the objects into memory buffers (i.e. while memcpy()s 644 + are in progress). Ultimately a two pass algorithm where the formatting is done 645 + separately to the pinning of objects could be used to reduce the hold time of 646 + the transaction commit side. 647 + 648 + Because of the number of potential transaction commit side holders, the lock 649 + really needs to be a sleeping lock - if the CIL flush takes the lock, we do not 650 + want every other CPU in the machine spinning on the CIL lock. Given that 651 + flushing the CIL could involve walking a list of tens of thousands of log 652 + items, it will get held for a significant time and so spin contention is a 653 + significant concern. Preventing lots of CPUs spinning doing nothing is the 654 + main reason for choosing a sleeping lock even though nothing in either the 655 + transaction commit or CIL flush side sleeps with the lock held. 656 + 657 + It should also be noted that CIL flushing is also a relatively rare operation 658 + compared to transaction commit for asynchronous transaction workloads - only 659 + time will tell if using a read-write semaphore for exclusion will limit 660 + transaction commit concurrency due to cache line bouncing of the lock on the 661 + read side. 662 + 663 + The second serialisation point is on the transaction commit side where items 664 + are inserted into the CIL. Because transactions can enter this code 665 + concurrently, the CIL needs to be protected separately from the above 666 + commit/flush exclusion. It also needs to be an exclusive lock but it is only 667 + held for a very short time and so a spin lock is appropriate here. It is 668 + possible that this lock will become a contention point, but given the short 669 + hold time once per transaction I think that contention is unlikely. 670 + 671 + The final serialisation point is the checkpoint commit record ordering code 672 + that is run as part of the checkpoint commit and log force sequencing. The code 673 + path that triggers a CIL flush (i.e. whatever triggers the log force) will enter 674 + an ordering loop after writing all the log vectors into the log buffers but 675 + before writing the commit record. This loop walks the list of committing 676 + checkpoints and needs to block waiting for checkpoints to complete their commit 677 + record write. As a result it needs a lock and a wait variable. Log force 678 + sequencing also requires the same lock, list walk, and blocking mechanism to 679 + ensure completion of checkpoints. 680 + 681 + These two sequencing operations can use the mechanism even though the 682 + events they are waiting for are different. The checkpoint commit record 683 + sequencing needs to wait until checkpoint contexts contain a commit LSN 684 + (obtained through completion of a commit record write) while log force 685 + sequencing needs to wait until previous checkpoint contexts are removed from 686 + the committing list (i.e. they've completed). A simple wait variable and 687 + broadcast wakeups (thundering herds) has been used to implement these two 688 + serialisation queues. They use the same lock as the CIL, too. If we see too 689 + much contention on the CIL lock, or too many context switches as a result of 690 + the broadcast wakeups these operations can be put under a new spinlock and 691 + given separate wait lists to reduce lock contention and the number of processes 692 + woken by the wrong event. 693 + 694 + 695 + Lifecycle Changes 696 + ----------------- 697 + 698 + The existing log item life cycle is as follows:: 699 + 700 + 1. Transaction allocate 701 + 2. Transaction reserve 702 + 3. Lock item 703 + 4. Join item to transaction 704 + If not already attached, 705 + Allocate log item 706 + Attach log item to owner item 707 + Attach log item to transaction 708 + 5. Modify item 709 + Record modifications in log item 710 + 6. Transaction commit 711 + Pin item in memory 712 + Format item into log buffer 713 + Write commit LSN into transaction 714 + Unlock item 715 + Attach transaction to log buffer 716 + 717 + <log buffer IO dispatched> 718 + <log buffer IO completes> 719 + 720 + 7. Transaction completion 721 + Mark log item committed 722 + Insert log item into AIL 723 + Write commit LSN into log item 724 + Unpin log item 725 + 8. AIL traversal 726 + Lock item 727 + Mark log item clean 728 + Flush item to disk 729 + 730 + <item IO completion> 731 + 732 + 9. Log item removed from AIL 733 + Moves log tail 734 + Item unlocked 735 + 736 + Essentially, steps 1-6 operate independently from step 7, which is also 737 + independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9 738 + at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur 739 + at the same time. If the log item is in the AIL or between steps 6 and 7 740 + and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9 741 + are entered and completed is the object considered clean. 742 + 743 + With delayed logging, there are new steps inserted into the life cycle:: 744 + 745 + 1. Transaction allocate 746 + 2. Transaction reserve 747 + 3. Lock item 748 + 4. Join item to transaction 749 + If not already attached, 750 + Allocate log item 751 + Attach log item to owner item 752 + Attach log item to transaction 753 + 5. Modify item 754 + Record modifications in log item 755 + 6. Transaction commit 756 + Pin item in memory if not pinned in CIL 757 + Format item into log vector + buffer 758 + Attach log vector and buffer to log item 759 + Insert log item into CIL 760 + Write CIL context sequence into transaction 761 + Unlock item 762 + 763 + <next log force> 764 + 765 + 7. CIL push 766 + lock CIL flush 767 + Chain log vectors and buffers together 768 + Remove items from CIL 769 + unlock CIL flush 770 + write log vectors into log 771 + sequence commit records 772 + attach checkpoint context to log buffer 773 + 774 + <log buffer IO dispatched> 775 + <log buffer IO completes> 776 + 777 + 8. Checkpoint completion 778 + Mark log item committed 779 + Insert item into AIL 780 + Write commit LSN into log item 781 + Unpin log item 782 + 9. AIL traversal 783 + Lock item 784 + Mark log item clean 785 + Flush item to disk 786 + <item IO completion> 787 + 10. Log item removed from AIL 788 + Moves log tail 789 + Item unlocked 790 + 791 + From this, it can be seen that the only life cycle differences between the two 792 + logging methods are in the middle of the life cycle - they still have the same 793 + beginning and end and execution constraints. The only differences are in the 794 + committing of the log items to the log itself and the completion processing. 795 + Hence delayed logging should not introduce any constraints on log item 796 + behaviour, allocation or freeing that don't already exist. 797 + 798 + As a result of this zero-impact "insertion" of delayed logging infrastructure 799 + and the design of the internal structures to avoid on disk format changes, we 800 + can basically switch between delayed logging and the existing mechanism with a 801 + mount option. Fundamentally, there is no reason why the log manager would not 802 + be able to swap methods automatically and transparently depending on load 803 + characteristics, but this should not be necessary if delayed logging works as 804 + designed.

-793

Documentation/filesystems/xfs-delayed-logging-design.txt

··· 1 - XFS Delayed Logging Design 2 - -------------------------- 3 - 4 - Introduction to Re-logging in XFS 5 - --------------------------------- 6 - 7 - XFS logging is a combination of logical and physical logging. Some objects, 8 - such as inodes and dquots, are logged in logical format where the details 9 - logged are made up of the changes to in-core structures rather than on-disk 10 - structures. Other objects - typically buffers - have their physical changes 11 - logged. The reason for these differences is to reduce the amount of log space 12 - required for objects that are frequently logged. Some parts of inodes are more 13 - frequently logged than others, and inodes are typically more frequently logged 14 - than any other object (except maybe the superblock buffer) so keeping the 15 - amount of metadata logged low is of prime importance. 16 - 17 - The reason that this is such a concern is that XFS allows multiple separate 18 - modifications to a single object to be carried in the log at any given time. 19 - This allows the log to avoid needing to flush each change to disk before 20 - recording a new change to the object. XFS does this via a method called 21 - "re-logging". Conceptually, this is quite simple - all it requires is that any 22 - new change to the object is recorded with a *new copy* of all the existing 23 - changes in the new transaction that is written to the log. 24 - 25 - That is, if we have a sequence of changes A through to F, and the object was 26 - written to disk after change D, we would see in the log the following series 27 - of transactions, their contents and the log sequence number (LSN) of the 28 - transaction: 29 - 30 - Transaction Contents LSN 31 - A A X 32 - B A+B X+n 33 - C A+B+C X+n+m 34 - D A+B+C+D X+n+m+o 35 - <object written to disk> 36 - E E Y (> X+n+m+o) 37 - F E+F Y+p 38 - 39 - In other words, each time an object is relogged, the new transaction contains 40 - the aggregation of all the previous changes currently held only in the log. 41 - 42 - This relogging technique also allows objects to be moved forward in the log so 43 - that an object being relogged does not prevent the tail of the log from ever 44 - moving forward. This can be seen in the table above by the changing 45 - (increasing) LSN of each subsequent transaction - the LSN is effectively a 46 - direct encoding of the location in the log of the transaction. 47 - 48 - This relogging is also used to implement long-running, multiple-commit 49 - transactions. These transaction are known as rolling transactions, and require 50 - a special log reservation known as a permanent transaction reservation. A 51 - typical example of a rolling transaction is the removal of extents from an 52 - inode which can only be done at a rate of two extents per transaction because 53 - of reservation size limitations. Hence a rolling extent removal transaction 54 - keeps relogging the inode and btree buffers as they get modified in each 55 - removal operation. This keeps them moving forward in the log as the operation 56 - progresses, ensuring that current operation never gets blocked by itself if the 57 - log wraps around. 58 - 59 - Hence it can be seen that the relogging operation is fundamental to the correct 60 - working of the XFS journalling subsystem. From the above description, most 61 - people should be able to see why the XFS metadata operations writes so much to 62 - the log - repeated operations to the same objects write the same changes to 63 - the log over and over again. Worse is the fact that objects tend to get 64 - dirtier as they get relogged, so each subsequent transaction is writing more 65 - metadata into the log. 66 - 67 - Another feature of the XFS transaction subsystem is that most transactions are 68 - asynchronous. That is, they don't commit to disk until either a log buffer is 69 - filled (a log buffer can hold multiple transactions) or a synchronous operation 70 - forces the log buffers holding the transactions to disk. This means that XFS is 71 - doing aggregation of transactions in memory - batching them, if you like - to 72 - minimise the impact of the log IO on transaction throughput. 73 - 74 - The limitation on asynchronous transaction throughput is the number and size of 75 - log buffers made available by the log manager. By default there are 8 log 76 - buffers available and the size of each is 32kB - the size can be increased up 77 - to 256kB by use of a mount option. 78 - 79 - Effectively, this gives us the maximum bound of outstanding metadata changes 80 - that can be made to the filesystem at any point in time - if all the log 81 - buffers are full and under IO, then no more transactions can be committed until 82 - the current batch completes. It is now common for a single current CPU core to 83 - be to able to issue enough transactions to keep the log buffers full and under 84 - IO permanently. Hence the XFS journalling subsystem can be considered to be IO 85 - bound. 86 - 87 - Delayed Logging: Concepts 88 - ------------------------- 89 - 90 - The key thing to note about the asynchronous logging combined with the 91 - relogging technique XFS uses is that we can be relogging changed objects 92 - multiple times before they are committed to disk in the log buffers. If we 93 - return to the previous relogging example, it is entirely possible that 94 - transactions A through D are committed to disk in the same log buffer. 95 - 96 - That is, a single log buffer may contain multiple copies of the same object, 97 - but only one of those copies needs to be there - the last one "D", as it 98 - contains all the changes from the previous changes. In other words, we have one 99 - necessary copy in the log buffer, and three stale copies that are simply 100 - wasting space. When we are doing repeated operations on the same set of 101 - objects, these "stale objects" can be over 90% of the space used in the log 102 - buffers. It is clear that reducing the number of stale objects written to the 103 - log would greatly reduce the amount of metadata we write to the log, and this 104 - is the fundamental goal of delayed logging. 105 - 106 - From a conceptual point of view, XFS is already doing relogging in memory (where 107 - memory == log buffer), only it is doing it extremely inefficiently. It is using 108 - logical to physical formatting to do the relogging because there is no 109 - infrastructure to keep track of logical changes in memory prior to physically 110 - formatting the changes in a transaction to the log buffer. Hence we cannot avoid 111 - accumulating stale objects in the log buffers. 112 - 113 - Delayed logging is the name we've given to keeping and tracking transactional 114 - changes to objects in memory outside the log buffer infrastructure. Because of 115 - the relogging concept fundamental to the XFS journalling subsystem, this is 116 - actually relatively easy to do - all the changes to logged items are already 117 - tracked in the current infrastructure. The big problem is how to accumulate 118 - them and get them to the log in a consistent, recoverable manner. 119 - Describing the problems and how they have been solved is the focus of this 120 - document. 121 - 122 - One of the key changes that delayed logging makes to the operation of the 123 - journalling subsystem is that it disassociates the amount of outstanding 124 - metadata changes from the size and number of log buffers available. In other 125 - words, instead of there only being a maximum of 2MB of transaction changes not 126 - written to the log at any point in time, there may be a much greater amount 127 - being accumulated in memory. Hence the potential for loss of metadata on a 128 - crash is much greater than for the existing logging mechanism. 129 - 130 - It should be noted that this does not change the guarantee that log recovery 131 - will result in a consistent filesystem. What it does mean is that as far as the 132 - recovered filesystem is concerned, there may be many thousands of transactions 133 - that simply did not occur as a result of the crash. This makes it even more 134 - important that applications that care about their data use fsync() where they 135 - need to ensure application level data integrity is maintained. 136 - 137 - It should be noted that delayed logging is not an innovative new concept that 138 - warrants rigorous proofs to determine whether it is correct or not. The method 139 - of accumulating changes in memory for some period before writing them to the 140 - log is used effectively in many filesystems including ext3 and ext4. Hence 141 - no time is spent in this document trying to convince the reader that the 142 - concept is sound. Instead it is simply considered a "solved problem" and as 143 - such implementing it in XFS is purely an exercise in software engineering. 144 - 145 - The fundamental requirements for delayed logging in XFS are simple: 146 - 147 - 1. Reduce the amount of metadata written to the log by at least 148 - an order of magnitude. 149 - 2. Supply sufficient statistics to validate Requirement #1. 150 - 3. Supply sufficient new tracing infrastructure to be able to debug 151 - problems with the new code. 152 - 4. No on-disk format change (metadata or log format). 153 - 5. Enable and disable with a mount option. 154 - 6. No performance regressions for synchronous transaction workloads. 155 - 156 - Delayed Logging: Design 157 - ----------------------- 158 - 159 - Storing Changes 160 - 161 - The problem with accumulating changes at a logical level (i.e. just using the 162 - existing log item dirty region tracking) is that when it comes to writing the 163 - changes to the log buffers, we need to ensure that the object we are formatting 164 - is not changing while we do this. This requires locking the object to prevent 165 - concurrent modification. Hence flushing the logical changes to the log would 166 - require us to lock every object, format them, and then unlock them again. 167 - 168 - This introduces lots of scope for deadlocks with transactions that are already 169 - running. For example, a transaction has object A locked and modified, but needs 170 - the delayed logging tracking lock to commit the transaction. However, the 171 - flushing thread has the delayed logging tracking lock already held, and is 172 - trying to get the lock on object A to flush it to the log buffer. This appears 173 - to be an unsolvable deadlock condition, and it was solving this problem that 174 - was the barrier to implementing delayed logging for so long. 175 - 176 - The solution is relatively simple - it just took a long time to recognise it. 177 - Put simply, the current logging code formats the changes to each item into an 178 - vector array that points to the changed regions in the item. The log write code 179 - simply copies the memory these vectors point to into the log buffer during 180 - transaction commit while the item is locked in the transaction. Instead of 181 - using the log buffer as the destination of the formatting code, we can use an 182 - allocated memory buffer big enough to fit the formatted vector. 183 - 184 - If we then copy the vector into the memory buffer and rewrite the vector to 185 - point to the memory buffer rather than the object itself, we now have a copy of 186 - the changes in a format that is compatible with the log buffer writing code. 187 - that does not require us to lock the item to access. This formatting and 188 - rewriting can all be done while the object is locked during transaction commit, 189 - resulting in a vector that is transactionally consistent and can be accessed 190 - without needing to lock the owning item. 191 - 192 - Hence we avoid the need to lock items when we need to flush outstanding 193 - asynchronous transactions to the log. The differences between the existing 194 - formatting method and the delayed logging formatting can be seen in the 195 - diagram below. 196 - 197 - Current format log vector: 198 - 199 - Object +---------------------------------------------+ 200 - Vector 1 +----+ 201 - Vector 2 +----+ 202 - Vector 3 +----------+ 203 - 204 - After formatting: 205 - 206 - Log Buffer +-V1-+-V2-+----V3----+ 207 - 208 - Delayed logging vector: 209 - 210 - Object +---------------------------------------------+ 211 - Vector 1 +----+ 212 - Vector 2 +----+ 213 - Vector 3 +----------+ 214 - 215 - After formatting: 216 - 217 - Memory Buffer +-V1-+-V2-+----V3----+ 218 - Vector 1 +----+ 219 - Vector 2 +----+ 220 - Vector 3 +----------+ 221 - 222 - The memory buffer and associated vector need to be passed as a single object, 223 - but still need to be associated with the parent object so if the object is 224 - relogged we can replace the current memory buffer with a new memory buffer that 225 - contains the latest changes. 226 - 227 - The reason for keeping the vector around after we've formatted the memory 228 - buffer is to support splitting vectors across log buffer boundaries correctly. 229 - If we don't keep the vector around, we do not know where the region boundaries 230 - are in the item, so we'd need a new encapsulation method for regions in the log 231 - buffer writing (i.e. double encapsulation). This would be an on-disk format 232 - change and as such is not desirable. It also means we'd have to write the log 233 - region headers in the formatting stage, which is problematic as there is per 234 - region state that needs to be placed into the headers during the log write. 235 - 236 - Hence we need to keep the vector, but by attaching the memory buffer to it and 237 - rewriting the vector addresses to point at the memory buffer we end up with a 238 - self-describing object that can be passed to the log buffer write code to be 239 - handled in exactly the same manner as the existing log vectors are handled. 240 - Hence we avoid needing a new on-disk format to handle items that have been 241 - relogged in memory. 242 - 243 - 244 - Tracking Changes 245 - 246 - Now that we can record transactional changes in memory in a form that allows 247 - them to be used without limitations, we need to be able to track and accumulate 248 - them so that they can be written to the log at some later point in time. The 249 - log item is the natural place to store this vector and buffer, and also makes sense 250 - to be the object that is used to track committed objects as it will always 251 - exist once the object has been included in a transaction. 252 - 253 - The log item is already used to track the log items that have been written to 254 - the log but not yet written to disk. Such log items are considered "active" 255 - and as such are stored in the Active Item List (AIL) which is a LSN-ordered 256 - double linked list. Items are inserted into this list during log buffer IO 257 - completion, after which they are unpinned and can be written to disk. An object 258 - that is in the AIL can be relogged, which causes the object to be pinned again 259 - and then moved forward in the AIL when the log buffer IO completes for that 260 - transaction. 261 - 262 - Essentially, this shows that an item that is in the AIL can still be modified 263 - and relogged, so any tracking must be separate to the AIL infrastructure. As 264 - such, we cannot reuse the AIL list pointers for tracking committed items, nor 265 - can we store state in any field that is protected by the AIL lock. Hence the 266 - committed item tracking needs it's own locks, lists and state fields in the log 267 - item. 268 - 269 - Similar to the AIL, tracking of committed items is done through a new list 270 - called the Committed Item List (CIL). The list tracks log items that have been 271 - committed and have formatted memory buffers attached to them. It tracks objects 272 - in transaction commit order, so when an object is relogged it is removed from 273 - it's place in the list and re-inserted at the tail. This is entirely arbitrary 274 - and done to make it easy for debugging - the last items in the list are the 275 - ones that are most recently modified. Ordering of the CIL is not necessary for 276 - transactional integrity (as discussed in the next section) so the ordering is 277 - done for convenience/sanity of the developers. 278 - 279 - 280 - Delayed Logging: Checkpoints 281 - 282 - When we have a log synchronisation event, commonly known as a "log force", 283 - all the items in the CIL must be written into the log via the log buffers. 284 - We need to write these items in the order that they exist in the CIL, and they 285 - need to be written as an atomic transaction. The need for all the objects to be 286 - written as an atomic transaction comes from the requirements of relogging and 287 - log replay - all the changes in all the objects in a given transaction must 288 - either be completely replayed during log recovery, or not replayed at all. If 289 - a transaction is not replayed because it is not complete in the log, then 290 - no later transactions should be replayed, either. 291 - 292 - To fulfill this requirement, we need to write the entire CIL in a single log 293 - transaction. Fortunately, the XFS log code has no fixed limit on the size of a 294 - transaction, nor does the log replay code. The only fundamental limit is that 295 - the transaction cannot be larger than just under half the size of the log. The 296 - reason for this limit is that to find the head and tail of the log, there must 297 - be at least one complete transaction in the log at any given time. If a 298 - transaction is larger than half the log, then there is the possibility that a 299 - crash during the write of a such a transaction could partially overwrite the 300 - only complete previous transaction in the log. This will result in a recovery 301 - failure and an inconsistent filesystem and hence we must enforce the maximum 302 - size of a checkpoint to be slightly less than a half the log. 303 - 304 - Apart from this size requirement, a checkpoint transaction looks no different 305 - to any other transaction - it contains a transaction header, a series of 306 - formatted log items and a commit record at the tail. From a recovery 307 - perspective, the checkpoint transaction is also no different - just a lot 308 - bigger with a lot more items in it. The worst case effect of this is that we 309 - might need to tune the recovery transaction object hash size. 310 - 311 - Because the checkpoint is just another transaction and all the changes to log 312 - items are stored as log vectors, we can use the existing log buffer writing 313 - code to write the changes into the log. To do this efficiently, we need to 314 - minimise the time we hold the CIL locked while writing the checkpoint 315 - transaction. The current log write code enables us to do this easily with the 316 - way it separates the writing of the transaction contents (the log vectors) from 317 - the transaction commit record, but tracking this requires us to have a 318 - per-checkpoint context that travels through the log write process through to 319 - checkpoint completion. 320 - 321 - Hence a checkpoint has a context that tracks the state of the current 322 - checkpoint from initiation to checkpoint completion. A new context is initiated 323 - at the same time a checkpoint transaction is started. That is, when we remove 324 - all the current items from the CIL during a checkpoint operation, we move all 325 - those changes into the current checkpoint context. We then initialise a new 326 - context and attach that to the CIL for aggregation of new transactions. 327 - 328 - This allows us to unlock the CIL immediately after transfer of all the 329 - committed items and effectively allow new transactions to be issued while we 330 - are formatting the checkpoint into the log. It also allows concurrent 331 - checkpoints to be written into the log buffers in the case of log force heavy 332 - workloads, just like the existing transaction commit code does. This, however, 333 - requires that we strictly order the commit records in the log so that 334 - checkpoint sequence order is maintained during log replay. 335 - 336 - To ensure that we can be writing an item into a checkpoint transaction at 337 - the same time another transaction modifies the item and inserts the log item 338 - into the new CIL, then checkpoint transaction commit code cannot use log items 339 - to store the list of log vectors that need to be written into the transaction. 340 - Hence log vectors need to be able to be chained together to allow them to be 341 - detached from the log items. That is, when the CIL is flushed the memory 342 - buffer and log vector attached to each log item needs to be attached to the 343 - checkpoint context so that the log item can be released. In diagrammatic form, 344 - the CIL would look like this before the flush: 345 - 346 - CIL Head 347 - | 348 - V 349 - Log Item <-> log vector 1 -> memory buffer 350 - | -> vector array 351 - V 352 - Log Item <-> log vector 2 -> memory buffer 353 - | -> vector array 354 - V 355 - ...... 356 - | 357 - V 358 - Log Item <-> log vector N-1 -> memory buffer 359 - | -> vector array 360 - V 361 - Log Item <-> log vector N -> memory buffer 362 - -> vector array 363 - 364 - And after the flush the CIL head is empty, and the checkpoint context log 365 - vector list would look like: 366 - 367 - Checkpoint Context 368 - | 369 - V 370 - log vector 1 -> memory buffer 371 - | -> vector array 372 - | -> Log Item 373 - V 374 - log vector 2 -> memory buffer 375 - | -> vector array 376 - | -> Log Item 377 - V 378 - ...... 379 - | 380 - V 381 - log vector N-1 -> memory buffer 382 - | -> vector array 383 - | -> Log Item 384 - V 385 - log vector N -> memory buffer 386 - -> vector array 387 - -> Log Item 388 - 389 - Once this transfer is done, the CIL can be unlocked and new transactions can 390 - start, while the checkpoint flush code works over the log vector chain to 391 - commit the checkpoint. 392 - 393 - Once the checkpoint is written into the log buffers, the checkpoint context is 394 - attached to the log buffer that the commit record was written to along with a 395 - completion callback. Log IO completion will call that callback, which can then 396 - run transaction committed processing for the log items (i.e. insert into AIL 397 - and unpin) in the log vector chain and then free the log vector chain and 398 - checkpoint context. 399 - 400 - Discussion Point: I am uncertain as to whether the log item is the most 401 - efficient way to track vectors, even though it seems like the natural way to do 402 - it. The fact that we walk the log items (in the CIL) just to chain the log 403 - vectors and break the link between the log item and the log vector means that 404 - we take a cache line hit for the log item list modification, then another for 405 - the log vector chaining. If we track by the log vectors, then we only need to 406 - break the link between the log item and the log vector, which means we should 407 - dirty only the log item cachelines. Normally I wouldn't be concerned about one 408 - vs two dirty cachelines except for the fact I've seen upwards of 80,000 log 409 - vectors in one checkpoint transaction. I'd guess this is a "measure and 410 - compare" situation that can be done after a working and reviewed implementation 411 - is in the dev tree.... 412 - 413 - Delayed Logging: Checkpoint Sequencing 414 - 415 - One of the key aspects of the XFS transaction subsystem is that it tags 416 - committed transactions with the log sequence number of the transaction commit. 417 - This allows transactions to be issued asynchronously even though there may be 418 - future operations that cannot be completed until that transaction is fully 419 - committed to the log. In the rare case that a dependent operation occurs (e.g. 420 - re-using a freed metadata extent for a data extent), a special, optimised log 421 - force can be issued to force the dependent transaction to disk immediately. 422 - 423 - To do this, transactions need to record the LSN of the commit record of the 424 - transaction. This LSN comes directly from the log buffer the transaction is 425 - written into. While this works just fine for the existing transaction 426 - mechanism, it does not work for delayed logging because transactions are not 427 - written directly into the log buffers. Hence some other method of sequencing 428 - transactions is required. 429 - 430 - As discussed in the checkpoint section, delayed logging uses per-checkpoint 431 - contexts, and as such it is simple to assign a sequence number to each 432 - checkpoint. Because the switching of checkpoint contexts must be done 433 - atomically, it is simple to ensure that each new context has a monotonically 434 - increasing sequence number assigned to it without the need for an external 435 - atomic counter - we can just take the current context sequence number and add 436 - one to it for the new context. 437 - 438 - Then, instead of assigning a log buffer LSN to the transaction commit LSN 439 - during the commit, we can assign the current checkpoint sequence. This allows 440 - operations that track transactions that have not yet completed know what 441 - checkpoint sequence needs to be committed before they can continue. As a 442 - result, the code that forces the log to a specific LSN now needs to ensure that 443 - the log forces to a specific checkpoint. 444 - 445 - To ensure that we can do this, we need to track all the checkpoint contexts 446 - that are currently committing to the log. When we flush a checkpoint, the 447 - context gets added to a "committing" list which can be searched. When a 448 - checkpoint commit completes, it is removed from the committing list. Because 449 - the checkpoint context records the LSN of the commit record for the checkpoint, 450 - we can also wait on the log buffer that contains the commit record, thereby 451 - using the existing log force mechanisms to execute synchronous forces. 452 - 453 - It should be noted that the synchronous forces may need to be extended with 454 - mitigation algorithms similar to the current log buffer code to allow 455 - aggregation of multiple synchronous transactions if there are already 456 - synchronous transactions being flushed. Investigation of the performance of the 457 - current design is needed before making any decisions here. 458 - 459 - The main concern with log forces is to ensure that all the previous checkpoints 460 - are also committed to disk before the one we need to wait for. Therefore we 461 - need to check that all the prior contexts in the committing list are also 462 - complete before waiting on the one we need to complete. We do this 463 - synchronisation in the log force code so that we don't need to wait anywhere 464 - else for such serialisation - it only matters when we do a log force. 465 - 466 - The only remaining complexity is that a log force now also has to handle the 467 - case where the forcing sequence number is the same as the current context. That 468 - is, we need to flush the CIL and potentially wait for it to complete. This is a 469 - simple addition to the existing log forcing code to check the sequence numbers 470 - and push if required. Indeed, placing the current sequence checkpoint flush in 471 - the log force code enables the current mechanism for issuing synchronous 472 - transactions to remain untouched (i.e. commit an asynchronous transaction, then 473 - force the log at the LSN of that transaction) and so the higher level code 474 - behaves the same regardless of whether delayed logging is being used or not. 475 - 476 - Delayed Logging: Checkpoint Log Space Accounting 477 - 478 - The big issue for a checkpoint transaction is the log space reservation for the 479 - transaction. We don't know how big a checkpoint transaction is going to be 480 - ahead of time, nor how many log buffers it will take to write out, nor the 481 - number of split log vector regions are going to be used. We can track the 482 - amount of log space required as we add items to the commit item list, but we 483 - still need to reserve the space in the log for the checkpoint. 484 - 485 - A typical transaction reserves enough space in the log for the worst case space 486 - usage of the transaction. The reservation accounts for log record headers, 487 - transaction and region headers, headers for split regions, buffer tail padding, 488 - etc. as well as the actual space for all the changed metadata in the 489 - transaction. While some of this is fixed overhead, much of it is dependent on 490 - the size of the transaction and the number of regions being logged (the number 491 - of log vectors in the transaction). 492 - 493 - An example of the differences would be logging directory changes versus logging 494 - inode changes. If you modify lots of inode cores (e.g. chmod -R g+w *), then 495 - there are lots of transactions that only contain an inode core and an inode log 496 - format structure. That is, two vectors totaling roughly 150 bytes. If we modify 497 - 10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each 498 - vector is 12 bytes, so the total to be logged is approximately 1.75MB. In 499 - comparison, if we are logging full directory buffers, they are typically 4KB 500 - each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a 501 - buffer format structure for each buffer - roughly 800 vectors or 1.51MB total 502 - space. From this, it should be obvious that a static log space reservation is 503 - not particularly flexible and is difficult to select the "optimal value" for 504 - all workloads. 505 - 506 - Further, if we are going to use a static reservation, which bit of the entire 507 - reservation does it cover? We account for space used by the transaction 508 - reservation by tracking the space currently used by the object in the CIL and 509 - then calculating the increase or decrease in space used as the object is 510 - relogged. This allows for a checkpoint reservation to only have to account for 511 - log buffer metadata used such as log header records. 512 - 513 - However, even using a static reservation for just the log metadata is 514 - problematic. Typically log record headers use at least 16KB of log space per 515 - 1MB of log space consumed (512 bytes per 32k) and the reservation needs to be 516 - large enough to handle arbitrary sized checkpoint transactions. This 517 - reservation needs to be made before the checkpoint is started, and we need to 518 - be able to reserve the space without sleeping. For a 8MB checkpoint, we need a 519 - reservation of around 150KB, which is a non-trivial amount of space. 520 - 521 - A static reservation needs to manipulate the log grant counters - we can take a 522 - permanent reservation on the space, but we still need to make sure we refresh 523 - the write reservation (the actual space available to the transaction) after 524 - every checkpoint transaction completion. Unfortunately, if this space is not 525 - available when required, then the regrant code will sleep waiting for it. 526 - 527 - The problem with this is that it can lead to deadlocks as we may need to commit 528 - checkpoints to be able to free up log space (refer back to the description of 529 - rolling transactions for an example of this). Hence we *must* always have 530 - space available in the log if we are to use static reservations, and that is 531 - very difficult and complex to arrange. It is possible to do, but there is a 532 - simpler way. 533 - 534 - The simpler way of doing this is tracking the entire log space used by the 535 - items in the CIL and using this to dynamically calculate the amount of log 536 - space required by the log metadata. If this log metadata space changes as a 537 - result of a transaction commit inserting a new memory buffer into the CIL, then 538 - the difference in space required is removed from the transaction that causes 539 - the change. Transactions at this level will *always* have enough space 540 - available in their reservation for this as they have already reserved the 541 - maximal amount of log metadata space they require, and such a delta reservation 542 - will always be less than or equal to the maximal amount in the reservation. 543 - 544 - Hence we can grow the checkpoint transaction reservation dynamically as items 545 - are added to the CIL and avoid the need for reserving and regranting log space 546 - up front. This avoids deadlocks and removes a blocking point from the 547 - checkpoint flush code. 548 - 549 - As mentioned early, transactions can't grow to more than half the size of the 550 - log. Hence as part of the reservation growing, we need to also check the size 551 - of the reservation against the maximum allowed transaction size. If we reach 552 - the maximum threshold, we need to push the CIL to the log. This is effectively 553 - a "background flush" and is done on demand. This is identical to 554 - a CIL push triggered by a log force, only that there is no waiting for the 555 - checkpoint commit to complete. This background push is checked and executed by 556 - transaction commit code. 557 - 558 - If the transaction subsystem goes idle while we still have items in the CIL, 559 - they will be flushed by the periodic log force issued by the xfssyncd. This log 560 - force will push the CIL to disk, and if the transaction subsystem stays idle, 561 - allow the idle log to be covered (effectively marked clean) in exactly the same 562 - manner that is done for the existing logging method. A discussion point is 563 - whether this log force needs to be done more frequently than the current rate 564 - which is once every 30s. 565 - 566 - 567 - Delayed Logging: Log Item Pinning 568 - 569 - Currently log items are pinned during transaction commit while the items are 570 - still locked. This happens just after the items are formatted, though it could 571 - be done any time before the items are unlocked. The result of this mechanism is 572 - that items get pinned once for every transaction that is committed to the log 573 - buffers. Hence items that are relogged in the log buffers will have a pin count 574 - for every outstanding transaction they were dirtied in. When each of these 575 - transactions is completed, they will unpin the item once. As a result, the item 576 - only becomes unpinned when all the transactions complete and there are no 577 - pending transactions. Thus the pinning and unpinning of a log item is symmetric 578 - as there is a 1:1 relationship with transaction commit and log item completion. 579 - 580 - For delayed logging, however, we have an asymmetric transaction commit to 581 - completion relationship. Every time an object is relogged in the CIL it goes 582 - through the commit process without a corresponding completion being registered. 583 - That is, we now have a many-to-one relationship between transaction commit and 584 - log item completion. The result of this is that pinning and unpinning of the 585 - log items becomes unbalanced if we retain the "pin on transaction commit, unpin 586 - on transaction completion" model. 587 - 588 - To keep pin/unpin symmetry, the algorithm needs to change to a "pin on 589 - insertion into the CIL, unpin on checkpoint completion". In other words, the 590 - pinning and unpinning becomes symmetric around a checkpoint context. We have to 591 - pin the object the first time it is inserted into the CIL - if it is already in 592 - the CIL during a transaction commit, then we do not pin it again. Because there 593 - can be multiple outstanding checkpoint contexts, we can still see elevated pin 594 - counts, but as each checkpoint completes the pin count will retain the correct 595 - value according to it's context. 596 - 597 - Just to make matters more slightly more complex, this checkpoint level context 598 - for the pin count means that the pinning of an item must take place under the 599 - CIL commit/flush lock. If we pin the object outside this lock, we cannot 600 - guarantee which context the pin count is associated with. This is because of 601 - the fact pinning the item is dependent on whether the item is present in the 602 - current CIL or not. If we don't pin the CIL first before we check and pin the 603 - object, we have a race with CIL being flushed between the check and the pin 604 - (or not pinning, as the case may be). Hence we must hold the CIL flush/commit 605 - lock to guarantee that we pin the items correctly. 606 - 607 - Delayed Logging: Concurrent Scalability 608 - 609 - A fundamental requirement for the CIL is that accesses through transaction 610 - commits must scale to many concurrent commits. The current transaction commit 611 - code does not break down even when there are transactions coming from 2048 612 - processors at once. The current transaction code does not go any faster than if 613 - there was only one CPU using it, but it does not slow down either. 614 - 615 - As a result, the delayed logging transaction commit code needs to be designed 616 - for concurrency from the ground up. It is obvious that there are serialisation 617 - points in the design - the three important ones are: 618 - 619 - 1. Locking out new transaction commits while flushing the CIL 620 - 2. Adding items to the CIL and updating item space accounting 621 - 3. Checkpoint commit ordering 622 - 623 - Looking at the transaction commit and CIL flushing interactions, it is clear 624 - that we have a many-to-one interaction here. That is, the only restriction on 625 - the number of concurrent transactions that can be trying to commit at once is 626 - the amount of space available in the log for their reservations. The practical 627 - limit here is in the order of several hundred concurrent transactions for a 628 - 128MB log, which means that it is generally one per CPU in a machine. 629 - 630 - The amount of time a transaction commit needs to hold out a flush is a 631 - relatively long period of time - the pinning of log items needs to be done 632 - while we are holding out a CIL flush, so at the moment that means it is held 633 - across the formatting of the objects into memory buffers (i.e. while memcpy()s 634 - are in progress). Ultimately a two pass algorithm where the formatting is done 635 - separately to the pinning of objects could be used to reduce the hold time of 636 - the transaction commit side. 637 - 638 - Because of the number of potential transaction commit side holders, the lock 639 - really needs to be a sleeping lock - if the CIL flush takes the lock, we do not 640 - want every other CPU in the machine spinning on the CIL lock. Given that 641 - flushing the CIL could involve walking a list of tens of thousands of log 642 - items, it will get held for a significant time and so spin contention is a 643 - significant concern. Preventing lots of CPUs spinning doing nothing is the 644 - main reason for choosing a sleeping lock even though nothing in either the 645 - transaction commit or CIL flush side sleeps with the lock held. 646 - 647 - It should also be noted that CIL flushing is also a relatively rare operation 648 - compared to transaction commit for asynchronous transaction workloads - only 649 - time will tell if using a read-write semaphore for exclusion will limit 650 - transaction commit concurrency due to cache line bouncing of the lock on the 651 - read side. 652 - 653 - The second serialisation point is on the transaction commit side where items 654 - are inserted into the CIL. Because transactions can enter this code 655 - concurrently, the CIL needs to be protected separately from the above 656 - commit/flush exclusion. It also needs to be an exclusive lock but it is only 657 - held for a very short time and so a spin lock is appropriate here. It is 658 - possible that this lock will become a contention point, but given the short 659 - hold time once per transaction I think that contention is unlikely. 660 - 661 - The final serialisation point is the checkpoint commit record ordering code 662 - that is run as part of the checkpoint commit and log force sequencing. The code 663 - path that triggers a CIL flush (i.e. whatever triggers the log force) will enter 664 - an ordering loop after writing all the log vectors into the log buffers but 665 - before writing the commit record. This loop walks the list of committing 666 - checkpoints and needs to block waiting for checkpoints to complete their commit 667 - record write. As a result it needs a lock and a wait variable. Log force 668 - sequencing also requires the same lock, list walk, and blocking mechanism to 669 - ensure completion of checkpoints. 670 - 671 - These two sequencing operations can use the mechanism even though the 672 - events they are waiting for are different. The checkpoint commit record 673 - sequencing needs to wait until checkpoint contexts contain a commit LSN 674 - (obtained through completion of a commit record write) while log force 675 - sequencing needs to wait until previous checkpoint contexts are removed from 676 - the committing list (i.e. they've completed). A simple wait variable and 677 - broadcast wakeups (thundering herds) has been used to implement these two 678 - serialisation queues. They use the same lock as the CIL, too. If we see too 679 - much contention on the CIL lock, or too many context switches as a result of 680 - the broadcast wakeups these operations can be put under a new spinlock and 681 - given separate wait lists to reduce lock contention and the number of processes 682 - woken by the wrong event. 683 - 684 - 685 - Lifecycle Changes 686 - 687 - The existing log item life cycle is as follows: 688 - 689 - 1. Transaction allocate 690 - 2. Transaction reserve 691 - 3. Lock item 692 - 4. Join item to transaction 693 - If not already attached, 694 - Allocate log item 695 - Attach log item to owner item 696 - Attach log item to transaction 697 - 5. Modify item 698 - Record modifications in log item 699 - 6. Transaction commit 700 - Pin item in memory 701 - Format item into log buffer 702 - Write commit LSN into transaction 703 - Unlock item 704 - Attach transaction to log buffer 705 - 706 - <log buffer IO dispatched> 707 - <log buffer IO completes> 708 - 709 - 7. Transaction completion 710 - Mark log item committed 711 - Insert log item into AIL 712 - Write commit LSN into log item 713 - Unpin log item 714 - 8. AIL traversal 715 - Lock item 716 - Mark log item clean 717 - Flush item to disk 718 - 719 - <item IO completion> 720 - 721 - 9. Log item removed from AIL 722 - Moves log tail 723 - Item unlocked 724 - 725 - Essentially, steps 1-6 operate independently from step 7, which is also 726 - independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9 727 - at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur 728 - at the same time. If the log item is in the AIL or between steps 6 and 7 729 - and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9 730 - are entered and completed is the object considered clean. 731 - 732 - With delayed logging, there are new steps inserted into the life cycle: 733 - 734 - 1. Transaction allocate 735 - 2. Transaction reserve 736 - 3. Lock item 737 - 4. Join item to transaction 738 - If not already attached, 739 - Allocate log item 740 - Attach log item to owner item 741 - Attach log item to transaction 742 - 5. Modify item 743 - Record modifications in log item 744 - 6. Transaction commit 745 - Pin item in memory if not pinned in CIL 746 - Format item into log vector + buffer 747 - Attach log vector and buffer to log item 748 - Insert log item into CIL 749 - Write CIL context sequence into transaction 750 - Unlock item 751 - 752 - <next log force> 753 - 754 - 7. CIL push 755 - lock CIL flush 756 - Chain log vectors and buffers together 757 - Remove items from CIL 758 - unlock CIL flush 759 - write log vectors into log 760 - sequence commit records 761 - attach checkpoint context to log buffer 762 - 763 - <log buffer IO dispatched> 764 - <log buffer IO completes> 765 - 766 - 8. Checkpoint completion 767 - Mark log item committed 768 - Insert item into AIL 769 - Write commit LSN into log item 770 - Unpin log item 771 - 9. AIL traversal 772 - Lock item 773 - Mark log item clean 774 - Flush item to disk 775 - <item IO completion> 776 - 10. Log item removed from AIL 777 - Moves log tail 778 - Item unlocked 779 - 780 - From this, it can be seen that the only life cycle differences between the two 781 - logging methods are in the middle of the life cycle - they still have the same 782 - beginning and end and execution constraints. The only differences are in the 783 - committing of the log items to the log itself and the completion processing. 784 - Hence delayed logging should not introduce any constraints on log item 785 - behaviour, allocation or freeing that don't already exist. 786 - 787 - As a result of this zero-impact "insertion" of delayed logging infrastructure 788 - and the design of the internal structures to avoid on disk format changes, we 789 - can basically switch between delayed logging and the existing mechanism with a 790 - mount option. Fundamentally, there is no reason why the log manager would not 791 - be able to swap methods automatically and transparently depending on load 792 - characteristics, but this should not be necessary if delayed logging works as 793 - designed.

+352

Documentation/filesystems/xfs-self-describing-metadata.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ============================ 4 + XFS Self Describing Metadata 5 + ============================ 6 + 7 + Introduction 8 + ============ 9 + 10 + The largest scalability problem facing XFS is not one of algorithmic 11 + scalability, but of verification of the filesystem structure. Scalabilty of the 12 + structures and indexes on disk and the algorithms for iterating them are 13 + adequate for supporting PB scale filesystems with billions of inodes, however it 14 + is this very scalability that causes the verification problem. 15 + 16 + Almost all metadata on XFS is dynamically allocated. The only fixed location 17 + metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all 18 + other metadata structures need to be discovered by walking the filesystem 19 + structure in different ways. While this is already done by userspace tools for 20 + validating and repairing the structure, there are limits to what they can 21 + verify, and this in turn limits the supportable size of an XFS filesystem. 22 + 23 + For example, it is entirely possible to manually use xfs_db and a bit of 24 + scripting to analyse the structure of a 100TB filesystem when trying to 25 + determine the root cause of a corruption problem, but it is still mainly a 26 + manual task of verifying that things like single bit errors or misplaced writes 27 + weren't the ultimate cause of a corruption event. It may take a few hours to a 28 + few days to perform such forensic analysis, so for at this scale root cause 29 + analysis is entirely possible. 30 + 31 + However, if we scale the filesystem up to 1PB, we now have 10x as much metadata 32 + to analyse and so that analysis blows out towards weeks/months of forensic work. 33 + Most of the analysis work is slow and tedious, so as the amount of analysis goes 34 + up, the more likely that the cause will be lost in the noise. Hence the primary 35 + concern for supporting PB scale filesystems is minimising the time and effort 36 + required for basic forensic analysis of the filesystem structure. 37 + 38 + 39 + Self Describing Metadata 40 + ======================== 41 + 42 + One of the problems with the current metadata format is that apart from the 43 + magic number in the metadata block, we have no other way of identifying what it 44 + is supposed to be. We can't even identify if it is the right place. Put simply, 45 + you can't look at a single metadata block in isolation and say "yes, it is 46 + supposed to be there and the contents are valid". 47 + 48 + Hence most of the time spent on forensic analysis is spent doing basic 49 + verification of metadata values, looking for values that are in range (and hence 50 + not detected by automated verification checks) but are not correct. Finding and 51 + understanding how things like cross linked block lists (e.g. sibling 52 + pointers in a btree end up with loops in them) are the key to understanding what 53 + went wrong, but it is impossible to tell what order the blocks were linked into 54 + each other or written to disk after the fact. 55 + 56 + Hence we need to record more information into the metadata to allow us to 57 + quickly determine if the metadata is intact and can be ignored for the purpose 58 + of analysis. We can't protect against every possible type of error, but we can 59 + ensure that common types of errors are easily detectable. Hence the concept of 60 + self describing metadata. 61 + 62 + The first, fundamental requirement of self describing metadata is that the 63 + metadata object contains some form of unique identifier in a well known 64 + location. This allows us to identify the expected contents of the block and 65 + hence parse and verify the metadata object. IF we can't independently identify 66 + the type of metadata in the object, then the metadata doesn't describe itself 67 + very well at all! 68 + 69 + Luckily, almost all XFS metadata has magic numbers embedded already - only the 70 + AGFL, remote symlinks and remote attribute blocks do not contain identifying 71 + magic numbers. Hence we can change the on-disk format of all these objects to 72 + add more identifying information and detect this simply by changing the magic 73 + numbers in the metadata objects. That is, if it has the current magic number, 74 + the metadata isn't self identifying. If it contains a new magic number, it is 75 + self identifying and we can do much more expansive automated verification of the 76 + metadata object at runtime, during forensic analysis or repair. 77 + 78 + As a primary concern, self describing metadata needs some form of overall 79 + integrity checking. We cannot trust the metadata if we cannot verify that it has 80 + not been changed as a result of external influences. Hence we need some form of 81 + integrity check, and this is done by adding CRC32c validation to the metadata 82 + block. If we can verify the block contains the metadata it was intended to 83 + contain, a large amount of the manual verification work can be skipped. 84 + 85 + CRC32c was selected as metadata cannot be more than 64k in length in XFS and 86 + hence a 32 bit CRC is more than sufficient to detect multi-bit errors in 87 + metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is 88 + fast. So while CRC32c is not the strongest of possible integrity checks that 89 + could be used, it is more than sufficient for our needs and has relatively 90 + little overhead. Adding support for larger integrity fields and/or algorithms 91 + does really provide any extra value over CRC32c, but it does add a lot of 92 + complexity and so there is no provision for changing the integrity checking 93 + mechanism. 94 + 95 + Self describing metadata needs to contain enough information so that the 96 + metadata block can be verified as being in the correct place without needing to 97 + look at any other metadata. This means it needs to contain location information. 98 + Just adding a block number to the metadata is not sufficient to protect against 99 + mis-directed writes - a write might be misdirected to the wrong LUN and so be 100 + written to the "correct block" of the wrong filesystem. Hence location 101 + information must contain a filesystem identifier as well as a block number. 102 + 103 + Another key information point in forensic analysis is knowing who the metadata 104 + block belongs to. We already know the type, the location, that it is valid 105 + and/or corrupted, and how long ago that it was last modified. Knowing the owner 106 + of the block is important as it allows us to find other related metadata to 107 + determine the scope of the corruption. For example, if we have a extent btree 108 + object, we don't know what inode it belongs to and hence have to walk the entire 109 + filesystem to find the owner of the block. Worse, the corruption could mean that 110 + no owner can be found (i.e. it's an orphan block), and so without an owner field 111 + in the metadata we have no idea of the scope of the corruption. If we have an 112 + owner field in the metadata object, we can immediately do top down validation to 113 + determine the scope of the problem. 114 + 115 + Different types of metadata have different owner identifiers. For example, 116 + directory, attribute and extent tree blocks are all owned by an inode, while 117 + freespace btree blocks are owned by an allocation group. Hence the size and 118 + contents of the owner field are determined by the type of metadata object we are 119 + looking at. The owner information can also identify misplaced writes (e.g. 120 + freespace btree block written to the wrong AG). 121 + 122 + Self describing metadata also needs to contain some indication of when it was 123 + written to the filesystem. One of the key information points when doing forensic 124 + analysis is how recently the block was modified. Correlation of set of corrupted 125 + metadata blocks based on modification times is important as it can indicate 126 + whether the corruptions are related, whether there's been multiple corruption 127 + events that lead to the eventual failure, and even whether there are corruptions 128 + present that the run-time verification is not detecting. 129 + 130 + For example, we can determine whether a metadata object is supposed to be free 131 + space or still allocated if it is still referenced by its owner by looking at 132 + when the free space btree block that contains the block was last written 133 + compared to when the metadata object itself was last written. If the free space 134 + block is more recent than the object and the object's owner, then there is a 135 + very good chance that the block should have been removed from the owner. 136 + 137 + To provide this "written timestamp", each metadata block gets the Log Sequence 138 + Number (LSN) of the most recent transaction it was modified on written into it. 139 + This number will always increase over the life of the filesystem, and the only 140 + thing that resets it is running xfs_repair on the filesystem. Further, by use of 141 + the LSN we can tell if the corrupted metadata all belonged to the same log 142 + checkpoint and hence have some idea of how much modification occurred between 143 + the first and last instance of corrupt metadata on disk and, further, how much 144 + modification occurred between the corruption being written and when it was 145 + detected. 146 + 147 + Runtime Validation 148 + ================== 149 + 150 + Validation of self-describing metadata takes place at runtime in two places: 151 + 152 + - immediately after a successful read from disk 153 + - immediately prior to write IO submission 154 + 155 + The verification is completely stateless - it is done independently of the 156 + modification process, and seeks only to check that the metadata is what it says 157 + it is and that the metadata fields are within bounds and internally consistent. 158 + As such, we cannot catch all types of corruption that can occur within a block 159 + as there may be certain limitations that operational state enforces of the 160 + metadata, or there may be corruption of interblock relationships (e.g. corrupted 161 + sibling pointer lists). Hence we still need stateful checking in the main code 162 + body, but in general most of the per-field validation is handled by the 163 + verifiers. 164 + 165 + For read verification, the caller needs to specify the expected type of metadata 166 + that it should see, and the IO completion process verifies that the metadata 167 + object matches what was expected. If the verification process fails, then it 168 + marks the object being read as EFSCORRUPTED. The caller needs to catch this 169 + error (same as for IO errors), and if it needs to take special action due to a 170 + verification error it can do so by catching the EFSCORRUPTED error value. If we 171 + need more discrimination of error type at higher levels, we can define new 172 + error numbers for different errors as necessary. 173 + 174 + The first step in read verification is checking the magic number and determining 175 + whether CRC validating is necessary. If it is, the CRC32c is calculated and 176 + compared against the value stored in the object itself. Once this is validated, 177 + further checks are made against the location information, followed by extensive 178 + object specific metadata validation. If any of these checks fail, then the 179 + buffer is considered corrupt and the EFSCORRUPTED error is set appropriately. 180 + 181 + Write verification is the opposite of the read verification - first the object 182 + is extensively verified and if it is OK we then update the LSN from the last 183 + modification made to the object, After this, we calculate the CRC and insert it 184 + into the object. Once this is done the write IO is allowed to continue. If any 185 + error occurs during this process, the buffer is again marked with a EFSCORRUPTED 186 + error for the higher layers to catch. 187 + 188 + Structures 189 + ========== 190 + 191 + A typical on-disk structure needs to contain the following information:: 192 + 193 + struct xfs_ondisk_hdr { 194 + __be32 magic; /* magic number */ 195 + __be32 crc; /* CRC, not logged */ 196 + uuid_t uuid; /* filesystem identifier */ 197 + __be64 owner; /* parent object */ 198 + __be64 blkno; /* location on disk */ 199 + __be64 lsn; /* last modification in log, not logged */ 200 + }; 201 + 202 + Depending on the metadata, this information may be part of a header structure 203 + separate to the metadata contents, or may be distributed through an existing 204 + structure. The latter occurs with metadata that already contains some of this 205 + information, such as the superblock and AG headers. 206 + 207 + Other metadata may have different formats for the information, but the same 208 + level of information is generally provided. For example: 209 + 210 + - short btree blocks have a 32 bit owner (ag number) and a 32 bit block 211 + number for location. The two of these combined provide the same 212 + information as @owner and @blkno in eh above structure, but using 8 213 + bytes less space on disk. 214 + 215 + - directory/attribute node blocks have a 16 bit magic number, and the 216 + header that contains the magic number has other information in it as 217 + well. hence the additional metadata headers change the overall format 218 + of the metadata. 219 + 220 + A typical buffer read verifier is structured as follows:: 221 + 222 + #define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc) 223 + 224 + static void 225 + xfs_foo_read_verify( 226 + struct xfs_buf *bp) 227 + { 228 + struct xfs_mount *mp = bp->b_mount; 229 + 230 + if ((xfs_sb_version_hascrc(&mp->m_sb) && 231 + !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length), 232 + XFS_FOO_CRC_OFF)) || 233 + !xfs_foo_verify(bp)) { 234 + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); 235 + xfs_buf_ioerror(bp, EFSCORRUPTED); 236 + } 237 + } 238 + 239 + The code ensures that the CRC is only checked if the filesystem has CRCs enabled 240 + by checking the superblock of the feature bit, and then if the CRC verifies OK 241 + (or is not needed) it verifies the actual contents of the block. 242 + 243 + The verifier function will take a couple of different forms, depending on 244 + whether the magic number can be used to determine the format of the block. In 245 + the case it can't, the code is structured as follows:: 246 + 247 + static bool 248 + xfs_foo_verify( 249 + struct xfs_buf *bp) 250 + { 251 + struct xfs_mount *mp = bp->b_mount; 252 + struct xfs_ondisk_hdr *hdr = bp->b_addr; 253 + 254 + if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) 255 + return false; 256 + 257 + if (!xfs_sb_version_hascrc(&mp->m_sb)) { 258 + if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) 259 + return false; 260 + if (bp->b_bn != be64_to_cpu(hdr->blkno)) 261 + return false; 262 + if (hdr->owner == 0) 263 + return false; 264 + } 265 + 266 + /* object specific verification checks here */ 267 + 268 + return true; 269 + } 270 + 271 + If there are different magic numbers for the different formats, the verifier 272 + will look like:: 273 + 274 + static bool 275 + xfs_foo_verify( 276 + struct xfs_buf *bp) 277 + { 278 + struct xfs_mount *mp = bp->b_mount; 279 + struct xfs_ondisk_hdr *hdr = bp->b_addr; 280 + 281 + if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) { 282 + if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) 283 + return false; 284 + if (bp->b_bn != be64_to_cpu(hdr->blkno)) 285 + return false; 286 + if (hdr->owner == 0) 287 + return false; 288 + } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) 289 + return false; 290 + 291 + /* object specific verification checks here */ 292 + 293 + return true; 294 + } 295 + 296 + Write verifiers are very similar to the read verifiers, they just do things in 297 + the opposite order to the read verifiers. A typical write verifier:: 298 + 299 + static void 300 + xfs_foo_write_verify( 301 + struct xfs_buf *bp) 302 + { 303 + struct xfs_mount *mp = bp->b_mount; 304 + struct xfs_buf_log_item *bip = bp->b_fspriv; 305 + 306 + if (!xfs_foo_verify(bp)) { 307 + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); 308 + xfs_buf_ioerror(bp, EFSCORRUPTED); 309 + return; 310 + } 311 + 312 + if (!xfs_sb_version_hascrc(&mp->m_sb)) 313 + return; 314 + 315 + 316 + if (bip) { 317 + struct xfs_ondisk_hdr *hdr = bp->b_addr; 318 + hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn); 319 + } 320 + xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF); 321 + } 322 + 323 + This will verify the internal structure of the metadata before we go any 324 + further, detecting corruptions that have occurred as the metadata has been 325 + modified in memory. If the metadata verifies OK, and CRCs are enabled, we then 326 + update the LSN field (when it was last modified) and calculate the CRC on the 327 + metadata. Once this is done, we can issue the IO. 328 + 329 + Inodes and Dquots 330 + ================= 331 + 332 + Inodes and dquots are special snowflakes. They have per-object CRC and 333 + self-identifiers, but they are packed so that there are multiple objects per 334 + buffer. Hence we do not use per-buffer verifiers to do the work of per-object 335 + verification and CRC calculations. The per-buffer verifiers simply perform basic 336 + identification of the buffer - that they contain inodes or dquots, and that 337 + there are magic numbers in all the expected spots. All further CRC and 338 + verification checks are done when each inode is read from or written back to the 339 + buffer. 340 + 341 + The structure of the verifiers and the identifiers checks is very similar to the 342 + buffer code described above. The only difference is where they are called. For 343 + example, inode read verification is done in xfs_iread() when the inode is first 344 + read out of the buffer and the struct xfs_inode is instantiated. The inode is 345 + already extensively verified during writeback in xfs_iflush_int, so the only 346 + addition here is to add the LSN and CRC to the inode as it is copied back into 347 + the buffer. 348 + 349 + XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of 350 + the unlinked list modifications check or update CRCs, neither during unlink nor 351 + log recovery. So, it's gone unnoticed until now. This won't matter immediately - 352 + repair will probably complain about it - but it needs to be fixed.

-350

Documentation/filesystems/xfs-self-describing-metadata.txt

··· 1 - XFS Self Describing Metadata 2 - ---------------------------- 3 - 4 - Introduction 5 - ------------ 6 - 7 - The largest scalability problem facing XFS is not one of algorithmic 8 - scalability, but of verification of the filesystem structure. Scalabilty of the 9 - structures and indexes on disk and the algorithms for iterating them are 10 - adequate for supporting PB scale filesystems with billions of inodes, however it 11 - is this very scalability that causes the verification problem. 12 - 13 - Almost all metadata on XFS is dynamically allocated. The only fixed location 14 - metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all 15 - other metadata structures need to be discovered by walking the filesystem 16 - structure in different ways. While this is already done by userspace tools for 17 - validating and repairing the structure, there are limits to what they can 18 - verify, and this in turn limits the supportable size of an XFS filesystem. 19 - 20 - For example, it is entirely possible to manually use xfs_db and a bit of 21 - scripting to analyse the structure of a 100TB filesystem when trying to 22 - determine the root cause of a corruption problem, but it is still mainly a 23 - manual task of verifying that things like single bit errors or misplaced writes 24 - weren't the ultimate cause of a corruption event. It may take a few hours to a 25 - few days to perform such forensic analysis, so for at this scale root cause 26 - analysis is entirely possible. 27 - 28 - However, if we scale the filesystem up to 1PB, we now have 10x as much metadata 29 - to analyse and so that analysis blows out towards weeks/months of forensic work. 30 - Most of the analysis work is slow and tedious, so as the amount of analysis goes 31 - up, the more likely that the cause will be lost in the noise. Hence the primary 32 - concern for supporting PB scale filesystems is minimising the time and effort 33 - required for basic forensic analysis of the filesystem structure. 34 - 35 - 36 - Self Describing Metadata 37 - ------------------------ 38 - 39 - One of the problems with the current metadata format is that apart from the 40 - magic number in the metadata block, we have no other way of identifying what it 41 - is supposed to be. We can't even identify if it is the right place. Put simply, 42 - you can't look at a single metadata block in isolation and say "yes, it is 43 - supposed to be there and the contents are valid". 44 - 45 - Hence most of the time spent on forensic analysis is spent doing basic 46 - verification of metadata values, looking for values that are in range (and hence 47 - not detected by automated verification checks) but are not correct. Finding and 48 - understanding how things like cross linked block lists (e.g. sibling 49 - pointers in a btree end up with loops in them) are the key to understanding what 50 - went wrong, but it is impossible to tell what order the blocks were linked into 51 - each other or written to disk after the fact. 52 - 53 - Hence we need to record more information into the metadata to allow us to 54 - quickly determine if the metadata is intact and can be ignored for the purpose 55 - of analysis. We can't protect against every possible type of error, but we can 56 - ensure that common types of errors are easily detectable. Hence the concept of 57 - self describing metadata. 58 - 59 - The first, fundamental requirement of self describing metadata is that the 60 - metadata object contains some form of unique identifier in a well known 61 - location. This allows us to identify the expected contents of the block and 62 - hence parse and verify the metadata object. IF we can't independently identify 63 - the type of metadata in the object, then the metadata doesn't describe itself 64 - very well at all! 65 - 66 - Luckily, almost all XFS metadata has magic numbers embedded already - only the 67 - AGFL, remote symlinks and remote attribute blocks do not contain identifying 68 - magic numbers. Hence we can change the on-disk format of all these objects to 69 - add more identifying information and detect this simply by changing the magic 70 - numbers in the metadata objects. That is, if it has the current magic number, 71 - the metadata isn't self identifying. If it contains a new magic number, it is 72 - self identifying and we can do much more expansive automated verification of the 73 - metadata object at runtime, during forensic analysis or repair. 74 - 75 - As a primary concern, self describing metadata needs some form of overall 76 - integrity checking. We cannot trust the metadata if we cannot verify that it has 77 - not been changed as a result of external influences. Hence we need some form of 78 - integrity check, and this is done by adding CRC32c validation to the metadata 79 - block. If we can verify the block contains the metadata it was intended to 80 - contain, a large amount of the manual verification work can be skipped. 81 - 82 - CRC32c was selected as metadata cannot be more than 64k in length in XFS and 83 - hence a 32 bit CRC is more than sufficient to detect multi-bit errors in 84 - metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is 85 - fast. So while CRC32c is not the strongest of possible integrity checks that 86 - could be used, it is more than sufficient for our needs and has relatively 87 - little overhead. Adding support for larger integrity fields and/or algorithms 88 - does really provide any extra value over CRC32c, but it does add a lot of 89 - complexity and so there is no provision for changing the integrity checking 90 - mechanism. 91 - 92 - Self describing metadata needs to contain enough information so that the 93 - metadata block can be verified as being in the correct place without needing to 94 - look at any other metadata. This means it needs to contain location information. 95 - Just adding a block number to the metadata is not sufficient to protect against 96 - mis-directed writes - a write might be misdirected to the wrong LUN and so be 97 - written to the "correct block" of the wrong filesystem. Hence location 98 - information must contain a filesystem identifier as well as a block number. 99 - 100 - Another key information point in forensic analysis is knowing who the metadata 101 - block belongs to. We already know the type, the location, that it is valid 102 - and/or corrupted, and how long ago that it was last modified. Knowing the owner 103 - of the block is important as it allows us to find other related metadata to 104 - determine the scope of the corruption. For example, if we have a extent btree 105 - object, we don't know what inode it belongs to and hence have to walk the entire 106 - filesystem to find the owner of the block. Worse, the corruption could mean that 107 - no owner can be found (i.e. it's an orphan block), and so without an owner field 108 - in the metadata we have no idea of the scope of the corruption. If we have an 109 - owner field in the metadata object, we can immediately do top down validation to 110 - determine the scope of the problem. 111 - 112 - Different types of metadata have different owner identifiers. For example, 113 - directory, attribute and extent tree blocks are all owned by an inode, while 114 - freespace btree blocks are owned by an allocation group. Hence the size and 115 - contents of the owner field are determined by the type of metadata object we are 116 - looking at. The owner information can also identify misplaced writes (e.g. 117 - freespace btree block written to the wrong AG). 118 - 119 - Self describing metadata also needs to contain some indication of when it was 120 - written to the filesystem. One of the key information points when doing forensic 121 - analysis is how recently the block was modified. Correlation of set of corrupted 122 - metadata blocks based on modification times is important as it can indicate 123 - whether the corruptions are related, whether there's been multiple corruption 124 - events that lead to the eventual failure, and even whether there are corruptions 125 - present that the run-time verification is not detecting. 126 - 127 - For example, we can determine whether a metadata object is supposed to be free 128 - space or still allocated if it is still referenced by its owner by looking at 129 - when the free space btree block that contains the block was last written 130 - compared to when the metadata object itself was last written. If the free space 131 - block is more recent than the object and the object's owner, then there is a 132 - very good chance that the block should have been removed from the owner. 133 - 134 - To provide this "written timestamp", each metadata block gets the Log Sequence 135 - Number (LSN) of the most recent transaction it was modified on written into it. 136 - This number will always increase over the life of the filesystem, and the only 137 - thing that resets it is running xfs_repair on the filesystem. Further, by use of 138 - the LSN we can tell if the corrupted metadata all belonged to the same log 139 - checkpoint and hence have some idea of how much modification occurred between 140 - the first and last instance of corrupt metadata on disk and, further, how much 141 - modification occurred between the corruption being written and when it was 142 - detected. 143 - 144 - Runtime Validation 145 - ------------------ 146 - 147 - Validation of self-describing metadata takes place at runtime in two places: 148 - 149 - - immediately after a successful read from disk 150 - - immediately prior to write IO submission 151 - 152 - The verification is completely stateless - it is done independently of the 153 - modification process, and seeks only to check that the metadata is what it says 154 - it is and that the metadata fields are within bounds and internally consistent. 155 - As such, we cannot catch all types of corruption that can occur within a block 156 - as there may be certain limitations that operational state enforces of the 157 - metadata, or there may be corruption of interblock relationships (e.g. corrupted 158 - sibling pointer lists). Hence we still need stateful checking in the main code 159 - body, but in general most of the per-field validation is handled by the 160 - verifiers. 161 - 162 - For read verification, the caller needs to specify the expected type of metadata 163 - that it should see, and the IO completion process verifies that the metadata 164 - object matches what was expected. If the verification process fails, then it 165 - marks the object being read as EFSCORRUPTED. The caller needs to catch this 166 - error (same as for IO errors), and if it needs to take special action due to a 167 - verification error it can do so by catching the EFSCORRUPTED error value. If we 168 - need more discrimination of error type at higher levels, we can define new 169 - error numbers for different errors as necessary. 170 - 171 - The first step in read verification is checking the magic number and determining 172 - whether CRC validating is necessary. If it is, the CRC32c is calculated and 173 - compared against the value stored in the object itself. Once this is validated, 174 - further checks are made against the location information, followed by extensive 175 - object specific metadata validation. If any of these checks fail, then the 176 - buffer is considered corrupt and the EFSCORRUPTED error is set appropriately. 177 - 178 - Write verification is the opposite of the read verification - first the object 179 - is extensively verified and if it is OK we then update the LSN from the last 180 - modification made to the object, After this, we calculate the CRC and insert it 181 - into the object. Once this is done the write IO is allowed to continue. If any 182 - error occurs during this process, the buffer is again marked with a EFSCORRUPTED 183 - error for the higher layers to catch. 184 - 185 - Structures 186 - ---------- 187 - 188 - A typical on-disk structure needs to contain the following information: 189 - 190 - struct xfs_ondisk_hdr { 191 - __be32 magic; /* magic number */ 192 - __be32 crc; /* CRC, not logged */ 193 - uuid_t uuid; /* filesystem identifier */ 194 - __be64 owner; /* parent object */ 195 - __be64 blkno; /* location on disk */ 196 - __be64 lsn; /* last modification in log, not logged */ 197 - }; 198 - 199 - Depending on the metadata, this information may be part of a header structure 200 - separate to the metadata contents, or may be distributed through an existing 201 - structure. The latter occurs with metadata that already contains some of this 202 - information, such as the superblock and AG headers. 203 - 204 - Other metadata may have different formats for the information, but the same 205 - level of information is generally provided. For example: 206 - 207 - - short btree blocks have a 32 bit owner (ag number) and a 32 bit block 208 - number for location. The two of these combined provide the same 209 - information as @owner and @blkno in eh above structure, but using 8 210 - bytes less space on disk. 211 - 212 - - directory/attribute node blocks have a 16 bit magic number, and the 213 - header that contains the magic number has other information in it as 214 - well. hence the additional metadata headers change the overall format 215 - of the metadata. 216 - 217 - A typical buffer read verifier is structured as follows: 218 - 219 - #define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc) 220 - 221 - static void 222 - xfs_foo_read_verify( 223 - struct xfs_buf *bp) 224 - { 225 - struct xfs_mount *mp = bp->b_mount; 226 - 227 - if ((xfs_sb_version_hascrc(&mp->m_sb) && 228 - !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length), 229 - XFS_FOO_CRC_OFF)) || 230 - !xfs_foo_verify(bp)) { 231 - XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); 232 - xfs_buf_ioerror(bp, EFSCORRUPTED); 233 - } 234 - } 235 - 236 - The code ensures that the CRC is only checked if the filesystem has CRCs enabled 237 - by checking the superblock of the feature bit, and then if the CRC verifies OK 238 - (or is not needed) it verifies the actual contents of the block. 239 - 240 - The verifier function will take a couple of different forms, depending on 241 - whether the magic number can be used to determine the format of the block. In 242 - the case it can't, the code is structured as follows: 243 - 244 - static bool 245 - xfs_foo_verify( 246 - struct xfs_buf *bp) 247 - { 248 - struct xfs_mount *mp = bp->b_mount; 249 - struct xfs_ondisk_hdr *hdr = bp->b_addr; 250 - 251 - if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) 252 - return false; 253 - 254 - if (!xfs_sb_version_hascrc(&mp->m_sb)) { 255 - if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) 256 - return false; 257 - if (bp->b_bn != be64_to_cpu(hdr->blkno)) 258 - return false; 259 - if (hdr->owner == 0) 260 - return false; 261 - } 262 - 263 - /* object specific verification checks here */ 264 - 265 - return true; 266 - } 267 - 268 - If there are different magic numbers for the different formats, the verifier 269 - will look like: 270 - 271 - static bool 272 - xfs_foo_verify( 273 - struct xfs_buf *bp) 274 - { 275 - struct xfs_mount *mp = bp->b_mount; 276 - struct xfs_ondisk_hdr *hdr = bp->b_addr; 277 - 278 - if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) { 279 - if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) 280 - return false; 281 - if (bp->b_bn != be64_to_cpu(hdr->blkno)) 282 - return false; 283 - if (hdr->owner == 0) 284 - return false; 285 - } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) 286 - return false; 287 - 288 - /* object specific verification checks here */ 289 - 290 - return true; 291 - } 292 - 293 - Write verifiers are very similar to the read verifiers, they just do things in 294 - the opposite order to the read verifiers. A typical write verifier: 295 - 296 - static void 297 - xfs_foo_write_verify( 298 - struct xfs_buf *bp) 299 - { 300 - struct xfs_mount *mp = bp->b_mount; 301 - struct xfs_buf_log_item *bip = bp->b_fspriv; 302 - 303 - if (!xfs_foo_verify(bp)) { 304 - XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); 305 - xfs_buf_ioerror(bp, EFSCORRUPTED); 306 - return; 307 - } 308 - 309 - if (!xfs_sb_version_hascrc(&mp->m_sb)) 310 - return; 311 - 312 - 313 - if (bip) { 314 - struct xfs_ondisk_hdr *hdr = bp->b_addr; 315 - hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn); 316 - } 317 - xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF); 318 - } 319 - 320 - This will verify the internal structure of the metadata before we go any 321 - further, detecting corruptions that have occurred as the metadata has been 322 - modified in memory. If the metadata verifies OK, and CRCs are enabled, we then 323 - update the LSN field (when it was last modified) and calculate the CRC on the 324 - metadata. Once this is done, we can issue the IO. 325 - 326 - Inodes and Dquots 327 - ----------------- 328 - 329 - Inodes and dquots are special snowflakes. They have per-object CRC and 330 - self-identifiers, but they are packed so that there are multiple objects per 331 - buffer. Hence we do not use per-buffer verifiers to do the work of per-object 332 - verification and CRC calculations. The per-buffer verifiers simply perform basic 333 - identification of the buffer - that they contain inodes or dquots, and that 334 - there are magic numbers in all the expected spots. All further CRC and 335 - verification checks are done when each inode is read from or written back to the 336 - buffer. 337 - 338 - The structure of the verifiers and the identifiers checks is very similar to the 339 - buffer code described above. The only difference is where they are called. For 340 - example, inode read verification is done in xfs_iread() when the inode is first 341 - read out of the buffer and the struct xfs_inode is instantiated. The inode is 342 - already extensively verified during writeback in xfs_iflush_int, so the only 343 - addition here is to add the LSN and CRC to the inode as it is copied back into 344 - the buffer. 345 - 346 - XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of 347 - the unlinked list modifications check or update CRCs, neither during unlink nor 348 - log recovery. So, it's gone unnoticed until now. This won't matter immediately - 349 - repair will probably complain about it - but it needs to be fixed. 350 -

Documentation/futex-requeue-pi.txt Documentation/locking/futex-requeue-pi.rst

Documentation/hwspinlock.txt Documentation/locking/hwspinlock.rst

-1341

Documentation/i2c/i2c.svg

··· 1 - <?xml version="1.0" encoding="UTF-8" standalone="no"?> 2 -  3 - 4 - <svg 5 - xmlns:dc="http://purl.org/dc/elements/1.1/" 6 - xmlns:cc="http://creativecommons.org/ns#" 7 - xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 8 - xmlns:svg="http://www.w3.org/2000/svg" 9 - xmlns="http://www.w3.org/2000/svg" 10 - xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd" 11 - xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" 12 - sodipodi:docname="i2c.svg" 13 - inkscape:version="0.92.3 (2405546, 2018-03-11)" 14 - version="1.1" 15 - id="svg2" 16 - viewBox="0 0 813.34215 261.01596" 17 - height="73.664505mm" 18 - width="229.54323mm"> 19 - <defs 20 - id="defs4"> 21 - <marker 22 - inkscape:stockid="DotM" 23 - orient="auto" 24 - refY="0" 25 - refX="0" 26 - id="marker8861" 27 - style="overflow:visible" 28 - inkscape:isstock="true"> 29 - <path 30 - inkscape:connector-curvature="0" 31 - id="path8859" 32 - d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 33 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 34 - transform="matrix(0.4,0,0,0.4,2.96,0.4)" /> 35 - </marker> 36 - <marker 37 - inkscape:isstock="true" 38 - style="overflow:visible" 39 - id="marker6165" 40 - refX="0" 41 - refY="0" 42 - orient="auto" 43 - inkscape:stockid="DotM" 44 - inkscape:collect="always"> 45 - <path 46 - transform="matrix(0.4,0,0,0.4,2.96,0.4)" 47 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 48 - d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 49 - id="path6163" 50 - inkscape:connector-curvature="0" /> 51 - </marker> 52 - <marker 53 - inkscape:isstock="true" 54 - style="overflow:visible" 55 - id="marker2713" 56 - refX="0" 57 - refY="0" 58 - orient="auto" 59 - inkscape:stockid="DotM"> 60 - <path 61 - transform="matrix(0.4,0,0,0.4,2.96,0.4)" 62 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 63 - d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 64 - id="path2711" 65 - inkscape:connector-curvature="0" /> 66 - </marker> 67 - <marker 68 - inkscape:stockid="DotM" 69 - orient="auto" 70 - refY="0" 71 - refX="0" 72 - id="DotM" 73 - style="overflow:visible" 74 - inkscape:isstock="true" 75 - inkscape:collect="always"> 76 - <path 77 - id="path1795" 78 - d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 79 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 80 - transform="matrix(0.4,0,0,0.4,2.96,0.4)" 81 - inkscape:connector-curvature="0" /> 82 - </marker> 83 - <marker 84 - inkscape:isstock="true" 85 - style="overflow:visible" 86 - id="marker6389" 87 - refX="0" 88 - refY="0" 89 - orient="auto" 90 - inkscape:stockid="Arrow2Mend"> 91 - <path 92 - transform="scale(-0.6)" 93 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 94 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 95 - id="path6387" 96 - inkscape:connector-curvature="0" /> 97 - </marker> 98 - <marker 99 - inkscape:stockid="TriangleOutM" 100 - orient="auto" 101 - refY="0" 102 - refX="0" 103 - id="TriangleOutM" 104 - style="overflow:visible" 105 - inkscape:isstock="true"> 106 - <path 107 - id="path4107" 108 - d="M 5.77,0 -2.88,5 V -5 Z" 109 - style="fill:#008000;fill-opacity:1;fill-rule:evenodd;stroke:#008000;stroke-width:1.00000003pt;stroke-opacity:1" 110 - transform="scale(0.4)" 111 - inkscape:connector-curvature="0" /> 112 - </marker> 113 - <marker 114 - inkscape:stockid="TriangleOutL" 115 - orient="auto" 116 - refY="0" 117 - refX="0" 118 - id="marker5333" 119 - style="overflow:visible" 120 - inkscape:isstock="true"> 121 - <path 122 - id="path5335" 123 - d="M 5.77,0 -2.88,5 V -5 Z" 124 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 125 - transform="scale(0.8)" 126 - inkscape:connector-curvature="0" /> 127 - </marker> 128 - <marker 129 - inkscape:stockid="TriangleOutL" 130 - orient="auto" 131 - refY="0" 132 - refX="0" 133 - id="TriangleOutL" 134 - style="overflow:visible" 135 - inkscape:isstock="true"> 136 - <path 137 - id="path5049" 138 - d="M 5.77,0 -2.88,5 V -5 Z" 139 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 140 - transform="scale(0.8)" 141 - inkscape:connector-curvature="0" /> 142 - </marker> 143 - <marker 144 - inkscape:stockid="DotS" 145 - orient="auto" 146 - refY="0" 147 - refX="0" 148 - id="DotS" 149 - style="overflow:visible" 150 - inkscape:isstock="true"> 151 - <path 152 - id="path9326" 153 - d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 154 - style="fill:#404040;fill-opacity:1;fill-rule:evenodd;stroke:#404040;stroke-width:1.00000003pt;stroke-opacity:1" 155 - transform="matrix(0.2,0,0,0.2,1.48,0.2)" 156 - inkscape:connector-curvature="0" /> 157 - </marker> 158 - <marker 159 - inkscape:stockid="Arrow2Mstart" 160 - orient="auto" 161 - refY="0" 162 - refX="0" 163 - id="Arrow2Mstart" 164 - style="overflow:visible" 165 - inkscape:isstock="true"> 166 - <path 167 - id="path9283" 168 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 169 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 170 - transform="scale(0.6)" 171 - inkscape:connector-curvature="0" /> 172 - </marker> 173 - <marker 174 - inkscape:stockid="Arrow2Mend" 175 - orient="auto" 176 - refY="0" 177 - refX="0" 178 - id="marker9095" 179 - style="overflow:visible" 180 - inkscape:isstock="true"> 181 - <path 182 - id="path9097" 183 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 184 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 185 - transform="scale(-0.6)" 186 - inkscape:connector-curvature="0" /> 187 - </marker> 188 - <marker 189 - inkscape:stockid="Arrow2Mend" 190 - orient="auto" 191 - refY="0" 192 - refX="0" 193 - id="marker8935" 194 - style="overflow:visible" 195 - inkscape:isstock="true"> 196 - <path 197 - id="path8937" 198 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 199 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 200 - transform="scale(-0.6)" 201 - inkscape:connector-curvature="0" /> 202 - </marker> 203 - <marker 204 - inkscape:stockid="Arrow2Mend" 205 - orient="auto" 206 - refY="0" 207 - refX="0" 208 - id="marker8781" 209 - style="overflow:visible" 210 - inkscape:isstock="true"> 211 - <path 212 - id="path8783" 213 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 214 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 215 - transform="scale(-0.6)" 216 - inkscape:connector-curvature="0" /> 217 - </marker> 218 - <marker 219 - inkscape:stockid="Arrow2Lend" 220 - orient="auto" 221 - refY="0" 222 - refX="0" 223 - id="Arrow2Lend" 224 - style="overflow:visible" 225 - inkscape:isstock="true"> 226 - <path 227 - id="path4700" 228 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 229 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 230 - transform="matrix(-1.1,0,0,-1.1,-1.1,0)" 231 - inkscape:connector-curvature="0" /> 232 - </marker> 233 - <marker 234 - inkscape:stockid="EmptyTriangleOutL" 235 - orient="auto" 236 - refY="0" 237 - refX="0" 238 - id="EmptyTriangleOutL" 239 - style="overflow:visible" 240 - inkscape:isstock="true"> 241 - <path 242 - id="path4502" 243 - d="M 5.77,0 -2.88,5 V -5 Z" 244 - style="fill:#ffffff;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 245 - transform="matrix(0.8,0,0,0.8,-4.8,0)" 246 - inkscape:connector-curvature="0" /> 247 - </marker> 248 - <marker 249 - inkscape:stockid="EmptyTriangleOutL" 250 - orient="auto" 251 - refY="0" 252 - refX="0" 253 - id="EmptyTriangleOutL-4" 254 - style="overflow:visible" 255 - inkscape:isstock="true"> 256 - <path 257 - inkscape:connector-curvature="0" 258 - id="path4502-7" 259 - d="M 5.77,0 -2.88,5 V -5 Z" 260 - style="fill:#ffffff;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 261 - transform="matrix(0.8,0,0,0.8,-4.8,0)" /> 262 - </marker> 263 - <marker 264 - inkscape:stockid="EmptyTriangleOutL" 265 - orient="auto" 266 - refY="0" 267 - refX="0" 268 - id="EmptyTriangleOutL-4-4" 269 - style="overflow:visible" 270 - inkscape:isstock="true"> 271 - <path 272 - inkscape:connector-curvature="0" 273 - id="path4502-7-5" 274 - d="M 5.77,0 -2.88,5 V -5 Z" 275 - style="fill:#ffffff;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 276 - transform="matrix(0.8,0,0,0.8,-4.8,0)" /> 277 - </marker> 278 - <marker 279 - inkscape:stockid="EmptyTriangleOutL" 280 - orient="auto" 281 - refY="0" 282 - refX="0" 283 - id="EmptyTriangleOutL-7" 284 - style="overflow:visible" 285 - inkscape:isstock="true"> 286 - <path 287 - inkscape:connector-curvature="0" 288 - id="path4502-5" 289 - d="M 5.77,0 -2.88,5 V -5 Z" 290 - style="fill:#ffffff;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 291 - transform="matrix(0.8,0,0,0.8,-4.8,0)" /> 292 - </marker> 293 - <marker 294 - inkscape:stockid="EmptyTriangleOutL" 295 - orient="auto" 296 - refY="0" 297 - refX="0" 298 - id="EmptyTriangleOutL-6" 299 - style="overflow:visible" 300 - inkscape:isstock="true"> 301 - <path 302 - inkscape:connector-curvature="0" 303 - id="path4502-2" 304 - d="M 5.77,0 -2.88,5 V -5 Z" 305 - style="fill:#ffffff;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 306 - transform="matrix(0.8,0,0,0.8,-4.8,0)" /> 307 - </marker> 308 - <marker 309 - inkscape:stockid="EmptyTriangleOutL" 310 - orient="auto" 311 - refY="0" 312 - refX="0" 313 - id="EmptyTriangleOutL-78" 314 - style="overflow:visible" 315 - inkscape:isstock="true"> 316 - <path 317 - inkscape:connector-curvature="0" 318 - id="path4502-57" 319 - d="M 5.77,0 -2.88,5 V -5 Z" 320 - style="fill:#ffffff;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 321 - transform="matrix(0.8,0,0,0.8,-4.8,0)" /> 322 - </marker> 323 - <marker 324 - inkscape:stockid="EmptyTriangleOutL" 325 - orient="auto" 326 - refY="0" 327 - refX="0" 328 - id="EmptyTriangleOutL-1" 329 - style="overflow:visible" 330 - inkscape:isstock="true"> 331 - <path 332 - inkscape:connector-curvature="0" 333 - id="path4502-8" 334 - d="M 5.77,0 -2.88,5 V -5 Z" 335 - style="fill:#ffffff;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 336 - transform="matrix(0.8,0,0,0.8,-4.8,0)" /> 337 - </marker> 338 - <marker 339 - inkscape:stockid="EmptyTriangleOutL" 340 - orient="auto" 341 - refY="0" 342 - refX="0" 343 - id="EmptyTriangleOutL-9" 344 - style="overflow:visible" 345 - inkscape:isstock="true"> 346 - <path 347 - inkscape:connector-curvature="0" 348 - id="path4502-75" 349 - d="M 5.77,0 -2.88,5 V -5 Z" 350 - style="fill:#ffffff;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 351 - transform="matrix(0.8,0,0,0.8,-4.8,0)" /> 352 - </marker> 353 - <marker 354 - inkscape:stockid="EmptyTriangleOutL" 355 - orient="auto" 356 - refY="0" 357 - refX="0" 358 - id="EmptyTriangleOutL-78-2" 359 - style="overflow:visible" 360 - inkscape:isstock="true"> 361 - <path 362 - inkscape:connector-curvature="0" 363 - id="path4502-57-6" 364 - d="M 5.77,0 -2.88,5 V -5 Z" 365 - style="fill:#ffffff;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 366 - transform="matrix(0.8,0,0,0.8,-4.8,0)" /> 367 - </marker> 368 - <marker 369 - inkscape:stockid="Arrow2Mend" 370 - orient="auto" 371 - refY="0" 372 - refX="0" 373 - id="Arrow2Mend-3" 374 - style="overflow:visible" 375 - inkscape:isstock="true"> 376 - <path 377 - inkscape:connector-curvature="0" 378 - id="path4369-6" 379 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 380 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 381 - transform="scale(-0.6)" /> 382 - </marker> 383 - <marker 384 - inkscape:stockid="Arrow2Mend" 385 - orient="auto" 386 - refY="0" 387 - refX="0" 388 - id="Arrow2Mend-3-5" 389 - style="overflow:visible" 390 - inkscape:isstock="true"> 391 - <path 392 - inkscape:connector-curvature="0" 393 - id="path4369-6-3" 394 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 395 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 396 - transform="scale(-0.6)" /> 397 - </marker> 398 - <marker 399 - inkscape:stockid="Arrow2Mend" 400 - orient="auto" 401 - refY="0" 402 - refX="0" 403 - id="Arrow2Mend-3-5-6" 404 - style="overflow:visible" 405 - inkscape:isstock="true"> 406 - <path 407 - inkscape:connector-curvature="0" 408 - id="path4369-6-3-2" 409 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 410 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 411 - transform="scale(-0.6)" /> 412 - </marker> 413 - <marker 414 - inkscape:stockid="Arrow2Mend" 415 - orient="auto" 416 - refY="0" 417 - refX="0" 418 - id="Arrow2Mend-3-5-6-1" 419 - style="overflow:visible" 420 - inkscape:isstock="true"> 421 - <path 422 - inkscape:connector-curvature="0" 423 - id="path4369-6-3-2-2" 424 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 425 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 426 - transform="scale(-0.6)" /> 427 - </marker> 428 - <marker 429 - inkscape:stockid="Arrow2Mend" 430 - orient="auto" 431 - refY="0" 432 - refX="0" 433 - id="Arrow2Mend-3-5-7" 434 - style="overflow:visible" 435 - inkscape:isstock="true"> 436 - <path 437 - inkscape:connector-curvature="0" 438 - id="path4369-6-3-0" 439 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 440 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 441 - transform="scale(-0.6)" /> 442 - </marker> 443 - <marker 444 - inkscape:stockid="Arrow2Mend" 445 - orient="auto" 446 - refY="0" 447 - refX="0" 448 - id="Arrow2Mend-3-9" 449 - style="overflow:visible" 450 - inkscape:isstock="true"> 451 - <path 452 - inkscape:connector-curvature="0" 453 - id="path4369-6-36" 454 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 455 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 456 - transform="scale(-0.6)" /> 457 - </marker> 458 - <marker 459 - inkscape:stockid="Arrow2Mend" 460 - orient="auto" 461 - refY="0" 462 - refX="0" 463 - id="Arrow2Mend-31" 464 - style="overflow:visible" 465 - inkscape:isstock="true"> 466 - <path 467 - inkscape:connector-curvature="0" 468 - id="path4369-9" 469 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 470 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 471 - transform="scale(-0.6)" /> 472 - </marker> 473 - <marker 474 - inkscape:stockid="Arrow2Mend" 475 - orient="auto" 476 - refY="0" 477 - refX="0" 478 - id="Arrow2Mend-3-5-7-5" 479 - style="overflow:visible" 480 - inkscape:isstock="true"> 481 - <path 482 - inkscape:connector-curvature="0" 483 - id="path4369-6-3-0-4" 484 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 485 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 486 - transform="scale(-0.6)" /> 487 - </marker> 488 - <marker 489 - inkscape:stockid="Arrow2Mend" 490 - orient="auto" 491 - refY="0" 492 - refX="0" 493 - id="Arrow2Mend-3-9-6" 494 - style="overflow:visible" 495 - inkscape:isstock="true"> 496 - <path 497 - inkscape:connector-curvature="0" 498 - id="path4369-6-36-5" 499 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 500 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 501 - transform="scale(-0.6)" /> 502 - </marker> 503 - <marker 504 - inkscape:stockid="Arrow2Mend" 505 - orient="auto" 506 - refY="0" 507 - refX="0" 508 - id="Arrow2Mend-3-5-7-3" 509 - style="overflow:visible" 510 - inkscape:isstock="true"> 511 - <path 512 - inkscape:connector-curvature="0" 513 - id="path4369-6-3-0-5" 514 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 515 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 516 - transform="scale(-0.6)" /> 517 - </marker> 518 - <marker 519 - inkscape:stockid="Arrow2Mend" 520 - orient="auto" 521 - refY="0" 522 - refX="0" 523 - id="marker9095-3" 524 - style="overflow:visible" 525 - inkscape:isstock="true"> 526 - <path 527 - inkscape:connector-curvature="0" 528 - id="path9097-1" 529 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 530 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 531 - transform="scale(-0.6)" /> 532 - </marker> 533 - <marker 534 - inkscape:stockid="Arrow2Mend" 535 - orient="auto" 536 - refY="0" 537 - refX="0" 538 - id="marker9095-3-7" 539 - style="overflow:visible" 540 - inkscape:isstock="true"> 541 - <path 542 - inkscape:connector-curvature="0" 543 - id="path9097-1-8" 544 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 545 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 546 - transform="scale(-0.6)" /> 547 - </marker> 548 - <marker 549 - inkscape:stockid="Arrow2Mend" 550 - orient="auto" 551 - refY="0" 552 - refX="0" 553 - id="marker9095-1-5" 554 - style="overflow:visible" 555 - inkscape:isstock="true"> 556 - <path 557 - inkscape:connector-curvature="0" 558 - id="path9097-2-0" 559 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 560 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 561 - transform="scale(-0.6)" /> 562 - </marker> 563 - <marker 564 - inkscape:stockid="Arrow2Mend" 565 - orient="auto" 566 - refY="0" 567 - refX="0" 568 - id="marker9095-1-5-6" 569 - style="overflow:visible" 570 - inkscape:isstock="true"> 571 - <path 572 - inkscape:connector-curvature="0" 573 - id="path9097-2-0-1" 574 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 575 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 576 - transform="scale(-0.6)" /> 577 - </marker> 578 - <marker 579 - inkscape:stockid="Arrow2Mend" 580 - orient="auto" 581 - refY="0" 582 - refX="0" 583 - id="marker9095-1-5-6-6" 584 - style="overflow:visible" 585 - inkscape:isstock="true"> 586 - <path 587 - inkscape:connector-curvature="0" 588 - id="path9097-2-0-1-3" 589 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 590 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 591 - transform="scale(-0.6)" /> 592 - </marker> 593 - <marker 594 - inkscape:stockid="Arrow2Mend" 595 - orient="auto" 596 - refY="0" 597 - refX="0" 598 - id="marker9095-1-5-2" 599 - style="overflow:visible" 600 - inkscape:isstock="true"> 601 - <path 602 - inkscape:connector-curvature="0" 603 - id="path9097-2-0-0" 604 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 605 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 606 - transform="scale(-0.6)" /> 607 - </marker> 608 - <marker 609 - inkscape:stockid="Arrow2Mend" 610 - orient="auto" 611 - refY="0" 612 - refX="0" 613 - id="marker9095-1-5-6-5" 614 - style="overflow:visible" 615 - inkscape:isstock="true"> 616 - <path 617 - inkscape:connector-curvature="0" 618 - id="path9097-2-0-1-5" 619 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 620 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 621 - transform="scale(-0.6)" /> 622 - </marker> 623 - <marker 624 - inkscape:stockid="Arrow2Mend" 625 - orient="auto" 626 - refY="0" 627 - refX="0" 628 - id="marker9095-1-5-4" 629 - style="overflow:visible" 630 - inkscape:isstock="true"> 631 - <path 632 - inkscape:connector-curvature="0" 633 - id="path9097-2-0-7" 634 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 635 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 636 - transform="scale(-0.6)" /> 637 - </marker> 638 - <marker 639 - inkscape:stockid="Arrow2Mend" 640 - orient="auto" 641 - refY="0" 642 - refX="0" 643 - id="marker9095-1-5-6-5-9" 644 - style="overflow:visible" 645 - inkscape:isstock="true"> 646 - <path 647 - inkscape:connector-curvature="0" 648 - id="path9097-2-0-1-5-3" 649 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 650 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 651 - transform="scale(-0.6)" /> 652 - </marker> 653 - <marker 654 - inkscape:stockid="Arrow2Mend" 655 - orient="auto" 656 - refY="0" 657 - refX="0" 658 - id="marker9095-1-5-6-5-4" 659 - style="overflow:visible" 660 - inkscape:isstock="true"> 661 - <path 662 - inkscape:connector-curvature="0" 663 - id="path9097-2-0-1-5-5" 664 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 665 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 666 - transform="scale(-0.6)" /> 667 - </marker> 668 - <marker 669 - inkscape:stockid="Arrow2Mend" 670 - orient="auto" 671 - refY="0" 672 - refX="0" 673 - id="marker9095-1-5-6-5-4-4" 674 - style="overflow:visible" 675 - inkscape:isstock="true"> 676 - <path 677 - inkscape:connector-curvature="0" 678 - id="path9097-2-0-1-5-5-3" 679 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 680 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 681 - transform="scale(-0.6)" /> 682 - </marker> 683 - <marker 684 - inkscape:stockid="Arrow2Mend" 685 - orient="auto" 686 - refY="0" 687 - refX="0" 688 - id="marker9095-1-5-6-5-4-4-9" 689 - style="overflow:visible" 690 - inkscape:isstock="true"> 691 - <path 692 - inkscape:connector-curvature="0" 693 - id="path9097-2-0-1-5-5-3-2" 694 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 695 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 696 - transform="scale(-0.6)" /> 697 - </marker> 698 - <marker 699 - inkscape:stockid="Arrow2Mend" 700 - orient="auto" 701 - refY="0" 702 - refX="0" 703 - id="marker9095-1-5-6-5-4-4-6" 704 - style="overflow:visible" 705 - inkscape:isstock="true"> 706 - <path 707 - inkscape:connector-curvature="0" 708 - id="path9097-2-0-1-5-5-3-8" 709 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 710 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 711 - transform="scale(-0.6)" /> 712 - </marker> 713 - <marker 714 - inkscape:stockid="Arrow2Mend" 715 - orient="auto" 716 - refY="0" 717 - refX="0" 718 - id="marker9095-1-5-6-5-4-4-2" 719 - style="overflow:visible" 720 - inkscape:isstock="true"> 721 - <path 722 - inkscape:connector-curvature="0" 723 - id="path9097-2-0-1-5-5-3-7" 724 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 725 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 726 - transform="scale(-0.6)" /> 727 - </marker> 728 - <marker 729 - inkscape:stockid="Arrow2Mend" 730 - orient="auto" 731 - refY="0" 732 - refX="0" 733 - id="marker9095-1-5-3" 734 - style="overflow:visible" 735 - inkscape:isstock="true"> 736 - <path 737 - inkscape:connector-curvature="0" 738 - id="path9097-2-0-6" 739 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 740 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 741 - transform="scale(-0.6)" /> 742 - </marker> 743 - <marker 744 - inkscape:stockid="Arrow2Mend" 745 - orient="auto" 746 - refY="0" 747 - refX="0" 748 - id="marker9095-1-5-6-5-4-4-6-5" 749 - style="overflow:visible" 750 - inkscape:isstock="true"> 751 - <path 752 - inkscape:connector-curvature="0" 753 - id="path9097-2-0-1-5-5-3-8-3" 754 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 755 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 756 - transform="scale(-0.6)" /> 757 - </marker> 758 - <marker 759 - inkscape:stockid="Arrow2Mend" 760 - orient="auto" 761 - refY="0" 762 - refX="0" 763 - id="marker9095-1-5-27" 764 - style="overflow:visible" 765 - inkscape:isstock="true"> 766 - <path 767 - inkscape:connector-curvature="0" 768 - id="path9097-2-0-09" 769 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 770 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 771 - transform="scale(-0.6)" /> 772 - </marker> 773 - <marker 774 - inkscape:stockid="Arrow2Mend" 775 - orient="auto" 776 - refY="0" 777 - refX="0" 778 - id="marker9095-1-5-27-6" 779 - style="overflow:visible" 780 - inkscape:isstock="true"> 781 - <path 782 - inkscape:connector-curvature="0" 783 - id="path9097-2-0-09-2" 784 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 785 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 786 - transform="scale(-0.6)" /> 787 - </marker> 788 - <marker 789 - inkscape:stockid="Arrow2Mend" 790 - orient="auto" 791 - refY="0" 792 - refX="0" 793 - id="marker9095-1-5-27-0" 794 - style="overflow:visible" 795 - inkscape:isstock="true"> 796 - <path 797 - inkscape:connector-curvature="0" 798 - id="path9097-2-0-09-23" 799 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 800 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 801 - transform="scale(-0.6)" /> 802 - </marker> 803 - <marker 804 - inkscape:stockid="Arrow2Mend" 805 - orient="auto" 806 - refY="0" 807 - refX="0" 808 - id="marker9095-1-9" 809 - style="overflow:visible" 810 - inkscape:isstock="true"> 811 - <path 812 - inkscape:connector-curvature="0" 813 - id="path9097-2-7" 814 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 815 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 816 - transform="scale(-0.6)" /> 817 - </marker> 818 - <marker 819 - inkscape:stockid="Arrow2Mend" 820 - orient="auto" 821 - refY="0" 822 - refX="0" 823 - id="marker9095-1-5-4-6" 824 - style="overflow:visible" 825 - inkscape:isstock="true"> 826 - <path 827 - inkscape:connector-curvature="0" 828 - id="path9097-2-0-7-7" 829 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 830 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 831 - transform="scale(-0.6)" /> 832 - </marker> 833 - <marker 834 - inkscape:stockid="Arrow2Mend" 835 - orient="auto" 836 - refY="0" 837 - refX="0" 838 - id="marker9095-1-5-4-6-3" 839 - style="overflow:visible" 840 - inkscape:isstock="true"> 841 - <path 842 - inkscape:connector-curvature="0" 843 - id="path9097-2-0-7-7-5" 844 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 845 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 846 - transform="scale(-0.6)" /> 847 - </marker> 848 - <marker 849 - inkscape:stockid="Arrow2Mend" 850 - orient="auto" 851 - refY="0" 852 - refX="0" 853 - id="marker9095-1-5-4-6-2" 854 - style="overflow:visible" 855 - inkscape:isstock="true"> 856 - <path 857 - inkscape:connector-curvature="0" 858 - id="path9097-2-0-7-7-9" 859 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 860 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 861 - transform="scale(-0.6)" /> 862 - </marker> 863 - <marker 864 - inkscape:stockid="TriangleOutM" 865 - orient="auto" 866 - refY="0" 867 - refX="0" 868 - id="TriangleOutM-2" 869 - style="overflow:visible" 870 - inkscape:isstock="true"> 871 - <path 872 - inkscape:connector-curvature="0" 873 - id="path4107-7" 874 - d="M 5.77,0 -2.88,5 V -5 Z" 875 - style="fill:#008000;fill-opacity:1;fill-rule:evenodd;stroke:#008000;stroke-width:1.00000003pt;stroke-opacity:1" 876 - transform="scale(0.4)" /> 877 - </marker> 878 - <marker 879 - inkscape:stockid="Arrow2Mstart" 880 - orient="auto" 881 - refY="0" 882 - refX="0" 883 - id="Arrow2Mstart-9" 884 - style="overflow:visible" 885 - inkscape:isstock="true"> 886 - <path 887 - inkscape:connector-curvature="0" 888 - id="path9283-3" 889 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 890 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 891 - transform="scale(0.6)" /> 892 - </marker> 893 - <marker 894 - inkscape:stockid="Arrow2Mend" 895 - orient="auto" 896 - refY="0" 897 - refX="0" 898 - id="marker9095-1-5-4-6-6" 899 - style="overflow:visible" 900 - inkscape:isstock="true"> 901 - <path 902 - inkscape:connector-curvature="0" 903 - id="path9097-2-0-7-7-0" 904 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 905 - d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 906 - transform="scale(-0.6)" /> 907 - </marker> 908 - <marker 909 - inkscape:stockid="DotM" 910 - orient="auto" 911 - refY="0" 912 - refX="0" 913 - id="DotM-3" 914 - style="overflow:visible" 915 - inkscape:isstock="true"> 916 - <path 917 - inkscape:connector-curvature="0" 918 - id="path1795-7" 919 - d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 920 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 921 - transform="matrix(0.4,0,0,0.4,2.96,0.4)" /> 922 - </marker> 923 - <marker 924 - inkscape:isstock="true" 925 - style="overflow:visible" 926 - id="marker2713-9" 927 - refX="0" 928 - refY="0" 929 - orient="auto" 930 - inkscape:stockid="DotM"> 931 - <path 932 - inkscape:connector-curvature="0" 933 - transform="matrix(0.4,0,0,0.4,2.96,0.4)" 934 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 935 - d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 936 - id="path2711-2" /> 937 - </marker> 938 - <marker 939 - inkscape:stockid="DotM" 940 - orient="auto" 941 - refY="0" 942 - refX="0" 943 - id="DotM-2" 944 - style="overflow:visible" 945 - inkscape:isstock="true" 946 - inkscape:collect="always"> 947 - <path 948 - inkscape:connector-curvature="0" 949 - id="path1795-8" 950 - d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 951 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 952 - transform="matrix(0.4,0,0,0.4,2.96,0.4)" /> 953 - </marker> 954 - <marker 955 - inkscape:isstock="true" 956 - style="overflow:visible" 957 - id="marker2713-3" 958 - refX="0" 959 - refY="0" 960 - orient="auto" 961 - inkscape:stockid="DotM"> 962 - <path 963 - inkscape:connector-curvature="0" 964 - transform="matrix(0.4,0,0,0.4,2.96,0.4)" 965 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 966 - d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 967 - id="path2711-6" /> 968 - </marker> 969 - <marker 970 - inkscape:stockid="DotM" 971 - orient="auto" 972 - refY="0" 973 - refX="0" 974 - id="DotM-1" 975 - style="overflow:visible" 976 - inkscape:isstock="true" 977 - inkscape:collect="always"> 978 - <path 979 - inkscape:connector-curvature="0" 980 - id="path1795-2" 981 - d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 982 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 983 - transform="matrix(0.4,0,0,0.4,2.96,0.4)" /> 984 - </marker> 985 - <marker 986 - inkscape:isstock="true" 987 - style="overflow:visible" 988 - id="marker2713-94" 989 - refX="0" 990 - refY="0" 991 - orient="auto" 992 - inkscape:stockid="DotM"> 993 - <path 994 - inkscape:connector-curvature="0" 995 - transform="matrix(0.4,0,0,0.4,2.96,0.4)" 996 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 997 - d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 998 - id="path2711-7" /> 999 - </marker> 1000 - <marker 1001 - inkscape:stockid="DotM" 1002 - orient="auto" 1003 - refY="0" 1004 - refX="0" 1005 - id="DotM-8" 1006 - style="overflow:visible" 1007 - inkscape:isstock="true" 1008 - inkscape:collect="always"> 1009 - <path 1010 - inkscape:connector-curvature="0" 1011 - id="path1795-4" 1012 - d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 1013 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 1014 - transform="matrix(0.4,0,0,0.4,2.96,0.4)" /> 1015 - </marker> 1016 - <marker 1017 - inkscape:isstock="true" 1018 - style="overflow:visible" 1019 - id="marker2713-36" 1020 - refX="0" 1021 - refY="0" 1022 - orient="auto" 1023 - inkscape:stockid="DotM"> 1024 - <path 1025 - inkscape:connector-curvature="0" 1026 - transform="matrix(0.4,0,0,0.4,2.96,0.4)" 1027 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 1028 - d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 1029 - id="path2711-1" /> 1030 - </marker> 1031 - <marker 1032 - inkscape:isstock="true" 1033 - style="overflow:visible" 1034 - id="marker6165-5" 1035 - refX="0" 1036 - refY="0" 1037 - orient="auto" 1038 - inkscape:stockid="DotM"> 1039 - <path 1040 - transform="matrix(0.4,0,0,0.4,2.96,0.4)" 1041 - style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 1042 - d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 1043 - id="path6163-5" 1044 - inkscape:connector-curvature="0" /> 1045 - </marker> 1046 - </defs> 1047 - <sodipodi:namedview 1048 - showguides="true" 1049 - inkscape:window-maximized="1" 1050 - inkscape:window-y="0" 1051 - inkscape:window-x="0" 1052 - inkscape:window-height="1015" 1053 - inkscape:window-width="1920" 1054 - showgrid="true" 1055 - inkscape:current-layer="layer1" 1056 - inkscape:document-units="px" 1057 - inkscape:cy="214.66765" 1058 - inkscape:cx="-167.56857" 1059 - inkscape:zoom="0.70710678" 1060 - inkscape:pageshadow="2" 1061 - inkscape:pageopacity="0.0" 1062 - borderopacity="1.0" 1063 - bordercolor="#666666" 1064 - pagecolor="#ffffff" 1065 - id="base" 1066 - inkscape:snap-to-guides="true" 1067 - inkscape:snap-grids="true" 1068 - inkscape:snap-bbox="false" 1069 - inkscape:object-nodes="true" 1070 - fit-margin-top="5" 1071 - fit-margin-left="5" 1072 - fit-margin-right="5" 1073 - fit-margin-bottom="5"> 1074 - <inkscape:grid 1075 - type="xygrid" 1076 - id="grid4451" 1077 - originx="-93.377219" 1078 - originy="-347.2523" /> 1079 - </sodipodi:namedview> 1080 - <metadata 1081 - id="metadata7"> 1082 - <rdf:RDF> 1083 - <cc:Work 1084 - rdf:about=""> 1085 - <dc:format>image/svg+xml</dc:format> 1086 - <dc:type 1087 - rdf:resource="http://purl.org/dc/dcmitype/StillImage" /> 1088 - <dc:title></dc:title> 1089 - <dc:creator> 1090 - <cc:Agent> 1091 - <dc:title>Luca Ceresoli</dc:title> 1092 - </cc:Agent> 1093 - </dc:creator> 1094 - <dc:date>2020</dc:date> 1095 - <cc:license 1096 - rdf:resource="http://creativecommons.org/licenses/by-sa/4.0/" /> 1097 - </cc:Work> 1098 - <cc:License 1099 - rdf:about="http://creativecommons.org/licenses/by-sa/4.0/"> 1100 - <cc:permits 1101 - rdf:resource="http://creativecommons.org/ns#Reproduction" /> 1102 - <cc:permits 1103 - rdf:resource="http://creativecommons.org/ns#Distribution" /> 1104 - <cc:requires 1105 - rdf:resource="http://creativecommons.org/ns#Notice" /> 1106 - <cc:requires 1107 - rdf:resource="http://creativecommons.org/ns#Attribution" /> 1108 - <cc:permits 1109 - rdf:resource="http://creativecommons.org/ns#DerivativeWorks" /> 1110 - <cc:requires 1111 - rdf:resource="http://creativecommons.org/ns#ShareAlike" /> 1112 - </cc:License> 1113 - </rdf:RDF> 1114 - </metadata> 1115 - <g 1116 - inkscape:label="Livello 1" 1117 - inkscape:groupmode="layer" 1118 - id="layer1" 1119 - transform="translate(-93.377215,-444.09395)"> 1120 - <rect 1121 - style="opacity:1;fill:#ffb9b9;fill-opacity:1;stroke:#f00000;stroke-width:2.8125;stroke-linecap:round;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" 1122 - id="rect4424-3-2-9-7" 1123 - width="112.5" 1124 - height="113.75008" 1125 - x="112.5" 1126 - y="471.11221" 1127 - rx="0" 1128 - ry="0" /> 1129 - <text 1130 - xml:space="preserve" 1131 - style="font-style:normal;font-weight:normal;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" 1132 - x="167.5354" 1133 - y="521.46259" 1134 - id="text4349"><tspan 1135 - sodipodi:role="line" 1136 - x="167.5354" 1137 - y="521.46259" 1138 - style="font-size:25px;line-height:1.25;font-family:sans-serif;text-align:center;text-anchor:middle" 1139 - id="tspan1273">I2C</tspan><tspan 1140 - sodipodi:role="line" 1141 - x="167.5354" 1142 - y="552.71259" 1143 - style="font-size:25px;line-height:1.25;font-family:sans-serif;text-align:center;text-anchor:middle" 1144 - id="tspan1285">Master</tspan></text> 1145 - <rect 1146 - style="color:#000000;clip-rule:nonzero;display:inline;overflow:visible;visibility:visible;opacity:1;isolation:auto;mix-blend-mode:normal;color-interpolation:sRGB;color-interpolation-filters:linearRGB;solid-color:#000000;solid-opacity:1;fill:#b9ffb9;fill-opacity:1;fill-rule:nonzero;stroke:#006400;stroke-width:2.8125;stroke-linecap:round;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1;color-rendering:auto;image-rendering:auto;shape-rendering:auto;text-rendering:auto;enable-background:accumulate" 1147 - id="rect4424-3-2-9-7-3-3-5-3" 1148 - width="112.49998" 1149 - height="112.50001" 1150 - x="262.5" 1151 - y="471.11218" 1152 - rx="0" 1153 - ry="0" /> 1154 - <path 1155 - style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968767;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" 1156 - d="m 112.50002,639.86223 c 712.50002,0 712.50002,0 712.50002,0" 1157 - id="path4655-9-3-65-5-6" 1158 - inkscape:connector-curvature="0" /> 1159 - <text 1160 - xml:space="preserve" 1161 - style="font-style:normal;font-weight:normal;font-size:12px;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" 1162 - x="318.59131" 1163 - y="520.83752" 1164 - id="text4349-26"><tspan 1165 - sodipodi:role="line" 1166 - x="318.59131" 1167 - y="520.83752" 1168 - style="font-size:25.00000191px;line-height:1.25;font-family:sans-serif;text-align:center;text-anchor:middle;stroke-width:1px" 1169 - id="tspan1273-8">I2C</tspan><tspan 1170 - sodipodi:role="line" 1171 - x="318.59131" 1172 - y="552.08752" 1173 - style="font-size:25.00000191px;line-height:1.25;font-family:sans-serif;text-align:center;text-anchor:middle;stroke-width:1px" 1174 - id="tspan1287">Slave</tspan></text> 1175 - <path 1176 - style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968767;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" 1177 - d="m 112.49995,677.36223 c 712.50005,0 712.50005,0 712.50005,0" 1178 - id="path4655-9-3-65-5-6-2" 1179 - inkscape:connector-curvature="0" /> 1180 - <text 1181 - xml:space="preserve" 1182 - style="font-style:normal;font-weight:normal;font-size:12px;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" 1183 - x="861.07312" 1184 - y="687.03937" 1185 - id="text4349-7"><tspan 1186 - sodipodi:role="line" 1187 - x="861.07312" 1188 - y="687.03937" 1189 - style="font-size:25.00000191px;line-height:1.25;font-family:sans-serif;text-align:center;writing-mode:lr-tb;direction:ltr;text-anchor:middle;stroke-width:1px" 1190 - id="tspan1285-9">SCL</tspan></text> 1191 - <flowRoot 1192 - xml:space="preserve" 1193 - id="flowRoot1627" 1194 - style="font-style:normal;font-weight:normal;font-size:40px;line-height:125%;font-family:Sans;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"><flowRegion 1195 - id="flowRegion1629"><rect 1196 - id="rect1631" 1197 - width="220" 1198 - height="120" 1199 - x="140" 1200 - y="-126.29921" /></flowRegion><flowPara 1201 - id="flowPara1633" /></flowRoot> <text 1202 - xml:space="preserve" 1203 - style="font-style:normal;font-weight:normal;font-size:12px;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" 1204 - x="863.31921" 1205 - y="648.96735" 1206 - id="text4349-7-3"><tspan 1207 - sodipodi:role="line" 1208 - x="863.31921" 1209 - y="648.96735" 1210 - style="font-size:25.00000191px;line-height:1.25;font-family:sans-serif;text-align:center;text-anchor:middle;stroke-width:1px" 1211 - id="tspan1285-9-6">SDA</tspan></text> 1212 - <rect 1213 - style="color:#000000;clip-rule:nonzero;display:inline;overflow:visible;visibility:visible;opacity:1;isolation:auto;mix-blend-mode:normal;color-interpolation:sRGB;color-interpolation-filters:linearRGB;solid-color:#000000;solid-opacity:1;vector-effect:none;fill:#b9ffb9;fill-opacity:1;fill-rule:nonzero;stroke:#006400;stroke-width:2.8125;stroke-linecap:round;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1;color-rendering:auto;image-rendering:auto;shape-rendering:auto;text-rendering:auto;enable-background:accumulate" 1214 - id="rect4424-3-2-9-7-3-3-5-3-0" 1215 - width="112.49998" 1216 - height="112.50002" 1217 - x="412.5" 1218 - y="471.11215" 1219 - rx="0" 1220 - ry="0" /> 1221 - <text 1222 - xml:space="preserve" 1223 - style="font-style:normal;font-weight:normal;font-size:12px;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" 1224 - x="468.59131" 1225 - y="520.83746" 1226 - id="text4349-26-6"><tspan 1227 - sodipodi:role="line" 1228 - x="468.59131" 1229 - y="520.83746" 1230 - style="font-size:25.00000191px;line-height:1.25;font-family:sans-serif;text-align:center;text-anchor:middle;stroke-width:1px" 1231 - id="tspan1273-8-2">I2C</tspan><tspan 1232 - sodipodi:role="line" 1233 - x="468.59131" 1234 - y="552.08746" 1235 - style="font-size:25.00000191px;line-height:1.25;font-family:sans-serif;text-align:center;text-anchor:middle;stroke-width:1px" 1236 - id="tspan1287-6">Slave</tspan></text> 1237 - <rect 1238 - style="color:#000000;clip-rule:nonzero;display:inline;overflow:visible;visibility:visible;opacity:1;isolation:auto;mix-blend-mode:normal;color-interpolation:sRGB;color-interpolation-filters:linearRGB;solid-color:#000000;solid-opacity:1;vector-effect:none;fill:#b9ffb9;fill-opacity:1;fill-rule:nonzero;stroke:#006400;stroke-width:2.8125;stroke-linecap:round;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1;color-rendering:auto;image-rendering:auto;shape-rendering:auto;text-rendering:auto;enable-background:accumulate" 1239 - id="rect4424-3-2-9-7-3-3-5-3-1" 1240 - width="112.49998" 1241 - height="112.50002" 1242 - x="562.5" 1243 - y="471.11215" 1244 - rx="0" 1245 - ry="0" /> 1246 - <text 1247 - xml:space="preserve" 1248 - style="font-style:normal;font-weight:normal;font-size:12px;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" 1249 - x="618.59131" 1250 - y="520.83746" 1251 - id="text4349-26-8"><tspan 1252 - sodipodi:role="line" 1253 - x="618.59131" 1254 - y="520.83746" 1255 - style="font-size:25.00000191px;line-height:1.25;font-family:sans-serif;text-align:center;text-anchor:middle;stroke-width:1px" 1256 - id="tspan1273-8-7">I2C</tspan><tspan 1257 - sodipodi:role="line" 1258 - x="618.59131" 1259 - y="552.08746" 1260 - style="font-size:25.00000191px;line-height:1.25;font-family:sans-serif;text-align:center;text-anchor:middle;stroke-width:1px" 1261 - id="tspan1287-9">Slave</tspan></text> 1262 - <path 1263 - style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968743;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#DotM)" 1264 - d="m 150,583.61221 v 93.75" 1265 - id="path4655-9-3-65-5-6-20" 1266 - inkscape:connector-curvature="0" 1267 - sodipodi:nodetypes="cc" /> 1268 - <path 1269 - style="opacity:1;vector-effect:none;fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968743;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1;marker-end:url(#marker2713)" 1270 - d="m 187.5,583.61221 v 56.25" 1271 - id="path4655-9-3-65-5-6-20-2" 1272 - inkscape:connector-curvature="0" 1273 - sodipodi:nodetypes="cc" /> 1274 - <path 1275 - style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968743;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#DotM-2)" 1276 - d="m 300,583.61221 v 93.75" 1277 - id="path4655-9-3-65-5-6-20-9" 1278 - inkscape:connector-curvature="0" 1279 - sodipodi:nodetypes="cc" /> 1280 - <path 1281 - style="opacity:1;vector-effect:none;fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968743;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1;marker-end:url(#marker2713-9)" 1282 - d="m 337.5,583.61221 v 56.25" 1283 - id="path4655-9-3-65-5-6-20-2-7" 1284 - inkscape:connector-curvature="0" 1285 - sodipodi:nodetypes="cc" /> 1286 - <path 1287 - style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968743;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#DotM-1)" 1288 - d="m 450,583.61221 v 93.75" 1289 - id="path4655-9-3-65-5-6-20-93" 1290 - inkscape:connector-curvature="0" 1291 - sodipodi:nodetypes="cc" /> 1292 - <path 1293 - style="opacity:1;vector-effect:none;fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968743;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1;marker-end:url(#marker2713-3)" 1294 - d="m 487.5,583.61221 v 56.25" 1295 - id="path4655-9-3-65-5-6-20-2-1" 1296 - inkscape:connector-curvature="0" 1297 - sodipodi:nodetypes="cc" /> 1298 - <path 1299 - style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968743;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#DotM-8)" 1300 - d="m 600,583.61221 v 93.75" 1301 - id="path4655-9-3-65-5-6-20-5" 1302 - inkscape:connector-curvature="0" 1303 - sodipodi:nodetypes="cc" /> 1304 - <path 1305 - style="opacity:1;vector-effect:none;fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968743;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1;marker-end:url(#marker2713-94)" 1306 - d="m 637.5,583.61221 v 56.25" 1307 - id="path4655-9-3-65-5-6-20-2-0" 1308 - inkscape:connector-curvature="0" 1309 - sodipodi:nodetypes="cc" /> 1310 - <path 1311 - style="opacity:1;vector-effect:none;fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968743;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1;marker-start:url(#marker6165);marker-end:url(#marker6165)" 1312 - d="m 750,471.11221 v 28.125 l 9.375,9.375 -18.74999,18.75 18.74999,18.75 -18.74999,18.75 18.74999,18.75 -9.375,9.375 v 28.125 0 0 56.25" 1313 - id="path6135" 1314 - inkscape:connector-curvature="0" 1315 - sodipodi:nodetypes="cccccccccccc" /> 1316 - <path 1317 - style="opacity:1;vector-effect:none;fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968743;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1;marker-start:url(#marker8861);marker-end:url(#marker6165-5)" 1318 - d="m 787.49999,471.11221 v 28.125 l 9.375,9.375 -18.74999,18.75 18.74999,18.75 -18.74999,18.75 18.74999,18.75 -9.375,9.375 v 28.125 0 0 18.75001" 1319 - id="path6135-4" 1320 - inkscape:connector-curvature="0" 1321 - sodipodi:nodetypes="cccccccccccc" /> 1322 - <path 1323 - style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968719;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" 1324 - d="m 712.5,471.11221 c 112.49999,0 112.49999,0 112.49999,0" 1325 - id="path4655-9-3-65-5-6-7" 1326 - inkscape:connector-curvature="0" /> 1327 - <text 1328 - xml:space="preserve" 1329 - style="font-style:normal;font-weight:normal;font-size:12px;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" 1330 - x="859.94275" 1331 - y="480.03558" 1332 - id="text4349-7-3-6"><tspan 1333 - sodipodi:role="line" 1334 - x="859.94275" 1335 - y="480.03558" 1336 - style="font-size:25.00000191px;line-height:1.25;font-family:sans-serif;text-align:center;text-anchor:middle;stroke-width:1px" 1337 - id="tspan1285-9-6-5">V<tspan 1338 - style="font-size:64.99999762%;baseline-shift:sub" 1339 - id="tspan9307">DD</tspan></tspan></text> 1340 - </g> 1341 - </svg>

+1341

Documentation/i2c/i2c_bus.svg

··· 1 + <?xml version="1.0" encoding="UTF-8" standalone="no"?> 2 +  3 + 4 + <svg 5 + xmlns:dc="http://purl.org/dc/elements/1.1/" 6 + xmlns:cc="http://creativecommons.org/ns#" 7 + xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 8 + xmlns:svg="http://www.w3.org/2000/svg" 9 + xmlns="http://www.w3.org/2000/svg" 10 + xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd" 11 + xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" 12 + sodipodi:docname="i2c_bus.svg" 13 + inkscape:version="0.92.3 (2405546, 2018-03-11)" 14 + version="1.1" 15 + id="svg2" 16 + viewBox="0 0 813.34215 261.01596" 17 + height="73.664505mm" 18 + width="229.54323mm"> 19 + <defs 20 + id="defs4"> 21 + <marker 22 + inkscape:stockid="DotM" 23 + orient="auto" 24 + refY="0" 25 + refX="0" 26 + id="marker8861" 27 + style="overflow:visible" 28 + inkscape:isstock="true"> 29 + <path 30 + inkscape:connector-curvature="0" 31 + id="path8859" 32 + d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 33 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 34 + transform="matrix(0.4,0,0,0.4,2.96,0.4)" /> 35 + </marker> 36 + <marker 37 + inkscape:isstock="true" 38 + style="overflow:visible" 39 + id="marker6165" 40 + refX="0" 41 + refY="0" 42 + orient="auto" 43 + inkscape:stockid="DotM" 44 + inkscape:collect="always"> 45 + <path 46 + transform="matrix(0.4,0,0,0.4,2.96,0.4)" 47 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 48 + d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 49 + id="path6163" 50 + inkscape:connector-curvature="0" /> 51 + </marker> 52 + <marker 53 + inkscape:isstock="true" 54 + style="overflow:visible" 55 + id="marker2713" 56 + refX="0" 57 + refY="0" 58 + orient="auto" 59 + inkscape:stockid="DotM"> 60 + <path 61 + transform="matrix(0.4,0,0,0.4,2.96,0.4)" 62 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 63 + d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 64 + id="path2711" 65 + inkscape:connector-curvature="0" /> 66 + </marker> 67 + <marker 68 + inkscape:stockid="DotM" 69 + orient="auto" 70 + refY="0" 71 + refX="0" 72 + id="DotM" 73 + style="overflow:visible" 74 + inkscape:isstock="true" 75 + inkscape:collect="always"> 76 + <path 77 + id="path1795" 78 + d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 79 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 80 + transform="matrix(0.4,0,0,0.4,2.96,0.4)" 81 + inkscape:connector-curvature="0" /> 82 + </marker> 83 + <marker 84 + inkscape:isstock="true" 85 + style="overflow:visible" 86 + id="marker6389" 87 + refX="0" 88 + refY="0" 89 + orient="auto" 90 + inkscape:stockid="Arrow2Mend"> 91 + <path 92 + transform="scale(-0.6)" 93 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 94 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 95 + id="path6387" 96 + inkscape:connector-curvature="0" /> 97 + </marker> 98 + <marker 99 + inkscape:stockid="TriangleOutM" 100 + orient="auto" 101 + refY="0" 102 + refX="0" 103 + id="TriangleOutM" 104 + style="overflow:visible" 105 + inkscape:isstock="true"> 106 + <path 107 + id="path4107" 108 + d="M 5.77,0 -2.88,5 V -5 Z" 109 + style="fill:#008000;fill-opacity:1;fill-rule:evenodd;stroke:#008000;stroke-width:1.00000003pt;stroke-opacity:1" 110 + transform="scale(0.4)" 111 + inkscape:connector-curvature="0" /> 112 + </marker> 113 + <marker 114 + inkscape:stockid="TriangleOutL" 115 + orient="auto" 116 + refY="0" 117 + refX="0" 118 + id="marker5333" 119 + style="overflow:visible" 120 + inkscape:isstock="true"> 121 + <path 122 + id="path5335" 123 + d="M 5.77,0 -2.88,5 V -5 Z" 124 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 125 + transform="scale(0.8)" 126 + inkscape:connector-curvature="0" /> 127 + </marker> 128 + <marker 129 + inkscape:stockid="TriangleOutL" 130 + orient="auto" 131 + refY="0" 132 + refX="0" 133 + id="TriangleOutL" 134 + style="overflow:visible" 135 + inkscape:isstock="true"> 136 + <path 137 + id="path5049" 138 + d="M 5.77,0 -2.88,5 V -5 Z" 139 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 140 + transform="scale(0.8)" 141 + inkscape:connector-curvature="0" /> 142 + </marker> 143 + <marker 144 + inkscape:stockid="DotS" 145 + orient="auto" 146 + refY="0" 147 + refX="0" 148 + id="DotS" 149 + style="overflow:visible" 150 + inkscape:isstock="true"> 151 + <path 152 + id="path9326" 153 + d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 154 + style="fill:#404040;fill-opacity:1;fill-rule:evenodd;stroke:#404040;stroke-width:1.00000003pt;stroke-opacity:1" 155 + transform="matrix(0.2,0,0,0.2,1.48,0.2)" 156 + inkscape:connector-curvature="0" /> 157 + </marker> 158 + <marker 159 + inkscape:stockid="Arrow2Mstart" 160 + orient="auto" 161 + refY="0" 162 + refX="0" 163 + id="Arrow2Mstart" 164 + style="overflow:visible" 165 + inkscape:isstock="true"> 166 + <path 167 + id="path9283" 168 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 169 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 170 + transform="scale(0.6)" 171 + inkscape:connector-curvature="0" /> 172 + </marker> 173 + <marker 174 + inkscape:stockid="Arrow2Mend" 175 + orient="auto" 176 + refY="0" 177 + refX="0" 178 + id="marker9095" 179 + style="overflow:visible" 180 + inkscape:isstock="true"> 181 + <path 182 + id="path9097" 183 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 184 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 185 + transform="scale(-0.6)" 186 + inkscape:connector-curvature="0" /> 187 + </marker> 188 + <marker 189 + inkscape:stockid="Arrow2Mend" 190 + orient="auto" 191 + refY="0" 192 + refX="0" 193 + id="marker8935" 194 + style="overflow:visible" 195 + inkscape:isstock="true"> 196 + <path 197 + id="path8937" 198 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 199 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 200 + transform="scale(-0.6)" 201 + inkscape:connector-curvature="0" /> 202 + </marker> 203 + <marker 204 + inkscape:stockid="Arrow2Mend" 205 + orient="auto" 206 + refY="0" 207 + refX="0" 208 + id="marker8781" 209 + style="overflow:visible" 210 + inkscape:isstock="true"> 211 + <path 212 + id="path8783" 213 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 214 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 215 + transform="scale(-0.6)" 216 + inkscape:connector-curvature="0" /> 217 + </marker> 218 + <marker 219 + inkscape:stockid="Arrow2Lend" 220 + orient="auto" 221 + refY="0" 222 + refX="0" 223 + id="Arrow2Lend" 224 + style="overflow:visible" 225 + inkscape:isstock="true"> 226 + <path 227 + id="path4700" 228 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 229 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 230 + transform="matrix(-1.1,0,0,-1.1,-1.1,0)" 231 + inkscape:connector-curvature="0" /> 232 + </marker> 233 + <marker 234 + inkscape:stockid="EmptyTriangleOutL" 235 + orient="auto" 236 + refY="0" 237 + refX="0" 238 + id="EmptyTriangleOutL" 239 + style="overflow:visible" 240 + inkscape:isstock="true"> 241 + <path 242 + id="path4502" 243 + d="M 5.77,0 -2.88,5 V -5 Z" 244 + style="fill:#ffffff;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 245 + transform="matrix(0.8,0,0,0.8,-4.8,0)" 246 + inkscape:connector-curvature="0" /> 247 + </marker> 248 + <marker 249 + inkscape:stockid="EmptyTriangleOutL" 250 + orient="auto" 251 + refY="0" 252 + refX="0" 253 + id="EmptyTriangleOutL-4" 254 + style="overflow:visible" 255 + inkscape:isstock="true"> 256 + <path 257 + inkscape:connector-curvature="0" 258 + id="path4502-7" 259 + d="M 5.77,0 -2.88,5 V -5 Z" 260 + style="fill:#ffffff;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 261 + transform="matrix(0.8,0,0,0.8,-4.8,0)" /> 262 + </marker> 263 + <marker 264 + inkscape:stockid="EmptyTriangleOutL" 265 + orient="auto" 266 + refY="0" 267 + refX="0" 268 + id="EmptyTriangleOutL-4-4" 269 + style="overflow:visible" 270 + inkscape:isstock="true"> 271 + <path 272 + inkscape:connector-curvature="0" 273 + id="path4502-7-5" 274 + d="M 5.77,0 -2.88,5 V -5 Z" 275 + style="fill:#ffffff;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 276 + transform="matrix(0.8,0,0,0.8,-4.8,0)" /> 277 + </marker> 278 + <marker 279 + inkscape:stockid="EmptyTriangleOutL" 280 + orient="auto" 281 + refY="0" 282 + refX="0" 283 + id="EmptyTriangleOutL-7" 284 + style="overflow:visible" 285 + inkscape:isstock="true"> 286 + <path 287 + inkscape:connector-curvature="0" 288 + id="path4502-5" 289 + d="M 5.77,0 -2.88,5 V -5 Z" 290 + style="fill:#ffffff;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 291 + transform="matrix(0.8,0,0,0.8,-4.8,0)" /> 292 + </marker> 293 + <marker 294 + inkscape:stockid="EmptyTriangleOutL" 295 + orient="auto" 296 + refY="0" 297 + refX="0" 298 + id="EmptyTriangleOutL-6" 299 + style="overflow:visible" 300 + inkscape:isstock="true"> 301 + <path 302 + inkscape:connector-curvature="0" 303 + id="path4502-2" 304 + d="M 5.77,0 -2.88,5 V -5 Z" 305 + style="fill:#ffffff;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 306 + transform="matrix(0.8,0,0,0.8,-4.8,0)" /> 307 + </marker> 308 + <marker 309 + inkscape:stockid="EmptyTriangleOutL" 310 + orient="auto" 311 + refY="0" 312 + refX="0" 313 + id="EmptyTriangleOutL-78" 314 + style="overflow:visible" 315 + inkscape:isstock="true"> 316 + <path 317 + inkscape:connector-curvature="0" 318 + id="path4502-57" 319 + d="M 5.77,0 -2.88,5 V -5 Z" 320 + style="fill:#ffffff;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 321 + transform="matrix(0.8,0,0,0.8,-4.8,0)" /> 322 + </marker> 323 + <marker 324 + inkscape:stockid="EmptyTriangleOutL" 325 + orient="auto" 326 + refY="0" 327 + refX="0" 328 + id="EmptyTriangleOutL-1" 329 + style="overflow:visible" 330 + inkscape:isstock="true"> 331 + <path 332 + inkscape:connector-curvature="0" 333 + id="path4502-8" 334 + d="M 5.77,0 -2.88,5 V -5 Z" 335 + style="fill:#ffffff;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 336 + transform="matrix(0.8,0,0,0.8,-4.8,0)" /> 337 + </marker> 338 + <marker 339 + inkscape:stockid="EmptyTriangleOutL" 340 + orient="auto" 341 + refY="0" 342 + refX="0" 343 + id="EmptyTriangleOutL-9" 344 + style="overflow:visible" 345 + inkscape:isstock="true"> 346 + <path 347 + inkscape:connector-curvature="0" 348 + id="path4502-75" 349 + d="M 5.77,0 -2.88,5 V -5 Z" 350 + style="fill:#ffffff;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 351 + transform="matrix(0.8,0,0,0.8,-4.8,0)" /> 352 + </marker> 353 + <marker 354 + inkscape:stockid="EmptyTriangleOutL" 355 + orient="auto" 356 + refY="0" 357 + refX="0" 358 + id="EmptyTriangleOutL-78-2" 359 + style="overflow:visible" 360 + inkscape:isstock="true"> 361 + <path 362 + inkscape:connector-curvature="0" 363 + id="path4502-57-6" 364 + d="M 5.77,0 -2.88,5 V -5 Z" 365 + style="fill:#ffffff;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 366 + transform="matrix(0.8,0,0,0.8,-4.8,0)" /> 367 + </marker> 368 + <marker 369 + inkscape:stockid="Arrow2Mend" 370 + orient="auto" 371 + refY="0" 372 + refX="0" 373 + id="Arrow2Mend-3" 374 + style="overflow:visible" 375 + inkscape:isstock="true"> 376 + <path 377 + inkscape:connector-curvature="0" 378 + id="path4369-6" 379 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 380 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 381 + transform="scale(-0.6)" /> 382 + </marker> 383 + <marker 384 + inkscape:stockid="Arrow2Mend" 385 + orient="auto" 386 + refY="0" 387 + refX="0" 388 + id="Arrow2Mend-3-5" 389 + style="overflow:visible" 390 + inkscape:isstock="true"> 391 + <path 392 + inkscape:connector-curvature="0" 393 + id="path4369-6-3" 394 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 395 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 396 + transform="scale(-0.6)" /> 397 + </marker> 398 + <marker 399 + inkscape:stockid="Arrow2Mend" 400 + orient="auto" 401 + refY="0" 402 + refX="0" 403 + id="Arrow2Mend-3-5-6" 404 + style="overflow:visible" 405 + inkscape:isstock="true"> 406 + <path 407 + inkscape:connector-curvature="0" 408 + id="path4369-6-3-2" 409 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 410 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 411 + transform="scale(-0.6)" /> 412 + </marker> 413 + <marker 414 + inkscape:stockid="Arrow2Mend" 415 + orient="auto" 416 + refY="0" 417 + refX="0" 418 + id="Arrow2Mend-3-5-6-1" 419 + style="overflow:visible" 420 + inkscape:isstock="true"> 421 + <path 422 + inkscape:connector-curvature="0" 423 + id="path4369-6-3-2-2" 424 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 425 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 426 + transform="scale(-0.6)" /> 427 + </marker> 428 + <marker 429 + inkscape:stockid="Arrow2Mend" 430 + orient="auto" 431 + refY="0" 432 + refX="0" 433 + id="Arrow2Mend-3-5-7" 434 + style="overflow:visible" 435 + inkscape:isstock="true"> 436 + <path 437 + inkscape:connector-curvature="0" 438 + id="path4369-6-3-0" 439 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 440 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 441 + transform="scale(-0.6)" /> 442 + </marker> 443 + <marker 444 + inkscape:stockid="Arrow2Mend" 445 + orient="auto" 446 + refY="0" 447 + refX="0" 448 + id="Arrow2Mend-3-9" 449 + style="overflow:visible" 450 + inkscape:isstock="true"> 451 + <path 452 + inkscape:connector-curvature="0" 453 + id="path4369-6-36" 454 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 455 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 456 + transform="scale(-0.6)" /> 457 + </marker> 458 + <marker 459 + inkscape:stockid="Arrow2Mend" 460 + orient="auto" 461 + refY="0" 462 + refX="0" 463 + id="Arrow2Mend-31" 464 + style="overflow:visible" 465 + inkscape:isstock="true"> 466 + <path 467 + inkscape:connector-curvature="0" 468 + id="path4369-9" 469 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 470 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 471 + transform="scale(-0.6)" /> 472 + </marker> 473 + <marker 474 + inkscape:stockid="Arrow2Mend" 475 + orient="auto" 476 + refY="0" 477 + refX="0" 478 + id="Arrow2Mend-3-5-7-5" 479 + style="overflow:visible" 480 + inkscape:isstock="true"> 481 + <path 482 + inkscape:connector-curvature="0" 483 + id="path4369-6-3-0-4" 484 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 485 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 486 + transform="scale(-0.6)" /> 487 + </marker> 488 + <marker 489 + inkscape:stockid="Arrow2Mend" 490 + orient="auto" 491 + refY="0" 492 + refX="0" 493 + id="Arrow2Mend-3-9-6" 494 + style="overflow:visible" 495 + inkscape:isstock="true"> 496 + <path 497 + inkscape:connector-curvature="0" 498 + id="path4369-6-36-5" 499 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 500 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 501 + transform="scale(-0.6)" /> 502 + </marker> 503 + <marker 504 + inkscape:stockid="Arrow2Mend" 505 + orient="auto" 506 + refY="0" 507 + refX="0" 508 + id="Arrow2Mend-3-5-7-3" 509 + style="overflow:visible" 510 + inkscape:isstock="true"> 511 + <path 512 + inkscape:connector-curvature="0" 513 + id="path4369-6-3-0-5" 514 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 515 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 516 + transform="scale(-0.6)" /> 517 + </marker> 518 + <marker 519 + inkscape:stockid="Arrow2Mend" 520 + orient="auto" 521 + refY="0" 522 + refX="0" 523 + id="marker9095-3" 524 + style="overflow:visible" 525 + inkscape:isstock="true"> 526 + <path 527 + inkscape:connector-curvature="0" 528 + id="path9097-1" 529 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 530 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 531 + transform="scale(-0.6)" /> 532 + </marker> 533 + <marker 534 + inkscape:stockid="Arrow2Mend" 535 + orient="auto" 536 + refY="0" 537 + refX="0" 538 + id="marker9095-3-7" 539 + style="overflow:visible" 540 + inkscape:isstock="true"> 541 + <path 542 + inkscape:connector-curvature="0" 543 + id="path9097-1-8" 544 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 545 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 546 + transform="scale(-0.6)" /> 547 + </marker> 548 + <marker 549 + inkscape:stockid="Arrow2Mend" 550 + orient="auto" 551 + refY="0" 552 + refX="0" 553 + id="marker9095-1-5" 554 + style="overflow:visible" 555 + inkscape:isstock="true"> 556 + <path 557 + inkscape:connector-curvature="0" 558 + id="path9097-2-0" 559 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 560 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 561 + transform="scale(-0.6)" /> 562 + </marker> 563 + <marker 564 + inkscape:stockid="Arrow2Mend" 565 + orient="auto" 566 + refY="0" 567 + refX="0" 568 + id="marker9095-1-5-6" 569 + style="overflow:visible" 570 + inkscape:isstock="true"> 571 + <path 572 + inkscape:connector-curvature="0" 573 + id="path9097-2-0-1" 574 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 575 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 576 + transform="scale(-0.6)" /> 577 + </marker> 578 + <marker 579 + inkscape:stockid="Arrow2Mend" 580 + orient="auto" 581 + refY="0" 582 + refX="0" 583 + id="marker9095-1-5-6-6" 584 + style="overflow:visible" 585 + inkscape:isstock="true"> 586 + <path 587 + inkscape:connector-curvature="0" 588 + id="path9097-2-0-1-3" 589 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 590 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 591 + transform="scale(-0.6)" /> 592 + </marker> 593 + <marker 594 + inkscape:stockid="Arrow2Mend" 595 + orient="auto" 596 + refY="0" 597 + refX="0" 598 + id="marker9095-1-5-2" 599 + style="overflow:visible" 600 + inkscape:isstock="true"> 601 + <path 602 + inkscape:connector-curvature="0" 603 + id="path9097-2-0-0" 604 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 605 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 606 + transform="scale(-0.6)" /> 607 + </marker> 608 + <marker 609 + inkscape:stockid="Arrow2Mend" 610 + orient="auto" 611 + refY="0" 612 + refX="0" 613 + id="marker9095-1-5-6-5" 614 + style="overflow:visible" 615 + inkscape:isstock="true"> 616 + <path 617 + inkscape:connector-curvature="0" 618 + id="path9097-2-0-1-5" 619 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 620 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 621 + transform="scale(-0.6)" /> 622 + </marker> 623 + <marker 624 + inkscape:stockid="Arrow2Mend" 625 + orient="auto" 626 + refY="0" 627 + refX="0" 628 + id="marker9095-1-5-4" 629 + style="overflow:visible" 630 + inkscape:isstock="true"> 631 + <path 632 + inkscape:connector-curvature="0" 633 + id="path9097-2-0-7" 634 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 635 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 636 + transform="scale(-0.6)" /> 637 + </marker> 638 + <marker 639 + inkscape:stockid="Arrow2Mend" 640 + orient="auto" 641 + refY="0" 642 + refX="0" 643 + id="marker9095-1-5-6-5-9" 644 + style="overflow:visible" 645 + inkscape:isstock="true"> 646 + <path 647 + inkscape:connector-curvature="0" 648 + id="path9097-2-0-1-5-3" 649 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 650 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 651 + transform="scale(-0.6)" /> 652 + </marker> 653 + <marker 654 + inkscape:stockid="Arrow2Mend" 655 + orient="auto" 656 + refY="0" 657 + refX="0" 658 + id="marker9095-1-5-6-5-4" 659 + style="overflow:visible" 660 + inkscape:isstock="true"> 661 + <path 662 + inkscape:connector-curvature="0" 663 + id="path9097-2-0-1-5-5" 664 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 665 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 666 + transform="scale(-0.6)" /> 667 + </marker> 668 + <marker 669 + inkscape:stockid="Arrow2Mend" 670 + orient="auto" 671 + refY="0" 672 + refX="0" 673 + id="marker9095-1-5-6-5-4-4" 674 + style="overflow:visible" 675 + inkscape:isstock="true"> 676 + <path 677 + inkscape:connector-curvature="0" 678 + id="path9097-2-0-1-5-5-3" 679 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 680 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 681 + transform="scale(-0.6)" /> 682 + </marker> 683 + <marker 684 + inkscape:stockid="Arrow2Mend" 685 + orient="auto" 686 + refY="0" 687 + refX="0" 688 + id="marker9095-1-5-6-5-4-4-9" 689 + style="overflow:visible" 690 + inkscape:isstock="true"> 691 + <path 692 + inkscape:connector-curvature="0" 693 + id="path9097-2-0-1-5-5-3-2" 694 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 695 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 696 + transform="scale(-0.6)" /> 697 + </marker> 698 + <marker 699 + inkscape:stockid="Arrow2Mend" 700 + orient="auto" 701 + refY="0" 702 + refX="0" 703 + id="marker9095-1-5-6-5-4-4-6" 704 + style="overflow:visible" 705 + inkscape:isstock="true"> 706 + <path 707 + inkscape:connector-curvature="0" 708 + id="path9097-2-0-1-5-5-3-8" 709 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 710 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 711 + transform="scale(-0.6)" /> 712 + </marker> 713 + <marker 714 + inkscape:stockid="Arrow2Mend" 715 + orient="auto" 716 + refY="0" 717 + refX="0" 718 + id="marker9095-1-5-6-5-4-4-2" 719 + style="overflow:visible" 720 + inkscape:isstock="true"> 721 + <path 722 + inkscape:connector-curvature="0" 723 + id="path9097-2-0-1-5-5-3-7" 724 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 725 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 726 + transform="scale(-0.6)" /> 727 + </marker> 728 + <marker 729 + inkscape:stockid="Arrow2Mend" 730 + orient="auto" 731 + refY="0" 732 + refX="0" 733 + id="marker9095-1-5-3" 734 + style="overflow:visible" 735 + inkscape:isstock="true"> 736 + <path 737 + inkscape:connector-curvature="0" 738 + id="path9097-2-0-6" 739 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 740 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 741 + transform="scale(-0.6)" /> 742 + </marker> 743 + <marker 744 + inkscape:stockid="Arrow2Mend" 745 + orient="auto" 746 + refY="0" 747 + refX="0" 748 + id="marker9095-1-5-6-5-4-4-6-5" 749 + style="overflow:visible" 750 + inkscape:isstock="true"> 751 + <path 752 + inkscape:connector-curvature="0" 753 + id="path9097-2-0-1-5-5-3-8-3" 754 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 755 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 756 + transform="scale(-0.6)" /> 757 + </marker> 758 + <marker 759 + inkscape:stockid="Arrow2Mend" 760 + orient="auto" 761 + refY="0" 762 + refX="0" 763 + id="marker9095-1-5-27" 764 + style="overflow:visible" 765 + inkscape:isstock="true"> 766 + <path 767 + inkscape:connector-curvature="0" 768 + id="path9097-2-0-09" 769 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 770 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 771 + transform="scale(-0.6)" /> 772 + </marker> 773 + <marker 774 + inkscape:stockid="Arrow2Mend" 775 + orient="auto" 776 + refY="0" 777 + refX="0" 778 + id="marker9095-1-5-27-6" 779 + style="overflow:visible" 780 + inkscape:isstock="true"> 781 + <path 782 + inkscape:connector-curvature="0" 783 + id="path9097-2-0-09-2" 784 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 785 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 786 + transform="scale(-0.6)" /> 787 + </marker> 788 + <marker 789 + inkscape:stockid="Arrow2Mend" 790 + orient="auto" 791 + refY="0" 792 + refX="0" 793 + id="marker9095-1-5-27-0" 794 + style="overflow:visible" 795 + inkscape:isstock="true"> 796 + <path 797 + inkscape:connector-curvature="0" 798 + id="path9097-2-0-09-23" 799 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 800 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 801 + transform="scale(-0.6)" /> 802 + </marker> 803 + <marker 804 + inkscape:stockid="Arrow2Mend" 805 + orient="auto" 806 + refY="0" 807 + refX="0" 808 + id="marker9095-1-9" 809 + style="overflow:visible" 810 + inkscape:isstock="true"> 811 + <path 812 + inkscape:connector-curvature="0" 813 + id="path9097-2-7" 814 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 815 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 816 + transform="scale(-0.6)" /> 817 + </marker> 818 + <marker 819 + inkscape:stockid="Arrow2Mend" 820 + orient="auto" 821 + refY="0" 822 + refX="0" 823 + id="marker9095-1-5-4-6" 824 + style="overflow:visible" 825 + inkscape:isstock="true"> 826 + <path 827 + inkscape:connector-curvature="0" 828 + id="path9097-2-0-7-7" 829 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 830 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 831 + transform="scale(-0.6)" /> 832 + </marker> 833 + <marker 834 + inkscape:stockid="Arrow2Mend" 835 + orient="auto" 836 + refY="0" 837 + refX="0" 838 + id="marker9095-1-5-4-6-3" 839 + style="overflow:visible" 840 + inkscape:isstock="true"> 841 + <path 842 + inkscape:connector-curvature="0" 843 + id="path9097-2-0-7-7-5" 844 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 845 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 846 + transform="scale(-0.6)" /> 847 + </marker> 848 + <marker 849 + inkscape:stockid="Arrow2Mend" 850 + orient="auto" 851 + refY="0" 852 + refX="0" 853 + id="marker9095-1-5-4-6-2" 854 + style="overflow:visible" 855 + inkscape:isstock="true"> 856 + <path 857 + inkscape:connector-curvature="0" 858 + id="path9097-2-0-7-7-9" 859 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 860 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 861 + transform="scale(-0.6)" /> 862 + </marker> 863 + <marker 864 + inkscape:stockid="TriangleOutM" 865 + orient="auto" 866 + refY="0" 867 + refX="0" 868 + id="TriangleOutM-2" 869 + style="overflow:visible" 870 + inkscape:isstock="true"> 871 + <path 872 + inkscape:connector-curvature="0" 873 + id="path4107-7" 874 + d="M 5.77,0 -2.88,5 V -5 Z" 875 + style="fill:#008000;fill-opacity:1;fill-rule:evenodd;stroke:#008000;stroke-width:1.00000003pt;stroke-opacity:1" 876 + transform="scale(0.4)" /> 877 + </marker> 878 + <marker 879 + inkscape:stockid="Arrow2Mstart" 880 + orient="auto" 881 + refY="0" 882 + refX="0" 883 + id="Arrow2Mstart-9" 884 + style="overflow:visible" 885 + inkscape:isstock="true"> 886 + <path 887 + inkscape:connector-curvature="0" 888 + id="path9283-3" 889 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 890 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 891 + transform="scale(0.6)" /> 892 + </marker> 893 + <marker 894 + inkscape:stockid="Arrow2Mend" 895 + orient="auto" 896 + refY="0" 897 + refX="0" 898 + id="marker9095-1-5-4-6-6" 899 + style="overflow:visible" 900 + inkscape:isstock="true"> 901 + <path 902 + inkscape:connector-curvature="0" 903 + id="path9097-2-0-7-7-0" 904 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1" 905 + d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z" 906 + transform="scale(-0.6)" /> 907 + </marker> 908 + <marker 909 + inkscape:stockid="DotM" 910 + orient="auto" 911 + refY="0" 912 + refX="0" 913 + id="DotM-3" 914 + style="overflow:visible" 915 + inkscape:isstock="true"> 916 + <path 917 + inkscape:connector-curvature="0" 918 + id="path1795-7" 919 + d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 920 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 921 + transform="matrix(0.4,0,0,0.4,2.96,0.4)" /> 922 + </marker> 923 + <marker 924 + inkscape:isstock="true" 925 + style="overflow:visible" 926 + id="marker2713-9" 927 + refX="0" 928 + refY="0" 929 + orient="auto" 930 + inkscape:stockid="DotM"> 931 + <path 932 + inkscape:connector-curvature="0" 933 + transform="matrix(0.4,0,0,0.4,2.96,0.4)" 934 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 935 + d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 936 + id="path2711-2" /> 937 + </marker> 938 + <marker 939 + inkscape:stockid="DotM" 940 + orient="auto" 941 + refY="0" 942 + refX="0" 943 + id="DotM-2" 944 + style="overflow:visible" 945 + inkscape:isstock="true" 946 + inkscape:collect="always"> 947 + <path 948 + inkscape:connector-curvature="0" 949 + id="path1795-8" 950 + d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 951 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 952 + transform="matrix(0.4,0,0,0.4,2.96,0.4)" /> 953 + </marker> 954 + <marker 955 + inkscape:isstock="true" 956 + style="overflow:visible" 957 + id="marker2713-3" 958 + refX="0" 959 + refY="0" 960 + orient="auto" 961 + inkscape:stockid="DotM"> 962 + <path 963 + inkscape:connector-curvature="0" 964 + transform="matrix(0.4,0,0,0.4,2.96,0.4)" 965 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 966 + d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 967 + id="path2711-6" /> 968 + </marker> 969 + <marker 970 + inkscape:stockid="DotM" 971 + orient="auto" 972 + refY="0" 973 + refX="0" 974 + id="DotM-1" 975 + style="overflow:visible" 976 + inkscape:isstock="true" 977 + inkscape:collect="always"> 978 + <path 979 + inkscape:connector-curvature="0" 980 + id="path1795-2" 981 + d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 982 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 983 + transform="matrix(0.4,0,0,0.4,2.96,0.4)" /> 984 + </marker> 985 + <marker 986 + inkscape:isstock="true" 987 + style="overflow:visible" 988 + id="marker2713-94" 989 + refX="0" 990 + refY="0" 991 + orient="auto" 992 + inkscape:stockid="DotM"> 993 + <path 994 + inkscape:connector-curvature="0" 995 + transform="matrix(0.4,0,0,0.4,2.96,0.4)" 996 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 997 + d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 998 + id="path2711-7" /> 999 + </marker> 1000 + <marker 1001 + inkscape:stockid="DotM" 1002 + orient="auto" 1003 + refY="0" 1004 + refX="0" 1005 + id="DotM-8" 1006 + style="overflow:visible" 1007 + inkscape:isstock="true" 1008 + inkscape:collect="always"> 1009 + <path 1010 + inkscape:connector-curvature="0" 1011 + id="path1795-4" 1012 + d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 1013 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 1014 + transform="matrix(0.4,0,0,0.4,2.96,0.4)" /> 1015 + </marker> 1016 + <marker 1017 + inkscape:isstock="true" 1018 + style="overflow:visible" 1019 + id="marker2713-36" 1020 + refX="0" 1021 + refY="0" 1022 + orient="auto" 1023 + inkscape:stockid="DotM"> 1024 + <path 1025 + inkscape:connector-curvature="0" 1026 + transform="matrix(0.4,0,0,0.4,2.96,0.4)" 1027 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 1028 + d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 1029 + id="path2711-1" /> 1030 + </marker> 1031 + <marker 1032 + inkscape:isstock="true" 1033 + style="overflow:visible" 1034 + id="marker6165-5" 1035 + refX="0" 1036 + refY="0" 1037 + orient="auto" 1038 + inkscape:stockid="DotM"> 1039 + <path 1040 + transform="matrix(0.4,0,0,0.4,2.96,0.4)" 1041 + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.00000003pt;stroke-opacity:1" 1042 + d="m -2.5,-1 c 0,2.76 -2.24,5 -5,5 -2.76,0 -5,-2.24 -5,-5 0,-2.76 2.24,-5 5,-5 2.76,0 5,2.24 5,5 z" 1043 + id="path6163-5" 1044 + inkscape:connector-curvature="0" /> 1045 + </marker> 1046 + </defs> 1047 + <sodipodi:namedview 1048 + showguides="true" 1049 + inkscape:window-maximized="1" 1050 + inkscape:window-y="0" 1051 + inkscape:window-x="0" 1052 + inkscape:window-height="1015" 1053 + inkscape:window-width="1920" 1054 + showgrid="true" 1055 + inkscape:current-layer="layer1" 1056 + inkscape:document-units="px" 1057 + inkscape:cy="214.66765" 1058 + inkscape:cx="-167.56857" 1059 + inkscape:zoom="0.70710678" 1060 + inkscape:pageshadow="2" 1061 + inkscape:pageopacity="0.0" 1062 + borderopacity="1.0" 1063 + bordercolor="#666666" 1064 + pagecolor="#ffffff" 1065 + id="base" 1066 + inkscape:snap-to-guides="true" 1067 + inkscape:snap-grids="true" 1068 + inkscape:snap-bbox="false" 1069 + inkscape:object-nodes="true" 1070 + fit-margin-top="5" 1071 + fit-margin-left="5" 1072 + fit-margin-right="5" 1073 + fit-margin-bottom="5"> 1074 + <inkscape:grid 1075 + type="xygrid" 1076 + id="grid4451" 1077 + originx="-93.377219" 1078 + originy="-347.2523" /> 1079 + </sodipodi:namedview> 1080 + <metadata 1081 + id="metadata7"> 1082 + <rdf:RDF> 1083 + <cc:Work 1084 + rdf:about=""> 1085 + <dc:format>image/svg+xml</dc:format> 1086 + <dc:type 1087 + rdf:resource="http://purl.org/dc/dcmitype/StillImage" /> 1088 + <dc:title></dc:title> 1089 + <dc:creator> 1090 + <cc:Agent> 1091 + <dc:title>Luca Ceresoli</dc:title> 1092 + </cc:Agent> 1093 + </dc:creator> 1094 + <dc:date>2020</dc:date> 1095 + <cc:license 1096 + rdf:resource="http://creativecommons.org/licenses/by-sa/4.0/" /> 1097 + </cc:Work> 1098 + <cc:License 1099 + rdf:about="http://creativecommons.org/licenses/by-sa/4.0/"> 1100 + <cc:permits 1101 + rdf:resource="http://creativecommons.org/ns#Reproduction" /> 1102 + <cc:permits 1103 + rdf:resource="http://creativecommons.org/ns#Distribution" /> 1104 + <cc:requires 1105 + rdf:resource="http://creativecommons.org/ns#Notice" /> 1106 + <cc:requires 1107 + rdf:resource="http://creativecommons.org/ns#Attribution" /> 1108 + <cc:permits 1109 + rdf:resource="http://creativecommons.org/ns#DerivativeWorks" /> 1110 + <cc:requires 1111 + rdf:resource="http://creativecommons.org/ns#ShareAlike" /> 1112 + </cc:License> 1113 + </rdf:RDF> 1114 + </metadata> 1115 + <g 1116 + inkscape:label="Livello 1" 1117 + inkscape:groupmode="layer" 1118 + id="layer1" 1119 + transform="translate(-93.377215,-444.09395)"> 1120 + <rect 1121 + style="opacity:1;fill:#ffb9b9;fill-opacity:1;stroke:#f00000;stroke-width:2.8125;stroke-linecap:round;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" 1122 + id="rect4424-3-2-9-7" 1123 + width="112.5" 1124 + height="113.75008" 1125 + x="112.5" 1126 + y="471.11221" 1127 + rx="0" 1128 + ry="0" /> 1129 + <text 1130 + xml:space="preserve" 1131 + style="font-style:normal;font-weight:normal;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" 1132 + x="167.5354" 1133 + y="521.46259" 1134 + id="text4349"><tspan 1135 + sodipodi:role="line" 1136 + x="167.5354" 1137 + y="521.46259" 1138 + style="font-size:25px;line-height:1.25;font-family:sans-serif;text-align:center;text-anchor:middle" 1139 + id="tspan1273">I2C</tspan><tspan 1140 + sodipodi:role="line" 1141 + x="167.5354" 1142 + y="552.71259" 1143 + style="font-size:25px;line-height:1.25;font-family:sans-serif;text-align:center;text-anchor:middle" 1144 + id="tspan1285">Master</tspan></text> 1145 + <rect 1146 + style="color:#000000;clip-rule:nonzero;display:inline;overflow:visible;visibility:visible;opacity:1;isolation:auto;mix-blend-mode:normal;color-interpolation:sRGB;color-interpolation-filters:linearRGB;solid-color:#000000;solid-opacity:1;fill:#b9ffb9;fill-opacity:1;fill-rule:nonzero;stroke:#006400;stroke-width:2.8125;stroke-linecap:round;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1;color-rendering:auto;image-rendering:auto;shape-rendering:auto;text-rendering:auto;enable-background:accumulate" 1147 + id="rect4424-3-2-9-7-3-3-5-3" 1148 + width="112.49998" 1149 + height="112.50001" 1150 + x="262.5" 1151 + y="471.11218" 1152 + rx="0" 1153 + ry="0" /> 1154 + <path 1155 + style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968767;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" 1156 + d="m 112.50002,639.86223 c 712.50002,0 712.50002,0 712.50002,0" 1157 + id="path4655-9-3-65-5-6" 1158 + inkscape:connector-curvature="0" /> 1159 + <text 1160 + xml:space="preserve" 1161 + style="font-style:normal;font-weight:normal;font-size:12px;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" 1162 + x="318.59131" 1163 + y="520.83752" 1164 + id="text4349-26"><tspan 1165 + sodipodi:role="line" 1166 + x="318.59131" 1167 + y="520.83752" 1168 + style="font-size:25.00000191px;line-height:1.25;font-family:sans-serif;text-align:center;text-anchor:middle;stroke-width:1px" 1169 + id="tspan1273-8">I2C</tspan><tspan 1170 + sodipodi:role="line" 1171 + x="318.59131" 1172 + y="552.08752" 1173 + style="font-size:25.00000191px;line-height:1.25;font-family:sans-serif;text-align:center;text-anchor:middle;stroke-width:1px" 1174 + id="tspan1287">Slave</tspan></text> 1175 + <path 1176 + style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968767;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" 1177 + d="m 112.49995,677.36223 c 712.50005,0 712.50005,0 712.50005,0" 1178 + id="path4655-9-3-65-5-6-2" 1179 + inkscape:connector-curvature="0" /> 1180 + <text 1181 + xml:space="preserve" 1182 + style="font-style:normal;font-weight:normal;font-size:12px;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" 1183 + x="861.07312" 1184 + y="687.03937" 1185 + id="text4349-7"><tspan 1186 + sodipodi:role="line" 1187 + x="861.07312" 1188 + y="687.03937" 1189 + style="font-size:25.00000191px;line-height:1.25;font-family:sans-serif;text-align:center;writing-mode:lr-tb;direction:ltr;text-anchor:middle;stroke-width:1px" 1190 + id="tspan1285-9">SCL</tspan></text> 1191 + <flowRoot 1192 + xml:space="preserve" 1193 + id="flowRoot1627" 1194 + style="font-style:normal;font-weight:normal;font-size:40px;line-height:125%;font-family:Sans;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"><flowRegion 1195 + id="flowRegion1629"><rect 1196 + id="rect1631" 1197 + width="220" 1198 + height="120" 1199 + x="140" 1200 + y="-126.29921" /></flowRegion><flowPara 1201 + id="flowPara1633" /></flowRoot> <text 1202 + xml:space="preserve" 1203 + style="font-style:normal;font-weight:normal;font-size:12px;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" 1204 + x="863.31921" 1205 + y="648.96735" 1206 + id="text4349-7-3"><tspan 1207 + sodipodi:role="line" 1208 + x="863.31921" 1209 + y="648.96735" 1210 + style="font-size:25.00000191px;line-height:1.25;font-family:sans-serif;text-align:center;text-anchor:middle;stroke-width:1px" 1211 + id="tspan1285-9-6">SDA</tspan></text> 1212 + <rect 1213 + style="color:#000000;clip-rule:nonzero;display:inline;overflow:visible;visibility:visible;opacity:1;isolation:auto;mix-blend-mode:normal;color-interpolation:sRGB;color-interpolation-filters:linearRGB;solid-color:#000000;solid-opacity:1;vector-effect:none;fill:#b9ffb9;fill-opacity:1;fill-rule:nonzero;stroke:#006400;stroke-width:2.8125;stroke-linecap:round;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1;color-rendering:auto;image-rendering:auto;shape-rendering:auto;text-rendering:auto;enable-background:accumulate" 1214 + id="rect4424-3-2-9-7-3-3-5-3-0" 1215 + width="112.49998" 1216 + height="112.50002" 1217 + x="412.5" 1218 + y="471.11215" 1219 + rx="0" 1220 + ry="0" /> 1221 + <text 1222 + xml:space="preserve" 1223 + style="font-style:normal;font-weight:normal;font-size:12px;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" 1224 + x="468.59131" 1225 + y="520.83746" 1226 + id="text4349-26-6"><tspan 1227 + sodipodi:role="line" 1228 + x="468.59131" 1229 + y="520.83746" 1230 + style="font-size:25.00000191px;line-height:1.25;font-family:sans-serif;text-align:center;text-anchor:middle;stroke-width:1px" 1231 + id="tspan1273-8-2">I2C</tspan><tspan 1232 + sodipodi:role="line" 1233 + x="468.59131" 1234 + y="552.08746" 1235 + style="font-size:25.00000191px;line-height:1.25;font-family:sans-serif;text-align:center;text-anchor:middle;stroke-width:1px" 1236 + id="tspan1287-6">Slave</tspan></text> 1237 + <rect 1238 + style="color:#000000;clip-rule:nonzero;display:inline;overflow:visible;visibility:visible;opacity:1;isolation:auto;mix-blend-mode:normal;color-interpolation:sRGB;color-interpolation-filters:linearRGB;solid-color:#000000;solid-opacity:1;vector-effect:none;fill:#b9ffb9;fill-opacity:1;fill-rule:nonzero;stroke:#006400;stroke-width:2.8125;stroke-linecap:round;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1;color-rendering:auto;image-rendering:auto;shape-rendering:auto;text-rendering:auto;enable-background:accumulate" 1239 + id="rect4424-3-2-9-7-3-3-5-3-1" 1240 + width="112.49998" 1241 + height="112.50002" 1242 + x="562.5" 1243 + y="471.11215" 1244 + rx="0" 1245 + ry="0" /> 1246 + <text 1247 + xml:space="preserve" 1248 + style="font-style:normal;font-weight:normal;font-size:12px;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" 1249 + x="618.59131" 1250 + y="520.83746" 1251 + id="text4349-26-8"><tspan 1252 + sodipodi:role="line" 1253 + x="618.59131" 1254 + y="520.83746" 1255 + style="font-size:25.00000191px;line-height:1.25;font-family:sans-serif;text-align:center;text-anchor:middle;stroke-width:1px" 1256 + id="tspan1273-8-7">I2C</tspan><tspan 1257 + sodipodi:role="line" 1258 + x="618.59131" 1259 + y="552.08746" 1260 + style="font-size:25.00000191px;line-height:1.25;font-family:sans-serif;text-align:center;text-anchor:middle;stroke-width:1px" 1261 + id="tspan1287-9">Slave</tspan></text> 1262 + <path 1263 + style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968743;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#DotM)" 1264 + d="m 150,583.61221 v 93.75" 1265 + id="path4655-9-3-65-5-6-20" 1266 + inkscape:connector-curvature="0" 1267 + sodipodi:nodetypes="cc" /> 1268 + <path 1269 + style="opacity:1;vector-effect:none;fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968743;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1;marker-end:url(#marker2713)" 1270 + d="m 187.5,583.61221 v 56.25" 1271 + id="path4655-9-3-65-5-6-20-2" 1272 + inkscape:connector-curvature="0" 1273 + sodipodi:nodetypes="cc" /> 1274 + <path 1275 + style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968743;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#DotM-2)" 1276 + d="m 300,583.61221 v 93.75" 1277 + id="path4655-9-3-65-5-6-20-9" 1278 + inkscape:connector-curvature="0" 1279 + sodipodi:nodetypes="cc" /> 1280 + <path 1281 + style="opacity:1;vector-effect:none;fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968743;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1;marker-end:url(#marker2713-9)" 1282 + d="m 337.5,583.61221 v 56.25" 1283 + id="path4655-9-3-65-5-6-20-2-7" 1284 + inkscape:connector-curvature="0" 1285 + sodipodi:nodetypes="cc" /> 1286 + <path 1287 + style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968743;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#DotM-1)" 1288 + d="m 450,583.61221 v 93.75" 1289 + id="path4655-9-3-65-5-6-20-93" 1290 + inkscape:connector-curvature="0" 1291 + sodipodi:nodetypes="cc" /> 1292 + <path 1293 + style="opacity:1;vector-effect:none;fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968743;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1;marker-end:url(#marker2713-3)" 1294 + d="m 487.5,583.61221 v 56.25" 1295 + id="path4655-9-3-65-5-6-20-2-1" 1296 + inkscape:connector-curvature="0" 1297 + sodipodi:nodetypes="cc" /> 1298 + <path 1299 + style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968743;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#DotM-8)" 1300 + d="m 600,583.61221 v 93.75" 1301 + id="path4655-9-3-65-5-6-20-5" 1302 + inkscape:connector-curvature="0" 1303 + sodipodi:nodetypes="cc" /> 1304 + <path 1305 + style="opacity:1;vector-effect:none;fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968743;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1;marker-end:url(#marker2713-94)" 1306 + d="m 637.5,583.61221 v 56.25" 1307 + id="path4655-9-3-65-5-6-20-2-0" 1308 + inkscape:connector-curvature="0" 1309 + sodipodi:nodetypes="cc" /> 1310 + <path 1311 + style="opacity:1;vector-effect:none;fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968743;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1;marker-start:url(#marker6165);marker-end:url(#marker6165)" 1312 + d="m 750,471.11221 v 28.125 l 9.375,9.375 -18.74999,18.75 18.74999,18.75 -18.74999,18.75 18.74999,18.75 -9.375,9.375 v 28.125 0 0 56.25" 1313 + id="path6135" 1314 + inkscape:connector-curvature="0" 1315 + sodipodi:nodetypes="cccccccccccc" /> 1316 + <path 1317 + style="opacity:1;vector-effect:none;fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968743;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1;marker-start:url(#marker8861);marker-end:url(#marker6165-5)" 1318 + d="m 787.49999,471.11221 v 28.125 l 9.375,9.375 -18.74999,18.75 18.74999,18.75 -18.74999,18.75 18.74999,18.75 -9.375,9.375 v 28.125 0 0 18.75001" 1319 + id="path6135-4" 1320 + inkscape:connector-curvature="0" 1321 + sodipodi:nodetypes="cccccccccccc" /> 1322 + <path 1323 + style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1.99968719;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" 1324 + d="m 712.5,471.11221 c 112.49999,0 112.49999,0 112.49999,0" 1325 + id="path4655-9-3-65-5-6-7" 1326 + inkscape:connector-curvature="0" /> 1327 + <text 1328 + xml:space="preserve" 1329 + style="font-style:normal;font-weight:normal;font-size:12px;line-height:0%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" 1330 + x="859.94275" 1331 + y="480.03558" 1332 + id="text4349-7-3-6"><tspan 1333 + sodipodi:role="line" 1334 + x="859.94275" 1335 + y="480.03558" 1336 + style="font-size:25.00000191px;line-height:1.25;font-family:sans-serif;text-align:center;text-anchor:middle;stroke-width:1px" 1337 + id="tspan1285-9-6-5">V<tspan 1338 + style="font-size:64.99999762%;baseline-shift:sub" 1339 + id="tspan9307">DD</tspan></tspan></text> 1340 + </g> 1341 + </svg>

+1 -1

Documentation/i2c/summary.rst

··· 34 34 Using the terminology from the official documentation, the I2C bus connects 35 35 one or more *master* chips and one or more *slave* chips. 36 36 37 - .. kernel-figure:: i2c.svg 37 + .. kernel-figure:: i2c_bus.svg 38 38 :alt: Simple I2C bus with one master and 3 slaves 39 39 40 40 Simple I2C bus

+1 -1

Documentation/ia64/irq-redir.rst

··· 7 7 8 8 By writing to /proc/irq/IRQ#/smp_affinity the interrupt routing can be 9 9 controlled. The behavior on IA64 platforms is slightly different from 10 - that described in Documentation/IRQ-affinity.txt for i386 systems. 10 + that described in Documentation/core-api/irq/irq-affinity.rst for i386 systems. 11 11 12 12 Because of the usage of SAPIC mode and physical destination mode the 13 13 IRQ target is one particular CPU and cannot be a mask of several

+1 -1

Documentation/iio/iio_configfs.rst

··· 9 9 objects that could be easily configured using configfs (e.g.: devices, 10 10 triggers). 11 11 12 - See Documentation/filesystems/configfs/configfs.txt for more information 12 + See Documentation/filesystems/configfs.rst for more information 13 13 about how configfs works. 14 14 15 15 2. Usage

Documentation/irqflags-tracing.txt Documentation/core-api/irq/irqflags-tracing.rst

Documentation/kref.txt Documentation/core-api/kref.rst

+7

Documentation/locking/index.rst

··· 16 16 rt-mutex 17 17 spinlocks 18 18 ww-mutex-design 19 + preempt-locking 20 + pi-futex 21 + futex-requeue-pi 22 + hwspinlock 23 + percpu-rw-semaphore 24 + robust-futexes 25 + robust-futex-ABI 19 26 20 27 .. only:: subproject and html 21 28

+1 -1

Documentation/locking/locktorture.rst

··· 110 110 same period of time. Defaults to "stutter=5", so as 111 111 to run and pause for (roughly) five-second intervals. 112 112 Specifying "stutter=0" causes the test to run continuously 113 - without pausing, which is the old default behavior. 113 + without pausing. 114 114 115 115 shuffle_interval 116 116 The number of seconds to keep the test threads affinitied

+1 -1

Documentation/locking/rt-mutex.rst

··· 4 4 5 5 RT-mutexes with priority inheritance are used to support PI-futexes, 6 6 which enable pthread_mutex_t priority inheritance attributes 7 - (PTHREAD_PRIO_INHERIT). [See Documentation/pi-futex.txt for more details 7 + (PTHREAD_PRIO_INHERIT). [See Documentation/locking/pi-futex.rst for more details 8 8 about PI-futexes.] 9 9 10 10 This technology was developed in the -rt tree and streamlined for

+6 -6

Documentation/maintainer/maintainer-entry-profile.rst

··· 7 7 (submitting-patches, submitting drivers...) with 8 8 subsystem/device-driver-local customs as well as details about the patch 9 9 submission life-cycle. A contributor uses this document to level set 10 - their expectations and avoid common mistakes, maintainers may use these 10 + their expectations and avoid common mistakes; maintainers may use these 11 11 profiles to look across subsystems for opportunities to converge on 12 12 common practices. 13 13 ··· 26 26 - Does the subsystem have a patchwork instance? Are patchwork state 27 27 changes notified? 28 28 - Any bots or CI infrastructure that watches the list, or automated 29 - testing feedback that the subsystem gates acceptance? 29 + testing feedback that the subsystem uses to gate acceptance? 30 30 - Git branches that are pulled into -next? 31 31 - What branch should contributors submit against? 32 32 - Links to any other Maintainer Entry Profiles? For example a ··· 54 54 sent at any time before the merge window closes and can still be 55 55 considered for the next -rc1. The reality is that most patches need to 56 56 be settled in soaking in linux-next in advance of the merge window 57 - opening. Clarify for the submitter the key dates (in terms rc release 58 - week) that patches might considered for merging and when patches need to 57 + opening. Clarify for the submitter the key dates (in terms of -rc release 58 + week) that patches might be considered for merging and when patches need to 59 59 wait for the next -rc. At a minimum: 60 60 61 61 - Last -rc for new feature submissions: ··· 70 70 - Last -rc to merge features: Deadline for merge decisions 71 71 Indicate to contributors the point at which an as yet un-applied patch 72 72 set will need to wait for the NEXT+1 merge window. Of course there is no 73 - obligation to ever except any given patchset, but if the review has not 74 - concluded by this point the expectation the contributor should wait and 73 + obligation to ever accept any given patchset, but if the review has not 74 + concluded by this point the expectation is the contributor should wait and 75 75 resubmit for the following merge window. 76 76 77 77 Optional:

+1 -1

Documentation/memory-barriers.txt

··· 620 620 until they are certain (1) that the write will actually happen, (2) 621 621 of the location of the write, and (3) of the value to be written. 622 622 But please carefully read the "CONTROL DEPENDENCIES" section and the 623 - Documentation/RCU/rcu_dereference.txt file: The compiler can and does 623 + Documentation/RCU/rcu_dereference.rst file: The compiler can and does 624 624 break dependencies in a great many highly creative ways. 625 625 626 626 CPU 1 CPU 2

+1

Documentation/misc-devices/index.rst

··· 21 21 lis3lv02d 22 22 max6875 23 23 mic/index 24 + uacce 24 25 xilinx_sdfec

+2 -2

Documentation/networking/scaling.rst

··· 81 81 an IRQ may be handled on any CPU. Because a non-negligible part of packet 82 82 processing takes place in receive interrupt handling, it is advantageous 83 83 to spread receive interrupts between CPUs. To manually adjust the IRQ 84 - affinity of each interrupt see Documentation/IRQ-affinity.txt. Some systems 84 + affinity of each interrupt see Documentation/core-api/irq/irq-affinity.rst. Some systems 85 85 will be running irqbalance, a daemon that dynamically optimizes IRQ 86 86 assignments and as a result may override any manual settings. 87 87 ··· 160 160 161 161 This file implements a bitmap of CPUs. RPS is disabled when it is zero 162 162 (the default), in which case packets are processed on the interrupting 163 - CPU. Documentation/IRQ-affinity.txt explains how CPUs are assigned to 163 + CPU. Documentation/core-api/irq/irq-affinity.rst explains how CPUs are assigned to 164 164 the bitmap. 165 165 166 166

+7 -7

Documentation/nvdimm/maintainer-entry-profile.rst

··· 4 4 Overview 5 5 -------- 6 6 The libnvdimm subsystem manages persistent memory across multiple 7 - architectures. The mailing list, is tracked by patchwork here: 7 + architectures. The mailing list is tracked by patchwork here: 8 8 https://patchwork.kernel.org/project/linux-nvdimm/list/ 9 9 ...and that instance is configured to give feedback to submitters on 10 10 patch acceptance and upstream merge. Patches are merged to either the 11 - 'libnvdimm-fixes', or 'libnvdimm-for-next' branch. Those branches are 11 + 'libnvdimm-fixes' or 'libnvdimm-for-next' branch. Those branches are 12 12 available here: 13 13 https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git/ 14 14 15 - In general patches can be submitted against the latest -rc, however if 15 + In general patches can be submitted against the latest -rc; however, if 16 16 the incoming code change is dependent on other pending changes then the 17 17 patch should be based on the libnvdimm-for-next branch. However, since 18 18 persistent memory sits at the intersection of storage and memory there ··· 35 35 36 36 ACPI Device Specific Methods (_DSM) 37 37 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 38 - Before patches enabling for a new _DSM family will be considered it must 38 + Before patches enabling a new _DSM family will be considered, it must 39 39 be assigned a format-interface-code from the NVDIMM Sub-team of the ACPI 40 40 Specification Working Group. In general, the stance of the subsystem is 41 - to push back on the proliferation of NVDIMM command sets, do strongly 41 + to push back on the proliferation of NVDIMM command sets, so do strongly 42 42 consider implementing support for an existing command set. See 43 - drivers/acpi/nfit/nfit.h for the set of support command sets. 43 + drivers/acpi/nfit/nfit.h for the set of supported command sets. 44 44 45 45 46 46 Key Cycle Dates ··· 48 48 New submissions can be sent at any time, but if they intend to hit the 49 49 next merge window they should be sent before -rc4, and ideally 50 50 stabilized in the libnvdimm-for-next branch by -rc6. Of course if a 51 - patch set requires more than 2 weeks of review -rc4 is already too late 51 + patch set requires more than 2 weeks of review, -rc4 is already too late 52 52 and some patches may require multiple development cycles to review. 53 53 54 54

Documentation/percpu-rw-semaphore.txt Documentation/locking/percpu-rw-semaphore.rst

Documentation/pi-futex.txt Documentation/locking/pi-futex.rst

+2

Documentation/powerpc/cxl.rst

··· 133 133 ======== 134 134 135 135 1. AFU character devices 136 + ^^^^^^^^^^^^^^^^^^^^^^^^ 136 137 137 138 For AFUs operating in AFU directed mode, two character device 138 139 files will be created. /dev/cxl/afu0.0m will correspond to a ··· 396 395 397 396 398 397 2. Card character device (powerVM guest only) 398 + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 399 399 400 400 In a powerVM guest, an extra character device is created for the 401 401 card. The device is only used to write (flash) a new image on the

+1 -1

Documentation/powerpc/firmware-assisted-dump.rst

··· 344 344 345 345 346 346 NOTE: 347 - Please refer to Documentation/filesystems/debugfs.txt on 347 + Please refer to Documentation/filesystems/debugfs.rst on 348 348 how to mount the debugfs filesystem. 349 349 350 350

Documentation/preempt-locking.txt Documentation/locking/preempt-locking.rst

+1 -1

Documentation/process/adding-syscalls.rst

··· 33 33 to a somewhat opaque API. 34 34 35 35 - If you're just exposing runtime system information, a new node in sysfs 36 - (see ``Documentation/filesystems/sysfs.txt``) or the ``/proc`` filesystem may 36 + (see ``Documentation/filesystems/sysfs.rst``) or the ``/proc`` filesystem may 37 37 be more appropriate. However, access to these mechanisms requires that the 38 38 relevant filesystem is mounted, which might not always be the case (e.g. 39 39 in a namespaced/sandboxed/chrooted environment). Avoid adding any API to

+1

Documentation/process/index.rst

··· 61 61 botching-up-ioctls 62 62 clang-format 63 63 ../riscv/patch-acceptance 64 + unaligned-memory-access 64 65 65 66 .. only:: subproject and html 66 67

+1 -1

Documentation/process/submit-checklist.rst

··· 107 107 and why. 108 108 109 109 26) If any ioctl's are added by the patch, then also update 110 - ``Documentation/ioctl/ioctl-number.rst``. 110 + ``Documentation/userspace-api/ioctl/ioctl-number.rst``. 111 111 112 112 27) If your modified source code depends on or uses any of the kernel 113 113 APIs or features that are related to the following ``Kconfig`` symbols,

Documentation/rbtree.txt Documentation/core-api/rbtree.rst

Documentation/robust-futex-ABI.txt Documentation/locking/robust-futex-ABI.rst

Documentation/robust-futexes.txt Documentation/locking/robust-futexes.rst

+1 -1

Documentation/s390/vfio-ap.rst

··· 484 484 05.00ff CEX5A Accelerator 485 485 =========== ===== ============ 486 486 487 - Guest2 487 + Guest3 488 488 ------ 489 489 =========== ===== ============ 490 490 CARD.DOMAIN TYPE MODE

+6 -4

Documentation/scheduler/sched-domains.rst

··· 19 19 Each scheduling domain must have one or more CPU groups (struct sched_group) 20 20 which are organised as a circular one way linked list from the ->groups 21 21 pointer. The union of cpumasks of these groups MUST be the same as the 22 - domain's span. The intersection of cpumasks from any two of these groups 23 - MUST be the empty set. The group pointed to by the ->groups pointer MUST 24 - contain the CPU to which the domain belongs. Groups may be shared among 25 - CPUs as they contain read only data after they have been set up. 22 + domain's span. The group pointed to by the ->groups pointer MUST contain the CPU 23 + to which the domain belongs. Groups may be shared among CPUs as they contain 24 + read only data after they have been set up. The intersection of cpumasks from 25 + any two of these groups may be non empty. If this is the case the SD_OVERLAP 26 + flag is set on the corresponding scheduling domain and its groups may not be 27 + shared between CPUs. 26 28 27 29 Balancing within a sched domain occurs between groups. That is, each group 28 30 is treated as one entity. The load of a group is defined as the sum of the

+1

Documentation/security/index.rst

··· 15 15 self-protection 16 16 siphash 17 17 tpm/index 18 + digsig

+62 -132

Documentation/security/lsm.rst

··· 35 35 migrating the Linux capabilities code into such a module. 36 36 37 37 The Linux Security Modules (LSM) project was started by WireX to develop 38 - such a framework. LSM is a joint development effort by several security 38 + such a framework. LSM was a joint development effort by several security 39 39 projects, including Immunix, SELinux, SGI and Janus, and several 40 40 individuals, including Greg Kroah-Hartman and James Morris, to develop a 41 - Linux kernel patch that implements this framework. The patch is 42 - currently tracking the 2.4 series and is targeted for integration into 43 - the 2.5 development series. This technical report provides an overview 44 - of the framework and the example capabilities security module provided 45 - by the LSM kernel patch. 41 + Linux kernel patch that implements this framework. The work was 42 + incorporated in the mainstream in December of 2003. This technical 43 + report provides an overview of the framework and the capabilities 44 + security module. 46 45 47 46 LSM Framework 48 47 ============= 49 48 50 - The LSM kernel patch provides a general kernel framework to support 49 + The LSM framework provides a general kernel framework to support 51 50 security modules. In particular, the LSM framework is primarily focused 52 51 on supporting access control modules, although future development is 53 - likely to address other security needs such as auditing. By itself, the 52 + likely to address other security needs such as sandboxing. By itself, the 54 53 framework does not provide any additional security; it merely provides 55 - the infrastructure to support security modules. The LSM kernel patch 56 - also moves most of the capabilities logic into an optional security 57 - module, with the system defaulting to the traditional superuser logic. 54 + the infrastructure to support security modules. The LSM framework is 55 + optional, requiring `CONFIG_SECURITY` to be enabled. The capabilities 56 + logic is implemented as a security module. 58 57 This capabilities module is discussed further in 59 58 `LSM Capabilities Module`_. 60 59 61 - The LSM kernel patch adds security fields to kernel data structures and 62 - inserts calls to hook functions at critical points in the kernel code to 63 - manage the security fields and to perform access control. It also adds 64 - functions for registering and unregistering security modules, and adds a 65 - general :c:func:`security()` system call to support new system calls 66 - for security-aware applications. 60 + The LSM framework includes security fields in kernel data structures and 61 + calls to hook functions at critical points in the kernel code to 62 + manage the security fields and to perform access control. 63 + It also adds functions for registering security modules. 64 + An interface `/sys/kernel/security/lsm` reports a comma separated list 65 + of security modules that are active on the system. 67 66 68 - The LSM security fields are simply ``void*`` pointers. For process and 69 - program execution security information, security fields were added to 67 + The LSM security fields are simply ``void*`` pointers. 68 + The data is referred to as a blob, which may be managed by 69 + the framework or by the individual security modules that use it. 70 + Security blobs that are used by more than one security module are 71 + typically managed by the framework. 72 + For process and 73 + program execution security information, security fields are included in 70 74 :c:type:`struct task_struct <task_struct>` and 71 - :c:type:`struct linux_binprm <linux_binprm>`. For filesystem 72 - security information, a security field was added to :c:type:`struct 75 + :c:type:`struct cred <cred>`. 76 + For filesystem 77 + security information, a security field is included in :c:type:`struct 73 78 super_block <super_block>`. For pipe, file, and socket security 74 - information, security fields were added to :c:type:`struct inode 75 - <inode>` and :c:type:`struct file <file>`. For packet and 76 - network device security information, security fields were added to 77 - :c:type:`struct sk_buff <sk_buff>` and :c:type:`struct 78 - net_device <net_device>`. For System V IPC security information, 79 + information, security fields are included in :c:type:`struct inode 80 + <inode>` and :c:type:`struct file <file>`. 81 + For System V IPC security information, 79 82 security fields were added to :c:type:`struct kern_ipc_perm 80 83 <kern_ipc_perm>` and :c:type:`struct msg_msg 81 84 <msg_msg>`; additionally, the definitions for :c:type:`struct ··· 87 84 ``include/linux/shm.h`` as appropriate) to allow the security modules to 88 85 use these definitions. 89 86 90 - Each LSM hook is a function pointer in a global table, security_ops. 91 - This table is a :c:type:`struct security_operations 92 - <security_operations>` structure as defined by 93 - ``include/linux/security.h``. Detailed documentation for each hook is 94 - included in this header file. At present, this structure consists of a 95 - collection of substructures that group related hooks based on the kernel 96 - object (e.g. task, inode, file, sk_buff, etc) as well as some top-level 97 - hook function pointers for system operations. This structure is likely 98 - to be flattened in the future for performance. The placement of the hook 99 - calls in the kernel code is described by the "called:" lines in the 100 - per-hook documentation in the header file. The hook calls can also be 101 - easily found in the kernel code by looking for the string 102 - "security_ops->". 87 + For packet and 88 + network device security information, security fields were added to 89 + :c:type:`struct sk_buff <sk_buff>` and 90 + :c:type:`struct scm_cookie <scm_cookie>`. 91 + Unlike the other security module data, the data used here is a 92 + 32-bit integer. The security modules are required to map or otherwise 93 + associate these values with real security attributes. 103 94 104 - Linus mentioned per-process security hooks in his original remarks as a 105 - possible alternative to global security hooks. However, if LSM were to 106 - start from the perspective of per-process hooks, then the base framework 107 - would have to deal with how to handle operations that involve multiple 108 - processes (e.g. kill), since each process might have its own hook for 109 - controlling the operation. This would require a general mechanism for 110 - composing hooks in the base framework. Additionally, LSM would still 111 - need global hooks for operations that have no process context (e.g. 112 - network input operations). Consequently, LSM provides global security 113 - hooks, but a security module is free to implement per-process hooks 114 - (where that makes sense) by storing a security_ops table in each 115 - process' security field and then invoking these per-process hooks from 116 - the global hooks. The problem of composition is thus deferred to the 117 - module. 95 + LSM hooks are maintained in lists. A list is maintained for each 96 + hook, and the hooks are called in the order specified by CONFIG_LSM. 97 + Detailed documentation for each hook is 98 + included in the `include/linux/lsm_hooks.h` header file. 118 99 119 - The global security_ops table is initialized to a set of hook functions 120 - provided by a dummy security module that provides traditional superuser 121 - logic. A :c:func:`register_security()` function (in 122 - ``security/security.c``) is provided to allow a security module to set 123 - security_ops to refer to its own hook functions, and an 124 - :c:func:`unregister_security()` function is provided to revert 125 - security_ops to the dummy module hooks. This mechanism is used to set 126 - the primary security module, which is responsible for making the final 127 - decision for each hook. 100 + The LSM framework provides for a close approximation of 101 + general security module stacking. It defines 102 + security_add_hooks() to which each security module passes a 103 + :c:type:`struct security_hooks_list <security_hooks_list>`, 104 + which are added to the lists. 105 + The LSM framework does not provide a mechanism for removing hooks that 106 + have been registered. The SELinux security module has implemented 107 + a way to remove itself, however the feature has been deprecated. 128 108 129 - LSM also provides a simple mechanism for stacking additional security 130 - modules with the primary security module. It defines 131 - :c:func:`register_security()` and 132 - :c:func:`unregister_security()` hooks in the :c:type:`struct 133 - security_operations <security_operations>` structure and 134 - provides :c:func:`mod_reg_security()` and 135 - :c:func:`mod_unreg_security()` functions that invoke these hooks 136 - after performing some sanity checking. A security module can call these 137 - functions in order to stack with other modules. However, the actual 138 - details of how this stacking is handled are deferred to the module, 139 - which can implement these hooks in any way it wishes (including always 140 - returning an error if it does not wish to support stacking). In this 141 - manner, LSM again defers the problem of composition to the module. 142 - 143 - Although the LSM hooks are organized into substructures based on kernel 144 - object, all of the hooks can be viewed as falling into two major 109 + The hooks can be viewed as falling into two major 145 110 categories: hooks that are used to manage the security fields and hooks 146 111 that are used to perform access control. Examples of the first category 147 - of hooks include the :c:func:`alloc_security()` and 148 - :c:func:`free_security()` hooks defined for each kernel data 149 - structure that has a security field. These hooks are used to allocate 150 - and free security structures for kernel objects. The first category of 151 - hooks also includes hooks that set information in the security field 152 - after allocation, such as the :c:func:`post_lookup()` hook in 153 - :c:type:`struct inode_security_ops <inode_security_ops>`. 154 - This hook is used to set security information for inodes after 155 - successful lookup operations. An example of the second category of hooks 156 - is the :c:func:`permission()` hook in :c:type:`struct 157 - inode_security_ops <inode_security_ops>`. This hook checks 158 - permission when accessing an inode. 112 + of hooks include the security_inode_alloc() and security_inode_free() 113 + These hooks are used to allocate 114 + and free security structures for inode objects. 115 + An example of the second category of hooks 116 + is the security_inode_permission() hook. 117 + This hook checks permission when accessing an inode. 159 118 160 119 LSM Capabilities Module 161 120 ======================= 162 121 163 - The LSM kernel patch moves most of the existing POSIX.1e capabilities 164 - logic into an optional security module stored in the file 165 - ``security/capability.c``. This change allows users who do not want to 166 - use capabilities to omit this code entirely from their kernel, instead 167 - using the dummy module for traditional superuser logic or any other 168 - module that they desire. This change also allows the developers of the 169 - capabilities logic to maintain and enhance their code more freely, 170 - without needing to integrate patches back into the base kernel. 171 - 172 - In addition to moving the capabilities logic, the LSM kernel patch could 173 - move the capability-related fields from the kernel data structures into 174 - the new security fields managed by the security modules. However, at 175 - present, the LSM kernel patch leaves the capability fields in the kernel 176 - data structures. In his original remarks, Linus suggested that this 177 - might be preferable so that other security modules can be easily stacked 178 - with the capabilities module without needing to chain multiple security 179 - structures on the security field. It also avoids imposing extra overhead 180 - on the capabilities module to manage the security fields. However, the 181 - LSM framework could certainly support such a move if it is determined to 182 - be desirable, with only a few additional changes described below. 183 - 184 - At present, the capabilities logic for computing process capabilities on 185 - :c:func:`execve()` and :c:func:`set\*uid()`, checking 186 - capabilities for a particular process, saving and checking capabilities 187 - for netlink messages, and handling the :c:func:`capget()` and 188 - :c:func:`capset()` system calls have been moved into the 189 - capabilities module. There are still a few locations in the base kernel 190 - where capability-related fields are directly examined or modified, but 191 - the current version of the LSM patch does allow a security module to 192 - completely replace the assignment and testing of capabilities. These few 193 - locations would need to be changed if the capability-related fields were 194 - moved into the security field. The following is a list of known 195 - locations that still perform such direct examination or modification of 196 - capability-related fields: 197 - 198 - - ``fs/open.c``::c:func:`sys_access()` 199 - 200 - - ``fs/lockd/host.c``::c:func:`nlm_bind_host()` 201 - 202 - - ``fs/nfsd/auth.c``::c:func:`nfsd_setuser()` 203 - 204 - - ``fs/proc/array.c``::c:func:`task_cap()` 122 + The POSIX.1e capabilities logic is maintained as a security module 123 + stored in the file ``security/commoncap.c``. The capabilities 124 + module uses the order field of the :c:type:`lsm_info` description 125 + to identify it as the first security module to be registered. 126 + The capabilities security module does not use the general security 127 + blobs, unlike other modules. The reasons are historical and are 128 + based on overhead, complexity and performance concerns.

+1 -1

Documentation/sphinx/requirements.txt

··· 1 1 docutils 2 - Sphinx==1.7.9 2 + Sphinx==2.4.4 3 3 sphinx_rtd_theme

+1

Documentation/trace/coresight/coresight-ect.rst

··· 1 1 .. SPDX-License-Identifier: GPL-2.0 2 + 2 3 ============================================= 3 4 CoreSight Embedded Cross Trigger (CTI & CTM). 4 5 =============================================

+14 -14

Documentation/trace/events.rst

··· 527 527 528 528 See Documentation/trace/histogram.rst for details and examples. 529 529 530 - 6.3 In-kernel trace event API 531 - ----------------------------- 530 + 7. In-kernel trace event API 531 + ============================ 532 532 533 533 In most cases, the command-line interface to trace events is more than 534 534 sufficient. Sometimes, however, applications might find the need for ··· 560 560 - tracing synthetic events from in-kernel code 561 561 - the low-level "dynevent_cmd" API 562 562 563 - 6.3.1 Dyamically creating synthetic event definitions 564 - ----------------------------------------------------- 563 + 7.1 Dyamically creating synthetic event definitions 564 + --------------------------------------------------- 565 565 566 566 There are a couple ways to create a new synthetic event from a kernel 567 567 module or other kernel code. ··· 666 666 At this point, the event object is ready to be used for tracing new 667 667 events. 668 668 669 - 6.3.3 Tracing synthetic events from in-kernel code 670 - -------------------------------------------------- 669 + 7.2 Tracing synthetic events from in-kernel code 670 + ------------------------------------------------ 671 671 672 672 To trace a synthetic event, there are several options. The first 673 673 option is to trace the event in one call, using synth_event_trace() ··· 678 678 synth_event_add_next_val() or synth_event_add_val() to add the values 679 679 piecewise. 680 680 681 - 6.3.3.1 Tracing a synthetic event all at once 682 - --------------------------------------------- 681 + 7.2.1 Tracing a synthetic event all at once 682 + ------------------------------------------- 683 683 684 684 To trace a synthetic event all at once, the synth_event_trace() or 685 685 synth_event_trace_array() functions can be used. ··· 780 780 781 781 ret = synth_event_delete("schedtest"); 782 782 783 - 6.3.3.1 Tracing a synthetic event piecewise 784 - ------------------------------------------- 783 + 7.2.2 Tracing a synthetic event piecewise 784 + ----------------------------------------- 785 785 786 786 To trace a synthetic using the piecewise method described above, the 787 787 synth_event_trace_start() function is used to 'open' the synthetic ··· 864 864 of whether any of the add calls failed (say due to a bad field name 865 865 being passed in). 866 866 867 - 6.3.4 Dyamically creating kprobe and kretprobe event definitions 868 - ---------------------------------------------------------------- 867 + 7.3 Dyamically creating kprobe and kretprobe event definitions 868 + -------------------------------------------------------------- 869 869 870 870 To create a kprobe or kretprobe trace event from kernel code, the 871 871 kprobe_event_gen_cmd_start() or kretprobe_event_gen_cmd_start() ··· 941 941 942 942 ret = kprobe_event_delete("gen_kprobe_test"); 943 943 944 - 6.3.4 The "dynevent_cmd" low-level API 945 - -------------------------------------- 944 + 7.4 The "dynevent_cmd" low-level API 945 + ------------------------------------ 946 946 947 947 Both the in-kernel synthetic event and kprobe interfaces are built on 948 948 top of a lower-level "dynevent_cmd" interface. This interface is

+16 -9

Documentation/translations/it_IT/doc-guide/kernel-doc.rst

··· 515 515 .. kernel-doc:: drivers/gpu/drm/i915/intel_audio.c 516 516 :internal: 517 517 518 + identifiers: *[ function/type ...]* 519 + Include la documentazione per ogni *function* e *type* in *source*. 520 + Se non vengono esplicitamente specificate le funzioni da includere, allora 521 + verranno incluse tutte quelle disponibili in *source*. 522 + 523 + Esempi:: 524 + 525 + .. kernel-doc:: lib/bitmap.c 526 + :identifiers: bitmap_parselist bitmap_parselist_user 527 + 528 + .. kernel-doc:: lib/idr.c 529 + :identifiers: 530 + 531 + functions: *[ function ...]* 532 + Questo è uno pseudonimo, deprecato, per la direttiva 'identifiers'. 533 + 518 534 doc: *title* 519 535 Include la documentazione del paragrafo ``DOC:`` identificato dal titolo 520 536 (*title*) all'interno del file sorgente (*source*). Gli spazi in *title* sono ··· 543 527 544 528 .. kernel-doc:: drivers/gpu/drm/i915/intel_audio.c 545 529 :doc: High Definition Audio over HDMI and Display Port 546 - 547 - functions: *function* *[...]* 548 - Dal file sorgente (*source*) include la documentazione per le funzioni 549 - elencate (*function*). 550 - 551 - Esempio:: 552 - 553 - .. kernel-doc:: lib/bitmap.c 554 - :functions: bitmap_parselist bitmap_parselist_user 555 530 556 531 Senza alcuna opzione, la direttiva kernel-doc include tutti i commenti di 557 532 documentazione presenti nel file sorgente (*source*).

+18

Documentation/translations/it_IT/kernel-hacking/hacking.rst

··· 627 627 :c:func:`EXPORT_SYMBOL_GPL()` quando si aggiungono nuove funzionalità o 628 628 interfacce. 629 629 630 + :c:func:`EXPORT_SYMBOL_NS()` 631 + ---------------------------- 632 + 633 + Definita in ``include/linux/export.h`` 634 + 635 + Questa è una variate di `EXPORT_SYMBOL()` che permette di specificare uno 636 + spazio dei nomi. Lo spazio dei nomi è documentato in 637 + :doc:`../core-api/symbol-namespaces` 638 + 639 + :c:func:`EXPORT_SYMBOL_NS_GPL()` 640 + -------------------------------- 641 + 642 + Definita in ``include/linux/export.h`` 643 + 644 + Questa è una variate di `EXPORT_SYMBOL_GPL()` che permette di specificare uno 645 + spazio dei nomi. Lo spazio dei nomi è documentato in 646 + :doc:`../core-api/symbol-namespaces` 647 + 630 648 Procedure e convenzioni 631 649 ======================= 632 650

+86 -86

Documentation/translations/it_IT/kernel-hacking/locking.rst

··· 159 159 Se avete una struttura dati che verrà utilizzata solo dal contesto utente, 160 160 allora, per proteggerla, potete utilizzare un semplice mutex 161 161 (``include/linux/mutex.h``). Questo è il caso più semplice: inizializzate il 162 - mutex; invocate :c:func:`mutex_lock_interruptible()` per trattenerlo e 163 - :c:func:`mutex_unlock()` per rilasciarlo. C'è anche :c:func:`mutex_lock()` 162 + mutex; invocate mutex_lock_interruptible() per trattenerlo e 163 + mutex_unlock() per rilasciarlo. C'è anche mutex_lock() 164 164 ma questa dovrebbe essere evitata perché non ritorna in caso di segnali. 165 165 166 166 Per esempio: ``net/netfilter/nf_sockopt.c`` permette la registrazione 167 - di nuove chiamate per :c:func:`setsockopt()` e :c:func:`getsockopt()` 168 - usando la funzione :c:func:`nf_register_sockopt()`. La registrazione e 167 + di nuove chiamate per setsockopt() e getsockopt() 168 + usando la funzione nf_register_sockopt(). La registrazione e 169 169 la rimozione vengono eseguite solamente quando il modulo viene caricato 170 170 o scaricato (e durante l'avvio del sistema, qui non abbiamo concorrenza), 171 171 e la lista delle funzioni registrate viene consultata solamente quando 172 - :c:func:`setsockopt()` o :c:func:`getsockopt()` sono sconosciute al sistema. 172 + setsockopt() o getsockopt() sono sconosciute al sistema. 173 173 In questo caso ``nf_sockopt_mutex`` è perfetto allo scopo, in particolar modo 174 174 visto che setsockopt e getsockopt potrebbero dormire. 175 175 ··· 179 179 Se un softirq condivide dati col contesto utente, avete due problemi. 180 180 Primo, il contesto utente corrente potrebbe essere interroto da un softirq, 181 181 e secondo, la sezione critica potrebbe essere eseguita da un altro 182 - processore. Questo è quando :c:func:`spin_lock_bh()` 182 + processore. Questo è quando spin_lock_bh() 183 183 (``include/linux/spinlock.h``) viene utilizzato. Questo disabilita i softirq 184 - sul processore e trattiene il *lock*. Invece, :c:func:`spin_unlock_bh()` fa 184 + sul processore e trattiene il *lock*. Invece, spin_unlock_bh() fa 185 185 l'opposto. (Il suffisso '_bh' è un residuo storico che fa riferimento al 186 186 "Bottom Halves", il vecchio nome delle interruzioni software. In un mondo 187 187 perfetto questa funzione si chiamerebbe 'spin_lock_softirq()'). 188 188 189 - Da notare che in questo caso potete utilizzare anche :c:func:`spin_lock_irq()` 190 - o :c:func:`spin_lock_irqsave()`, queste fermano anche le interruzioni hardware: 189 + Da notare che in questo caso potete utilizzare anche spin_lock_irq() 190 + o spin_lock_irqsave(), queste fermano anche le interruzioni hardware: 191 191 vedere :ref:`Contesto di interruzione hardware <it_hardirq-context>`. 192 192 193 193 Questo funziona alla perfezione anche sui sistemi monoprocessore: gli spinlock 194 - svaniscono e questa macro diventa semplicemente :c:func:`local_bh_disable()` 194 + svaniscono e questa macro diventa semplicemente local_bh_disable() 195 195 (``include/linux/interrupt.h``), la quale impedisce ai softirq d'essere 196 196 eseguiti. 197 197 ··· 224 224 ~~~~~~~~~~~~~~~~~~~~~~~~ 225 225 226 226 Se un altro tasklet/timer vuole condividere dati col vostro tasklet o timer, 227 - allora avrete bisogno entrambe di :c:func:`spin_lock()` e 228 - :c:func:`spin_unlock()`. Qui :c:func:`spin_lock_bh()` è inutile, siete già 227 + allora avrete bisogno entrambe di spin_lock() e 228 + spin_unlock(). Qui spin_lock_bh() è inutile, siete già 229 229 in un tasklet ed avete la garanzia che nessun altro verrà eseguito sullo 230 230 stesso processore. 231 231 ··· 243 243 fino a questo punto nell'uso dei softirq, probabilmente tenete alla scalabilità 244 244 delle prestazioni abbastanza da giustificarne la complessità aggiuntiva. 245 245 246 - Dovete utilizzare :c:func:`spin_lock()` e :c:func:`spin_unlock()` per 246 + Dovete utilizzare spin_lock() e spin_unlock() per 247 247 proteggere i dati condivisi. 248 248 249 249 Diversi Softirqs 250 250 ~~~~~~~~~~~~~~~~ 251 251 252 - Dovete utilizzare :c:func:`spin_lock()` e :c:func:`spin_unlock()` per 252 + Dovete utilizzare spin_lock() e spin_unlock() per 253 253 proteggere i dati condivisi, che siano timer, tasklet, diversi softirq o 254 254 lo stesso o altri softirq: uno qualsiasi di essi potrebbe essere in esecuzione 255 255 su un diverso processore. ··· 270 270 avrete due preoccupazioni. Primo, il softirq può essere interrotto da 271 271 un'interruzione hardware, e secondo, la sezione critica potrebbe essere 272 272 eseguita da un'interruzione hardware su un processore diverso. Questo è il caso 273 - dove :c:func:`spin_lock_irq()` viene utilizzato. Disabilita le interruzioni 274 - sul processore che l'esegue, poi trattiene il lock. :c:func:`spin_unlock_irq()` 273 + dove spin_lock_irq() viene utilizzato. Disabilita le interruzioni 274 + sul processore che l'esegue, poi trattiene il lock. spin_unlock_irq() 275 275 fa l'opposto. 276 276 277 - Il gestore d'interruzione hardware non usa :c:func:`spin_lock_irq()` perché 278 - i softirq non possono essere eseguiti quando il gestore d'interruzione hardware 279 - è in esecuzione: per questo si può usare :c:func:`spin_lock()`, che è un po' 277 + Il gestore d'interruzione hardware non ha bisogno di usare spin_lock_irq() 278 + perché i softirq non possono essere eseguiti quando il gestore d'interruzione 279 + hardware è in esecuzione: per questo si può usare spin_lock(), che è un po' 280 280 più veloce. L'unica eccezione è quando un altro gestore d'interruzioni 281 - hardware utilizza lo stesso *lock*: :c:func:`spin_lock_irq()` impedirà a questo 281 + hardware utilizza lo stesso *lock*: spin_lock_irq() impedirà a questo 282 282 secondo gestore di interrompere quello in esecuzione. 283 283 284 284 Questo funziona alla perfezione anche sui sistemi monoprocessore: gli spinlock 285 - svaniscono e questa macro diventa semplicemente :c:func:`local_irq_disable()` 285 + svaniscono e questa macro diventa semplicemente local_irq_disable() 286 286 (``include/asm/smp.h``), la quale impedisce a softirq/tasklet/BH d'essere 287 287 eseguiti. 288 288 289 - :c:func:`spin_lock_irqsave()` (``include/linux/spinlock.h``) è una variante che 289 + spin_lock_irqsave() (``include/linux/spinlock.h``) è una variante che 290 290 salva lo stato delle interruzioni in una variabile, questa verrà poi passata 291 - a :c:func:`spin_unlock_irqrestore()`. Questo significa che lo stesso codice 291 + a spin_unlock_irqrestore(). Questo significa che lo stesso codice 292 292 potrà essere utilizzato in un'interruzione hardware (dove le interruzioni sono 293 293 già disabilitate) e in un softirq (dove la disabilitazione delle interruzioni 294 294 è richiesta). 295 295 296 296 Da notare che i softirq (e quindi tasklet e timer) sono eseguiti al ritorno 297 - da un'interruzione hardware, quindi :c:func:`spin_lock_irq()` interrompe 297 + da un'interruzione hardware, quindi spin_lock_irq() interrompe 298 298 anche questi. Tenuto conto di questo si può dire che 299 - :c:func:`spin_lock_irqsave()` è la funzione di sincronizzazione più generica 299 + spin_lock_irqsave() è la funzione di sincronizzazione più generica 300 300 e potente. 301 301 302 302 Sincronizzazione fra due gestori d'interruzioni hardware 303 303 -------------------------------------------------------- 304 304 305 305 Condividere dati fra due gestori di interruzione hardware è molto raro, ma se 306 - succede, dovreste usare :c:func:`spin_lock_irqsave()`: è una specificità 306 + succede, dovreste usare spin_lock_irqsave(): è una specificità 307 307 dell'architettura il fatto che tutte le interruzioni vengano interrotte 308 308 quando si eseguono di gestori di interruzioni. 309 309 ··· 317 317 il mutex e dormire (``copy_from_user*(`` o ``kmalloc(x,GFP_KERNEL)``). 318 318 319 319 - Altrimenti (== i dati possono essere manipolati da un'interruzione) usate 320 - :c:func:`spin_lock_irqsave()` e :c:func:`spin_unlock_irqrestore()`. 320 + spin_lock_irqsave() e spin_unlock_irqrestore(). 321 321 322 322 - Evitate di trattenere uno spinlock per più di 5 righe di codice incluse 323 323 le chiamate a funzione (ad eccezione di quell per l'accesso come 324 - :c:func:`readb()`). 324 + readb()). 325 325 326 326 Tabella dei requisiti minimi 327 327 ---------------------------- ··· 334 334 la sincronizzazione è necessaria). 335 335 336 336 Ricordatevi il suggerimento qui sopra: potete sempre usare 337 - :c:func:`spin_lock_irqsave()`, che è un sovrainsieme di tutte le altre funzioni 337 + spin_lock_irqsave(), che è un sovrainsieme di tutte le altre funzioni 338 338 per spinlock. 339 339 340 340 ============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ============== ··· 378 378 trattenendo il *lock*. Potrete acquisire il *lock* più tardi se vi 379 379 serve accedere ai dati protetti da questo *lock*. 380 380 381 - La funzione :c:func:`spin_trylock()` non ritenta di acquisire il *lock*, 381 + La funzione spin_trylock() non ritenta di acquisire il *lock*, 382 382 se ci riesce al primo colpo ritorna un valore diverso da zero, altrimenti 383 383 se fallisce ritorna 0. Questa funzione può essere utilizzata in un qualunque 384 - contesto, ma come :c:func:`spin_lock()`: dovete disabilitare i contesti che 384 + contesto, ma come spin_lock(): dovete disabilitare i contesti che 385 385 potrebbero interrompervi e quindi trattenere lo spinlock. 386 386 387 - La funzione :c:func:`mutex_trylock()` invece di sospendere il vostro processo 387 + La funzione mutex_trylock() invece di sospendere il vostro processo 388 388 ritorna un valore diverso da zero se è possibile trattenere il lock al primo 389 389 colpo, altrimenti se fallisce ritorna 0. Nonostante non dorma, questa funzione 390 390 non può essere usata in modo sicuro in contesti di interruzione hardware o ··· 506 506 caso è semplice dato che copiamo i dati dall'utente e non permettiamo 507 507 mai loro di accedere direttamente agli oggetti. 508 508 509 - C'è una piccola ottimizzazione qui: nella funzione :c:func:`cache_add()` 509 + C'è una piccola ottimizzazione qui: nella funzione cache_add() 510 510 impostiamo i campi dell'oggetto prima di acquisire il *lock*. Questo è 511 511 sicuro perché nessun altro potrà accedervi finché non lo inseriremo 512 512 nella memoria. ··· 514 514 Accesso dal contesto utente 515 515 --------------------------- 516 516 517 - Ora consideriamo il caso in cui :c:func:`cache_find()` può essere invocata 517 + Ora consideriamo il caso in cui cache_find() può essere invocata 518 518 dal contesto d'interruzione: sia hardware che software. Un esempio potrebbe 519 519 essere un timer che elimina oggetti dalla memoria. 520 520 ··· 583 583 return ret; 584 584 } 585 585 586 - Da notare che :c:func:`spin_lock_irqsave()` disabiliterà le interruzioni 586 + Da notare che spin_lock_irqsave() disabiliterà le interruzioni 587 587 se erano attive, altrimenti non farà niente (quando siamo già in un contesto 588 588 d'interruzione); dunque queste funzioni possono essere chiamante in 589 589 sicurezza da qualsiasi contesto. 590 590 591 - Sfortunatamente, :c:func:`cache_add()` invoca :c:func:`kmalloc()` con 591 + Sfortunatamente, cache_add() invoca kmalloc() con 592 592 l'opzione ``GFP_KERNEL`` che è permessa solo in contesto utente. Ho supposto 593 - che :c:func:`cache_add()` venga chiamata dal contesto utente, altrimenti 594 - questa opzione deve diventare un parametro di :c:func:`cache_add()`. 593 + che cache_add() venga chiamata dal contesto utente, altrimenti 594 + questa opzione deve diventare un parametro di cache_add(). 595 595 596 596 Esporre gli oggetti al di fuori del file 597 597 ---------------------------------------- ··· 610 610 mantiene un puntatore ad un oggetto, presumibilmente si aspetta che questo 611 611 puntatore rimanga valido. Sfortunatamente, questo è garantito solo mentre 612 612 si trattiene il *lock*, altrimenti qualcuno potrebbe chiamare 613 - :c:func:`cache_delete()` o peggio, aggiungere un oggetto che riutilizza lo 613 + cache_delete() o peggio, aggiungere un oggetto che riutilizza lo 614 614 stesso indirizzo. 615 615 616 616 Dato che c'è un solo *lock*, non potete trattenerlo a vita: altrimenti ··· 710 710 } 711 711 712 712 Abbiamo incapsulato il contatore di riferimenti nelle tipiche funzioni 713 - di 'get' e 'put'. Ora possiamo ritornare l'oggetto da :c:func:`cache_find()` 713 + di 'get' e 'put'. Ora possiamo ritornare l'oggetto da cache_find() 714 714 col vantaggio che l'utente può dormire trattenendo l'oggetto (per esempio, 715 - :c:func:`copy_to_user()` per copiare il nome verso lo spazio utente). 715 + copy_to_user() per copiare il nome verso lo spazio utente). 716 716 717 717 Un altro punto da notare è che ho detto che il contatore dovrebbe incrementarsi 718 718 per ogni puntatore ad un oggetto: quindi il contatore di riferimenti è 1 ··· 727 727 in ``include/asm/atomic.h``: queste sono garantite come atomiche su qualsiasi 728 728 processore del sistema, quindi non sono necessari i *lock*. In questo caso è 729 729 più semplice rispetto all'uso degli spinlock, benché l'uso degli spinlock 730 - sia più elegante per casi non banali. Le funzioni :c:func:`atomic_inc()` e 731 - :c:func:`atomic_dec_and_test()` vengono usate al posto dei tipici operatori di 730 + sia più elegante per casi non banali. Le funzioni atomic_inc() e 731 + atomic_dec_and_test() vengono usate al posto dei tipici operatori di 732 732 incremento e decremento, e i *lock* non sono più necessari per proteggere il 733 733 contatore stesso. 734 734 ··· 820 820 - Si può togliere static da ``cache_lock`` e dire agli utenti che devono 821 821 trattenere il *lock* prima di modificare il nome di un oggetto. 822 822 823 - - Si può fornire una funzione :c:func:`cache_obj_rename()` che prende il 823 + - Si può fornire una funzione cache_obj_rename() che prende il 824 824 *lock* e cambia il nome per conto del chiamante; si dirà poi agli utenti 825 825 di usare questa funzione. 826 826 ··· 878 878 protetto da ``cache_lock`` piuttosto che dal *lock* dell'oggetto; questo 879 879 perché è logicamente parte dell'infrastruttura (come 880 880 :c:type:`struct list_head <list_head>` nell'oggetto). In questo modo, 881 - in :c:func:`__cache_add()`, non ho bisogno di trattenere il *lock* di ogni 881 + in __cache_add(), non ho bisogno di trattenere il *lock* di ogni 882 882 oggetto mentre si cerca il meno popolare. 883 883 884 884 Ho anche deciso che il campo id è immutabile, quindi non ho bisogno di 885 - trattenere il lock dell'oggetto quando si usa :c:func:`__cache_find()` 885 + trattenere il lock dell'oggetto quando si usa __cache_find() 886 886 per leggere questo campo; il *lock* dell'oggetto è usato solo dal chiamante 887 887 che vuole leggere o scrivere il campo name. 888 888 ··· 907 907 sveglio 5 notti a parlare da solo. 908 908 909 909 Un caso un pochino più complesso; immaginate d'avere una spazio condiviso 910 - fra un softirq ed il contesto utente. Se usate :c:func:`spin_lock()` per 910 + fra un softirq ed il contesto utente. Se usate spin_lock() per 911 911 proteggerlo, il contesto utente potrebbe essere interrotto da un softirq 912 912 mentre trattiene il lock, da qui il softirq rimarrà in attesa attiva provando 913 913 ad acquisire il *lock* già trattenuto nel contesto utente. ··· 1006 1006 spin_unlock_bh(&list_lock); 1007 1007 1008 1008 Primo o poi, questo esploderà su un sistema multiprocessore perché un 1009 - temporizzatore potrebbe essere già partiro prima di :c:func:`spin_lock_bh()`, 1010 - e prenderà il *lock* solo dopo :c:func:`spin_unlock_bh()`, e cercherà 1009 + temporizzatore potrebbe essere già partiro prima di spin_lock_bh(), 1010 + e prenderà il *lock* solo dopo spin_unlock_bh(), e cercherà 1011 1011 di eliminare il suo oggetto (che però è già stato eliminato). 1012 1012 1013 1013 Questo può essere evitato controllando il valore di ritorno di 1014 - :c:func:`del_timer()`: se ritorna 1, il temporizzatore è stato già 1014 + del_timer(): se ritorna 1, il temporizzatore è stato già 1015 1015 rimosso. Se 0, significa (in questo caso) che il temporizzatore è in 1016 1016 esecuzione, quindi possiamo fare come segue:: 1017 1017 ··· 1032 1032 spin_unlock_bh(&list_lock); 1033 1033 1034 1034 Un altro problema è l'eliminazione dei temporizzatori che si riavviano 1035 - da soli (chiamando :c:func:`add_timer()` alla fine della loro esecuzione). 1035 + da soli (chiamando add_timer() alla fine della loro esecuzione). 1036 1036 Dato che questo è un problema abbastanza comune con una propensione 1037 - alle corse critiche, dovreste usare :c:func:`del_timer_sync()` 1037 + alle corse critiche, dovreste usare del_timer_sync() 1038 1038 (``include/linux/timer.h``) per gestire questo caso. Questa ritorna il 1039 1039 numero di volte che il temporizzatore è stato interrotto prima che 1040 1040 fosse in grado di fermarlo senza che si riavviasse. ··· 1116 1116 wmb(); 1117 1117 list->next = new; 1118 1118 1119 - La funzione :c:func:`wmb()` è una barriera di sincronizzazione delle 1119 + La funzione wmb() è una barriera di sincronizzazione delle 1120 1120 scritture. Questa garantisce che la prima operazione (impostare l'elemento 1121 1121 ``next`` del nuovo elemento) venga completata e vista da tutti i processori 1122 1122 prima che venga eseguita la seconda operazione (che sarebbe quella di mettere ··· 1127 1127 il puntatore ``next`` deve puntare al resto della lista. 1128 1128 1129 1129 Fortunatamente, c'è una funzione che fa questa operazione sulle liste 1130 - :c:type:`struct list_head <list_head>`: :c:func:`list_add_rcu()` 1130 + :c:type:`struct list_head <list_head>`: list_add_rcu() 1131 1131 (``include/linux/list.h``). 1132 1132 1133 1133 Rimuovere un elemento dalla lista è anche più facile: sostituiamo il puntatore ··· 1138 1138 1139 1139 list->next = old->next; 1140 1140 1141 - La funzione :c:func:`list_del_rcu()` (``include/linux/list.h``) fa esattamente 1141 + La funzione list_del_rcu() (``include/linux/list.h``) fa esattamente 1142 1142 questo (la versione normale corrompe il vecchio oggetto, e non vogliamo che 1143 1143 accada). 1144 1144 ··· 1146 1146 attraverso il puntatore ``next`` il contenuto dell'elemento successivo 1147 1147 troppo presto, ma non accorgersi che il contenuto caricato è sbagliato quando 1148 1148 il puntatore ``next`` viene modificato alla loro spalle. Ancora una volta 1149 - c'è una funzione che viene in vostro aiuto :c:func:`list_for_each_entry_rcu()` 1149 + c'è una funzione che viene in vostro aiuto list_for_each_entry_rcu() 1150 1150 (``include/linux/list.h``). Ovviamente, gli scrittori possono usare 1151 - :c:func:`list_for_each_entry()` dato che non ci possono essere due scrittori 1151 + list_for_each_entry() dato che non ci possono essere due scrittori 1152 1152 in contemporanea. 1153 1153 1154 1154 Il nostro ultimo dilemma è il seguente: quando possiamo realmente distruggere ··· 1156 1156 elemento proprio ora: se eliminiamo questo elemento ed il puntatore ``next`` 1157 1157 cambia, il lettore salterà direttamente nella spazzatura e scoppierà. Dobbiamo 1158 1158 aspettare finché tutti i lettori che stanno attraversando la lista abbiano 1159 - finito. Utilizziamo :c:func:`call_rcu()` per registrare una funzione di 1159 + finito. Utilizziamo call_rcu() per registrare una funzione di 1160 1160 richiamo che distrugga l'oggetto quando tutti i lettori correnti hanno 1161 1161 terminato. In alternative, potrebbe essere usata la funzione 1162 - :c:func:`synchronize_rcu()` che blocca l'esecuzione finché tutti i lettori 1162 + synchronize_rcu() che blocca l'esecuzione finché tutti i lettori 1163 1163 non terminano di ispezionare la lista. 1164 1164 1165 1165 Ma come fa l'RCU a sapere quando i lettori sono finiti? Il meccanismo è 1166 1166 il seguente: innanzi tutto i lettori accedono alla lista solo fra la coppia 1167 - :c:func:`rcu_read_lock()`/:c:func:`rcu_read_unlock()` che disabilita la 1167 + rcu_read_lock()/rcu_read_unlock() che disabilita la 1168 1168 prelazione così che i lettori non vengano sospesi mentre stanno leggendo 1169 1169 la lista. 1170 1170 ··· 1253 1253 } 1254 1254 1255 1255 Da notare che i lettori modificano il campo popularity nella funzione 1256 - :c:func:`__cache_find()`, e ora non trattiene alcun *lock*. Una soluzione 1256 + __cache_find(), e ora non trattiene alcun *lock*. Una soluzione 1257 1257 potrebbe essere quella di rendere la variabile ``atomic_t``, ma per l'uso 1258 1258 che ne abbiamo fatto qui, non ci interessano queste corse critiche perché un 1259 1259 risultato approssimativo è comunque accettabile, quindi non l'ho cambiato. 1260 1260 1261 - Il risultato è che la funzione :c:func:`cache_find()` non ha bisogno di alcuna 1261 + Il risultato è che la funzione cache_find() non ha bisogno di alcuna 1262 1262 sincronizzazione con le altre funzioni, quindi è veloce su un sistema 1263 1263 multi-processore tanto quanto lo sarebbe su un sistema mono-processore. 1264 1264 ··· 1271 1271 1272 1272 Ora, dato che il '*lock* di lettura' di un RCU non fa altro che disabilitare 1273 1273 la prelazione, un chiamante che ha sempre la prelazione disabilitata fra le 1274 - chiamate :c:func:`cache_find()` e :c:func:`object_put()` non necessita 1274 + chiamate cache_find() e object_put() non necessita 1275 1275 di incrementare e decrementare il contatore di riferimenti. Potremmo 1276 - esporre la funzione :c:func:`__cache_find()` dichiarandola non-static, 1276 + esporre la funzione __cache_find() dichiarandola non-static, 1277 1277 e quel chiamante potrebbe usare direttamente questa funzione. 1278 1278 1279 1279 Il beneficio qui sta nel fatto che il contatore di riferimenti no ··· 1293 1293 Se questo dovesse essere troppo lento (solitamente non lo è, ma se avete 1294 1294 dimostrato che lo è devvero), potreste usare un contatore per ogni processore 1295 1295 e quindi non sarebbe più necessaria la mutua esclusione. Vedere 1296 - :c:func:`DEFINE_PER_CPU()`, :c:func:`get_cpu_var()` e :c:func:`put_cpu_var()` 1296 + DEFINE_PER_CPU(), get_cpu_var() e put_cpu_var() 1297 1297 (``include/linux/percpu.h``). 1298 1298 1299 - Il tipo di dato ``local_t``, la funzione :c:func:`cpu_local_inc()` e tutte 1299 + Il tipo di dato ``local_t``, la funzione cpu_local_inc() e tutte 1300 1300 le altre funzioni associate, sono di particolare utilità per semplici contatori 1301 1301 per-processore; su alcune architetture sono anche più efficienti 1302 1302 (``include/asm/local.h``). ··· 1324 1324 enable_irq(irq); 1325 1325 spin_unlock(&lock); 1326 1326 1327 - La funzione :c:func:`disable_irq()` impedisce al gestore d'interruzioni 1327 + La funzione disable_irq() impedisce al gestore d'interruzioni 1328 1328 d'essere eseguito (e aspetta che finisca nel caso fosse in esecuzione su 1329 1329 un altro processore). Lo spinlock, invece, previene accessi simultanei. 1330 1330 Naturalmente, questo è più lento della semplice chiamata 1331 - :c:func:`spin_lock_irq()`, quindi ha senso solo se questo genere di accesso 1331 + spin_lock_irq(), quindi ha senso solo se questo genere di accesso 1332 1332 è estremamente raro. 1333 1333 1334 1334 .. _`it_sleeping-things`: ··· 1336 1336 Quali funzioni possono essere chiamate in modo sicuro dalle interruzioni? 1337 1337 ========================================================================= 1338 1338 1339 - Molte funzioni del kernel dormono (in sostanza, chiamano ``schedule()``) 1339 + Molte funzioni del kernel dormono (in sostanza, chiamano schedule()) 1340 1340 direttamente od indirettamente: non potete chiamarle se trattenere uno 1341 1341 spinlock o avete la prelazione disabilitata, mai. Questo significa che 1342 1342 dovete necessariamente essere nel contesto utente: chiamarle da un ··· 1354 1354 1355 1355 - Accessi allo spazio utente: 1356 1356 1357 - - :c:func:`copy_from_user()` 1357 + - copy_from_user() 1358 1358 1359 - - :c:func:`copy_to_user()` 1359 + - copy_to_user() 1360 1360 1361 - - :c:func:`get_user()` 1361 + - get_user() 1362 1362 1363 - - :c:func:`put_user()` 1363 + - put_user() 1364 1364 1365 - - :c:func:`kmalloc(GFP_KERNEL) <kmalloc>` 1365 + - kmalloc(GFP_KERNEL) <kmalloc>` 1366 1366 1367 - - :c:func:`mutex_lock_interruptible()` and 1368 - :c:func:`mutex_lock()` 1367 + - mutex_lock_interruptible() and 1368 + mutex_lock() 1369 1369 1370 - C'è anche :c:func:`mutex_trylock()` che però non dorme. 1370 + C'è anche mutex_trylock() che però non dorme. 1371 1371 Comunque, non deve essere usata in un contesto d'interruzione dato 1372 1372 che la sua implementazione non è sicura in quel contesto. 1373 - Anche :c:func:`mutex_unlock()` non dorme mai. Non può comunque essere 1373 + Anche mutex_unlock() non dorme mai. Non può comunque essere 1374 1374 usata in un contesto d'interruzione perché un mutex deve essere rilasciato 1375 1375 dallo stesso processo che l'ha acquisito. 1376 1376 ··· 1380 1380 Alcune funzioni possono essere chiamate tranquillamente da qualsiasi 1381 1381 contesto, o trattenendo un qualsiasi *lock*. 1382 1382 1383 - - :c:func:`printk()` 1383 + - printk() 1384 1384 1385 - - :c:func:`kfree()` 1385 + - kfree() 1386 1386 1387 - - :c:func:`add_timer()` e :c:func:`del_timer()` 1387 + - add_timer() e del_timer() 1388 1388 1389 1389 Riferimento per l'API dei Mutex 1390 1390 =============================== ··· 1444 1444 bh 1445 1445 Bottom Half: per ragioni storiche, le funzioni che contengono '_bh' nel 1446 1446 loro nome ora si riferiscono a qualsiasi interruzione software; per esempio, 1447 - :c:func:`spin_lock_bh()` blocca qualsiasi interuzione software sul processore 1447 + spin_lock_bh() blocca qualsiasi interuzione software sul processore 1448 1448 corrente. I *Bottom Halves* sono deprecati, e probabilmente verranno 1449 1449 sostituiti dai tasklet. In un dato momento potrà esserci solo un 1450 1450 *bottom half* in esecuzione. 1451 1451 1452 1452 contesto d'interruzione 1453 1453 Non è il contesto utente: qui si processano le interruzioni hardware e 1454 - software. La macro :c:func:`in_interrupt()` ritorna vero. 1454 + software. La macro in_interrupt() ritorna vero. 1455 1455 1456 1456 contesto utente 1457 1457 Il kernel che esegue qualcosa per conto di un particolare processo (per ··· 1461 1461 che hardware. 1462 1462 1463 1463 interruzione hardware 1464 - Richiesta di interruzione hardware. :c:func:`in_irq()` ritorna vero in un 1464 + Richiesta di interruzione hardware. in_irq() ritorna vero in un 1465 1465 gestore d'interruzioni hardware. 1466 1466 1467 1467 interruzione software / softirq 1468 - Gestore di interruzioni software: :c:func:`in_irq()` ritorna falso; 1469 - :c:func:`in_softirq()` ritorna vero. I tasklet e le softirq sono entrambi 1468 + Gestore di interruzioni software: in_irq() ritorna falso; 1469 + in_softirq() ritorna vero. I tasklet e le softirq sono entrambi 1470 1470 considerati 'interruzioni software'. 1471 1471 1472 1472 In soldoni, un softirq è uno delle 32 interruzioni software che possono

+49 -46

Documentation/translations/it_IT/process/2.Process.rst

··· 23 23 I rilasci più recenti sono stati: 24 24 25 25 ====== ================= 26 - 4.11 Aprile 30, 2017 27 - 4.12 Luglio 2, 2017 28 - 4.13 Settembre 3, 2017 29 - 4.14 Novembre 12, 2017 30 - 4.15 Gennaio 28, 2018 31 - 4.16 Aprile 1, 2018 26 + 5.0 3 marzo, 2019 27 + 5.1 5 maggio, 2019 28 + 5.2 7 luglio, 2019 29 + 5.3 15 settembre, 2019 30 + 5.4 24 novembre, 2019 31 + 5.5 6 gennaio, 2020 32 32 ====== ================= 33 33 34 - Ciascun rilascio 4.x è un importante rilascio del kernel con nuove 34 + Ciascun rilascio 5.x è un importante rilascio del kernel con nuove 35 35 funzionalità, modifiche interne dell'API, e molto altro. Un tipico 36 - rilascio 4.x contiene quasi 13,000 gruppi di modifiche con ulteriori 37 - modifiche a parecchie migliaia di linee di codice. La 4.x. è pertanto la 36 + rilascio contiene quasi 13,000 gruppi di modifiche con ulteriori 37 + modifiche a parecchie migliaia di linee di codice. La 5.x. è pertanto la 38 38 linea di confine nello sviluppo del kernel Linux; il kernel utilizza un sistema 39 39 di sviluppo continuo che integra costantemente nuove importanti modifiche. 40 40 ··· 55 55 La finestra di inclusione resta attiva approssimativamente per due settimane. 56 56 Al termine di questo periodo, Linus Torvald dichiarerà che la finestra è 57 57 chiusa e rilascerà il primo degli "rc" del kernel. 58 - Per il kernel che è destinato ad essere 2.6.40, per esempio, il rilascio 59 - che emerge al termine della finestra d'inclusione si chiamerà 2.6.40-rc1. 58 + Per il kernel che è destinato ad essere 5.6, per esempio, il rilascio 59 + che emerge al termine della finestra d'inclusione si chiamerà 5.6-rc1. 60 60 Questo rilascio indica che il momento di aggiungere nuovi componenti è 61 61 passato, e che è iniziato il periodo di stabilizzazione del prossimo kernel. 62 62 ··· 76 76 il ritmo delle modifiche rallenta col tempo. Linus rilascia un nuovo 77 77 kernel -rc circa una volta alla settimana; e ne usciranno circa 6 o 9 prima 78 78 che il kernel venga considerato sufficientemente stabile e che il rilascio 79 - finale 2.6.x venga fatto. A quel punto tutto il processo ricomincerà. 79 + finale venga fatto. A quel punto tutto il processo ricomincerà. 80 80 81 - Esempio: ecco com'è andato il ciclo di sviluppo della versione 4.16 81 + Esempio: ecco com'è andato il ciclo di sviluppo della versione 5.4 82 82 (tutte le date si collocano nel 2018) 83 83 84 84 85 85 ============== ======================================= 86 - Gennaio 28 4.15 rilascio stabile 87 - Febbraio 11 4.16-rc1, finestra di inclusione chiusa 88 - Febbraio 18 4.16-rc2 89 - Febbraio 25 4.16-rc3 90 - Marzo 4 4.16-rc4 91 - Marzo 11 4.16-rc5 92 - Marzo 18 4.16-rc6 93 - Marzo 25 4.16-rc7 94 - Aprile 1 4.17 rilascio stabile 86 + 15 settembre 5.3 rilascio stabile 87 + 30 settembre 5.4-rc1, finestra di inclusione chiusa 88 + 6 ottobre 5.4-rc2 89 + 13 ottobre 5.4-rc3 90 + 20 ottobre 5.4-rc4 91 + 27 ottobre 5.4-rc5 92 + 3 novembre 5.4-rc6 93 + 10 novembre 5.4-rc7 94 + 17 novembre 5.4-rc8 95 + 24 novembre 5.4 rilascio stabile 95 96 ============== ======================================= 96 97 97 98 In che modo gli sviluppatori decidono quando chiudere il ciclo di sviluppo e ··· 109 108 in un progetto di questa portata. Arriva un punto dove ritardare il rilascio 110 109 finale peggiora la situazione; la quantità di modifiche in attesa della 111 110 prossima finestra di inclusione crescerà enormemente, creando ancor più 112 - regressioni al giro successivo. Quindi molti kernel 4.x escono con una 111 + regressioni al giro successivo. Quindi molti kernel 5.x escono con una 113 112 manciata di regressioni delle quali, si spera, nessuna è grave. 114 113 115 114 Una volta che un rilascio stabile è fatto, il suo costante mantenimento è 116 115 affidato al "squadra stabilità", attualmente composta da Greg Kroah-Hartman. 117 116 Questa squadra rilascia occasionalmente degli aggiornamenti relativi al 118 - rilascio stabile usando la numerazione 4.x.y. Per essere presa in 117 + rilascio stabile usando la numerazione 5.x.y. Per essere presa in 119 118 considerazione per un rilascio d'aggiornamento, una modifica deve: 120 119 (1) correggere un baco importante (2) essere già inserita nel ramo principale 121 120 per il prossimo sviluppo del kernel. Solitamente, passato il loro rilascio 122 121 iniziale, i kernel ricevono aggiornamenti per più di un ciclo di sviluppo. 123 - Quindi, per esempio, la storia del kernel 4.13 appare così: 122 + Quindi, per esempio, la storia del kernel 5.2 appare così (anno 2019): 124 123 125 124 ============== =============================== 126 - Settembre 3 4.13 rilascio stabile 127 - Settembre 13 4.13.1 128 - Settembre 20 4.13.2 129 - Settembre 27 4.13.3 130 - Ottobre 5 4.13.4 131 - Ottobre 12 4.13.5 125 + 15 settembre 5.2 rilascio stabile FIXME settembre è sbagliato 126 + 14 luglio 5.2.1 127 + 21 luglio 5.2.2 128 + 26 luglio 5.2.3 129 + 28 luglio 5.2.4 130 + 31 luglio 5.2.5 132 131 ... ... 133 - Novembre 24 4.13.16 132 + 11 ottobre 5.2.21 134 133 ============== =============================== 135 134 136 - La 4.13.16 fu l'aggiornamento finale per la versione 4.13. 135 + La 5.2.21 fu l'aggiornamento finale per la versione 5.2. 137 136 138 137 Alcuni kernel sono destinati ad essere kernel a "lungo termine"; questi 139 138 riceveranno assistenza per un lungo periodo di tempo. Al momento in cui 140 139 scriviamo, i manutentori dei kernel stabili a lungo termine sono: 141 140 142 - ====== ====================== ========================================== 143 - 3.16 Ben Hutchings (kernel stabile molto più a lungo termine) 144 - 4.1 Sasha Levin 145 - 4.4 Greg Kroah-Hartman (kernel stabile molto più a lungo termine) 146 - 4.9 Greg Kroah-Hartman 147 - 4.14 Greg Kroah-Hartman 148 - ====== ====================== ========================================== 141 + ====== ================================ ========================================== 142 + 3.16 Ben Hutchings (kernel stabile molto più a lungo termine) 143 + 4.4 Greg Kroah-Hartman e Sasha Levin (kernel stabile molto più a lungo termine) 144 + 4.9 Greg Kroah-Hartman e Sasha Levin 145 + 4.14 Greg Kroah-Hartman e Sasha Levin 146 + 4.19 Greg Kroah-Hartman e Sasha Levin 147 + 5.4i Greg Kroah-Hartman e Sasha Levin 148 + ====== ================================ ========================================== 149 149 150 150 151 151 Questa selezione di kernel di lungo periodo sono puramente dovuti ai loro ··· 231 229 -------------------------------------- 232 230 233 231 Esiste una sola persona che può inserire le patch nel repositorio principale 234 - del kernel: Linus Torvalds. Ma, di tutte le 9500 patch che entrarono nella 235 - versione 2.6.38 del kernel, solo 112 (circa l'1,3%) furono scelte direttamente 236 - da Linus in persona. Il progetto del kernel è cresciuto fino a raggiungere 237 - una dimensione tale per cui un singolo sviluppatore non può controllare e 238 - selezionare indipendentemente ogni modifica senza essere supportato. 239 - La via scelta dagli sviluppatori per indirizzare tale crescita è stata quella 232 + del kernel: Linus Torvalds. Ma, per esempio, di tutte le 9500 patch 233 + che entrarono nella versione 2.6.38 del kernel, solo 112 (circa 234 + l'1,3%) furono scelte direttamente da Linus in persona. Il progetto 235 + del kernel è cresciuto fino a raggiungere una dimensione tale per cui 236 + un singolo sviluppatore non può controllare e selezionare 237 + indipendentemente ogni modifica senza essere supportato. La via 238 + scelta dagli sviluppatori per indirizzare tale crescita è stata quella 240 239 di utilizzare un sistema di "sottotenenti" basato sulla fiducia. 241 240 242 241 Il codice base del kernel è spezzato in una serie si sottosistemi: rete,

+1 -1

Documentation/translations/it_IT/process/adding-syscalls.rst

··· 39 39 un qualche modo opaca. 40 40 41 41 - Se dovete esporre solo delle informazioni sul sistema, un nuovo nodo in 42 - sysfs (vedere ``Documentation/filesystems/sysfs.txt``) o 42 + sysfs (vedere ``Documentation/filesystems/sysfs.rst``) o 43 43 in procfs potrebbe essere sufficiente. Tuttavia, l'accesso a questi 44 44 meccanismi richiede che il filesystem sia montato, il che potrebbe non 45 45 essere sempre vero (per esempio, in ambienti come namespace/sandbox/chroot).

+3 -3

Documentation/translations/it_IT/process/coding-style.rst

··· 313 313 qualcosa di simile, **non** dovreste chiamarla ``cntusr()``. 314 314 315 315 Codificare il tipo di funzione nel suo nome (quella cosa chiamata notazione 316 - ungherese) fa male al cervello - il compilatore conosce comunque il tipo e 316 + ungherese) è stupido - il compilatore conosce comunque il tipo e 317 317 può verificarli, e inoltre confonde i programmatori. Non c'è da 318 318 sorprendersi che MicroSoft faccia programmi bacati. 319 319 ··· 825 825 826 826 Agli sviluppatori del kernel piace essere visti come dotti. Tenete un occhio 827 827 di riguardo per l'ortografia e farete una belle figura. In inglese, evitate 828 - l'uso di parole mozzate come ``dont``: usate ``do not`` oppure ``don't``. 829 - Scrivete messaggi concisi, chiari, e inequivocabili. 828 + l'uso incorretto di abbreviazioni come ``dont``: usate ``do not`` oppure 829 + ``don't``. Scrivete messaggi concisi, chiari, e inequivocabili. 830 830 831 831 I messaggi del kernel non devono terminare con un punto fermo. 832 832

+115 -15

Documentation/translations/it_IT/process/deprecated.rst

··· 34 34 deve essere rimossa dal kernel, o aggiunta a questo documento per scoraggiarne 35 35 l'uso. 36 36 37 + BUG() e BUG_ON() 38 + ---------------- 39 + Al loro posto usate WARN() e WARN_ON() per gestire le 40 + condizioni "impossibili" e gestitele come se fosse possibile farlo. 41 + Nonostante le funzioni della famiglia BUG() siano state progettate 42 + per asserire "situazioni impossibili" e interrompere in sicurezza un 43 + thread del kernel, queste si sono rivelate essere troppo rischiose 44 + (per esempio, in quale ordine rilasciare i *lock*? Ci sono stati che 45 + sono stati ripristinati?). Molto spesso l'uso di BUG() 46 + destabilizza il sistema o lo corrompe del tutto, il che rende 47 + impossibile un'attività di debug o anche solo leggere un rapporto 48 + circa l'errore. Linus ha un'opinione molto critica al riguardo: 49 + `email 1 50 + <https://lore.kernel.org/lkml/CA+55aFy6jNLsywVYdGp83AMrXBo_P-pkjkphPGrO=82SPKCpLQ@mail.gmail.com/>`_, 51 + `email 2 52 + <https://lore.kernel.org/lkml/CAHk-=whDHsbK3HTOpTF=ue_o04onRwTEaK_ZoJp_fjbqq4+=Jw@mail.gmail.com/>`_ 53 + 54 + Tenete presente che la famiglia di funzioni WARN() dovrebbe essere 55 + usato solo per situazioni che si suppone siano "impossibili". Se 56 + volete avvisare gli utenti riguardo a qualcosa di possibile anche se 57 + indesiderato, usare le funzioni della famiglia pr_warn(). Chi 58 + amministra il sistema potrebbe aver attivato l'opzione sysctl 59 + *panic_on_warn* per essere sicuri che il sistema smetta di funzionare 60 + in caso si verifichino delle condizioni "inaspettate". (per esempio, 61 + date un'occhiata al questo `commit 62 + <https://git.kernel.org/linus/d4689846881d160a4d12a514e991a740bcb5d65a>`_) 63 + 37 64 Calcoli codificati negli argomenti di un allocatore 38 65 ---------------------------------------------------- 39 66 Il calcolo dinamico delle dimensioni (specialmente le moltiplicazioni) non ··· 95 68 96 69 header = kzalloc(struct_size(header, item, count), GFP_KERNEL); 97 70 98 - Per maggiori dettagli fate riferimento a :c:func:`array_size`, 99 - :c:func:`array3_size`, e :c:func:`struct_size`, così come la famiglia di 100 - funzioni :c:func:`check_add_overflow` e :c:func:`check_mul_overflow`. 71 + Per maggiori dettagli fate riferimento a array_size(), 72 + array3_size(), e struct_size(), così come la famiglia di 73 + funzioni check_add_overflow() e check_mul_overflow(). 101 74 102 75 simple_strtol(), simple_strtoll(), simple_strtoul(), simple_strtoull() 103 76 ---------------------------------------------------------------------- 104 - Le funzioni :c:func:`simple_strtol`, :c:func:`simple_strtoll`, 105 - :c:func:`simple_strtoul`, e :c:func:`simple_strtoull` ignorano volutamente 77 + Le funzioni simple_strtol(), simple_strtoll(), 78 + simple_strtoul(), e simple_strtoull() ignorano volutamente 106 79 i possibili overflow, e questo può portare il chiamante a generare risultati 107 - inaspettati. Le rispettive funzioni :c:func:`kstrtol`, :c:func:`kstrtoll`, 108 - :c:func:`kstrtoul`, e :c:func:`kstrtoull` sono da considerarsi le corrette 80 + inaspettati. Le rispettive funzioni kstrtol(), kstrtoll(), 81 + kstrtoul(), e kstrtoull() sono da considerarsi le corrette 109 82 sostitute; tuttavia va notato che queste richiedono che la stringa sia 110 83 terminata con il carattere NUL o quello di nuova riga. 111 84 112 85 strcpy() 113 86 -------- 114 - La funzione :c:func:`strcpy` non fa controlli agli estremi del buffer 87 + La funzione strcpy() non fa controlli agli estremi del buffer 115 88 di destinazione. Questo può portare ad un overflow oltre i limiti del 116 89 buffer e generare svariati tipi di malfunzionamenti. Nonostante l'opzione 117 90 `CONFIG_FORTIFY_SOURCE=y` e svariate opzioni del compilatore aiutano 118 91 a ridurne il rischio, non c'è alcuna buona ragione per continuare ad usare 119 - questa funzione. La versione sicura da usare è :c:func:`strscpy`. 92 + questa funzione. La versione sicura da usare è strscpy(). 120 93 121 94 strncpy() su stringe terminate con NUL 122 95 -------------------------------------- 123 - L'utilizzo di :c:func:`strncpy` non fornisce alcuna garanzia sul fatto che 96 + L'utilizzo di strncpy() non fornisce alcuna garanzia sul fatto che 124 97 il buffer di destinazione verrà terminato con il carattere NUL. Questo 125 98 potrebbe portare a diversi overflow di lettura o altri malfunzionamenti 126 99 causati, appunto, dalla mancanza del terminatore. Questa estende la 127 100 terminazione nel buffer di destinazione quando la stringa d'origine è più 128 101 corta; questo potrebbe portare ad una penalizzazione delle prestazioni per 129 102 chi usa solo stringe terminate. La versione sicura da usare è 130 - :c:func:`strscpy`. (chi usa :c:func:`strscpy` e necessita di estendere la 131 - terminazione con NUL deve aggiungere una chiamata a :c:func:`memset`) 103 + strscpy(). (chi usa strscpy() e necessita di estendere la 104 + terminazione con NUL deve aggiungere una chiamata a memset()) 132 105 133 - Se il chiamate no usa stringhe terminate con NUL, allore :c:func:`strncpy()` 106 + Se il chiamate no usa stringhe terminate con NUL, allore strncpy()() 134 107 può continuare ad essere usata, ma i buffer di destinazione devono essere 135 108 marchiati con l'attributo `__nonstring <https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html>`_ 136 109 per evitare avvisi durante la compilazione. 137 110 138 111 strlcpy() 139 112 --------- 140 - La funzione :c:func:`strlcpy`, per prima cosa, legge interamente il buffer di 113 + La funzione strlcpy(), per prima cosa, legge interamente il buffer di 141 114 origine, magari leggendo più di quanto verrà effettivamente copiato. Questo 142 115 è inefficiente e può portare a overflow di lettura quando la stringa non è 143 - terminata con NUL. La versione sicura da usare è :c:func:`strscpy`. 116 + terminata con NUL. La versione sicura da usare è strscpy(). 117 + 118 + Segnaposto %p nella stringa di formato 119 + -------------------------------------- 120 + 121 + Tradizionalmente, l'uso del segnaposto "%p" nella stringa di formato 122 + esponne un indirizzo di memoria in dmesg, proc, sysfs, eccetera. Per 123 + evitare che questi indirizzi vengano sfruttati da malintenzionati, 124 + tutto gli usi di "%p" nel kernel rappresentano l'hash dell'indirizzo, 125 + rendendolo di fatto inutilizzabile. Nuovi usi di "%p" non dovrebbero 126 + essere aggiunti al kernel. Per una rappresentazione testuale di un 127 + indirizzo usate "%pS", l'output è migliore perché mostrerà il nome del 128 + simbolo. Per tutto il resto, semplicemente non usate "%p". 129 + 130 + Parafrasando la `guida 131 + <https://lore.kernel.org/lkml/CA+55aFwQEd_d40g4mUCSsVRZzrFPUJt74vc6PPpb675hYNXcKw@mail.gmail.com/>`_ 132 + di Linus: 133 + 134 + - Se il valore hash di "%p" è inutile, chiediti se il puntatore stesso 135 + è importante. Forse dovrebbe essere rimosso del tutto? 136 + - Se credi davvero che il vero valore del puntatore sia importante, 137 + perché alcuni stati del sistema o i livelli di privilegi di un 138 + utente sono considerati "special"? Se pensi di poterlo giustificare 139 + (in un commento e nel messaggio del commit) abbastanza bene da 140 + affrontare il giudizio di Linus, allora forse potrai usare "%px", 141 + assicurandosi anche di averne il permesso. 142 + 143 + Infine, sappi che un cambio in favore di "%p" con hash `non verrà 144 + accettato 145 + <https://lore.kernel.org/lkml/CA+55aFwieC1-nAs+NFq9RTwaR8ef9hWa4MjNBWL41F-8wM49eA@mail.gmail.com/>`_. 144 146 145 147 Vettori a dimensione variabile (VLA) 146 148 ------------------------------------ ··· 183 127 dati importanti alla fine dello stack (quando il kernel è compilato senza 184 128 `CONFIG_THREAD_INFO_IN_TASK=y`), o sovrascrivere un pezzo di memoria adiacente 185 129 allo stack (quando il kernel è compilato senza `CONFIG_VMAP_STACK=y`). 130 + 131 + Salto implicito nell'istruzione switch-case 132 + ------------------------------------------- 133 + 134 + Il linguaggio C permette ai casi di un'istruzione `switch` di saltare al 135 + prossimo caso quando l'istruzione "break" viene omessa alla fine del caso 136 + corrente. Tuttavia questo rende il codice ambiguo perché non è sempre ovvio se 137 + l'istruzione "break" viene omessa intenzionalmente o è un baco. Per esempio, 138 + osservando il seguente pezzo di codice non è chiaro se lo stato 139 + `STATE_ONE` è stato progettato apposta per eseguire anche `STATE_TWO`:: 140 + 141 + switch (value) { 142 + case STATE_ONE: 143 + do_something(); 144 + case STATE_TWO: 145 + do_other(); 146 + break; 147 + default: 148 + WARN("unknown state"); 149 + } 150 + 151 + Dato che c'è stata una lunga lista di problemi `dovuti alla mancanza dell'istruzione 152 + "break" <https://cwe.mitre.org/data/definitions/484.html>`_, oggigiorno non 153 + permettiamo più che vi sia un "salto implicito" (*fall-through*). Per 154 + identificare un salto implicito intenzionale abbiamo adottato la pseudo 155 + parola chiave 'fallthrough' che viene espansa nell'estensione di gcc 156 + `__attribute__((fallthrough))` `Statement Attributes 157 + <https://gcc.gnu.org/onlinedocs/gcc/Statement-Attributes.html>`_. 158 + (Quando la sintassi C17/C18 `[[fallthrough]]` sarà più comunemente 159 + supportata dai compilatori C, analizzatori statici, e dagli IDE, 160 + allora potremo usare quella sintassi per la pseudo parola chiave) 161 + 162 + Quando la sintassi [[fallthrough]] sarà più comunemente supportata dai 163 + compilatori, analizzatori statici, e ambienti di sviluppo IDE, 164 + allora potremo usarla anche noi. 165 + 166 + Ne consegue che tutti i blocchi switch/case devono finire in uno dei seguenti 167 + modi: 168 + 169 + * ``break;`` 170 + * `fallthrough;`` 171 + * ``continue;`` 172 + * ``goto <label>;`` 173 + * ``return [expression];``

+327 -5

Documentation/translations/it_IT/process/email-clients.rst

··· 1 1 .. include:: ../disclaimer-ita.rst 2 2 3 - :Original: :ref:`Documentation/process/email-clients.rst <email_clients>` 4 - 5 - .. _it_email_clients: 3 + :Original: :doc:`../../../process/email-clients` 4 + :Translator: Alessia Mantegazza <amantegazza@vaga.pv.it> 6 5 7 6 Informazioni sui programmi di posta elettronica per Linux 8 7 ========================================================= 9 8 10 - .. warning:: 9 + Git 10 + --- 11 11 12 - TODO ancora da tradurre 12 + Oggigiorno, la maggior parte degli sviluppatori utilizza ``git send-email`` 13 + al posto dei classici programmi di posta elettronica. Le pagine man sono 14 + abbastanza buone. Dal lato del ricevente, i manutentori utilizzano ``git am`` 15 + per applicare le patch. 16 + 17 + Se siete dei novelli utilizzatori di ``git`` allora inviate la patch a voi 18 + stessi. Salvatela come testo includendo tutte le intestazioni. Poi eseguite 19 + il comando ``git am messaggio-formato-testo.txt`` e revisionatene il risultato 20 + con ``git log``. Quando tutto funziona correttamente, allora potete inviare 21 + la patch alla lista di discussione più appropriata. 22 + 23 + Panoramica delle opzioni 24 + ------------------------ 25 + 26 + Le patch per il kernel vengono inviate per posta elettronica, preferibilmente 27 + come testo integrante del messaggio. Alcuni manutentori accettano gli 28 + allegati, ma in questo caso gli allegati devono avere il *content-type* 29 + impostato come ``text/plain``. Tuttavia, generalmente gli allegati non sono 30 + ben apprezzati perché rende più difficile citare porzioni di patch durante il 31 + processo di revisione. 32 + 33 + I programmi di posta elettronica che vengono usati per inviare le patch per il 34 + kernel Linux dovrebbero inviarle senza alterazioni. Per esempio, non 35 + dovrebbero modificare o rimuovere tabulazioni o spazi, nemmeno all'inizio o 36 + alla fine delle righe. 37 + 38 + Non inviate patch con ``format=flowed``. Questo potrebbe introdurre 39 + interruzioni di riga inaspettate e indesiderate. 40 + 41 + Non lasciate che il vostro programma di posta vada a capo automaticamente. 42 + Questo può corrompere le patch. 43 + 44 + I programmi di posta non dovrebbero modificare la codifica dei caratteri nel 45 + testo. Le patch inviate per posta elettronica dovrebbero essere codificate in 46 + ASCII o UTF-8. 47 + Se configurate il vostro programma per inviare messaggi codificati con UTF-8 48 + eviterete possibili problemi di codifica. 49 + 50 + I programmi di posta dovrebbero generare e mantenere le intestazioni 51 + "References" o "In-Reply-To:" cosicché la discussione non venga interrotta. 52 + 53 + Di solito, il copia-e-incolla (o taglia-e-incolla) non funziona con le patch 54 + perché le tabulazioni vengono convertite in spazi. Usando xclipboard, xclip 55 + e/o xcutsel potrebbe funzionare, ma è meglio che lo verifichiate o meglio 56 + ancora: non usate il copia-e-incolla. 57 + 58 + Non usate firme PGP/GPG nei messaggi che contengono delle patch. Questo 59 + impedisce il corretto funzionamento di alcuni script per leggere o applicare 60 + patch (questo si dovrebbe poter correggere). 61 + 62 + Prima di inviare le patch sulle liste di discussione Linux, può essere una 63 + buona idea quella di inviare la patch a voi stessi, salvare il messaggio 64 + ricevuto, e applicarlo ai sorgenti con successo. 65 + 66 + 67 + Alcuni suggerimenti per i programmi di posta elettronica (MUA) 68 + -------------------------------------------------------------- 69 + 70 + Qui troverete alcuni suggerimenti per configurare i vostri MUA allo scopo 71 + di modificare ed inviare patch per il kernel Linux. Tuttavia, questi 72 + suggerimenti non sono da considerarsi come un riassunto di una configurazione 73 + completa. 74 + 75 + Legenda: 76 + 77 + - TUI = interfaccia utente testuale (*text-based user interface*) 78 + - GUI = interfaccia utente grafica (*graphical user interface*) 79 + 80 + Alpine (TUI) 81 + ************ 82 + 83 + Opzioni per la configurazione: 84 + 85 + Nella sezione :menuselection:`Sending Preferences`: 86 + 87 + - :menuselection:`Do Not Send Flowed Text` deve essere ``enabled`` 88 + - :menuselection:`Strip Whitespace Before Sending` deve essere ``disabled`` 89 + 90 + Quando state scrivendo un messaggio, il cursore dev'essere posizionato 91 + dove volete che la patch inizi, poi premendo :kbd:`CTRL-R` vi verrà chiesto 92 + di selezionare il file patch da inserire nel messaggio. 93 + 94 + Claws Mail (GUI) 95 + **************** 96 + 97 + Funziona. Alcune persone riescono ad usarlo con successo per inviare le patch. 98 + 99 + Per inserire una patch usate :menuselection:`Messaggio-->Inserisci file` 100 + (:kbd:`CTRL-I`) oppure un editor esterno. 101 + 102 + Se la patch che avete inserito dev'essere modificata usato la finestra di 103 + scrittura di Claws, allora assicuratevi che l'"auto-interruzione" sia 104 + disabilitata :menuselection:`Configurazione-->Preferenze-->Composizione-->Interruzione riga`. 105 + 106 + Evolution (GUI) 107 + *************** 108 + 109 + Alcune persone riescono ad usarlo con successo per inviare le patch. 110 + 111 + Quando state scrivendo una lettera selezionate: Preformattato 112 + da :menuselection:`Formato-->Stile del paragrafo-->Preformattato` 113 + (:kbd:`CTRL-7`) o dalla barra degli strumenti 114 + 115 + Poi per inserire la patch usate: 116 + :menuselection:`Inserisci--> File di testo...` (:kbd:`ALT-N x`) 117 + 118 + Potete anche eseguire ``diff -Nru old.c new.c | xclip``, selezionare 119 + :menuselection:`Preformattato`, e poi usare il tasto centrale del mouse. 120 + 121 + Kmail (GUI) 122 + *********** 123 + 124 + Alcune persone riescono ad usarlo con successo per inviare le patch. 125 + 126 + La configurazione base che disabilita la composizione di messaggi HTML è 127 + corretta; non abilitatela. 128 + 129 + Quando state scrivendo un messaggio, nel menu opzioni, togliete la selezione a 130 + "A capo automatico". L'unico svantaggio sarà che qualsiasi altra cosa scriviate 131 + nel messaggio non verrà mandata a capo in automatico ma dovrete farlo voi. 132 + Il modo più semplice per ovviare a questo problema è quello di scrivere il 133 + messaggio con l'opzione abilitata e poi di salvarlo nelle bozze. Riaprendo ora 134 + il messaggio dalle bozze le andate a capo saranno parte integrante del 135 + messaggio, per cui togliendo l'opzione "A capo automatico" non perderete nulla. 136 + 137 + Alla fine del vostro messaggio, appena prima di inserire la vostra patch, 138 + aggiungete il delimitatore di patch: tre trattini (``---``). 139 + 140 + Ora, dal menu :menuselection:`Messaggio`, selezionate :menuselection:`Inserisci file di testo...` 141 + quindi scegliete la vostra patch. 142 + Come soluzione aggiuntiva potreste personalizzare la vostra barra degli 143 + strumenti aggiungendo un'icona per :menuselection:`Inserisci file di testo...`. 144 + 145 + Allargate la finestra di scrittura abbastanza da evitare andate a capo. 146 + Questo perché in Kmail 1.13.5 (KDE 4.5.4), Kmail aggiunge andate a capo 147 + automaticamente al momento dell'invio per tutte quelle righe che graficamente, 148 + nella vostra finestra di composizione, si sono estete su una riga successiva. 149 + Disabilitare l'andata a capo automatica non è sufficiente. Dunque, se la vostra 150 + patch contiene delle righe molto lunghe, allora dovrete allargare la finestra 151 + di composizione per evitare che quelle righe vadano a capo. Vedere: 152 + https://bugs.kde.org/show_bug.cgi?id=174034 153 + 154 + Potete firmare gli allegati con GPG, ma per le patch si preferisce aggiungerle 155 + al testo del messaggio per cui non usate la firma GPG. Firmare le patch 156 + inserite come testo del messaggio le rende più difficili da estrarre dalla loro 157 + codifica a 7-bit. 158 + 159 + Se dovete assolutamente inviare delle patch come allegati invece di integrarle 160 + nel testo del messaggio, allora premete il tasto destro sull'allegato e 161 + selezionate :menuselection:`Proprietà`, e poi attivate 162 + :menuselection:`Suggerisci visualizzazione automatica` per far si che 163 + l'allegato sia più leggibile venendo visualizzato come parte del messaggio. 164 + 165 + Per salvare le patch inviate come parte di un messaggio, selezionate il 166 + messaggio che la contiene, premete il tasto destro e selezionate 167 + :menuselection:`Salva come`. Se il messaggio fu ben preparato, allora potrete 168 + usarlo interamente senza alcuna modifica. 169 + I messaggi vengono salvati con permessi di lettura-scrittura solo per l'utente, 170 + nel caso in cui vogliate copiarli altrove per renderli disponibili ad altri 171 + gruppi o al mondo, ricordatevi di usare ``chmod`` per cambiare i permessi. 172 + 173 + Lotus Notes (GUI) 174 + ***************** 175 + 176 + Scappate finché potete. 177 + 178 + IBM Verse (Web GUI) 179 + ******************* 180 + 181 + Vedi il commento per Lotus Notes. 182 + 183 + Mutt (TUI) 184 + ********** 185 + 186 + Un sacco di sviluppatori Linux usano ``mutt``, per cui deve funzionare 187 + abbastanza bene. 188 + 189 + Mutt non ha un proprio editor, quindi qualunque sia il vostro editor dovrete 190 + configurarlo per non aggiungere automaticamente le andate a capo. Molti 191 + editor hanno un'opzione :menuselection:`Inserisci file` che inserisce il 192 + contenuto di un file senza alterarlo. 193 + 194 + Per usare ``vim`` come editor per mutt:: 195 + 196 + set editor="vi" 197 + 198 + Se per inserire la patch nel messaggio usate xclip, scrivete il comando:: 199 + 200 + :set paste 201 + 202 + prima di premere il tasto centrale o shift-insert. Oppure usate il 203 + comando:: 204 + 205 + :r filename 206 + 207 + (a)llega funziona bene senza ``set paste`` 208 + 209 + Potete generare le patch con ``git format-patch`` e usare Mutt per inviarle:: 210 + 211 + $ mutt -H 0001-some-bug-fix.patch 212 + 213 + Opzioni per la configurazione: 214 + 215 + Tutto dovrebbe funzionare già nella configurazione base. 216 + Tuttavia, è una buona idea quella di impostare ``send_charset``:: 217 + 218 + set send_charset="us-ascii:utf-8" 219 + 220 + Mutt è molto personalizzabile. Qui di seguito trovate la configurazione minima 221 + per iniziare ad usare Mutt per inviare patch usando Gmail:: 222 + 223 + # .muttrc 224 + # ================ IMAP ==================== 225 + set imap_user = 'yourusername@gmail.com' 226 + set imap_pass = 'yourpassword' 227 + set spoolfile = imaps://imap.gmail.com/INBOX 228 + set folder = imaps://imap.gmail.com/ 229 + set record="imaps://imap.gmail.com/[Gmail]/Sent Mail" 230 + set postponed="imaps://imap.gmail.com/[Gmail]/Drafts" 231 + set mbox="imaps://imap.gmail.com/[Gmail]/All Mail" 232 + 233 + # ================ SMTP ==================== 234 + set smtp_url = "smtp://username@smtp.gmail.com:587/" 235 + set smtp_pass = $imap_pass 236 + set ssl_force_tls = yes # Require encrypted connection 237 + 238 + # ================ Composition ==================== 239 + set editor = `echo \$EDITOR` 240 + set edit_headers = yes # See the headers when editing 241 + set charset = UTF-8 # value of $LANG; also fallback for send_charset 242 + # Sender, email address, and sign-off line must match 243 + unset use_domain # because joe@localhost is just embarrassing 244 + set realname = "YOUR NAME" 245 + set from = "username@gmail.com" 246 + set use_from = yes 247 + 248 + La documentazione di Mutt contiene molte più informazioni: 249 + 250 + https://gitlab.com/muttmua/mutt/-/wikis/UseCases/Gmail 251 + 252 + http://www.mutt.org/doc/manual/ 253 + 254 + Pine (TUI) 255 + ********** 256 + 257 + Pine aveva alcuni problemi con gli spazi vuoti, ma questi dovrebbero essere 258 + stati risolti. 259 + 260 + Se potete usate alpine (il successore di pine). 261 + 262 + Opzioni di configurazione: 263 + 264 + - Nelle versioni più recenti è necessario avere ``quell-flowed-text`` 265 + - l'opzione ``no-strip-whitespace-before-send`` è necessaria 266 + 267 + Sylpheed (GUI) 268 + ************** 269 + 270 + - funziona bene per aggiungere testo in linea (o usando allegati) 271 + - permette di utilizzare editor esterni 272 + - è lento su cartelle grandi 273 + - non farà l'autenticazione TSL SMTP su una connessione non SSL 274 + - ha un utile righello nella finestra di scrittura 275 + - la rubrica non comprende correttamente il nome da visualizzare e 276 + l'indirizzo associato 277 + 278 + Thunderbird (GUI) 279 + ***************** 280 + 281 + Thunderbird è un clone di Outlook a cui piace maciullare il testo, ma esistono 282 + modi per impedirglielo. 283 + 284 + - permettere l'uso di editor esterni: 285 + La cosa più semplice da fare con Thunderbird e le patch è quello di usare 286 + l'estensione "external editor" e di usare il vostro ``$EDITOR`` preferito per 287 + leggere/includere patch nel vostro messaggio. Per farlo, scaricate ed 288 + installate l'estensione e aggiungete un bottone per chiamarla rapidamente 289 + usando :menuselection:`Visualizza-->Barra degli strumenti-->Personalizza...`; 290 + una volta fatto potrete richiamarlo premendo sul bottone mentre siete nella 291 + finestra :menuselection:`Scrivi` 292 + 293 + Tenete presente che "external editor" richiede che il vostro editor non 294 + faccia alcun fork, in altre parole, l'editor non deve ritornare prima di 295 + essere stato chiuso. Potreste dover passare dei parametri aggiuntivi al 296 + vostro editor oppure cambiargli la configurazione. Per esempio, usando 297 + gvim dovrete aggiungere l'opzione -f ``/usr/bin/gvim -f`` (Se il binario 298 + si trova in ``/usr/bin``) nell'apposito campo nell'interfaccia di 299 + configurazione di :menuselection:`external editor`. Se usate altri editor 300 + consultate il loro manuale per sapere come configurarli. 301 + 302 + Per rendere l'editor interno un po' più sensato, fate così: 303 + 304 + - Modificate le impostazioni di Thunderbird per far si che non usi 305 + ``format=flowed``. Andate in :menuselection:`Modifica-->Preferenze-->Avanzate-->Editor di configurazione` 306 + per invocare il registro delle impostazioni. 307 + 308 + - impostate ``mailnews.send_plaintext_flowed`` a ``false`` 309 + 310 + - impostate ``mailnews.wraplength`` da ``72`` a ``0`` 311 + 312 + - :menuselection:`Visualizza-->Corpo del messaggio come-->Testo semplice` 313 + 314 + - :menuselection:`Visualizza-->Codifica del testo-->Unicode` 315 + 316 + 317 + TkRat (GUI) 318 + *********** 319 + 320 + Funziona. Usare "Inserisci file..." o un editor esterno. 321 + 322 + Gmail (Web GUI) 323 + *************** 324 + 325 + Non funziona per inviare le patch. 326 + 327 + Il programma web Gmail converte automaticamente i tab in spazi. 328 + 329 + Allo stesso tempo aggiunge andata a capo ogni 78 caratteri. Comunque 330 + il problema della conversione fra spazi e tab può essere risolto usando 331 + un editor esterno. 332 + 333 + Un altro problema è che Gmail usa la codifica base64 per tutti quei messaggi 334 + che contengono caratteri non ASCII. Questo include cose tipo i nomi europei.

+1

Documentation/translations/it_IT/process/index.rst

··· 59 59 magic-number 60 60 volatile-considered-harmful 61 61 clang-format 62 + ../riscv/patch-acceptance 62 63 63 64 .. only:: subproject and html 64 65

+287 -6

Documentation/translations/it_IT/process/management-style.rst

··· 1 1 .. include:: ../disclaimer-ita.rst 2 2 3 - :Original: :ref:`Documentation/process/management-style.rst <managementstyle>` 3 + :Original: :doc:`../../../process/management-style` 4 + :Translator: Alessia Mantegazza <amantegazza@vaga.pv.it> 4 5 5 - .. _it_managementstyle: 6 + Il modello di gestione del kernel Linux 7 + ======================================= 6 8 7 - Tipo di gestione del kernel Linux 8 - ================================= 9 + Questo breve documento descrive il modello di gestione del kernel Linux. 10 + Per certi versi, esso rispecchia il documento 11 + :ref:`translations/it_IT/process/coding-style.rst <it_codingstyle>`, 12 + ed è principalmente scritto per evitare di rispondere [#f1]_ in continuazione 13 + alle stesse identiche (o quasi) domande. 9 14 10 - .. warning:: 15 + Il modello di gestione è qualcosa di molto personale e molto più difficile da 16 + qualificare rispetto a delle semplici regole di codifica, quindi questo 17 + documento potrebbe avere più o meno a che fare con la realtà. È cominciato 18 + come un gioco, ma ciò non significa che non possa essere vero. 19 + Lo dovrete decidere voi stessi. 11 20 12 - TODO ancora da tradurre 21 + In ogni caso, quando si parla del "dirigente del kernel", ci si riferisce 22 + sempre alla persona che dirige tecnicamente, e non a coloro che 23 + tradizionalmente hanno un ruolo direttivo all'interno delle aziende. Se vi 24 + occupate di convalidare acquisti o avete una qualche idea sul budget del vostro 25 + gruppo, probabilmente non siete un dirigente del kernel. Quindi i suggerimenti 26 + qui indicati potrebbero fare al caso vostro, oppure no. 27 + 28 + Prima di tutto, suggerirei di acquistare "Le sette regole per avere successo", 29 + e di non leggerlo. Bruciatelo, è un grande gesto simbolico. 30 + 31 + .. [#f1] Questo documento non fa molto per risponde alla domanda, ma rende 32 + così dannatamente ovvio a chi la pone che non abbiamo la minima idea 33 + di come rispondere. 34 + 35 + Comunque, partiamo: 36 + 37 + .. _it_decisions: 38 + 39 + 1) Le decisioni 40 + --------------- 41 + 42 + Tutti pensano che i dirigenti decidano, e che questo prendere decisioni 43 + sia importante. Più grande e dolorosa è la decisione, più importante deve 44 + essere il dirigente che la prende. Questo è molto profondo ed ovvio, ma non è 45 + del tutto vero. 46 + 47 + Il gioco consiste nell'"evitare" di dover prendere decisioni. In particolare 48 + se qualcuno vi chiede di "Decidere" tra (a) o (b), e vi dice che ha 49 + davvero bisogno di voi per questo, come dirigenti siete nei guai. 50 + Le persone che gestite devono conoscere i dettagli più di quanto li conosciate 51 + voi, quindi se vengono da voi per una decisione tecnica, siete fottuti. 52 + Non sarete chiaramente competente per prendere quella decisione per loro. 53 + 54 + (Corollario: se le persone che gestite non conoscono i dettagli meglio di voi, 55 + anche in questo caso sarete fregati, tuttavia per altre ragioni. Ossia state 56 + facendo il lavoro sbagliato, e che invece dovrebbero essere "loro" a gestirvi) 57 + 58 + Quindi il gioco si chiama "evitare" decisioni, almeno le più grandi e 59 + difficili. Prendere decisioni piccoli e senza conseguenze va bene, e vi fa 60 + sembrare competenti in quello che state facendo, quindi quello che un dirigente 61 + del kernel ha bisogno di fare è trasformare le decisioni grandi e difficili 62 + in minuzie delle quali nessuno importa. 63 + 64 + Ciò aiuta a capire che la differenza chiave tra una grande decisione ed una 65 + piccola sta nella possibilità di modificare tale decisione in seguito. 66 + Qualsiasi decisione importante può essere ridotta in decisioni meno importanti, 67 + ma dovete assicurarvi che possano essere reversibili in caso di errori 68 + (presenti o futuri). Improvvisamente, dovrete essere doppiamente dirigenti 69 + per **due** decisioni non sequenziali - quella sbagliata **e** quella giusta. 70 + 71 + E le persone vedranno tutto ciò come prova di vera capacità di comando 72 + (*cough* cavolata *cough*) 73 + 74 + Così la chiave per evitare le decisioni difficili diviene l'evitare 75 + di fare cose che non possono essere disfatte. Non infilatevi in un angolo 76 + dal quale non potrete sfuggire. Un topo messo all'angolo può rivelarsi 77 + pericoloso - un dirigente messo all'angolo è solo pietoso. 78 + 79 + **In ogni caso** dato che nessuno è stupido al punto da lasciare veramente ad 80 + un dirigente del kernel un enorme responsabilità, solitamente è facile fare 81 + marcia indietro. Annullare una decisione è molto facile: semplicemente dite a 82 + tutti che siete stati degli scemi incompetenti, dite che siete dispiaciuti, ed 83 + annullate tutto l'inutile lavoro sul quale gli altri hanno lavorato nell'ultimo 84 + anno. Improvvisamente la decisione che avevate preso un anno fa non era poi 85 + così grossa, dato che può essere facilmente annullata. 86 + 87 + È emerso che alcune persone hanno dei problemi con questo tipo di approccio, 88 + questo per due ragioni: 89 + 90 + - ammettere di essere degli idioti è più difficile di quanto sembri. A tutti 91 + noi piace mantenere le apparenze, ed uscire allo scoperto in pubblico per 92 + ammettere che ci si è sbagliati è qualcosa di davvero impegnativo. 93 + - avere qualcuno che ti dice che ciò su cui hai lavorato nell'ultimo anno 94 + non era del tutto valido, può rivelarsi difficile anche per un povero ed 95 + umile ingegnere, e mentre il **lavoro** vero era abbastanza facile da 96 + cancellare, dall'altro canto potreste aver irrimediabilmente perso la 97 + fiducia di quell'ingegnere. E ricordate che l'"irrevocabile" era quello 98 + che avevamo cercato di evitare fin dall'inizio, e la vostra decisione 99 + ha finito per esserlo. 100 + 101 + Fortunatamente, entrambe queste ragioni posso essere mitigate semplicemente 102 + ammettendo fin dal principio che non avete una cavolo di idea, dicendo 103 + agli altri in anticipo che la vostra decisione è puramente ipotetica, e che 104 + potrebbe essere sbagliata. Dovreste sempre riservarvi il diritto di cambiare 105 + la vostra opinione, e rendere gli altri ben **consapevoli** di ciò. 106 + Ed è molto più facile ammettere di essere stupidi quando non avete **ancora** 107 + fatto quella cosa stupida. 108 + 109 + Poi, quando è realmente emersa la vostra stupidità, le persone semplicemente 110 + roteeranno gli occhi e diranno "Uffa, no, ancora". 111 + 112 + Questa ammissione preventiva di incompetenza potrebbe anche portare le persone 113 + che stanno facendo il vero lavoro, a pensarci due volte. Dopo tutto, se 114 + **loro** non sono certi se sia una buona idea, voi, sicuro come la morte, 115 + non dovreste incoraggiarli promettendogli che ciò su cui stanno lavorando 116 + verrà incluso. Fate si che ci pensino due volte prima che si imbarchino in un 117 + grosso lavoro. 118 + 119 + Ricordate: loro devono sapere più cose sui dettagli rispetto a voi, e 120 + solitamente pensano di avere già la risposta a tutto. La miglior cosa che 121 + potete fare in qualità di dirigente è di non instillare troppa fiducia, ma 122 + invece fornire una salutare dose di pensiero critico su quanto stanno facendo. 123 + 124 + Comunque, un altro modo di evitare una decisione è quello di lamentarsi 125 + malinconicamente dicendo : "non possiamo farli entrambi e basta?" e con uno 126 + sguardo pietoso. Fidatevi, funziona. Se non è chiaro quale sia il miglior 127 + approccio, lo scopriranno. La risposta potrebbe essere data dal fatto che 128 + entrambe i gruppi di lavoro diventano frustati al punto di rinunciarvi. 129 + 130 + Questo può suonare come un fallimento, ma di solito questo è un segno che 131 + c'era qualcosa che non andava in entrambe i progetti, e il motivo per 132 + il quale le persone coinvolte non abbiano potuto decidere era che entrambe 133 + sbagliavano. Voi ne uscirete freschi come una rosa, e avrete evitato un'altra 134 + decisione con la quale avreste potuto fregarvi. 135 + 136 + 137 + 2) Le persone 138 + ------------- 139 + 140 + Ci sono molte persone stupide, ed essere un dirigente significa che dovrete 141 + scendere a patti con questo, e molto più importate, che **loro** devono avere 142 + a che fare con **voi**. 143 + 144 + Ne emerge che mentre è facile annullare degli errori tecnici, non è invece 145 + così facile rimuovere i disordini della personalità. Dovrete semplicemente 146 + convivere con i loro, ed i vostri, problemi. 147 + 148 + Comunque, al fine di preparavi in qualità di dirigenti del kernel, è meglio 149 + ricordare di non abbattere alcun ponte, bombardare alcun paesano innocente, 150 + o escludere troppi sviluppatori kernel. Ne emerge che escludere le persone 151 + è piuttosto facile, mentre includerle nuovamente è difficile. Così 152 + "l'esclusione" immediatamente cade sotto il titolo di "non reversibile", e 153 + diviene un no-no secondo la sezione :ref:`it_decisions`. 154 + 155 + Esistono alcune semplici regole qui: 156 + 157 + (1) non chiamate le persone teste di c*** (al meno, non in pubblico) 158 + (2) imparate a scusarvi quando dimenticate la regola (1) 159 + 160 + Il problema del punto numero 1 è che è molto facile da rispettare, dato che 161 + è possibile dire "sei una testa di c***" in milioni di modi differenti [#f2]_, 162 + a volte senza nemmeno pensarci, e praticamente sempre con la calda convinzione 163 + di essere nel giusto. 164 + 165 + E più convinti sarete che avete ragione (e diciamolo, potete chiamare 166 + praticamente **tutti** testa di c**, e spesso **sarete** nel giusto), più 167 + difficile sarà scusarvi successivamente. 168 + 169 + Per risolvere questo problema, avete due possibilità: 170 + 171 + - diventare davvero bravi nello scusarsi 172 + - essere amabili così che nessuno finirà col sentirsi preso di mira. Siate 173 + creativi abbastanza, e potrebbero esserne divertiti. 174 + 175 + L'opzione dell'essere immancabilmente educati non esiste proprio. Nessuno 176 + si fiderà di qualcuno che chiaramente sta nascondendo il suo vero carattere. 177 + 178 + .. [#f2] Paul Simon cantava: "50 modi per lasciare il vostro amante", perché, 179 + molto francamente, "Un milione di modi per dire ad uno sviluppatore 180 + Testa di c***" non avrebbe funzionato. Ma sono sicuro che ci abbia 181 + pensato. 182 + 183 + 184 + 3) Le persone II - quelle buone 185 + ------------------------------- 186 + 187 + Mentre emerge che la maggior parte delle persone sono stupide, il corollario 188 + a questo è il triste fatto che anche voi siete fra queste, e che mentre 189 + possiamo tutti crogiolarci nella sicurezza di essere migliori della media 190 + delle persone (diciamocelo, nessuno crede di essere nelle media o sotto di 191 + essa), dovremmo anche ammettere che non siamo il "coltello più affilato" del 192 + circondario, e che ci saranno altre persone che sono meno stupide di quanto 193 + lo siete voi. 194 + 195 + Molti reagiscono male davanti alle persone intelligenti. Altri le usano a 196 + proprio vantaggio. 197 + 198 + Assicuratevi che voi, in qualità di manutentori del kernel, siate nel secondo 199 + gruppo. Inchinatevi dinanzi a loro perché saranno le persone che vi renderanno 200 + il lavoro più facile. In particolare, prenderanno le decisioni per voi, che è 201 + l'oggetto di questo gioco. 202 + 203 + Quindi quando trovate qualcuno più sveglio di voi, prendetevela comoda. 204 + Le vostre responsabilità dirigenziali si ridurranno in gran parte nel dire 205 + "Sembra una buona idea - Vai", oppure "Sembra buono, ma invece circa questo e 206 + quello?". La seconda versione in particolare è una gran modo per imparare 207 + qualcosa di nuovo circa "questo e quello" o di sembrare **extra** dirigenziali 208 + sottolineando qualcosa alla quale i più svegli non avevano pensato. In 209 + entrambe i casi, vincete. 210 + 211 + Una cosa alla quale dovete fare attenzione è che l'essere grandi in qualcosa 212 + non si traduce automaticamente nell'essere grandi anche in altre cose. Quindi 213 + dovreste dare una spintarella alle persone in una specifica direzione, ma 214 + diciamocelo, potrebbero essere bravi in ciò che fanno e far schifo in tutto 215 + il resto. La buona notizia è che le persone tendono a gravitare attorno a ciò 216 + in cui sono bravi, quindi non state facendo nulla di irreversibile quando li 217 + spingete verso una certa direzione, solo non spingete troppo. 218 + 219 + 220 + 4) Addossare le colpe 221 + --------------------- 222 + 223 + Le cose andranno male, e le persone vogliono qualcuno da incolpare. Sarete voi. 224 + 225 + Non è poi così difficile accettare la colpa, specialmente se le persone 226 + riescono a capire che non era **tutta** colpa vostra. Il che ci porta 227 + sulla miglior strada per assumersi la colpa: fatelo per qualcun'altro. 228 + Vi sentirete bene nel assumervi la responsabilità, e loro si sentiranno 229 + bene nel non essere incolpati, e coloro che hanno perso i loro 36GB di 230 + pornografia a causa della vostra incompetenza ammetteranno a malincuore che 231 + almeno non avete cercato di fare il furbetto. 232 + 233 + Successivamente fate in modo che gli sviluppatori che in realtà hanno fallito 234 + (se riuscite a trovarli) sappiano **in privato** che sono "fottuti". 235 + Questo non per fargli sapere che la prossima volta possono evitarselo ma per 236 + fargli capire che sono in debito. E, forse cosa più importante, sono loro che 237 + devono sistemare la cosa. Perché, ammettiamolo, è sicuro non sarete voi a 238 + farlo. 239 + 240 + Assumersi la colpa è anche ciò che vi rendere dirigenti in prima battuta. 241 + È parte di ciò che spinge gli altri a fidarsi di voi, e vi garantisce 242 + la gloria potenziale, perché siete gli unici a dire "Ho fatto una cavolata". 243 + E se avete seguito le regole precedenti, sarete decisamente bravi nel dirlo. 244 + 245 + 246 + 5) Le cose da evitare 247 + --------------------- 248 + 249 + Esiste una cosa che le persone odiano più che essere chiamate "teste di c****", 250 + ed è essere chiamate "teste di c****" con fare da bigotto. Se per il primo 251 + caso potrete comunque scusarvi, per il secondo non ve ne verrà data nemmeno 252 + l'opportunità. Probabilmente smetteranno di ascoltarvi anche se tutto sommato 253 + state svolgendo un buon lavoro. 254 + 255 + Tutti crediamo di essere migliori degli altri, il che significa che quando 256 + qualcuno inizia a darsi delle arie, ci da **davvero** fastidio. Potreste anche 257 + essere moralmente ed intellettualmente superiore a tutti quelli attorno a voi, 258 + ma non cercate di renderlo ovvio per gli altri a meno che non **vogliate** 259 + veramente far arrabbiare qualcuno [#f3]_. 260 + 261 + Allo stesso modo evitate di essere troppo gentili e pacati. Le buone maniere 262 + facilmente finiscono per strabordare e nascondere i problemi, e come si usa 263 + dire, "su internet nessuno può sentire la vostra pacatezza". Usate argomenti 264 + diretti per farvi capire, non potete sperare che la gente capisca in altro 265 + modo. 266 + 267 + Un po' di umorismo può aiutare a smorzare sia la franchezza che la moralità. 268 + Andare oltre i limiti al punto d'essere ridicolo può portare dei punti a casa 269 + senza renderlo spiacevole per i riceventi, i quali penseranno che stavate 270 + facendo gli scemi. Può anche aiutare a lasciare andare quei blocchi mentali 271 + che abbiamo nei confronti delle critiche. 272 + 273 + .. [#f3] Suggerimento: i forum di discussione su internet, che non sono 274 + collegati col vostro lavoro, sono ottimi modi per sfogare la frustrazione 275 + verso altre persone. Di tanto in tanto scrivete messaggi offensivi col ghigno 276 + in faccia per infiammare qualche discussione: vi sentirete purificati. Solo 277 + cercate di non cagare troppo vicino a casa. 278 + 279 + 6) Perché io? 280 + ------------- 281 + 282 + Dato che la vostra responsabilità principale è quella di prendervi le colpe 283 + d'altri, e rendere dolorosamente ovvio a tutti che siete degli incompetenti, 284 + la domanda naturale che ne segue sarà : perché dovrei fare tutto ciò? 285 + 286 + Innanzitutto, potreste diventare o no popolari al punto da avere la fila di 287 + ragazzine (o ragazzini, evitiamo pregiudizi o sessismo) che gridano e bussano 288 + alla porta del vostro camerino, ma comunque **proverete** un immenso senso di 289 + realizzazione personale dall'essere "in carica". Dimenticate il fatto che voi 290 + state discutendo con tutti e che cercate di inseguirli il più velocemente che 291 + potete. Tutti continueranno a pensare che voi siete la persona in carica. 292 + 293 + È un bel lavoro se riuscite ad adattarlo a voi.

+1 -1

Documentation/translations/it_IT/process/submit-checklist.rst

··· 117 117 sorgenti che ne spieghi la logica: cosa fanno e perché. 118 118 119 119 25) Se la patch aggiunge nuove chiamate ioctl, allora aggiornate 120 - ``Documentation/ioctl/ioctl-number.rst``. 120 + ``Documentation/userspace-api/ioctl/ioctl-number.rst``. 121 121 122 122 26) Se il codice che avete modificato dipende o usa una qualsiasi interfaccia o 123 123 funzionalità del kernel che è associata a uno dei seguenti simboli

+40

Documentation/translations/it_IT/riscv/patch-acceptance.rst

··· 1 + .. include:: ../disclaimer-ita.rst 2 + 3 + :Original: :doc:`../../../riscv/patch-acceptance` 4 + :Translator: Federico Vaga <federico.vaga@vaga.pv.it> 5 + 6 + arch/riscv linee guida alla manutenzione per gli sviluppatori 7 + ============================================================= 8 + 9 + Introduzione 10 + ------------ 11 + 12 + L'insieme di istruzioni RISC-V sono sviluppate in modo aperto: le 13 + bozze in fase di sviluppo sono disponibili a tutti per essere 14 + revisionate e per essere sperimentare nelle implementazioni. Le bozze 15 + dei nuovi moduli o estensioni possono cambiare in fase di sviluppo - a 16 + volte in modo incompatibile rispetto a bozze precedenti. Questa 17 + flessibilità può portare a dei problemi di manutenzioni per il 18 + supporto RISC-V nel kernel Linux. I manutentori Linux non amano 19 + l'abbandono del codice, e il processo di sviluppo del kernel 20 + preferisce codice ben revisionato e testato rispetto a quello 21 + sperimentale. Desideriamo estendere questi stessi principi al codice 22 + relativo all'architettura RISC-V che verrà accettato per l'inclusione 23 + nel kernel. 24 + 25 + In aggiunta alla lista delle verifiche da fare prima di inviare una patch 26 + ------------------------------------------------------------------------- 27 + 28 + Accetteremo le patch per un nuovo modulo o estensione se la fondazione 29 + RISC-V li classifica come "Frozen" o "Retified". (Ovviamente, gli 30 + sviluppatori sono liberi di mantenere una copia del kernel Linux 31 + contenente il codice per una bozza di estensione). 32 + 33 + In aggiunta, la specifica RISC-V permette agli implementatori di 34 + creare le proprie estensioni. Queste estensioni non passano 35 + attraverso il processo di revisione della fondazione RISC-V. Per 36 + questo motivo, al fine di evitare complicazioni o problemi di 37 + prestazioni, accetteremo patch solo per quelle estensioni che sono 38 + state ufficialmente accettate dalla fondazione RISC-V. (Ovviamente, 39 + gli implementatori sono liberi di mantenere una copia del kernel Linux 40 + contenente il codice per queste specifiche estensioni).

+1 -1

Documentation/translations/ko_KR/memory-barriers.txt

··· 641 641 리눅스 커널이 지원하는 CPU 들은 (1) 쓰기가 정말로 일어날지, (2) 쓰기가 어디에 642 642 이루어질지, 그리고 (3) 쓰여질 값을 확실히 알기 전까지는 쓰기를 수행하지 않기 643 643 때문입니다. 하지만 "컨트롤 의존성" 섹션과 644 - Documentation/RCU/rcu_dereference.txt 파일을 주의 깊게 읽어 주시기 바랍니다: 644 + Documentation/RCU/rcu_dereference.rst 파일을 주의 깊게 읽어 주시기 바랍니다: 645 645 컴파일러는 매우 창의적인 많은 방법으로 종속성을 깰 수 있습니다. 646 646 647 647 CPU 1 CPU 2

+2 -2

Documentation/translations/zh_CN/IRQ.txt

··· 1 - Chinese translated version of Documentation/IRQ.txt 1 + Chinese translated version of Documentation/core-api/irq/index.rst 2 2 3 3 If you have any comment or update to the content, please contact the 4 4 original document maintainer directly. However, if you have a problem ··· 9 9 Maintainer: Eric W. Biederman <ebiederman@xmission.com> 10 10 Chinese maintainer: Fu Wei <tekkamanninja@gmail.com> 11 11 --------------------------------------------------------------------- 12 - Documentation/IRQ.txt 的中文翻译 12 + Documentation/core-api/irq/index.rst 的中文翻译 13 13 14 14 如果想评论或更新本文的内容，请直接联系原文档的维护者。如果你使用英文 15 15 交流有困难的话，也可以向中文版维护者求助。如果本翻译更新不及时或者翻

+221

Documentation/translations/zh_CN/filesystems/debugfs.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + .. include:: ../disclaimer-zh_CN.rst 4 + 5 + :Original: :ref:`Documentation/filesystems/debugfs.txt <debugfs_index>` 6 + 7 + ======= 8 + Debugfs 9 + ======= 10 + 11 + 译者 12 + :: 13 + 14 + 中文版维护者：罗楚成 Chucheng Luo <luochucheng@vivo.com> 15 + 中文版翻译者：罗楚成 Chucheng Luo <luochucheng@vivo.com> 16 + 中文版校译者: 罗楚成 Chucheng Luo <luochucheng@vivo.com> 17 + 18 + 19 + 20 + 版权所有2020 罗楚成 <luochucheng@vivo.com> 21 + 22 + 23 + Debugfs是内核开发人员在用户空间获取信息的简单方法。与/proc不同，proc只提供进程 24 + 信息。也不像sysfs,具有严格的“每个文件一个值“的规则。debugfs根本没有规则,开发 25 + 人员可以在这里放置他们想要的任何信息。debugfs文件系统也不能用作稳定的ABI接口。 26 + 从理论上讲，debugfs导出文件的时候没有任何约束。但是[1]实际情况并不总是那么 27 + 简单。即使是debugfs接口，也最好根据需要进行设计,并尽量保持接口不变。 28 + 29 + 30 + Debugfs通常使用以下命令安装:: 31 + 32 + mount -t debugfs none /sys/kernel/debug 33 + 34 + （或等效的/etc/fstab行）。 35 + debugfs根目录默认仅可由root用户访问。要更改对文件树的访问，请使用“ uid”，“ gid” 36 + 和“ mode”挂载选项。请注意，debugfs API仅按照GPL协议导出到模块。 37 + 38 + 使用debugfs的代码应包含<linux/debugfs.h>。然后，首先是创建至少一个目录来保存 39 + 一组debugfs文件:: 40 + 41 + struct dentry *debugfs_create_dir(const char *name, struct dentry *parent); 42 + 43 + 如果成功，此调用将在指定的父目录下创建一个名为name的目录。如果parent参数为空， 44 + 则会在debugfs根目录中创建。创建目录成功时，返回值是一个指向dentry结构体的指针。 45 + 该dentry结构体的指针可用于在目录中创建文件（以及最后将其清理干净）。ERR_PTR 46 + （-ERROR）返回值表明出错。如果返回ERR_PTR（-ENODEV），则表明内核是在没有debugfs 47 + 支持的情况下构建的，并且下述函数都不会起作用。 48 + 49 + 在debugfs目录中创建文件的最通用方法是:: 50 + 51 + struct dentry *debugfs_create_file(const char *name, umode_t mode, 52 + struct dentry *parent, void *data, 53 + const struct file_operations *fops); 54 + 55 + 在这里，name是要创建的文件的名称，mode描述了访问文件应具有的权限，parent指向 56 + 应该保存文件的目录，data将存储在产生的inode结构体的i_private字段中，而fops是 57 + 一组文件操作函数，这些函数中实现文件操作的具体行为。至少，read（）和/或 58 + write（）操作应提供；其他可以根据需要包括在内。同样的，返回值将是指向创建文件 59 + 的dentry指针，错误时返回ERR_PTR（-ERROR），系统不支持debugfs时返回值为ERR_PTR 60 + （-ENODEV）。创建一个初始大小的文件，可以使用以下函数代替:: 61 + 62 + struct dentry *debugfs_create_file_size(const char *name, umode_t mode, 63 + struct dentry *parent, void *data, 64 + const struct file_operations *fops, 65 + loff_t file_size); 66 + 67 + file_size是初始文件大小。其他参数跟函数debugfs_create_file的相同。 68 + 69 + 在许多情况下，没必要自己去创建一组文件操作;对于一些简单的情况,debugfs代码提供 70 + 了许多帮助函数。包含单个整数值的文件可以使用以下任何一项创建:: 71 + 72 + void debugfs_create_u8(const char *name, umode_t mode, 73 + struct dentry *parent, u8 *value); 74 + void debugfs_create_u16(const char *name, umode_t mode, 75 + struct dentry *parent, u16 *value); 76 + struct dentry *debugfs_create_u32(const char *name, umode_t mode, 77 + struct dentry *parent, u32 *value); 78 + void debugfs_create_u64(const char *name, umode_t mode, 79 + struct dentry *parent, u64 *value); 80 + 81 + 这些文件支持读取和写入给定值。如果某个文件不支持写入，只需根据需要设置mode 82 + 参数位。这些文件中的值以十进制表示；如果需要使用十六进制，可以使用以下函数 83 + 替代:: 84 + 85 + void debugfs_create_x8(const char *name, umode_t mode, 86 + struct dentry *parent, u8 *value); 87 + void debugfs_create_x16(const char *name, umode_t mode, 88 + struct dentry *parent, u16 *value); 89 + void debugfs_create_x32(const char *name, umode_t mode, 90 + struct dentry *parent, u32 *value); 91 + void debugfs_create_x64(const char *name, umode_t mode, 92 + struct dentry *parent, u64 *value); 93 + 94 + 这些功能只有在开发人员知道导出值的大小的时候才有用。某些数据类型在不同的架构上 95 + 有不同的宽度，这样会使情况变得有些复杂。在这种特殊情况下可以使用以下函数:: 96 + 97 + void debugfs_create_size_t(const char *name, umode_t mode, 98 + struct dentry *parent, size_t *value); 99 + 100 + 不出所料，此函数将创建一个debugfs文件来表示类型为size_t的变量。 101 + 102 + 同样地，也有导出无符号长整型变量的函数，分别以十进制和十六进制表示如下:: 103 + 104 + struct dentry *debugfs_create_ulong(const char *name, umode_t mode, 105 + struct dentry *parent, 106 + unsigned long *value); 107 + void debugfs_create_xul(const char *name, umode_t mode, 108 + struct dentry *parent, unsigned long *value); 109 + 110 + 布尔值可以通过以下方式放置在debugfs中:: 111 + 112 + struct dentry *debugfs_create_bool(const char *name, umode_t mode, 113 + struct dentry *parent, bool *value); 114 + 115 + 116 + 读取结果文件将产生Y（对于非零值）或N，后跟换行符写入的时候，它只接受大写或小写 117 + 值或1或0。任何其他输入将被忽略。 118 + 119 + 同样，atomic_t类型的值也可以放置在debugfs中:: 120 + 121 + void debugfs_create_atomic_t(const char *name, umode_t mode, 122 + struct dentry *parent, atomic_t *value) 123 + 124 + 读取此文件将获得atomic_t值，写入此文件将设置atomic_t值。 125 + 126 + 另一个选择是通过以下结构体和函数导出一个任意二进制数据块:: 127 + 128 + struct debugfs_blob_wrapper { 129 + void *data; 130 + unsigned long size; 131 + }; 132 + 133 + struct dentry *debugfs_create_blob(const char *name, umode_t mode, 134 + struct dentry *parent, 135 + struct debugfs_blob_wrapper *blob); 136 + 137 + 读取此文件将返回由指针指向debugfs_blob_wrapper结构体的数据。一些驱动使用“blobs” 138 + 作为一种返回几行（静态）格式化文本的简单方法。这个函数可用于导出二进制信息，但 139 + 似乎在主线中没有任何代码这样做。请注意，使用debugfs_create_blob（）命令创建的 140 + 所有文件是只读的。 141 + 142 + 如果您要转储一个寄存器块（在开发过程中经常会这么做，但是这样的调试代码很少上传 143 + 到主线中。Debugfs提供两个函数：一个用于创建仅寄存器文件，另一个把一个寄存器块 144 + 插入一个顺序文件中:: 145 + 146 + struct debugfs_reg32 { 147 + char *name; 148 + unsigned long offset; 149 + }; 150 + 151 + struct debugfs_regset32 { 152 + struct debugfs_reg32 *regs; 153 + int nregs; 154 + void __iomem *base; 155 + }; 156 + 157 + struct dentry *debugfs_create_regset32(const char *name, umode_t mode, 158 + struct dentry *parent, 159 + struct debugfs_regset32 *regset); 160 + 161 + void debugfs_print_regs32(struct seq_file *s, struct debugfs_reg32 *regs, 162 + int nregs, void __iomem *base, char *prefix); 163 + 164 + “base”参数可能为0，但您可能需要使用__stringify构建reg32数组，实际上有许多寄存器 165 + 名称（宏）是寄存器块在基址上的字节偏移量。 166 + 167 + 如果要在debugfs中转储u32数组，可以使用以下函数创建文件:: 168 + 169 + void debugfs_create_u32_array(const char *name, umode_t mode, 170 + struct dentry *parent, 171 + u32 *array, u32 elements); 172 + 173 + “array”参数提供数据，而“elements”参数为数组中元素的数量。注意：数组创建后，数组 174 + 大小无法更改。 175 + 176 + 有一个函数来创建与设备相关的seq_file:: 177 + 178 + struct dentry *debugfs_create_devm_seqfile(struct device *dev, 179 + const char *name, 180 + struct dentry *parent, 181 + int (*read_fn)(struct seq_file *s, 182 + void *data)); 183 + 184 + “dev”参数是与此debugfs文件相关的设备，并且“read_fn”是一个函数指针，这个函数在 185 + 打印seq_file内容的时候被回调。 186 + 187 + 还有一些其他的面向目录的函数:: 188 + 189 + struct dentry *debugfs_rename(struct dentry *old_dir, 190 + struct dentry *old_dentry, 191 + struct dentry *new_dir, 192 + const char *new_name); 193 + 194 + struct dentry *debugfs_create_symlink(const char *name, 195 + struct dentry *parent, 196 + const char *target); 197 + 198 + 调用debugfs_rename()将为现有的debugfs文件重命名，可能同时切换目录。 new_name 199 + 函数调用之前不能存在；返回值为old_dentry，其中包含更新的信息。可以使用 200 + debugfs_create_symlink（）创建符号链接。 201 + 202 + 所有debugfs用户必须考虑的一件事是： 203 + 204 + debugfs不会自动清除在其中创建的任何目录。如果一个模块在不显式删除debugfs目录的 205 + 情况下卸载模块，结果将会遗留很多野指针，从而导致系统不稳定。因此，所有debugfs 206 + 用户-至少是那些可以作为模块构建的用户-必须做模块卸载的时候准备删除在此创建的 207 + 所有文件和目录。一份文件可以通过以下方式删除:: 208 + 209 + void debugfs_remove(struct dentry *dentry); 210 + 211 + dentry值可以为NULL或错误值，在这种情况下，不会有任何文件被删除。 212 + 213 + 很久以前，内核开发者使用debugfs时需要记录他们创建的每个dentry指针，以便最后所有 214 + 文件都可以被清理掉。但是，现在debugfs用户能调用以下函数递归清除之前创建的文件:: 215 + 216 + void debugfs_remove_recursive(struct dentry *dentry); 217 + 218 + 如果将对应顶层目录的dentry传递给以上函数，则该目录下的整个层次结构将会被删除。 219 + 220 + 注释： 221 + [1] http://lwn.net/Articles/309298/

+1

Documentation/translations/zh_CN/filesystems/index.rst

··· 24 24 :maxdepth: 2 25 25 26 26 virtiofs 27 + debugfs 27 28

+4 -4

Documentation/translations/zh_CN/filesystems/sysfs.txt

··· 1 - Chinese translated version of Documentation/filesystems/sysfs.txt 1 + Chinese translated version of Documentation/filesystems/sysfs.rst 2 2 3 3 If you have any comment or update to the content, please contact the 4 4 original document maintainer directly. However, if you have a problem ··· 10 10 Mike Murphy <mamurph@cs.clemson.edu> 11 11 Chinese maintainer: Fu Wei <tekkamanninja@gmail.com> 12 12 --------------------------------------------------------------------- 13 - Documentation/filesystems/sysfs.txt 的中文翻译 13 + Documentation/filesystems/sysfs.rst 的中文翻译 14 14 15 15 如果想评论或更新本文的内容，请直接联系原文档的维护者。如果你使用英文 16 16 交流有困难的话，也可以向中文版维护者求助。如果本翻译更新不及时或者翻 ··· 40 40 数据结构及其属性，以及它们之间的关联到用户空间的方法。 41 41 42 42 sysfs 始终与 kobject 的底层结构紧密相关。请阅读 43 - Documentation/kobject.txt 文档以获得更多关于 kobject 接口的 43 + Documentation/core-api/kobject.rst 文档以获得更多关于 kobject 接口的 44 44 信息。 45 45 46 46 ··· 281 281 假定驱动没有跨越多个总线类型)。 282 282 283 283 fs/ 包含了一个为文件系统设立的目录。现在每个想要导出属性的文件系统必须 284 - 在 fs/ 下创建自己的层次结构(参见Documentation/filesystems/fuse.txt)。 284 + 在 fs/ 下创建自己的层次结构(参见Documentation/filesystems/fuse.rst)。 285 285 286 286 dev/ 包含两个子目录： char/ 和 block/。在这两个子目录中，有以 287 287 <major>:<minor> 格式命名的符号链接。这些符号链接指向 sysfs 目录

+1 -1

Documentation/translations/zh_CN/process/submit-checklist.rst

··· 97 97 24) 所有内存屏障例如 ``barrier()``, ``rmb()``, ``wmb()`` 都需要源代码中的注 98 98 释来解释它们正在执行的操作及其原因的逻辑。 99 99 100 - 25) 如果补丁添加了任何ioctl，那么也要更新 ``Documentation/ioctl/ioctl-number.rst`` 100 + 25) 如果补丁添加了任何ioctl，那么也要更新 ``Documentation/userspace-api/ioctl/ioctl-number.rst`` 101 101 102 102 26) 如果修改后的源代码依赖或使用与以下 ``Kconfig`` 符号相关的任何内核API或 103 103 功能，则在禁用相关 ``Kconfig`` 符号和/或 ``=m`` （如果该选项可用）的情况

+1 -1

Documentation/translations/zh_CN/video4linux/v4l2-framework.txt

··· 488 488 489 489 这个函数会加载给定的模块（如果没有模块需要加载，可以为 NULL）， 490 490 并用给定的 i2c 适配器结构体指针（i2c_adapter）和器件地址（chip/address） 491 - 作为参数调用 i2c_new_device()。如果一切顺利，则就在 v4l2_device 491 + 作为参数调用 i2c_new_client_device()。如果一切顺利，则就在 v4l2_device 492 492 中注册了子设备。 493 493 494 494 你也可以利用 v4l2_i2c_new_subdev()的最后一个参数，传递一个可能的

Documentation/unaligned-memory-access.txt Documentation/process/unaligned-memory-access.rst

+2 -2

Documentation/usb/gadget_configfs.rst

··· 24 24 Creating a gadget means deciding what configurations there will be 25 25 and which functions each configuration will provide. 26 26 27 - Configfs (please see `Documentation/filesystems/configfs/*`) lends itself nicely 27 + Configfs (please see `Documentation/filesystems/configfs.rst`) lends itself nicely 28 28 for the purpose of telling the kernel about the above mentioned decision. 29 29 This document is about how to do it. 30 30 ··· 354 354 a number of its default sub-groups created automatically. 355 355 356 356 For more information on configfs please see 357 - `Documentation/filesystems/configfs/*`. 357 + `Documentation/filesystems/configfs.rst`. 358 358 359 359 The concepts described above translate to USB gadgets like this: 360 360

+1

Documentation/userspace-api/ioctl/ioctl-number.rst

··· 146 146 'H' 40-4F sound/hdspm.h conflict! 147 147 'H' 40-4F sound/hdsp.h conflict! 148 148 'H' 90 sound/usb/usx2y/usb_stream.h 149 + 'H' 00-0F uapi/misc/habanalabs.h conflict! 149 150 'H' A0 uapi/linux/usb/cdc-wdm.h 150 151 'H' C0-F0 net/bluetooth/hci.h conflict! 151 152 'H' C0-DF net/bluetooth/hidp/hidp.h conflict!

+1 -1

Documentation/virt/kvm/amd-memory-encryption.rst

··· 74 74 device, if needed (see individual commands). 75 75 76 76 On output, ``error`` is zero on success, or an error code. Error codes 77 - are defined in ``<linux/psp-dev.h>`. 77 + are defined in ``<linux/psp-dev.h>``. 78 78 79 79 KVM implements the following commands to support common lifecycle events of SEV 80 80 guests, such as launching, running, snapshotting, migrating and decommissioning.

+7 -5

Documentation/virt/kvm/api.rst

··· 2572 2572 :Parameters: None 2573 2573 :Returns: 0 on success, -1 on error 2574 2574 2575 - This signals to the host kernel that the specified guest is being paused by 2576 - userspace. The host will set a flag in the pvclock structure that is checked 2577 - from the soft lockup watchdog. The flag is part of the pvclock structure that 2578 - is shared between guest and host, specifically the second bit of the flags 2575 + This ioctl sets a flag accessible to the guest indicating that the specified 2576 + vCPU has been paused by the host userspace. 2577 + 2578 + The host will set a flag in the pvclock structure that is checked from the 2579 + soft lockup watchdog. The flag is part of the pvclock structure that is 2580 + shared between guest and host, specifically the second bit of the flags 2579 2581 field of the pvclock_vcpu_time_info structure. It will be set exclusively by 2580 2582 the host and read/cleared exclusively by the guest. The guest operation of 2581 - checking and clearing the flag must an atomic operation so 2583 + checking and clearing the flag must be an atomic operation so 2582 2584 load-link/store-conditional, or equivalent must be used. There are two cases 2583 2585 where the guest will clear the flag: when the soft lockup watchdog timer resets 2584 2586 itself or when a soft lockup is detected. This ioctl can be called any time

+1 -1

Documentation/virt/kvm/arm/pvtime.rst

··· 76 76 these structures and not used for other purposes, this enables the guest to map 77 77 the region using 64k pages and avoids conflicting attributes with other memory. 78 78 79 - For the user space interface see Documentation/virt/kvm/devices/vcpu.txt 79 + For the user space interface see Documentation/virt/kvm/devices/vcpu.rst 80 80 section "3. GROUP: KVM_ARM_VCPU_PVTIME_CTRL".

+1 -1

Documentation/virt/kvm/devices/vcpu.rst

··· 110 110 111 111 Specifies the base address of the stolen time structure for this VCPU. The 112 112 base address must be 64 byte aligned and exist within a valid guest memory 113 - region. See Documentation/virt/kvm/arm/pvtime.txt for more information 113 + region. See Documentation/virt/kvm/arm/pvtime.rst for more information 114 114 including the layout of the stolen time structure.

+2 -2

Documentation/virt/kvm/hypercalls.rst

··· 22 22 number in R1. 23 23 24 24 For further information on the S390 diagnose call as supported by KVM, 25 - refer to Documentation/virt/kvm/s390-diag.txt. 25 + refer to Documentation/virt/kvm/s390-diag.rst. 26 26 27 27 PowerPC: 28 28 It uses R3-R10 and hypercall number in R11. R4-R11 are used as output registers. ··· 30 30 31 31 KVM hypercalls uses 4 byte opcode, that are patched with 'hypercall-instructions' 32 32 property inside the device tree's /hypervisor node. 33 - For more information refer to Documentation/virt/kvm/ppc-pv.txt 33 + For more information refer to Documentation/virt/kvm/ppc-pv.rst 34 34 35 35 MIPS: 36 36 KVM hypercalls use the HYPCALL instruction with code 0 and the hypercall

+1 -1

Documentation/virt/kvm/mmu.rst

··· 319 319 320 320 - If both P bit and R/W bit of error code are set, this could possibly 321 321 be handled as a "fast page fault" (fixed without taking the MMU lock). See 322 - the description in Documentation/virt/kvm/locking.txt. 322 + the description in Documentation/virt/kvm/locking.rst. 323 323 324 324 - if needed, walk the guest page tables to determine the guest translation 325 325 (gva->gpa or ngpa->gpa)

+1 -1

Documentation/virt/kvm/review-checklist.rst

··· 10 10 2. Patches should be against kvm.git master branch. 11 11 12 12 3. If the patch introduces or modifies a new userspace API: 13 - - the API must be documented in Documentation/virt/kvm/api.txt 13 + - the API must be documented in Documentation/virt/kvm/api.rst 14 14 - the API must be discoverable using KVM_CHECK_EXTENSION 15 15 16 16 4. New state must include support for save/restore.

+1

Documentation/vm/index.rst

··· 31 31 active_mm 32 32 balance 33 33 cleancache 34 + free_page_reporting 34 35 frontswap 35 36 highmem 36 37 hmm

+1 -1

Documentation/vm/page_frags.rst

··· 26 26 27 27 The network stack uses two separate caches per CPU to handle fragment 28 28 allocation. The netdev_alloc_cache is used by callers making use of the 29 - __netdev_alloc_frag and __netdev_alloc_skb calls. The napi_alloc_cache is 29 + netdev_alloc_frag and __netdev_alloc_skb calls. The napi_alloc_cache is 30 30 used by callers of the __napi_alloc_frag and __napi_alloc_skb calls. The 31 31 main difference between these two calls is the context in which they may be 32 32 called. The "netdev" prefixed functions are usable in any context as these

+2 -2

Documentation/vm/zswap.rst

··· 140 140 special parameter has been introduced to implement a sort of hysteresis to 141 141 refuse taking pages into zswap pool until it has sufficient space if the limit 142 142 has been hit. To set the threshold at which zswap would start accepting pages 143 - again after it became full, use the sysfs ``accept_threhsold_percent`` 143 + again after it became full, use the sysfs ``accept_threshold_percent`` 144 144 attribute, e. g.:: 145 145 146 - echo 80 > /sys/module/zswap/parameters/accept_threhsold_percent 146 + echo 80 > /sys/module/zswap/parameters/accept_threshold_percent 147 147 148 148 Setting this parameter to 100 will disable the hysteresis. 149 149

+2 -2

Documentation/watchdog/convert_drivers_to_kernel_api.rst

··· 2 2 Converting old watchdog drivers to the watchdog framework 3 3 ========================================================= 4 4 5 - by Wolfram Sang <w.sang@pengutronix.de> 5 + by Wolfram Sang <wsa@kernel.org> 6 6 7 7 Before the watchdog framework came into the kernel, every driver had to 8 8 implement the API on its own. Now, as the framework factored out the common ··· 115 115 --------------------------- 116 116 117 117 All possible callbacks are defined in 'struct watchdog_ops'. You can find it 118 - explained in 'watchdog-kernel-api.txt' in this directory. start(), stop() and 118 + explained in 'watchdog-kernel-api.txt' in this directory. start() and 119 119 owner must be set, the rest are optional. You will easily find corresponding 120 120 functions in the old driver. Note that you will now get a pointer to the 121 121 watchdog_device as a parameter to these functions, so you probably have to

+1 -1

Documentation/watchdog/watchdog-kernel-api.rst

··· 123 123 struct module *owner; 124 124 /* mandatory operations */ 125 125 int (*start)(struct watchdog_device *); 126 - int (*stop)(struct watchdog_device *); 127 126 /* optional operations */ 127 + int (*stop)(struct watchdog_device *); 128 128 int (*ping)(struct watchdog_device *); 129 129 unsigned int (*status)(struct watchdog_device *); 130 130 int (*set_timeout)(struct watchdog_device *, unsigned int);

+1 -1

Documentation/x86/x86_64/uefi.rst

··· 36 36 37 37 elilo bootloader with x86_64 support, elilo configuration file, 38 38 kernel image built in first step and corresponding 39 - initrd. Instructions on building elilo and its dependencies 39 + initrd. Instructions on building elilo and its dependencies 40 40 can be found in the elilo sourceforge project. 41 41 42 42 - Boot to EFI shell and invoke elilo choosing the kernel image built

+14 -14

MAINTAINERS

··· 3742 3742 M: David Howells <dhowells@redhat.com> 3743 3743 L: linux-cachefs@redhat.com (moderated for non-subscribers) 3744 3744 S: Supported 3745 - F: Documentation/filesystems/caching/cachefiles.txt 3745 + F: Documentation/filesystems/caching/cachefiles.rst 3746 3746 F: fs/cachefiles/ 3747 3747 3748 3748 CADENCE MIPI-CSI2 BRIDGES ··· 4219 4219 L: codalist@coda.cs.cmu.edu 4220 4220 S: Maintained 4221 4221 W: http://www.coda.cs.cmu.edu/ 4222 - F: Documentation/filesystems/coda.txt 4222 + F: Documentation/filesystems/coda.rst 4223 4223 F: fs/coda/ 4224 4224 F: include/linux/coda*.h 4225 4225 F: include/uapi/linux/coda*.h ··· 5012 5012 R: Amir Goldstein <amir73il@gmail.com> 5013 5013 L: linux-fsdevel@vger.kernel.org 5014 5014 S: Maintained 5015 - F: Documentation/filesystems/dnotify.txt 5015 + F: Documentation/filesystems/dnotify.rst 5016 5016 F: fs/notify/dnotify/ 5017 5017 F: include/linux/dnotify.h 5018 5018 ··· 5026 5026 DISKQUOTA 5027 5027 M: Jan Kara <jack@suse.com> 5028 5028 S: Maintained 5029 - F: Documentation/filesystems/quota.txt 5029 + F: Documentation/filesystems/quota.rst 5030 5030 F: fs/quota/ 5031 5031 F: include/linux/quota*.h 5032 5032 F: include/uapi/linux/quota*.h ··· 7040 7040 L: linux-kernel@vger.kernel.org 7041 7041 S: Maintained 7042 7042 T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/core 7043 - F: Documentation/*futex* 7043 + F: Documentation/locking/*futex* 7044 7044 F: include/asm-generic/futex.h 7045 7045 F: include/linux/futex.h 7046 7046 F: include/uapi/linux/futex.h 7047 7047 F: kernel/futex.c 7048 7048 F: tools/perf/bench/futex* 7049 - F: tools/testing/selftests/futex/ 7049 + F: Documentation/locking/*futex* 7050 7050 7051 7051 GATEWORKS SYSTEM CONTROLLER (GSC) DRIVER 7052 7052 M: Tim Harvey <tharvey@gateworks.com> ··· 7527 7527 S: Maintained 7528 7528 T: git git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc.git hwspinlock-next 7529 7529 F: Documentation/devicetree/bindings/hwlock/ 7530 - F: Documentation/hwspinlock.txt 7530 + F: Documentation/locking/hwspinlock.rst 7531 7531 F: drivers/hwspinlock/ 7532 7532 F: include/linux/hwspinlock.h 7533 7533 ··· 8934 8934 L: openipmi-developer@lists.sourceforge.net (moderated for non-subscribers) 8935 8935 S: Supported 8936 8936 W: http://openipmi.sourceforge.net/ 8937 - F: Documentation/IPMI.txt 8937 + F: Documentation/driver-api/ipmi.rst 8938 8938 F: Documentation/devicetree/bindings/ipmi/ 8939 8939 F: drivers/char/ipmi/ 8940 8940 F: include/linux/ipmi* ··· 8976 8976 M: Marc Zyngier <maz@kernel.org> 8977 8977 S: Maintained 8978 8978 T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/core 8979 - F: Documentation/IRQ-domain.txt 8979 + F: Documentation/core-api/irq/irq-domain.rst 8980 8980 F: include/linux/irqdomain.h 8981 8981 F: kernel/irq/irqdomain.c 8982 8982 F: kernel/irq/msi.c ··· 15703 15703 F: include/linux/ssb/ 15704 15704 15705 15705 SONY IMX214 SENSOR DRIVER 15706 - M: Ricardo Ribalda <ricardo.ribalda@gmail.com> 15706 + M: Ricardo Ribalda <ribalda@kernel.org> 15707 15707 L: linux-media@vger.kernel.org 15708 15708 S: Maintained 15709 15709 T: git git://linuxtv.org/media_tree.git ··· 15943 15943 L: linuxppc-dev@lists.ozlabs.org 15944 15944 S: Supported 15945 15945 W: http://www.ibm.com/developerworks/power/cell/ 15946 - F: Documentation/filesystems/spufs.txt 15946 + F: Documentation/filesystems/spufs/spufs.rst 15947 15947 F: arch/powerpc/platforms/cell/spufs/ 15948 15948 15949 15949 SQUASHFS FILE SYSTEM ··· 16690 16690 F: sound/soc/ti/ 16691 16691 16692 16692 TEXAS INSTRUMENTS' DAC7612 DAC DRIVER 16693 - M: Ricardo Ribalda <ricardo@ribalda.com> 16693 + M: Ricardo Ribalda <ribalda@kernel.org> 16694 16694 L: linux-iio@vger.kernel.org 16695 16695 S: Supported 16696 16696 F: Documentation/devicetree/bindings/iio/dac/ti,dac7612.txt ··· 18594 18594 T: git git://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git 18595 18595 F: Documentation/ABI/testing/sysfs-fs-xfs 18596 18596 F: Documentation/admin-guide/xfs.rst 18597 - F: Documentation/filesystems/xfs-delayed-logging-design.txt 18598 - F: Documentation/filesystems/xfs-self-describing-metadata.txt 18597 + F: Documentation/filesystems/xfs-delayed-logging-design.rst 18598 + F: Documentation/filesystems/xfs-self-describing-metadata.rst 18599 18599 F: fs/xfs/ 18600 18600 F: include/uapi/linux/dqblk_xfs.h 18601 18601 F: include/uapi/linux/fsmap.h

+1 -1

arch/powerpc/include/uapi/asm/kvm_para.h

··· 31 31 * Struct fields are always 32 or 64 bit aligned, depending on them being 32 32 32 * or 64 bit wide respectively. 33 33 * 34 - * See Documentation/virt/kvm/ppc-pv.txt 34 + * See Documentation/virt/kvm/ppc-pv.rst 35 35 */ 36 36 struct kvm_vcpu_arch_shared { 37 37 __u64 scratch1;

+1 -1

arch/x86/kvm/mmu/mmu.c

··· 3586 3586 /* 3587 3587 * Currently, fast page fault only works for direct mapping 3588 3588 * since the gfn is not stable for indirect shadow page. See 3589 - * Documentation/virt/kvm/locking.txt to get more detail. 3589 + * Documentation/virt/kvm/locking.rst to get more detail. 3590 3590 */ 3591 3591 fault_handled = fast_pf_fix_direct_spte(vcpu, sp, 3592 3592 iterator.sptep, spte,

+1 -1

drivers/ata/libata-core.c

··· 5209 5209 * sata_link_init_spd - Initialize link->sata_spd_limit 5210 5210 * @link: Link to configure sata_spd_limit for 5211 5211 * 5212 - * Initialize @link->[hw_]sata_spd_limit to the currently 5212 + * Initialize ``link->[hw_]sata_spd_limit`` to the currently 5213 5213 * configured value. 5214 5214 * 5215 5215 * LOCKING:

+1 -1

drivers/base/core.c

··· 1393 1393 else if (dev->class && dev->class->dev_release) 1394 1394 dev->class->dev_release(dev); 1395 1395 else 1396 - WARN(1, KERN_ERR "Device '%s' does not have a release() function, it is broken and must be fixed. See Documentation/kobject.txt.\n", 1396 + WARN(1, KERN_ERR "Device '%s' does not have a release() function, it is broken and must be fixed. See Documentation/core-api/kobject.rst.\n", 1397 1397 dev_name(dev)); 1398 1398 kfree(p); 1399 1399 }

+4 -2

drivers/base/platform.c

··· 147 147 * request_irq() APIs. This is the same as platform_get_irq(), except that it 148 148 * does not print an error message if an IRQ can not be obtained. 149 149 * 150 - * Example: 150 + * For example:: 151 + * 151 152 * int irq = platform_get_irq_optional(pdev, 0); 152 153 * if (irq < 0) 153 154 * return irq; ··· 227 226 * IRQ fails. Device drivers should check the return value for errors so as to 228 227 * not pass a negative integer value to the request_irq() APIs. 229 228 * 230 - * Example: 229 + * For example:: 230 + * 231 231 * int irq = platform_get_irq(pdev, 0); 232 232 * if (irq < 0) 233 233 * return irq;

+1 -1

drivers/char/ipmi/Kconfig

··· 14 14 IPMI is a standard for managing sensors (temperature, 15 15 voltage, etc.) in a system. 16 16 17 - See <file:Documentation/IPMI.txt> for more details on the driver. 17 + See <file:Documentation/driver-api/ipmi.rst> for more details on the driver. 18 18 19 19 If unsure, say N. 20 20

+1 -1

drivers/char/ipmi/ipmi_si_hotmod.c

··· 18 18 19 19 module_param_call(hotmod, hotmod_handler, NULL, NULL, 0200); 20 20 MODULE_PARM_DESC(hotmod, "Add and remove interfaces. See" 21 - " Documentation/IPMI.txt in the kernel sources for the" 21 + " Documentation/driver-api/ipmi.rst in the kernel sources for the" 22 22 " gory details."); 23 23 24 24 /*

+1 -1

drivers/char/ipmi/ipmi_si_intf.c

··· 968 968 * that are not BT and do not have interrupts. It starts spinning 969 969 * when an operation is complete or until max_busy tells it to stop 970 970 * (if that is enabled). See the paragraph on kimid_max_busy_us in 971 - * Documentation/IPMI.txt for details. 971 + * Documentation/driver-api/ipmi.rst for details. 972 972 */ 973 973 static int ipmi_thread(void *data) 974 974 {

+1 -1

drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c

··· 8 8 * This file add support for AES cipher with 128,192,256 bits keysize in 9 9 * CBC and ECB mode. 10 10 * 11 - * You could find a link for the datasheet in Documentation/arm/sunxi/README 11 + * You could find a link for the datasheet in Documentation/arm/sunxi.rst 12 12 */ 13 13 14 14 #include <linux/crypto.h>

+1 -1

drivers/crypto/allwinner/sun8i-ce/sun8i-ce-core.c

··· 7 7 * 8 8 * Core file which registers crypto algorithms supported by the CryptoEngine. 9 9 * 10 - * You could find a link for the datasheet in Documentation/arm/sunxi/README 10 + * You could find a link for the datasheet in Documentation/arm/sunxi.rst 11 11 */ 12 12 #include <linux/clk.h> 13 13 #include <linux/crypto.h>

+1 -1

drivers/crypto/allwinner/sun8i-ss/sun8i-ss-cipher.c

··· 8 8 * This file add support for AES cipher with 128,192,256 bits keysize in 9 9 * CBC and ECB mode. 10 10 * 11 - * You could find a link for the datasheet in Documentation/arm/sunxi/README 11 + * You could find a link for the datasheet in Documentation/arm/sunxi.rst 12 12 */ 13 13 14 14 #include <linux/crypto.h>

+1 -1

drivers/crypto/allwinner/sun8i-ss/sun8i-ss-core.c

··· 7 7 * 8 8 * Core file which registers crypto algorithms supported by the SecuritySystem 9 9 * 10 - * You could find a link for the datasheet in Documentation/arm/sunxi/README 10 + * You could find a link for the datasheet in Documentation/arm/sunxi.rst 11 11 */ 12 12 #include <linux/clk.h> 13 13 #include <linux/crypto.h>

+1 -1

drivers/gpu/drm/Kconfig

··· 161 161 monitor are unable to provide appropriate EDID data. Since this 162 162 feature is provided as a workaround for broken hardware, the 163 163 default case is N. Details and instructions how to build your own 164 - EDID data are given in Documentation/driver-api/edid.rst. 164 + EDID data are given in Documentation/admin-guide/edid.rst. 165 165 166 166 config DRM_DP_CEC 167 167 bool "Enable DisplayPort CEC-Tunneling-over-AUX HDMI support"

+1 -1

drivers/gpu/drm/drm_ioctl.c

··· 741 741 * }; 742 742 * 743 743 * Please make sure that you follow all the best practices from 744 - * ``Documentation/ioctl/botching-up-ioctls.rst``. Note that drm_ioctl() 744 + * ``Documentation/process/botching-up-ioctls.rst``. Note that drm_ioctl() 745 745 * automatically zero-extends structures, hence make sure you can add more stuff 746 746 * at the end, i.e. don't put a variable sized array there. 747 747 *

+1 -1

drivers/gpu/drm/msm/disp/dpu1/dpu_kms.h

··· 170 170 * 171 171 * Main debugfs documentation is located at, 172 172 * 173 - * Documentation/filesystems/debugfs.txt 173 + * Documentation/filesystems/debugfs.rst 174 174 * 175 175 * @dpu_debugfs_setup_regset32: Initialize data for dpu_debugfs_create_regset32 176 176 * @dpu_debugfs_create_regset32: Create 32-bit register dump file

+1 -1

drivers/hwtracing/coresight/Kconfig

··· 107 107 can quickly get to know program counter (PC), secure state, 108 108 exception level, etc. Before use debugging functionality, platform 109 109 needs to ensure the clock domain and power domain are enabled 110 - properly, please refer Documentation/trace/coresight-cpu-debug.rst 110 + properly, please refer Documentation/trace/coresight/coresight-cpu-debug.rst 111 111 for detailed description and the example for usage. 112 112 113 113 config CORESIGHT_CTI

+2 -2

drivers/iio/dac/ad5761.c

··· 3 3 * AD5721, AD5721R, AD5761, AD5761R, Voltage Output Digital to Analog Converter 4 4 * 5 5 * Copyright 2016 Qtechnology A/S 6 - * 2016 Ricardo Ribalda <ricardo.ribalda@gmail.com> 6 + * 2016 Ricardo Ribalda <ribalda@kernel.org> 7 7 */ 8 8 #include <linux/kernel.h> 9 9 #include <linux/module.h> ··· 423 423 }; 424 424 module_spi_driver(ad5761_driver); 425 425 426 - MODULE_AUTHOR("Ricardo Ribalda <ricardo.ribalda@gmail.com>"); 426 + MODULE_AUTHOR("Ricardo Ribalda <ribalda@kernel.org>"); 427 427 MODULE_DESCRIPTION("Analog Devices AD5721, AD5721R, AD5761, AD5761R driver"); 428 428 MODULE_LICENSE("GPL v2");

+2 -2

drivers/iio/dac/ti-dac7612.c

··· 3 3 * DAC7612 Dual, 12-Bit Serial input Digital-to-Analog Converter 4 4 * 5 5 * Copyright 2019 Qtechnology A/S 6 - * 2019 Ricardo Ribalda <ricardo@ribalda.com> 6 + * 2019 Ricardo Ribalda <ribalda@kernel.org> 7 7 * 8 8 * Licensed under the GPL-2. 9 9 */ ··· 179 179 }; 180 180 module_spi_driver(dac7612_driver); 181 181 182 - MODULE_AUTHOR("Ricardo Ribalda <ricardo@ribalda.com>"); 182 + MODULE_AUTHOR("Ricardo Ribalda <ribalda@kernel.org>"); 183 183 MODULE_DESCRIPTION("Texas Instruments DAC7612 DAC driver"); 184 184 MODULE_LICENSE("GPL v2");

+1 -1

drivers/leds/leds-pca963x.c

··· 4 4 * Copyright 2013 Qtechnology/AS 5 5 * 6 6 * Author: Peter Meerwald <p.meerwald@bct-electronic.com> 7 - * Author: Ricardo Ribalda <ricardo.ribalda@gmail.com> 7 + * Author: Ricardo Ribalda <ribalda@kernel.org> 8 8 * 9 9 * Based on leds-pca955x.c 10 10 *

+2 -2

drivers/media/i2c/imx214.c

··· 4 4 * 5 5 * Copyright 2018 Qtechnology A/S 6 6 * 7 - * Ricardo Ribalda <ricardo.ribalda@gmail.com> 7 + * Ricardo Ribalda <ribalda@kernel.org> 8 8 */ 9 9 #include <linux/clk.h> 10 10 #include <linux/delay.h> ··· 1120 1120 module_i2c_driver(imx214_i2c_driver); 1121 1121 1122 1122 MODULE_DESCRIPTION("Sony IMX214 Camera driver"); 1123 - MODULE_AUTHOR("Ricardo Ribalda <ricardo.ribalda@gmail.com>"); 1123 + MODULE_AUTHOR("Ricardo Ribalda <ribalda@kernel.org>"); 1124 1124 MODULE_LICENSE("GPL v2");

+1 -1

drivers/media/v4l2-core/v4l2-fwnode.c

··· 980 980 * 981 981 * THIS EXAMPLE EXISTS MERELY TO DOCUMENT THIS FUNCTION. DO NOT USE IT AS A 982 982 * REFERENCE IN HOW ACPI TABLES SHOULD BE WRITTEN!! See documentation under 983 - * Documentation/acpi/dsd instead and especially graph.txt, 983 + * Documentation/firmware-guide/acpi/dsd/ instead and especially graph.txt, 984 984 * data-node-references.txt and leds.txt . 985 985 * 986 986 * Scope (\_SB.PCI0.I2C2)

+1 -1

fs/Kconfig

··· 166 166 space. If you unmount a tmpfs instance, everything stored therein is 167 167 lost. 168 168 169 - See <file:Documentation/filesystems/tmpfs.txt> for details. 169 + See <file:Documentation/filesystems/tmpfs.rst> for details. 170 170 171 171 config TMPFS_POSIX_ACL 172 172 bool "Tmpfs POSIX Access Control Lists"

+1 -1

fs/Kconfig.binfmt

··· 78 78 79 79 The core dump behavior can be controlled per process using 80 80 the /proc/PID/coredump_filter pseudo-file; this setting is 81 - inherited. See Documentation/filesystems/proc.txt for details. 81 + inherited. See Documentation/filesystems/proc.rst for details. 82 82 83 83 This config option changes the default setting of coredump_filter 84 84 seen at boot time. If unsure, say Y.

+1 -1

fs/adfs/Kconfig

··· 12 12 13 13 The ADFS partition should be the first partition (i.e., 14 14 /dev/[hs]d?1) on each of your drives. Please read the file 15 - <file:Documentation/filesystems/adfs.txt> for further details. 15 + <file:Documentation/filesystems/adfs.rst> for further details. 16 16 17 17 To compile this code as a module, choose M here: the module will be 18 18 called adfs.

+1 -1

fs/affs/Kconfig

··· 9 9 FFS partition on your hard drive. Amiga floppies however cannot be 10 10 read with this driver due to an incompatibility of the floppy 11 11 controller used in an Amiga and the standard floppy controller in 12 - PCs and workstations. Read <file:Documentation/filesystems/affs.txt> 12 + PCs and workstations. Read <file:Documentation/filesystems/affs.rst> 13 13 and <file:fs/affs/Changes>. 14 14 15 15 With this driver you can also mount disk files used by Bernd

+3 -3

fs/afs/Kconfig

··· 8 8 If you say Y here, you will get an experimental Andrew File System 9 9 driver. It currently only supports unsecured read-only AFS access. 10 10 11 - See <file:Documentation/filesystems/afs.txt> for more information. 11 + See <file:Documentation/filesystems/afs.rst> for more information. 12 12 13 13 If unsure, say N. 14 14 ··· 18 18 help 19 19 Say Y here to make runtime controllable debugging messages appear. 20 20 21 - See <file:Documentation/filesystems/afs.txt> for more information. 21 + See <file:Documentation/filesystems/afs.rst> for more information. 22 22 23 23 If unsure, say N. 24 24 ··· 37 37 the dmesg log if the server rotation algorithm fails to successfully 38 38 contact a server. 39 39 40 - See <file:Documentation/filesystems/afs.txt> for more information. 40 + See <file:Documentation/filesystems/afs.rst> for more information. 41 41 42 42 If unsure, say N.

+1 -1

fs/bfs/Kconfig

··· 11 11 on your /stand slice from within Linux. You then also need to say Y 12 12 to "UnixWare slices support", below. More information about the BFS 13 13 file system is contained in the file 14 - <file:Documentation/filesystems/bfs.txt>. 14 + <file:Documentation/filesystems/bfs.rst>. 15 15 16 16 If you don't know what this is about, say N. 17 17

+2 -2

fs/cachefiles/Kconfig

··· 8 8 filesystems - primarily networking filesystems - thus allowing fast 9 9 local disk to enhance the speed of slower devices. 10 10 11 - See Documentation/filesystems/caching/cachefiles.txt for more 11 + See Documentation/filesystems/caching/cachefiles.rst for more 12 12 information. 13 13 14 14 config CACHEFILES_DEBUG ··· 36 36 bouncing between CPUs. On the other hand, the histogram may be 37 37 useful for debugging purposes. Saying 'N' here is recommended. 38 38 39 - See Documentation/filesystems/caching/cachefiles.txt for more 39 + See Documentation/filesystems/caching/cachefiles.rst for more 40 40 information.

+1 -1

fs/coda/Kconfig

··· 15 15 *client*. You will need user level code as well, both for the 16 16 client and server. Servers are currently user level, i.e. they need 17 17 no kernel support. Please read 18 - <file:Documentation/filesystems/coda.txt> and check out the Coda 18 + <file:Documentation/filesystems/coda.rst> and check out the Coda 19 19 home page <http://www.coda.cs.cmu.edu/>. 20 20 21 21 To compile the coda client support as a module, choose M here: the

+1 -1

fs/configfs/inode.c

··· 9 9 * 10 10 * configfs Copyright (C) 2005 Oracle. All rights reserved. 11 11 * 12 - * Please see Documentation/filesystems/configfs/configfs.txt for more 12 + * Please see Documentation/filesystems/configfs.rst for more 13 13 * information. 14 14 */ 15 15

+1 -1

fs/configfs/item.c

··· 9 9 * 10 10 * configfs Copyright (C) 2005 Oracle. All rights reserved. 11 11 * 12 - * Please see the file Documentation/filesystems/configfs/configfs.txt for 12 + * Please see the file Documentation/filesystems/configfs.rst for 13 13 * critical information about using the config_item interface. 14 14 */ 15 15

+1 -1

fs/cramfs/Kconfig

··· 9 9 limited to 256MB file systems (with 16MB files), and doesn't support 10 10 16/32 bits uid/gid, hard links and timestamps. 11 11 12 - See <file:Documentation/filesystems/cramfs.txt> and 12 + See <file:Documentation/filesystems/cramfs.rst> and 13 13 <file:fs/cramfs/README> for further information. 14 14 15 15 To compile this as a module, choose M here: the module will be called

+1 -1

fs/ecryptfs/Kconfig

··· 7 7 select CRYPTO_MD5 8 8 help 9 9 Encrypted filesystem that operates on the VFS layer. See 10 - <file:Documentation/filesystems/ecryptfs.txt> to learn more about 10 + <file:Documentation/filesystems/ecryptfs.rst> to learn more about 11 11 eCryptfs. Userspace components are required and can be 12 12 obtained from <http://ecryptfs.sf.net>. 13 13

+4 -4

fs/fat/Kconfig

··· 69 69 70 70 The VFAT support enlarges your kernel by about 10 KB and it only 71 71 works if you said Y to the "DOS FAT fs support" above. Please read 72 - the file <file:Documentation/filesystems/vfat.txt> for details. If 72 + the file <file:Documentation/filesystems/vfat.rst> for details. If 73 73 unsure, say Y. 74 74 75 75 To compile this as a module, choose M here: the module will be called ··· 82 82 help 83 83 This option should be set to the codepage of your FAT filesystems. 84 84 It can be overridden with the "codepage" mount option. 85 - See <file:Documentation/filesystems/vfat.txt> for more information. 85 + See <file:Documentation/filesystems/vfat.rst> for more information. 86 86 87 87 config FAT_DEFAULT_IOCHARSET 88 88 string "Default iocharset for FAT" ··· 96 96 Note that "utf8" is not recommended for FAT filesystems. 97 97 If unsure, you shouldn't set "utf8" here - select the next option 98 98 instead if you would like to use UTF-8 encoded file names by default. 99 - See <file:Documentation/filesystems/vfat.txt> for more information. 99 + See <file:Documentation/filesystems/vfat.rst> for more information. 100 100 101 101 Enable any character sets you need in File Systems/Native Language 102 102 Support. ··· 114 114 115 115 Say Y if you use UTF-8 encoding for file names, N otherwise. 116 116 117 - See <file:Documentation/filesystems/vfat.txt> for more information. 117 + See <file:Documentation/filesystems/vfat.rst> for more information.

+4 -4

fs/fscache/Kconfig

··· 8 8 Different sorts of caches can be plugged in, depending on the 9 9 resources available. 10 10 11 - See Documentation/filesystems/caching/fscache.txt for more information. 11 + See Documentation/filesystems/caching/fscache.rst for more information. 12 12 13 13 config FSCACHE_STATS 14 14 bool "Gather statistical information on local caching" ··· 25 25 between CPUs. On the other hand, the stats are very useful for 26 26 debugging purposes. Saying 'Y' here is recommended. 27 27 28 - See Documentation/filesystems/caching/fscache.txt for more information. 28 + See Documentation/filesystems/caching/fscache.rst for more information. 29 29 30 30 config FSCACHE_HISTOGRAM 31 31 bool "Gather latency information on local caching" ··· 42 42 bouncing between CPUs. On the other hand, the histogram may be 43 43 useful for debugging purposes. Saying 'N' here is recommended. 44 44 45 - See Documentation/filesystems/caching/fscache.txt for more information. 45 + See Documentation/filesystems/caching/fscache.rst for more information. 46 46 47 47 config FSCACHE_DEBUG 48 48 bool "Debug FS-Cache" ··· 52 52 management module. If this is set, the debugging output may be 53 53 enabled by setting bits in /sys/modules/fscache/parameter/debug. 54 54 55 - See Documentation/filesystems/caching/fscache.txt for more information. 55 + See Documentation/filesystems/caching/fscache.rst for more information. 56 56 57 57 config FSCACHE_OBJECT_LIST 58 58 bool "Maintain global object list for debugging purposes"

+4 -4

fs/fscache/cache.c

··· 172 172 * 173 173 * Initialise a record of a cache and fill in the name. 174 174 * 175 - * See Documentation/filesystems/caching/backend-api.txt for a complete 175 + * See Documentation/filesystems/caching/backend-api.rst for a complete 176 176 * description. 177 177 */ 178 178 void fscache_init_cache(struct fscache_cache *cache, ··· 207 207 * 208 208 * Add a cache to the system, making it available for netfs's to use. 209 209 * 210 - * See Documentation/filesystems/caching/backend-api.txt for a complete 210 + * See Documentation/filesystems/caching/backend-api.rst for a complete 211 211 * description. 212 212 */ 213 213 int fscache_add_cache(struct fscache_cache *cache, ··· 307 307 * Note that an I/O error occurred in a cache and that it should no longer be 308 308 * used for anything. This also reports the error into the kernel log. 309 309 * 310 - * See Documentation/filesystems/caching/backend-api.txt for a complete 310 + * See Documentation/filesystems/caching/backend-api.rst for a complete 311 311 * description. 312 312 */ 313 313 void fscache_io_error(struct fscache_cache *cache) ··· 355 355 * Withdraw a cache from service, unbinding all its cache objects from the 356 356 * netfs cookies they're currently representing. 357 357 * 358 - * See Documentation/filesystems/caching/backend-api.txt for a complete 358 + * See Documentation/filesystems/caching/backend-api.rst for a complete 359 359 * description. 360 360 */ 361 361 void fscache_withdraw_cache(struct fscache_cache *cache)

+1 -1

fs/fscache/cookie.c

···

+2 -2

fs/fscache/object.c

··· 4 4 * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved. 5 5 * Written by David Howells (dhowells@redhat.com) 6 6 * 7 - * See Documentation/filesystems/caching/object.txt for a description of the 7 + * See Documentation/filesystems/caching/object.rst for a description of the 8 8 * object state machine and the in-kernel representations. 9 9 */ 10 10 ··· 295 295 * 296 296 * Initialise a cache object description to its basic values. 297 297 * 298 - * See Documentation/filesystems/caching/backend-api.txt for a complete 298 + * See Documentation/filesystems/caching/backend-api.rst for a complete 299 299 * description. 300 300 */ 301 301 void fscache_object_init(struct fscache_object *object,

+1 -1

fs/fscache/operation.c

··· 4 4 * Copyright (C) 2008 Red Hat, Inc. All Rights Reserved. 5 5 * Written by David Howells (dhowells@redhat.com) 6 6 * 7 - * See Documentation/filesystems/caching/operations.txt 7 + * See Documentation/filesystems/caching/operations.rst 8 8 */ 9 9 10 10 #define FSCACHE_DEBUG_LEVEL OPERATION

+1 -1

fs/fuse/Kconfig

··· 12 12 although chances are your distribution already has that library 13 13 installed if you've installed the "fuse" package itself. 14 14 15 - See <file:Documentation/filesystems/fuse.txt> for more information. 15 + See <file:Documentation/filesystems/fuse.rst> for more information. 16 16 See <file:Documentation/Changes> for needed library/utility version. 17 17 18 18 If you want to develop a userspace FS, or if you want to use

+1 -1

fs/fuse/dev.c

··· 2081 2081 * The same effect is usually achievable through killing the filesystem daemon 2082 2082 * and all users of the filesystem. The exception is the combination of an 2083 2083 * asynchronous request and the tricky deadlock (see 2084 - * Documentation/filesystems/fuse.txt). 2084 + * Documentation/filesystems/fuse.rst). 2085 2085 * 2086 2086 * Aborting requests under I/O goes as follows: 1: Separate out unlocked 2087 2087 * requests, they should be finished off immediately. Locked requests will be

+1 -1

fs/hfs/Kconfig

··· 6 6 help 7 7 If you say Y here, you will be able to mount Macintosh-formatted 8 8 floppy disks and hard drive partitions with full read-write access. 9 - Please read <file:Documentation/filesystems/hfs.txt> to learn about 9 + Please read <file:Documentation/filesystems/hfs.rst> to learn about 10 10 the available mount options. 11 11 12 12 To compile this file system support as a module, choose M here: the

+1 -1

fs/hpfs/Kconfig

··· 9 9 write files to an OS/2 HPFS partition on your hard drive. OS/2 10 10 floppies however are in regular MSDOS format, so you don't need this 11 11 option in order to be able to read them. Read 12 - <file:Documentation/filesystems/hpfs.txt>. 12 + <file:Documentation/filesystems/hpfs.rst>. 13 13 14 14 To compile this file system support as a module, choose M here: the 15 15 module will be called hpfs. If unsure, say N.

+3 -3

fs/inode.c

··· 1606 1606 * @inode: inode owning the block number being requested 1607 1607 * @block: pointer containing the block to find 1608 1608 * 1609 - * Replaces the value in *block with the block number on the device holding 1609 + * Replaces the value in ``*block`` with the block number on the device holding 1610 1610 * corresponding to the requested block number in the file. 1611 1611 * That is, asked for block 4 of inode 1 the function will replace the 1612 - * 4 in *block, with disk block relative to the disk start that holds that 1612 + * 4 in ``*block``, with disk block relative to the disk start that holds that 1613 1613 * block of the file. 1614 1614 * 1615 1615 * Returns -EINVAL in case of error, 0 otherwise. If mapping falls into a 1616 - * hole, returns 0 and *block is also set to 0. 1616 + * hole, returns 0 and ``*block`` is also set to 0. 1617 1617 */ 1618 1618 int bmap(struct inode *inode, sector_t *block) 1619 1619 {

+1 -1

fs/isofs/Kconfig

··· 8 8 long Unix filenames and symbolic links are also supported by this 9 9 driver. If you have a CD-ROM drive and want to do more with it than 10 10 just listen to audio CDs and watch its LEDs, say Y (and read 11 - <file:Documentation/filesystems/isofs.txt> and the CD-ROM-HOWTO, 11 + <file:Documentation/filesystems/isofs.rst> and the CD-ROM-HOWTO, 12 12 available from <http://www.tldp.org/docs.html#howto>), thereby 13 13 enlarging your kernel by about 27 KB; otherwise say N. 14 14

+1 -1

fs/locks.c

··· 61 61 * 62 62 * Initial implementation of mandatory locks. SunOS turned out to be 63 63 * a rotten model, so I implemented the "obvious" semantics. 64 - * See 'Documentation/filesystems/mandatory-locking.txt' for details. 64 + * See 'Documentation/filesystems/mandatory-locking.rst' for details. 65 65 * Andy Walker (andy@lysaker.kvaerner.no), April 06, 1996. 66 66 * 67 67 * Don't allow mandatory locks on mmap()'ed files. Added simple functions to

+1 -1

fs/namespace.c

··· 3595 3595 * file system may be mounted on put_old. After all, new_root is a mountpoint. 3596 3596 * 3597 3597 * Also, the current root cannot be on the 'rootfs' (initial ramfs) filesystem. 3598 - * See Documentation/filesystems/ramfs-rootfs-initramfs.txt for alternatives 3598 + * See Documentation/filesystems/ramfs-rootfs-initramfs.rst for alternatives 3599 3599 * in this situation. 3600 3600 * 3601 3601 * Notes:

+1 -1

fs/notify/inotify/Kconfig

··· 12 12 new features including multiple file events, one-shot support, and 13 13 unmount notification. 14 14 15 - For more information, see <file:Documentation/filesystems/inotify.txt> 15 + For more information, see <file:Documentation/filesystems/inotify.rst> 16 16 17 17 If unsure, say Y.

+1 -1

fs/ntfs/Kconfig

··· 18 18 the Linux 2.4 kernel series is separately available as a patch 19 19 from the project web site. 20 20 21 - For more information see <file:Documentation/filesystems/ntfs.txt> 21 + For more information see <file:Documentation/filesystems/ntfs.rst> 22 22 and <http://www.linux-ntfs.org/>. 23 23 24 24 To compile this file system support as a module, choose M here: the

+1 -1

fs/ocfs2/Kconfig

··· 21 21 OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ 22 22 23 23 For more information on OCFS2, see the file 24 - <file:Documentation/filesystems/ocfs2.txt>. 24 + <file:Documentation/filesystems/ocfs2.rst>. 25 25 26 26 config OCFS2_FS_O2CB 27 27 tristate "O2CB Kernelspace Clustering"

+3 -3

fs/overlayfs/Kconfig

··· 9 9 'lower' filesystem is either hidden or, in the case of directories, 10 10 merged with the 'upper' object. 11 11 12 - For more information see Documentation/filesystems/overlayfs.txt 12 + For more information see Documentation/filesystems/overlayfs.rst 13 13 14 14 config OVERLAY_FS_REDIRECT_DIR 15 15 bool "Overlayfs: turn on redirect directory feature by default" ··· 38 38 If backward compatibility is not an issue, then it is safe and 39 39 recommended to say N here. 40 40 41 - For more information, see Documentation/filesystems/overlayfs.txt 41 + For more information, see Documentation/filesystems/overlayfs.rst 42 42 43 43 If unsure, say Y. 44 44 ··· 103 103 If compatibility with applications that expect 32bit inodes is not an 104 104 issue, then it is safe and recommended to say Y here. 105 105 106 - For more information, see Documentation/filesystems/overlayfs.txt 106 + For more information, see Documentation/filesystems/overlayfs.rst 107 107 108 108 If unsure, say N. 109 109

+2 -2

fs/proc/Kconfig

··· 23 23 /proc" or the equivalent line in /etc/fstab does the job. 24 24 25 25 The /proc file system is explained in the file 26 - <file:Documentation/filesystems/proc.txt> and on the proc(5) manpage 26 + <file:Documentation/filesystems/proc.rst> and on the proc(5) manpage 27 27 ("man 5 proc"). 28 28 29 29 This option will enlarge your kernel by about 67 KB. Several ··· 95 95 default n 96 96 help 97 97 Provides a fast way to retrieve first level children pids of a task. See 98 - <file:Documentation/filesystems/proc.txt> for more information. 98 + <file:Documentation/filesystems/proc.rst> for more information. 99 99 100 100 Say Y if you are running any user-space software which takes benefit from 101 101 this interface. For example, rkt is such a piece of software.

+1 -1

fs/romfs/Kconfig

··· 6 6 This is a very small read-only file system mainly intended for 7 7 initial ram disks of installation disks, but it could be used for 8 8 other read-only media as well. Read 9 - <file:Documentation/filesystems/romfs.txt> for details. 9 + <file:Documentation/filesystems/romfs.rst> for details. 10 10 11 11 To compile this file system support as a module, choose M here: the 12 12 module will be called romfs. Note that the file system of your

+1 -1

fs/sysfs/dir.c

··· 6 6 * Copyright (c) 2007 SUSE Linux Products GmbH 7 7 * Copyright (c) 2007 Tejun Heo <teheo@suse.de> 8 8 * 9 - * Please see Documentation/filesystems/sysfs.txt for more information. 9 + * Please see Documentation/filesystems/sysfs.rst for more information. 10 10 */ 11 11 12 12 #define pr_fmt(fmt) "sysfs: " fmt

+1 -1

fs/sysfs/file.c

··· 6 6 * Copyright (c) 2007 SUSE Linux Products GmbH 7 7 * Copyright (c) 2007 Tejun Heo <teheo@suse.de> 8 8 * 9 - * Please see Documentation/filesystems/sysfs.txt for more information. 9 + * Please see Documentation/filesystems/sysfs.rst for more information. 10 10 */ 11 11 12 12 #include <linux/module.h>

+1 -1

fs/sysfs/mount.c

··· 6 6 * Copyright (c) 2007 SUSE Linux Products GmbH 7 7 * Copyright (c) 2007 Tejun Heo <teheo@suse.de> 8 8 * 9 - * Please see Documentation/filesystems/sysfs.txt for more information. 9 + * Please see Documentation/filesystems/sysfs.rst for more information. 10 10 */ 11 11 12 12 #include <linux/fs.h>

+1 -1

fs/sysfs/symlink.c

··· 6 6 * Copyright (c) 2007 SUSE Linux Products GmbH 7 7 * Copyright (c) 2007 Tejun Heo <teheo@suse.de> 8 8 * 9 - * Please see Documentation/filesystems/sysfs.txt for more information. 9 + * Please see Documentation/filesystems/sysfs.rst for more information. 10 10 */ 11 11 12 12 #include <linux/fs.h>

+1 -1

fs/sysv/Kconfig

··· 28 28 tar" or preferably "info tar"). Note also that this option has 29 29 nothing whatsoever to do with the option "System V IPC". Read about 30 30 the System V file system in 31 - <file:Documentation/filesystems/sysv-fs.txt>. 31 + <file:Documentation/filesystems/sysv-fs.rst>. 32 32 Saying Y here will enlarge your kernel by about 27 KB. 33 33 34 34 To compile this as a module, choose M here: the module will be called

+1 -1

fs/udf/Kconfig

··· 9 9 compatible with standard unix file systems, it is also suitable for 10 10 removable USB disks. Say Y if you intend to mount DVD discs or CDRW's 11 11 written in packet mode, or if you want to use UDF for removable USB 12 - disks. Please read <file:Documentation/filesystems/udf.txt>. 12 + disks. Please read <file:Documentation/filesystems/udf.rst>. 13 13 14 14 To compile this file system support as a module, choose M here: the 15 15 module will be called udf.

+1 -1

include/linux/configfs.h

··· 13 13 * 14 14 * configfs Copyright (C) 2005 Oracle. All rights reserved. 15 15 * 16 - * Please read Documentation/filesystems/configfs/configfs.txt before using 16 + * Please read Documentation/filesystems/configfs.rst before using 17 17 * the configfs interface, ESPECIALLY the parts about reference counts and 18 18 * item destructors. 19 19 */

+1 -1

include/linux/fs_context.h

··· 85 85 * Superblock creation fills in ->root whereas reconfiguration begins with this 86 86 * already set. 87 87 * 88 - * See Documentation/filesystems/mount_api.txt 88 + * See Documentation/filesystems/mount_api.rst 89 89 */ 90 90 struct fs_context { 91 91 const struct fs_context_operations *ops;

+2 -2

include/linux/fscache-cache.h

··· 6 6 * 7 7 * NOTE!!! See: 8 8 * 9 - * Documentation/filesystems/caching/backend-api.txt 9 + * Documentation/filesystems/caching/backend-api.rst 10 10 * 11 11 * for a description of the cache backend interface declared here. 12 12 */ ··· 454 454 * Set the maximum size an object is permitted to reach, implying the highest 455 455 * byte that may be written. Intended to be called by the attr_changed() op. 456 456 * 457 - * See Documentation/filesystems/caching/backend-api.txt for a complete 457 + * See Documentation/filesystems/caching/backend-api.rst for a complete 458 458 * description. 459 459 */ 460 460 static inline

+21 -21

include/linux/fscache.h

··· 6 6 * 7 7 * NOTE!!! See: 8 8 * 9 - * Documentation/filesystems/caching/netfs-api.txt 9 + * Documentation/filesystems/caching/netfs-api.rst 10 10 * 11 11 * for a description of the network filesystem interface declared here. 12 12 */ ··· 233 233 * 234 234 * Register a filesystem as desiring caching services if they're available. 235 235 * 236 - * See Documentation/filesystems/caching/netfs-api.txt for a complete 236 + * See Documentation/filesystems/caching/netfs-api.rst for a complete 237 237 * description. 238 238 */ 239 239 static inline ··· 253 253 * Indicate that a filesystem no longer desires caching services for the 254 254 * moment. 255 255 * 256 - * See Documentation/filesystems/caching/netfs-api.txt for a complete 256 + * See Documentation/filesystems/caching/netfs-api.rst for a complete 257 257 * description. 258 258 */ 259 259 static inline ··· 270 270 * Acquire a specific cache referral tag that can be used to select a specific 271 271 * cache in which to cache an index. 272 272 * 273 - * See Documentation/filesystems/caching/netfs-api.txt for a complete 273 + * See Documentation/filesystems/caching/netfs-api.rst for a complete 274 274 * description. 275 275 */ 276 276 static inline ··· 288 288 * 289 289 * Release a reference to a cache referral tag previously looked up. 290 290 * 291 - * See Documentation/filesystems/caching/netfs-api.txt for a complete 291 + * See Documentation/filesystems/caching/netfs-api.rst for a complete 292 292 * description. 293 293 */ 294 294 static inline ··· 315 315 * that can be used to locate files. This is done by requesting a cookie for 316 316 * each index in the path to the file. 317 317 * 318 - * See Documentation/filesystems/caching/netfs-api.txt for a complete 318 + * See Documentation/filesystems/caching/netfs-api.rst for a complete 319 319 * description. 320 320 */ 321 321 static inline ··· 351 351 * provided to update the auxiliary data in the cache before the object is 352 352 * disconnected. 353 353 * 354 - * See Documentation/filesystems/caching/netfs-api.txt for a complete 354 + * See Documentation/filesystems/caching/netfs-api.rst for a complete 355 355 * description. 356 356 */ 357 357 static inline ··· 394 394 * cookie. The auxiliary data on the cookie will be updated first if @aux_data 395 395 * is set. 396 396 * 397 - * See Documentation/filesystems/caching/netfs-api.txt for a complete 397 + * See Documentation/filesystems/caching/netfs-api.rst for a complete 398 398 * description. 399 399 */ 400 400 static inline ··· 410 410 * 411 411 * Permit data-storage cache objects to be pinned in the cache. 412 412 * 413 - * See Documentation/filesystems/caching/netfs-api.txt for a complete 413 + * See Documentation/filesystems/caching/netfs-api.rst for a complete 414 414 * description. 415 415 */ 416 416 static inline ··· 425 425 * 426 426 * Permit data-storage cache objects to be unpinned from the cache. 427 427 * 428 - * See Documentation/filesystems/caching/netfs-api.txt for a complete 428 + * See Documentation/filesystems/caching/netfs-api.rst for a complete 429 429 * description. 430 430 */ 431 431 static inline ··· 441 441 * changed. This includes the data size. These attributes will be obtained 442 442 * through the get_attr() cookie definition op. 443 443 * 444 - * See Documentation/filesystems/caching/netfs-api.txt for a complete 444 + * See Documentation/filesystems/caching/netfs-api.rst for a complete 445 445 * description. 446 446 */ 447 447 static inline ··· 463 463 * 464 464 * This can be called with spinlocks held. 465 465 * 466 - * See Documentation/filesystems/caching/netfs-api.txt for a complete 466 + * See Documentation/filesystems/caching/netfs-api.rst for a complete 467 467 * description. 468 468 */ 469 469 static inline ··· 479 479 * 480 480 * Wait for the invalidation of an object to complete. 481 481 * 482 - * See Documentation/filesystems/caching/netfs-api.txt for a complete 482 + * See Documentation/filesystems/caching/netfs-api.rst for a complete 483 483 * description. 484 484 */ 485 485 static inline ··· 498 498 * cookie so that a write to that object within the space can always be 499 499 * honoured. 500 500 * 501 - * See Documentation/filesystems/caching/netfs-api.txt for a complete 501 + * See Documentation/filesystems/caching/netfs-api.rst for a complete 502 502 * description. 503 503 */ 504 504 static inline ··· 533 533 * Else, if the page is unbacked, -ENODATA is returned and a block may have 534 534 * been allocated in the cache. 535 535 * 536 - * See Documentation/filesystems/caching/netfs-api.txt for a complete 536 + * See Documentation/filesystems/caching/netfs-api.rst for a complete 537 537 * description. 538 538 */ 539 539 static inline ··· 582 582 * regard to different pages, the return values are prioritised in that order. 583 583 * Any pages submitted for reading are removed from the pages list. 584 584 * 585 - * See Documentation/filesystems/caching/netfs-api.txt for a complete 585 + * See Documentation/filesystems/caching/netfs-api.rst for a complete 586 586 * description. 587 587 */ 588 588 static inline ··· 617 617 * Else, a block will be allocated if one wasn't already, and 0 will be 618 618 * returned 619 619 * 620 - * See Documentation/filesystems/caching/netfs-api.txt for a complete 620 + * See Documentation/filesystems/caching/netfs-api.rst for a complete 621 621 * description. 622 622 */ 623 623 static inline ··· 667 667 * be cleared at the completion of the write to indicate the success or failure 668 668 * of the operation. Note that the completion may happen before the return. 669 669 * 670 - * See Documentation/filesystems/caching/netfs-api.txt for a complete 670 + * See Documentation/filesystems/caching/netfs-api.rst for a complete 671 671 * description. 672 672 */ 673 673 static inline ··· 693 693 * Note that this cannot cancel any outstanding I/O operations between this 694 694 * page and the cache. 695 695 * 696 - * See Documentation/filesystems/caching/netfs-api.txt for a complete 696 + * See Documentation/filesystems/caching/netfs-api.rst for a complete 697 697 * description. 698 698 */ 699 699 static inline ··· 711 711 * 712 712 * Ask the cache if a page is being written to the cache. 713 713 * 714 - * See Documentation/filesystems/caching/netfs-api.txt for a complete 714 + * See Documentation/filesystems/caching/netfs-api.rst for a complete 715 715 * description. 716 716 */ 717 717 static inline ··· 731 731 * Ask the cache to wake us up when a page is no longer being written to the 732 732 * cache. 733 733 * 734 - * See Documentation/filesystems/caching/netfs-api.txt for a complete 734 + * See Documentation/filesystems/caching/netfs-api.rst for a complete 735 735 * description. 736 736 */ 737 737 static inline

+1 -1

include/linux/kobject.h

··· 7 7 * Copyright (c) 2006-2008 Greg Kroah-Hartman <greg@kroah.com> 8 8 * Copyright (c) 2006-2008 Novell Inc. 9 9 * 10 - * Please read Documentation/kobject.txt before using the kobject 10 + * Please read Documentation/core-api/kobject.rst before using the kobject 11 11 * interface, ESPECIALLY the parts about reference counts and object 12 12 * destructors. 13 13 */

+1 -1

include/linux/kobject_ns.h

··· 8 8 * 9 9 * Split from kobject.h by David Howells (dhowells@redhat.com) 10 10 * 11 - * Please read Documentation/kobject.txt before using the kobject 11 + * Please read Documentation/core-api/kobject.rst before using the kobject 12 12 * interface, ESPECIALLY the parts about reference counts and object 13 13 * destructors. 14 14 */

+1 -1

include/linux/lsm_hooks.h

··· 77 77 * state. This is called immediately after commit_creds(). 78 78 * 79 79 * Security hooks for mount using fs_context. 80 - * [See also Documentation/filesystems/mount_api.txt] 80 + * [See also Documentation/filesystems/mount_api.rst] 81 81 * 82 82 * @fs_context_dup: 83 83 * Allocate and attach a security structure to sc->security. This pointer

+2 -2

include/linux/mm.h

··· 1226 1226 * used to track the pincount (instead using of the GUP_PIN_COUNTING_BIAS 1227 1227 * scheme). 1228 1228 * 1229 - * For more information, please see Documentation/vm/pin_user_pages.rst. 1229 + * For more information, please see Documentation/core-api/pin_user_pages.rst. 1230 1230 * 1231 1231 * @page: pointer to page to be queried. 1232 1232 * @Return: True, if it is likely that the page has been "dma-pinned". ··· 2841 2841 * releasing pages: get_user_pages*() pages must be released via put_page(), 2842 2842 * while pin_user_pages*() pages must be released via unpin_user_page(). 2843 2843 * 2844 - * Please see Documentation/vm/pin_user_pages.rst for more information. 2844 + * Please see Documentation/core-api/pin_user_pages.rst for more information. 2845 2845 */ 2846 2846 2847 2847 static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)

+1 -1

include/linux/platform_data/ad5761.h

··· 3 3 * AD5721, AD5721R, AD5761, AD5761R, Voltage Output Digital to Analog Converter 4 4 * 5 5 * Copyright 2016 Qtechnology A/S 6 - * 2016 Ricardo Ribalda <ricardo.ribalda@gmail.com> 6 + * 2016 Ricardo Ribalda <ribalda@kernel.org> 7 7 */ 8 8 #ifndef __LINUX_PLATFORM_DATA_AD5761_H__ 9 9 #define __LINUX_PLATFORM_DATA_AD5761_H__

+100 -12

include/linux/printk.h

··· 279 279 280 280 extern int kptr_restrict; 281 281 282 + /** 283 + * pr_fmt - used by the pr_*() macros to generate the printk format string 284 + * @fmt: format string passed from a pr_*() macro 285 + * 286 + * This macro can be used to generate a unified format string for pr_*() 287 + * macros. A common use is to prefix all pr_*() messages in a file with a common 288 + * string. For example, defining this at the top of a source file: 289 + * 290 + * #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt 291 + * 292 + * would prefix all pr_info, pr_emerg... messages in the file with the module 293 + * name. 294 + */ 282 295 #ifndef pr_fmt 283 296 #define pr_fmt(fmt) fmt 284 297 #endif 285 298 286 - /* 287 - * These can be used to print at the various log levels. 288 - * All of these will print unconditionally, although note that pr_debug() 289 - * and other debug macros are compiled out unless either DEBUG is defined 290 - * or CONFIG_DYNAMIC_DEBUG is set. 299 + /** 300 + * pr_emerg - Print an emergency-level message 301 + * @fmt: format string 302 + * @...: arguments for the format string 303 + * 304 + * This macro expands to a printk with KERN_EMERG loglevel. It uses pr_fmt() to 305 + * generate the format string. 291 306 */ 292 307 #define pr_emerg(fmt, ...) \ 293 308 printk(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__) 309 + /** 310 + * pr_alert - Print an alert-level message 311 + * @fmt: format string 312 + * @...: arguments for the format string 313 + * 314 + * This macro expands to a printk with KERN_ALERT loglevel. It uses pr_fmt() to 315 + * generate the format string. 316 + */ 294 317 #define pr_alert(fmt, ...) \ 295 318 printk(KERN_ALERT pr_fmt(fmt), ##__VA_ARGS__) 319 + /** 320 + * pr_crit - Print a critical-level message 321 + * @fmt: format string 322 + * @...: arguments for the format string 323 + * 324 + * This macro expands to a printk with KERN_CRIT loglevel. It uses pr_fmt() to 325 + * generate the format string. 326 + */ 296 327 #define pr_crit(fmt, ...) \ 297 328 printk(KERN_CRIT pr_fmt(fmt), ##__VA_ARGS__) 329 + /** 330 + * pr_err - Print an error-level message 331 + * @fmt: format string 332 + * @...: arguments for the format string 333 + * 334 + * This macro expands to a printk with KERN_ERR loglevel. It uses pr_fmt() to 335 + * generate the format string. 336 + */ 298 337 #define pr_err(fmt, ...) \ 299 338 printk(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__) 339 + /** 340 + * pr_warn - Print a warning-level message 341 + * @fmt: format string 342 + * @...: arguments for the format string 343 + * 344 + * This macro expands to a printk with KERN_WARNING loglevel. It uses pr_fmt() 345 + * to generate the format string. 346 + */ 300 347 #define pr_warn(fmt, ...) \ 301 348 printk(KERN_WARNING pr_fmt(fmt), ##__VA_ARGS__) 349 + /** 350 + * pr_notice - Print a notice-level message 351 + * @fmt: format string 352 + * @...: arguments for the format string 353 + * 354 + * This macro expands to a printk with KERN_NOTICE loglevel. It uses pr_fmt() to 355 + * generate the format string. 356 + */ 302 357 #define pr_notice(fmt, ...) \ 303 358 printk(KERN_NOTICE pr_fmt(fmt), ##__VA_ARGS__) 359 + /** 360 + * pr_info - Print an info-level message 361 + * @fmt: format string 362 + * @...: arguments for the format string 363 + * 364 + * This macro expands to a printk with KERN_INFO loglevel. It uses pr_fmt() to 365 + * generate the format string. 366 + */ 304 367 #define pr_info(fmt, ...) \ 305 368 printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__) 306 - /* 307 - * Like KERN_CONT, pr_cont() should only be used when continuing 308 - * a line with no newline ('\n') enclosed. Otherwise it defaults 309 - * back to KERN_DEFAULT. 369 + 370 + /** 371 + * pr_cont - Continues a previous log message in the same line. 372 + * @fmt: format string 373 + * @...: arguments for the format string 374 + * 375 + * This macro expands to a printk with KERN_CONT loglevel. It should only be 376 + * used when continuing a log message with no newline ('\n') enclosed. Otherwise 377 + * it defaults back to KERN_DEFAULT loglevel. 310 378 */ 311 379 #define pr_cont(fmt, ...) \ 312 380 printk(KERN_CONT fmt, ##__VA_ARGS__) 313 381 314 - /* pr_devel() should produce zero code unless DEBUG is defined */ 382 + /** 383 + * pr_devel - Print a debug-level message conditionally 384 + * @fmt: format string 385 + * @...: arguments for the format string 386 + * 387 + * This macro expands to a printk with KERN_DEBUG loglevel if DEBUG is 388 + * defined. Otherwise it does nothing. 389 + * 390 + * It uses pr_fmt() to generate the format string. 391 + */ 315 392 #ifdef DEBUG 316 393 #define pr_devel(fmt, ...) \ 317 394 printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__) ··· 402 325 #if defined(CONFIG_DYNAMIC_DEBUG) 403 326 #include <linux/dynamic_debug.h> 404 327 405 - /* dynamic_pr_debug() uses pr_fmt() internally so we don't need it here */ 406 - #define pr_debug(fmt, ...) \ 328 + /** 329 + * pr_debug - Print a debug-level message conditionally 330 + * @fmt: format string 331 + * @...: arguments for the format string 332 + * 333 + * This macro expands to dynamic_pr_debug() if CONFIG_DYNAMIC_DEBUG is 334 + * set. Otherwise, if DEBUG is defined, it's equivalent to a printk with 335 + * KERN_DEBUG loglevel. If DEBUG is not defined it does nothing. 336 + * 337 + * It uses pr_fmt() to generate the format string (dynamic_pr_debug() uses 338 + * pr_fmt() internally). 339 + */ 340 + #define pr_debug(fmt, ...) \ 407 341 dynamic_pr_debug(fmt, ##__VA_ARGS__) 408 342 #elif defined(DEBUG) 409 343 #define pr_debug(fmt, ...) \

+1 -1

include/linux/rbtree.h

··· 11 11 I know it's not the cleaner way, but in C (not in C++) to get 12 12 performances and genericity... 13 13 14 - See Documentation/rbtree.txt for documentation and samples. 14 + See Documentation/core-api/rbtree.rst for documentation and samples. 15 15 */ 16 16 17 17 #ifndef _LINUX_RBTREE_H

+1 -1

include/linux/rbtree_augmented.h

··· 21 21 * rb_insert_augmented() and rb_erase_augmented() are intended to be public. 22 22 * The rest are implementation details you are not expected to depend on. 23 23 * 24 - * See Documentation/rbtree.txt for documentation and samples. 24 + * See Documentation/core-api/rbtree.rst for documentation and samples. 25 25 */ 26 26 27 27 struct rb_augment_callbacks {

+1 -1

include/linux/relay.h

··· 141 141 * cause relay_open() to create a single global buffer rather 142 142 * than the default set of per-cpu buffers. 143 143 * 144 - * See Documentation/filesystems/relay.txt for more info. 144 + * See Documentation/filesystems/relay.rst for more info. 145 145 */ 146 146 struct dentry *(*create_buf_file)(const char *filename, 147 147 struct dentry *parent,

+1

include/linux/spi/spi.h

··· 394 394 * for example doing DMA mapping. Called from threaded 395 395 * context. 396 396 * @transfer_one: transfer a single spi_transfer. 397 + * 397 398 * - return 0 if the transfer is finished, 398 399 * - return 1 if the transfer is still in progress. When 399 400 * the driver is finished with this transfer it must

+1 -1

include/linux/sysfs.h

··· 7 7 * Copyright (c) 2007 SUSE Linux Products GmbH 8 8 * Copyright (c) 2007 Tejun Heo <teheo@suse.de> 9 9 * 10 - * Please see Documentation/filesystems/sysfs.txt for more information. 10 + * Please see Documentation/filesystems/sysfs.rst for more information. 11 11 */ 12 12 13 13 #ifndef _SYSFS_H_

+2 -2

include/linux/watchdog.h

··· 37 37 * 38 38 * The watchdog_ops structure contains a list of low-level operations 39 39 * that control a watchdog device. It also contains the module that owns 40 - * these operations. The start and stop function are mandatory, all other 40 + * these operations. The start function is mandatory, all other 41 41 * functions are optional. 42 42 */ 43 43 struct watchdog_ops { 44 44 struct module *owner; 45 45 /* mandatory operations */ 46 46 int (*start)(struct watchdog_device *); 47 - int (*stop)(struct watchdog_device *); 48 47 /* optional operations */ 48 + int (*stop)(struct watchdog_device *); 49 49 int (*ping)(struct watchdog_device *); 50 50 unsigned int (*status)(struct watchdog_device *); 51 51 int (*set_timeout)(struct watchdog_device *, unsigned int);

+1 -1

include/uapi/linux/ethtool_netlink.h

··· 2 2 /* 3 3 * include/uapi/linux/ethtool_netlink.h - netlink interface for ethtool 4 4 * 5 - * See Documentation/networking/ethtool-netlink.txt in kernel source tree for 5 + * See Documentation/networking/ethtool-netlink.rst in kernel source tree for 6 6 * doucumentation of the interface. 7 7 */ 8 8

+1 -1

include/uapi/linux/firewire-cdev.h

··· 308 308 /** 309 309 * struct fw_cdev_event_iso_resource - Iso resources were allocated or freed 310 310 * @closure: See &fw_cdev_event_common; 311 - * set by %FW_CDEV_IOC_(DE)ALLOCATE_ISO_RESOURCE(_ONCE) ioctl 311 + * set by``FW_CDEV_IOC_(DE)ALLOCATE_ISO_RESOURCE(_ONCE)`` ioctl 312 312 * @type: %FW_CDEV_EVENT_ISO_RESOURCE_ALLOCATED or 313 313 * %FW_CDEV_EVENT_ISO_RESOURCE_DEALLOCATED 314 314 * @handle: Reference by which an allocated resource can be deallocated

+2 -2

include/uapi/linux/kvm.h

··· 116 116 * ACPI gsi notion of irq. 117 117 * For IA-64 (APIC model) IOAPIC0: irq 0-23; IOAPIC1: irq 24-47.. 118 118 * For X86 (standard AT mode) PIC0/1: irq 0-15. IOAPIC0: 0-23.. 119 - * For ARM: See Documentation/virt/kvm/api.txt 119 + * For ARM: See Documentation/virt/kvm/api.rst 120 120 */ 121 121 union { 122 122 __u32 irq; ··· 1107 1107 * 1108 1108 * KVM_IRQFD_FLAG_RESAMPLE indicates resamplefd is valid and specifies 1109 1109 * the irqfd to operate in resampling mode for level triggered interrupt 1110 - * emulation. See Documentation/virt/kvm/api.txt. 1110 + * emulation. See Documentation/virt/kvm/api.rst. 1111 1111 */ 1112 1112 #define KVM_IRQFD_FLAG_RESAMPLE (1 << 1) 1113 1113

+1 -1

include/uapi/rdma/rdma_user_ioctl_cmds.h

··· 36 36 #include <linux/types.h> 37 37 #include <linux/ioctl.h> 38 38 39 - /* Documentation/ioctl/ioctl-number.rst */ 39 + /* Documentation/userspace-api/ioctl/ioctl-number.rst */ 40 40 #define RDMA_IOCTL_MAGIC 0x1b 41 41 #define RDMA_VERBS_IOCTL \ 42 42 _IOWR(RDMA_IOCTL_MAGIC, 1, struct ib_uverbs_ioctl_hdr)

+3

kernel/futex.c

··· 486 486 * The key words are stored in @key on success. 487 487 * 488 488 * For shared mappings (when @fshared), the key is: 489 + * 489 490 * ( inode->i_sequence, page->index, offset_within_page ) 491 + * 490 492 * [ also see get_inode_sequence_number() ] 491 493 * 492 494 * For private mappings (or when !@fshared), the key is: 495 + * 493 496 * ( current->mm, address, 0 ) 494 497 * 495 498 * This allows (cross process, where applicable) identification of the futex

+1 -1

kernel/relay.c

··· 1 1 /* 2 2 * Public API and common code for kernel->userspace relay file support. 3 3 * 4 - * See Documentation/filesystems/relay.txt for an overview. 4 + * See Documentation/filesystems/relay.rst for an overview. 5 5 * 6 6 * Copyright (C) 2002-2005 - Tom Zanussi (zanussi@us.ibm.com), IBM Corp 7 7 * Copyright (C) 1999-2005 - Karim Yaghmour (karim@opersys.com)

+1 -1

lib/Kconfig

··· 433 433 434 434 See: 435 435 436 - Documentation/rbtree.txt 436 + Documentation/core-api/rbtree.rst 437 437 438 438 for more information. 439 439

+1 -1

lib/Kconfig.debug

··· 1515 1515 This code (~1k) is freed after boot. By then, the firewire stack 1516 1516 in charge of the OHCI-1394 controllers should be used instead. 1517 1517 1518 - See Documentation/debugging-via-ohci1394.txt for more information. 1518 + See Documentation/core-api/debugging-via-ohci1394.rst for more information. 1519 1519 1520 1520 source "samples/Kconfig" 1521 1521

+14 -13

lib/bitmap.c

··· 182 182 * 183 183 * In pictures, example for a big-endian 32-bit architecture: 184 184 * 185 - * @src: 186 - * 31 63 187 - * | | 188 - * 10000000 11000001 11110010 00010101 10000000 11000001 01110010 00010101 189 - * | | | | 190 - * 16 14 0 32 185 + * The @src bitmap is:: 191 186 * 192 - * if @cut is 3, and @first is 14, bits 14-16 in @src are cut and @dst is: 187 + * 31 63 188 + * | | 189 + * 10000000 11000001 11110010 00010101 10000000 11000001 01110010 00010101 190 + * | | | | 191 + * 16 14 0 32 193 192 * 194 - * 31 63 195 - * | | 196 - * 10110000 00011000 00110010 00010101 00010000 00011000 00101110 01000010 197 - * | | | 198 - * 14 (bit 17 0 32 199 - * from @src) 193 + * if @cut is 3, and @first is 14, bits 14-16 in @src are cut and @dst is:: 194 + * 195 + * 31 63 196 + * | | 197 + * 10110000 00011000 00110010 00010101 00010000 00011000 00101110 01000010 198 + * | | | 199 + * 14 (bit 17 0 32 200 + * from @src) 200 201 * 201 202 * Note that @dst and @src might overlap partially or entirely. 202 203 *

+2 -2

lib/kobject.c

··· 6 6 * Copyright (c) 2006-2007 Greg Kroah-Hartman <greg@kroah.com> 7 7 * Copyright (c) 2006-2007 Novell Inc. 8 8 * 9 - * Please see the file Documentation/kobject.txt for critical information 9 + * Please see the file Documentation/core-api/kobject.rst for critical information 10 10 * about using the kobject interface. 11 11 */ 12 12 ··· 670 670 kobject_name(kobj), kobj, __func__, kobj->parent); 671 671 672 672 if (t && !t->release) 673 - pr_debug("kobject: '%s' (%p): does not have a release() function, it is broken and must be fixed. See Documentation/kobject.txt.\n", 673 + pr_debug("kobject: '%s' (%p): does not have a release() function, it is broken and must be fixed. See Documentation/core-api/kobject.rst.\n", 674 674 kobject_name(kobj), kobj); 675 675 676 676 /* send "remove" if the caller did not do it but sent "add" */

+6 -6

mm/gup.c

··· 2845 2845 * the arguments here are identical. 2846 2846 * 2847 2847 * FOLL_PIN means that the pages must be released via unpin_user_page(). Please 2848 - * see Documentation/vm/pin_user_pages.rst for further details. 2848 + * see Documentation/core-api/pin_user_pages.rst for further details. 2849 2849 * 2850 - * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It 2850 + * This is intended for Case 1 (DIO) in Documentation/core-api/pin_user_pages.rst. It 2851 2851 * is NOT intended for Case 2 (RDMA: long-term pins). 2852 2852 */ 2853 2853 int pin_user_pages_fast(unsigned long start, int nr_pages, ··· 2885 2885 * the arguments here are identical. 2886 2886 * 2887 2887 * FOLL_PIN means that the pages must be released via unpin_user_page(). Please 2888 - * see Documentation/vm/pin_user_pages.rst for details. 2888 + * see Documentation/core-api/pin_user_pages.rst for details. 2889 2889 * 2890 - * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It 2890 + * This is intended for Case 1 (DIO) in Documentation/core-api/pin_user_pages.rst. It 2891 2891 * is NOT intended for Case 2 (RDMA: long-term pins). 2892 2892 */ 2893 2893 long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, ··· 2921 2921 * FOLL_PIN is set. 2922 2922 * 2923 2923 * FOLL_PIN means that the pages must be released via unpin_user_page(). Please 2924 - * see Documentation/vm/pin_user_pages.rst for details. 2924 + * see Documentation/core-api/pin_user_pages.rst for details. 2925 2925 * 2926 - * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It 2926 + * This is intended for Case 1 (DIO) in Documentation/core-api/pin_user_pages.rst. It 2927 2927 * is NOT intended for Case 2 (RDMA: long-term pins). 2928 2928 */ 2929 2929 long pin_user_pages(unsigned long start, unsigned long nr_pages,

+1 -1

samples/Kconfig

··· 171 171 172 172 config SAMPLE_ANDROID_BINDERFS 173 173 bool "Build Android binderfs example" 174 - depends on CONFIG_ANDROID_BINDERFS 174 + depends on ANDROID_BINDERFS 175 175 help 176 176 Builds a sample program to illustrate the use of the Android binderfs 177 177 filesystem.

+5 -1

samples/binderfs/Makefile

··· 1 1 # SPDX-License-Identifier: GPL-2.0-only 2 - obj-$(CONFIG_SAMPLE_ANDROID_BINDERFS) += binderfs_example.o 2 + ifndef CROSS_COMPILE 3 + ifdef CONFIG_SAMPLE_ANDROID_BINDERFS 4 + hostprogs := binderfs_example 5 + endif 6 + endif

+28 -13

scripts/kernel-doc

··· 213 213 my $type_constant2 = '\%([-_\w]+)'; 214 214 my $type_func = '(\w+)'; 215 215 my $type_param = '\@(\w*((\.\w+)|(->\w+))*(\.\.\.)?)'; 216 + my $type_param_ref = '([\!]?)\@(\w*((\.\w+)|(->\w+))*(\.\.\.)?)'; 216 217 my $type_fp_param = '\@(\w+)'; # Special RST handling for func ptr params 218 + my $type_fp_param2 = '\@(\w+->\S+)'; # Special RST handling for structs with func ptr params 217 219 my $type_env = '(\$\w+)'; 218 220 my $type_enum = '\&(enum\s*([_\w]+))'; 219 221 my $type_struct = '\&(struct\s*([_\w]+))'; ··· 238 236 [$type_typedef, "\\\\fI\$1\\\\fP"], 239 237 [$type_union, "\\\\fI\$1\\\\fP"], 240 238 [$type_param, "\\\\fI\$1\\\\fP"], 239 + [$type_param_ref, "\\\\fI\$1\$2\\\\fP"], 241 240 [$type_member, "\\\\fI\$1\$2\$3\\\\fP"], 242 241 [$type_fallback, "\\\\fI\$1\\\\fP"] 243 242 ); ··· 252 249 [$type_member_func, "\\:c\\:type\\:`\$1\$2\$3\\\$\\\$ <\$1>`"], 253 250 [$type_member, "\\:c\\:type\\:`\$1\$2\$3 <\$1>`"], 254 251 [$type_fp_param, "**\$1\\\$\\\$**"], 252 + [$type_fp_param2, "**\$1\\\$\\\$**"], 255 253 [$type_func, "\$1()"], 256 254 [$type_enum, "\\:c\\:type\\:`\$1 <\$2>`"], 257 255 [$type_struct, "\\:c\\:type\\:`\$1 <\$2>`"], ··· 260 256 [$type_union, "\\:c\\:type\\:`\$1 <\$2>`"], 261 257 # in rst this can refer to any type 262 258 [$type_fallback, "\\:c\\:type\\:`\$1`"], 263 - [$type_param, "**\$1**"] 259 + [$type_param_ref, "**\$1\$2**"] 264 260 ); 265 261 my $blankline_rst = "\n"; 266 262 ··· 331 327 332 328 # Parser states 333 329 use constant { 334 - STATE_NORMAL => 0, # normal code 335 - STATE_NAME => 1, # looking for function name 336 - STATE_BODY_MAYBE => 2, # body - or maybe more description 337 - STATE_BODY => 3, # the body of the comment 338 - STATE_PROTO => 4, # scanning prototype 339 - STATE_DOCBLOCK => 5, # documentation block 340 - STATE_INLINE => 6, # gathering documentation outside main block 330 + STATE_NORMAL => 0, # normal code 331 + STATE_NAME => 1, # looking for function name 332 + STATE_BODY_MAYBE => 2, # body - or maybe more description 333 + STATE_BODY => 3, # the body of the comment 334 + STATE_BODY_WITH_BLANK_LINE => 4, # the body, which has a blank line 335 + STATE_PROTO => 5, # scanning prototype 336 + STATE_DOCBLOCK => 6, # documentation block 337 + STATE_INLINE => 7, # gathering doc outside main block 341 338 }; 342 339 my $state; 343 340 my $in_doc_sect; ··· 1958 1953 } 1959 1954 } 1960 1955 1956 + if ($state == STATE_BODY_WITH_BLANK_LINE && /^\s*\*\s?\S/) { 1957 + dump_section($file, $section, $contents); 1958 + $section = $section_default; 1959 + $contents = ""; 1960 + } 1961 + 1961 1962 if (/$doc_sect/i) { # case insensitive for supported section names 1962 1963 $newsection = $1; 1963 1964 $newcontents = $2; ··· 2017 2006 $state = STATE_PROTO; 2018 2007 $brcount = 0; 2019 2008 } elsif (/$doc_content/) { 2020 - # miguel-style comment kludge, look for blank lines after 2021 - # @parameter line to signify start of description 2022 2009 if ($1 eq "") { 2023 - if ($section =~ m/^@/ || $section eq $section_context) { 2010 + if ($section eq $section_context) { 2024 2011 dump_section($file, $section, $contents); 2025 2012 $section = $section_default; 2026 2013 $contents = ""; 2027 2014 $new_start_line = $.; 2015 + $state = STATE_BODY; 2028 2016 } else { 2017 + if ($section ne $section_default) { 2018 + $state = STATE_BODY_WITH_BLANK_LINE; 2019 + } else { 2020 + $state = STATE_BODY; 2021 + } 2029 2022 $contents .= "\n"; 2030 2023 } 2031 - $state = STATE_BODY; 2032 2024 } elsif ($state == STATE_BODY_MAYBE) { 2033 2025 # Continued declaration purpose 2034 2026 chomp($declaration_purpose); ··· 2183 2169 process_normal(); 2184 2170 } elsif ($state == STATE_NAME) { 2185 2171 process_name($file, $_); 2186 - } elsif ($state == STATE_BODY || $state == STATE_BODY_MAYBE) { 2172 + } elsif ($state == STATE_BODY || $state == STATE_BODY_MAYBE || 2173 + $state == STATE_BODY_WITH_BLANK_LINE) { 2187 2174 process_body($file, $_); 2188 2175 } elsif ($state == STATE_INLINE) { # scanning for inline parameters 2189 2176 process_inline($file, $_);

+214 -75

scripts/sphinx-pre-install

··· 2 2 # SPDX-License-Identifier: GPL-2.0-or-later 3 3 use strict; 4 4 5 - # Copyright (c) 2017-2019 Mauro Carvalho Chehab <mchehab@kernel.org> 5 + # Copyright (c) 2017-2020 Mauro Carvalho Chehab <mchehab@kernel.org> 6 6 # 7 7 8 8 my $prefix = "./"; ··· 22 22 my $optional = 0; 23 23 my $need_symlink = 0; 24 24 my $need_sphinx = 0; 25 + my $need_venv = 0; 26 + my $need_virtualenv = 0; 25 27 my $rec_sphinx_upgrade = 0; 26 28 my $install = ""; 27 29 my $virtenv_dir = ""; 30 + my $python_cmd = ""; 28 31 my $min_version; 32 + my $cur_version; 33 + my $rec_version = "1.7.9"; # PDF won't build here 34 + my $min_pdf_version = "2.4.4"; # Min version where pdf builds 29 35 30 36 # 31 37 # Command line arguments ··· 148 142 } 149 143 } 150 144 145 + sub find_python_no_venv() 146 + { 147 + my $prog = shift; 148 + 149 + my $cur_dir = qx(pwd); 150 + $cur_dir =~ s/\s+$//; 151 + 152 + foreach my $dir (split(/:/, $ENV{PATH})) { 153 + next if ($dir =~ m,($cur_dir)/sphinx,); 154 + return "$dir/python3" if(-x "$dir/python3"); 155 + } 156 + foreach my $dir (split(/:/, $ENV{PATH})) { 157 + next if ($dir =~ m,($cur_dir)/sphinx,); 158 + return "$dir/python" if(-x "$dir/python"); 159 + } 160 + return "python"; 161 + } 162 + 151 163 sub check_program($$) 152 164 { 153 165 my $prog = shift; 154 166 my $is_optional = shift; 155 167 156 - return if findprog($prog); 168 + return $prog if findprog($prog); 157 169 158 170 add_package($prog, $is_optional); 159 171 } ··· 192 168 my $prog = shift; 193 169 my $is_optional = shift; 194 170 195 - my $err = system("python3 -c 'import $prog' 2>/dev/null /dev/null"); 196 - return if ($err == 0); 197 - my $err = system("python -c 'import $prog' 2>/dev/null /dev/null"); 171 + return if (!$python_cmd); 172 + 173 + my $err = system("$python_cmd -c 'import $prog' 2>/dev/null /dev/null"); 198 174 return if ($err == 0); 199 175 200 176 add_package($prog, $is_optional); ··· 249 225 return $fname; 250 226 } 251 227 252 - if ($virtualenv) { 253 - my $prog = findprog("virtualenv-3"); 254 - $prog = findprog("virtualenv-3.5") if (!$prog); 255 - 256 - check_program("virtualenv", 0) if (!$prog); 257 - $need_sphinx = 1; 258 - } else { 259 - add_package("python-sphinx", 0); 260 - } 261 - 262 228 return ""; 229 + } 230 + 231 + sub get_sphinx_version($) 232 + { 233 + my $cmd = shift; 234 + my $ver; 235 + 236 + open IN, "$cmd --version 2>&1 |"; 237 + while (<IN>) { 238 + if (m/^\s*sphinx-build\s+([\d\.]+)(\+\/[\da-f]+)?$/) { 239 + $ver=$1; 240 + last; 241 + } 242 + # Sphinx 1.2.x uses a different format 243 + if (m/^\s*Sphinx.*\s+([\d\.]+)$/) { 244 + $ver=$1; 245 + last; 246 + } 247 + } 248 + close IN; 249 + return $ver; 263 250 } 264 251 265 252 sub check_sphinx() 266 253 { 267 - my $rec_version; 268 - my $cur_version; 254 + my $default_version; 269 255 270 256 open IN, $conf or die "Can't open $conf"; 271 257 while (<IN>) { ··· 291 257 open IN, $requirement_file or die "Can't open $requirement_file"; 292 258 while (<IN>) { 293 259 if (m/^\s*Sphinx\s*==\s*([\d\.]+)$/) { 294 - $rec_version=$1; 260 + $default_version=$1; 295 261 last; 296 262 } 297 263 } 298 264 close IN; 299 265 300 - die "Can't get recommended sphinx version from $requirement_file" if (!$min_version); 266 + die "Can't get default sphinx version from $requirement_file" if (!$default_version); 301 267 302 - $virtenv_dir = $virtenv_prefix . $rec_version; 268 + $virtenv_dir = $virtenv_prefix . $default_version; 303 269 304 270 my $sphinx = get_sphinx_fname(); 305 - return if ($sphinx eq ""); 306 - 307 - open IN, "$sphinx --version 2>&1 |" or die "$sphinx returned an error"; 308 - while (<IN>) { 309 - if (m/^\s*sphinx-build\s+([\d\.]+)(\+\/[\da-f]+)?$/) { 310 - $cur_version=$1; 311 - last; 312 - } 313 - # Sphinx 1.2.x uses a different format 314 - if (m/^\s*Sphinx.*\s+([\d\.]+)$/) { 315 - $cur_version=$1; 316 - last; 317 - } 271 + if ($sphinx eq "") { 272 + $need_sphinx = 1; 273 + return; 318 274 } 319 - close IN; 275 + 276 + $cur_version = get_sphinx_version($sphinx); 277 + die ("$sphinx returned an error") if (!$cur_version); 320 278 321 279 die "$sphinx didn't return its version" if (!$cur_version); 322 280 323 281 if ($cur_version lt $min_version) { 324 282 printf "ERROR: Sphinx version is %s. It should be >= %s (recommended >= %s)\n", 325 - $cur_version, $min_version, $rec_version;; 283 + $cur_version, $min_version, $default_version; 326 284 $need_sphinx = 1; 327 285 return; 328 286 } 329 287 330 288 if ($cur_version lt $rec_version) { 331 - printf "Sphinx version %s\n", $cur_version; 332 - print "Warning: It is recommended at least Sphinx version $rec_version.\n"; 289 + $rec_sphinx_upgrade = 1; 290 + return; 291 + } 292 + if ($cur_version lt $min_pdf_version) { 333 293 $rec_sphinx_upgrade = 1; 334 294 return; 335 295 } ··· 364 336 my %map = ( 365 337 "python-sphinx" => "python3-sphinx", 366 338 "sphinx_rtd_theme" => "python3-sphinx-rtd-theme", 339 + "ensurepip" => "python3-venv", 367 340 "virtualenv" => "virtualenv", 368 341 "dot" => "graphviz", 369 342 "convert" => "imagemagick", ··· 378 349 "fonts-dejavu", 2); 379 350 380 351 check_missing_file(["/usr/share/fonts/noto-cjk/NotoSansCJK-Regular.ttc", 381 - "/usr/share/fonts/opentype/noto/NotoSerifCJK-Regular.ttc"], 352 + "/usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc", 353 + "/usr/share/fonts/opentype/noto/NotoSerifCJK-Regular.ttc"], 382 354 "fonts-noto-cjk", 2); 383 355 } 384 356 ··· 476 446 "convert" => "ImageMagick", 477 447 "Pod::Usage" => "perl-Pod-Usage", 478 448 "xelatex" => "texlive-xetex-bin", 479 - "rsvg-convert" => "rsvg-view", 480 449 ); 450 + 451 + # On Tumbleweed, this package is also named rsvg-convert 452 + $map{"rsvg-convert"} = "rsvg-view" if (!($system_release =~ /Tumbleweed/)); 481 453 482 454 my @suse_tex_pkgs = ( 483 455 "texlive-babel-english", ··· 523 491 "convert" => "ImageMagick", 524 492 "Pod::Usage" => "perl-Pod-Usage", 525 493 "xelatex" => "texlive", 526 - "rsvg-convert" => "librsvg2-tools", 494 + "rsvg-convert" => "librsvg2", 527 495 ); 528 496 529 497 my @tex_pkgs = ( ··· 532 500 533 501 $map{"latexmk"} = "texlive-collection-basic"; 534 502 503 + my $packager_cmd; 504 + my $noto_sans; 505 + if ($system_release =~ /OpenMandriva/) { 506 + $packager_cmd = "dnf install"; 507 + $noto_sans = "noto-sans-cjk-fonts"; 508 + @tex_pkgs = ( "texlive-collection-fontsextra" ); 509 + } else { 510 + $packager_cmd = "urpmi"; 511 + $noto_sans = "google-noto-sans-cjk-ttc-fonts"; 512 + } 513 + 514 + 535 515 if ($pdf) { 536 - check_missing_file(["/usr/share/fonts/google-noto-cjk/NotoSansCJK-Regular.ttc"], 537 - "google-noto-sans-cjk-ttc-fonts", 2); 516 + check_missing_file(["/usr/share/fonts/google-noto-cjk/NotoSansCJK-Regular.ttc", 517 + "/usr/share/fonts/TTF/NotoSans-Regular.ttf"], 518 + $noto_sans, 2); 538 519 } 539 520 540 521 check_rpm_missing(\@tex_pkgs, 2) if ($pdf); 541 522 check_missing(\%map); 542 523 543 524 return if (!$need && !$optional); 544 - printf("You should run:\n\n\tsudo urpmi $install\n"); 525 + printf("You should run:\n\n\tsudo $packager_cmd $install\n"); 545 526 } 546 527 547 528 sub give_arch_linux_hints() ··· 602 557 "media-fonts/dejavu", 2) if ($pdf); 603 558 604 559 if ($pdf) { 605 - check_missing_file(["/usr/share/fonts/noto-cjk/NotoSansCJKsc-Regular.otf"], 560 + check_missing_file(["/usr/share/fonts/noto-cjk/NotoSansCJKsc-Regular.otf", 561 + "/usr/share/fonts/noto-cjk/NotoSerifCJK-Regular.ttc"], 606 562 "media-fonts/noto-cjk", 2); 607 563 } 608 564 ··· 618 572 my $portage_imagemagick = "/etc/portage/package.use/imagemagick"; 619 573 my $portage_cairo = "/etc/portage/package.use/graphviz"; 620 574 621 - if (qx(cat $portage_imagemagick) ne "$imagemagick\n") { 575 + if (qx(grep imagemagick $portage_imagemagick 2>/dev/null) eq "") { 622 576 printf("\tsudo su -c 'echo \"$imagemagick\" > $portage_imagemagick'\n") 623 577 } 624 - if (qx(cat $portage_cairo) ne "$cairo\n") { 578 + if (qx(grep graphviz $portage_cairo 2>/dev/null) eq "") { 625 579 printf("\tsudo su -c 'echo \"$cairo\" > $portage_cairo'\n"); 626 580 } 627 581 ··· 668 622 give_mageia_hints; 669 623 return; 670 624 } 625 + if ($system_release =~ /OpenMandriva/) { 626 + give_mageia_hints; 627 + return; 628 + } 671 629 if ($system_release =~ /Arch Linux/) { 672 630 give_arch_linux_hints; 673 631 return; ··· 701 651 702 652 sub deactivate_help() 703 653 { 704 - printf "\tIf you want to exit the virtualenv, you can use:\n"; 654 + printf "\nIf you want to exit the virtualenv, you can use:\n"; 705 655 printf "\tdeactivate\n"; 706 656 } 707 657 708 658 sub check_needs() 709 659 { 710 - # Check for needed programs/tools 660 + # Check if Sphinx is already accessible from current environment 711 661 check_sphinx(); 712 662 713 663 if ($system_release) { 714 - print "Detected OS: $system_release.\n\n"; 664 + print "Detected OS: $system_release.\n"; 715 665 } else { 716 - print "Unknown OS\n\n"; 666 + print "Unknown OS\n"; 667 + } 668 + printf "Sphinx version: %s\n\n", $cur_version if ($cur_version); 669 + 670 + # Check python command line, trying first python3 671 + $python_cmd = findprog("python3"); 672 + $python_cmd = check_program("python", 0) if (!$python_cmd); 673 + 674 + # Check the type of virtual env, depending on Python version 675 + if ($python_cmd) { 676 + if ($virtualenv) { 677 + my $tmp = qx($python_cmd --version 2>&1); 678 + if ($tmp =~ m/(\d+\.)(\d+\.)/) { 679 + if ($1 >= 3 && $2 >= 3) { 680 + $need_venv = 1; # python 3.3 or upper 681 + } else { 682 + $need_virtualenv = 1; 683 + } 684 + if ($1 < 3) { 685 + # Complain if it finds python2 (or worse) 686 + printf "Warning: python$1 support is deprecated. Use it with caution!\n"; 687 + } 688 + } else { 689 + die "Warning: couldn't identify $python_cmd version!"; 690 + } 691 + } else { 692 + add_package("python-sphinx", 0); 693 + } 717 694 } 718 695 719 - print "To upgrade Sphinx, use:\n\n" if ($rec_sphinx_upgrade); 696 + # Set virtualenv command line, if python < 3.3 697 + my $virtualenv_cmd; 698 + if ($need_virtualenv) { 699 + $virtualenv_cmd = findprog("virtualenv-3"); 700 + $virtualenv_cmd = findprog("virtualenv-3.5") if (!$virtualenv_cmd); 701 + if (!$virtualenv_cmd) { 702 + check_program("virtualenv", 0); 703 + $virtualenv_cmd = "virtualenv"; 704 + } 705 + } 720 706 721 707 # Check for needed programs/tools 722 708 check_perl_module("Pod::Usage", 0); ··· 767 681 check_program("rsvg-convert", 2) if ($pdf); 768 682 check_program("latexmk", 2) if ($pdf); 769 683 684 + if ($need_sphinx || $rec_sphinx_upgrade) { 685 + check_python_module("ensurepip", 0) if ($need_venv); 686 + } 687 + 688 + # Do distro-specific checks and output distro-install commands 770 689 check_distros(); 771 690 691 + if (!$python_cmd) { 692 + if ($need == 1) { 693 + die "Can't build as $need mandatory dependency is missing"; 694 + } elsif ($need) { 695 + die "Can't build as $need mandatory dependencies are missing"; 696 + } 697 + } 698 + 699 + # Check if sphinx-build is called sphinx-build-3 772 700 if ($need_symlink) { 773 701 printf "\tsudo ln -sf %s /usr/bin/sphinx-build\n\n", 774 702 which("sphinx-build-3"); 775 703 } 704 + 705 + # NOTE: if the system has a too old Sphinx version installed, 706 + # it will recommend installing a newer version using virtualenv 707 + 776 708 if ($need_sphinx || $rec_sphinx_upgrade) { 777 709 my $min_activate = "$ENV{'PWD'}/${virtenv_prefix}${min_version}/bin/activate"; 778 710 my @activates = glob "$ENV{'PWD'}/${virtenv_prefix}*/bin/activate"; 779 711 712 + if ($cur_version lt $rec_version) { 713 + print "Warning: It is recommended at least Sphinx version $rec_version.\n"; 714 + print " If you want pdf, you need at least $min_pdf_version.\n"; 715 + } 716 + if ($cur_version lt $min_pdf_version) { 717 + print "Note: It is recommended at least Sphinx version $min_pdf_version if you need PDF support.\n"; 718 + } 780 719 @activates = sort {$b cmp $a} @activates; 720 + my ($activate, $ver); 721 + foreach my $f (@activates) { 722 + next if ($f lt $min_activate); 781 723 782 - if ($need_sphinx && scalar @activates > 0 && $activates[0] ge $min_activate) { 783 - printf "\nNeed to activate a compatible Sphinx version on virtualenv with:\n"; 784 - printf "\t. $activates[0]\n"; 785 - deactivate_help(); 786 - exit (1); 724 + my $sphinx_cmd = $f; 725 + $sphinx_cmd =~ s/activate/sphinx-build/; 726 + next if (! -f $sphinx_cmd); 727 + 728 + $ver = get_sphinx_version($sphinx_cmd); 729 + if ($need_sphinx && ($ver ge $min_version)) { 730 + $activate = $f; 731 + last; 732 + } elsif ($ver gt $cur_version) { 733 + $activate = $f; 734 + last; 735 + } 736 + } 737 + if ($activate ne "") { 738 + if ($need_sphinx) { 739 + printf "\nNeed to activate Sphinx (version $ver) on virtualenv with:\n"; 740 + printf "\t. $activate\n"; 741 + deactivate_help(); 742 + exit (1); 743 + } else { 744 + printf "\nYou may also use a newer Sphinx (version $ver) with:\n"; 745 + printf "\tdeactivate && . $activate\n"; 746 + } 787 747 } else { 788 748 my $rec_activate = "$virtenv_dir/bin/activate"; 789 - my $virtualenv = findprog("virtualenv-3"); 790 - my $rec_python3 = ""; 791 - $virtualenv = findprog("virtualenv-3.5") if (!$virtualenv); 792 - $virtualenv = findprog("virtualenv") if (!$virtualenv); 793 - $virtualenv = "virtualenv" if (!$virtualenv); 794 749 795 - my $rel = ""; 796 - if (index($system_release, "Ubuntu") != -1) { 797 - $rel = $1 if ($system_release =~ /Ubuntu\s+(\d+)[.]/); 798 - if ($rel && $rel >= 16) { 799 - $rec_python3 = " -p python3"; 800 - } 801 - } 802 - if (index($system_release, "Debian") != -1) { 803 - $rel = $1 if ($system_release =~ /Debian\s+(\d+)/); 804 - if ($rel && $rel >= 7) { 805 - $rec_python3 = " -p python3"; 806 - } 807 - } 750 + print "To upgrade Sphinx, use:\n\n" if ($rec_sphinx_upgrade); 808 751 809 - printf "\t$virtualenv$rec_python3 $virtenv_dir\n"; 752 + $python_cmd = find_python_no_venv(); 753 + 754 + if ($need_venv) { 755 + printf "\t$python_cmd -m venv $virtenv_dir\n"; 756 + } else { 757 + printf "\t$virtualenv_cmd $virtenv_dir\n"; 758 + } 810 759 printf "\t. $rec_activate\n"; 811 760 printf "\tpip install -r $requirement_file\n"; 812 761 deactivate_help(); ··· 901 780 $system_release = catcheck("/etc/redhat-release") if !$system_release; 902 781 $system_release = catcheck("/etc/lsb-release") if !$system_release; 903 782 $system_release = catcheck("/etc/gentoo-release") if !$system_release; 783 + 784 + # This seems more common than LSB these days 785 + if (!$system_release) { 786 + my %os_var; 787 + if (open IN, "cat /etc/os-release|") { 788 + while (<IN>) { 789 + if (m/^([\w\d\_]+)=\"?([^\"]*)\"?\n/) { 790 + $os_var{$1}=$2; 791 + } 792 + } 793 + $system_release = $os_var{"NAME"}; 794 + if (defined($os_var{"VERSION_ID"})) { 795 + $system_release .= " " . $os_var{"VERSION_ID"} if (defined($os_var{"VERSION_ID"})); 796 + } else { 797 + $system_release .= " " . $os_var{"VERSION"}; 798 + } 799 + } 800 + } 904 801 $system_release = catcheck("/etc/issue") if !$system_release; 905 802 $system_release =~ s/\s+$//; 906 803

+1 -1

tools/include/linux/rbtree.h

··· 11 11 I know it's not the cleaner way, but in C (not in C++) to get 12 12 performances and genericity... 13 13 14 - See Documentation/rbtree.txt for documentation and samples. 14 + See Documentation/core-api/rbtree.rst for documentation and samples. 15 15 */ 16 16 17 17 #ifndef __TOOLS_LINUX_PERF_RBTREE_H

+1 -1

tools/include/linux/rbtree_augmented.h

··· 23 23 * rb_insert_augmented() and rb_erase_augmented() are intended to be public. 24 24 * The rest are implementation details you are not expected to depend on. 25 25 * 26 - * See Documentation/rbtree.txt for documentation and samples. 26 + * See Documentation/core-api/rbtree.rst for documentation and samples. 27 27 */ 28 28 29 29 struct rb_augment_callbacks {

+2 -2

tools/include/uapi/linux/kvm.h

··· 116 116 * ACPI gsi notion of irq. 117 117 * For IA-64 (APIC model) IOAPIC0: irq 0-23; IOAPIC1: irq 24-47.. 118 118 * For X86 (standard AT mode) PIC0/1: irq 0-15. IOAPIC0: 0-23.. 119 - * For ARM: See Documentation/virt/kvm/api.txt 119 + * For ARM: See Documentation/virt/kvm/api.rst 120 120 */ 121 121 union { 122 122 __u32 irq; ··· 1107 1107 * 1108 1108 * KVM_IRQFD_FLAG_RESAMPLE indicates resamplefd is valid and specifies 1109 1109 * the irqfd to operate in resampling mode for level triggered interrupt 1110 - * emulation. See Documentation/virt/kvm/api.txt. 1110 + * emulation. See Documentation/virt/kvm/api.rst. 1111 1111 */ 1112 1112 #define KVM_IRQFD_FLAG_RESAMPLE (1 << 1) 1113 1113

+1 -1

virt/kvm/arm/vgic/vgic-mmio-v3.c

··· 302 302 * pending state of interrupt is latched in pending_latch variable. 303 303 * Userspace will save and restore pending state and line_level 304 304 * separately. 305 - * Refer to Documentation/virt/kvm/devices/arm-vgic-v3.txt 305 + * Refer to Documentation/virt/kvm/devices/arm-vgic-v3.rst 306 306 * for handling of ISPENDR and ICPENDR. 307 307 */ 308 308 for (i = 0; i < len * 8; i++) {

+2 -2

virt/kvm/arm/vgic/vgic.h

··· 42 42 VGIC_AFFINITY_LEVEL(val, 3)) 43 43 44 44 /* 45 - * As per Documentation/virt/kvm/devices/arm-vgic-v3.txt, 45 + * As per Documentation/virt/kvm/devices/arm-vgic-v3.rst, 46 46 * below macros are defined for CPUREG encoding. 47 47 */ 48 48 #define KVM_REG_ARM_VGIC_SYSREG_OP0_MASK 0x000000000000c000 ··· 63 63 KVM_REG_ARM_VGIC_SYSREG_OP2_MASK) 64 64 65 65 /* 66 - * As per Documentation/virt/kvm/devices/arm-vgic-its.txt, 66 + * As per Documentation/virt/kvm/devices/arm-vgic-its.rst, 67 67 * below macros are defined for ITS table entry encoding. 68 68 */ 69 69 #define KVM_ITS_CTE_VALID_SHIFT 63