Merge tag 'core-rcu-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

+39 -9

Documentation/RCU/Design/Requirements/Requirements.rst

··· 1929 1929 to allow the various kernel subsystems (including RCU) to respond 1930 1930 appropriately to a given CPU-hotplug operation. Most RCU operations may 1931 1931 be invoked from CPU-hotplug notifiers, including even synchronous 1932 - grace-period operations such as ``synchronize_rcu()`` and 1933 - ``synchronize_rcu_expedited()``. 1932 + grace-period operations such as (``synchronize_rcu()`` and 1933 + ``synchronize_rcu_expedited()``). However, these synchronous operations 1934 + do block and therefore cannot be invoked from notifiers that execute via 1935 + ``stop_machine()``, specifically those between the ``CPUHP_AP_OFFLINE`` 1936 + and ``CPUHP_AP_ONLINE`` states. 1934 1937 1935 - However, all-callback-wait operations such as ``rcu_barrier()`` are also 1936 - not supported, due to the fact that there are phases of CPU-hotplug 1937 - operations where the outgoing CPU's callbacks will not be invoked until 1938 - after the CPU-hotplug operation ends, which could also result in 1939 - deadlock. Furthermore, ``rcu_barrier()`` blocks CPU-hotplug operations 1940 - during its execution, which results in another type of deadlock when 1941 - invoked from a CPU-hotplug notifier. 1938 + In addition, all-callback-wait operations such as ``rcu_barrier()`` may 1939 + not be invoked from any CPU-hotplug notifier. This restriction is due 1940 + to the fact that there are phases of CPU-hotplug operations where the 1941 + outgoing CPU's callbacks will not be invoked until after the CPU-hotplug 1942 + operation ends, which could also result in deadlock. Furthermore, 1943 + ``rcu_barrier()`` blocks CPU-hotplug operations during its execution, 1944 + which results in another type of deadlock when invoked from a CPU-hotplug 1945 + notifier. 1946 + 1947 + Finally, RCU must avoid deadlocks due to interaction between hotplug, 1948 + timers and grace period processing. It does so by maintaining its own set 1949 + of books that duplicate the centrally maintained ``cpu_online_mask``, 1950 + and also by reporting quiescent states explicitly when a CPU goes 1951 + offline. This explicit reporting of quiescent states avoids any need 1952 + for the force-quiescent-state loop (FQS) to report quiescent states for 1953 + offline CPUs. However, as a debugging measure, the FQS loop does splat 1954 + if offline CPUs block an RCU grace period for too long. 1955 + 1956 + An offline CPU's quiescent state will be reported either: 1957 + 1958 + 1. As the CPU goes offline using RCU's hotplug notifier (``rcu_report_dead()``). 1959 + 2. When grace period initialization (``rcu_gp_init()``) detects a 1960 + race either with CPU offlining or with a task unblocking on a leaf 1961 + ``rcu_node`` structure whose CPUs are all offline. 1962 + 1963 + The CPU-online path (``rcu_cpu_starting()``) should never need to report 1964 + a quiescent state for an offline CPU. However, as a debugging measure, 1965 + it does emit a warning if a quiescent state was not already reported 1966 + for that CPU. 1967 + 1968 + During the checking/modification of RCU's hotplug bookkeeping, the 1969 + corresponding CPU's leaf node lock is held. This avoids race conditions 1970 + between RCU's hotplug notifier hooks, the grace period initialization 1971 + code, and the FQS loop, all of which refer to or modify this bookkeeping. 1942 1972 1943 1973 Scheduler and RCU 1944 1974 ~~~~~~~~~~~~~~~~~

+7

Documentation/RCU/checklist.rst

··· 314 314 shared between readers and updaters. Additional primitives 315 315 are provided for this case, as discussed in lockdep.txt. 316 316 317 + One exception to this rule is when data is only ever added to 318 + the linked data structure, and is never removed during any 319 + time that readers might be accessing that structure. In such 320 + cases, READ_ONCE() may be used in place of rcu_dereference() 321 + and the read-side markers (rcu_read_lock() and rcu_read_unlock(), 322 + for example) may be omitted. 323 + 317 324 10. Conversely, if you are in an RCU read-side critical section, 318 325 and you don't hold the appropriate update-side lock, you -must- 319 326 use the "_rcu()" variants of the list macros. Failing to do so

+6

Documentation/RCU/rcu_dereference.rst

··· 28 28 for an example where the compiler can in fact deduce the exact 29 29 value of the pointer, and thus cause misordering. 30 30 31 + - In the special case where data is added but is never removed 32 + while readers are accessing the structure, READ_ONCE() may be used 33 + instead of rcu_dereference(). In this case, use of READ_ONCE() 34 + takes on the role of the lockless_dereference() primitive that 35 + was removed in v4.15. 36 + 31 37 - You are only permitted to use rcu_dereference on pointer values. 32 38 The compiler simply knows too much about integral values to 33 39 trust it to carry dependencies through integer operations.

+1 -2

Documentation/RCU/whatisRCU.rst

··· 497 497 In such cases, one uses call_rcu() rather than synchronize_rcu(). 498 498 The call_rcu() API is as follows:: 499 499 500 - void call_rcu(struct rcu_head * head, 501 - void (*func)(struct rcu_head *head)); 500 + void call_rcu(struct rcu_head *head, rcu_callback_t func); 502 501 503 502 This function invokes func(head) after a grace period has elapsed. 504 503 This invocation might happen from either softirq or process context,

+1 -1

Documentation/memory-barriers.txt

··· 1870 1870 1871 1871 These are for use with atomic RMW functions that do not imply memory 1872 1872 barriers, but where the code needs a memory barrier. Examples for atomic 1873 - RMW functions that do not imply are memory barrier are e.g. add, 1873 + RMW functions that do not imply a memory barrier are e.g. add, 1874 1874 subtract, (failed) conditional operations, _relaxed functions, 1875 1875 but not atomic_read or atomic_set. A common example where a memory 1876 1876 barrier may be required is when atomic ops are used for reference

+15 -1

arch/x86/kernel/cpu/aperfmperf.c

··· 14 14 #include <linux/cpufreq.h> 15 15 #include <linux/smp.h> 16 16 #include <linux/sched/isolation.h> 17 + #include <linux/rcupdate.h> 17 18 18 19 #include "cpu.h" 19 20 20 21 struct aperfmperf_sample { 21 22 unsigned int khz; 23 + atomic_t scfpending; 22 24 ktime_t time; 23 25 u64 aperf; 24 26 u64 mperf; ··· 64 62 s->aperf = aperf; 65 63 s->mperf = mperf; 66 64 s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta); 65 + atomic_set_release(&s->scfpending, 0); 67 66 } 68 67 69 68 static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait) 70 69 { 71 70 s64 time_delta = ktime_ms_delta(now, per_cpu(samples.time, cpu)); 71 + struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu); 72 72 73 73 /* Don't bother re-computing within the cache threshold time. */ 74 74 if (time_delta < APERFMPERF_CACHE_THRESHOLD_MS) 75 75 return true; 76 76 77 - smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, wait); 77 + if (!atomic_xchg(&s->scfpending, 1) || wait) 78 + smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, wait); 78 79 79 80 /* Return false if the previous iteration was too long ago. */ 80 81 return time_delta <= APERFMPERF_STALE_THRESHOLD_MS; ··· 93 88 94 89 if (!housekeeping_cpu(cpu, HK_FLAG_MISC)) 95 90 return 0; 91 + 92 + if (rcu_is_idle_cpu(cpu)) 93 + return 0; /* Idle CPUs are completely uninteresting. */ 96 94 97 95 aperfmperf_snapshot_cpu(cpu, ktime_get(), true); 98 96 return per_cpu(samples.khz, cpu); ··· 116 108 for_each_online_cpu(cpu) { 117 109 if (!housekeeping_cpu(cpu, HK_FLAG_MISC)) 118 110 continue; 111 + if (rcu_is_idle_cpu(cpu)) 112 + continue; /* Idle CPUs are completely uninteresting. */ 119 113 if (!aperfmperf_snapshot_cpu(cpu, now, false)) 120 114 wait = true; 121 115 } ··· 128 118 129 119 unsigned int arch_freq_get_on_cpu(int cpu) 130 120 { 121 + struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu); 122 + 131 123 if (!cpu_khz) 132 124 return 0; 133 125 ··· 143 131 return per_cpu(samples.khz, cpu); 144 132 145 133 msleep(APERFMPERF_REFRESH_DELAY_MS); 134 + atomic_set(&s->scfpending, 1); 135 + smp_mb(); /* ->scfpending before smp_call_function_single(). */ 146 136 smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1); 147 137 148 138 return per_cpu(samples.khz, cpu);

-2

arch/x86/kernel/cpu/mtrr/mtrr.c

··· 794 794 if (!use_intel() || mtrr_aps_delayed_init) 795 795 return; 796 796 797 - rcu_cpu_starting(smp_processor_id()); 798 - 799 797 /* 800 798 * Ideally we should hold mtrr_mutex here to avoid mtrr entries 801 799 * changed, but this routine will be called in cpu boot time,

+1

arch/x86/kernel/smpboot.c

··· 229 229 #endif 230 230 cpu_init_exception_handling(); 231 231 cpu_init(); 232 + rcu_cpu_starting(raw_smp_processor_id()); 232 233 x86_cpuinit.early_percpu_clock_init(); 233 234 preempt_disable(); 234 235 smp_callin();

+1

include/linux/kernel.h

··· 536 536 extern unsigned long panic_on_taint; 537 537 extern bool panic_on_taint_nousertaint; 538 538 extern int sysctl_panic_on_rcu_stall; 539 + extern int sysctl_max_rcu_stall_to_panic; 539 540 extern int sysctl_panic_on_stackoverflow; 540 541 541 542 extern bool crash_kexec_post_notifiers;

+1 -1

include/linux/list.h

··· 9 9 #include <linux/kernel.h> 10 10 11 11 /* 12 - * Simple doubly linked list implementation. 12 + * Circular doubly linked list implementation. 13 13 * 14 14 * Some of the internal functions ("__xxx") are useful when 15 15 * manipulating whole lists rather than single entries, as

+6

include/linux/lockdep.h

··· 375 375 376 376 #define lockdep_depth(tsk) (0) 377 377 378 + /* 379 + * Dummy forward declarations, allow users to write less ifdef-y code 380 + * and depend on dead code elimination. 381 + */ 382 + extern int lock_is_held(const void *); 383 + extern int lockdep_is_held(const void *); 378 384 #define lockdep_is_held_type(l, r) (1) 379 385 380 386 #define lockdep_assert_held(l) do { (void)(l); } while (0)

+6 -5

include/linux/rcupdate.h

··· 241 241 static inline bool rcu_lockdep_current_cpu_online(void) { return true; } 242 242 #endif /* #else #if defined(CONFIG_HOTPLUG_CPU) && defined(CONFIG_PROVE_RCU) */ 243 243 244 + extern struct lockdep_map rcu_lock_map; 245 + extern struct lockdep_map rcu_bh_lock_map; 246 + extern struct lockdep_map rcu_sched_lock_map; 247 + extern struct lockdep_map rcu_callback_map; 248 + 244 249 #ifdef CONFIG_DEBUG_LOCK_ALLOC 245 250 246 251 static inline void rcu_lock_acquire(struct lockdep_map *map) ··· 258 253 lock_release(map, _THIS_IP_); 259 254 } 260 255 261 - extern struct lockdep_map rcu_lock_map; 262 - extern struct lockdep_map rcu_bh_lock_map; 263 - extern struct lockdep_map rcu_sched_lock_map; 264 - extern struct lockdep_map rcu_callback_map; 265 256 int debug_lockdep_rcu_enabled(void); 266 257 int rcu_read_lock_held(void); 267 258 int rcu_read_lock_bh_held(void); ··· 328 327 329 328 #else /* #ifdef CONFIG_PROVE_RCU */ 330 329 331 - #define RCU_LOCKDEP_WARN(c, s) do { } while (0) 330 + #define RCU_LOCKDEP_WARN(c, s) do { } while (0 && (c)) 332 331 #define rcu_sleep_check() do { } while (0) 333 332 334 333 #endif /* #else #ifdef CONFIG_PROVE_RCU */

+2 -2

include/linux/rcupdate_trace.h

··· 11 11 #include <linux/sched.h> 12 12 #include <linux/rcupdate.h> 13 13 14 - #ifdef CONFIG_DEBUG_LOCK_ALLOC 15 - 16 14 extern struct lockdep_map rcu_trace_lock_map; 15 + 16 + #ifdef CONFIG_DEBUG_LOCK_ALLOC 17 17 18 18 static inline int rcu_read_lock_trace_held(void) 19 19 {

+2

include/linux/rcutiny.h

··· 89 89 static inline void rcu_irq_exit(void) { } 90 90 static inline void rcu_irq_exit_preempt(void) { } 91 91 static inline void rcu_irq_exit_check_preempt(void) { } 92 + #define rcu_is_idle_cpu(cpu) \ 93 + (is_idle_task(current) && !in_nmi() && !in_irq() && !in_serving_softirq()) 92 94 static inline void exit_rcu(void) { } 93 95 static inline bool rcu_preempt_need_deferred_qs(struct task_struct *t) 94 96 {

+1

include/linux/rcutree.h

··· 50 50 void rcu_irq_exit_preempt(void); 51 51 void rcu_irq_enter_irqson(void); 52 52 void rcu_irq_exit_irqson(void); 53 + bool rcu_is_idle_cpu(int cpu); 53 54 54 55 #ifdef CONFIG_PROVE_RCU 55 56 void rcu_irq_exit_check_preempt(void);

-2

include/linux/sched/task.h

··· 47 47 extern union thread_union init_thread_union; 48 48 extern struct task_struct init_task; 49 49 50 - #ifdef CONFIG_PROVE_RCU 51 50 extern int lockdep_tasklist_lock_is_held(void); 52 - #endif /* #ifdef CONFIG_PROVE_RCU */ 53 51 54 52 extern asmlinkage void schedule_tail(struct task_struct *prev); 55 53 extern void init_idle(struct task_struct *idle, int cpu);

-12

include/net/sch_generic.h

··· 435 435 struct mutex proto_destroy_lock; /* Lock for proto_destroy hashtable. */ 436 436 }; 437 437 438 - #ifdef CONFIG_PROVE_LOCKING 439 438 static inline bool lockdep_tcf_chain_is_locked(struct tcf_chain *chain) 440 439 { 441 440 return lockdep_is_held(&chain->filter_chain_lock); ··· 444 445 { 445 446 return lockdep_is_held(&tp->lock); 446 447 } 447 - #else 448 - static inline bool lockdep_tcf_chain_is_locked(struct tcf_block *chain) 449 - { 450 - return true; 451 - } 452 - 453 - static inline bool lockdep_tcf_proto_is_locked(struct tcf_proto *tp) 454 - { 455 - return true; 456 - } 457 - #endif /* #ifdef CONFIG_PROVE_LOCKING */ 458 448 459 449 #define tcf_chain_dereference(p, chain) \ 460 450 rcu_dereference_protected(p, lockdep_tcf_chain_is_locked(chain))

-2

include/net/sock.h

··· 1566 1566 lockdep_init_map(&(sk)->sk_lock.dep_map, (name), (key), 0); \ 1567 1567 } while (0) 1568 1568 1569 - #ifdef CONFIG_LOCKDEP 1570 1569 static inline bool lockdep_sock_is_held(const struct sock *sk) 1571 1570 { 1572 1571 return lockdep_is_held(&sk->sk_lock) || 1573 1572 lockdep_is_held(&sk->sk_lock.slock); 1574 1573 } 1575 - #endif 1576 1574 1577 1575 void lock_sock_nested(struct sock *sk, int subclass); 1578 1576

+11 -9

kernel/kcsan/encoding.h

··· 37 37 */ 38 38 #define WATCHPOINT_ADDR_BITS (BITS_PER_LONG-1 - WATCHPOINT_SIZE_BITS) 39 39 40 - /* 41 - * Masks to set/retrieve the encoded data. 42 - */ 43 - #define WATCHPOINT_WRITE_MASK BIT(BITS_PER_LONG-1) 44 - #define WATCHPOINT_SIZE_MASK \ 45 - GENMASK(BITS_PER_LONG-2, BITS_PER_LONG-2 - WATCHPOINT_SIZE_BITS) 46 - #define WATCHPOINT_ADDR_MASK \ 47 - GENMASK(BITS_PER_LONG-3 - WATCHPOINT_SIZE_BITS, 0) 40 + /* Bitmasks for the encoded watchpoint access information. */ 41 + #define WATCHPOINT_WRITE_MASK BIT(BITS_PER_LONG-1) 42 + #define WATCHPOINT_SIZE_MASK GENMASK(BITS_PER_LONG-2, WATCHPOINT_ADDR_BITS) 43 + #define WATCHPOINT_ADDR_MASK GENMASK(WATCHPOINT_ADDR_BITS-1, 0) 44 + static_assert(WATCHPOINT_ADDR_MASK == (1UL << WATCHPOINT_ADDR_BITS) - 1); 45 + static_assert((WATCHPOINT_WRITE_MASK ^ WATCHPOINT_SIZE_MASK ^ WATCHPOINT_ADDR_MASK) == ~0UL); 48 46 49 47 static inline bool check_encodable(unsigned long addr, size_t size) 50 48 { 51 - return size <= MAX_ENCODABLE_SIZE; 49 + /* 50 + * While we can encode addrs<PAGE_SIZE, avoid crashing with a NULL 51 + * pointer deref inside KCSAN. 52 + */ 53 + return addr >= PAGE_SIZE && size <= MAX_ENCODABLE_SIZE; 52 54 } 53 55 54 56 static inline long

+3

kernel/kcsan/selftest.c

··· 33 33 unsigned long addr; 34 34 35 35 prandom_bytes(&addr, sizeof(addr)); 36 + if (addr < PAGE_SIZE) 37 + addr = PAGE_SIZE; 38 + 36 39 if (WARN_ON(!check_encodable(addr, size))) 37 40 return false; 38 41

+30 -6

kernel/locking/locktorture.c

··· 29 29 #include <linux/slab.h> 30 30 #include <linux/percpu-rwsem.h> 31 31 #include <linux/torture.h> 32 + #include <linux/reboot.h> 32 33 33 34 MODULE_LICENSE("GPL"); 34 35 MODULE_AUTHOR("Paul E. McKenney <paulmck@linux.ibm.com>"); ··· 61 60 62 61 static bool lock_is_write_held; 63 62 static bool lock_is_read_held; 63 + static unsigned long last_lock_release; 64 64 65 65 struct lock_stress_stats { 66 66 long n_lock_fail; ··· 76 74 */ 77 75 struct lock_torture_ops { 78 76 void (*init)(void); 77 + void (*exit)(void); 79 78 int (*writelock)(void); 80 79 void (*write_delay)(struct torture_random_state *trsp); 81 80 void (*task_boost)(struct torture_random_state *trsp); ··· 93 90 int nrealwriters_stress; 94 91 int nrealreaders_stress; 95 92 bool debug_lock; 93 + bool init_called; 96 94 atomic_t n_lock_torture_errors; 97 95 struct lock_torture_ops *cur_ops; 98 96 struct lock_stress_stats *lwsa; /* writer statistics */ 99 97 struct lock_stress_stats *lrsa; /* reader statistics */ 100 98 }; 101 - static struct lock_torture_cxt cxt = { 0, 0, false, 99 + static struct lock_torture_cxt cxt = { 0, 0, false, false, 102 100 ATOMIC_INIT(0), 103 101 NULL, NULL}; 104 102 /* ··· 575 571 BUG_ON(percpu_init_rwsem(&pcpu_rwsem)); 576 572 } 577 573 574 + static void torture_percpu_rwsem_exit(void) 575 + { 576 + percpu_free_rwsem(&pcpu_rwsem); 577 + } 578 + 578 579 static int torture_percpu_rwsem_down_write(void) __acquires(pcpu_rwsem) 579 580 { 580 581 percpu_down_write(&pcpu_rwsem); ··· 604 595 605 596 static struct lock_torture_ops percpu_rwsem_lock_ops = { 606 597 .init = torture_percpu_rwsem_init, 598 + .exit = torture_percpu_rwsem_exit, 607 599 .writelock = torture_percpu_rwsem_down_write, 608 600 .write_delay = torture_rwsem_write_delay, 609 601 .task_boost = torture_boost_dummy, ··· 642 632 lwsp->n_lock_acquired++; 643 633 cxt.cur_ops->write_delay(&rand); 644 634 lock_is_write_held = false; 635 + WRITE_ONCE(last_lock_release, jiffies); 645 636 cxt.cur_ops->writeunlock(); 646 637 647 638 stutter_wait("lock_torture_writer"); ··· 797 786 798 787 /* 799 788 * Indicates early cleanup, meaning that the test has not run, 800 - * such as when passing bogus args when loading the module. As 801 - * such, only perform the underlying torture-specific cleanups, 802 - * and avoid anything related to locktorture. 789 + * such as when passing bogus args when loading the module. 790 + * However cxt->cur_ops.init() may have been invoked, so beside 791 + * perform the underlying torture-specific cleanups, cur_ops.exit() 792 + * will be invoked if needed. 803 793 */ 804 794 if (!cxt.lwsa && !cxt.lrsa) 805 795 goto end; ··· 840 828 cxt.lrsa = NULL; 841 829 842 830 end: 831 + if (cxt.init_called) { 832 + if (cxt.cur_ops->exit) 833 + cxt.cur_ops->exit(); 834 + cxt.init_called = false; 835 + } 843 836 torture_cleanup_end(); 844 837 } 845 838 ··· 885 868 goto unwind; 886 869 } 887 870 888 - if (nwriters_stress == 0 && nreaders_stress == 0) { 871 + if (nwriters_stress == 0 && 872 + (!cxt.cur_ops->readlock || nreaders_stress == 0)) { 889 873 pr_alert("lock-torture: must run at least one locking thread\n"); 890 874 firsterr = -EINVAL; 891 875 goto unwind; 892 876 } 893 877 894 - if (cxt.cur_ops->init) 878 + if (cxt.cur_ops->init) { 895 879 cxt.cur_ops->init(); 880 + cxt.init_called = true; 881 + } 896 882 897 883 if (nwriters_stress >= 0) 898 884 cxt.nrealwriters_stress = nwriters_stress; ··· 1058 1038 unwind: 1059 1039 torture_init_end(); 1060 1040 lock_torture_cleanup(); 1041 + if (shutdown_secs) { 1042 + WARN_ON(!IS_MODULE(CONFIG_LOCK_TORTURE_TEST)); 1043 + kernel_power_off(); 1044 + } 1061 1045 return firsterr; 1062 1046 } 1063 1047

+11 -7

kernel/rcu/Kconfig

··· 221 221 Use this option to reduce OS jitter for aggressive HPC or 222 222 real-time workloads. It can also be used to offload RCU 223 223 callback invocation to energy-efficient CPUs in battery-powered 224 - asymmetric multiprocessors. 224 + asymmetric multiprocessors. The price of this reduced jitter 225 + is that the overhead of call_rcu() increases and that some 226 + workloads will incur significant increases in context-switch 227 + rates. 225 228 226 229 This option offloads callback invocation from the set of CPUs 227 230 specified at boot time by the rcu_nocbs parameter. For each 228 231 such CPU, a kthread ("rcuox/N") will be created to invoke 229 232 callbacks, where the "N" is the CPU being offloaded, and where 230 - the "p" for RCU-preempt (PREEMPTION kernels) and "s" for RCU-sched 231 - (!PREEMPTION kernels). Nothing prevents this kthread from running 232 - on the specified CPUs, but (1) the kthreads may be preempted 233 - between each callback, and (2) affinity or cgroups can be used 234 - to force the kthreads to run on whatever set of CPUs is desired. 233 + the "x" is "p" for RCU-preempt (PREEMPTION kernels) and "s" for 234 + RCU-sched (!PREEMPTION kernels). Nothing prevents this kthread 235 + from running on the specified CPUs, but (1) the kthreads may be 236 + preempted between each callback, and (2) affinity or cgroups can 237 + be used to force the kthreads to run on whatever set of CPUs is 238 + desired. 235 239 236 - Say Y here if you want to help to debug reduced OS jitter. 240 + Say Y here if you need reduced OS jitter, despite added overhead. 237 241 Say N here if you are unsure. 238 242 239 243 config TASKS_TRACE_RCU_READ_MB

+16

kernel/rcu/rcu.h

··· 533 533 static inline void rcu_bind_current_to_nocb(void) { } 534 534 #endif 535 535 536 + #if !defined(CONFIG_TINY_RCU) && defined(CONFIG_TASKS_RCU) 537 + void show_rcu_tasks_classic_gp_kthread(void); 538 + #else 539 + static inline void show_rcu_tasks_classic_gp_kthread(void) {} 540 + #endif 541 + #if !defined(CONFIG_TINY_RCU) && defined(CONFIG_TASKS_RUDE_RCU) 542 + void show_rcu_tasks_rude_gp_kthread(void); 543 + #else 544 + static inline void show_rcu_tasks_rude_gp_kthread(void) {} 545 + #endif 546 + #if !defined(CONFIG_TINY_RCU) && defined(CONFIG_TASKS_TRACE_RCU) 547 + void show_rcu_tasks_trace_gp_kthread(void); 548 + #else 549 + static inline void show_rcu_tasks_trace_gp_kthread(void) {} 550 + #endif 551 + 536 552 #endif /* __LINUX_RCU_H */

+1 -1

kernel/rcu/rcu_segcblist.h

··· 62 62 /* Is the specified rcu_segcblist offloaded? */ 63 63 static inline bool rcu_segcblist_is_offloaded(struct rcu_segcblist *rsclp) 64 64 { 65 - return rsclp->offloaded; 65 + return IS_ENABLED(CONFIG_RCU_NOCB_CPU) && rsclp->offloaded; 66 66 } 67 67 68 68 /*

+35 -2

kernel/rcu/rcuscale.c

··· 38 38 #include <asm/byteorder.h> 39 39 #include <linux/torture.h> 40 40 #include <linux/vmalloc.h> 41 + #include <linux/rcupdate_trace.h> 41 42 42 43 #include "rcu.h" 43 44 ··· 293 292 .sync = synchronize_rcu_tasks, 294 293 .exp_sync = synchronize_rcu_tasks, 295 294 .name = "tasks" 295 + }; 296 + 297 + /* 298 + * Definitions for RCU-tasks-trace scalability testing. 299 + */ 300 + 301 + static int tasks_trace_scale_read_lock(void) 302 + { 303 + rcu_read_lock_trace(); 304 + return 0; 305 + } 306 + 307 + static void tasks_trace_scale_read_unlock(int idx) 308 + { 309 + rcu_read_unlock_trace(); 310 + } 311 + 312 + static struct rcu_scale_ops tasks_tracing_ops = { 313 + .ptype = RCU_TASKS_FLAVOR, 314 + .init = rcu_sync_scale_init, 315 + .readlock = tasks_trace_scale_read_lock, 316 + .readunlock = tasks_trace_scale_read_unlock, 317 + .get_gp_seq = rcu_no_completed, 318 + .gp_diff = rcu_seq_diff, 319 + .async = call_rcu_tasks_trace, 320 + .gp_barrier = rcu_barrier_tasks_trace, 321 + .sync = synchronize_rcu_tasks_trace, 322 + .exp_sync = synchronize_rcu_tasks_trace, 323 + .name = "tasks-tracing" 296 324 }; 297 325 298 326 static unsigned long rcuscale_seq_diff(unsigned long new, unsigned long old) ··· 784 754 long i; 785 755 int firsterr = 0; 786 756 static struct rcu_scale_ops *scale_ops[] = { 787 - &rcu_ops, &srcu_ops, &srcud_ops, &tasks_ops, 757 + &rcu_ops, &srcu_ops, &srcud_ops, &tasks_ops, &tasks_tracing_ops 788 758 }; 789 759 790 760 if (!torture_init_begin(scale_type, verbose)) ··· 802 772 for (i = 0; i < ARRAY_SIZE(scale_ops); i++) 803 773 pr_cont(" %s", scale_ops[i]->name); 804 774 pr_cont("\n"); 805 - WARN_ON(!IS_MODULE(CONFIG_RCU_SCALE_TEST)); 806 775 firsterr = -EINVAL; 807 776 cur_ops = NULL; 808 777 goto unwind; ··· 875 846 unwind: 876 847 torture_init_end(); 877 848 rcu_scale_cleanup(); 849 + if (shutdown) { 850 + WARN_ON(!IS_MODULE(CONFIG_RCU_SCALE_TEST)); 851 + kernel_power_off(); 852 + } 878 853 return firsterr; 879 854 } 880 855

+38 -14

kernel/rcu/rcutorture.c

··· 317 317 void (*cb_barrier)(void); 318 318 void (*fqs)(void); 319 319 void (*stats)(void); 320 + void (*gp_kthread_dbg)(void); 320 321 int (*stall_dur)(void); 321 322 int irq_capable; 322 323 int can_boost; ··· 467 466 .cb_barrier = rcu_barrier, 468 467 .fqs = rcu_force_quiescent_state, 469 468 .stats = NULL, 469 + .gp_kthread_dbg = show_rcu_gp_kthreads, 470 470 .stall_dur = rcu_jiffies_till_stall_check, 471 471 .irq_capable = 1, 472 472 .can_boost = rcu_can_boost(), ··· 695 693 .exp_sync = synchronize_rcu_mult_test, 696 694 .call = call_rcu_tasks, 697 695 .cb_barrier = rcu_barrier_tasks, 696 + .gp_kthread_dbg = show_rcu_tasks_classic_gp_kthread, 698 697 .fqs = NULL, 699 698 .stats = NULL, 700 699 .irq_capable = 1, ··· 765 762 .exp_sync = synchronize_rcu_tasks_rude, 766 763 .call = call_rcu_tasks_rude, 767 764 .cb_barrier = rcu_barrier_tasks_rude, 765 + .gp_kthread_dbg = show_rcu_tasks_rude_gp_kthread, 768 766 .fqs = NULL, 769 767 .stats = NULL, 770 768 .irq_capable = 1, ··· 804 800 .exp_sync = synchronize_rcu_tasks_trace, 805 801 .call = call_rcu_tasks_trace, 806 802 .cb_barrier = rcu_barrier_tasks_trace, 803 + .gp_kthread_dbg = show_rcu_tasks_trace_gp_kthread, 807 804 .fqs = NULL, 808 805 .stats = NULL, 809 806 .irq_capable = 1, ··· 917 912 oldstarttime = boost_starttime; 918 913 while (time_before(jiffies, oldstarttime)) { 919 914 schedule_timeout_interruptible(oldstarttime - jiffies); 920 - stutter_wait("rcu_torture_boost"); 915 + if (stutter_wait("rcu_torture_boost")) 916 + sched_set_fifo_low(current); 921 917 if (torture_must_stop()) 922 918 goto checkwait; 923 919 } ··· 938 932 jiffies); 939 933 call_rcu_time = jiffies; 940 934 } 941 - stutter_wait("rcu_torture_boost"); 935 + if (stutter_wait("rcu_torture_boost")) 936 + sched_set_fifo_low(current); 942 937 if (torture_must_stop()) 943 938 goto checkwait; 944 939 } ··· 971 964 } 972 965 973 966 /* Go do the stutter. */ 974 - checkwait: stutter_wait("rcu_torture_boost"); 967 + checkwait: if (stutter_wait("rcu_torture_boost")) 968 + sched_set_fifo_low(current); 975 969 } while (!torture_must_stop()); 976 970 977 971 /* Clean up and exit. */ ··· 995 987 { 996 988 unsigned long fqs_resume_time; 997 989 int fqs_burst_remaining; 990 + int oldnice = task_nice(current); 998 991 999 992 VERBOSE_TOROUT_STRING("rcu_torture_fqs task started"); 1000 993 do { ··· 1011 1002 udelay(fqs_holdoff); 1012 1003 fqs_burst_remaining -= fqs_holdoff; 1013 1004 } 1014 - stutter_wait("rcu_torture_fqs"); 1005 + if (stutter_wait("rcu_torture_fqs")) 1006 + sched_set_normal(current, oldnice); 1015 1007 } while (!torture_must_stop()); 1016 1008 torture_kthread_stopping("rcu_torture_fqs"); 1017 1009 return 0; ··· 1032 1022 bool gp_cond1 = gp_cond, gp_exp1 = gp_exp, gp_normal1 = gp_normal; 1033 1023 bool gp_sync1 = gp_sync; 1034 1024 int i; 1025 + int oldnice = task_nice(current); 1035 1026 struct rcu_torture *rp; 1036 1027 struct rcu_torture *old_rp; 1037 1028 static DEFINE_TORTURE_RANDOM(rand); 1029 + bool stutter_waited; 1038 1030 int synctype[] = { RTWS_DEF_FREE, RTWS_EXP_SYNC, 1039 1031 RTWS_COND_GET, RTWS_SYNC }; 1040 1032 int nsynctypes = 0; ··· 1155 1143 !rcu_gp_is_normal(); 1156 1144 } 1157 1145 rcu_torture_writer_state = RTWS_STUTTER; 1158 - if (stutter_wait("rcu_torture_writer") && 1146 + stutter_waited = stutter_wait("rcu_torture_writer"); 1147 + if (stutter_waited && 1159 1148 !READ_ONCE(rcu_fwd_cb_nodelay) && 1160 1149 !cur_ops->slow_gps && 1161 1150 !torture_must_stop() && ··· 1168 1155 rcu_ftrace_dump(DUMP_ALL); 1169 1156 WARN(1, "%s: rtort_pipe_count: %d\n", __func__, rcu_tortures[i].rtort_pipe_count); 1170 1157 } 1158 + if (stutter_waited) 1159 + sched_set_normal(current, oldnice); 1171 1160 } while (!torture_must_stop()); 1172 1161 rcu_torture_current = NULL; // Let stats task know that we are done. 1173 1162 /* Reset expediting back to unexpedited. */ ··· 1609 1594 sched_show_task(wtp); 1610 1595 splatted = true; 1611 1596 } 1612 - show_rcu_gp_kthreads(); 1597 + if (cur_ops->gp_kthread_dbg) 1598 + cur_ops->gp_kthread_dbg(); 1613 1599 rcu_ftrace_dump(DUMP_ALL); 1614 1600 } 1615 1601 rtcv_snap = rcu_torture_current_version; ··· 1929 1913 unsigned long stopat; 1930 1914 static DEFINE_TORTURE_RANDOM(trs); 1931 1915 1932 - if (cur_ops->call && cur_ops->sync && cur_ops->cb_barrier) { 1916 + if (!cur_ops->sync) 1917 + return; // Cannot do need_resched() forward progress testing without ->sync. 1918 + if (cur_ops->call && cur_ops->cb_barrier) { 1933 1919 init_rcu_head_on_stack(&fcs.rh); 1934 1920 selfpropcb = true; 1935 1921 } ··· 2121 2103 /* Carry out grace-period forward-progress testing. */ 2122 2104 static int rcu_torture_fwd_prog(void *args) 2123 2105 { 2106 + int oldnice = task_nice(current); 2124 2107 struct rcu_fwd *rfp = args; 2125 2108 int tested = 0; 2126 2109 int tested_tries = 0; ··· 2140 2121 rcu_torture_fwd_prog_cr(rfp); 2141 2122 2142 2123 /* Avoid slow periods, better to test when busy. */ 2143 - stutter_wait("rcu_torture_fwd_prog"); 2124 + if (stutter_wait("rcu_torture_fwd_prog")) 2125 + sched_set_normal(current, oldnice); 2144 2126 } while (!torture_must_stop()); 2145 2127 /* Short runs might not contain a valid forward-progress attempt. */ 2146 2128 WARN_ON(!tested && tested_tries >= 5); ··· 2157 2137 2158 2138 if (!fwd_progress) 2159 2139 return 0; /* Not requested, so don't do it. */ 2160 - if (!cur_ops->stall_dur || cur_ops->stall_dur() <= 0 || 2161 - cur_ops == &rcu_busted_ops) { 2140 + if ((!cur_ops->sync && !cur_ops->call) || 2141 + !cur_ops->stall_dur || cur_ops->stall_dur() <= 0 || cur_ops == &rcu_busted_ops) { 2162 2142 VERBOSE_TOROUT_STRING("rcu_torture_fwd_prog_init: Disabled, unsupported by RCU flavor under test"); 2163 2143 return 0; 2164 2144 } ··· 2492 2472 return; 2493 2473 } 2494 2474 2495 - show_rcu_gp_kthreads(); 2475 + if (cur_ops->gp_kthread_dbg) 2476 + cur_ops->gp_kthread_dbg(); 2496 2477 rcu_torture_read_exit_cleanup(); 2497 2478 rcu_torture_barrier_cleanup(); 2498 2479 rcu_torture_fwd_prog_cleanup(); ··· 2505 2484 torture_stop_kthread(rcu_torture_reader, 2506 2485 reader_tasks[i]); 2507 2486 kfree(reader_tasks); 2487 + reader_tasks = NULL; 2508 2488 } 2509 2489 2510 2490 if (fakewriter_tasks) { 2511 - for (i = 0; i < nfakewriters; i++) { 2491 + for (i = 0; i < nfakewriters; i++) 2512 2492 torture_stop_kthread(rcu_torture_fakewriter, 2513 2493 fakewriter_tasks[i]); 2514 - } 2515 2494 kfree(fakewriter_tasks); 2516 2495 fakewriter_tasks = NULL; 2517 2496 } ··· 2668 2647 for (i = 0; i < ARRAY_SIZE(torture_ops); i++) 2669 2648 pr_cont(" %s", torture_ops[i]->name); 2670 2649 pr_cont("\n"); 2671 - WARN_ON(!IS_MODULE(CONFIG_RCU_TORTURE_TEST)); 2672 2650 firsterr = -EINVAL; 2673 2651 cur_ops = NULL; 2674 2652 goto unwind; ··· 2835 2815 unwind: 2836 2816 torture_init_end(); 2837 2817 rcu_torture_cleanup(); 2818 + if (shutdown_secs) { 2819 + WARN_ON(!IS_MODULE(CONFIG_RCU_TORTURE_TEST)); 2820 + kernel_power_off(); 2821 + } 2838 2822 return firsterr; 2839 2823 } 2840 2824

+10 -1

kernel/rcu/refscale.c

··· 658 658 for (i = 0; i < ARRAY_SIZE(scale_ops); i++) 659 659 pr_cont(" %s", scale_ops[i]->name); 660 660 pr_cont("\n"); 661 - WARN_ON(!IS_MODULE(CONFIG_RCU_REF_SCALE_TEST)); 662 661 firsterr = -EINVAL; 663 662 cur_ops = NULL; 664 663 goto unwind; ··· 680 681 // Reader tasks (default to ~75% of online CPUs). 681 682 if (nreaders < 0) 682 683 nreaders = (num_online_cpus() >> 1) + (num_online_cpus() >> 2); 684 + if (WARN_ONCE(loops <= 0, "%s: loops = %ld, adjusted to 1\n", __func__, loops)) 685 + loops = 1; 686 + if (WARN_ONCE(nreaders <= 0, "%s: nreaders = %d, adjusted to 1\n", __func__, nreaders)) 687 + nreaders = 1; 688 + if (WARN_ONCE(nruns <= 0, "%s: nruns = %d, adjusted to 1\n", __func__, nruns)) 689 + nruns = 1; 683 690 reader_tasks = kcalloc(nreaders, sizeof(reader_tasks[0]), 684 691 GFP_KERNEL); 685 692 if (!reader_tasks) { ··· 717 712 unwind: 718 713 torture_init_end(); 719 714 ref_scale_cleanup(); 715 + if (shutdown) { 716 + WARN_ON(!IS_MODULE(CONFIG_RCU_REF_SCALE_TEST)); 717 + kernel_power_off(); 718 + } 720 719 return firsterr; 721 720 } 722 721

+4 -2

kernel/rcu/srcutree.c

··· 177 177 INIT_DELAYED_WORK(&ssp->work, process_srcu); 178 178 if (!is_static) 179 179 ssp->sda = alloc_percpu(struct srcu_data); 180 + if (!ssp->sda) 181 + return -ENOMEM; 180 182 init_srcu_struct_nodes(ssp, is_static); 181 183 ssp->srcu_gp_seq_needed_exp = 0; 182 184 ssp->srcu_last_gp_end = ktime_get_mono_fast_ns(); 183 185 smp_store_release(&ssp->srcu_gp_seq_needed, 0); /* Init done. */ 184 - return ssp->sda ? 0 : -ENOMEM; 186 + return 0; 185 187 } 186 188 187 189 #ifdef CONFIG_DEBUG_LOCK_ALLOC ··· 908 906 { 909 907 struct rcu_synchronize rcu; 910 908 911 - RCU_LOCKDEP_WARN(lock_is_held(&ssp->dep_map) || 909 + RCU_LOCKDEP_WARN(lockdep_is_held(ssp) || 912 910 lock_is_held(&rcu_bh_lock_map) || 913 911 lock_is_held(&rcu_lock_map) || 914 912 lock_is_held(&rcu_sched_lock_map),

+21 -28

kernel/rcu/tasks.h

··· 290 290 ".C"[!!data_race(rtp->cbs_head)], 291 291 s); 292 292 } 293 - #endif /* #ifndef CONFIG_TINY_RCU */ 293 + #endif // #ifndef CONFIG_TINY_RCU 294 294 295 295 static void exit_tasks_rcu_finish_trace(struct task_struct *t); 296 296 ··· 335 335 336 336 // Start off with initial wait and slowly back off to 1 HZ wait. 337 337 fract = rtp->init_fract; 338 - if (fract > HZ) 339 - fract = HZ; 340 338 341 - for (;;) { 339 + while (!list_empty(&holdouts)) { 342 340 bool firstreport; 343 341 bool needreport; 344 342 int rtst; 345 343 346 - if (list_empty(&holdouts)) 347 - break; 348 - 349 344 /* Slowly back off waiting for holdouts */ 350 345 set_tasks_gp_state(rtp, RTGS_WAIT_SCAN_HOLDOUTS); 351 - schedule_timeout_idle(HZ/fract); 346 + schedule_timeout_idle(fract); 352 347 353 - if (fract > 1) 354 - fract--; 348 + if (fract < HZ) 349 + fract++; 355 350 356 351 rtst = READ_ONCE(rcu_task_stall_timeout); 357 352 needreport = rtst > 0 && time_after(jiffies, lastreport + rtst); ··· 555 560 static int __init rcu_spawn_tasks_kthread(void) 556 561 { 557 562 rcu_tasks.gp_sleep = HZ / 10; 558 - rcu_tasks.init_fract = 10; 563 + rcu_tasks.init_fract = HZ / 10; 559 564 rcu_tasks.pregp_func = rcu_tasks_pregp_step; 560 565 rcu_tasks.pertask_func = rcu_tasks_pertask; 561 566 rcu_tasks.postscan_func = rcu_tasks_postscan; ··· 566 571 } 567 572 core_initcall(rcu_spawn_tasks_kthread); 568 573 569 - #ifndef CONFIG_TINY_RCU 570 - static void show_rcu_tasks_classic_gp_kthread(void) 574 + #if !defined(CONFIG_TINY_RCU) 575 + void show_rcu_tasks_classic_gp_kthread(void) 571 576 { 572 577 show_rcu_tasks_generic_gp_kthread(&rcu_tasks, ""); 573 578 } 574 - #endif /* #ifndef CONFIG_TINY_RCU */ 579 + EXPORT_SYMBOL_GPL(show_rcu_tasks_classic_gp_kthread); 580 + #endif // !defined(CONFIG_TINY_RCU) 575 581 576 582 /* Do the srcu_read_lock() for the above synchronize_srcu(). */ 577 583 void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu) ··· 594 598 } 595 599 596 600 #else /* #ifdef CONFIG_TASKS_RCU */ 597 - static inline void show_rcu_tasks_classic_gp_kthread(void) { } 598 601 void exit_tasks_rcu_start(void) { } 599 602 void exit_tasks_rcu_finish(void) { exit_tasks_rcu_finish_trace(current); } 600 603 #endif /* #else #ifdef CONFIG_TASKS_RCU */ ··· 694 699 } 695 700 core_initcall(rcu_spawn_tasks_rude_kthread); 696 701 697 - #ifndef CONFIG_TINY_RCU 698 - static void show_rcu_tasks_rude_gp_kthread(void) 702 + #if !defined(CONFIG_TINY_RCU) 703 + void show_rcu_tasks_rude_gp_kthread(void) 699 704 { 700 705 show_rcu_tasks_generic_gp_kthread(&rcu_tasks_rude, ""); 701 706 } 702 - #endif /* #ifndef CONFIG_TINY_RCU */ 703 - 704 - #else /* #ifdef CONFIG_TASKS_RUDE_RCU */ 705 - static void show_rcu_tasks_rude_gp_kthread(void) {} 706 - #endif /* #else #ifdef CONFIG_TASKS_RUDE_RCU */ 707 + EXPORT_SYMBOL_GPL(show_rcu_tasks_rude_gp_kthread); 708 + #endif // !defined(CONFIG_TINY_RCU) 709 + #endif /* #ifdef CONFIG_TASKS_RUDE_RCU */ 707 710 708 711 //////////////////////////////////////////////////////////////////////// 709 712 // ··· 1176 1183 { 1177 1184 if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB)) { 1178 1185 rcu_tasks_trace.gp_sleep = HZ / 10; 1179 - rcu_tasks_trace.init_fract = 10; 1186 + rcu_tasks_trace.init_fract = HZ / 10; 1180 1187 } else { 1181 1188 rcu_tasks_trace.gp_sleep = HZ / 200; 1182 1189 if (rcu_tasks_trace.gp_sleep <= 0) 1183 1190 rcu_tasks_trace.gp_sleep = 1; 1184 - rcu_tasks_trace.init_fract = HZ / 5; 1191 + rcu_tasks_trace.init_fract = HZ / 200; 1185 1192 if (rcu_tasks_trace.init_fract <= 0) 1186 1193 rcu_tasks_trace.init_fract = 1; 1187 1194 } ··· 1195 1202 } 1196 1203 core_initcall(rcu_spawn_tasks_trace_kthread); 1197 1204 1198 - #ifndef CONFIG_TINY_RCU 1199 - static void show_rcu_tasks_trace_gp_kthread(void) 1205 + #if !defined(CONFIG_TINY_RCU) 1206 + void show_rcu_tasks_trace_gp_kthread(void) 1200 1207 { 1201 1208 char buf[64]; 1202 1209 ··· 1206 1213 data_race(n_heavy_reader_attempts)); 1207 1214 show_rcu_tasks_generic_gp_kthread(&rcu_tasks_trace, buf); 1208 1215 } 1209 - #endif /* #ifndef CONFIG_TINY_RCU */ 1216 + EXPORT_SYMBOL_GPL(show_rcu_tasks_trace_gp_kthread); 1217 + #endif // !defined(CONFIG_TINY_RCU) 1210 1218 1211 1219 #else /* #ifdef CONFIG_TASKS_TRACE_RCU */ 1212 1220 static void exit_tasks_rcu_finish_trace(struct task_struct *t) { } 1213 - static inline void show_rcu_tasks_trace_gp_kthread(void) {} 1214 1221 #endif /* #else #ifdef CONFIG_TASKS_TRACE_RCU */ 1215 1222 1216 1223 #ifndef CONFIG_TINY_RCU

+132 -68

kernel/rcu/tree.c

··· 177 177 * per-CPU. Object size is equal to one page. This value 178 178 * can be changed at boot time. 179 179 */ 180 - static int rcu_min_cached_objs = 2; 180 + static int rcu_min_cached_objs = 5; 181 181 module_param(rcu_min_cached_objs, int, 0444); 182 182 183 183 /* Retrieve RCU kthreads priority for rcutorture */ ··· 339 339 static bool rcu_dynticks_in_eqs(int snap) 340 340 { 341 341 return !(snap & RCU_DYNTICK_CTRL_CTR); 342 + } 343 + 344 + /* Return true if the specified CPU is currently idle from an RCU viewpoint. */ 345 + bool rcu_is_idle_cpu(int cpu) 346 + { 347 + struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); 348 + 349 + return rcu_dynticks_in_eqs(rcu_dynticks_snap(rdp)); 342 350 } 343 351 344 352 /* ··· 554 546 return ret; 555 547 } 556 548 557 - static struct kernel_param_ops first_fqs_jiffies_ops = { 549 + static const struct kernel_param_ops first_fqs_jiffies_ops = { 558 550 .set = param_set_first_fqs_jiffies, 559 551 .get = param_get_ulong, 560 552 }; 561 553 562 - static struct kernel_param_ops next_fqs_jiffies_ops = { 554 + static const struct kernel_param_ops next_fqs_jiffies_ops = { 563 555 .set = param_set_next_fqs_jiffies, 564 556 .get = param_get_ulong, 565 557 }; ··· 936 928 { 937 929 struct rcu_data *rdp = this_cpu_ptr(&rcu_data); 938 930 939 - // Enabling the tick is unsafe in NMI handlers. 940 - if (WARN_ON_ONCE(in_nmi())) 931 + // If we're here from NMI there's nothing to do. 932 + if (in_nmi()) 941 933 return; 942 934 943 935 RCU_LOCKDEP_WARN(rcu_dynticks_curr_cpu_in_eqs(), ··· 1101 1093 * CPU can safely enter RCU read-side critical sections. In other words, 1102 1094 * if the current CPU is not in its idle loop or is in an interrupt or 1103 1095 * NMI handler, return true. 1096 + * 1097 + * Make notrace because it can be called by the internal functions of 1098 + * ftrace, and making this notrace removes unnecessary recursion calls. 1104 1099 */ 1105 - bool rcu_is_watching(void) 1100 + notrace bool rcu_is_watching(void) 1106 1101 { 1107 1102 bool ret; 1108 1103 ··· 1160 1149 preempt_disable_notrace(); 1161 1150 rdp = this_cpu_ptr(&rcu_data); 1162 1151 rnp = rdp->mynode; 1163 - if (rdp->grpmask & rcu_rnp_online_cpus(rnp)) 1152 + if (rdp->grpmask & rcu_rnp_online_cpus(rnp) || READ_ONCE(rnp->ofl_seq) & 0x1) 1164 1153 ret = true; 1165 1154 preempt_enable_notrace(); 1166 1155 return ret; ··· 1614 1603 { 1615 1604 bool ret = false; 1616 1605 bool need_qs; 1617 - const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) && 1618 - rcu_segcblist_is_offloaded(&rdp->cblist); 1606 + const bool offloaded = rcu_segcblist_is_offloaded(&rdp->cblist); 1619 1607 1620 1608 raw_lockdep_assert_held_rcu_node(rnp); 1621 1609 ··· 1725 1715 */ 1726 1716 static bool rcu_gp_init(void) 1727 1717 { 1718 + unsigned long firstseq; 1728 1719 unsigned long flags; 1729 1720 unsigned long oldmask; 1730 1721 unsigned long mask; ··· 1769 1758 */ 1770 1759 rcu_state.gp_state = RCU_GP_ONOFF; 1771 1760 rcu_for_each_leaf_node(rnp) { 1761 + smp_mb(); // Pair with barriers used when updating ->ofl_seq to odd values. 1762 + firstseq = READ_ONCE(rnp->ofl_seq); 1763 + if (firstseq & 0x1) 1764 + while (firstseq == READ_ONCE(rnp->ofl_seq)) 1765 + schedule_timeout_idle(1); // Can't wake unless RCU is watching. 1766 + smp_mb(); // Pair with barriers used when updating ->ofl_seq to even values. 1772 1767 raw_spin_lock(&rcu_state.ofl_lock); 1773 1768 raw_spin_lock_irq_rcu_node(rnp); 1774 1769 if (rnp->qsmaskinit == rnp->qsmaskinitnext && ··· 2065 2048 needgp = true; 2066 2049 } 2067 2050 /* Advance CBs to reduce false positives below. */ 2068 - offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) && 2069 - rcu_segcblist_is_offloaded(&rdp->cblist); 2051 + offloaded = rcu_segcblist_is_offloaded(&rdp->cblist); 2070 2052 if ((offloaded || !rcu_accelerate_cbs(rnp, rdp)) && needgp) { 2071 2053 WRITE_ONCE(rcu_state.gp_flags, RCU_GP_FLAG_INIT); 2072 2054 WRITE_ONCE(rcu_state.gp_req_activity, jiffies); ··· 2264 2248 unsigned long flags; 2265 2249 unsigned long mask; 2266 2250 bool needwake = false; 2267 - const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) && 2268 - rcu_segcblist_is_offloaded(&rdp->cblist); 2251 + const bool offloaded = rcu_segcblist_is_offloaded(&rdp->cblist); 2269 2252 struct rcu_node *rnp; 2270 2253 2271 2254 WARN_ON_ONCE(rdp->cpu != smp_processor_id()); ··· 2414 2399 if (!IS_ENABLED(CONFIG_HOTPLUG_CPU)) 2415 2400 return 0; 2416 2401 2402 + WRITE_ONCE(rcu_state.n_online_cpus, rcu_state.n_online_cpus - 1); 2417 2403 /* Adjust any no-longer-needed kthreads. */ 2418 2404 rcu_boost_kthread_setaffinity(rnp, -1); 2419 2405 /* Do any needed no-CB deferred wakeups from this CPU. */ ··· 2433 2417 { 2434 2418 int div; 2435 2419 unsigned long flags; 2436 - const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) && 2437 - rcu_segcblist_is_offloaded(&rdp->cblist); 2420 + const bool offloaded = rcu_segcblist_is_offloaded(&rdp->cblist); 2438 2421 struct rcu_head *rhp; 2439 2422 struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl); 2440 2423 long bl, count; ··· 2690 2675 unsigned long flags; 2691 2676 struct rcu_data *rdp = raw_cpu_ptr(&rcu_data); 2692 2677 struct rcu_node *rnp = rdp->mynode; 2693 - const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) && 2694 - rcu_segcblist_is_offloaded(&rdp->cblist); 2678 + const bool offloaded = rcu_segcblist_is_offloaded(&rdp->cblist); 2695 2679 2696 2680 if (cpu_is_offline(smp_processor_id())) 2697 2681 return; ··· 2992 2978 rcu_segcblist_n_cbs(&rdp->cblist)); 2993 2979 2994 2980 /* Go handle any RCU core processing required. */ 2995 - if (IS_ENABLED(CONFIG_RCU_NOCB_CPU) && 2996 - unlikely(rcu_segcblist_is_offloaded(&rdp->cblist))) { 2981 + if (unlikely(rcu_segcblist_is_offloaded(&rdp->cblist))) { 2997 2982 __call_rcu_nocb_wake(rdp, was_alldone, flags); /* unlocks */ 2998 2983 } else { 2999 2984 __call_rcu_core(rdp, head, flags); ··· 3097 3084 * In order to save some per-cpu space the list is singular. 3098 3085 * Even though it is lockless an access has to be protected by the 3099 3086 * per-cpu lock. 3087 + * @page_cache_work: A work to refill the cache when it is empty 3088 + * @work_in_progress: Indicates that page_cache_work is running 3089 + * @hrtimer: A hrtimer for scheduling a page_cache_work 3100 3090 * @nr_bkv_objs: number of allocated objects at @bkvcache. 3101 3091 * 3102 3092 * This is a per-CPU structure. The reason that it is not included in ··· 3116 3100 bool monitor_todo; 3117 3101 bool initialized; 3118 3102 int count; 3103 + 3104 + struct work_struct page_cache_work; 3105 + atomic_t work_in_progress; 3106 + struct hrtimer hrtimer; 3107 + 3119 3108 struct llist_head bkvcache; 3120 3109 int nr_bkv_objs; 3121 3110 }; ··· 3238 3217 } 3239 3218 rcu_lock_release(&rcu_callback_map); 3240 3219 3241 - krcp = krc_this_cpu_lock(&flags); 3220 + raw_spin_lock_irqsave(&krcp->lock, flags); 3242 3221 if (put_cached_bnode(krcp, bkvhead[i])) 3243 3222 bkvhead[i] = NULL; 3244 - krc_this_cpu_unlock(krcp, flags); 3223 + raw_spin_unlock_irqrestore(&krcp->lock, flags); 3245 3224 3246 3225 if (bkvhead[i]) 3247 3226 free_page((unsigned long) bkvhead[i]); ··· 3368 3347 raw_spin_unlock_irqrestore(&krcp->lock, flags); 3369 3348 } 3370 3349 3350 + static enum hrtimer_restart 3351 + schedule_page_work_fn(struct hrtimer *t) 3352 + { 3353 + struct kfree_rcu_cpu *krcp = 3354 + container_of(t, struct kfree_rcu_cpu, hrtimer); 3355 + 3356 + queue_work(system_highpri_wq, &krcp->page_cache_work); 3357 + return HRTIMER_NORESTART; 3358 + } 3359 + 3360 + static void fill_page_cache_func(struct work_struct *work) 3361 + { 3362 + struct kvfree_rcu_bulk_data *bnode; 3363 + struct kfree_rcu_cpu *krcp = 3364 + container_of(work, struct kfree_rcu_cpu, 3365 + page_cache_work); 3366 + unsigned long flags; 3367 + bool pushed; 3368 + int i; 3369 + 3370 + for (i = 0; i < rcu_min_cached_objs; i++) { 3371 + bnode = (struct kvfree_rcu_bulk_data *) 3372 + __get_free_page(GFP_KERNEL | __GFP_NOWARN); 3373 + 3374 + if (bnode) { 3375 + raw_spin_lock_irqsave(&krcp->lock, flags); 3376 + pushed = put_cached_bnode(krcp, bnode); 3377 + raw_spin_unlock_irqrestore(&krcp->lock, flags); 3378 + 3379 + if (!pushed) { 3380 + free_page((unsigned long) bnode); 3381 + break; 3382 + } 3383 + } 3384 + } 3385 + 3386 + atomic_set(&krcp->work_in_progress, 0); 3387 + } 3388 + 3389 + static void 3390 + run_page_cache_worker(struct kfree_rcu_cpu *krcp) 3391 + { 3392 + if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING && 3393 + !atomic_xchg(&krcp->work_in_progress, 1)) { 3394 + hrtimer_init(&krcp->hrtimer, CLOCK_MONOTONIC, 3395 + HRTIMER_MODE_REL); 3396 + krcp->hrtimer.function = schedule_page_work_fn; 3397 + hrtimer_start(&krcp->hrtimer, 0, HRTIMER_MODE_REL); 3398 + } 3399 + } 3400 + 3371 3401 static inline bool 3372 3402 kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr) 3373 3403 { ··· 3435 3363 if (!krcp->bkvhead[idx] || 3436 3364 krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) { 3437 3365 bnode = get_cached_bnode(krcp); 3438 - if (!bnode) { 3439 - /* 3440 - * To keep this path working on raw non-preemptible 3441 - * sections, prevent the optional entry into the 3442 - * allocator as it uses sleeping locks. In fact, even 3443 - * if the caller of kfree_rcu() is preemptible, this 3444 - * path still is not, as krcp->lock is a raw spinlock. 3445 - * With additional page pre-allocation in the works, 3446 - * hitting this return is going to be much less likely. 3447 - */ 3448 - if (IS_ENABLED(CONFIG_PREEMPT_RT)) 3449 - return false; 3450 - 3451 - /* 3452 - * NOTE: For one argument of kvfree_rcu() we can 3453 - * drop the lock and get the page in sleepable 3454 - * context. That would allow to maintain an array 3455 - * for the CONFIG_PREEMPT_RT as well if no cached 3456 - * pages are available. 3457 - */ 3458 - bnode = (struct kvfree_rcu_bulk_data *) 3459 - __get_free_page(GFP_NOWAIT | __GFP_NOWARN); 3460 - } 3461 - 3462 3366 /* Switch to emergency path. */ 3463 - if (unlikely(!bnode)) 3367 + if (!bnode) 3464 3368 return false; 3465 3369 3466 3370 /* Initialize the new block. */ ··· 3500 3452 goto unlock_return; 3501 3453 } 3502 3454 3503 - /* 3504 - * Under high memory pressure GFP_NOWAIT can fail, 3505 - * in that case the emergency path is maintained. 3506 - */ 3507 3455 success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr); 3508 3456 if (!success) { 3457 + run_page_cache_worker(krcp); 3458 + 3509 3459 if (head == NULL) 3510 3460 // Inline if kvfree_rcu(one_arg) call. 3511 3461 goto unlock_return; ··· 3613 3567 * During early boot, any blocking grace-period wait automatically 3614 3568 * implies a grace period. Later on, this is never the case for PREEMPTION. 3615 3569 * 3616 - * Howevr, because a context switch is a grace period for !PREEMPTION, any 3570 + * However, because a context switch is a grace period for !PREEMPTION, any 3617 3571 * blocking grace-period wait automatically implies a grace period if 3618 3572 * there is only one CPU online at any point time during execution of 3619 3573 * either synchronize_rcu() or synchronize_rcu_expedited(). It is OK to ··· 3629 3583 return rcu_scheduler_active == RCU_SCHEDULER_INACTIVE; 3630 3584 might_sleep(); /* Check for RCU read-side critical section. */ 3631 3585 preempt_disable(); 3632 - ret = num_online_cpus() <= 1; 3586 + /* 3587 + * If the rcu_state.n_online_cpus counter is equal to one, 3588 + * there is only one CPU, and that CPU sees all prior accesses 3589 + * made by any CPU that was online at the time of its access. 3590 + * Furthermore, if this counter is equal to one, its value cannot 3591 + * change until after the preempt_enable() below. 3592 + * 3593 + * Furthermore, if rcu_state.n_online_cpus is equal to one here, 3594 + * all later CPUs (both this one and any that come online later 3595 + * on) are guaranteed to see all accesses prior to this point 3596 + * in the code, without the need for additional memory barriers. 3597 + * Those memory barriers are provided by CPU-hotplug code. 3598 + */ 3599 + ret = READ_ONCE(rcu_state.n_online_cpus) <= 1; 3633 3600 preempt_enable(); 3634 3601 return ret; 3635 3602 } ··· 3687 3628 lock_is_held(&rcu_sched_lock_map), 3688 3629 "Illegal synchronize_rcu() in RCU read-side critical section"); 3689 3630 if (rcu_blocking_is_gp()) 3690 - return; 3631 + return; // Context allows vacuous grace periods. 3691 3632 if (rcu_gp_is_expedited()) 3692 3633 synchronize_rcu_expedited(); 3693 3634 else ··· 3766 3707 return 1; 3767 3708 3768 3709 /* Does this CPU have callbacks ready to invoke? */ 3769 - if (rcu_segcblist_ready_cbs(&rdp->cblist)) 3710 + if (!rcu_segcblist_is_offloaded(&rdp->cblist) && 3711 + rcu_segcblist_ready_cbs(&rdp->cblist)) 3770 3712 return 1; 3771 3713 3772 3714 /* Has RCU gone idle with this CPU needing another grace period? */ 3773 3715 if (!gp_in_progress && rcu_segcblist_is_enabled(&rdp->cblist) && 3774 - (!IS_ENABLED(CONFIG_RCU_NOCB_CPU) || 3775 - !rcu_segcblist_is_offloaded(&rdp->cblist)) && 3716 + !rcu_segcblist_is_offloaded(&rdp->cblist) && 3776 3717 !rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL)) 3777 3718 return 1; 3778 3719 ··· 4028 3969 raw_spin_unlock_irqrestore_rcu_node(rnp, flags); 4029 3970 rcu_prepare_kthreads(cpu); 4030 3971 rcu_spawn_cpu_nocb_kthread(cpu); 3972 + WRITE_ONCE(rcu_state.n_online_cpus, rcu_state.n_online_cpus + 1); 4031 3973 4032 3974 return 0; 4033 3975 } ··· 4117 4057 4118 4058 rnp = rdp->mynode; 4119 4059 mask = rdp->grpmask; 4060 + WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1); 4061 + WARN_ON_ONCE(!(rnp->ofl_seq & 0x1)); 4062 + smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier(). 4120 4063 raw_spin_lock_irqsave_rcu_node(rnp, flags); 4121 4064 WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask); 4122 4065 newcpu = !(rnp->expmaskinitnext & mask); ··· 4130 4067 rcu_gpnum_ovf(rnp, rdp); /* Offline-induced counter wrap? */ 4131 4068 rdp->rcu_onl_gp_seq = READ_ONCE(rcu_state.gp_seq); 4132 4069 rdp->rcu_onl_gp_flags = READ_ONCE(rcu_state.gp_flags); 4133 - if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */ 4070 + 4071 + /* An incoming CPU should never be blocking a grace period. */ 4072 + if (WARN_ON_ONCE(rnp->qsmask & mask)) { /* RCU waiting on incoming CPU? */ 4134 4073 rcu_disable_urgency_upon_qs(rdp); 4135 4074 /* Report QS -after- changing ->qsmaskinitnext! */ 4136 4075 rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags); 4137 4076 } else { 4138 4077 raw_spin_unlock_irqrestore_rcu_node(rnp, flags); 4139 4078 } 4079 + smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier(). 4080 + WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1); 4081 + WARN_ON_ONCE(rnp->ofl_seq & 0x1); 4140 4082 smp_mb(); /* Ensure RCU read-side usage follows above initialization. */ 4141 4083 } 4142 4084 ··· 4168 4100 4169 4101 /* Remove outgoing CPU from mask in the leaf rcu_node structure. */ 4170 4102 mask = rdp->grpmask; 4103 + WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1); 4104 + WARN_ON_ONCE(!(rnp->ofl_seq & 0x1)); 4105 + smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier(). 4171 4106 raw_spin_lock(&rcu_state.ofl_lock); 4172 4107 raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */ 4173 4108 rdp->rcu_ofl_gp_seq = READ_ONCE(rcu_state.gp_seq); ··· 4183 4112 WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask); 4184 4113 raw_spin_unlock_irqrestore_rcu_node(rnp, flags); 4185 4114 raw_spin_unlock(&rcu_state.ofl_lock); 4115 + smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier(). 4116 + WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1); 4117 + WARN_ON_ONCE(rnp->ofl_seq & 0x1); 4186 4118 4187 4119 rdp->cpu_started = false; 4188 4120 } ··· 4523 4449 4524 4450 for_each_possible_cpu(cpu) { 4525 4451 struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu); 4526 - struct kvfree_rcu_bulk_data *bnode; 4527 4452 4528 4453 for (i = 0; i < KFREE_N_BATCHES; i++) { 4529 4454 INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work); 4530 4455 krcp->krw_arr[i].krcp = krcp; 4531 4456 } 4532 4457 4533 - for (i = 0; i < rcu_min_cached_objs; i++) { 4534 - bnode = (struct kvfree_rcu_bulk_data *) 4535 - __get_free_page(GFP_NOWAIT | __GFP_NOWARN); 4536 - 4537 - if (bnode) 4538 - put_cached_bnode(krcp, bnode); 4539 - else 4540 - pr_err("Failed to preallocate for %d CPU!\n", cpu); 4541 - } 4542 - 4543 4458 INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor); 4459 + INIT_WORK(&krcp->page_cache_work, fill_page_cache_func); 4544 4460 krcp->initialized = true; 4545 4461 } 4546 4462 if (register_shrinker(&kfree_rcu_shrinker))

+2

kernel/rcu/tree.h

··· 56 56 /* Initialized from ->qsmaskinitnext at the */ 57 57 /* beginning of each grace period. */ 58 58 unsigned long qsmaskinitnext; 59 + unsigned long ofl_seq; /* CPU-hotplug operation sequence count. */ 59 60 /* Online CPUs for next grace period. */ 60 61 unsigned long expmask; /* CPUs or groups that need to check in */ 61 62 /* to allow the current expedited GP */ ··· 299 298 /* Hierarchy levels (+1 to */ 300 299 /* shut bogus gcc warning) */ 301 300 int ncpus; /* # CPUs seen so far. */ 301 + int n_online_cpus; /* # CPUs online for RCU. */ 302 302 303 303 /* The following fields are guarded by the root rcu_node's lock. */ 304 304

+1 -1

kernel/rcu/tree_plugin.h

··· 628 628 set_tsk_need_resched(current); 629 629 set_preempt_need_resched(); 630 630 if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled && 631 - !rdp->defer_qs_iw_pending && exp) { 631 + !rdp->defer_qs_iw_pending && exp && cpu_online(rdp->cpu)) { 632 632 // Get scheduler to re-evaluate and call hooks. 633 633 // If !IRQ_WORK, FQS scan will eventually IPI. 634 634 init_irq_work(&rdp->defer_qs_iw,

+6

kernel/rcu/tree_stall.h

··· 13 13 14 14 /* panic() on RCU Stall sysctl. */ 15 15 int sysctl_panic_on_rcu_stall __read_mostly; 16 + int sysctl_max_rcu_stall_to_panic __read_mostly; 16 17 17 18 #ifdef CONFIG_PROVE_RCU 18 19 #define RCU_STALL_DELAY_DELTA (5 * HZ) ··· 107 106 /* If so specified via sysctl, panic, yielding cleaner stall-warning output. */ 108 107 static void panic_on_rcu_stall(void) 109 108 { 109 + static int cpu_stall; 110 + 111 + if (++cpu_stall < sysctl_max_rcu_stall_to_panic) 112 + return; 113 + 110 114 if (sysctl_panic_on_rcu_stall) 111 115 panic("RCU Stall\n"); 112 116 }

+39 -10

kernel/scftorture.c

··· 59 59 torture_param(int, onoff_interval, 0, "Time between CPU hotplugs (s), 0=disable"); 60 60 torture_param(int, shutdown_secs, 0, "Shutdown time (ms), <= zero to disable."); 61 61 torture_param(int, stat_interval, 60, "Number of seconds between stats printk()s."); 62 - torture_param(int, stutter_cpus, 5, "Number of jiffies to change CPUs under test, 0=disable"); 62 + torture_param(int, stutter, 5, "Number of jiffies to run/halt test, 0=disable"); 63 63 torture_param(bool, use_cpus_read_lock, 0, "Use cpus_read_lock() to exclude CPU hotplug."); 64 64 torture_param(int, verbose, 0, "Enable verbose debugging printk()s"); 65 + torture_param(int, weight_resched, -1, "Testing weight for resched_cpu() operations."); 65 66 torture_param(int, weight_single, -1, "Testing weight for single-CPU no-wait operations."); 66 67 torture_param(int, weight_single_wait, -1, "Testing weight for single-CPU operations."); 67 68 torture_param(int, weight_many, -1, "Testing weight for multi-CPU no-wait operations."); ··· 83 82 struct scf_statistics { 84 83 struct task_struct *task; 85 84 int cpu; 85 + long long n_resched; 86 86 long long n_single; 87 87 long long n_single_ofl; 88 88 long long n_single_wait; ··· 99 97 static DEFINE_PER_CPU(long long, scf_invoked_count); 100 98 101 99 // Data for random primitive selection 102 - #define SCF_PRIM_SINGLE 0 103 - #define SCF_PRIM_MANY 1 104 - #define SCF_PRIM_ALL 2 105 - #define SCF_NPRIMS (2 * 3) // Need wait and no-wait versions of each. 100 + #define SCF_PRIM_RESCHED 0 101 + #define SCF_PRIM_SINGLE 1 102 + #define SCF_PRIM_MANY 2 103 + #define SCF_PRIM_ALL 3 104 + #define SCF_NPRIMS 7 // Need wait and no-wait versions of each, 105 + // except for SCF_PRIM_RESCHED. 106 106 107 107 static char *scf_prim_name[] = { 108 + "resched_cpu", 108 109 "smp_call_function_single", 109 110 "smp_call_function_many", 110 111 "smp_call_function", ··· 141 136 142 137 static DEFINE_TORTURE_RANDOM_PERCPU(scf_torture_rand); 143 138 139 + extern void resched_cpu(int cpu); // An alternative IPI vector. 140 + 144 141 // Print torture statistics. Caller must ensure serialization. 145 142 static void scf_torture_stats_print(void) 146 143 { ··· 155 148 for_each_possible_cpu(cpu) 156 149 invoked_count += data_race(per_cpu(scf_invoked_count, cpu)); 157 150 for (i = 0; i < nthreads; i++) { 151 + scfs.n_resched += scf_stats_p[i].n_resched; 158 152 scfs.n_single += scf_stats_p[i].n_single; 159 153 scfs.n_single_ofl += scf_stats_p[i].n_single_ofl; 160 154 scfs.n_single_wait += scf_stats_p[i].n_single_wait; ··· 168 160 if (atomic_read(&n_errs) || atomic_read(&n_mb_in_errs) || 169 161 atomic_read(&n_mb_out_errs) || atomic_read(&n_alloc_errs)) 170 162 bangstr = "!!! "; 171 - pr_alert("%s %sscf_invoked_count %s: %lld single: %lld/%lld single_ofl: %lld/%lld many: %lld/%lld all: %lld/%lld ", 172 - SCFTORT_FLAG, bangstr, isdone ? "VER" : "ver", invoked_count, 163 + pr_alert("%s %sscf_invoked_count %s: %lld resched: %lld single: %lld/%lld single_ofl: %lld/%lld many: %lld/%lld all: %lld/%lld ", 164 + SCFTORT_FLAG, bangstr, isdone ? "VER" : "ver", invoked_count, scfs.n_resched, 173 165 scfs.n_single, scfs.n_single_wait, scfs.n_single_ofl, scfs.n_single_wait_ofl, 174 166 scfs.n_many, scfs.n_many_wait, scfs.n_all, scfs.n_all_wait); 175 167 torture_onoff_stats(); ··· 322 314 } 323 315 } 324 316 switch (scfsp->scfs_prim) { 317 + case SCF_PRIM_RESCHED: 318 + if (IS_BUILTIN(CONFIG_SCF_TORTURE_TEST)) { 319 + cpu = torture_random(trsp) % nr_cpu_ids; 320 + scfp->n_resched++; 321 + resched_cpu(cpu); 322 + } 323 + break; 325 324 case SCF_PRIM_SINGLE: 326 325 cpu = torture_random(trsp) % nr_cpu_ids; 327 326 if (scfsp->scfs_wait) ··· 436 421 was_offline = false; 437 422 } 438 423 cond_resched(); 424 + stutter_wait("scftorture_invoker"); 439 425 } while (!torture_must_stop()); 440 426 441 427 VERBOSE_SCFTORTOUT("scftorture_invoker %d ended", scfp->cpu); ··· 449 433 scftorture_print_module_parms(const char *tag) 450 434 { 451 435 pr_alert(SCFTORT_FLAG 452 - "--- %s: verbose=%d holdoff=%d longwait=%d nthreads=%d onoff_holdoff=%d onoff_interval=%d shutdown_secs=%d stat_interval=%d stutter_cpus=%d use_cpus_read_lock=%d, weight_single=%d, weight_single_wait=%d, weight_many=%d, weight_many_wait=%d, weight_all=%d, weight_all_wait=%d\n", tag, 453 - verbose, holdoff, longwait, nthreads, onoff_holdoff, onoff_interval, shutdown, stat_interval, stutter_cpus, use_cpus_read_lock, weight_single, weight_single_wait, weight_many, weight_many_wait, weight_all, weight_all_wait); 436 + "--- %s: verbose=%d holdoff=%d longwait=%d nthreads=%d onoff_holdoff=%d onoff_interval=%d shutdown_secs=%d stat_interval=%d stutter=%d use_cpus_read_lock=%d, weight_resched=%d, weight_single=%d, weight_single_wait=%d, weight_many=%d, weight_many_wait=%d, weight_all=%d, weight_all_wait=%d\n", tag, 437 + verbose, holdoff, longwait, nthreads, onoff_holdoff, onoff_interval, shutdown, stat_interval, stutter, use_cpus_read_lock, weight_resched, weight_single, weight_single_wait, weight_many, weight_many_wait, weight_all, weight_all_wait); 454 438 } 455 439 456 440 static void scf_cleanup_handler(void *unused) ··· 491 475 { 492 476 long i; 493 477 int firsterr = 0; 478 + unsigned long weight_resched1 = weight_resched; 494 479 unsigned long weight_single1 = weight_single; 495 480 unsigned long weight_single_wait1 = weight_single_wait; 496 481 unsigned long weight_many1 = weight_many; ··· 504 487 505 488 scftorture_print_module_parms("Start of test"); 506 489 507 - if (weight_single == -1 && weight_single_wait == -1 && 490 + if (weight_resched == -1 && weight_single == -1 && weight_single_wait == -1 && 508 491 weight_many == -1 && weight_many_wait == -1 && 509 492 weight_all == -1 && weight_all_wait == -1) { 493 + weight_resched1 = 2 * nr_cpu_ids; 510 494 weight_single1 = 2 * nr_cpu_ids; 511 495 weight_single_wait1 = 2 * nr_cpu_ids; 512 496 weight_many1 = 2; ··· 515 497 weight_all1 = 1; 516 498 weight_all_wait1 = 1; 517 499 } else { 500 + if (weight_resched == -1) 501 + weight_resched1 = 0; 518 502 if (weight_single == -1) 519 503 weight_single1 = 0; 520 504 if (weight_single_wait == -1) ··· 537 517 firsterr = -EINVAL; 538 518 goto unwind; 539 519 } 520 + if (IS_BUILTIN(CONFIG_SCF_TORTURE_TEST)) 521 + scf_sel_add(weight_resched1, SCF_PRIM_RESCHED, false); 522 + else if (weight_resched1) 523 + VERBOSE_SCFTORTOUT_ERRSTRING("built as module, weight_resched ignored"); 540 524 scf_sel_add(weight_single1, SCF_PRIM_SINGLE, false); 541 525 scf_sel_add(weight_single_wait1, SCF_PRIM_SINGLE, true); 542 526 scf_sel_add(weight_many1, SCF_PRIM_MANY, false); ··· 556 532 } 557 533 if (shutdown_secs > 0) { 558 534 firsterr = torture_shutdown_init(shutdown_secs, scf_torture_cleanup); 535 + if (firsterr) 536 + goto unwind; 537 + } 538 + if (stutter > 0) { 539 + firsterr = torture_stutter_init(stutter, stutter); 559 540 if (firsterr) 560 541 goto unwind; 561 542 }

+11

kernel/sysctl.c

··· 2650 2650 .extra2 = SYSCTL_ONE, 2651 2651 }, 2652 2652 #endif 2653 + #if defined(CONFIG_TREE_RCU) 2654 + { 2655 + .procname = "max_rcu_stall_to_panic", 2656 + .data = &sysctl_max_rcu_stall_to_panic, 2657 + .maxlen = sizeof(sysctl_max_rcu_stall_to_panic), 2658 + .mode = 0644, 2659 + .proc_handler = proc_dointvec_minmax, 2660 + .extra1 = SYSCTL_ONE, 2661 + .extra2 = SYSCTL_INT_MAX, 2662 + }, 2663 + #endif 2653 2664 #ifdef CONFIG_STACKLEAK_RUNTIME_DISABLE 2654 2665 { 2655 2666 .procname = "stack_erasing",

+26 -8

kernel/torture.c

··· 602 602 */ 603 603 bool stutter_wait(const char *title) 604 604 { 605 - int spt; 605 + ktime_t delay; 606 + unsigned int i = 0; 606 607 bool ret = false; 608 + int spt; 607 609 608 610 cond_resched_tasks_rcu_qs(); 609 611 spt = READ_ONCE(stutter_pause_test); 610 612 for (; spt; spt = READ_ONCE(stutter_pause_test)) { 611 - ret = true; 613 + if (!ret) { 614 + sched_set_normal(current, MAX_NICE); 615 + ret = true; 616 + } 612 617 if (spt == 1) { 613 618 schedule_timeout_interruptible(1); 614 619 } else if (spt == 2) { 615 - while (READ_ONCE(stutter_pause_test)) 620 + while (READ_ONCE(stutter_pause_test)) { 621 + if (!(i++ & 0xffff)) { 622 + set_current_state(TASK_INTERRUPTIBLE); 623 + delay = 10 * NSEC_PER_USEC; 624 + schedule_hrtimeout(&delay, HRTIMER_MODE_REL); 625 + } 616 626 cond_resched(); 627 + } 617 628 } else { 618 629 schedule_timeout_interruptible(round_jiffies_relative(HZ)); 619 630 } ··· 640 629 */ 641 630 static int torture_stutter(void *arg) 642 631 { 632 + ktime_t delay; 633 + DEFINE_TORTURE_RANDOM(rand); 643 634 int wtime; 644 635 645 636 VERBOSE_TOROUT_STRING("torture_stutter task started"); 646 637 do { 647 638 if (!torture_must_stop() && stutter > 1) { 648 639 wtime = stutter; 649 - if (stutter > HZ + 1) { 640 + if (stutter > 2) { 650 641 WRITE_ONCE(stutter_pause_test, 1); 651 - wtime = stutter - HZ - 1; 652 - schedule_timeout_interruptible(wtime); 653 - wtime = HZ + 1; 642 + wtime = stutter - 3; 643 + delay = ktime_divns(NSEC_PER_SEC * wtime, HZ); 644 + delay += (torture_random(&rand) >> 3) % NSEC_PER_MSEC; 645 + set_current_state(TASK_INTERRUPTIBLE); 646 + schedule_hrtimeout(&delay, HRTIMER_MODE_REL); 647 + wtime = 2; 654 648 } 655 649 WRITE_ONCE(stutter_pause_test, 2); 656 - schedule_timeout_interruptible(wtime); 650 + delay = ktime_divns(NSEC_PER_SEC * wtime, HZ); 651 + set_current_state(TASK_INTERRUPTIBLE); 652 + schedule_hrtimeout(&delay, HRTIMER_MODE_REL); 657 653 } 658 654 WRITE_ONCE(stutter_pause_test, 0); 659 655 if (!torture_must_stop())

+2 -2

tools/include/nolibc/nolibc.h

··· 107 107 #endif 108 108 109 109 /* errno codes all ensure that they will not conflict with a valid pointer 110 - * because they all correspond to the highest addressable memry page. 110 + * because they all correspond to the highest addressable memory page. 111 111 */ 112 112 #define MAX_ERRNO 4095 113 113 ··· 231 231 #define DT_SOCK 12 232 232 233 233 /* all the *at functions */ 234 - #ifndef AT_FDWCD 234 + #ifndef AT_FDCWD 235 235 #define AT_FDCWD -100 236 236 #endif 237 237

+76

tools/memory-model/Documentation/README

··· 1 + It has been said that successful communication requires first identifying 2 + what your audience knows and then building a bridge from their current 3 + knowledge to what they need to know. Unfortunately, the expected 4 + Linux-kernel memory model (LKMM) audience might be anywhere from novice 5 + to expert both in kernel hacking and in understanding LKMM. 6 + 7 + This document therefore points out a number of places to start reading, 8 + depending on what you know and what you would like to learn. Please note 9 + that the documents later in this list assume that the reader understands 10 + the material provided by documents earlier in this list. 11 + 12 + o You are new to Linux-kernel concurrency: simple.txt 13 + 14 + o You have some background in Linux-kernel concurrency, and would 15 + like an overview of the types of low-level concurrency primitives 16 + that the Linux kernel provides: ordering.txt 17 + 18 + Here, "low level" means atomic operations to single variables. 19 + 20 + o You are familiar with the Linux-kernel concurrency primitives 21 + that you need, and just want to get started with LKMM litmus 22 + tests: litmus-tests.txt 23 + 24 + o You are familiar with Linux-kernel concurrency, and would 25 + like a detailed intuitive understanding of LKMM, including 26 + situations involving more than two threads: recipes.txt 27 + 28 + o You would like a detailed understanding of what your compiler can 29 + and cannot do to control dependencies: control-dependencies.txt 30 + 31 + o You are familiar with Linux-kernel concurrency and the use of 32 + LKMM, and would like a quick reference: cheatsheet.txt 33 + 34 + o You are familiar with Linux-kernel concurrency and the use 35 + of LKMM, and would like to learn about LKMM's requirements, 36 + rationale, and implementation: explanation.txt 37 + 38 + o You are interested in the publications related to LKMM, including 39 + hardware manuals, academic literature, standards-committee 40 + working papers, and LWN articles: references.txt 41 + 42 + 43 + ==================== 44 + DESCRIPTION OF FILES 45 + ==================== 46 + 47 + README 48 + This file. 49 + 50 + cheatsheet.txt 51 + Quick-reference guide to the Linux-kernel memory model. 52 + 53 + control-dependencies.txt 54 + Guide to preventing compiler optimizations from destroying 55 + your control dependencies. 56 + 57 + explanation.txt 58 + Detailed description of the memory model. 59 + 60 + litmus-tests.txt 61 + The format, features, capabilities, and limitations of the litmus 62 + tests that LKMM can evaluate. 63 + 64 + ordering.txt 65 + Overview of the Linux kernel's low-level memory-ordering 66 + primitives by category. 67 + 68 + recipes.txt 69 + Common memory-ordering patterns. 70 + 71 + references.txt 72 + Background information. 73 + 74 + simple.txt 75 + Starting point for someone new to Linux-kernel concurrency. 76 + And also a reminder of the simpler approaches to concurrency!

+258

tools/memory-model/Documentation/control-dependencies.txt

··· 1 + CONTROL DEPENDENCIES 2 + ==================== 3 + 4 + A major difficulty with control dependencies is that current compilers 5 + do not support them. One purpose of this document is therefore to 6 + help you prevent your compiler from breaking your code. However, 7 + control dependencies also pose other challenges, which leads to the 8 + second purpose of this document, namely to help you to avoid breaking 9 + your own code, even in the absence of help from your compiler. 10 + 11 + One such challenge is that control dependencies order only later stores. 12 + Therefore, a load-load control dependency will not preserve ordering 13 + unless a read memory barrier is provided. Consider the following code: 14 + 15 + q = READ_ONCE(a); 16 + if (q) 17 + p = READ_ONCE(b); 18 + 19 + This is not guaranteed to provide any ordering because some types of CPUs 20 + are permitted to predict the result of the load from "b". This prediction 21 + can cause other CPUs to see this load as having happened before the load 22 + from "a". This means that an explicit read barrier is required, for example 23 + as follows: 24 + 25 + q = READ_ONCE(a); 26 + if (q) { 27 + smp_rmb(); 28 + p = READ_ONCE(b); 29 + } 30 + 31 + However, stores are not speculated. This means that ordering is 32 + (usually) guaranteed for load-store control dependencies, as in the 33 + following example: 34 + 35 + q = READ_ONCE(a); 36 + if (q) 37 + WRITE_ONCE(b, 1); 38 + 39 + Control dependencies can pair with each other and with other types 40 + of ordering. But please note that neither the READ_ONCE() nor the 41 + WRITE_ONCE() are optional. Without the READ_ONCE(), the compiler might 42 + fuse the load from "a" with other loads. Without the WRITE_ONCE(), 43 + the compiler might fuse the store to "b" with other stores. Worse yet, 44 + the compiler might convert the store into a load and a check followed 45 + by a store, and this compiler-generated load would not be ordered by 46 + the control dependency. 47 + 48 + Furthermore, if the compiler is able to prove that the value of variable 49 + "a" is always non-zero, it would be well within its rights to optimize 50 + the original example by eliminating the "if" statement as follows: 51 + 52 + q = a; 53 + b = 1; /* BUG: Compiler and CPU can both reorder!!! */ 54 + 55 + So don't leave out either the READ_ONCE() or the WRITE_ONCE(). 56 + In particular, although READ_ONCE() does force the compiler to emit a 57 + load, it does *not* force the compiler to actually use the loaded value. 58 + 59 + It is tempting to try use control dependencies to enforce ordering on 60 + identical stores on both branches of the "if" statement as follows: 61 + 62 + q = READ_ONCE(a); 63 + if (q) { 64 + barrier(); 65 + WRITE_ONCE(b, 1); 66 + do_something(); 67 + } else { 68 + barrier(); 69 + WRITE_ONCE(b, 1); 70 + do_something_else(); 71 + } 72 + 73 + Unfortunately, current compilers will transform this as follows at high 74 + optimization levels: 75 + 76 + q = READ_ONCE(a); 77 + barrier(); 78 + WRITE_ONCE(b, 1); /* BUG: No ordering vs. load from a!!! */ 79 + if (q) { 80 + /* WRITE_ONCE(b, 1); -- moved up, BUG!!! */ 81 + do_something(); 82 + } else { 83 + /* WRITE_ONCE(b, 1); -- moved up, BUG!!! */ 84 + do_something_else(); 85 + } 86 + 87 + Now there is no conditional between the load from "a" and the store to 88 + "b", which means that the CPU is within its rights to reorder them: The 89 + conditional is absolutely required, and must be present in the final 90 + assembly code, after all of the compiler and link-time optimizations 91 + have been applied. Therefore, if you need ordering in this example, 92 + you must use explicit memory ordering, for example, smp_store_release(): 93 + 94 + q = READ_ONCE(a); 95 + if (q) { 96 + smp_store_release(&b, 1); 97 + do_something(); 98 + } else { 99 + smp_store_release(&b, 1); 100 + do_something_else(); 101 + } 102 + 103 + Without explicit memory ordering, control-dependency-based ordering is 104 + guaranteed only when the stores differ, for example: 105 + 106 + q = READ_ONCE(a); 107 + if (q) { 108 + WRITE_ONCE(b, 1); 109 + do_something(); 110 + } else { 111 + WRITE_ONCE(b, 2); 112 + do_something_else(); 113 + } 114 + 115 + The initial READ_ONCE() is still required to prevent the compiler from 116 + knowing too much about the value of "a". 117 + 118 + But please note that you need to be careful what you do with the local 119 + variable "q", otherwise the compiler might be able to guess the value 120 + and again remove the conditional branch that is absolutely required to 121 + preserve ordering. For example: 122 + 123 + q = READ_ONCE(a); 124 + if (q % MAX) { 125 + WRITE_ONCE(b, 1); 126 + do_something(); 127 + } else { 128 + WRITE_ONCE(b, 2); 129 + do_something_else(); 130 + } 131 + 132 + If MAX is compile-time defined to be 1, then the compiler knows that 133 + (q % MAX) must be equal to zero, regardless of the value of "q". 134 + The compiler is therefore within its rights to transform the above code 135 + into the following: 136 + 137 + q = READ_ONCE(a); 138 + WRITE_ONCE(b, 2); 139 + do_something_else(); 140 + 141 + Given this transformation, the CPU is not required to respect the ordering 142 + between the load from variable "a" and the store to variable "b". It is 143 + tempting to add a barrier(), but this does not help. The conditional 144 + is gone, and the barrier won't bring it back. Therefore, if you need 145 + to relying on control dependencies to produce this ordering, you should 146 + make sure that MAX is greater than one, perhaps as follows: 147 + 148 + q = READ_ONCE(a); 149 + BUILD_BUG_ON(MAX <= 1); /* Order load from a with store to b. */ 150 + if (q % MAX) { 151 + WRITE_ONCE(b, 1); 152 + do_something(); 153 + } else { 154 + WRITE_ONCE(b, 2); 155 + do_something_else(); 156 + } 157 + 158 + Please note once again that each leg of the "if" statement absolutely 159 + must store different values to "b". As in previous examples, if the two 160 + values were identical, the compiler could pull this store outside of the 161 + "if" statement, destroying the control dependency's ordering properties. 162 + 163 + You must also be careful avoid relying too much on boolean short-circuit 164 + evaluation. Consider this example: 165 + 166 + q = READ_ONCE(a); 167 + if (q || 1 > 0) 168 + WRITE_ONCE(b, 1); 169 + 170 + Because the first condition cannot fault and the second condition is 171 + always true, the compiler can transform this example as follows, again 172 + destroying the control dependency's ordering: 173 + 174 + q = READ_ONCE(a); 175 + WRITE_ONCE(b, 1); 176 + 177 + This is yet another example showing the importance of preventing the 178 + compiler from out-guessing your code. Again, although READ_ONCE() really 179 + does force the compiler to emit code for a given load, the compiler is 180 + within its rights to discard the loaded value. 181 + 182 + In addition, control dependencies apply only to the then-clause and 183 + else-clause of the "if" statement in question. In particular, they do 184 + not necessarily order the code following the entire "if" statement: 185 + 186 + q = READ_ONCE(a); 187 + if (q) { 188 + WRITE_ONCE(b, 1); 189 + } else { 190 + WRITE_ONCE(b, 2); 191 + } 192 + WRITE_ONCE(c, 1); /* BUG: No ordering against the read from "a". */ 193 + 194 + It is tempting to argue that there in fact is ordering because the 195 + compiler cannot reorder volatile accesses and also cannot reorder 196 + the writes to "b" with the condition. Unfortunately for this line 197 + of reasoning, the compiler might compile the two writes to "b" as 198 + conditional-move instructions, as in this fanciful pseudo-assembly 199 + language: 200 + 201 + ld r1,a 202 + cmp r1,$0 203 + cmov,ne r4,$1 204 + cmov,eq r4,$2 205 + st r4,b 206 + st $1,c 207 + 208 + The control dependencies would then extend only to the pair of cmov 209 + instructions and the store depending on them. This means that a weakly 210 + ordered CPU would have no dependency of any sort between the load from 211 + "a" and the store to "c". In short, control dependencies provide ordering 212 + only to the stores in the then-clause and else-clause of the "if" statement 213 + in question (including functions invoked by those two clauses), and not 214 + to code following that "if" statement. 215 + 216 + 217 + In summary: 218 + 219 + (*) Control dependencies can order prior loads against later stores. 220 + However, they do *not* guarantee any other sort of ordering: 221 + Not prior loads against later loads, nor prior stores against 222 + later anything. If you need these other forms of ordering, use 223 + smp_load_acquire(), smp_store_release(), or, in the case of prior 224 + stores and later loads, smp_mb(). 225 + 226 + (*) If both legs of the "if" statement contain identical stores to 227 + the same variable, then you must explicitly order those stores, 228 + either by preceding both of them with smp_mb() or by using 229 + smp_store_release(). Please note that it is *not* sufficient to use 230 + barrier() at beginning and end of each leg of the "if" statement 231 + because, as shown by the example above, optimizing compilers can 232 + destroy the control dependency while respecting the letter of the 233 + barrier() law. 234 + 235 + (*) Control dependencies require at least one run-time conditional 236 + between the prior load and the subsequent store, and this 237 + conditional must involve the prior load. If the compiler is able 238 + to optimize the conditional away, it will have also optimized 239 + away the ordering. Careful use of READ_ONCE() and WRITE_ONCE() 240 + can help to preserve the needed conditional. 241 + 242 + (*) Control dependencies require that the compiler avoid reordering the 243 + dependency into nonexistence. Careful use of READ_ONCE() or 244 + atomic{,64}_read() can help to preserve your control dependency. 245 + 246 + (*) Control dependencies apply only to the then-clause and else-clause 247 + of the "if" statement containing the control dependency, including 248 + any functions that these two clauses call. Control dependencies 249 + do *not* apply to code beyond the end of that "if" statement. 250 + 251 + (*) Control dependencies pair normally with other types of barriers. 252 + 253 + (*) Control dependencies do *not* provide multicopy atomicity. If you 254 + need all the CPUs to agree on the ordering of a given store against 255 + all other accesses, use smp_mb(). 256 + 257 + (*) Compilers do not understand control dependencies. It is therefore 258 + your job to ensure that they do not break your code.

+172

tools/memory-model/Documentation/glossary.txt

··· 1 + This document contains brief definitions of LKMM-related terms. Like most 2 + glossaries, it is not intended to be read front to back (except perhaps 3 + as a way of confirming a diagnosis of OCD), but rather to be searched 4 + for specific terms. 5 + 6 + 7 + Address Dependency: When the address of a later memory access is computed 8 + based on the value returned by an earlier load, an "address 9 + dependency" extends from that load extending to the later access. 10 + Address dependencies are quite common in RCU read-side critical 11 + sections: 12 + 13 + 1 rcu_read_lock(); 14 + 2 p = rcu_dereference(gp); 15 + 3 do_something(p->a); 16 + 4 rcu_read_unlock(); 17 + 18 + In this case, because the address of "p->a" on line 3 is computed 19 + from the value returned by the rcu_dereference() on line 2, the 20 + address dependency extends from that rcu_dereference() to that 21 + "p->a". In rare cases, optimizing compilers can destroy address 22 + dependencies. Please see Documentation/RCU/rcu_dereference.txt 23 + for more information. 24 + 25 + See also "Control Dependency" and "Data Dependency". 26 + 27 + Acquire: With respect to a lock, acquiring that lock, for example, 28 + using spin_lock(). With respect to a non-lock shared variable, 29 + a special operation that includes a load and which orders that 30 + load before later memory references running on that same CPU. 31 + An example special acquire operation is smp_load_acquire(), 32 + but atomic_read_acquire() and atomic_xchg_acquire() also include 33 + acquire loads. 34 + 35 + When an acquire load returns the value stored by a release store 36 + to that same variable, then all operations preceding that store 37 + happen before any operations following that load acquire. 38 + 39 + See also "Relaxed" and "Release". 40 + 41 + Coherence (co): When one CPU's store to a given variable overwrites 42 + either the value from another CPU's store or some later value, 43 + there is said to be a coherence link from the second CPU to 44 + the first. 45 + 46 + It is also possible to have a coherence link within a CPU, which 47 + is a "coherence internal" (coi) link. The term "coherence 48 + external" (coe) link is used when it is necessary to exclude 49 + the coi case. 50 + 51 + See also "From-reads" and "Reads-from". 52 + 53 + Control Dependency: When a later store's execution depends on a test 54 + of a value computed from a value returned by an earlier load, 55 + a "control dependency" extends from that load to that store. 56 + For example: 57 + 58 + 1 if (READ_ONCE(x)) 59 + 2 WRITE_ONCE(y, 1); 60 + 61 + Here, the control dependency extends from the READ_ONCE() on 62 + line 1 to the WRITE_ONCE() on line 2. Control dependencies are 63 + fragile, and can be easily destroyed by optimizing compilers. 64 + Please see control-dependencies.txt for more information. 65 + 66 + See also "Address Dependency" and "Data Dependency". 67 + 68 + Cycle: Memory-barrier pairing is restricted to a pair of CPUs, as the 69 + name suggests. And in a great many cases, a pair of CPUs is all 70 + that is required. In other cases, the notion of pairing must be 71 + extended to additional CPUs, and the result is called a "cycle". 72 + In a cycle, each CPU's ordering interacts with that of the next: 73 + 74 + CPU 0 CPU 1 CPU 2 75 + WRITE_ONCE(x, 1); WRITE_ONCE(y, 1); WRITE_ONCE(z, 1); 76 + smp_mb(); smp_mb(); smp_mb(); 77 + r0 = READ_ONCE(y); r1 = READ_ONCE(z); r2 = READ_ONCE(x); 78 + 79 + CPU 0's smp_mb() interacts with that of CPU 1, which interacts 80 + with that of CPU 2, which in turn interacts with that of CPU 0 81 + to complete the cycle. Because of the smp_mb() calls between 82 + each pair of memory accesses, the outcome where r0, r1, and r2 83 + are all equal to zero is forbidden by LKMM. 84 + 85 + See also "Pairing". 86 + 87 + Data Dependency: When the data written by a later store is computed based 88 + on the value returned by an earlier load, a "data dependency" 89 + extends from that load to that later store. For example: 90 + 91 + 1 r1 = READ_ONCE(x); 92 + 2 WRITE_ONCE(y, r1 + 1); 93 + 94 + In this case, the data dependency extends from the READ_ONCE() 95 + on line 1 to the WRITE_ONCE() on line 2. Data dependencies are 96 + fragile and can be easily destroyed by optimizing compilers. 97 + Because optimizing compilers put a great deal of effort into 98 + working out what values integer variables might have, this is 99 + especially true in cases where the dependency is carried through 100 + an integer. 101 + 102 + See also "Address Dependency" and "Control Dependency". 103 + 104 + From-Reads (fr): When one CPU's store to a given variable happened 105 + too late to affect the value returned by another CPU's 106 + load from that same variable, there is said to be a from-reads 107 + link from the load to the store. 108 + 109 + It is also possible to have a from-reads link within a CPU, which 110 + is a "from-reads internal" (fri) link. The term "from-reads 111 + external" (fre) link is used when it is necessary to exclude 112 + the fri case. 113 + 114 + See also "Coherence" and "Reads-from". 115 + 116 + Fully Ordered: An operation such as smp_mb() that orders all of 117 + its CPU's prior accesses with all of that CPU's subsequent 118 + accesses, or a marked access such as atomic_add_return() 119 + that orders all of its CPU's prior accesses, itself, and 120 + all of its CPU's subsequent accesses. 121 + 122 + Marked Access: An access to a variable that uses an special function or 123 + macro such as "r1 = READ_ONCE(x)" or "smp_store_release(&a, 1)". 124 + 125 + See also "Unmarked Access". 126 + 127 + Pairing: "Memory-barrier pairing" reflects the fact that synchronizing 128 + data between two CPUs requires that both CPUs their accesses. 129 + Memory barriers thus tend to come in pairs, one executed by 130 + one of the CPUs and the other by the other CPU. Of course, 131 + pairing also occurs with other types of operations, so that a 132 + smp_store_release() pairs with an smp_load_acquire() that reads 133 + the value stored. 134 + 135 + See also "Cycle". 136 + 137 + Reads-From (rf): When one CPU's load returns the value stored by some other 138 + CPU, there is said to be a reads-from link from the second 139 + CPU's store to the first CPU's load. Reads-from links have the 140 + nice property that time must advance from the store to the load, 141 + which means that algorithms using reads-from links can use lighter 142 + weight ordering and synchronization compared to algorithms using 143 + coherence and from-reads links. 144 + 145 + It is also possible to have a reads-from link within a CPU, which 146 + is a "reads-from internal" (rfi) link. The term "reads-from 147 + external" (rfe) link is used when it is necessary to exclude 148 + the rfi case. 149 + 150 + See also Coherence" and "From-reads". 151 + 152 + Relaxed: A marked access that does not imply ordering, for example, a 153 + READ_ONCE(), WRITE_ONCE(), a non-value-returning read-modify-write 154 + operation, or a value-returning read-modify-write operation whose 155 + name ends in "_relaxed". 156 + 157 + See also "Acquire" and "Release". 158 + 159 + Release: With respect to a lock, releasing that lock, for example, 160 + using spin_unlock(). With respect to a non-lock shared variable, 161 + a special operation that includes a store and which orders that 162 + store after earlier memory references that ran on that same CPU. 163 + An example special release store is smp_store_release(), but 164 + atomic_set_release() and atomic_cmpxchg_release() also include 165 + release stores. 166 + 167 + See also "Acquire" and "Relaxed". 168 + 169 + Unmarked Access: An access to a variable that uses normal C-language 170 + syntax, for example, "a = b[2]"; 171 + 172 + See also "Marked Access".

+17

tools/memory-model/Documentation/litmus-tests.txt

··· 946 946 carrying a dependency, then the compiler can break that dependency 947 947 by substituting a constant of that value. 948 948 949 + Conversely, LKMM sometimes doesn't recognize that a particular 950 + optimization is not allowed, and as a result, thinks that a 951 + dependency is not present (because the optimization would break it). 952 + The memory model misses some pretty obvious control dependencies 953 + because of this limitation. A simple example is: 954 + 955 + r1 = READ_ONCE(x); 956 + if (r1 == 0) 957 + smp_mb(); 958 + WRITE_ONCE(y, 1); 959 + 960 + There is a control dependency from the READ_ONCE to the WRITE_ONCE, 961 + even when r1 is nonzero, but LKMM doesn't realize this and thinks 962 + that the write may execute before the read if r1 != 0. (Yes, that 963 + doesn't make sense if you think about it, but the memory model's 964 + intelligence is limited.) 965 + 949 966 2. Multiple access sizes for a single variable are not supported, 950 967 and neither are misaligned or partially overlapping accesses. 951 968

+556

tools/memory-model/Documentation/ordering.txt

··· 1 + This document gives an overview of the categories of memory-ordering 2 + operations provided by the Linux-kernel memory model (LKMM). 3 + 4 + 5 + Categories of Ordering 6 + ====================== 7 + 8 + This section lists LKMM's three top-level categories of memory-ordering 9 + operations in decreasing order of strength: 10 + 11 + 1. Barriers (also known as "fences"). A barrier orders some or 12 + all of the CPU's prior operations against some or all of its 13 + subsequent operations. 14 + 15 + 2. Ordered memory accesses. These operations order themselves 16 + against some or all of the CPU's prior accesses or some or all 17 + of the CPU's subsequent accesses, depending on the subcategory 18 + of the operation. 19 + 20 + 3. Unordered accesses, as the name indicates, have no ordering 21 + properties except to the extent that they interact with an 22 + operation in the previous categories. This being the real world, 23 + some of these "unordered" operations provide limited ordering 24 + in some special situations. 25 + 26 + Each of the above categories is described in more detail by one of the 27 + following sections. 28 + 29 + 30 + Barriers 31 + ======== 32 + 33 + Each of the following categories of barriers is described in its own 34 + subsection below: 35 + 36 + a. Full memory barriers. 37 + 38 + b. Read-modify-write (RMW) ordering augmentation barriers. 39 + 40 + c. Write memory barrier. 41 + 42 + d. Read memory barrier. 43 + 44 + e. Compiler barrier. 45 + 46 + Note well that many of these primitives generate absolutely no code 47 + in kernels built with CONFIG_SMP=n. Therefore, if you are writing 48 + a device driver, which must correctly order accesses to a physical 49 + device even in kernels built with CONFIG_SMP=n, please use the 50 + ordering primitives provided for that purpose. For example, instead of 51 + smp_mb(), use mb(). See the "Linux Kernel Device Drivers" book or the 52 + https://lwn.net/Articles/698014/ article for more information. 53 + 54 + 55 + Full Memory Barriers 56 + -------------------- 57 + 58 + The Linux-kernel primitives that provide full ordering include: 59 + 60 + o The smp_mb() full memory barrier. 61 + 62 + o Value-returning RMW atomic operations whose names do not end in 63 + _acquire, _release, or _relaxed. 64 + 65 + o RCU's grace-period primitives. 66 + 67 + First, the smp_mb() full memory barrier orders all of the CPU's prior 68 + accesses against all subsequent accesses from the viewpoint of all CPUs. 69 + In other words, all CPUs will agree that any earlier action taken 70 + by that CPU happened before any later action taken by that same CPU. 71 + For example, consider the following: 72 + 73 + WRITE_ONCE(x, 1); 74 + smp_mb(); // Order store to x before load from y. 75 + r1 = READ_ONCE(y); 76 + 77 + All CPUs will agree that the store to "x" happened before the load 78 + from "y", as indicated by the comment. And yes, please comment your 79 + memory-ordering primitives. It is surprisingly hard to remember their 80 + purpose after even a few months. 81 + 82 + Second, some RMW atomic operations provide full ordering. These 83 + operations include value-returning RMW atomic operations (that is, those 84 + with non-void return types) whose names do not end in _acquire, _release, 85 + or _relaxed. Examples include atomic_add_return(), atomic_dec_and_test(), 86 + cmpxchg(), and xchg(). Note that conditional RMW atomic operations such 87 + as cmpxchg() are only guaranteed to provide ordering when they succeed. 88 + When RMW atomic operations provide full ordering, they partition the 89 + CPU's accesses into three groups: 90 + 91 + 1. All code that executed prior to the RMW atomic operation. 92 + 93 + 2. The RMW atomic operation itself. 94 + 95 + 3. All code that executed after the RMW atomic operation. 96 + 97 + All CPUs will agree that any operation in a given partition happened 98 + before any operation in a higher-numbered partition. 99 + 100 + In contrast, non-value-returning RMW atomic operations (that is, those 101 + with void return types) do not guarantee any ordering whatsoever. Nor do 102 + value-returning RMW atomic operations whose names end in _relaxed. 103 + Examples of the former include atomic_inc() and atomic_dec(), 104 + while examples of the latter include atomic_cmpxchg_relaxed() and 105 + atomic_xchg_relaxed(). Similarly, value-returning non-RMW atomic 106 + operations such as atomic_read() do not guarantee full ordering, and 107 + are covered in the later section on unordered operations. 108 + 109 + Value-returning RMW atomic operations whose names end in _acquire or 110 + _release provide limited ordering, and will be described later in this 111 + document. 112 + 113 + Finally, RCU's grace-period primitives provide full ordering. These 114 + primitives include synchronize_rcu(), synchronize_rcu_expedited(), 115 + synchronize_srcu() and so on. However, these primitives have orders 116 + of magnitude greater overhead than smp_mb(), atomic_xchg(), and so on. 117 + Furthermore, RCU's grace-period primitives can only be invoked in 118 + sleepable contexts. Therefore, RCU's grace-period primitives are 119 + typically instead used to provide ordering against RCU read-side critical 120 + sections, as documented in their comment headers. But of course if you 121 + need a synchronize_rcu() to interact with readers, it costs you nothing 122 + to also rely on its additional full-memory-barrier semantics. Just please 123 + carefully comment this, otherwise your future self will hate you. 124 + 125 + 126 + RMW Ordering Augmentation Barriers 127 + ---------------------------------- 128 + 129 + As noted in the previous section, non-value-returning RMW operations 130 + such as atomic_inc() and atomic_dec() guarantee no ordering whatsoever. 131 + Nevertheless, a number of popular CPU families, including x86, provide 132 + full ordering for these primitives. One way to obtain full ordering on 133 + all architectures is to add a call to smp_mb(): 134 + 135 + WRITE_ONCE(x, 1); 136 + atomic_inc(&my_counter); 137 + smp_mb(); // Inefficient on x86!!! 138 + r1 = READ_ONCE(y); 139 + 140 + This works, but the added smp_mb() adds needless overhead for 141 + x86, on which atomic_inc() provides full ordering all by itself. 142 + The smp_mb__after_atomic() primitive can be used instead: 143 + 144 + WRITE_ONCE(x, 1); 145 + atomic_inc(&my_counter); 146 + smp_mb__after_atomic(); // Order store to x before load from y. 147 + r1 = READ_ONCE(y); 148 + 149 + The smp_mb__after_atomic() primitive emits code only on CPUs whose 150 + atomic_inc() implementations do not guarantee full ordering, thus 151 + incurring no unnecessary overhead on x86. There are a number of 152 + variations on the smp_mb__*() theme: 153 + 154 + o smp_mb__before_atomic(), which provides full ordering prior 155 + to an unordered RMW atomic operation. 156 + 157 + o smp_mb__after_atomic(), which, as shown above, provides full 158 + ordering subsequent to an unordered RMW atomic operation. 159 + 160 + o smp_mb__after_spinlock(), which provides full ordering subsequent 161 + to a successful spinlock acquisition. Note that spin_lock() is 162 + always successful but spin_trylock() might not be. 163 + 164 + o smp_mb__after_srcu_read_unlock(), which provides full ordering 165 + subsequent to an srcu_read_unlock(). 166 + 167 + It is bad practice to place code between the smp__*() primitive and the 168 + operation whose ordering that it is augmenting. The reason is that the 169 + ordering of this intervening code will differ from one CPU architecture 170 + to another. 171 + 172 + 173 + Write Memory Barrier 174 + -------------------- 175 + 176 + The Linux kernel's write memory barrier is smp_wmb(). If a CPU executes 177 + the following code: 178 + 179 + WRITE_ONCE(x, 1); 180 + smp_wmb(); 181 + WRITE_ONCE(y, 1); 182 + 183 + Then any given CPU will see the write to "x" has having happened before 184 + the write to "y". However, you are usually better off using a release 185 + store, as described in the "Release Operations" section below. 186 + 187 + Note that smp_wmb() might fail to provide ordering for unmarked C-language 188 + stores because profile-driven optimization could determine that the 189 + value being overwritten is almost always equal to the new value. Such a 190 + compiler might then reasonably decide to transform "x = 1" and "y = 1" 191 + as follows: 192 + 193 + if (x != 1) 194 + x = 1; 195 + smp_wmb(); // BUG: does not order the reads!!! 196 + if (y != 1) 197 + y = 1; 198 + 199 + Therefore, if you need to use smp_wmb() with unmarked C-language writes, 200 + you will need to make sure that none of the compilers used to build 201 + the Linux kernel carry out this sort of transformation, both now and in 202 + the future. 203 + 204 + 205 + Read Memory Barrier 206 + ------------------- 207 + 208 + The Linux kernel's read memory barrier is smp_rmb(). If a CPU executes 209 + the following code: 210 + 211 + r0 = READ_ONCE(y); 212 + smp_rmb(); 213 + r1 = READ_ONCE(x); 214 + 215 + Then any given CPU will see the read from "y" as having preceded the read from 216 + "x". However, you are usually better off using an acquire load, as described 217 + in the "Acquire Operations" section below. 218 + 219 + Compiler Barrier 220 + ---------------- 221 + 222 + The Linux kernel's compiler barrier is barrier(). This primitive 223 + prohibits compiler code-motion optimizations that might move memory 224 + references across the point in the code containing the barrier(), but 225 + does not constrain hardware memory ordering. For example, this can be 226 + used to prevent to compiler from moving code across an infinite loop: 227 + 228 + WRITE_ONCE(x, 1); 229 + while (dontstop) 230 + barrier(); 231 + r1 = READ_ONCE(y); 232 + 233 + Without the barrier(), the compiler would be within its rights to move the 234 + WRITE_ONCE() to follow the loop. This code motion could be problematic 235 + in the case where an interrupt handler terminates the loop. Another way 236 + to handle this is to use READ_ONCE() for the load of "dontstop". 237 + 238 + Note that the barriers discussed previously use barrier() or its low-level 239 + equivalent in their implementations. 240 + 241 + 242 + Ordered Memory Accesses 243 + ======================= 244 + 245 + The Linux kernel provides a wide variety of ordered memory accesses: 246 + 247 + a. Release operations. 248 + 249 + b. Acquire operations. 250 + 251 + c. RCU read-side ordering. 252 + 253 + d. Control dependencies. 254 + 255 + Each of the above categories has its own section below. 256 + 257 + 258 + Release Operations 259 + ------------------ 260 + 261 + Release operations include smp_store_release(), atomic_set_release(), 262 + rcu_assign_pointer(), and value-returning RMW operations whose names 263 + end in _release. These operations order their own store against all 264 + of the CPU's prior memory accesses. Release operations often provide 265 + improved readability and performance compared to explicit barriers. 266 + For example, use of smp_store_release() saves a line compared to the 267 + smp_wmb() example above: 268 + 269 + WRITE_ONCE(x, 1); 270 + smp_store_release(&y, 1); 271 + 272 + More important, smp_store_release() makes it easier to connect up the 273 + different pieces of the concurrent algorithm. The variable stored to 274 + by the smp_store_release(), in this case "y", will normally be used in 275 + an acquire operation in other parts of the concurrent algorithm. 276 + 277 + To see the performance advantages, suppose that the above example read 278 + from "x" instead of writing to it. Then an smp_wmb() could not guarantee 279 + ordering, and an smp_mb() would be needed instead: 280 + 281 + r1 = READ_ONCE(x); 282 + smp_mb(); 283 + WRITE_ONCE(y, 1); 284 + 285 + But smp_mb() often incurs much higher overhead than does 286 + smp_store_release(), which still provides the needed ordering of "x" 287 + against "y". On x86, the version using smp_store_release() might compile 288 + to a simple load instruction followed by a simple store instruction. 289 + In contrast, the smp_mb() compiles to an expensive instruction that 290 + provides the needed ordering. 291 + 292 + There is a wide variety of release operations: 293 + 294 + o Store operations, including not only the aforementioned 295 + smp_store_release(), but also atomic_set_release(), and 296 + atomic_long_set_release(). 297 + 298 + o RCU's rcu_assign_pointer() operation. This is the same as 299 + smp_store_release() except that: (1) It takes the pointer to 300 + be assigned to instead of a pointer to that pointer, (2) It 301 + is intended to be used in conjunction with rcu_dereference() 302 + and similar rather than smp_load_acquire(), and (3) It checks 303 + for an RCU-protected pointer in "sparse" runs. 304 + 305 + o Value-returning RMW operations whose names end in _release, 306 + such as atomic_fetch_add_release() and cmpxchg_release(). 307 + Note that release ordering is guaranteed only against the 308 + memory-store portion of the RMW operation, and not against the 309 + memory-load portion. Note also that conditional operations such 310 + as cmpxchg_release() are only guaranteed to provide ordering 311 + when they succeed. 312 + 313 + As mentioned earlier, release operations are often paired with acquire 314 + operations, which are the subject of the next section. 315 + 316 + 317 + Acquire Operations 318 + ------------------ 319 + 320 + Acquire operations include smp_load_acquire(), atomic_read_acquire(), 321 + and value-returning RMW operations whose names end in _acquire. These 322 + operations order their own load against all of the CPU's subsequent 323 + memory accesses. Acquire operations often provide improved performance 324 + and readability compared to explicit barriers. For example, use of 325 + smp_load_acquire() saves a line compared to the smp_rmb() example above: 326 + 327 + r0 = smp_load_acquire(&y); 328 + r1 = READ_ONCE(x); 329 + 330 + As with smp_store_release(), this also makes it easier to connect 331 + the different pieces of the concurrent algorithm by looking for the 332 + smp_store_release() that stores to "y". In addition, smp_load_acquire() 333 + improves upon smp_rmb() by ordering against subsequent stores as well 334 + as against subsequent loads. 335 + 336 + There are a couple of categories of acquire operations: 337 + 338 + o Load operations, including not only the aforementioned 339 + smp_load_acquire(), but also atomic_read_acquire(), and 340 + atomic64_read_acquire(). 341 + 342 + o Value-returning RMW operations whose names end in _acquire, 343 + such as atomic_xchg_acquire() and atomic_cmpxchg_acquire(). 344 + Note that acquire ordering is guaranteed only against the 345 + memory-load portion of the RMW operation, and not against the 346 + memory-store portion. Note also that conditional operations 347 + such as atomic_cmpxchg_acquire() are only guaranteed to provide 348 + ordering when they succeed. 349 + 350 + Symmetry being what it is, acquire operations are often paired with the 351 + release operations covered earlier. For example, consider the following 352 + example, where task0() and task1() execute concurrently: 353 + 354 + void task0(void) 355 + { 356 + WRITE_ONCE(x, 1); 357 + smp_store_release(&y, 1); 358 + } 359 + 360 + void task1(void) 361 + { 362 + r0 = smp_load_acquire(&y); 363 + r1 = READ_ONCE(x); 364 + } 365 + 366 + If "x" and "y" are both initially zero, then either r0's final value 367 + will be zero or r1's final value will be one, thus providing the required 368 + ordering. 369 + 370 + 371 + RCU Read-Side Ordering 372 + ---------------------- 373 + 374 + This category includes read-side markers such as rcu_read_lock() 375 + and rcu_read_unlock() as well as pointer-traversal primitives such as 376 + rcu_dereference() and srcu_dereference(). 377 + 378 + Compared to locking primitives and RMW atomic operations, markers 379 + for RCU read-side critical sections incur very low overhead because 380 + they interact only with the corresponding grace-period primitives. 381 + For example, the rcu_read_lock() and rcu_read_unlock() markers interact 382 + with synchronize_rcu(), synchronize_rcu_expedited(), and call_rcu(). 383 + The way this works is that if a given call to synchronize_rcu() cannot 384 + prove that it started before a given call to rcu_read_lock(), then 385 + that synchronize_rcu() must block until the matching rcu_read_unlock() 386 + is reached. For more information, please see the synchronize_rcu() 387 + docbook header comment and the material in Documentation/RCU. 388 + 389 + RCU's pointer-traversal primitives, including rcu_dereference() and 390 + srcu_dereference(), order their load (which must be a pointer) against any 391 + of the CPU's subsequent memory accesses whose address has been calculated 392 + from the value loaded. There is said to be an *address dependency* 393 + from the value returned by the rcu_dereference() or srcu_dereference() 394 + to that subsequent memory access. 395 + 396 + A call to rcu_dereference() for a given RCU-protected pointer is 397 + usually paired with a call to a call to rcu_assign_pointer() for that 398 + same pointer in much the same way that a call to smp_load_acquire() is 399 + paired with a call to smp_store_release(). Calls to rcu_dereference() 400 + and rcu_assign_pointer are often buried in other APIs, for example, 401 + the RCU list API members defined in include/linux/rculist.h. For more 402 + information, please see the docbook headers in that file, the most 403 + recent LWN article on the RCU API (https://lwn.net/Articles/777036/), 404 + and of course the material in Documentation/RCU. 405 + 406 + If the pointer value is manipulated between the rcu_dereference() 407 + that returned it and a later dereference(), please read 408 + Documentation/RCU/rcu_dereference.rst. It can also be quite helpful to 409 + review uses in the Linux kernel. 410 + 411 + 412 + Control Dependencies 413 + -------------------- 414 + 415 + A control dependency extends from a marked load (READ_ONCE() or stronger) 416 + through an "if" condition to a marked store (WRITE_ONCE() or stronger) 417 + that is executed only by one of the legs of that "if" statement. 418 + Control dependencies are so named because they are mediated by 419 + control-flow instructions such as comparisons and conditional branches. 420 + 421 + In short, you can use a control dependency to enforce ordering between 422 + an READ_ONCE() and a WRITE_ONCE() when there is an "if" condition 423 + between them. The canonical example is as follows: 424 + 425 + q = READ_ONCE(a); 426 + if (q) 427 + WRITE_ONCE(b, 1); 428 + 429 + In this case, all CPUs would see the read from "a" as happening before 430 + the write to "b". 431 + 432 + However, control dependencies are easily destroyed by compiler 433 + optimizations, so any use of control dependencies must take into account 434 + all of the compilers used to build the Linux kernel. Please see the 435 + "control-dependencies.txt" file for more information. 436 + 437 + 438 + Unordered Accesses 439 + ================== 440 + 441 + Each of these two categories of unordered accesses has a section below: 442 + 443 + a. Unordered marked operations. 444 + 445 + b. Unmarked C-language accesses. 446 + 447 + 448 + Unordered Marked Operations 449 + --------------------------- 450 + 451 + Unordered operations to different variables are just that, unordered. 452 + However, if a group of CPUs apply these operations to a single variable, 453 + all the CPUs will agree on the operation order. Of course, the ordering 454 + of unordered marked accesses can also be constrained using the mechanisms 455 + described earlier in this document. 456 + 457 + These operations come in three categories: 458 + 459 + o Marked writes, such as WRITE_ONCE() and atomic_set(). These 460 + primitives required the compiler to emit the corresponding store 461 + instructions in the expected execution order, thus suppressing 462 + a number of destructive optimizations. However, they provide no 463 + hardware ordering guarantees, and in fact many CPUs will happily 464 + reorder marked writes with each other or with other unordered 465 + operations, unless these operations are to the same variable. 466 + 467 + o Marked reads, such as READ_ONCE() and atomic_read(). These 468 + primitives required the compiler to emit the corresponding load 469 + instructions in the expected execution order, thus suppressing 470 + a number of destructive optimizations. However, they provide no 471 + hardware ordering guarantees, and in fact many CPUs will happily 472 + reorder marked reads with each other or with other unordered 473 + operations, unless these operations are to the same variable. 474 + 475 + o Unordered RMW atomic operations. These are non-value-returning 476 + RMW atomic operations whose names do not end in _acquire or 477 + _release, and also value-returning RMW operations whose names 478 + end in _relaxed. Examples include atomic_add(), atomic_or(), 479 + and atomic64_fetch_xor_relaxed(). These operations do carry 480 + out the specified RMW operation atomically, for example, five 481 + concurrent atomic_inc() operations applied to a given variable 482 + will reliably increase the value of that variable by five. 483 + However, many CPUs will happily reorder these operations with 484 + each other or with other unordered operations. 485 + 486 + This category of operations can be efficiently ordered using 487 + smp_mb__before_atomic() and smp_mb__after_atomic(), as was 488 + discussed in the "RMW Ordering Augmentation Barriers" section. 489 + 490 + In short, these operations can be freely reordered unless they are all 491 + operating on a single variable or unless they are constrained by one of 492 + the operations called out earlier in this document. 493 + 494 + 495 + Unmarked C-Language Accesses 496 + ---------------------------- 497 + 498 + Unmarked C-language accesses are normal variable accesses to normal 499 + variables, that is, to variables that are not "volatile" and are not 500 + C11 atomic variables. These operations provide no ordering guarantees, 501 + and further do not guarantee "atomic" access. For example, the compiler 502 + might (and sometimes does) split a plain C-language store into multiple 503 + smaller stores. A load from that same variable running on some other 504 + CPU while such a store is executing might see a value that is a mashup 505 + of the old value and the new value. 506 + 507 + Unmarked C-language accesses are unordered, and are also subject to 508 + any number of compiler optimizations, many of which can break your 509 + concurrent code. It is possible to used unmarked C-language accesses for 510 + shared variables that are subject to concurrent access, but great care 511 + is required on an ongoing basis. The compiler-constraining barrier() 512 + primitive can be helpful, as can the various ordering primitives discussed 513 + in this document. It nevertheless bears repeating that use of unmarked 514 + C-language accesses requires careful attention to not just your code, 515 + but to all the compilers that might be used to build it. Such compilers 516 + might replace a series of loads with a single load, and might replace 517 + a series of stores with a single store. Some compilers will even split 518 + a single store into multiple smaller stores. 519 + 520 + But there are some ways of using unmarked C-language accesses for shared 521 + variables without such worries: 522 + 523 + o Guard all accesses to a given variable by a particular lock, 524 + so that there are never concurrent conflicting accesses to 525 + that variable. (There are "conflicting accesses" when 526 + (1) at least one of the concurrent accesses to a variable is an 527 + unmarked C-language access and (2) when at least one of those 528 + accesses is a write, whether marked or not.) 529 + 530 + o As above, but using other synchronization primitives such 531 + as reader-writer locks or sequence locks. 532 + 533 + o Use locking or other means to ensure that all concurrent accesses 534 + to a given variable are reads. 535 + 536 + o Restrict use of a given variable to statistics or heuristics 537 + where the occasional bogus value can be tolerated. 538 + 539 + o Declare the accessed variables as C11 atomics. 540 + https://lwn.net/Articles/691128/ 541 + 542 + o Declare the accessed variables as "volatile". 543 + 544 + If you need to live more dangerously, please do take the time to 545 + understand the compilers. One place to start is these two LWN 546 + articles: 547 + 548 + Who's afraid of a big bad optimizing compiler? 549 + https://lwn.net/Articles/793253 550 + Calibrating your fear of big bad optimizing compilers 551 + https://lwn.net/Articles/799218 552 + 553 + Used properly, unmarked C-language accesses can reduce overhead on 554 + fastpaths. However, the price is great care and continual attention 555 + to your compiler as new versions come out and as new optimizations 556 + are enabled.

+2 -20

tools/memory-model/README

··· 161 161 DESCRIPTION OF FILES 162 162 ==================== 163 163 164 - Documentation/cheatsheet.txt 165 - Quick-reference guide to the Linux-kernel memory model. 166 - 167 - Documentation/explanation.txt 168 - Describes the memory model in detail. 169 - 170 - Documentation/litmus-tests.txt 171 - Describes the format, features, capabilities, and limitations 172 - of the litmus tests that LKMM can evaluate. 173 - 174 - Documentation/recipes.txt 175 - Lists common memory-ordering patterns. 176 - 177 - Documentation/references.txt 178 - Provides background reading. 179 - 180 - Documentation/simple.txt 181 - Starting point for someone new to Linux-kernel concurrency. 182 - And also for those needing a reminder of the simpler approaches 183 - to concurrency! 164 + Documentation/README 165 + Guide to the other documents in the Documentation/ directory. 184 166 185 167 linux-kernel.bell 186 168 Categorizes the relevant instructions, including memory

+3 -1

tools/memory-model/litmus-tests/CoRR+poonceonce+Once.litmus

··· 7 7 * reads from the same variable are ordered. 8 8 *) 9 9 10 - {} 10 + { 11 + int x; 12 + } 11 13 12 14 P0(int *x) 13 15 {

+3 -1

tools/memory-model/litmus-tests/CoRW+poonceonce+Once.litmus

··· 7 7 * a given variable and a later write to that same variable are ordered. 8 8 *) 9 9 10 - {} 10 + { 11 + int x; 12 + } 11 13 12 14 P0(int *x) 13 15 {

+3 -1

tools/memory-model/litmus-tests/CoWR+poonceonce+Once.litmus

··· 7 7 * given variable and a later read from that same variable are ordered. 8 8 *) 9 9 10 - {} 10 + { 11 + int x; 12 + } 11 13 12 14 P0(int *x) 13 15 {

+3 -1

tools/memory-model/litmus-tests/CoWW+poonceonce.litmus

··· 7 7 * writes to the same variable are ordered. 8 8 *) 9 9 10 - {} 10 + { 11 + int x; 12 + } 11 13 12 14 P0(int *x) 13 15 {

+4 -1

tools/memory-model/litmus-tests/IRIW+fencembonceonces+OnceOnce.litmus

··· 10 10 * process? This litmus test exercises LKMM's "propagation" rule. 11 11 *) 12 12 13 - {} 13 + { 14 + int x; 15 + int y; 16 + } 14 17 15 18 P0(int *x) 16 19 {

+4 -1

tools/memory-model/litmus-tests/IRIW+poonceonces+OnceOnce.litmus

··· 10 10 * different process? 11 11 *) 12 12 13 - {} 13 + { 14 + int x; 15 + int y; 16 + } 14 17 15 18 P0(int *x) 16 19 {

+6 -1

tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus

··· 7 7 * (in P0() and P1()) is visible to external process P2(). 8 8 *) 9 9 10 - {} 10 + { 11 + spinlock_t mylock; 12 + int x; 13 + int y; 14 + int z; 15 + } 11 16 12 17 P0(int *x, int *y, spinlock_t *mylock) 13 18 {

+5 -1

tools/memory-model/litmus-tests/ISA2+poonceonces.litmus

··· 9 9 * of the smp_load_acquire() invocations are replaced by READ_ONCE()? 10 10 *) 11 11 12 - {} 12 + { 13 + int x; 14 + int y; 15 + int z; 16 + } 13 17 14 18 P0(int *x, int *y) 15 19 {

+5 -1

tools/memory-model/litmus-tests/ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus

··· 11 11 * (AKA non-rf) link, so release-acquire is all that is needed. 12 12 *) 13 13 14 - {} 14 + { 15 + int x; 16 + int y; 17 + int z; 18 + } 15 19 16 20 P0(int *x, int *y) 17 21 {

+4 -1

tools/memory-model/litmus-tests/LB+fencembonceonce+ctrlonceonce.litmus

··· 11 11 * another control dependency and order would still be maintained.) 12 12 *) 13 13 14 - {} 14 + { 15 + int x; 16 + int y; 17 + } 15 18 16 19 P0(int *x, int *y) 17 20 {

+4 -1

tools/memory-model/litmus-tests/LB+poacquireonce+pooncerelease.litmus

··· 8 8 * to the other? 9 9 *) 10 10 11 - {} 11 + { 12 + int x; 13 + int y; 14 + } 12 15 13 16 P0(int *x, int *y) 14 17 {

+4 -1

tools/memory-model/litmus-tests/LB+poonceonces.litmus

··· 7 7 * be prevented even with no explicit ordering? 8 8 *) 9 9 10 - {} 10 + { 11 + int x; 12 + int y; 13 + } 11 14 12 15 P0(int *x, int *y) 13 16 {

+13 -10

tools/memory-model/litmus-tests/MP+fencewmbonceonce+fencermbonceonce.litmus

··· 8 8 * is usually better to use smp_store_release() and smp_load_acquire(). 9 9 *) 10 10 11 - {} 12 - 13 - P0(int *x, int *y) 14 11 { 15 - WRITE_ONCE(*x, 1); 16 - smp_wmb(); 17 - WRITE_ONCE(*y, 1); 12 + int buf; 13 + int flag; 18 14 } 19 15 20 - P1(int *x, int *y) 16 + P0(int *buf, int *flag) // Producer 17 + { 18 + WRITE_ONCE(*buf, 1); 19 + smp_wmb(); 20 + WRITE_ONCE(*flag, 1); 21 + } 22 + 23 + P1(int *buf, int *flag) // Consumer 21 24 { 22 25 int r0; 23 26 int r1; 24 27 25 - r0 = READ_ONCE(*y); 28 + r0 = READ_ONCE(*flag); 26 29 smp_rmb(); 27 - r1 = READ_ONCE(*x); 30 + r1 = READ_ONCE(*buf); 28 31 } 29 32 30 - exists (1:r0=1 /\ 1:r1=0) 33 + exists (1:r0=1 /\ 1:r1=0) (* Bad outcome. *)

+8 -7

tools/memory-model/litmus-tests/MP+onceassign+derefonce.litmus

··· 10 10 *) 11 11 12 12 { 13 - y=z; 14 - z=0; 13 + int *p=y; 14 + int x; 15 + int y=0; 15 16 } 16 17 17 - P0(int *x, int **y) 18 + P0(int *x, int **p) // Producer 18 19 { 19 20 WRITE_ONCE(*x, 1); 20 - rcu_assign_pointer(*y, x); 21 + rcu_assign_pointer(*p, x); 21 22 } 22 23 23 - P1(int *x, int **y) 24 + P1(int *x, int **p) // Consumer 24 25 { 25 26 int *r0; 26 27 int r1; 27 28 28 29 rcu_read_lock(); 29 - r0 = rcu_dereference(*y); 30 + r0 = rcu_dereference(*p); 30 31 r1 = READ_ONCE(*r0); 31 32 rcu_read_unlock(); 32 33 } 33 34 34 - exists (1:r0=x /\ 1:r1=0) 35 + exists (1:r0=x /\ 1:r1=0) (* Bad outcome. *)

+5 -3

tools/memory-model/litmus-tests/MP+polockmbonce+poacquiresilsil.litmus

··· 11 11 *) 12 12 13 13 { 14 + spinlock_t lo; 15 + int x; 14 16 } 15 17 16 - P0(spinlock_t *lo, int *x) 18 + P0(spinlock_t *lo, int *x) // Producer 17 19 { 18 20 spin_lock(lo); 19 21 smp_mb__after_spinlock(); ··· 23 21 spin_unlock(lo); 24 22 } 25 23 26 - P1(spinlock_t *lo, int *x) 24 + P1(spinlock_t *lo, int *x) // Consumer 27 25 { 28 26 int r1; 29 27 int r2; ··· 34 32 r3 = spin_is_locked(lo); 35 33 } 36 34 37 - exists (1:r1=1 /\ 1:r2=0 /\ 1:r3=1) 35 + exists (1:r1=1 /\ 1:r2=0 /\ 1:r3=1) (* Bad outcome. *)

+5 -3

tools/memory-model/litmus-tests/MP+polockonce+poacquiresilsil.litmus

··· 11 11 *) 12 12 13 13 { 14 + spinlock_t lo; 15 + int x; 14 16 } 15 17 16 - P0(spinlock_t *lo, int *x) 18 + P0(spinlock_t *lo, int *x) // Producer 17 19 { 18 20 spin_lock(lo); 19 21 WRITE_ONCE(*x, 1); 20 22 spin_unlock(lo); 21 23 } 22 24 23 - P1(spinlock_t *lo, int *x) 25 + P1(spinlock_t *lo, int *x) // Consumer 24 26 { 25 27 int r1; 26 28 int r2; ··· 33 31 r3 = spin_is_locked(lo); 34 32 } 35 33 36 - exists (1:r1=1 /\ 1:r2=0 /\ 1:r3=1) 34 + exists (1:r1=1 /\ 1:r2=0 /\ 1:r3=1) (* Bad outcome. *)

+13 -9

tools/memory-model/litmus-tests/MP+polocks.litmus

··· 11 11 * to see all prior accesses by those other CPUs. 12 12 *) 13 13 14 - {} 15 - 16 - P0(int *x, int *y, spinlock_t *mylock) 17 14 { 18 - WRITE_ONCE(*x, 1); 15 + spinlock_t mylock; 16 + int buf; 17 + int flag; 18 + } 19 + 20 + P0(int *buf, int *flag, spinlock_t *mylock) // Producer 21 + { 22 + WRITE_ONCE(*buf, 1); 19 23 spin_lock(mylock); 20 - WRITE_ONCE(*y, 1); 24 + WRITE_ONCE(*flag, 1); 21 25 spin_unlock(mylock); 22 26 } 23 27 24 - P1(int *x, int *y, spinlock_t *mylock) 28 + P1(int *buf, int *flag, spinlock_t *mylock) // Consumer 25 29 { 26 30 int r0; 27 31 int r1; 28 32 29 33 spin_lock(mylock); 30 - r0 = READ_ONCE(*y); 34 + r0 = READ_ONCE(*flag); 31 35 spin_unlock(mylock); 32 - r1 = READ_ONCE(*x); 36 + r1 = READ_ONCE(*buf); 33 37 } 34 38 35 - exists (1:r0=1 /\ 1:r1=0) 39 + exists (1:r0=1 /\ 1:r1=0) (* Bad outcome. *)

+12 -9

tools/memory-model/litmus-tests/MP+poonceonces.litmus

··· 7 7 * no ordering at all? 8 8 *) 9 9 10 - {} 11 - 12 - P0(int *x, int *y) 13 10 { 14 - WRITE_ONCE(*x, 1); 15 - WRITE_ONCE(*y, 1); 11 + int buf; 12 + int flag; 16 13 } 17 14 18 - P1(int *x, int *y) 15 + P0(int *buf, int *flag) // Producer 16 + { 17 + WRITE_ONCE(*buf, 1); 18 + WRITE_ONCE(*flag, 1); 19 + } 20 + 21 + P1(int *buf, int *flag) // Consumer 19 22 { 20 23 int r0; 21 24 int r1; 22 25 23 - r0 = READ_ONCE(*y); 24 - r1 = READ_ONCE(*x); 26 + r0 = READ_ONCE(*flag); 27 + r1 = READ_ONCE(*buf); 25 28 } 26 29 27 - exists (1:r0=1 /\ 1:r1=0) 30 + exists (1:r0=1 /\ 1:r1=0) (* Bad outcome. *)

+12 -9

tools/memory-model/litmus-tests/MP+pooncerelease+poacquireonce.litmus

··· 8 8 * pattern. 9 9 *) 10 10 11 - {} 12 - 13 - P0(int *x, int *y) 14 11 { 15 - WRITE_ONCE(*x, 1); 16 - smp_store_release(y, 1); 12 + int buf; 13 + int flag; 17 14 } 18 15 19 - P1(int *x, int *y) 16 + P0(int *buf, int *flag) // Producer 17 + { 18 + WRITE_ONCE(*buf, 1); 19 + smp_store_release(flag, 1); 20 + } 21 + 22 + P1(int *buf, int *flag) // Consumer 20 23 { 21 24 int r0; 22 25 int r1; 23 26 24 - r0 = smp_load_acquire(y); 25 - r1 = READ_ONCE(*x); 27 + r0 = smp_load_acquire(flag); 28 + r1 = READ_ONCE(*buf); 26 29 } 27 30 28 - exists (1:r0=1 /\ 1:r1=0) 31 + exists (1:r0=1 /\ 1:r1=0) (* Bad outcome. *)

+12 -8

tools/memory-model/litmus-tests/MP+porevlocks.litmus

··· 11 11 * see all prior accesses by those other CPUs. 12 12 *) 13 13 14 - {} 14 + { 15 + spinlock_t mylock; 16 + int buf; 17 + int flag; 18 + } 15 19 16 - P0(int *x, int *y, spinlock_t *mylock) 20 + P0(int *buf, int *flag, spinlock_t *mylock) // Consumer 17 21 { 18 22 int r0; 19 23 int r1; 20 24 21 - r0 = READ_ONCE(*y); 25 + r0 = READ_ONCE(*flag); 22 26 spin_lock(mylock); 23 - r1 = READ_ONCE(*x); 27 + r1 = READ_ONCE(*buf); 24 28 spin_unlock(mylock); 25 29 } 26 30 27 - P1(int *x, int *y, spinlock_t *mylock) 31 + P1(int *buf, int *flag, spinlock_t *mylock) // Producer 28 32 { 29 33 spin_lock(mylock); 30 - WRITE_ONCE(*x, 1); 34 + WRITE_ONCE(*buf, 1); 31 35 spin_unlock(mylock); 32 - WRITE_ONCE(*y, 1); 36 + WRITE_ONCE(*flag, 1); 33 37 } 34 38 35 - exists (0:r0=1 /\ 0:r1=0) 39 + exists (0:r0=1 /\ 0:r1=0) (* Bad outcome. *)

+4 -1

tools/memory-model/litmus-tests/R+fencembonceonces.litmus

··· 9 9 * cause the resulting test to be allowed. 10 10 *) 11 11 12 - {} 12 + { 13 + int x; 14 + int y; 15 + } 13 16 14 17 P0(int *x, int *y) 15 18 {

+4 -1

tools/memory-model/litmus-tests/R+poonceonces.litmus

··· 8 8 * store propagation delays. 9 9 *) 10 10 11 - {} 11 + { 12 + int x; 13 + int y; 14 + } 12 15 13 16 P0(int *x, int *y) 14 17 {

+4 -1

tools/memory-model/litmus-tests/S+fencewmbonceonce+poacquireonce.litmus

··· 7 7 * store against a subsequent store? 8 8 *) 9 9 10 - {} 10 + { 11 + int x; 12 + int y; 13 + } 11 14 12 15 P0(int *x, int *y) 13 16 {

+4 -1

tools/memory-model/litmus-tests/S+poonceonces.litmus

··· 9 9 * READ_ONCE(), is ordering preserved? 10 10 *) 11 11 12 - {} 12 + { 13 + int x; 14 + int y; 15 + } 13 16 14 17 P0(int *x, int *y) 15 18 {

+4 -1

tools/memory-model/litmus-tests/SB+fencembonceonces.litmus

··· 9 9 * suffice, but not much else.) 10 10 *) 11 11 12 - {} 12 + { 13 + int x; 14 + int y; 15 + } 13 16 14 17 P0(int *x, int *y) 15 18 {

+4 -1

tools/memory-model/litmus-tests/SB+poonceonces.litmus

··· 8 8 * variable that the preceding process reads. 9 9 *) 10 10 11 - {} 11 + { 12 + int x; 13 + int y; 14 + } 12 15 13 16 P0(int *x, int *y) 14 17 {

+4 -1

tools/memory-model/litmus-tests/SB+rfionceonce-poonceonces.litmus

··· 6 6 * This litmus test demonstrates that LKMM is not fully multicopy atomic. 7 7 *) 8 8 9 - {} 9 + { 10 + int x; 11 + int y; 12 + } 10 13 11 14 P0(int *x, int *y) 12 15 {

+4 -1

tools/memory-model/litmus-tests/WRC+poonceonces+Once.litmus

··· 8 8 * test has no ordering at all. 9 9 *) 10 10 11 - {} 11 + { 12 + int x; 13 + int y; 14 + } 12 15 13 16 P0(int *x) 14 17 {

+4 -1

tools/memory-model/litmus-tests/WRC+pooncerelease+fencermbonceonce+Once.litmus

··· 10 10 * is A-cumulative in LKMM. 11 11 *) 12 12 13 - {} 13 + { 14 + int x; 15 + int y; 16 + } 14 17 15 18 P0(int *x) 16 19 {

+6 -1

tools/memory-model/litmus-tests/Z6.0+pooncelock+poonceLock+pombonce.litmus

··· 9 9 * by CPUs not holding that lock. 10 10 *) 11 11 12 - {} 12 + { 13 + spinlock_t mylock; 14 + int x; 15 + int y; 16 + int z; 17 + } 13 18 14 19 P0(int *x, int *y, spinlock_t *mylock) 15 20 {

+6 -1

tools/memory-model/litmus-tests/Z6.0+pooncelock+pooncelock+pombonce.litmus

··· 8 8 * seen as ordered by a third process not holding that lock. 9 9 *) 10 10 11 - {} 11 + { 12 + spinlock_t mylock; 13 + int x; 14 + int y; 15 + int z; 16 + } 12 17 13 18 P0(int *x, int *y, spinlock_t *mylock) 14 19 {

+5 -1

tools/memory-model/litmus-tests/Z6.0+pooncerelease+poacquirerelease+fencembonceonce.litmus

··· 14 14 * involving locking.) 15 15 *) 16 16 17 - {} 17 + { 18 + int x; 19 + int y; 20 + int z; 21 + } 18 22 19 23 P0(int *x, int *y) 20 24 {

+2 -1

tools/testing/selftests/rcutorture/bin/console-badness.sh

+1

tools/testing/selftests/rcutorture/bin/functions.sh

··· 169 169 # Output arguments for the qemu "-append" string based on CPU type 170 170 # and the TORTURE_QEMU_INTERACTIVE environment variable. 171 171 identify_qemu_append () { 172 + echo debug_boot_weak_hash 172 173 local console=ttyS0 173 174 case "$1" in 174 175 qemu-system-x86_64|qemu-system-i386)

+2 -3

tools/testing/selftests/rcutorture/bin/kvm-check-branches.sh

··· 52 52 KVM="`pwd`/tools/testing/selftests/rcutorture"; export KVM 53 53 PATH=${KVM}/bin:$PATH; export PATH 54 54 . functions.sh 55 - cpus="`identify_qemu_vcpus`" 56 - echo Using up to $cpus CPUs. 55 + echo Using all `identify_qemu_vcpus` CPUs. 57 56 58 57 # Each pass through this loop does one command-line argument. 59 58 for gitbr in $@ ··· 73 74 # Test the specified commit. 74 75 git checkout $i > $resdir/$ds/$idir/git-checkout.out 2>&1 75 76 echo git checkout return code: $? "(Commit $ntry: $i)" 76 - kvm.sh --cpus $cpus --duration 3 --trust-make > $resdir/$ds/$idir/kvm.sh.out 2>&1 77 + kvm.sh --allcpus --duration 3 --trust-make > $resdir/$ds/$idir/kvm.sh.out 2>&1 77 78 ret=$? 78 79 echo kvm.sh return code $ret for commit $i from branch $gitbr 79 80

+1 -1

tools/testing/selftests/rcutorture/bin/kvm-recheck-rcuscale.sh

··· 32 32 awk ' 33 33 /-scale: .* gps: .* batches:/ { 34 34 ngps = $9; 35 - nbatches = $11; 35 + nbatches = 1; 36 36 } 37 37 38 38 /-scale: .*writer-duration/ {

+18 -1

tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh

··· 206 206 kruntime=`gawk 'BEGIN { print systime() - '"$kstarttime"' }' < /dev/null` 207 207 if test -z "$qemu_pid" || kill -0 "$qemu_pid" > /dev/null 2>&1 208 208 then 209 - if test $kruntime -ge $seconds -o -f "$TORTURE_STOPFILE" 209 + if test -n "$TORTURE_KCONFIG_GDB_ARG" 210 + then 211 + : 212 + elif test $kruntime -ge $seconds || test -f "$TORTURE_STOPFILE" 210 213 then 211 214 break; 212 215 fi ··· 225 222 then 226 223 echo "ps -fp $killpid" >> $resdir/Warnings 2>&1 227 224 ps -fp $killpid >> $resdir/Warnings 2>&1 225 + fi 226 + # Reduce probability of PID reuse by allowing a one-minute buffer 227 + if test $((kruntime + 60)) -lt $seconds && test -s "$resdir/../jitter_pids" 228 + then 229 + awk < "$resdir/../jitter_pids" ' 230 + NF > 0 { 231 + pidlist = pidlist " " $1; 232 + n++; 233 + } 234 + END { 235 + if (n > 0) { 236 + print "kill " pidlist; 237 + } 238 + }' | sh 228 239 fi 229 240 else 230 241 echo ' ---' `date`: "Kernel done"

+22 -7

tools/testing/selftests/rcutorture/bin/kvm.sh

··· 58 58 echo " --datestamp string" 59 59 echo " --defconfig string" 60 60 echo " --dryrun sched|script" 61 - echo " --duration minutes" 61 + echo " --duration minutes | <seconds>s | <hours>h | <days>d" 62 62 echo " --gdb" 63 63 echo " --help" 64 64 echo " --interactive" ··· 93 93 TORTURE_BOOT_IMAGE="$2" 94 94 shift 95 95 ;; 96 - --buildonly) 96 + --buildonly|--build-only) 97 97 TORTURE_BUILDONLY=1 98 98 ;; 99 99 --configs|--config) ··· 128 128 shift 129 129 ;; 130 130 --duration) 131 - checkarg --duration "(minutes)" $# "$2" '^[0-9]*$' '^error' 132 - dur=$(($2*60)) 131 + checkarg --duration "(minutes)" $# "$2" '^[0-9][0-9]*$s\|m\|h\|d\|$$' '^error' 132 + mult=60 133 + if echo "$2" | grep -q 's$' 134 + then 135 + mult=1 136 + elif echo "$2" | grep -q 'h$' 137 + then 138 + mult=3600 139 + elif echo "$2" | grep -q 'd$' 140 + then 141 + mult=86400 142 + fi 143 + ts=`echo $2 | sed -e 's/[smhd]$//'` 144 + dur=$(($ts*mult)) 133 145 shift 134 146 ;; 135 147 --gdb) ··· 160 148 jitter="$2" 161 149 shift 162 150 ;; 163 - --kconfig) 151 + --kconfig|--kconfigs) 164 152 checkarg --kconfig "(Kconfig options)" $# "$2" '^CONFIG_[A-Z0-9_]\+=$[ynm]\|[0-9]\+$$ CONFIG_[A-Z0-9_]\+=\([ynm]\|[0-9]\+$\)*$' '^error$' 165 153 TORTURE_KCONFIG_ARG="$2" 166 154 shift ··· 171 159 --kcsan) 172 160 TORTURE_KCONFIG_KCSAN_ARG="CONFIG_DEBUG_INFO=y CONFIG_KCSAN=y CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=n CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY=n CONFIG_KCSAN_REPORT_ONCE_IN_MS=100000 CONFIG_KCSAN_VERBOSE=y CONFIG_KCSAN_INTERRUPT_WATCHER=y"; export TORTURE_KCONFIG_KCSAN_ARG 173 161 ;; 174 - --kmake-arg) 162 + --kmake-arg|--kmake-args) 175 163 checkarg --kmake-arg "(kernel make arguments)" $# "$2" '.*' '^error$' 176 164 TORTURE_KMAKE_ARG="$2" 177 165 shift ··· 471 459 print "if test -n \"$needqemurun\"" 472 460 print "then" 473 461 print "\techo ---- Starting kernels. `date` | tee -a " rd "log"; 474 - for (j = 0; j < njitter; j++) 462 + print "\techo > " rd "jitter_pids" 463 + for (j = 0; j < njitter; j++) { 475 464 print "\tjitter.sh " j " " dur " " ja[2] " " ja[3] "&" 465 + print "\techo $! >> " rd "jitter_pids" 466 + } 476 467 print "\twait" 477 468 print "\techo ---- All kernel runs complete. `date` | tee -a " rd "log"; 478 469 print "else"

+1 -1

tools/testing/selftests/rcutorture/bin/parse-console.sh

··· 133 133 then 134 134 summary="$summary Warnings: $n_warn" 135 135 fi 136 - n_bugs=`egrep -c 'BUG|Oops:' $file` 136 + n_bugs=`egrep -c '\bBUG|Oops:' $file` 137 137 if test "$n_bugs" -ne 0 138 138 then 139 139 summary="$summary Bugs: $n_bugs"

+2 -1

tools/testing/selftests/rcutorture/configs/rcu/SRCU-t

··· 4 4 CONFIG_PREEMPT=n 5 5 #CHECK#CONFIG_TINY_SRCU=y 6 6 CONFIG_RCU_TRACE=n 7 - CONFIG_DEBUG_LOCK_ALLOC=n 7 + CONFIG_DEBUG_LOCK_ALLOC=y 8 + CONFIG_PROVE_LOCKING=y 8 9 CONFIG_DEBUG_OBJECTS_RCU_HEAD=n 9 10 CONFIG_DEBUG_ATOMIC_SLEEP=y 10 11 #CHECK#CONFIG_PREEMPT_COUNT=y

+1 -2

tools/testing/selftests/rcutorture/configs/rcu/SRCU-u

··· 4 4 CONFIG_PREEMPT=n 5 5 #CHECK#CONFIG_TINY_SRCU=y 6 6 CONFIG_RCU_TRACE=n 7 - CONFIG_DEBUG_LOCK_ALLOC=y 8 - CONFIG_PROVE_LOCKING=y 7 + CONFIG_DEBUG_LOCK_ALLOC=n 9 8 CONFIG_DEBUG_OBJECTS_RCU_HEAD=n 10 9 CONFIG_PREEMPT_COUNT=n

+3 -3

tools/testing/selftests/rcutorture/configs/rcu/TRACE01

··· 4 4 CONFIG_PREEMPT_NONE=y 5 5 CONFIG_PREEMPT_VOLUNTARY=n 6 6 CONFIG_PREEMPT=n 7 - CONFIG_DEBUG_LOCK_ALLOC=y 8 - CONFIG_PROVE_LOCKING=y 9 - #CHECK#CONFIG_PROVE_RCU=y 7 + CONFIG_DEBUG_LOCK_ALLOC=n 8 + CONFIG_PROVE_LOCKING=n 9 + #CHECK#CONFIG_PROVE_RCU=n 10 10 CONFIG_TASKS_TRACE_RCU_READ_MB=y 11 11 CONFIG_RCU_EXPERT=y

+3 -3

tools/testing/selftests/rcutorture/configs/rcu/TRACE02

··· 4 4 CONFIG_PREEMPT_NONE=n 5 5 CONFIG_PREEMPT_VOLUNTARY=n 6 6 CONFIG_PREEMPT=y 7 - CONFIG_DEBUG_LOCK_ALLOC=n 8 - CONFIG_PROVE_LOCKING=n 9 - #CHECK#CONFIG_PROVE_RCU=n 7 + CONFIG_DEBUG_LOCK_ALLOC=y 8 + CONFIG_PROVE_LOCKING=y 9 + #CHECK#CONFIG_PROVE_RCU=y 10 10 CONFIG_TASKS_TRACE_RCU_READ_MB=n 11 11 CONFIG_RCU_EXPERT=y

+3

tools/testing/selftests/rcutorture/configs/rcuscale/CFcommon

··· 1 1 CONFIG_RCU_SCALE_TEST=y 2 2 CONFIG_PRINTK_TIME=y 3 + CONFIG_TASKS_RCU_GENERIC=y 4 + CONFIG_TASKS_RCU=y 5 + CONFIG_TASKS_TRACE_RCU=y

+15

tools/testing/selftests/rcutorture/configs/rcuscale/TRACE01

··· 1 + CONFIG_SMP=y 2 + CONFIG_PREEMPT_NONE=y 3 + CONFIG_PREEMPT_VOLUNTARY=n 4 + CONFIG_PREEMPT=n 5 + CONFIG_HZ_PERIODIC=n 6 + CONFIG_NO_HZ_IDLE=y 7 + CONFIG_NO_HZ_FULL=n 8 + CONFIG_RCU_FAST_NO_HZ=n 9 + CONFIG_RCU_NOCB_CPU=n 10 + CONFIG_DEBUG_LOCK_ALLOC=n 11 + CONFIG_PROVE_LOCKING=n 12 + CONFIG_RCU_BOOST=n 13 + CONFIG_DEBUG_OBJECTS_RCU_HEAD=n 14 + CONFIG_RCU_EXPERT=y 15 + CONFIG_RCU_TRACE=y

+1

tools/testing/selftests/rcutorture/configs/rcuscale/TRACE01.boot

··· 1 + rcuscale.scale_type=tasks-tracing