Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/rcu

+128 -16

Documentation/RCU/trace.txt

··· 1 1 CONFIG_RCU_TRACE debugfs Files and Formats 2 2 3 3 4 - The rcutree implementation of RCU provides debugfs trace output that 5 - summarizes counters and state. This information is useful for debugging 6 - RCU itself, and can sometimes also help to debug abuses of RCU. 7 - The following sections describe the debugfs files and formats. 4 + The rcutree and rcutiny implementations of RCU provide debugfs trace 5 + output that summarizes counters and state. This information is useful for 6 + debugging RCU itself, and can sometimes also help to debug abuses of RCU. 7 + The following sections describe the debugfs files and formats, first 8 + for rcutree and next for rcutiny. 8 9 9 10 10 - Hierarchical RCU debugfs Files and Formats 11 + CONFIG_TREE_RCU and CONFIG_TREE_PREEMPT_RCU debugfs Files and Formats 11 12 12 - This implementation of RCU provides three debugfs files under the 13 + These implementations of RCU provides five debugfs files under the 13 14 top-level directory RCU: rcu/rcudata (which displays fields in struct 14 - rcu_data), rcu/rcugp (which displays grace-period counters), and 15 - rcu/rcuhier (which displays the struct rcu_node hierarchy). 15 + rcu_data), rcu/rcudata.csv (which is a .csv spreadsheet version of 16 + rcu/rcudata), rcu/rcugp (which displays grace-period counters), 17 + rcu/rcuhier (which displays the struct rcu_node hierarchy), and 18 + rcu/rcu_pending (which displays counts of the reasons that the 19 + rcu_pending() function decided that there was core RCU work to do). 16 20 17 21 The output of "cat rcu/rcudata" looks as follows: 18 22 ··· 134 130 been registered in absence of CPU-hotplug activity. 135 131 136 132 o "co" is the number of RCU callbacks that have been orphaned due to 137 - this CPU going offline. 133 + this CPU going offline. These orphaned callbacks have been moved 134 + to an arbitrarily chosen online CPU. 138 135 139 136 o "ca" is the number of RCU callbacks that have been adopted due to 140 137 other CPUs going offline. Note that ci+co-ca+ql is the number of ··· 173 168 174 169 The output of "cat rcu/rcuhier" looks as follows, with very long lines: 175 170 176 - c=6902 g=6903 s=2 jfq=3 j=72c7 nfqs=13142/nfqsng=0(13142) fqlh=6 oqlen=0 171 + c=6902 g=6903 s=2 jfq=3 j=72c7 nfqs=13142/nfqsng=0(13142) fqlh=6 177 172 1/1 .>. 0:127 ^0 178 173 3/3 .>. 0:35 ^0 0/0 .>. 36:71 ^1 0/0 .>. 72:107 ^2 0/0 .>. 108:127 ^3 179 174 3/3f .>. 0:5 ^0 2/3 .>. 6:11 ^1 0/0 .>. 12:17 ^2 0/0 .>. 18:23 ^3 0/0 .>. 24:29 ^4 0/0 .>. 30:35 ^5 0/0 .>. 36:41 ^0 0/0 .>. 42:47 ^1 0/0 .>. 48:53 ^2 0/0 .>. 54:59 ^3 0/0 .>. 60:65 ^4 0/0 .>. 66:71 ^5 0/0 .>. 72:77 ^0 0/0 .>. 78:83 ^1 0/0 .>. 84:89 ^2 0/0 .>. 90:95 ^3 0/0 .>. 96:101 ^4 0/0 .>. 102:107 ^5 0/0 .>. 108:113 ^0 0/0 .>. 114:119 ^1 0/0 .>. 120:125 ^2 0/0 .>. 126:127 ^3 180 175 rcu_bh: 181 - c=-226 g=-226 s=1 jfq=-5701 j=72c7 nfqs=88/nfqsng=0(88) fqlh=0 oqlen=0 176 + c=-226 g=-226 s=1 jfq=-5701 j=72c7 nfqs=88/nfqsng=0(88) fqlh=0 182 177 0/1 .>. 0:127 ^0 183 178 0/3 .>. 0:35 ^0 0/0 .>. 36:71 ^1 0/0 .>. 72:107 ^2 0/0 .>. 108:127 ^3 184 179 0/3f .>. 0:5 ^0 0/3 .>. 6:11 ^1 0/0 .>. 12:17 ^2 0/0 .>. 18:23 ^3 0/0 .>. 24:29 ^4 0/0 .>. 30:35 ^5 0/0 .>. 36:41 ^0 0/0 .>. 42:47 ^1 0/0 .>. 48:53 ^2 0/0 .>. 54:59 ^3 0/0 .>. 60:65 ^4 0/0 .>. 66:71 ^5 0/0 .>. 72:77 ^0 0/0 .>. 78:83 ^1 0/0 .>. 84:89 ^2 0/0 .>. 90:95 ^3 0/0 .>. 96:101 ^4 0/0 .>. 102:107 ^5 0/0 .>. 108:113 ^0 0/0 .>. 114:119 ^1 0/0 .>. 120:125 ^2 0/0 .>. 126:127 ^3 ··· 216 211 o "fqlh" is the number of calls to force_quiescent_state() that 217 212 exited immediately (without even being counted in nfqs above) 218 213 due to contention on ->fqslock. 219 - 220 - o "oqlen" is the number of callbacks on the "orphan" callback 221 - list. RCU callbacks are placed on this list by CPUs going 222 - offline, and are "adopted" either by the CPU helping the outgoing 223 - CPU or by the next rcu_barrier*() call, whichever comes first. 224 214 225 215 o Each element of the form "1/1 0:127 ^0" represents one struct 226 216 rcu_node. Each line represents one level of the hierarchy, from ··· 326 326 readers will note that the rcu "nn" number for a given CPU very 327 327 closely matches the rcu_bh "np" number for that same CPU. This 328 328 is due to short-circuit evaluation in rcu_pending(). 329 + 330 + 331 + CONFIG_TINY_RCU and CONFIG_TINY_PREEMPT_RCU debugfs Files and Formats 332 + 333 + These implementations of RCU provides a single debugfs file under the 334 + top-level directory RCU, namely rcu/rcudata, which displays fields in 335 + rcu_bh_ctrlblk, rcu_sched_ctrlblk and, for CONFIG_TINY_PREEMPT_RCU, 336 + rcu_preempt_ctrlblk. 337 + 338 + The output of "cat rcu/rcudata" is as follows: 339 + 340 + rcu_preempt: qlen=24 gp=1097669 g197/p197/c197 tasks=... 341 + ttb=. btg=no ntb=184 neb=0 nnb=183 j=01f7 bt=0274 342 + normal balk: nt=1097669 gt=0 bt=371 b=0 ny=25073378 nos=0 343 + exp balk: bt=0 nos=0 344 + rcu_sched: qlen: 0 345 + rcu_bh: qlen: 0 346 + 347 + This is split into rcu_preempt, rcu_sched, and rcu_bh sections, with the 348 + rcu_preempt section appearing only in CONFIG_TINY_PREEMPT_RCU builds. 349 + The last three lines of the rcu_preempt section appear only in 350 + CONFIG_RCU_BOOST kernel builds. The fields are as follows: 351 + 352 + o "qlen" is the number of RCU callbacks currently waiting either 353 + for an RCU grace period or waiting to be invoked. This is the 354 + only field present for rcu_sched and rcu_bh, due to the 355 + short-circuiting of grace period in those two cases. 356 + 357 + o "gp" is the number of grace periods that have completed. 358 + 359 + o "g197/p197/c197" displays the grace-period state, with the 360 + "g" number being the number of grace periods that have started 361 + (mod 256), the "p" number being the number of grace periods 362 + that the CPU has responded to (also mod 256), and the "c" 363 + number being the number of grace periods that have completed 364 + (once again mode 256). 365 + 366 + Why have both "gp" and "g"? Because the data flowing into 367 + "gp" is only present in a CONFIG_RCU_TRACE kernel. 368 + 369 + o "tasks" is a set of bits. The first bit is "T" if there are 370 + currently tasks that have recently blocked within an RCU 371 + read-side critical section, the second bit is "N" if any of the 372 + aforementioned tasks are blocking the current RCU grace period, 373 + and the third bit is "E" if any of the aforementioned tasks are 374 + blocking the current expedited grace period. Each bit is "." 375 + if the corresponding condition does not hold. 376 + 377 + o "ttb" is a single bit. It is "B" if any of the blocked tasks 378 + need to be priority boosted and "." otherwise. 379 + 380 + o "btg" indicates whether boosting has been carried out during 381 + the current grace period, with "exp" indicating that boosting 382 + is in progress for an expedited grace period, "no" indicating 383 + that boosting has not yet started for a normal grace period, 384 + "begun" indicating that boosting has bebug for a normal grace 385 + period, and "done" indicating that boosting has completed for 386 + a normal grace period. 387 + 388 + o "ntb" is the total number of tasks subjected to RCU priority boosting 389 + periods since boot. 390 + 391 + o "neb" is the number of expedited grace periods that have had 392 + to resort to RCU priority boosting since boot. 393 + 394 + o "nnb" is the number of normal grace periods that have had 395 + to resort to RCU priority boosting since boot. 396 + 397 + o "j" is the low-order 12 bits of the jiffies counter in hexadecimal. 398 + 399 + o "bt" is the low-order 12 bits of the value that the jiffies counter 400 + will have at the next time that boosting is scheduled to begin. 401 + 402 + o In the line beginning with "normal balk", the fields are as follows: 403 + 404 + o "nt" is the number of times that the system balked from 405 + boosting because there were no blocked tasks to boost. 406 + Note that the system will balk from boosting even if the 407 + grace period is overdue when the currently running task 408 + is looping within an RCU read-side critical section. 409 + There is no point in boosting in this case, because 410 + boosting a running task won't make it run any faster. 411 + 412 + o "gt" is the number of times that the system balked 413 + from boosting because, although there were blocked tasks, 414 + none of them were preventing the current grace period 415 + from completing. 416 + 417 + o "bt" is the number of times that the system balked 418 + from boosting because boosting was already in progress. 419 + 420 + o "b" is the number of times that the system balked from 421 + boosting because boosting had already completed for 422 + the grace period in question. 423 + 424 + o "ny" is the number of times that the system balked from 425 + boosting because it was not yet time to start boosting 426 + the grace period in question. 427 + 428 + o "nos" is the number of times that the system balked from 429 + boosting for inexplicable ("not otherwise specified") 430 + reasons. This can actually happen due to races involving 431 + increments of the jiffies counter. 432 + 433 + o In the line beginning with "exp balk", the fields are as follows: 434 + 435 + o "bt" is the number of times that the system balked from 436 + boosting because there were no blocked tasks to boost. 437 + 438 + o "nos" is the number of times that the system balked from 439 + boosting for inexplicable ("not otherwise specified") 440 + reasons.

+8 -1

include/linux/init_task.h

··· 83 83 */ 84 84 # define CAP_INIT_BSET CAP_FULL_SET 85 85 86 + #ifdef CONFIG_RCU_BOOST 87 + #define INIT_TASK_RCU_BOOST() \ 88 + .rcu_boost_mutex = NULL, 89 + #else 90 + #define INIT_TASK_RCU_BOOST() 91 + #endif 86 92 #ifdef CONFIG_TREE_PREEMPT_RCU 87 93 #define INIT_TASK_RCU_TREE_PREEMPT() \ 88 94 .rcu_blocked_node = NULL, ··· 100 94 .rcu_read_lock_nesting = 0, \ 101 95 .rcu_read_unlock_special = 0, \ 102 96 .rcu_node_entry = LIST_HEAD_INIT(tsk.rcu_node_entry), \ 103 - INIT_TASK_RCU_TREE_PREEMPT() 97 + INIT_TASK_RCU_TREE_PREEMPT() \ 98 + INIT_TASK_RCU_BOOST() 104 99 #else 105 100 #define INIT_TASK_RCU_PREEMPT(tsk) 106 101 #endif

-5

include/linux/rculist.h

··· 241 241 #define list_first_entry_rcu(ptr, type, member) \ 242 242 list_entry_rcu((ptr)->next, type, member) 243 243 244 - #define __list_for_each_rcu(pos, head) \ 245 - for (pos = rcu_dereference_raw(list_next_rcu(head)); \ 246 - pos != (head); \ 247 - pos = rcu_dereference_raw(list_next_rcu((pos))) 248 - 249 244 /** 250 245 * list_for_each_entry_rcu - iterate over rcu list of given type 251 246 * @pos: the type * to use as a loop cursor.

+2 -2

include/linux/rcupdate.h

··· 47 47 extern int rcutorture_runnable; /* for sysctl */ 48 48 #endif /* #ifdef CONFIG_RCU_TORTURE_TEST */ 49 49 50 + #define UINT_CMP_GE(a, b) (UINT_MAX / 2 >= (a) - (b)) 51 + #define UINT_CMP_LT(a, b) (UINT_MAX / 2 < (a) - (b)) 50 52 #define ULONG_CMP_GE(a, b) (ULONG_MAX / 2 >= (a) - (b)) 51 53 #define ULONG_CMP_LT(a, b) (ULONG_MAX / 2 < (a) - (b)) 52 54 ··· 68 66 extern void synchronize_sched(void); 69 67 extern void rcu_barrier_bh(void); 70 68 extern void rcu_barrier_sched(void); 71 - extern void synchronize_sched_expedited(void); 72 69 extern int sched_expedited_torture_stats(char *page); 73 70 74 71 static inline void __rcu_read_lock_bh(void) ··· 119 118 #endif /* #else #ifdef CONFIG_PREEMPT_RCU */ 120 119 121 120 /* Internal to kernel */ 122 - extern void rcu_init(void); 123 121 extern void rcu_sched_qs(int cpu); 124 122 extern void rcu_bh_qs(int cpu); 125 123 extern void rcu_check_callbacks(int cpu, int user);

+8 -5

include/linux/rcutiny.h

··· 27 27 28 28 #include <linux/cache.h> 29 29 30 - #define rcu_init_sched() do { } while (0) 30 + static inline void rcu_init(void) 31 + { 32 + } 31 33 32 34 #ifdef CONFIG_TINY_RCU 33 35 ··· 56 54 } 57 55 58 56 static inline void synchronize_rcu_bh_expedited(void) 57 + { 58 + synchronize_sched(); 59 + } 60 + 61 + static inline void synchronize_sched_expedited(void) 59 62 { 60 63 synchronize_sched(); 61 64 } ··· 132 125 } 133 126 134 127 #ifdef CONFIG_DEBUG_LOCK_ALLOC 135 - 136 128 extern int rcu_scheduler_active __read_mostly; 137 129 extern void rcu_scheduler_starting(void); 138 - 139 130 #else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ 140 - 141 131 static inline void rcu_scheduler_starting(void) 142 132 { 143 133 } 144 - 145 134 #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */ 146 135 147 136 #endif /* __LINUX_RCUTINY_H */

+2

include/linux/rcutree.h

··· 30 30 #ifndef __LINUX_RCUTREE_H 31 31 #define __LINUX_RCUTREE_H 32 32 33 + extern void rcu_init(void); 33 34 extern void rcu_note_context_switch(int cpu); 34 35 extern int rcu_needs_cpu(int cpu); 35 36 extern void rcu_cpu_stall_reset(void); ··· 48 47 #endif /* #else #ifdef CONFIG_TREE_PREEMPT_RCU */ 49 48 50 49 extern void synchronize_rcu_bh(void); 50 + extern void synchronize_sched_expedited(void); 51 51 extern void synchronize_rcu_expedited(void); 52 52 53 53 static inline void synchronize_rcu_bh_expedited(void)

+9 -2

include/linux/sched.h

··· 1229 1229 #ifdef CONFIG_TREE_PREEMPT_RCU 1230 1230 struct rcu_node *rcu_blocked_node; 1231 1231 #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */ 1232 + #ifdef CONFIG_RCU_BOOST 1233 + struct rt_mutex *rcu_boost_mutex; 1234 + #endif /* #ifdef CONFIG_RCU_BOOST */ 1232 1235 1233 1236 #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) 1234 1237 struct sched_info sched_info; ··· 1762 1759 #ifdef CONFIG_PREEMPT_RCU 1763 1760 1764 1761 #define RCU_READ_UNLOCK_BLOCKED (1 << 0) /* blocked while in RCU read-side. */ 1765 - #define RCU_READ_UNLOCK_NEED_QS (1 << 1) /* RCU core needs CPU response. */ 1762 + #define RCU_READ_UNLOCK_BOOSTED (1 << 1) /* boosted while in RCU read-side. */ 1763 + #define RCU_READ_UNLOCK_NEED_QS (1 << 2) /* RCU core needs CPU response. */ 1766 1764 1767 1765 static inline void rcu_copy_process(struct task_struct *p) 1768 1766 { ··· 1771 1767 p->rcu_read_unlock_special = 0; 1772 1768 #ifdef CONFIG_TREE_PREEMPT_RCU 1773 1769 p->rcu_blocked_node = NULL; 1774 - #endif 1770 + #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */ 1771 + #ifdef CONFIG_RCU_BOOST 1772 + p->rcu_boost_mutex = NULL; 1773 + #endif /* #ifdef CONFIG_RCU_BOOST */ 1775 1774 INIT_LIST_HEAD(&p->rcu_node_entry); 1776 1775 } 1777 1776

+54 -1

init/Kconfig

··· 393 393 394 394 config RCU_TRACE 395 395 bool "Enable tracing for RCU" 396 - depends on TREE_RCU || TREE_PREEMPT_RCU 397 396 help 398 397 This option provides tracing in RCU which presents stats 399 398 in debugfs for debugging RCU implementation. ··· 457 458 This option provides tracing for the TREE_RCU and 458 459 TREE_PREEMPT_RCU implementations, permitting Makefile to 459 460 trivially select kernel/rcutree_trace.c. 461 + 462 + config RCU_BOOST 463 + bool "Enable RCU priority boosting" 464 + depends on RT_MUTEXES && TINY_PREEMPT_RCU 465 + default n 466 + help 467 + This option boosts the priority of preempted RCU readers that 468 + block the current preemptible RCU grace period for too long. 469 + This option also prevents heavy loads from blocking RCU 470 + callback invocation for all flavors of RCU. 471 + 472 + Say Y here if you are working with real-time apps or heavy loads 473 + Say N here if you are unsure. 474 + 475 + config RCU_BOOST_PRIO 476 + int "Real-time priority to boost RCU readers to" 477 + range 1 99 478 + depends on RCU_BOOST 479 + default 1 480 + help 481 + This option specifies the real-time priority to which preempted 482 + RCU readers are to be boosted. If you are working with CPU-bound 483 + real-time applications, you should specify a priority higher then 484 + the highest-priority CPU-bound application. 485 + 486 + Specify the real-time priority, or take the default if unsure. 487 + 488 + config RCU_BOOST_DELAY 489 + int "Milliseconds to delay boosting after RCU grace-period start" 490 + range 0 3000 491 + depends on RCU_BOOST 492 + default 500 493 + help 494 + This option specifies the time to wait after the beginning of 495 + a given grace period before priority-boosting preempted RCU 496 + readers blocking that grace period. Note that any RCU reader 497 + blocking an expedited RCU grace period is boosted immediately. 498 + 499 + Accept the default if unsure. 500 + 501 + config SRCU_SYNCHRONIZE_DELAY 502 + int "Microseconds to delay before waiting for readers" 503 + range 0 20 504 + default 10 505 + help 506 + This option controls how long SRCU delays before entering its 507 + loop waiting on SRCU readers. The purpose of this loop is 508 + to avoid the unconditional context-switch penalty that would 509 + otherwise be incurred if there was an active SRCU reader, 510 + in a manner similar to adaptive locking schemes. This should 511 + be set to be a bit longer than the common-case SRCU read-side 512 + critical-section overhead. 513 + 514 + Accept the default if unsure. 460 515 461 516 endmenu # "RCU Subsystem" 462 517

+70 -35

kernel/rcutiny.c

··· 36 36 #include <linux/time.h> 37 37 #include <linux/cpu.h> 38 38 39 - /* Global control variables for rcupdate callback mechanism. */ 40 - struct rcu_ctrlblk { 41 - struct rcu_head *rcucblist; /* List of pending callbacks (CBs). */ 42 - struct rcu_head **donetail; /* ->next pointer of last "done" CB. */ 43 - struct rcu_head **curtail; /* ->next pointer of last CB. */ 44 - }; 45 - 46 - /* Definition for rcupdate control block. */ 47 - static struct rcu_ctrlblk rcu_sched_ctrlblk = { 48 - .donetail = &rcu_sched_ctrlblk.rcucblist, 49 - .curtail = &rcu_sched_ctrlblk.rcucblist, 50 - }; 51 - 52 - static struct rcu_ctrlblk rcu_bh_ctrlblk = { 53 - .donetail = &rcu_bh_ctrlblk.rcucblist, 54 - .curtail = &rcu_bh_ctrlblk.rcucblist, 55 - }; 56 - 57 - #ifdef CONFIG_DEBUG_LOCK_ALLOC 58 - int rcu_scheduler_active __read_mostly; 59 - EXPORT_SYMBOL_GPL(rcu_scheduler_active); 60 - #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ 39 + /* Controls for rcu_kthread() kthread, replacing RCU_SOFTIRQ used previously. */ 40 + static struct task_struct *rcu_kthread_task; 41 + static DECLARE_WAIT_QUEUE_HEAD(rcu_kthread_wq); 42 + static unsigned long have_rcu_kthread_work; 43 + static void invoke_rcu_kthread(void); 61 44 62 45 /* Forward declarations for rcutiny_plugin.h. */ 63 - static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp); 46 + struct rcu_ctrlblk; 47 + static void rcu_process_callbacks(struct rcu_ctrlblk *rcp); 48 + static int rcu_kthread(void *arg); 64 49 static void __call_rcu(struct rcu_head *head, 65 50 void (*func)(struct rcu_head *rcu), 66 51 struct rcu_ctrlblk *rcp); ··· 108 123 { 109 124 if (rcu_qsctr_help(&rcu_sched_ctrlblk) + 110 125 rcu_qsctr_help(&rcu_bh_ctrlblk)) 111 - raise_softirq(RCU_SOFTIRQ); 126 + invoke_rcu_kthread(); 112 127 } 113 128 114 129 /* ··· 117 132 void rcu_bh_qs(int cpu) 118 133 { 119 134 if (rcu_qsctr_help(&rcu_bh_ctrlblk)) 120 - raise_softirq(RCU_SOFTIRQ); 135 + invoke_rcu_kthread(); 121 136 } 122 137 123 138 /* ··· 137 152 } 138 153 139 154 /* 140 - * Helper function for rcu_process_callbacks() that operates on the 141 - * specified rcu_ctrlkblk structure. 155 + * Invoke the RCU callbacks on the specified rcu_ctrlkblk structure 156 + * whose grace period has elapsed. 142 157 */ 143 - static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp) 158 + static void rcu_process_callbacks(struct rcu_ctrlblk *rcp) 144 159 { 145 160 struct rcu_head *next, *list; 146 161 unsigned long flags; 162 + RCU_TRACE(int cb_count = 0); 147 163 148 164 /* If no RCU callbacks ready to invoke, just return. */ 149 165 if (&rcp->rcucblist == rcp->donetail) ··· 166 180 next = list->next; 167 181 prefetch(next); 168 182 debug_rcu_head_unqueue(list); 183 + local_bh_disable(); 169 184 list->func(list); 185 + local_bh_enable(); 170 186 list = next; 187 + RCU_TRACE(cb_count++); 171 188 } 189 + RCU_TRACE(rcu_trace_sub_qlen(rcp, cb_count)); 172 190 } 173 191 174 192 /* 175 - * Invoke any callbacks whose grace period has completed. 193 + * This kthread invokes RCU callbacks whose grace periods have 194 + * elapsed. It is awakened as needed, and takes the place of the 195 + * RCU_SOFTIRQ that was used previously for this purpose. 196 + * This is a kthread, but it is never stopped, at least not until 197 + * the system goes down. 176 198 */ 177 - static void rcu_process_callbacks(struct softirq_action *unused) 199 + static int rcu_kthread(void *arg) 178 200 { 179 - __rcu_process_callbacks(&rcu_sched_ctrlblk); 180 - __rcu_process_callbacks(&rcu_bh_ctrlblk); 181 - rcu_preempt_process_callbacks(); 201 + unsigned long work; 202 + unsigned long morework; 203 + unsigned long flags; 204 + 205 + for (;;) { 206 + wait_event(rcu_kthread_wq, have_rcu_kthread_work != 0); 207 + morework = rcu_boost(); 208 + local_irq_save(flags); 209 + work = have_rcu_kthread_work; 210 + have_rcu_kthread_work = morework; 211 + local_irq_restore(flags); 212 + if (work) { 213 + rcu_process_callbacks(&rcu_sched_ctrlblk); 214 + rcu_process_callbacks(&rcu_bh_ctrlblk); 215 + rcu_preempt_process_callbacks(); 216 + } 217 + schedule_timeout_interruptible(1); /* Leave CPU for others. */ 218 + } 219 + 220 + return 0; /* Not reached, but needed to shut gcc up. */ 221 + } 222 + 223 + /* 224 + * Wake up rcu_kthread() to process callbacks now eligible for invocation 225 + * or to boost readers. 226 + */ 227 + static void invoke_rcu_kthread(void) 228 + { 229 + unsigned long flags; 230 + 231 + local_irq_save(flags); 232 + have_rcu_kthread_work = 1; 233 + wake_up(&rcu_kthread_wq); 234 + local_irq_restore(flags); 182 235 } 183 236 184 237 /* ··· 255 230 local_irq_save(flags); 256 231 *rcp->curtail = head; 257 232 rcp->curtail = &head->next; 233 + RCU_TRACE(rcp->qlen++); 258 234 local_irq_restore(flags); 259 235 } 260 236 ··· 308 282 } 309 283 EXPORT_SYMBOL_GPL(rcu_barrier_sched); 310 284 311 - void __init rcu_init(void) 285 + /* 286 + * Spawn the kthread that invokes RCU callbacks. 287 + */ 288 + static int __init rcu_spawn_kthreads(void) 312 289 { 313 - open_softirq(RCU_SOFTIRQ, rcu_process_callbacks); 290 + struct sched_param sp; 291 + 292 + rcu_kthread_task = kthread_run(rcu_kthread, NULL, "rcu_kthread"); 293 + sp.sched_priority = RCU_BOOST_PRIO; 294 + sched_setscheduler_nocheck(rcu_kthread_task, SCHED_FIFO, &sp); 295 + return 0; 314 296 } 297 + early_initcall(rcu_spawn_kthreads);

+419 -14

kernel/rcutiny_plugin.h

··· 22 22 * Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com> 23 23 */ 24 24 25 + #include <linux/kthread.h> 26 + #include <linux/debugfs.h> 27 + #include <linux/seq_file.h> 28 + 29 + #ifdef CONFIG_RCU_TRACE 30 + #define RCU_TRACE(stmt) stmt 31 + #else /* #ifdef CONFIG_RCU_TRACE */ 32 + #define RCU_TRACE(stmt) 33 + #endif /* #else #ifdef CONFIG_RCU_TRACE */ 34 + 35 + /* Global control variables for rcupdate callback mechanism. */ 36 + struct rcu_ctrlblk { 37 + struct rcu_head *rcucblist; /* List of pending callbacks (CBs). */ 38 + struct rcu_head **donetail; /* ->next pointer of last "done" CB. */ 39 + struct rcu_head **curtail; /* ->next pointer of last CB. */ 40 + RCU_TRACE(long qlen); /* Number of pending CBs. */ 41 + }; 42 + 43 + /* Definition for rcupdate control block. */ 44 + static struct rcu_ctrlblk rcu_sched_ctrlblk = { 45 + .donetail = &rcu_sched_ctrlblk.rcucblist, 46 + .curtail = &rcu_sched_ctrlblk.rcucblist, 47 + }; 48 + 49 + static struct rcu_ctrlblk rcu_bh_ctrlblk = { 50 + .donetail = &rcu_bh_ctrlblk.rcucblist, 51 + .curtail = &rcu_bh_ctrlblk.rcucblist, 52 + }; 53 + 54 + #ifdef CONFIG_DEBUG_LOCK_ALLOC 55 + int rcu_scheduler_active __read_mostly; 56 + EXPORT_SYMBOL_GPL(rcu_scheduler_active); 57 + #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ 58 + 25 59 #ifdef CONFIG_TINY_PREEMPT_RCU 26 60 27 61 #include <linux/delay.h> ··· 80 46 struct list_head *gp_tasks; 81 47 /* Pointer to the first task blocking the */ 82 48 /* current grace period, or NULL if there */ 83 - /* is not such task. */ 49 + /* is no such task. */ 84 50 struct list_head *exp_tasks; 85 51 /* Pointer to first task blocking the */ 86 52 /* current expedited grace period, or NULL */ 87 53 /* if there is no such task. If there */ 88 54 /* is no current expedited grace period, */ 89 55 /* then there cannot be any such task. */ 56 + #ifdef CONFIG_RCU_BOOST 57 + struct list_head *boost_tasks; 58 + /* Pointer to first task that needs to be */ 59 + /* priority-boosted, or NULL if no priority */ 60 + /* boosting is needed. If there is no */ 61 + /* current or expedited grace period, there */ 62 + /* can be no such task. */ 63 + #endif /* #ifdef CONFIG_RCU_BOOST */ 90 64 u8 gpnum; /* Current grace period. */ 91 65 u8 gpcpu; /* Last grace period blocked by the CPU. */ 92 66 u8 completed; /* Last grace period completed. */ 93 67 /* If all three are equal, RCU is idle. */ 68 + #ifdef CONFIG_RCU_BOOST 69 + s8 boosted_this_gp; /* Has boosting already happened? */ 70 + unsigned long boost_time; /* When to start boosting (jiffies) */ 71 + #endif /* #ifdef CONFIG_RCU_BOOST */ 72 + #ifdef CONFIG_RCU_TRACE 73 + unsigned long n_grace_periods; 74 + #ifdef CONFIG_RCU_BOOST 75 + unsigned long n_tasks_boosted; 76 + unsigned long n_exp_boosts; 77 + unsigned long n_normal_boosts; 78 + unsigned long n_normal_balk_blkd_tasks; 79 + unsigned long n_normal_balk_gp_tasks; 80 + unsigned long n_normal_balk_boost_tasks; 81 + unsigned long n_normal_balk_boosted; 82 + unsigned long n_normal_balk_notyet; 83 + unsigned long n_normal_balk_nos; 84 + unsigned long n_exp_balk_blkd_tasks; 85 + unsigned long n_exp_balk_nos; 86 + #endif /* #ifdef CONFIG_RCU_BOOST */ 87 + #endif /* #ifdef CONFIG_RCU_TRACE */ 94 88 }; 95 89 96 90 static struct rcu_preempt_ctrlblk rcu_preempt_ctrlblk = { ··· 184 122 } 185 123 186 124 /* 125 + * Advance a ->blkd_tasks-list pointer to the next entry, instead 126 + * returning NULL if at the end of the list. 127 + */ 128 + static struct list_head *rcu_next_node_entry(struct task_struct *t) 129 + { 130 + struct list_head *np; 131 + 132 + np = t->rcu_node_entry.next; 133 + if (np == &rcu_preempt_ctrlblk.blkd_tasks) 134 + np = NULL; 135 + return np; 136 + } 137 + 138 + #ifdef CONFIG_RCU_TRACE 139 + 140 + #ifdef CONFIG_RCU_BOOST 141 + static void rcu_initiate_boost_trace(void); 142 + static void rcu_initiate_exp_boost_trace(void); 143 + #endif /* #ifdef CONFIG_RCU_BOOST */ 144 + 145 + /* 146 + * Dump additional statistice for TINY_PREEMPT_RCU. 147 + */ 148 + static void show_tiny_preempt_stats(struct seq_file *m) 149 + { 150 + seq_printf(m, "rcu_preempt: qlen=%ld gp=%lu g%u/p%u/c%u tasks=%c%c%c\n", 151 + rcu_preempt_ctrlblk.rcb.qlen, 152 + rcu_preempt_ctrlblk.n_grace_periods, 153 + rcu_preempt_ctrlblk.gpnum, 154 + rcu_preempt_ctrlblk.gpcpu, 155 + rcu_preempt_ctrlblk.completed, 156 + "T."[list_empty(&rcu_preempt_ctrlblk.blkd_tasks)], 157 + "N."[!rcu_preempt_ctrlblk.gp_tasks], 158 + "E."[!rcu_preempt_ctrlblk.exp_tasks]); 159 + #ifdef CONFIG_RCU_BOOST 160 + seq_printf(m, " ttb=%c btg=", 161 + "B."[!rcu_preempt_ctrlblk.boost_tasks]); 162 + switch (rcu_preempt_ctrlblk.boosted_this_gp) { 163 + case -1: 164 + seq_puts(m, "exp"); 165 + break; 166 + case 0: 167 + seq_puts(m, "no"); 168 + break; 169 + case 1: 170 + seq_puts(m, "begun"); 171 + break; 172 + case 2: 173 + seq_puts(m, "done"); 174 + break; 175 + default: 176 + seq_printf(m, "?%d?", rcu_preempt_ctrlblk.boosted_this_gp); 177 + } 178 + seq_printf(m, " ntb=%lu neb=%lu nnb=%lu j=%04x bt=%04x\n", 179 + rcu_preempt_ctrlblk.n_tasks_boosted, 180 + rcu_preempt_ctrlblk.n_exp_boosts, 181 + rcu_preempt_ctrlblk.n_normal_boosts, 182 + (int)(jiffies & 0xffff), 183 + (int)(rcu_preempt_ctrlblk.boost_time & 0xffff)); 184 + seq_printf(m, " %s: nt=%lu gt=%lu bt=%lu b=%lu ny=%lu nos=%lu\n", 185 + "normal balk", 186 + rcu_preempt_ctrlblk.n_normal_balk_blkd_tasks, 187 + rcu_preempt_ctrlblk.n_normal_balk_gp_tasks, 188 + rcu_preempt_ctrlblk.n_normal_balk_boost_tasks, 189 + rcu_preempt_ctrlblk.n_normal_balk_boosted, 190 + rcu_preempt_ctrlblk.n_normal_balk_notyet, 191 + rcu_preempt_ctrlblk.n_normal_balk_nos); 192 + seq_printf(m, " exp balk: bt=%lu nos=%lu\n", 193 + rcu_preempt_ctrlblk.n_exp_balk_blkd_tasks, 194 + rcu_preempt_ctrlblk.n_exp_balk_nos); 195 + #endif /* #ifdef CONFIG_RCU_BOOST */ 196 + } 197 + 198 + #endif /* #ifdef CONFIG_RCU_TRACE */ 199 + 200 + #ifdef CONFIG_RCU_BOOST 201 + 202 + #include "rtmutex_common.h" 203 + 204 + /* 205 + * Carry out RCU priority boosting on the task indicated by ->boost_tasks, 206 + * and advance ->boost_tasks to the next task in the ->blkd_tasks list. 207 + */ 208 + static int rcu_boost(void) 209 + { 210 + unsigned long flags; 211 + struct rt_mutex mtx; 212 + struct list_head *np; 213 + struct task_struct *t; 214 + 215 + if (rcu_preempt_ctrlblk.boost_tasks == NULL) 216 + return 0; /* Nothing to boost. */ 217 + raw_local_irq_save(flags); 218 + rcu_preempt_ctrlblk.boosted_this_gp++; 219 + t = container_of(rcu_preempt_ctrlblk.boost_tasks, struct task_struct, 220 + rcu_node_entry); 221 + np = rcu_next_node_entry(t); 222 + rt_mutex_init_proxy_locked(&mtx, t); 223 + t->rcu_boost_mutex = &mtx; 224 + t->rcu_read_unlock_special |= RCU_READ_UNLOCK_BOOSTED; 225 + raw_local_irq_restore(flags); 226 + rt_mutex_lock(&mtx); 227 + RCU_TRACE(rcu_preempt_ctrlblk.n_tasks_boosted++); 228 + rcu_preempt_ctrlblk.boosted_this_gp++; 229 + rt_mutex_unlock(&mtx); 230 + return rcu_preempt_ctrlblk.boost_tasks != NULL; 231 + } 232 + 233 + /* 234 + * Check to see if it is now time to start boosting RCU readers blocking 235 + * the current grace period, and, if so, tell the rcu_kthread_task to 236 + * start boosting them. If there is an expedited boost in progress, 237 + * we wait for it to complete. 238 + * 239 + * If there are no blocked readers blocking the current grace period, 240 + * return 0 to let the caller know, otherwise return 1. Note that this 241 + * return value is independent of whether or not boosting was done. 242 + */ 243 + static int rcu_initiate_boost(void) 244 + { 245 + if (!rcu_preempt_blocked_readers_cgp()) { 246 + RCU_TRACE(rcu_preempt_ctrlblk.n_normal_balk_blkd_tasks++); 247 + return 0; 248 + } 249 + if (rcu_preempt_ctrlblk.gp_tasks != NULL && 250 + rcu_preempt_ctrlblk.boost_tasks == NULL && 251 + rcu_preempt_ctrlblk.boosted_this_gp == 0 && 252 + ULONG_CMP_GE(jiffies, rcu_preempt_ctrlblk.boost_time)) { 253 + rcu_preempt_ctrlblk.boost_tasks = rcu_preempt_ctrlblk.gp_tasks; 254 + invoke_rcu_kthread(); 255 + RCU_TRACE(rcu_preempt_ctrlblk.n_normal_boosts++); 256 + } else 257 + RCU_TRACE(rcu_initiate_boost_trace()); 258 + return 1; 259 + } 260 + 261 + /* 262 + * Initiate boosting for an expedited grace period. 263 + */ 264 + static void rcu_initiate_expedited_boost(void) 265 + { 266 + unsigned long flags; 267 + 268 + raw_local_irq_save(flags); 269 + if (!list_empty(&rcu_preempt_ctrlblk.blkd_tasks)) { 270 + rcu_preempt_ctrlblk.boost_tasks = 271 + rcu_preempt_ctrlblk.blkd_tasks.next; 272 + rcu_preempt_ctrlblk.boosted_this_gp = -1; 273 + invoke_rcu_kthread(); 274 + RCU_TRACE(rcu_preempt_ctrlblk.n_exp_boosts++); 275 + } else 276 + RCU_TRACE(rcu_initiate_exp_boost_trace()); 277 + raw_local_irq_restore(flags); 278 + } 279 + 280 + #define RCU_BOOST_DELAY_JIFFIES DIV_ROUND_UP(CONFIG_RCU_BOOST_DELAY * HZ, 1000); 281 + 282 + /* 283 + * Do priority-boost accounting for the start of a new grace period. 284 + */ 285 + static void rcu_preempt_boost_start_gp(void) 286 + { 287 + rcu_preempt_ctrlblk.boost_time = jiffies + RCU_BOOST_DELAY_JIFFIES; 288 + if (rcu_preempt_ctrlblk.boosted_this_gp > 0) 289 + rcu_preempt_ctrlblk.boosted_this_gp = 0; 290 + } 291 + 292 + #else /* #ifdef CONFIG_RCU_BOOST */ 293 + 294 + /* 295 + * If there is no RCU priority boosting, we don't boost. 296 + */ 297 + static int rcu_boost(void) 298 + { 299 + return 0; 300 + } 301 + 302 + /* 303 + * If there is no RCU priority boosting, we don't initiate boosting, 304 + * but we do indicate whether there are blocked readers blocking the 305 + * current grace period. 306 + */ 307 + static int rcu_initiate_boost(void) 308 + { 309 + return rcu_preempt_blocked_readers_cgp(); 310 + } 311 + 312 + /* 313 + * If there is no RCU priority boosting, we don't initiate expedited boosting. 314 + */ 315 + static void rcu_initiate_expedited_boost(void) 316 + { 317 + } 318 + 319 + /* 320 + * If there is no RCU priority boosting, nothing to do at grace-period start. 321 + */ 322 + static void rcu_preempt_boost_start_gp(void) 323 + { 324 + } 325 + 326 + #endif /* else #ifdef CONFIG_RCU_BOOST */ 327 + 328 + /* 187 329 * Record a preemptible-RCU quiescent state for the specified CPU. Note 188 330 * that this just means that the task currently running on the CPU is 189 331 * in a quiescent state. There might be any number of tasks blocked ··· 414 148 rcu_preempt_ctrlblk.gpcpu = rcu_preempt_ctrlblk.gpnum; 415 149 current->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_NEED_QS; 416 150 151 + /* If there is no GP then there is nothing more to do. */ 152 + if (!rcu_preempt_gp_in_progress()) 153 + return; 417 154 /* 418 - * If there is no GP, or if blocked readers are still blocking GP, 419 - * then there is nothing more to do. 155 + * Check up on boosting. If there are no readers blocking the 156 + * current grace period, leave. 420 157 */ 421 - if (!rcu_preempt_gp_in_progress() || rcu_preempt_blocked_readers_cgp()) 158 + if (rcu_initiate_boost()) 422 159 return; 423 160 424 161 /* Advance callbacks. */ ··· 433 164 if (!rcu_preempt_blocked_readers_any()) 434 165 rcu_preempt_ctrlblk.rcb.donetail = rcu_preempt_ctrlblk.nexttail; 435 166 436 - /* If there are done callbacks, make RCU_SOFTIRQ process them. */ 167 + /* If there are done callbacks, cause them to be invoked. */ 437 168 if (*rcu_preempt_ctrlblk.rcb.donetail != NULL) 438 - raise_softirq(RCU_SOFTIRQ); 169 + invoke_rcu_kthread(); 439 170 } 440 171 441 172 /* ··· 447 178 448 179 /* Official start of GP. */ 449 180 rcu_preempt_ctrlblk.gpnum++; 181 + RCU_TRACE(rcu_preempt_ctrlblk.n_grace_periods++); 450 182 451 183 /* Any blocked RCU readers block new GP. */ 452 184 if (rcu_preempt_blocked_readers_any()) 453 185 rcu_preempt_ctrlblk.gp_tasks = 454 186 rcu_preempt_ctrlblk.blkd_tasks.next; 187 + 188 + /* Set up for RCU priority boosting. */ 189 + rcu_preempt_boost_start_gp(); 455 190 456 191 /* If there is no running reader, CPU is done with GP. */ 457 192 if (!rcu_preempt_running_reader()) ··· 577 304 */ 578 305 empty = !rcu_preempt_blocked_readers_cgp(); 579 306 empty_exp = rcu_preempt_ctrlblk.exp_tasks == NULL; 580 - np = t->rcu_node_entry.next; 581 - if (np == &rcu_preempt_ctrlblk.blkd_tasks) 582 - np = NULL; 307 + np = rcu_next_node_entry(t); 583 308 list_del(&t->rcu_node_entry); 584 309 if (&t->rcu_node_entry == rcu_preempt_ctrlblk.gp_tasks) 585 310 rcu_preempt_ctrlblk.gp_tasks = np; 586 311 if (&t->rcu_node_entry == rcu_preempt_ctrlblk.exp_tasks) 587 312 rcu_preempt_ctrlblk.exp_tasks = np; 313 + #ifdef CONFIG_RCU_BOOST 314 + if (&t->rcu_node_entry == rcu_preempt_ctrlblk.boost_tasks) 315 + rcu_preempt_ctrlblk.boost_tasks = np; 316 + #endif /* #ifdef CONFIG_RCU_BOOST */ 588 317 INIT_LIST_HEAD(&t->rcu_node_entry); 589 318 590 319 /* ··· 606 331 if (!empty_exp && rcu_preempt_ctrlblk.exp_tasks == NULL) 607 332 rcu_report_exp_done(); 608 333 } 334 + #ifdef CONFIG_RCU_BOOST 335 + /* Unboost self if was boosted. */ 336 + if (special & RCU_READ_UNLOCK_BOOSTED) { 337 + t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BOOSTED; 338 + rt_mutex_unlock(t->rcu_boost_mutex); 339 + t->rcu_boost_mutex = NULL; 340 + } 341 + #endif /* #ifdef CONFIG_RCU_BOOST */ 609 342 local_irq_restore(flags); 610 343 } 611 344 ··· 657 374 rcu_preempt_cpu_qs(); 658 375 if (&rcu_preempt_ctrlblk.rcb.rcucblist != 659 376 rcu_preempt_ctrlblk.rcb.donetail) 660 - raise_softirq(RCU_SOFTIRQ); 377 + invoke_rcu_kthread(); 661 378 if (rcu_preempt_gp_in_progress() && 662 379 rcu_cpu_blocking_cur_gp() && 663 380 rcu_preempt_running_reader()) ··· 666 383 667 384 /* 668 385 * TINY_PREEMPT_RCU has an extra callback-list tail pointer to 669 - * update, so this is invoked from __rcu_process_callbacks() to 386 + * update, so this is invoked from rcu_process_callbacks() to 670 387 * handle that case. Of course, it is invoked for all flavors of 671 388 * RCU, but RCU callbacks can appear only on one of the lists, and 672 389 * neither ->nexttail nor ->donetail can possibly be NULL, so there ··· 683 400 */ 684 401 static void rcu_preempt_process_callbacks(void) 685 402 { 686 - __rcu_process_callbacks(&rcu_preempt_ctrlblk.rcb); 403 + rcu_process_callbacks(&rcu_preempt_ctrlblk.rcb); 687 404 } 688 405 689 406 /* ··· 700 417 local_irq_save(flags); 701 418 *rcu_preempt_ctrlblk.nexttail = head; 702 419 rcu_preempt_ctrlblk.nexttail = &head->next; 420 + RCU_TRACE(rcu_preempt_ctrlblk.rcb.qlen++); 703 421 rcu_preempt_start_gp(); /* checks to see if GP needed. */ 704 422 local_irq_restore(flags); 705 423 } ··· 816 532 817 533 /* Wait for tail of ->blkd_tasks list to drain. */ 818 534 if (rcu_preempted_readers_exp()) 535 + rcu_initiate_expedited_boost(); 819 536 wait_event(sync_rcu_preempt_exp_wq, 820 537 !rcu_preempted_readers_exp()); 821 538 ··· 857 572 858 573 #else /* #ifdef CONFIG_TINY_PREEMPT_RCU */ 859 574 575 + #ifdef CONFIG_RCU_TRACE 576 + 577 + /* 578 + * Because preemptible RCU does not exist, it is not necessary to 579 + * dump out its statistics. 580 + */ 581 + static void show_tiny_preempt_stats(struct seq_file *m) 582 + { 583 + } 584 + 585 + #endif /* #ifdef CONFIG_RCU_TRACE */ 586 + 587 + /* 588 + * Because preemptible RCU does not exist, it is never necessary to 589 + * boost preempted RCU readers. 590 + */ 591 + static int rcu_boost(void) 592 + { 593 + return 0; 594 + } 595 + 860 596 /* 861 597 * Because preemptible RCU does not exist, it never has any callbacks 862 598 * to check. ··· 905 599 #endif /* #else #ifdef CONFIG_TINY_PREEMPT_RCU */ 906 600 907 601 #ifdef CONFIG_DEBUG_LOCK_ALLOC 908 - 909 602 #include <linux/kernel_stat.h> 910 603 911 604 /* 912 605 * During boot, we forgive RCU lockdep issues. After this function is 913 606 * invoked, we start taking RCU lockdep issues seriously. 914 607 */ 915 - void rcu_scheduler_starting(void) 608 + void __init rcu_scheduler_starting(void) 916 609 { 917 610 WARN_ON(nr_context_switches() > 0); 918 611 rcu_scheduler_active = 1; 919 612 } 920 613 921 614 #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ 615 + 616 + #ifdef CONFIG_RCU_BOOST 617 + #define RCU_BOOST_PRIO CONFIG_RCU_BOOST_PRIO 618 + #else /* #ifdef CONFIG_RCU_BOOST */ 619 + #define RCU_BOOST_PRIO 1 620 + #endif /* #else #ifdef CONFIG_RCU_BOOST */ 621 + 622 + #ifdef CONFIG_RCU_TRACE 623 + 624 + #ifdef CONFIG_RCU_BOOST 625 + 626 + static void rcu_initiate_boost_trace(void) 627 + { 628 + if (rcu_preempt_ctrlblk.gp_tasks == NULL) 629 + rcu_preempt_ctrlblk.n_normal_balk_gp_tasks++; 630 + else if (rcu_preempt_ctrlblk.boost_tasks != NULL) 631 + rcu_preempt_ctrlblk.n_normal_balk_boost_tasks++; 632 + else if (rcu_preempt_ctrlblk.boosted_this_gp != 0) 633 + rcu_preempt_ctrlblk.n_normal_balk_boosted++; 634 + else if (!ULONG_CMP_GE(jiffies, rcu_preempt_ctrlblk.boost_time)) 635 + rcu_preempt_ctrlblk.n_normal_balk_notyet++; 636 + else 637 + rcu_preempt_ctrlblk.n_normal_balk_nos++; 638 + } 639 + 640 + static void rcu_initiate_exp_boost_trace(void) 641 + { 642 + if (list_empty(&rcu_preempt_ctrlblk.blkd_tasks)) 643 + rcu_preempt_ctrlblk.n_exp_balk_blkd_tasks++; 644 + else 645 + rcu_preempt_ctrlblk.n_exp_balk_nos++; 646 + } 647 + 648 + #endif /* #ifdef CONFIG_RCU_BOOST */ 649 + 650 + static void rcu_trace_sub_qlen(struct rcu_ctrlblk *rcp, int n) 651 + { 652 + unsigned long flags; 653 + 654 + raw_local_irq_save(flags); 655 + rcp->qlen -= n; 656 + raw_local_irq_restore(flags); 657 + } 658 + 659 + /* 660 + * Dump statistics for TINY_RCU, such as they are. 661 + */ 662 + static int show_tiny_stats(struct seq_file *m, void *unused) 663 + { 664 + show_tiny_preempt_stats(m); 665 + seq_printf(m, "rcu_sched: qlen: %ld\n", rcu_sched_ctrlblk.qlen); 666 + seq_printf(m, "rcu_bh: qlen: %ld\n", rcu_bh_ctrlblk.qlen); 667 + return 0; 668 + } 669 + 670 + static int show_tiny_stats_open(struct inode *inode, struct file *file) 671 + { 672 + return single_open(file, show_tiny_stats, NULL); 673 + } 674 + 675 + static const struct file_operations show_tiny_stats_fops = { 676 + .owner = THIS_MODULE, 677 + .open = show_tiny_stats_open, 678 + .read = seq_read, 679 + .llseek = seq_lseek, 680 + .release = single_release, 681 + }; 682 + 683 + static struct dentry *rcudir; 684 + 685 + static int __init rcutiny_trace_init(void) 686 + { 687 + struct dentry *retval; 688 + 689 + rcudir = debugfs_create_dir("rcu", NULL); 690 + if (!rcudir) 691 + goto free_out; 692 + retval = debugfs_create_file("rcudata", 0444, rcudir, 693 + NULL, &show_tiny_stats_fops); 694 + if (!retval) 695 + goto free_out; 696 + return 0; 697 + free_out: 698 + debugfs_remove_recursive(rcudir); 699 + return 1; 700 + } 701 + 702 + static void __exit rcutiny_trace_cleanup(void) 703 + { 704 + debugfs_remove_recursive(rcudir); 705 + } 706 + 707 + module_init(rcutiny_trace_init); 708 + module_exit(rcutiny_trace_cleanup); 709 + 710 + MODULE_AUTHOR("Paul E. McKenney"); 711 + MODULE_DESCRIPTION("Read-Copy Update tracing for tiny implementation"); 712 + MODULE_LICENSE("GPL"); 713 + 714 + #endif /* #ifdef CONFIG_RCU_TRACE */

+259 -11

kernel/rcutorture.c

··· 47 47 #include <linux/srcu.h> 48 48 #include <linux/slab.h> 49 49 #include <asm/byteorder.h> 50 + #include <linux/sched.h> 50 51 51 52 MODULE_LICENSE("GPL"); 52 53 MODULE_AUTHOR("Paul E. McKenney <paulmck@us.ibm.com> and " ··· 65 64 static int fqs_duration = 0; /* Duration of bursts (us), 0 to disable. */ 66 65 static int fqs_holdoff = 0; /* Hold time within burst (us). */ 67 66 static int fqs_stutter = 3; /* Wait time between bursts (s). */ 67 + static int test_boost = 1; /* Test RCU prio boost: 0=no, 1=maybe, 2=yes. */ 68 + static int test_boost_interval = 7; /* Interval between boost tests, seconds. */ 69 + static int test_boost_duration = 4; /* Duration of each boost test, seconds. */ 68 70 static char *torture_type = "rcu"; /* What RCU implementation to torture. */ 69 71 70 72 module_param(nreaders, int, 0444); ··· 92 88 MODULE_PARM_DESC(fqs_holdoff, "Holdoff time within fqs bursts (us)"); 93 89 module_param(fqs_stutter, int, 0444); 94 90 MODULE_PARM_DESC(fqs_stutter, "Wait time between fqs bursts (s)"); 91 + module_param(test_boost, int, 0444); 92 + MODULE_PARM_DESC(test_boost, "Test RCU prio boost: 0=no, 1=maybe, 2=yes."); 93 + module_param(test_boost_interval, int, 0444); 94 + MODULE_PARM_DESC(test_boost_interval, "Interval between boost tests, seconds."); 95 + module_param(test_boost_duration, int, 0444); 96 + MODULE_PARM_DESC(test_boost_duration, "Duration of each boost test, seconds."); 95 97 module_param(torture_type, charp, 0444); 96 98 MODULE_PARM_DESC(torture_type, "Type of RCU to torture (rcu, rcu_bh, srcu)"); 97 99 ··· 119 109 static struct task_struct *shuffler_task; 120 110 static struct task_struct *stutter_task; 121 111 static struct task_struct *fqs_task; 112 + static struct task_struct *boost_tasks[NR_CPUS]; 122 113 123 114 #define RCU_TORTURE_PIPE_LEN 10 124 115 ··· 145 134 static atomic_t n_rcu_torture_free; 146 135 static atomic_t n_rcu_torture_mberror; 147 136 static atomic_t n_rcu_torture_error; 137 + static long n_rcu_torture_boost_ktrerror; 138 + static long n_rcu_torture_boost_rterror; 139 + static long n_rcu_torture_boost_allocerror; 140 + static long n_rcu_torture_boost_afferror; 141 + static long n_rcu_torture_boost_failure; 142 + static long n_rcu_torture_boosts; 148 143 static long n_rcu_torture_timers; 149 144 static struct list_head rcu_torture_removed; 150 145 static cpumask_var_t shuffle_tmp_mask; ··· 163 146 #define RCUTORTURE_RUNNABLE_INIT 0 164 147 #endif 165 148 int rcutorture_runnable = RCUTORTURE_RUNNABLE_INIT; 149 + 150 + #ifdef CONFIG_RCU_BOOST 151 + #define rcu_can_boost() 1 152 + #else /* #ifdef CONFIG_RCU_BOOST */ 153 + #define rcu_can_boost() 0 154 + #endif /* #else #ifdef CONFIG_RCU_BOOST */ 155 + 156 + static unsigned long boost_starttime; /* jiffies of next boost test start. */ 157 + DEFINE_MUTEX(boost_mutex); /* protect setting boost_starttime */ 158 + /* and boost task create/destroy. */ 166 159 167 160 /* Mediate rmmod and system shutdown. Concurrent rmmod & shutdown illegal! */ 168 161 ··· 304 277 void (*fqs)(void); 305 278 int (*stats)(char *page); 306 279 int irq_capable; 280 + int can_boost; 307 281 char *name; 308 282 }; 309 283 ··· 394 366 .fqs = rcu_force_quiescent_state, 395 367 .stats = NULL, 396 368 .irq_capable = 1, 369 + .can_boost = rcu_can_boost(), 397 370 .name = "rcu" 398 371 }; 399 372 ··· 437 408 .fqs = rcu_force_quiescent_state, 438 409 .stats = NULL, 439 410 .irq_capable = 1, 411 + .can_boost = rcu_can_boost(), 440 412 .name = "rcu_sync" 441 413 }; 442 414 ··· 454 424 .fqs = rcu_force_quiescent_state, 455 425 .stats = NULL, 456 426 .irq_capable = 1, 427 + .can_boost = rcu_can_boost(), 457 428 .name = "rcu_expedited" 458 429 }; 459 430 ··· 715 684 }; 716 685 717 686 /* 687 + * RCU torture priority-boost testing. Runs one real-time thread per 688 + * CPU for moderate bursts, repeatedly registering RCU callbacks and 689 + * spinning waiting for them to be invoked. If a given callback takes 690 + * too long to be invoked, we assume that priority inversion has occurred. 691 + */ 692 + 693 + struct rcu_boost_inflight { 694 + struct rcu_head rcu; 695 + int inflight; 696 + }; 697 + 698 + static void rcu_torture_boost_cb(struct rcu_head *head) 699 + { 700 + struct rcu_boost_inflight *rbip = 701 + container_of(head, struct rcu_boost_inflight, rcu); 702 + 703 + smp_mb(); /* Ensure RCU-core accesses precede clearing ->inflight */ 704 + rbip->inflight = 0; 705 + } 706 + 707 + static int rcu_torture_boost(void *arg) 708 + { 709 + unsigned long call_rcu_time; 710 + unsigned long endtime; 711 + unsigned long oldstarttime; 712 + struct rcu_boost_inflight rbi = { .inflight = 0 }; 713 + struct sched_param sp; 714 + 715 + VERBOSE_PRINTK_STRING("rcu_torture_boost started"); 716 + 717 + /* Set real-time priority. */ 718 + sp.sched_priority = 1; 719 + if (sched_setscheduler(current, SCHED_FIFO, &sp) < 0) { 720 + VERBOSE_PRINTK_STRING("rcu_torture_boost RT prio failed!"); 721 + n_rcu_torture_boost_rterror++; 722 + } 723 + 724 + /* Each pass through the following loop does one boost-test cycle. */ 725 + do { 726 + /* Wait for the next test interval. */ 727 + oldstarttime = boost_starttime; 728 + while (jiffies - oldstarttime > ULONG_MAX / 2) { 729 + schedule_timeout_uninterruptible(1); 730 + rcu_stutter_wait("rcu_torture_boost"); 731 + if (kthread_should_stop() || 732 + fullstop != FULLSTOP_DONTSTOP) 733 + goto checkwait; 734 + } 735 + 736 + /* Do one boost-test interval. */ 737 + endtime = oldstarttime + test_boost_duration * HZ; 738 + call_rcu_time = jiffies; 739 + while (jiffies - endtime > ULONG_MAX / 2) { 740 + /* If we don't have a callback in flight, post one. */ 741 + if (!rbi.inflight) { 742 + smp_mb(); /* RCU core before ->inflight = 1. */ 743 + rbi.inflight = 1; 744 + call_rcu(&rbi.rcu, rcu_torture_boost_cb); 745 + if (jiffies - call_rcu_time > 746 + test_boost_duration * HZ - HZ / 2) { 747 + VERBOSE_PRINTK_STRING("rcu_torture_boost boosting failed"); 748 + n_rcu_torture_boost_failure++; 749 + } 750 + call_rcu_time = jiffies; 751 + } 752 + cond_resched(); 753 + rcu_stutter_wait("rcu_torture_boost"); 754 + if (kthread_should_stop() || 755 + fullstop != FULLSTOP_DONTSTOP) 756 + goto checkwait; 757 + } 758 + 759 + /* 760 + * Set the start time of the next test interval. 761 + * Yes, this is vulnerable to long delays, but such 762 + * delays simply cause a false negative for the next 763 + * interval. Besides, we are running at RT priority, 764 + * so delays should be relatively rare. 765 + */ 766 + while (oldstarttime == boost_starttime) { 767 + if (mutex_trylock(&boost_mutex)) { 768 + boost_starttime = jiffies + 769 + test_boost_interval * HZ; 770 + n_rcu_torture_boosts++; 771 + mutex_unlock(&boost_mutex); 772 + break; 773 + } 774 + schedule_timeout_uninterruptible(1); 775 + } 776 + 777 + /* Go do the stutter. */ 778 + checkwait: rcu_stutter_wait("rcu_torture_boost"); 779 + } while (!kthread_should_stop() && fullstop == FULLSTOP_DONTSTOP); 780 + 781 + /* Clean up and exit. */ 782 + VERBOSE_PRINTK_STRING("rcu_torture_boost task stopping"); 783 + rcutorture_shutdown_absorb("rcu_torture_boost"); 784 + while (!kthread_should_stop() || rbi.inflight) 785 + schedule_timeout_uninterruptible(1); 786 + smp_mb(); /* order accesses to ->inflight before stack-frame death. */ 787 + return 0; 788 + } 789 + 790 + /* 718 791 * RCU torture force-quiescent-state kthread. Repeatedly induces 719 792 * bursts of calls to force_quiescent_state(), increasing the probability 720 793 * of occurrence of some important types of race conditions. ··· 1068 933 cnt += sprintf(&page[cnt], "%s%s ", torture_type, TORTURE_FLAG); 1069 934 cnt += sprintf(&page[cnt], 1070 935 "rtc: %p ver: %ld tfle: %d rta: %d rtaf: %d rtf: %d " 1071 - "rtmbe: %d nt: %ld", 936 + "rtmbe: %d rtbke: %ld rtbre: %ld rtbae: %ld rtbafe: %ld " 937 + "rtbf: %ld rtb: %ld nt: %ld", 1072 938 rcu_torture_current, 1073 939 rcu_torture_current_version, 1074 940 list_empty(&rcu_torture_freelist), ··· 1077 941 atomic_read(&n_rcu_torture_alloc_fail), 1078 942 atomic_read(&n_rcu_torture_free), 1079 943 atomic_read(&n_rcu_torture_mberror), 944 + n_rcu_torture_boost_ktrerror, 945 + n_rcu_torture_boost_rterror, 946 + n_rcu_torture_boost_allocerror, 947 + n_rcu_torture_boost_afferror, 948 + n_rcu_torture_boost_failure, 949 + n_rcu_torture_boosts, 1080 950 n_rcu_torture_timers); 1081 - if (atomic_read(&n_rcu_torture_mberror) != 0) 951 + if (atomic_read(&n_rcu_torture_mberror) != 0 || 952 + n_rcu_torture_boost_ktrerror != 0 || 953 + n_rcu_torture_boost_rterror != 0 || 954 + n_rcu_torture_boost_allocerror != 0 || 955 + n_rcu_torture_boost_afferror != 0 || 956 + n_rcu_torture_boost_failure != 0) 1082 957 cnt += sprintf(&page[cnt], " !!!"); 1083 958 cnt += sprintf(&page[cnt], "\n%s%s ", torture_type, TORTURE_FLAG); 1084 959 if (i > 1) { ··· 1241 1094 } 1242 1095 1243 1096 static inline void 1244 - rcu_torture_print_module_parms(char *tag) 1097 + rcu_torture_print_module_parms(struct rcu_torture_ops *cur_ops, char *tag) 1245 1098 { 1246 1099 printk(KERN_ALERT "%s" TORTURE_FLAG 1247 1100 "--- %s: nreaders=%d nfakewriters=%d " 1248 1101 "stat_interval=%d verbose=%d test_no_idle_hz=%d " 1249 1102 "shuffle_interval=%d stutter=%d irqreader=%d " 1250 - "fqs_duration=%d fqs_holdoff=%d fqs_stutter=%d\n", 1103 + "fqs_duration=%d fqs_holdoff=%d fqs_stutter=%d " 1104 + "test_boost=%d/%d test_boost_interval=%d " 1105 + "test_boost_duration=%d\n", 1251 1106 torture_type, tag, nrealreaders, nfakewriters, 1252 1107 stat_interval, verbose, test_no_idle_hz, shuffle_interval, 1253 - stutter, irqreader, fqs_duration, fqs_holdoff, fqs_stutter); 1108 + stutter, irqreader, fqs_duration, fqs_holdoff, fqs_stutter, 1109 + test_boost, cur_ops->can_boost, 1110 + test_boost_interval, test_boost_duration); 1254 1111 } 1255 1112 1256 - static struct notifier_block rcutorture_nb = { 1113 + static struct notifier_block rcutorture_shutdown_nb = { 1257 1114 .notifier_call = rcutorture_shutdown_notify, 1115 + }; 1116 + 1117 + static void rcutorture_booster_cleanup(int cpu) 1118 + { 1119 + struct task_struct *t; 1120 + 1121 + if (boost_tasks[cpu] == NULL) 1122 + return; 1123 + mutex_lock(&boost_mutex); 1124 + VERBOSE_PRINTK_STRING("Stopping rcu_torture_boost task"); 1125 + t = boost_tasks[cpu]; 1126 + boost_tasks[cpu] = NULL; 1127 + mutex_unlock(&boost_mutex); 1128 + 1129 + /* This must be outside of the mutex, otherwise deadlock! */ 1130 + kthread_stop(t); 1131 + } 1132 + 1133 + static int rcutorture_booster_init(int cpu) 1134 + { 1135 + int retval; 1136 + 1137 + if (boost_tasks[cpu] != NULL) 1138 + return 0; /* Already created, nothing more to do. */ 1139 + 1140 + /* Don't allow time recalculation while creating a new task. */ 1141 + mutex_lock(&boost_mutex); 1142 + VERBOSE_PRINTK_STRING("Creating rcu_torture_boost task"); 1143 + boost_tasks[cpu] = kthread_create(rcu_torture_boost, NULL, 1144 + "rcu_torture_boost"); 1145 + if (IS_ERR(boost_tasks[cpu])) { 1146 + retval = PTR_ERR(boost_tasks[cpu]); 1147 + VERBOSE_PRINTK_STRING("rcu_torture_boost task create failed"); 1148 + n_rcu_torture_boost_ktrerror++; 1149 + boost_tasks[cpu] = NULL; 1150 + mutex_unlock(&boost_mutex); 1151 + return retval; 1152 + } 1153 + kthread_bind(boost_tasks[cpu], cpu); 1154 + wake_up_process(boost_tasks[cpu]); 1155 + mutex_unlock(&boost_mutex); 1156 + return 0; 1157 + } 1158 + 1159 + static int rcutorture_cpu_notify(struct notifier_block *self, 1160 + unsigned long action, void *hcpu) 1161 + { 1162 + long cpu = (long)hcpu; 1163 + 1164 + switch (action) { 1165 + case CPU_ONLINE: 1166 + case CPU_DOWN_FAILED: 1167 + (void)rcutorture_booster_init(cpu); 1168 + break; 1169 + case CPU_DOWN_PREPARE: 1170 + rcutorture_booster_cleanup(cpu); 1171 + break; 1172 + default: 1173 + break; 1174 + } 1175 + return NOTIFY_OK; 1176 + } 1177 + 1178 + static struct notifier_block rcutorture_cpu_nb = { 1179 + .notifier_call = rcutorture_cpu_notify, 1258 1180 }; 1259 1181 1260 1182 static void ··· 1343 1127 } 1344 1128 fullstop = FULLSTOP_RMMOD; 1345 1129 mutex_unlock(&fullstop_mutex); 1346 - unregister_reboot_notifier(&rcutorture_nb); 1130 + unregister_reboot_notifier(&rcutorture_shutdown_nb); 1347 1131 if (stutter_task) { 1348 1132 VERBOSE_PRINTK_STRING("Stopping rcu_torture_stutter task"); 1349 1133 kthread_stop(stutter_task); ··· 1400 1184 kthread_stop(fqs_task); 1401 1185 } 1402 1186 fqs_task = NULL; 1187 + if ((test_boost == 1 && cur_ops->can_boost) || 1188 + test_boost == 2) { 1189 + unregister_cpu_notifier(&rcutorture_cpu_nb); 1190 + for_each_possible_cpu(i) 1191 + rcutorture_booster_cleanup(i); 1192 + } 1403 1193 1404 1194 /* Wait for all RCU callbacks to fire. */ 1405 1195 ··· 1417 1195 if (cur_ops->cleanup) 1418 1196 cur_ops->cleanup(); 1419 1197 if (atomic_read(&n_rcu_torture_error)) 1420 - rcu_torture_print_module_parms("End of test: FAILURE"); 1198 + rcu_torture_print_module_parms(cur_ops, "End of test: FAILURE"); 1421 1199 else 1422 - rcu_torture_print_module_parms("End of test: SUCCESS"); 1200 + rcu_torture_print_module_parms(cur_ops, "End of test: SUCCESS"); 1423 1201 } 1424 1202 1425 1203 static int __init ··· 1464 1242 nrealreaders = nreaders; 1465 1243 else 1466 1244 nrealreaders = 2 * num_online_cpus(); 1467 - rcu_torture_print_module_parms("Start of test"); 1245 + rcu_torture_print_module_parms(cur_ops, "Start of test"); 1468 1246 fullstop = FULLSTOP_DONTSTOP; 1469 1247 1470 1248 /* Set up the freelist. */ ··· 1485 1263 atomic_set(&n_rcu_torture_free, 0); 1486 1264 atomic_set(&n_rcu_torture_mberror, 0); 1487 1265 atomic_set(&n_rcu_torture_error, 0); 1266 + n_rcu_torture_boost_ktrerror = 0; 1267 + n_rcu_torture_boost_rterror = 0; 1268 + n_rcu_torture_boost_allocerror = 0; 1269 + n_rcu_torture_boost_afferror = 0; 1270 + n_rcu_torture_boost_failure = 0; 1271 + n_rcu_torture_boosts = 0; 1488 1272 for (i = 0; i < RCU_TORTURE_PIPE_LEN + 1; i++) 1489 1273 atomic_set(&rcu_torture_wcount[i], 0); 1490 1274 for_each_possible_cpu(cpu) { ··· 1604 1376 goto unwind; 1605 1377 } 1606 1378 } 1607 - register_reboot_notifier(&rcutorture_nb); 1379 + if (test_boost_interval < 1) 1380 + test_boost_interval = 1; 1381 + if (test_boost_duration < 2) 1382 + test_boost_duration = 2; 1383 + if ((test_boost == 1 && cur_ops->can_boost) || 1384 + test_boost == 2) { 1385 + int retval; 1386 + 1387 + boost_starttime = jiffies + test_boost_interval * HZ; 1388 + register_cpu_notifier(&rcutorture_cpu_nb); 1389 + for_each_possible_cpu(i) { 1390 + if (cpu_is_offline(i)) 1391 + continue; /* Heuristic: CPU can go offline. */ 1392 + retval = rcutorture_booster_init(i); 1393 + if (retval < 0) { 1394 + firsterr = retval; 1395 + goto unwind; 1396 + } 1397 + } 1398 + } 1399 + register_reboot_notifier(&rcutorture_shutdown_nb); 1608 1400 mutex_unlock(&fullstop_mutex); 1609 1401 return 0; 1610 1402

+75 -81

kernel/rcutree.c

··· 67 67 .gpnum = -300, \ 68 68 .completed = -300, \ 69 69 .onofflock = __RAW_SPIN_LOCK_UNLOCKED(&structname.onofflock), \ 70 - .orphan_cbs_list = NULL, \ 71 - .orphan_cbs_tail = &structname.orphan_cbs_list, \ 72 - .orphan_qlen = 0, \ 73 70 .fqslock = __RAW_SPIN_LOCK_UNLOCKED(&structname.fqslock), \ 74 71 .n_force_qs = 0, \ 75 72 .n_force_qs_ngp = 0, \ ··· 617 620 static void __note_new_gpnum(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_data *rdp) 618 621 { 619 622 if (rdp->gpnum != rnp->gpnum) { 620 - rdp->qs_pending = 1; 621 - rdp->passed_quiesc = 0; 623 + /* 624 + * If the current grace period is waiting for this CPU, 625 + * set up to detect a quiescent state, otherwise don't 626 + * go looking for one. 627 + */ 622 628 rdp->gpnum = rnp->gpnum; 629 + if (rnp->qsmask & rdp->grpmask) { 630 + rdp->qs_pending = 1; 631 + rdp->passed_quiesc = 0; 632 + } else 633 + rdp->qs_pending = 0; 623 634 } 624 635 } 625 636 ··· 686 681 687 682 /* Remember that we saw this grace-period completion. */ 688 683 rdp->completed = rnp->completed; 684 + 685 + /* 686 + * If we were in an extended quiescent state, we may have 687 + * missed some grace periods that others CPUs handled on 688 + * our behalf. Catch up with this state to avoid noting 689 + * spurious new grace periods. If another grace period 690 + * has started, then rnp->gpnum will have advanced, so 691 + * we will detect this later on. 692 + */ 693 + if (ULONG_CMP_LT(rdp->gpnum, rdp->completed)) 694 + rdp->gpnum = rdp->completed; 695 + 696 + /* 697 + * If RCU does not need a quiescent state from this CPU, 698 + * then make sure that this CPU doesn't go looking for one. 699 + */ 700 + if ((rnp->qsmask & rdp->grpmask) == 0) 701 + rdp->qs_pending = 0; 689 702 } 690 703 } 691 704 ··· 1007 984 #ifdef CONFIG_HOTPLUG_CPU 1008 985 1009 986 /* 1010 - * Move a dying CPU's RCU callbacks to the ->orphan_cbs_list for the 1011 - * specified flavor of RCU. The callbacks will be adopted by the next 1012 - * _rcu_barrier() invocation or by the CPU_DEAD notifier, whichever 1013 - * comes first. Because this is invoked from the CPU_DYING notifier, 1014 - * irqs are already disabled. 987 + * Move a dying CPU's RCU callbacks to online CPU's callback list. 988 + * Synchronization is not required because this function executes 989 + * in stop_machine() context. 1015 990 */ 1016 - static void rcu_send_cbs_to_orphanage(struct rcu_state *rsp) 991 + static void rcu_send_cbs_to_online(struct rcu_state *rsp) 1017 992 { 1018 993 int i; 994 + /* current DYING CPU is cleared in the cpu_online_mask */ 995 + int receive_cpu = cpumask_any(cpu_online_mask); 1019 996 struct rcu_data *rdp = this_cpu_ptr(rsp->rda); 997 + struct rcu_data *receive_rdp = per_cpu_ptr(rsp->rda, receive_cpu); 1020 998 1021 999 if (rdp->nxtlist == NULL) 1022 1000 return; /* irqs disabled, so comparison is stable. */ 1023 - raw_spin_lock(&rsp->onofflock); /* irqs already disabled. */ 1024 - *rsp->orphan_cbs_tail = rdp->nxtlist; 1025 - rsp->orphan_cbs_tail = rdp->nxttail[RCU_NEXT_TAIL]; 1001 + 1002 + *receive_rdp->nxttail[RCU_NEXT_TAIL] = rdp->nxtlist; 1003 + receive_rdp->nxttail[RCU_NEXT_TAIL] = rdp->nxttail[RCU_NEXT_TAIL]; 1004 + receive_rdp->qlen += rdp->qlen; 1005 + receive_rdp->n_cbs_adopted += rdp->qlen; 1006 + rdp->n_cbs_orphaned += rdp->qlen; 1007 + 1026 1008 rdp->nxtlist = NULL; 1027 1009 for (i = 0; i < RCU_NEXT_SIZE; i++) 1028 1010 rdp->nxttail[i] = &rdp->nxtlist; 1029 - rsp->orphan_qlen += rdp->qlen; 1030 - rdp->n_cbs_orphaned += rdp->qlen; 1031 1011 rdp->qlen = 0; 1032 - raw_spin_unlock(&rsp->onofflock); /* irqs remain disabled. */ 1033 - } 1034 - 1035 - /* 1036 - * Adopt previously orphaned RCU callbacks. 1037 - */ 1038 - static void rcu_adopt_orphan_cbs(struct rcu_state *rsp) 1039 - { 1040 - unsigned long flags; 1041 - struct rcu_data *rdp; 1042 - 1043 - raw_spin_lock_irqsave(&rsp->onofflock, flags); 1044 - rdp = this_cpu_ptr(rsp->rda); 1045 - if (rsp->orphan_cbs_list == NULL) { 1046 - raw_spin_unlock_irqrestore(&rsp->onofflock, flags); 1047 - return; 1048 - } 1049 - *rdp->nxttail[RCU_NEXT_TAIL] = rsp->orphan_cbs_list; 1050 - rdp->nxttail[RCU_NEXT_TAIL] = rsp->orphan_cbs_tail; 1051 - rdp->qlen += rsp->orphan_qlen; 1052 - rdp->n_cbs_adopted += rsp->orphan_qlen; 1053 - rsp->orphan_cbs_list = NULL; 1054 - rsp->orphan_cbs_tail = &rsp->orphan_cbs_list; 1055 - rsp->orphan_qlen = 0; 1056 - raw_spin_unlock_irqrestore(&rsp->onofflock, flags); 1057 1012 } 1058 1013 1059 1014 /* ··· 1082 1081 raw_spin_unlock_irqrestore(&rnp->lock, flags); 1083 1082 if (need_report & RCU_OFL_TASKS_EXP_GP) 1084 1083 rcu_report_exp_rnp(rsp, rnp); 1085 - 1086 - rcu_adopt_orphan_cbs(rsp); 1087 1084 } 1088 1085 1089 1086 /* ··· 1099 1100 1100 1101 #else /* #ifdef CONFIG_HOTPLUG_CPU */ 1101 1102 1102 - static void rcu_send_cbs_to_orphanage(struct rcu_state *rsp) 1103 - { 1104 - } 1105 - 1106 - static void rcu_adopt_orphan_cbs(struct rcu_state *rsp) 1103 + static void rcu_send_cbs_to_online(struct rcu_state *rsp) 1107 1104 { 1108 1105 } 1109 1106 ··· 1435 1440 */ 1436 1441 local_irq_save(flags); 1437 1442 rdp = this_cpu_ptr(rsp->rda); 1438 - rcu_process_gp_end(rsp, rdp); 1439 - check_for_new_grace_period(rsp, rdp); 1440 1443 1441 1444 /* Add the callback to our list. */ 1442 1445 *rdp->nxttail[RCU_NEXT_TAIL] = head; 1443 1446 rdp->nxttail[RCU_NEXT_TAIL] = &head->next; 1444 - 1445 - /* Start a new grace period if one not already started. */ 1446 - if (!rcu_gp_in_progress(rsp)) { 1447 - unsigned long nestflag; 1448 - struct rcu_node *rnp_root = rcu_get_root(rsp); 1449 - 1450 - raw_spin_lock_irqsave(&rnp_root->lock, nestflag); 1451 - rcu_start_gp(rsp, nestflag); /* releases rnp_root->lock. */ 1452 - } 1453 1447 1454 1448 /* 1455 1449 * Force the grace period if too many callbacks or too long waiting. ··· 1448 1464 * is the only one waiting for a grace period to complete. 1449 1465 */ 1450 1466 if (unlikely(++rdp->qlen > rdp->qlen_last_fqs_check + qhimark)) { 1451 - rdp->blimit = LONG_MAX; 1452 - if (rsp->n_force_qs == rdp->n_force_qs_snap && 1453 - *rdp->nxttail[RCU_DONE_TAIL] != head) 1454 - force_quiescent_state(rsp, 0); 1455 - rdp->n_force_qs_snap = rsp->n_force_qs; 1456 - rdp->qlen_last_fqs_check = rdp->qlen; 1467 + 1468 + /* Are we ignoring a completed grace period? */ 1469 + rcu_process_gp_end(rsp, rdp); 1470 + check_for_new_grace_period(rsp, rdp); 1471 + 1472 + /* Start a new grace period if one not already started. */ 1473 + if (!rcu_gp_in_progress(rsp)) { 1474 + unsigned long nestflag; 1475 + struct rcu_node *rnp_root = rcu_get_root(rsp); 1476 + 1477 + raw_spin_lock_irqsave(&rnp_root->lock, nestflag); 1478 + rcu_start_gp(rsp, nestflag); /* rlses rnp_root->lock */ 1479 + } else { 1480 + /* Give the grace period a kick. */ 1481 + rdp->blimit = LONG_MAX; 1482 + if (rsp->n_force_qs == rdp->n_force_qs_snap && 1483 + *rdp->nxttail[RCU_DONE_TAIL] != head) 1484 + force_quiescent_state(rsp, 0); 1485 + rdp->n_force_qs_snap = rsp->n_force_qs; 1486 + rdp->qlen_last_fqs_check = rdp->qlen; 1487 + } 1457 1488 } else if (ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs), jiffies)) 1458 1489 force_quiescent_state(rsp, 1); 1459 1490 local_irq_restore(flags); ··· 1698 1699 * decrement rcu_barrier_cpu_count -- otherwise the first CPU 1699 1700 * might complete its grace period before all of the other CPUs 1700 1701 * did their increment, causing this function to return too 1701 - * early. 1702 + * early. Note that on_each_cpu() disables irqs, which prevents 1703 + * any CPUs from coming online or going offline until each online 1704 + * CPU has queued its RCU-barrier callback. 1702 1705 */ 1703 1706 atomic_set(&rcu_barrier_cpu_count, 1); 1704 - preempt_disable(); /* stop CPU_DYING from filling orphan_cbs_list */ 1705 - rcu_adopt_orphan_cbs(rsp); 1706 1707 on_each_cpu(rcu_barrier_func, (void *)call_rcu_func, 1); 1707 - preempt_enable(); /* CPU_DYING can again fill orphan_cbs_list */ 1708 1708 if (atomic_dec_and_test(&rcu_barrier_cpu_count)) 1709 1709 complete(&rcu_barrier_completion); 1710 1710 wait_for_completion(&rcu_barrier_completion); ··· 1829 1831 case CPU_DYING: 1830 1832 case CPU_DYING_FROZEN: 1831 1833 /* 1832 - * preempt_disable() in _rcu_barrier() prevents stop_machine(), 1833 - * so when "on_each_cpu(rcu_barrier_func, (void *)type, 1);" 1834 - * returns, all online cpus have queued rcu_barrier_func(). 1835 - * The dying CPU clears its cpu_online_mask bit and 1836 - * moves all of its RCU callbacks to ->orphan_cbs_list 1837 - * in the context of stop_machine(), so subsequent calls 1838 - * to _rcu_barrier() will adopt these callbacks and only 1839 - * then queue rcu_barrier_func() on all remaining CPUs. 1834 + * The whole machine is "stopped" except this CPU, so we can 1835 + * touch any data without introducing corruption. We send the 1836 + * dying CPU's callbacks to an arbitrarily chosen online CPU. 1840 1837 */ 1841 - rcu_send_cbs_to_orphanage(&rcu_bh_state); 1842 - rcu_send_cbs_to_orphanage(&rcu_sched_state); 1843 - rcu_preempt_send_cbs_to_orphanage(); 1838 + rcu_send_cbs_to_online(&rcu_bh_state); 1839 + rcu_send_cbs_to_online(&rcu_sched_state); 1840 + rcu_preempt_send_cbs_to_online(); 1844 1841 break; 1845 1842 case CPU_DEAD: 1846 1843 case CPU_DEAD_FROZEN: ··· 1873 1880 { 1874 1881 int i; 1875 1882 1876 - for (i = NUM_RCU_LVLS - 1; i >= 0; i--) 1883 + for (i = NUM_RCU_LVLS - 1; i > 0; i--) 1877 1884 rsp->levelspread[i] = CONFIG_RCU_FANOUT; 1885 + rsp->levelspread[0] = RCU_FANOUT_LEAF; 1878 1886 } 1879 1887 #else /* #ifdef CONFIG_RCU_FANOUT_EXACT */ 1880 1888 static void __init rcu_init_levelspread(struct rcu_state *rsp)

+28 -31

kernel/rcutree.h

··· 31 31 /* 32 32 * Define shape of hierarchy based on NR_CPUS and CONFIG_RCU_FANOUT. 33 33 * In theory, it should be possible to add more levels straightforwardly. 34 - * In practice, this has not been tested, so there is probably some 35 - * bug somewhere. 34 + * In practice, this did work well going from three levels to four. 35 + * Of course, your mileage may vary. 36 36 */ 37 37 #define MAX_RCU_LVLS 4 38 - #define RCU_FANOUT (CONFIG_RCU_FANOUT) 39 - #define RCU_FANOUT_SQ (RCU_FANOUT * RCU_FANOUT) 40 - #define RCU_FANOUT_CUBE (RCU_FANOUT_SQ * RCU_FANOUT) 41 - #define RCU_FANOUT_FOURTH (RCU_FANOUT_CUBE * RCU_FANOUT) 38 + #if CONFIG_RCU_FANOUT > 16 39 + #define RCU_FANOUT_LEAF 16 40 + #else /* #if CONFIG_RCU_FANOUT > 16 */ 41 + #define RCU_FANOUT_LEAF (CONFIG_RCU_FANOUT) 42 + #endif /* #else #if CONFIG_RCU_FANOUT > 16 */ 43 + #define RCU_FANOUT_1 (RCU_FANOUT_LEAF) 44 + #define RCU_FANOUT_2 (RCU_FANOUT_1 * CONFIG_RCU_FANOUT) 45 + #define RCU_FANOUT_3 (RCU_FANOUT_2 * CONFIG_RCU_FANOUT) 46 + #define RCU_FANOUT_4 (RCU_FANOUT_3 * CONFIG_RCU_FANOUT) 42 47 43 - #if NR_CPUS <= RCU_FANOUT 48 + #if NR_CPUS <= RCU_FANOUT_1 44 49 # define NUM_RCU_LVLS 1 45 50 # define NUM_RCU_LVL_0 1 46 51 # define NUM_RCU_LVL_1 (NR_CPUS) 47 52 # define NUM_RCU_LVL_2 0 48 53 # define NUM_RCU_LVL_3 0 49 54 # define NUM_RCU_LVL_4 0 50 - #elif NR_CPUS <= RCU_FANOUT_SQ 55 + #elif NR_CPUS <= RCU_FANOUT_2 51 56 # define NUM_RCU_LVLS 2 52 57 # define NUM_RCU_LVL_0 1 53 - # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT) 58 + # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) 54 59 # define NUM_RCU_LVL_2 (NR_CPUS) 55 60 # define NUM_RCU_LVL_3 0 56 61 # define NUM_RCU_LVL_4 0 57 - #elif NR_CPUS <= RCU_FANOUT_CUBE 62 + #elif NR_CPUS <= RCU_FANOUT_3 58 63 # define NUM_RCU_LVLS 3 59 64 # define NUM_RCU_LVL_0 1 60 - # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_SQ) 61 - # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT) 62 - # define NUM_RCU_LVL_3 NR_CPUS 65 + # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) 66 + # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) 67 + # define NUM_RCU_LVL_3 (NR_CPUS) 63 68 # define NUM_RCU_LVL_4 0 64 - #elif NR_CPUS <= RCU_FANOUT_FOURTH 69 + #elif NR_CPUS <= RCU_FANOUT_4 65 70 # define NUM_RCU_LVLS 4 66 71 # define NUM_RCU_LVL_0 1 67 - # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_CUBE) 68 - # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_SQ) 69 - # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT) 70 - # define NUM_RCU_LVL_4 NR_CPUS 72 + # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) 73 + # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) 74 + # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) 75 + # define NUM_RCU_LVL_4 (NR_CPUS) 71 76 #else 72 77 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS" 73 - #endif /* #if (NR_CPUS) <= RCU_FANOUT */ 78 + #endif /* #if (NR_CPUS) <= RCU_FANOUT_1 */ 74 79 75 80 #define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3 + NUM_RCU_LVL_4) 76 81 #define NUM_RCU_NODES (RCU_SUM - NR_CPUS) ··· 208 203 long qlen_last_fqs_check; 209 204 /* qlen at last check for QS forcing */ 210 205 unsigned long n_cbs_invoked; /* count of RCU cbs invoked. */ 211 - unsigned long n_cbs_orphaned; /* RCU cbs sent to orphanage. */ 212 - unsigned long n_cbs_adopted; /* RCU cbs adopted from orphanage. */ 206 + unsigned long n_cbs_orphaned; /* RCU cbs orphaned by dying CPU */ 207 + unsigned long n_cbs_adopted; /* RCU cbs adopted from dying CPU */ 213 208 unsigned long n_force_qs_snap; 214 209 /* did other CPU force QS recently? */ 215 210 long blimit; /* Upper limit on a processed batch */ ··· 314 309 /* End of fields guarded by root rcu_node's lock. */ 315 310 316 311 raw_spinlock_t onofflock; /* exclude on/offline and */ 317 - /* starting new GP. Also */ 318 - /* protects the following */ 319 - /* orphan_cbs fields. */ 320 - struct rcu_head *orphan_cbs_list; /* list of rcu_head structs */ 321 - /* orphaned by all CPUs in */ 322 - /* a given leaf rcu_node */ 323 - /* going offline. */ 324 - struct rcu_head **orphan_cbs_tail; /* And tail pointer. */ 325 - long orphan_qlen; /* Number of orphaned cbs. */ 312 + /* starting new GP. */ 326 313 raw_spinlock_t fqslock; /* Only one task forcing */ 327 314 /* quiescent states. */ 328 315 unsigned long jiffies_force_qs; /* Time at which to invoke */ ··· 387 390 static int rcu_preempt_pending(int cpu); 388 391 static int rcu_preempt_needs_cpu(int cpu); 389 392 static void __cpuinit rcu_preempt_init_percpu_data(int cpu); 390 - static void rcu_preempt_send_cbs_to_orphanage(void); 393 + static void rcu_preempt_send_cbs_to_online(void); 391 394 static void __init __rcu_init_preempt(void); 392 395 static void rcu_needs_cpu_flush(void); 393 396

+131 -4

kernel/rcutree_plugin.h

··· 25 25 */ 26 26 27 27 #include <linux/delay.h> 28 + #include <linux/stop_machine.h> 28 29 29 30 /* 30 31 * Check the RCU kernel configuration parameters and print informative ··· 774 773 } 775 774 776 775 /* 777 - * Move preemptable RCU's callbacks to ->orphan_cbs_list. 776 + * Move preemptable RCU's callbacks from dying CPU to other online CPU. 778 777 */ 779 - static void rcu_preempt_send_cbs_to_orphanage(void) 778 + static void rcu_preempt_send_cbs_to_online(void) 780 779 { 781 - rcu_send_cbs_to_orphanage(&rcu_preempt_state); 780 + rcu_send_cbs_to_online(&rcu_preempt_state); 782 781 } 783 782 784 783 /* ··· 1002 1001 /* 1003 1002 * Because there is no preemptable RCU, there are no callbacks to move. 1004 1003 */ 1005 - static void rcu_preempt_send_cbs_to_orphanage(void) 1004 + static void rcu_preempt_send_cbs_to_online(void) 1006 1005 { 1007 1006 } 1008 1007 ··· 1014 1013 } 1015 1014 1016 1015 #endif /* #else #ifdef CONFIG_TREE_PREEMPT_RCU */ 1016 + 1017 + #ifndef CONFIG_SMP 1018 + 1019 + void synchronize_sched_expedited(void) 1020 + { 1021 + cond_resched(); 1022 + } 1023 + EXPORT_SYMBOL_GPL(synchronize_sched_expedited); 1024 + 1025 + #else /* #ifndef CONFIG_SMP */ 1026 + 1027 + static atomic_t sync_sched_expedited_started = ATOMIC_INIT(0); 1028 + static atomic_t sync_sched_expedited_done = ATOMIC_INIT(0); 1029 + 1030 + static int synchronize_sched_expedited_cpu_stop(void *data) 1031 + { 1032 + /* 1033 + * There must be a full memory barrier on each affected CPU 1034 + * between the time that try_stop_cpus() is called and the 1035 + * time that it returns. 1036 + * 1037 + * In the current initial implementation of cpu_stop, the 1038 + * above condition is already met when the control reaches 1039 + * this point and the following smp_mb() is not strictly 1040 + * necessary. Do smp_mb() anyway for documentation and 1041 + * robustness against future implementation changes. 1042 + */ 1043 + smp_mb(); /* See above comment block. */ 1044 + return 0; 1045 + } 1046 + 1047 + /* 1048 + * Wait for an rcu-sched grace period to elapse, but use "big hammer" 1049 + * approach to force grace period to end quickly. This consumes 1050 + * significant time on all CPUs, and is thus not recommended for 1051 + * any sort of common-case code. 1052 + * 1053 + * Note that it is illegal to call this function while holding any 1054 + * lock that is acquired by a CPU-hotplug notifier. Failing to 1055 + * observe this restriction will result in deadlock. 1056 + * 1057 + * This implementation can be thought of as an application of ticket 1058 + * locking to RCU, with sync_sched_expedited_started and 1059 + * sync_sched_expedited_done taking on the roles of the halves 1060 + * of the ticket-lock word. Each task atomically increments 1061 + * sync_sched_expedited_started upon entry, snapshotting the old value, 1062 + * then attempts to stop all the CPUs. If this succeeds, then each 1063 + * CPU will have executed a context switch, resulting in an RCU-sched 1064 + * grace period. We are then done, so we use atomic_cmpxchg() to 1065 + * update sync_sched_expedited_done to match our snapshot -- but 1066 + * only if someone else has not already advanced past our snapshot. 1067 + * 1068 + * On the other hand, if try_stop_cpus() fails, we check the value 1069 + * of sync_sched_expedited_done. If it has advanced past our 1070 + * initial snapshot, then someone else must have forced a grace period 1071 + * some time after we took our snapshot. In this case, our work is 1072 + * done for us, and we can simply return. Otherwise, we try again, 1073 + * but keep our initial snapshot for purposes of checking for someone 1074 + * doing our work for us. 1075 + * 1076 + * If we fail too many times in a row, we fall back to synchronize_sched(). 1077 + */ 1078 + void synchronize_sched_expedited(void) 1079 + { 1080 + int firstsnap, s, snap, trycount = 0; 1081 + 1082 + /* Note that atomic_inc_return() implies full memory barrier. */ 1083 + firstsnap = snap = atomic_inc_return(&sync_sched_expedited_started); 1084 + get_online_cpus(); 1085 + 1086 + /* 1087 + * Each pass through the following loop attempts to force a 1088 + * context switch on each CPU. 1089 + */ 1090 + while (try_stop_cpus(cpu_online_mask, 1091 + synchronize_sched_expedited_cpu_stop, 1092 + NULL) == -EAGAIN) { 1093 + put_online_cpus(); 1094 + 1095 + /* No joy, try again later. Or just synchronize_sched(). */ 1096 + if (trycount++ < 10) 1097 + udelay(trycount * num_online_cpus()); 1098 + else { 1099 + synchronize_sched(); 1100 + return; 1101 + } 1102 + 1103 + /* Check to see if someone else did our work for us. */ 1104 + s = atomic_read(&sync_sched_expedited_done); 1105 + if (UINT_CMP_GE((unsigned)s, (unsigned)firstsnap)) { 1106 + smp_mb(); /* ensure test happens before caller kfree */ 1107 + return; 1108 + } 1109 + 1110 + /* 1111 + * Refetching sync_sched_expedited_started allows later 1112 + * callers to piggyback on our grace period. We subtract 1113 + * 1 to get the same token that the last incrementer got. 1114 + * We retry after they started, so our grace period works 1115 + * for them, and they started after our first try, so their 1116 + * grace period works for us. 1117 + */ 1118 + get_online_cpus(); 1119 + snap = atomic_read(&sync_sched_expedited_started) - 1; 1120 + smp_mb(); /* ensure read is before try_stop_cpus(). */ 1121 + } 1122 + 1123 + /* 1124 + * Everyone up to our most recent fetch is covered by our grace 1125 + * period. Update the counter, but only if our work is still 1126 + * relevant -- which it won't be if someone who started later 1127 + * than we did beat us to the punch. 1128 + */ 1129 + do { 1130 + s = atomic_read(&sync_sched_expedited_done); 1131 + if (UINT_CMP_GE((unsigned)s, (unsigned)snap)) { 1132 + smp_mb(); /* ensure test happens before caller kfree */ 1133 + break; 1134 + } 1135 + } while (atomic_cmpxchg(&sync_sched_expedited_done, s, snap) != s); 1136 + 1137 + put_online_cpus(); 1138 + } 1139 + EXPORT_SYMBOL_GPL(synchronize_sched_expedited); 1140 + 1141 + #endif /* #else #ifndef CONFIG_SMP */ 1017 1142 1018 1143 #if !defined(CONFIG_RCU_FAST_NO_HZ) 1019 1144

+6 -6

kernel/rcutree_trace.c

··· 166 166 167 167 gpnum = rsp->gpnum; 168 168 seq_printf(m, "c=%lu g=%lu s=%d jfq=%ld j=%x " 169 - "nfqs=%lu/nfqsng=%lu(%lu) fqlh=%lu oqlen=%ld\n", 169 + "nfqs=%lu/nfqsng=%lu(%lu) fqlh=%lu\n", 170 170 rsp->completed, gpnum, rsp->signaled, 171 171 (long)(rsp->jiffies_force_qs - jiffies), 172 172 (int)(jiffies & 0xffff), 173 173 rsp->n_force_qs, rsp->n_force_qs_ngp, 174 174 rsp->n_force_qs - rsp->n_force_qs_ngp, 175 - rsp->n_force_qs_lh, rsp->orphan_qlen); 175 + rsp->n_force_qs_lh); 176 176 for (rnp = &rsp->node[0]; rnp - &rsp->node[0] < NUM_RCU_NODES; rnp++) { 177 177 if (rnp->level != level) { 178 178 seq_puts(m, "\n"); ··· 300 300 301 301 static struct dentry *rcudir; 302 302 303 - static int __init rcuclassic_trace_init(void) 303 + static int __init rcutree_trace_init(void) 304 304 { 305 305 struct dentry *retval; 306 306 ··· 337 337 return 1; 338 338 } 339 339 340 - static void __exit rcuclassic_trace_cleanup(void) 340 + static void __exit rcutree_trace_cleanup(void) 341 341 { 342 342 debugfs_remove_recursive(rcudir); 343 343 } 344 344 345 345 346 - module_init(rcuclassic_trace_init); 347 - module_exit(rcuclassic_trace_cleanup); 346 + module_init(rcutree_trace_init); 347 + module_exit(rcutree_trace_cleanup); 348 348 349 349 MODULE_AUTHOR("Paul E. McKenney"); 350 350 MODULE_DESCRIPTION("Read-Copy Update tracing for hierarchical implementation");

-69

kernel/sched.c

··· 9534 9534 }; 9535 9535 #endif /* CONFIG_CGROUP_CPUACCT */ 9536 9536 9537 - #ifndef CONFIG_SMP 9538 - 9539 - void synchronize_sched_expedited(void) 9540 - { 9541 - barrier(); 9542 - } 9543 - EXPORT_SYMBOL_GPL(synchronize_sched_expedited); 9544 - 9545 - #else /* #ifndef CONFIG_SMP */ 9546 - 9547 - static atomic_t synchronize_sched_expedited_count = ATOMIC_INIT(0); 9548 - 9549 - static int synchronize_sched_expedited_cpu_stop(void *data) 9550 - { 9551 - /* 9552 - * There must be a full memory barrier on each affected CPU 9553 - * between the time that try_stop_cpus() is called and the 9554 - * time that it returns. 9555 - * 9556 - * In the current initial implementation of cpu_stop, the 9557 - * above condition is already met when the control reaches 9558 - * this point and the following smp_mb() is not strictly 9559 - * necessary. Do smp_mb() anyway for documentation and 9560 - * robustness against future implementation changes. 9561 - */ 9562 - smp_mb(); /* See above comment block. */ 9563 - return 0; 9564 - } 9565 - 9566 - /* 9567 - * Wait for an rcu-sched grace period to elapse, but use "big hammer" 9568 - * approach to force grace period to end quickly. This consumes 9569 - * significant time on all CPUs, and is thus not recommended for 9570 - * any sort of common-case code. 9571 - * 9572 - * Note that it is illegal to call this function while holding any 9573 - * lock that is acquired by a CPU-hotplug notifier. Failing to 9574 - * observe this restriction will result in deadlock. 9575 - */ 9576 - void synchronize_sched_expedited(void) 9577 - { 9578 - int snap, trycount = 0; 9579 - 9580 - smp_mb(); /* ensure prior mod happens before capturing snap. */ 9581 - snap = atomic_read(&synchronize_sched_expedited_count) + 1; 9582 - get_online_cpus(); 9583 - while (try_stop_cpus(cpu_online_mask, 9584 - synchronize_sched_expedited_cpu_stop, 9585 - NULL) == -EAGAIN) { 9586 - put_online_cpus(); 9587 - if (trycount++ < 10) 9588 - udelay(trycount * num_online_cpus()); 9589 - else { 9590 - synchronize_sched(); 9591 - return; 9592 - } 9593 - if (atomic_read(&synchronize_sched_expedited_count) - snap > 0) { 9594 - smp_mb(); /* ensure test happens before caller kfree */ 9595 - return; 9596 - } 9597 - get_online_cpus(); 9598 - } 9599 - atomic_inc(&synchronize_sched_expedited_count); 9600 - smp_mb__after_atomic_inc(); /* ensure post-GP actions seen after GP. */ 9601 - put_online_cpus(); 9602 - } 9603 - EXPORT_SYMBOL_GPL(synchronize_sched_expedited); 9604 - 9605 - #endif /* #else #ifndef CONFIG_SMP */

+7 -1

kernel/srcu.c

··· 31 31 #include <linux/rcupdate.h> 32 32 #include <linux/sched.h> 33 33 #include <linux/smp.h> 34 + #include <linux/delay.h> 34 35 #include <linux/srcu.h> 35 36 36 37 static int init_srcu_struct_fields(struct srcu_struct *sp) ··· 204 203 * all srcu_read_lock() calls using the old counters have completed. 205 204 * Their corresponding critical sections might well be still 206 205 * executing, but the srcu_read_lock() primitives themselves 207 - * will have finished executing. 206 + * will have finished executing. We initially give readers 207 + * an arbitrarily chosen 10 microseconds to get out of their 208 + * SRCU read-side critical sections, then loop waiting 1/HZ 209 + * seconds per iteration. 208 210 */ 209 211 212 + if (srcu_readers_active_idx(sp, idx)) 213 + udelay(CONFIG_SRCU_SYNCHRONIZE_DELAY); 210 214 while (srcu_readers_active_idx(sp, idx)) 211 215 schedule_timeout_interruptible(1); 212 216