Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

+27 -10

Documentation/cputopology.txt

··· 1 1 2 2 Export CPU topology info via sysfs. Items (attributes) are similar 3 - to /proc/cpuinfo. 3 + to /proc/cpuinfo output of some architectures: 4 4 5 5 1) /sys/devices/system/cpu/cpuX/topology/physical_package_id: 6 6 ··· 23 23 4) /sys/devices/system/cpu/cpuX/topology/thread_siblings: 24 24 25 25 internal kernel map of cpuX's hardware threads within the same 26 - core as cpuX 26 + core as cpuX. 27 27 28 - 5) /sys/devices/system/cpu/cpuX/topology/core_siblings: 28 + 5) /sys/devices/system/cpu/cpuX/topology/thread_siblings_list: 29 + 30 + human-readable list of cpuX's hardware threads within the same 31 + core as cpuX. 32 + 33 + 6) /sys/devices/system/cpu/cpuX/topology/core_siblings: 29 34 30 35 internal kernel map of cpuX's hardware threads within the same 31 36 physical_package_id. 32 37 33 - 6) /sys/devices/system/cpu/cpuX/topology/book_siblings: 38 + 7) /sys/devices/system/cpu/cpuX/topology/core_siblings_list: 39 + 40 + human-readable list of cpuX's hardware threads within the same 41 + physical_package_id. 42 + 43 + 8) /sys/devices/system/cpu/cpuX/topology/book_siblings: 34 44 35 45 internal kernel map of cpuX's hardware threads within the same 36 46 book_id. 37 47 48 + 9) /sys/devices/system/cpu/cpuX/topology/book_siblings_list: 49 + 50 + human-readable list of cpuX's hardware threads within the same 51 + book_id. 52 + 38 53 To implement it in an architecture-neutral way, a new source file, 39 - drivers/base/topology.c, is to export the 4 or 6 attributes. The two book 54 + drivers/base/topology.c, is to export the 6 or 9 attributes. The three book 40 55 related sysfs files will only be created if CONFIG_SCHED_BOOK is selected. 41 56 42 57 For an architecture to support this feature, it must define some of ··· 59 44 #define topology_physical_package_id(cpu) 60 45 #define topology_core_id(cpu) 61 46 #define topology_book_id(cpu) 62 - #define topology_thread_cpumask(cpu) 47 + #define topology_sibling_cpumask(cpu) 63 48 #define topology_core_cpumask(cpu) 64 49 #define topology_book_cpumask(cpu) 65 50 66 - The type of **_id is int. 67 - The type of siblings is (const) struct cpumask *. 51 + The type of **_id macros is int. 52 + The type of **_cpumask macros is (const) struct cpumask *. The latter 53 + correspond with appropriate **_siblings sysfs attributes (except for 54 + topology_sibling_cpumask() which corresponds with thread_siblings). 68 55 69 56 To be consistent on all architectures, include/linux/topology.h 70 57 provides default definitions for any of the above macros that are 71 58 not defined by include/asm-XXX/topology.h: 72 59 1) physical_package_id: -1 73 60 2) core_id: 0 74 - 3) thread_siblings: just the given CPU 75 - 4) core_siblings: just the given CPU 61 + 3) sibling_cpumask: just the given CPU 62 + 4) core_cpumask: just the given CPU 76 63 77 64 For architectures that don't support books (CONFIG_SCHED_BOOK) there are no 78 65 default definitions for topology_book_id() and topology_book_cpumask().

+154 -30

Documentation/scheduler/sched-deadline.txt

··· 8 8 1. Overview 9 9 2. Scheduling algorithm 10 10 3. Scheduling Real-Time Tasks 11 + 3.1 Definitions 12 + 3.2 Schedulability Analysis for Uniprocessor Systems 13 + 3.3 Schedulability Analysis for Multiprocessor Systems 14 + 3.4 Relationship with SCHED_DEADLINE Parameters 11 15 4. Bandwidth management 12 16 4.1 System-wide settings 13 17 4.2 Task interface ··· 47 43 "deadline", to schedule tasks. A SCHED_DEADLINE task should receive 48 44 "runtime" microseconds of execution time every "period" microseconds, and 49 45 these "runtime" microseconds are available within "deadline" microseconds 50 - from the beginning of the period. In order to implement this behaviour, 46 + from the beginning of the period. In order to implement this behavior, 51 47 every time the task wakes up, the scheduler computes a "scheduling deadline" 52 48 consistent with the guarantee (using the CBS[2,3] algorithm). Tasks are then 53 49 scheduled using EDF[1] on these scheduling deadlines (the task with the ··· 56 52 "admission control" strategy (see Section "4. Bandwidth management") is used 57 53 (clearly, if the system is overloaded this guarantee cannot be respected). 58 54 59 - Summing up, the CBS[2,3] algorithms assigns scheduling deadlines to tasks so 55 + Summing up, the CBS[2,3] algorithm assigns scheduling deadlines to tasks so 60 56 that each task runs for at most its runtime every period, avoiding any 61 57 interference between different tasks (bandwidth isolation), while the EDF[1] 62 58 algorithm selects the task with the earliest scheduling deadline as the one ··· 67 63 In more details, the CBS algorithm assigns scheduling deadlines to 68 64 tasks in the following way: 69 65 70 - - Each SCHED_DEADLINE task is characterised by the "runtime", 66 + - Each SCHED_DEADLINE task is characterized by the "runtime", 71 67 "deadline", and "period" parameters; 72 68 73 69 - The state of the task is described by a "scheduling deadline", and ··· 82 78 83 79 then, if the scheduling deadline is smaller than the current time, or 84 80 this condition is verified, the scheduling deadline and the 85 - remaining runtime are re-initialised as 81 + remaining runtime are re-initialized as 86 82 87 83 scheduling deadline = current time + deadline 88 84 remaining runtime = runtime ··· 130 126 suited for periodic or sporadic real-time tasks that need guarantees on their 131 127 timing behavior, e.g., multimedia, streaming, control applications, etc. 132 128 129 + 3.1 Definitions 130 + ------------------------ 131 + 133 132 A typical real-time task is composed of a repetition of computation phases 134 133 (task instances, or jobs) which are activated on a periodic or sporadic 135 134 fashion. 136 - Each job J_j (where J_j is the j^th job of the task) is characterised by an 135 + Each job J_j (where J_j is the j^th job of the task) is characterized by an 137 136 arrival time r_j (the time when the job starts), an amount of computation 138 137 time c_j needed to finish the job, and a job absolute deadline d_j, which 139 138 is the time within which the job should be finished. The maximum execution 140 - time max_j{c_j} is called "Worst Case Execution Time" (WCET) for the task. 139 + time max{c_j} is called "Worst Case Execution Time" (WCET) for the task. 141 140 A real-time task can be periodic with period P if r_{j+1} = r_j + P, or 142 141 sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally, 143 142 d_j = r_j + D, where D is the task's relative deadline. 144 - The utilisation of a real-time task is defined as the ratio between its 143 + Summing up, a real-time task can be described as 144 + Task = (WCET, D, P) 145 + 146 + The utilization of a real-time task is defined as the ratio between its 145 147 WCET and its period (or minimum inter-arrival time), and represents 146 148 the fraction of CPU time needed to execute the task. 147 149 148 - If the total utilisation sum_i(WCET_i/P_i) is larger than M (with M equal 150 + If the total utilization U=sum(WCET_i/P_i) is larger than M (with M equal 149 151 to the number of CPUs), then the scheduler is unable to respect all the 150 152 deadlines. 151 - Note that total utilisation is defined as the sum of the utilisations 153 + Note that total utilization is defined as the sum of the utilizations 152 154 WCET_i/P_i over all the real-time tasks in the system. When considering 153 155 multiple real-time tasks, the parameters of the i-th task are indicated 154 156 with the "_i" suffix. 155 - Moreover, if the total utilisation is larger than M, then we risk starving 157 + Moreover, if the total utilization is larger than M, then we risk starving 156 158 non- real-time tasks by real-time tasks. 157 - If, instead, the total utilisation is smaller than M, then non real-time 159 + If, instead, the total utilization is smaller than M, then non real-time 158 160 tasks will not be starved and the system might be able to respect all the 159 161 deadlines. 160 162 As a matter of fact, in this case it is possible to provide an upper bound ··· 169 159 More precisely, it can be proven that using a global EDF scheduler the 170 160 maximum tardiness of each task is smaller or equal than 171 161 ((M − 1) · WCET_max − WCET_min)/(M − (M − 2) · U_max) + WCET_max 172 - where WCET_max = max_i{WCET_i} is the maximum WCET, WCET_min=min_i{WCET_i} 173 - is the minimum WCET, and U_max = max_i{WCET_i/P_i} is the maximum utilisation. 162 + where WCET_max = max{WCET_i} is the maximum WCET, WCET_min=min{WCET_i} 163 + is the minimum WCET, and U_max = max{WCET_i/P_i} is the maximum 164 + utilization[12]. 165 + 166 + 3.2 Schedulability Analysis for Uniprocessor Systems 167 + ------------------------ 174 168 175 169 If M=1 (uniprocessor system), or in case of partitioned scheduling (each 176 170 real-time task is statically assigned to one and only one CPU), it is 177 171 possible to formally check if all the deadlines are respected. 178 172 If D_i = P_i for all tasks, then EDF is able to respect all the deadlines 179 - of all the tasks executing on a CPU if and only if the total utilisation 173 + of all the tasks executing on a CPU if and only if the total utilization 180 174 of the tasks running on such a CPU is smaller or equal than 1. 181 175 If D_i != P_i for some task, then it is possible to define the density of 182 - a task as C_i/min{D_i,T_i}, and EDF is able to respect all the deadlines 183 - of all the tasks running on a CPU if the sum sum_i C_i/min{D_i,T_i} of the 184 - densities of the tasks running on such a CPU is smaller or equal than 1 185 - (notice that this condition is only sufficient, and not necessary). 176 + a task as WCET_i/min{D_i,P_i}, and EDF is able to respect all the deadlines 177 + of all the tasks running on a CPU if the sum of the densities of the tasks 178 + running on such a CPU is smaller or equal than 1: 179 + sum(WCET_i / min{D_i, P_i}) <= 1 180 + It is important to notice that this condition is only sufficient, and not 181 + necessary: there are task sets that are schedulable, but do not respect the 182 + condition. For example, consider the task set {Task_1,Task_2} composed by 183 + Task_1=(50ms,50ms,100ms) and Task_2=(10ms,100ms,100ms). 184 + EDF is clearly able to schedule the two tasks without missing any deadline 185 + (Task_1 is scheduled as soon as it is released, and finishes just in time 186 + to respect its deadline; Task_2 is scheduled immediately after Task_1, hence 187 + its response time cannot be larger than 50ms + 10ms = 60ms) even if 188 + 50 / min{50,100} + 10 / min{100, 100} = 50 / 50 + 10 / 100 = 1.1 189 + Of course it is possible to test the exact schedulability of tasks with 190 + D_i != P_i (checking a condition that is both sufficient and necessary), 191 + but this cannot be done by comparing the total utilization or density with 192 + a constant. Instead, the so called "processor demand" approach can be used, 193 + computing the total amount of CPU time h(t) needed by all the tasks to 194 + respect all of their deadlines in a time interval of size t, and comparing 195 + such a time with the interval size t. If h(t) is smaller than t (that is, 196 + the amount of time needed by the tasks in a time interval of size t is 197 + smaller than the size of the interval) for all the possible values of t, then 198 + EDF is able to schedule the tasks respecting all of their deadlines. Since 199 + performing this check for all possible values of t is impossible, it has been 200 + proven[4,5,6] that it is sufficient to perform the test for values of t 201 + between 0 and a maximum value L. The cited papers contain all of the 202 + mathematical details and explain how to compute h(t) and L. 203 + In any case, this kind of analysis is too complex as well as too 204 + time-consuming to be performed on-line. Hence, as explained in Section 205 + 4 Linux uses an admission test based on the tasks' utilizations. 206 + 207 + 3.3 Schedulability Analysis for Multiprocessor Systems 208 + ------------------------ 186 209 187 210 On multiprocessor systems with global EDF scheduling (non partitioned 188 211 systems), a sufficient test for schedulability can not be based on the 189 - utilisations (it can be shown that task sets with utilisations slightly 190 - larger than 1 can miss deadlines regardless of the number of CPUs M). 191 - However, as previously stated, enforcing that the total utilisation is smaller 192 - than M is enough to guarantee that non real-time tasks are not starved and 193 - that the tardiness of real-time tasks has an upper bound. 212 + utilizations or densities: it can be shown that even if D_i = P_i task 213 + sets with utilizations slightly larger than 1 can miss deadlines regardless 214 + of the number of CPUs. 194 215 195 - SCHED_DEADLINE can be used to schedule real-time tasks guaranteeing that 196 - the jobs' deadlines of a task are respected. In order to do this, a task 197 - must be scheduled by setting: 216 + Consider a set {Task_1,...Task_{M+1}} of M+1 tasks on a system with M 217 + CPUs, with the first task Task_1=(P,P,P) having period, relative deadline 218 + and WCET equal to P. The remaining M tasks Task_i=(e,P-1,P-1) have an 219 + arbitrarily small worst case execution time (indicated as "e" here) and a 220 + period smaller than the one of the first task. Hence, if all the tasks 221 + activate at the same time t, global EDF schedules these M tasks first 222 + (because their absolute deadlines are equal to t + P - 1, hence they are 223 + smaller than the absolute deadline of Task_1, which is t + P). As a 224 + result, Task_1 can be scheduled only at time t + e, and will finish at 225 + time t + e + P, after its absolute deadline. The total utilization of the 226 + task set is U = M · e / (P - 1) + P / P = M · e / (P - 1) + 1, and for small 227 + values of e this can become very close to 1. This is known as "Dhall's 228 + effect"[7]. Note: the example in the original paper by Dhall has been 229 + slightly simplified here (for example, Dhall more correctly computed 230 + lim_{e->0}U). 231 + 232 + More complex schedulability tests for global EDF have been developed in 233 + real-time literature[8,9], but they are not based on a simple comparison 234 + between total utilization (or density) and a fixed constant. If all tasks 235 + have D_i = P_i, a sufficient schedulability condition can be expressed in 236 + a simple way: 237 + sum(WCET_i / P_i) <= M - (M - 1) · U_max 238 + where U_max = max{WCET_i / P_i}[10]. Notice that for U_max = 1, 239 + M - (M - 1) · U_max becomes M - M + 1 = 1 and this schedulability condition 240 + just confirms the Dhall's effect. A more complete survey of the literature 241 + about schedulability tests for multi-processor real-time scheduling can be 242 + found in [11]. 243 + 244 + As seen, enforcing that the total utilization is smaller than M does not 245 + guarantee that global EDF schedules the tasks without missing any deadline 246 + (in other words, global EDF is not an optimal scheduling algorithm). However, 247 + a total utilization smaller than M is enough to guarantee that non real-time 248 + tasks are not starved and that the tardiness of real-time tasks has an upper 249 + bound[12] (as previously noted). Different bounds on the maximum tardiness 250 + experienced by real-time tasks have been developed in various papers[13,14], 251 + but the theoretical result that is important for SCHED_DEADLINE is that if 252 + the total utilization is smaller or equal than M then the response times of 253 + the tasks are limited. 254 + 255 + 3.4 Relationship with SCHED_DEADLINE Parameters 256 + ------------------------ 257 + 258 + Finally, it is important to understand the relationship between the 259 + SCHED_DEADLINE scheduling parameters described in Section 2 (runtime, 260 + deadline and period) and the real-time task parameters (WCET, D, P) 261 + described in this section. Note that the tasks' temporal constraints are 262 + represented by its absolute deadlines d_j = r_j + D described above, while 263 + SCHED_DEADLINE schedules the tasks according to scheduling deadlines (see 264 + Section 2). 265 + If an admission test is used to guarantee that the scheduling deadlines 266 + are respected, then SCHED_DEADLINE can be used to schedule real-time tasks 267 + guaranteeing that all the jobs' deadlines of a task are respected. 268 + In order to do this, a task must be scheduled by setting: 198 269 199 270 - runtime >= WCET 200 271 - deadline = D 201 272 - period <= P 202 273 203 - IOW, if runtime >= WCET and if period is >= P, then the scheduling deadlines 274 + IOW, if runtime >= WCET and if period is <= P, then the scheduling deadlines 204 275 and the absolute deadlines (d_j) coincide, so a proper admission control 205 276 allows to respect the jobs' absolute deadlines for this task (this is what is 206 277 called "hard schedulability property" and is an extension of Lemma 1 of [2]). ··· 297 206 Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf 298 207 3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab 299 208 Technical Report. http://disi.unitn.it/~abeni/tr-98-01.pdf 209 + 4 - J. Y. Leung and M.L. Merril. A Note on Preemptive Scheduling of 210 + Periodic, Real-Time Tasks. Information Processing Letters, vol. 11, 211 + no. 3, pp. 115-118, 1980. 212 + 5 - S. K. Baruah, A. K. Mok and L. E. Rosier. Preemptively Scheduling 213 + Hard-Real-Time Sporadic Tasks on One Processor. Proceedings of the 214 + 11th IEEE Real-time Systems Symposium, 1990. 215 + 6 - S. K. Baruah, L. E. Rosier and R. R. Howell. Algorithms and Complexity 216 + Concerning the Preemptive Scheduling of Periodic Real-Time tasks on 217 + One Processor. Real-Time Systems Journal, vol. 4, no. 2, pp 301-324, 218 + 1990. 219 + 7 - S. J. Dhall and C. L. Liu. On a real-time scheduling problem. Operations 220 + research, vol. 26, no. 1, pp 127-140, 1978. 221 + 8 - T. Baker. Multiprocessor EDF and Deadline Monotonic Schedulability 222 + Analysis. Proceedings of the 24th IEEE Real-Time Systems Symposium, 2003. 223 + 9 - T. Baker. An Analysis of EDF Schedulability on a Multiprocessor. 224 + IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 8, 225 + pp 760-768, 2005. 226 + 10 - J. Goossens, S. Funk and S. Baruah, Priority-Driven Scheduling of 227 + Periodic Task Systems on Multiprocessors. Real-Time Systems Journal, 228 + vol. 25, no. 2–3, pp. 187–205, 2003. 229 + 11 - R. Davis and A. Burns. A Survey of Hard Real-Time Scheduling for 230 + Multiprocessor Systems. ACM Computing Surveys, vol. 43, no. 4, 2011. 231 + http://www-users.cs.york.ac.uk/~robdavis/papers/MPSurveyv5.0.pdf 232 + 12 - U. C. Devi and J. H. Anderson. Tardiness Bounds under Global EDF 233 + Scheduling on a Multiprocessor. Real-Time Systems Journal, vol. 32, 234 + no. 2, pp 133-189, 2008. 235 + 13 - P. Valente and G. Lipari. An Upper Bound to the Lateness of Soft 236 + Real-Time Tasks Scheduled by EDF on Multiprocessors. Proceedings of 237 + the 26th IEEE Real-Time Systems Symposium, 2005. 238 + 14 - J. Erickson, U. Devi and S. Baruah. Improved tardiness bounds for 239 + Global EDF. Proceedings of the 22nd Euromicro Conference on 240 + Real-Time Systems, 2010. 241 + 300 242 301 243 4. Bandwidth management 302 244 ======================= ··· 342 218 no guarantee can be given on the actual scheduling of the -deadline tasks. 343 219 344 220 As already stated in Section 3, a necessary condition to be respected to 345 - correctly schedule a set of real-time tasks is that the total utilisation 221 + correctly schedule a set of real-time tasks is that the total utilization 346 222 is smaller than M. When talking about -deadline tasks, this requires that 347 223 the sum of the ratio between runtime and period for all tasks is smaller 348 - than M. Notice that the ratio runtime/period is equivalent to the utilisation 224 + than M. Notice that the ratio runtime/period is equivalent to the utilization 349 225 of a "traditional" real-time task, and is also often referred to as 350 226 "bandwidth". 351 227 The interface used to control the CPU bandwidth that can be allocated ··· 375 251 The system wide settings are configured under the /proc virtual file system. 376 252 377 253 For now the -rt knobs are used for -deadline admission control and the 378 - -deadline runtime is accounted against the -rt runtime. We realise that this 254 + -deadline runtime is accounted against the -rt runtime. We realize that this 379 255 isn't entirely desirable; however, it is better to have a small interface for 380 256 now, and be able to change it easily later. The ideal situation (see 5.) is to 381 257 run -rt tasks from a -deadline server; in which case the -rt bandwidth is a

+2 -3

arch/alpha/mm/fault.c

··· 23 23 #include <linux/smp.h> 24 24 #include <linux/interrupt.h> 25 25 #include <linux/module.h> 26 - 27 - #include <asm/uaccess.h> 26 + #include <linux/uaccess.h> 28 27 29 28 extern void die_if_kernel(char *,struct pt_regs *,long, unsigned long *); 30 29 ··· 106 107 107 108 /* If we're in an interrupt context, or have no user context, 108 109 we must not take the fault. */ 109 - if (!mm || in_atomic()) 110 + if (!mm || faulthandler_disabled()) 110 111 goto no_context; 111 112 112 113 #ifdef CONFIG_ALPHA_LARGE_VMALLOC

+5 -5

arch/arc/include/asm/futex.h

··· 53 53 if (!access_ok(VERIFY_WRITE, uaddr, sizeof(int))) 54 54 return -EFAULT; 55 55 56 - pagefault_disable(); /* implies preempt_disable() */ 56 + pagefault_disable(); 57 57 58 58 switch (op) { 59 59 case FUTEX_OP_SET: ··· 75 75 ret = -ENOSYS; 76 76 } 77 77 78 - pagefault_enable(); /* subsumes preempt_enable() */ 78 + pagefault_enable(); 79 79 80 80 if (!ret) { 81 81 switch (cmp) { ··· 104 104 return ret; 105 105 } 106 106 107 - /* Compare-xchg with preemption disabled. 107 + /* Compare-xchg with pagefaults disabled. 108 108 * Notes: 109 109 * -Best-Effort: Exchg happens only if compare succeeds. 110 110 * If compare fails, returns; leaving retry/looping to upper layers ··· 121 121 if (!access_ok(VERIFY_WRITE, uaddr, sizeof(int))) 122 122 return -EFAULT; 123 123 124 - pagefault_disable(); /* implies preempt_disable() */ 124 + pagefault_disable(); 125 125 126 126 /* TBD : can use llock/scond */ 127 127 __asm__ __volatile__( ··· 142 142 : "r"(oldval), "r"(newval), "r"(uaddr), "ir"(-EFAULT) 143 143 : "cc", "memory"); 144 144 145 - pagefault_enable(); /* subsumes preempt_enable() */ 145 + pagefault_enable(); 146 146 147 147 *uval = val; 148 148 return val;

+1 -1

arch/arc/mm/fault.c

··· 86 86 * If we're in an interrupt or have no user 87 87 * context, we must not take the fault.. 88 88 */ 89 - if (in_atomic() || !mm) 89 + if (faulthandler_disabled() || !mm) 90 90 goto no_context; 91 91 92 92 if (user_mode(regs))

+11 -2

arch/arm/include/asm/futex.h

··· 93 93 if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) 94 94 return -EFAULT; 95 95 96 + preempt_disable(); 96 97 __asm__ __volatile__("@futex_atomic_cmpxchg_inatomic\n" 97 98 "1: " TUSER(ldr) " %1, [%4]\n" 98 99 " teq %1, %2\n" ··· 105 104 : "cc", "memory"); 106 105 107 106 *uval = val; 107 + preempt_enable(); 108 + 108 109 return ret; 109 110 } 110 111 ··· 127 124 if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) 128 125 return -EFAULT; 129 126 130 - pagefault_disable(); /* implies preempt_disable() */ 127 + #ifndef CONFIG_SMP 128 + preempt_disable(); 129 + #endif 130 + pagefault_disable(); 131 131 132 132 switch (op) { 133 133 case FUTEX_OP_SET: ··· 152 146 ret = -ENOSYS; 153 147 } 154 148 155 - pagefault_enable(); /* subsumes preempt_enable() */ 149 + pagefault_enable(); 150 + #ifndef CONFIG_SMP 151 + preempt_enable(); 152 + #endif 156 153 157 154 if (!ret) { 158 155 switch (cmp) {

+1 -1

arch/arm/include/asm/topology.h

··· 18 18 #define topology_physical_package_id(cpu) (cpu_topology[cpu].socket_id) 19 19 #define topology_core_id(cpu) (cpu_topology[cpu].core_id) 20 20 #define topology_core_cpumask(cpu) (&cpu_topology[cpu].core_sibling) 21 - #define topology_thread_cpumask(cpu) (&cpu_topology[cpu].thread_sibling) 21 + #define topology_sibling_cpumask(cpu) (&cpu_topology[cpu].thread_sibling) 22 22 23 23 void init_cpu_topology(void); 24 24 void store_cpu_topology(unsigned int cpuid);

+1 -1

arch/arm/mm/fault.c

··· 276 276 * If we're in an interrupt or have no user 277 277 * context, we must not take the fault.. 278 278 */ 279 - if (in_atomic() || !mm) 279 + if (faulthandler_disabled() || !mm) 280 280 goto no_context; 281 281 282 282 if (user_mode(regs))

+3

arch/arm/mm/highmem.c

··· 59 59 void *kmap; 60 60 int type; 61 61 62 + preempt_disable(); 62 63 pagefault_disable(); 63 64 if (!PageHighMem(page)) 64 65 return page_address(page); ··· 122 121 kunmap_high(pte_page(pkmap_page_table[PKMAP_NR(vaddr)])); 123 122 } 124 123 pagefault_enable(); 124 + preempt_enable(); 125 125 } 126 126 EXPORT_SYMBOL(__kunmap_atomic); 127 127 ··· 132 130 int idx, type; 133 131 struct page *page = pfn_to_page(pfn); 134 132 133 + preempt_disable(); 135 134 pagefault_disable(); 136 135 if (!PageHighMem(page)) 137 136 return page_address(page);

+2 -2

arch/arm64/include/asm/futex.h

··· 58 58 if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))) 59 59 return -EFAULT; 60 60 61 - pagefault_disable(); /* implies preempt_disable() */ 61 + pagefault_disable(); 62 62 63 63 switch (op) { 64 64 case FUTEX_OP_SET: ··· 85 85 ret = -ENOSYS; 86 86 } 87 87 88 - pagefault_enable(); /* subsumes preempt_enable() */ 88 + pagefault_enable(); 89 89 90 90 if (!ret) { 91 91 switch (cmp) {

+1 -1

arch/arm64/include/asm/topology.h

··· 18 18 #define topology_physical_package_id(cpu) (cpu_topology[cpu].cluster_id) 19 19 #define topology_core_id(cpu) (cpu_topology[cpu].core_id) 20 20 #define topology_core_cpumask(cpu) (&cpu_topology[cpu].core_sibling) 21 - #define topology_thread_cpumask(cpu) (&cpu_topology[cpu].thread_sibling) 21 + #define topology_sibling_cpumask(cpu) (&cpu_topology[cpu].thread_sibling) 22 22 23 23 void init_cpu_topology(void); 24 24 void store_cpu_topology(unsigned int cpuid);

+1 -1

arch/arm64/mm/fault.c

··· 211 211 * If we're in an interrupt or have no user context, we must not take 212 212 * the fault. 213 213 */ 214 - if (in_atomic() || !mm) 214 + if (faulthandler_disabled() || !mm) 215 215 goto no_context; 216 216 217 217 if (user_mode(regs))

+8 -4

arch/avr32/include/asm/uaccess.h

··· 97 97 * @x: Value to copy to user space. 98 98 * @ptr: Destination address, in user space. 99 99 * 100 - * Context: User context only. This function may sleep. 100 + * Context: User context only. This function may sleep if pagefaults are 101 + * enabled. 101 102 * 102 103 * This macro copies a single simple value from kernel space to user 103 104 * space. It supports simple types like char and int, but not larger ··· 117 116 * @x: Variable to store result. 118 117 * @ptr: Source address, in user space. 119 118 * 120 - * Context: User context only. This function may sleep. 119 + * Context: User context only. This function may sleep if pagefaults are 120 + * enabled. 121 121 * 122 122 * This macro copies a single simple variable from user space to kernel 123 123 * space. It supports simple types like char and int, but not larger ··· 138 136 * @x: Value to copy to user space. 139 137 * @ptr: Destination address, in user space. 140 138 * 141 - * Context: User context only. This function may sleep. 139 + * Context: User context only. This function may sleep if pagefaults are 140 + * enabled. 142 141 * 143 142 * This macro copies a single simple value from kernel space to user 144 143 * space. It supports simple types like char and int, but not larger ··· 161 158 * @x: Variable to store result. 162 159 * @ptr: Source address, in user space. 163 160 * 164 - * Context: User context only. This function may sleep. 161 + * Context: User context only. This function may sleep if pagefaults are 162 + * enabled. 165 163 * 166 164 * This macro copies a single simple variable from user space to kernel 167 165 * space. It supports simple types like char and int, but not larger

+2 -2

arch/avr32/mm/fault.c

··· 14 14 #include <linux/pagemap.h> 15 15 #include <linux/kdebug.h> 16 16 #include <linux/kprobes.h> 17 + #include <linux/uaccess.h> 17 18 18 19 #include <asm/mmu_context.h> 19 20 #include <asm/sysreg.h> 20 21 #include <asm/tlb.h> 21 - #include <asm/uaccess.h> 22 22 23 23 #ifdef CONFIG_KPROBES 24 24 static inline int notify_page_fault(struct pt_regs *regs, int trap) ··· 81 81 * If we're in an interrupt or have no user context, we must 82 82 * not take the fault... 83 83 */ 84 - if (in_atomic() || !mm || regs->sr & SYSREG_BIT(GM)) 84 + if (faulthandler_disabled() || !mm || regs->sr & SYSREG_BIT(GM)) 85 85 goto no_context; 86 86 87 87 local_irq_enable();

+3 -3

arch/cris/mm/fault.c

··· 8 8 #include <linux/interrupt.h> 9 9 #include <linux/module.h> 10 10 #include <linux/wait.h> 11 - #include <asm/uaccess.h> 11 + #include <linux/uaccess.h> 12 12 #include <arch/system.h> 13 13 14 14 extern int find_fixup_code(struct pt_regs *); ··· 109 109 info.si_code = SEGV_MAPERR; 110 110 111 111 /* 112 - * If we're in an interrupt or "atomic" operation or have no 112 + * If we're in an interrupt, have pagefaults disabled or have no 113 113 * user context, we must not take the fault. 114 114 */ 115 115 116 - if (in_atomic() || !mm) 116 + if (faulthandler_disabled() || !mm) 117 117 goto no_context; 118 118 119 119 if (user_mode(regs))

+2 -2

arch/frv/mm/fault.c

··· 19 19 #include <linux/kernel.h> 20 20 #include <linux/ptrace.h> 21 21 #include <linux/hardirq.h> 22 + #include <linux/uaccess.h> 22 23 23 24 #include <asm/pgtable.h> 24 - #include <asm/uaccess.h> 25 25 #include <asm/gdb-stub.h> 26 26 27 27 /*****************************************************************************/ ··· 78 78 * If we're in an interrupt or have no user 79 79 * context, we must not take the fault.. 80 80 */ 81 - if (in_atomic() || !mm) 81 + if (faulthandler_disabled() || !mm) 82 82 goto no_context; 83 83 84 84 if (user_mode(__frame))

+2

arch/frv/mm/highmem.c

··· 42 42 unsigned long paddr; 43 43 int type; 44 44 45 + preempt_disable(); 45 46 pagefault_disable(); 46 47 type = kmap_atomic_idx_push(); 47 48 paddr = page_to_phys(page); ··· 86 85 } 87 86 kmap_atomic_idx_pop(); 88 87 pagefault_enable(); 88 + preempt_enable(); 89 89 } 90 90 EXPORT_SYMBOL(__kunmap_atomic);

+2 -1

arch/hexagon/include/asm/uaccess.h

··· 36 36 * @addr: User space pointer to start of block to check 37 37 * @size: Size of block to check 38 38 * 39 - * Context: User context only. This function may sleep. 39 + * Context: User context only. This function may sleep if pagefaults are 40 + * enabled. 40 41 * 41 42 * Checks if a pointer to a block of memory in user space is valid. 42 43 *

+1 -1

arch/ia64/include/asm/topology.h

··· 53 53 #define topology_physical_package_id(cpu) (cpu_data(cpu)->socket_id) 54 54 #define topology_core_id(cpu) (cpu_data(cpu)->core_id) 55 55 #define topology_core_cpumask(cpu) (&cpu_core_map[cpu]) 56 - #define topology_thread_cpumask(cpu) (&per_cpu(cpu_sibling_map, cpu)) 56 + #define topology_sibling_cpumask(cpu) (&per_cpu(cpu_sibling_map, cpu)) 57 57 #endif 58 58 59 59 extern void arch_fix_phys_package_id(int num, u32 slot);

+2 -2

arch/ia64/mm/fault.c

··· 11 11 #include <linux/kprobes.h> 12 12 #include <linux/kdebug.h> 13 13 #include <linux/prefetch.h> 14 + #include <linux/uaccess.h> 14 15 15 16 #include <asm/pgtable.h> 16 17 #include <asm/processor.h> 17 - #include <asm/uaccess.h> 18 18 19 19 extern int die(char *, struct pt_regs *, long); 20 20 ··· 96 96 /* 97 97 * If we're in an interrupt or have no user context, we must not take the fault.. 98 98 */ 99 - if (in_atomic() || !mm) 99 + if (faulthandler_disabled() || !mm) 100 100 goto no_context; 101 101 102 102 #ifdef CONFIG_VIRTUAL_MEM_MAP

+20 -10

arch/m32r/include/asm/uaccess.h

··· 91 91 * @addr: User space pointer to start of block to check 92 92 * @size: Size of block to check 93 93 * 94 - * Context: User context only. This function may sleep. 94 + * Context: User context only. This function may sleep if pagefaults are 95 + * enabled. 95 96 * 96 97 * Checks if a pointer to a block of memory in user space is valid. 97 98 * ··· 156 155 * @x: Variable to store result. 157 156 * @ptr: Source address, in user space. 158 157 * 159 - * Context: User context only. This function may sleep. 158 + * Context: User context only. This function may sleep if pagefaults are 159 + * enabled. 160 160 * 161 161 * This macro copies a single simple variable from user space to kernel 162 162 * space. It supports simple types like char and int, but not larger ··· 177 175 * @x: Value to copy to user space. 178 176 * @ptr: Destination address, in user space. 179 177 * 180 - * Context: User context only. This function may sleep. 178 + * Context: User context only. This function may sleep if pagefaults are 179 + * enabled. 181 180 * 182 181 * This macro copies a single simple value from kernel space to user 183 182 * space. It supports simple types like char and int, but not larger ··· 197 194 * @x: Variable to store result. 198 195 * @ptr: Source address, in user space. 199 196 * 200 - * Context: User context only. This function may sleep. 197 + * Context: User context only. This function may sleep if pagefaults are 198 + * enabled. 201 199 * 202 200 * This macro copies a single simple variable from user space to kernel 203 201 * space. It supports simple types like char and int, but not larger ··· 278 274 * @x: Value to copy to user space. 279 275 * @ptr: Destination address, in user space. 280 276 * 281 - * Context: User context only. This function may sleep. 277 + * Context: User context only. This function may sleep if pagefaults are 278 + * enabled. 282 279 * 283 280 * This macro copies a single simple value from kernel space to user 284 281 * space. It supports simple types like char and int, but not larger ··· 573 568 * @from: Source address, in kernel space. 574 569 * @n: Number of bytes to copy. 575 570 * 576 - * Context: User context only. This function may sleep. 571 + * Context: User context only. This function may sleep if pagefaults are 572 + * enabled. 577 573 * 578 574 * Copy data from kernel space to user space. Caller must check 579 575 * the specified block with access_ok() before calling this function. ··· 594 588 * @from: Source address, in kernel space. 595 589 * @n: Number of bytes to copy. 596 590 * 597 - * Context: User context only. This function may sleep. 591 + * Context: User context only. This function may sleep if pagefaults are 592 + * enabled. 598 593 * 599 594 * Copy data from kernel space to user space. 600 595 * ··· 613 606 * @from: Source address, in user space. 614 607 * @n: Number of bytes to copy. 615 608 * 616 - * Context: User context only. This function may sleep. 609 + * Context: User context only. This function may sleep if pagefaults are 610 + * enabled. 617 611 * 618 612 * Copy data from user space to kernel space. Caller must check 619 613 * the specified block with access_ok() before calling this function. ··· 634 626 * @from: Source address, in user space. 635 627 * @n: Number of bytes to copy. 636 628 * 637 - * Context: User context only. This function may sleep. 629 + * Context: User context only. This function may sleep if pagefaults are 630 + * enabled. 638 631 * 639 632 * Copy data from user space to kernel space. 640 633 * ··· 686 677 * strlen_user: - Get the size of a string in user space. 687 678 * @str: The string to measure. 688 679 * 689 - * Context: User context only. This function may sleep. 680 + * Context: User context only. This function may sleep if pagefaults are 681 + * enabled. 690 682 * 691 683 * Get the size of a NUL-terminated string in user space. 692 684 *

+4 -4

arch/m32r/mm/fault.c

··· 24 24 #include <linux/vt_kern.h> /* For unblank_screen() */ 25 25 #include <linux/highmem.h> 26 26 #include <linux/module.h> 27 + #include <linux/uaccess.h> 27 28 28 29 #include <asm/m32r.h> 29 - #include <asm/uaccess.h> 30 30 #include <asm/hardirq.h> 31 31 #include <asm/mmu_context.h> 32 32 #include <asm/tlbflush.h> ··· 111 111 mm = tsk->mm; 112 112 113 113 /* 114 - * If we're in an interrupt or have no user context or are running in an 115 - * atomic region then we must not take the fault.. 114 + * If we're in an interrupt or have no user context or have pagefaults 115 + * disabled then we must not take the fault. 116 116 */ 117 - if (in_atomic() || !mm) 117 + if (faulthandler_disabled() || !mm) 118 118 goto bad_area_nosemaphore; 119 119 120 120 if (error_code & ACE_USERMODE)

-3

arch/m68k/include/asm/irqflags.h

··· 2 2 #define _M68K_IRQFLAGS_H 3 3 4 4 #include <linux/types.h> 5 - #ifdef CONFIG_MMU 6 - #include <linux/preempt_mask.h> 7 - #endif 8 5 #include <linux/preempt.h> 9 6 #include <asm/thread_info.h> 10 7 #include <asm/entry.h>

+2 -2

arch/m68k/mm/fault.c

··· 10 10 #include <linux/ptrace.h> 11 11 #include <linux/interrupt.h> 12 12 #include <linux/module.h> 13 + #include <linux/uaccess.h> 13 14 14 15 #include <asm/setup.h> 15 16 #include <asm/traps.h> 16 - #include <asm/uaccess.h> 17 17 #include <asm/pgalloc.h> 18 18 19 19 extern void die_if_kernel(char *, struct pt_regs *, long); ··· 81 81 * If we're in an interrupt or have no user 82 82 * context, we must not take the fault.. 83 83 */ 84 - if (in_atomic() || !mm) 84 + if (faulthandler_disabled() || !mm) 85 85 goto no_context; 86 86 87 87 if (user_mode(regs))

+1 -1

arch/metag/mm/fault.c

··· 105 105 106 106 mm = tsk->mm; 107 107 108 - if (in_atomic() || !mm) 108 + if (faulthandler_disabled() || !mm) 109 109 goto no_context; 110 110 111 111 if (user_mode(regs))

+3 -1

arch/metag/mm/highmem.c

··· 43 43 unsigned long vaddr; 44 44 int type; 45 45 46 - /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */ 46 + preempt_disable(); 47 47 pagefault_disable(); 48 48 if (!PageHighMem(page)) 49 49 return page_address(page); ··· 82 82 } 83 83 84 84 pagefault_enable(); 85 + preempt_enable(); 85 86 } 86 87 EXPORT_SYMBOL(__kunmap_atomic); 87 88 ··· 96 95 unsigned long vaddr; 97 96 int type; 98 97 98 + preempt_disable(); 99 99 pagefault_disable(); 100 100 101 101 type = kmap_atomic_idx_push();

+4 -2

arch/microblaze/include/asm/uaccess.h

··· 178 178 * @x: Variable to store result. 179 179 * @ptr: Source address, in user space. 180 180 * 181 - * Context: User context only. This function may sleep. 181 + * Context: User context only. This function may sleep if pagefaults are 182 + * enabled. 182 183 * 183 184 * This macro copies a single simple variable from user space to kernel 184 185 * space. It supports simple types like char and int, but not larger ··· 291 290 * @x: Value to copy to user space. 292 291 * @ptr: Destination address, in user space. 293 292 * 294 - * Context: User context only. This function may sleep. 293 + * Context: User context only. This function may sleep if pagefaults are 294 + * enabled. 295 295 * 296 296 * This macro copies a single simple value from kernel space to user 297 297 * space. It supports simple types like char and int, but not larger

+4 -4

arch/microblaze/mm/fault.c

··· 107 107 if ((error_code & 0x13) == 0x13 || (error_code & 0x11) == 0x11) 108 108 is_write = 0; 109 109 110 - if (unlikely(in_atomic() || !mm)) { 110 + if (unlikely(faulthandler_disabled() || !mm)) { 111 111 if (kernel_mode(regs)) 112 112 goto bad_area_nosemaphore; 113 113 114 - /* in_atomic() in user mode is really bad, 114 + /* faulthandler_disabled() in user mode is really bad, 115 115 as is current->mm == NULL. */ 116 - pr_emerg("Page fault in user mode with in_atomic(), mm = %p\n", 117 - mm); 116 + pr_emerg("Page fault in user mode with faulthandler_disabled(), mm = %p\n", 117 + mm); 118 118 pr_emerg("r15 = %lx MSR = %lx\n", 119 119 regs->r15, regs->msr); 120 120 die("Weird page fault", regs, SIGSEGV);

+3 -1

arch/microblaze/mm/highmem.c

··· 37 37 unsigned long vaddr; 38 38 int idx, type; 39 39 40 - /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */ 40 + preempt_disable(); 41 41 pagefault_disable(); 42 42 if (!PageHighMem(page)) 43 43 return page_address(page); ··· 63 63 64 64 if (vaddr < __fix_to_virt(FIX_KMAP_END)) { 65 65 pagefault_enable(); 66 + preempt_enable(); 66 67 return; 67 68 } 68 69 ··· 85 84 #endif 86 85 kmap_atomic_idx_pop(); 87 86 pagefault_enable(); 87 + preempt_enable(); 88 88 } 89 89 EXPORT_SYMBOL(__kunmap_atomic);

+1 -1

arch/mips/include/asm/topology.h

··· 15 15 #define topology_physical_package_id(cpu) (cpu_data[cpu].package) 16 16 #define topology_core_id(cpu) (cpu_data[cpu].core) 17 17 #define topology_core_cpumask(cpu) (&cpu_core_map[cpu]) 18 - #define topology_thread_cpumask(cpu) (&cpu_sibling_map[cpu]) 18 + #define topology_sibling_cpumask(cpu) (&cpu_sibling_map[cpu]) 19 19 #endif 20 20 21 21 #endif /* __ASM_TOPOLOGY_H */

+30 -15

arch/mips/include/asm/uaccess.h

··· 103 103 * @addr: User space pointer to start of block to check 104 104 * @size: Size of block to check 105 105 * 106 - * Context: User context only. This function may sleep. 106 + * Context: User context only. This function may sleep if pagefaults are 107 + * enabled. 107 108 * 108 109 * Checks if a pointer to a block of memory in user space is valid. 109 110 * ··· 139 138 * @x: Value to copy to user space. 140 139 * @ptr: Destination address, in user space. 141 140 * 142 - * Context: User context only. This function may sleep. 141 + * Context: User context only. This function may sleep if pagefaults are 142 + * enabled. 143 143 * 144 144 * This macro copies a single simple value from kernel space to user 145 145 * space. It supports simple types like char and int, but not larger ··· 159 157 * @x: Variable to store result. 160 158 * @ptr: Source address, in user space. 161 159 * 162 - * Context: User context only. This function may sleep. 160 + * Context: User context only. This function may sleep if pagefaults are 161 + * enabled. 163 162 * 164 163 * This macro copies a single simple variable from user space to kernel 165 164 * space. It supports simple types like char and int, but not larger ··· 180 177 * @x: Value to copy to user space. 181 178 * @ptr: Destination address, in user space. 182 179 * 183 - * Context: User context only. This function may sleep. 180 + * Context: User context only. This function may sleep if pagefaults are 181 + * enabled. 184 182 * 185 183 * This macro copies a single simple value from kernel space to user 186 184 * space. It supports simple types like char and int, but not larger ··· 203 199 * @x: Variable to store result. 204 200 * @ptr: Source address, in user space. 205 201 * 206 - * Context: User context only. This function may sleep. 202 + * Context: User context only. This function may sleep if pagefaults are 203 + * enabled. 207 204 * 208 205 * This macro copies a single simple variable from user space to kernel 209 206 * space. It supports simple types like char and int, but not larger ··· 503 498 * @x: Value to copy to user space. 504 499 * @ptr: Destination address, in user space. 505 500 * 506 - * Context: User context only. This function may sleep. 501 + * Context: User context only. This function may sleep if pagefaults are 502 + * enabled. 507 503 * 508 504 * This macro copies a single simple value from kernel space to user 509 505 * space. It supports simple types like char and int, but not larger ··· 523 517 * @x: Variable to store result. 524 518 * @ptr: Source address, in user space. 525 519 * 526 - * Context: User context only. This function may sleep. 520 + * Context: User context only. This function may sleep if pagefaults are 521 + * enabled. 527 522 * 528 523 * This macro copies a single simple variable from user space to kernel 529 524 * space. It supports simple types like char and int, but not larger ··· 544 537 * @x: Value to copy to user space. 545 538 * @ptr: Destination address, in user space. 546 539 * 547 - * Context: User context only. This function may sleep. 540 + * Context: User context only. This function may sleep if pagefaults are 541 + * enabled. 548 542 * 549 543 * This macro copies a single simple value from kernel space to user 550 544 * space. It supports simple types like char and int, but not larger ··· 567 559 * @x: Variable to store result. 568 560 * @ptr: Source address, in user space. 569 561 * 570 - * Context: User context only. This function may sleep. 562 + * Context: User context only. This function may sleep if pagefaults are 563 + * enabled. 571 564 * 572 565 * This macro copies a single simple variable from user space to kernel 573 566 * space. It supports simple types like char and int, but not larger ··· 824 815 * @from: Source address, in kernel space. 825 816 * @n: Number of bytes to copy. 826 817 * 827 - * Context: User context only. This function may sleep. 818 + * Context: User context only. This function may sleep if pagefaults are 819 + * enabled. 828 820 * 829 821 * Copy data from kernel space to user space. Caller must check 830 822 * the specified block with access_ok() before calling this function. ··· 898 888 * @from: Source address, in kernel space. 899 889 * @n: Number of bytes to copy. 900 890 * 901 - * Context: User context only. This function may sleep. 891 + * Context: User context only. This function may sleep if pagefaults are 892 + * enabled. 902 893 * 903 894 * Copy data from kernel space to user space. 904 895 * ··· 1086 1075 * @from: Source address, in user space. 1087 1076 * @n: Number of bytes to copy. 1088 1077 * 1089 - * Context: User context only. This function may sleep. 1078 + * Context: User context only. This function may sleep if pagefaults are 1079 + * enabled. 1090 1080 * 1091 1081 * Copy data from user space to kernel space. Caller must check 1092 1082 * the specified block with access_ok() before calling this function. ··· 1119 1107 * @from: Source address, in user space. 1120 1108 * @n: Number of bytes to copy. 1121 1109 * 1122 - * Context: User context only. This function may sleep. 1110 + * Context: User context only. This function may sleep if pagefaults are 1111 + * enabled. 1123 1112 * 1124 1113 * Copy data from user space to kernel space. 1125 1114 * ··· 1342 1329 * strlen_user: - Get the size of a string in user space. 1343 1330 * @str: The string to measure. 1344 1331 * 1345 - * Context: User context only. This function may sleep. 1332 + * Context: User context only. This function may sleep if pagefaults are 1333 + * enabled. 1346 1334 * 1347 1335 * Get the size of a NUL-terminated string in user space. 1348 1336 * ··· 1412 1398 * strnlen_user: - Get the size of a string in user space. 1413 1399 * @str: The string to measure. 1414 1400 * 1415 - * Context: User context only. This function may sleep. 1401 + * Context: User context only. This function may sleep if pagefaults are 1402 + * enabled. 1416 1403 * 1417 1404 * Get the size of a NUL-terminated string in user space. 1418 1405 *

+2 -7

arch/mips/kernel/signal-common.h

··· 28 28 extern int fpcsr_pending(unsigned int __user *fpcsr); 29 29 30 30 /* Make sure we will not lose FPU ownership */ 31 - #ifdef CONFIG_PREEMPT 32 - #define lock_fpu_owner() preempt_disable() 33 - #define unlock_fpu_owner() preempt_enable() 34 - #else 35 - #define lock_fpu_owner() pagefault_disable() 36 - #define unlock_fpu_owner() pagefault_enable() 37 - #endif 31 + #define lock_fpu_owner() ({ preempt_disable(); pagefault_disable(); }) 32 + #define unlock_fpu_owner() ({ pagefault_enable(); preempt_enable(); }) 38 33 39 34 #endif /* __SIGNAL_COMMON_H */

+2 -2

arch/mips/mm/fault.c

··· 21 21 #include <linux/module.h> 22 22 #include <linux/kprobes.h> 23 23 #include <linux/perf_event.h> 24 + #include <linux/uaccess.h> 24 25 25 26 #include <asm/branch.h> 26 27 #include <asm/mmu_context.h> 27 - #include <asm/uaccess.h> 28 28 #include <asm/ptrace.h> 29 29 #include <asm/highmem.h> /* For VMALLOC_END */ 30 30 #include <linux/kdebug.h> ··· 94 94 * If we're in an interrupt or have no user 95 95 * context, we must not take the fault.. 96 96 */ 97 - if (in_atomic() || !mm) 97 + if (faulthandler_disabled() || !mm) 98 98 goto bad_area_nosemaphore; 99 99 100 100 if (user_mode(regs))

+4 -1

arch/mips/mm/highmem.c

··· 47 47 unsigned long vaddr; 48 48 int idx, type; 49 49 50 - /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */ 50 + preempt_disable(); 51 51 pagefault_disable(); 52 52 if (!PageHighMem(page)) 53 53 return page_address(page); ··· 72 72 73 73 if (vaddr < FIXADDR_START) { // FIXME 74 74 pagefault_enable(); 75 + preempt_enable(); 75 76 return; 76 77 } 77 78 ··· 93 92 #endif 94 93 kmap_atomic_idx_pop(); 95 94 pagefault_enable(); 95 + preempt_enable(); 96 96 } 97 97 EXPORT_SYMBOL(__kunmap_atomic); 98 98 ··· 106 104 unsigned long vaddr; 107 105 int idx, type; 108 106 107 + preempt_disable(); 109 108 pagefault_disable(); 110 109 111 110 type = kmap_atomic_idx_push();

+2

arch/mips/mm/init.c

··· 90 90 91 91 BUG_ON(Page_dcache_dirty(page)); 92 92 93 + preempt_disable(); 93 94 pagefault_disable(); 94 95 idx = (addr >> PAGE_SHIFT) & (FIX_N_COLOURS - 1); 95 96 idx += in_interrupt() ? FIX_N_COLOURS : 0; ··· 153 152 write_c0_entryhi(old_ctx); 154 153 local_irq_restore(flags); 155 154 pagefault_enable(); 155 + preempt_enable(); 156 156 } 157 157 158 158 void copy_user_highpage(struct page *to, struct page *from,

+3

arch/mn10300/include/asm/highmem.h

··· 75 75 unsigned long vaddr; 76 76 int idx, type; 77 77 78 + preempt_disable(); 78 79 pagefault_disable(); 79 80 if (page < highmem_start_page) 80 81 return page_address(page); ··· 99 98 100 99 if (vaddr < FIXADDR_START) { /* FIXME */ 101 100 pagefault_enable(); 101 + preempt_enable(); 102 102 return; 103 103 } 104 104 ··· 124 122 125 123 kmap_atomic_idx_pop(); 126 124 pagefault_enable(); 125 + preempt_enable(); 127 126 } 128 127 #endif /* __KERNEL__ */ 129 128

+2 -2

arch/mn10300/mm/fault.c

··· 23 23 #include <linux/interrupt.h> 24 24 #include <linux/init.h> 25 25 #include <linux/vt_kern.h> /* For unblank_screen() */ 26 + #include <linux/uaccess.h> 26 27 27 - #include <asm/uaccess.h> 28 28 #include <asm/pgalloc.h> 29 29 #include <asm/hardirq.h> 30 30 #include <asm/cpu-regs.h> ··· 168 168 * If we're in an interrupt or have no user 169 169 * context, we must not take the fault.. 170 170 */ 171 - if (in_atomic() || !mm) 171 + if (faulthandler_disabled() || !mm) 172 172 goto no_context; 173 173 174 174 if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR)

+1 -1

arch/nios2/mm/fault.c

··· 77 77 * If we're in an interrupt or have no user 78 78 * context, we must not take the fault.. 79 79 */ 80 - if (in_atomic() || !mm) 80 + if (faulthandler_disabled() || !mm) 81 81 goto bad_area_nosemaphore; 82 82 83 83 if (user_mode(regs))

+2

arch/parisc/include/asm/cacheflush.h

··· 142 142 143 143 static inline void *kmap_atomic(struct page *page) 144 144 { 145 + preempt_disable(); 145 146 pagefault_disable(); 146 147 return page_address(page); 147 148 } ··· 151 150 { 152 151 flush_kernel_dcache_page_addr(addr); 153 152 pagefault_enable(); 153 + preempt_enable(); 154 154 } 155 155 156 156 #define kmap_atomic_prot(page, prot) kmap_atomic(page)

+2 -2

arch/parisc/kernel/traps.c

··· 26 26 #include <linux/console.h> 27 27 #include <linux/bug.h> 28 28 #include <linux/ratelimit.h> 29 + #include <linux/uaccess.h> 29 30 30 31 #include <asm/assembly.h> 31 - #include <asm/uaccess.h> 32 32 #include <asm/io.h> 33 33 #include <asm/irq.h> 34 34 #include <asm/traps.h> ··· 800 800 * unless pagefault_disable() was called before. 801 801 */ 802 802 803 - if (fault_space == 0 && !in_atomic()) 803 + if (fault_space == 0 && !faulthandler_disabled()) 804 804 { 805 805 pdc_chassis_send_status(PDC_CHASSIS_DIRECT_PANIC); 806 806 parisc_terminate("Kernel Fault", regs, code, fault_address);

+2 -2

arch/parisc/mm/fault.c

··· 15 15 #include <linux/sched.h> 16 16 #include <linux/interrupt.h> 17 17 #include <linux/module.h> 18 + #include <linux/uaccess.h> 18 19 19 - #include <asm/uaccess.h> 20 20 #include <asm/traps.h> 21 21 22 22 /* Various important other fields */ ··· 207 207 int fault; 208 208 unsigned int flags; 209 209 210 - if (in_atomic()) 210 + if (pagefault_disabled()) 211 211 goto no_context; 212 212 213 213 tsk = current;

+1 -1

arch/powerpc/include/asm/topology.h

··· 87 87 #include <asm/smp.h> 88 88 89 89 #define topology_physical_package_id(cpu) (cpu_to_chip_id(cpu)) 90 - #define topology_thread_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) 90 + #define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) 91 91 #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu)) 92 92 #define topology_core_id(cpu) (cpu_to_core_id(cpu)) 93 93 #endif

+6 -5

arch/powerpc/lib/vmx-helper.c

··· 27 27 if (in_interrupt()) 28 28 return 0; 29 29 30 - /* This acts as preempt_disable() as well and will make 31 - * enable_kernel_altivec(). We need to disable page faults 32 - * as they can call schedule and thus make us lose the VMX 33 - * context. So on page faults, we just fail which will cause 34 - * a fallback to the normal non-vmx copy. 30 + preempt_disable(); 31 + /* 32 + * We need to disable page faults as they can call schedule and 33 + * thus make us lose the VMX context. So on page faults, we just 34 + * fail which will cause a fallback to the normal non-vmx copy. 35 35 */ 36 36 pagefault_disable(); 37 37 ··· 47 47 int exit_vmx_usercopy(void) 48 48 { 49 49 pagefault_enable(); 50 + preempt_enable(); 50 51 return 0; 51 52 } 52 53

+5 -4

arch/powerpc/mm/fault.c

··· 33 33 #include <linux/ratelimit.h> 34 34 #include <linux/context_tracking.h> 35 35 #include <linux/hugetlb.h> 36 + #include <linux/uaccess.h> 36 37 37 38 #include <asm/firmware.h> 38 39 #include <asm/page.h> 39 40 #include <asm/pgtable.h> 40 41 #include <asm/mmu.h> 41 42 #include <asm/mmu_context.h> 42 - #include <asm/uaccess.h> 43 43 #include <asm/tlbflush.h> 44 44 #include <asm/siginfo.h> 45 45 #include <asm/debug.h> ··· 272 272 if (!arch_irq_disabled_regs(regs)) 273 273 local_irq_enable(); 274 274 275 - if (in_atomic() || mm == NULL) { 275 + if (faulthandler_disabled() || mm == NULL) { 276 276 if (!user_mode(regs)) { 277 277 rc = SIGSEGV; 278 278 goto bail; 279 279 } 280 - /* in_atomic() in user mode is really bad, 280 + /* faulthandler_disabled() in user mode is really bad, 281 281 as is current->mm == NULL. */ 282 282 printk(KERN_EMERG "Page fault in user mode with " 283 - "in_atomic() = %d mm = %p\n", in_atomic(), mm); 283 + "faulthandler_disabled() = %d mm = %p\n", 284 + faulthandler_disabled(), mm); 284 285 printk(KERN_EMERG "NIP = %lx MSR = %lx\n", 285 286 regs->nip, regs->msr); 286 287 die("Weird page fault", regs, SIGSEGV);

+3 -1

arch/powerpc/mm/highmem.c

··· 34 34 unsigned long vaddr; 35 35 int idx, type; 36 36 37 - /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */ 37 + preempt_disable(); 38 38 pagefault_disable(); 39 39 if (!PageHighMem(page)) 40 40 return page_address(page); ··· 59 59 60 60 if (vaddr < __fix_to_virt(FIX_KMAP_END)) { 61 61 pagefault_enable(); 62 + preempt_enable(); 62 63 return; 63 64 } 64 65 ··· 83 82 84 83 kmap_atomic_idx_pop(); 85 84 pagefault_enable(); 85 + preempt_enable(); 86 86 } 87 87 EXPORT_SYMBOL(__kunmap_atomic);

+1 -1

arch/powerpc/mm/tlb_nohash.c

··· 217 217 static int mm_is_core_local(struct mm_struct *mm) 218 218 { 219 219 return cpumask_subset(mm_cpumask(mm), 220 - topology_thread_cpumask(smp_processor_id())); 220 + topology_sibling_cpumask(smp_processor_id())); 221 221 } 222 222 223 223 struct tlb_flush_param {

+2 -1

arch/s390/include/asm/topology.h

··· 22 22 23 23 #define topology_physical_package_id(cpu) (per_cpu(cpu_topology, cpu).socket_id) 24 24 #define topology_thread_id(cpu) (per_cpu(cpu_topology, cpu).thread_id) 25 - #define topology_thread_cpumask(cpu) (&per_cpu(cpu_topology, cpu).thread_mask) 25 + #define topology_sibling_cpumask(cpu) \ 26 + (&per_cpu(cpu_topology, cpu).thread_mask) 26 27 #define topology_core_id(cpu) (per_cpu(cpu_topology, cpu).core_id) 27 28 #define topology_core_cpumask(cpu) (&per_cpu(cpu_topology, cpu).core_mask) 28 29 #define topology_book_id(cpu) (per_cpu(cpu_topology, cpu).book_id)

+10 -5

arch/s390/include/asm/uaccess.h

··· 98 98 * @from: Source address, in user space. 99 99 * @n: Number of bytes to copy. 100 100 * 101 - * Context: User context only. This function may sleep. 101 + * Context: User context only. This function may sleep if pagefaults are 102 + * enabled. 102 103 * 103 104 * Copy data from user space to kernel space. Caller must check 104 105 * the specified block with access_ok() before calling this function. ··· 119 118 * @from: Source address, in kernel space. 120 119 * @n: Number of bytes to copy. 121 120 * 122 - * Context: User context only. This function may sleep. 121 + * Context: User context only. This function may sleep if pagefaults are 122 + * enabled. 123 123 * 124 124 * Copy data from kernel space to user space. Caller must check 125 125 * the specified block with access_ok() before calling this function. ··· 266 264 * @from: Source address, in kernel space. 267 265 * @n: Number of bytes to copy. 268 266 * 269 - * Context: User context only. This function may sleep. 267 + * Context: User context only. This function may sleep if pagefaults are 268 + * enabled. 270 269 * 271 270 * Copy data from kernel space to user space. 272 271 * ··· 293 290 * @from: Source address, in user space. 294 291 * @n: Number of bytes to copy. 295 292 * 296 - * Context: User context only. This function may sleep. 293 + * Context: User context only. This function may sleep if pagefaults are 294 + * enabled. 297 295 * 298 296 * Copy data from user space to kernel space. 299 297 * ··· 352 348 * strlen_user: - Get the size of a string in user space. 353 349 * @str: The string to measure. 354 350 * 355 - * Context: User context only. This function may sleep. 351 + * Context: User context only. This function may sleep if pagefaults are 352 + * enabled. 356 353 * 357 354 * Get the size of a NUL-terminated string in user space. 358 355 *

+1 -1

arch/s390/mm/fault.c

··· 399 399 * user context. 400 400 */ 401 401 fault = VM_FAULT_BADCONTEXT; 402 - if (unlikely(!user_space_fault(regs) || in_atomic() || !mm)) 402 + if (unlikely(!user_space_fault(regs) || faulthandler_disabled() || !mm)) 403 403 goto out; 404 404 405 405 address = trans_exc_code & __FAIL_ADDR_MASK;

+10 -5

arch/score/include/asm/uaccess.h

··· 36 36 * @addr: User space pointer to start of block to check 37 37 * @size: Size of block to check 38 38 * 39 - * Context: User context only. This function may sleep. 39 + * Context: User context only. This function may sleep if pagefaults are 40 + * enabled. 40 41 * 41 42 * Checks if a pointer to a block of memory in user space is valid. 42 43 * ··· 62 61 * @x: Value to copy to user space. 63 62 * @ptr: Destination address, in user space. 64 63 * 65 - * Context: User context only. This function may sleep. 64 + * Context: User context only. This function may sleep if pagefaults are 65 + * enabled. 66 66 * 67 67 * This macro copies a single simple value from kernel space to user 68 68 * space. It supports simple types like char and int, but not larger ··· 81 79 * @x: Variable to store result. 82 80 * @ptr: Source address, in user space. 83 81 * 84 - * Context: User context only. This function may sleep. 82 + * Context: User context only. This function may sleep if pagefaults are 83 + * enabled. 85 84 * 86 85 * This macro copies a single simple variable from user space to kernel 87 86 * space. It supports simple types like char and int, but not larger ··· 101 98 * @x: Value to copy to user space. 102 99 * @ptr: Destination address, in user space. 103 100 * 104 - * Context: User context only. This function may sleep. 101 + * Context: User context only. This function may sleep if pagefaults are 102 + * enabled. 105 103 * 106 104 * This macro copies a single simple value from kernel space to user 107 105 * space. It supports simple types like char and int, but not larger ··· 123 119 * @x: Variable to store result. 124 120 * @ptr: Source address, in user space. 125 121 * 126 - * Context: User context only. This function may sleep. 122 + * Context: User context only. This function may sleep if pagefaults are 123 + * enabled. 127 124 * 128 125 * This macro copies a single simple variable from user space to kernel 129 126 * space. It supports simple types like char and int, but not larger

+2 -1

arch/score/mm/fault.c

··· 34 34 #include <linux/string.h> 35 35 #include <linux/types.h> 36 36 #include <linux/ptrace.h> 37 + #include <linux/uaccess.h> 37 38 38 39 /* 39 40 * This routine handles page faults. It determines the address, ··· 74 73 * If we're in an interrupt or have no user 75 74 * context, we must not take the fault.. 76 75 */ 77 - if (in_atomic() || !mm) 76 + if (pagefault_disabled() || !mm) 78 77 goto bad_area_nosemaphore; 79 78 80 79 if (user_mode(regs))

+3 -2

arch/sh/mm/fault.c

··· 17 17 #include <linux/kprobes.h> 18 18 #include <linux/perf_event.h> 19 19 #include <linux/kdebug.h> 20 + #include <linux/uaccess.h> 20 21 #include <asm/io_trapped.h> 21 22 #include <asm/mmu_context.h> 22 23 #include <asm/tlbflush.h> ··· 439 438 440 439 /* 441 440 * If we're in an interrupt, have no user context or are running 442 - * in an atomic region then we must not take the fault: 441 + * with pagefaults disabled then we must not take the fault: 443 442 */ 444 - if (unlikely(in_atomic() || !mm)) { 443 + if (unlikely(faulthandler_disabled() || !mm)) { 445 444 bad_area_nosemaphore(regs, error_code, address); 446 445 return; 447 446 }

+1 -1

arch/sparc/include/asm/topology_64.h

··· 41 41 #define topology_physical_package_id(cpu) (cpu_data(cpu).proc_id) 42 42 #define topology_core_id(cpu) (cpu_data(cpu).core_id) 43 43 #define topology_core_cpumask(cpu) (&cpu_core_sib_map[cpu]) 44 - #define topology_thread_cpumask(cpu) (&per_cpu(cpu_sibling_map, cpu)) 44 + #define topology_sibling_cpumask(cpu) (&per_cpu(cpu_sibling_map, cpu)) 45 45 #endif /* CONFIG_SMP */ 46 46 47 47 extern cpumask_t cpu_core_map[NR_CPUS];

+2 -2

arch/sparc/mm/fault_32.c

··· 21 21 #include <linux/perf_event.h> 22 22 #include <linux/interrupt.h> 23 23 #include <linux/kdebug.h> 24 + #include <linux/uaccess.h> 24 25 25 26 #include <asm/page.h> 26 27 #include <asm/pgtable.h> ··· 30 29 #include <asm/setup.h> 31 30 #include <asm/smp.h> 32 31 #include <asm/traps.h> 33 - #include <asm/uaccess.h> 34 32 35 33 #include "mm_32.h" 36 34 ··· 196 196 * If we're in an interrupt or have no user 197 197 * context, we must not take the fault.. 198 198 */ 199 - if (in_atomic() || !mm) 199 + if (pagefault_disabled() || !mm) 200 200 goto no_context; 201 201 202 202 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);

+2 -2

arch/sparc/mm/fault_64.c

··· 22 22 #include <linux/kdebug.h> 23 23 #include <linux/percpu.h> 24 24 #include <linux/context_tracking.h> 25 + #include <linux/uaccess.h> 25 26 26 27 #include <asm/page.h> 27 28 #include <asm/pgtable.h> 28 29 #include <asm/openprom.h> 29 30 #include <asm/oplib.h> 30 - #include <asm/uaccess.h> 31 31 #include <asm/asi.h> 32 32 #include <asm/lsu.h> 33 33 #include <asm/sections.h> ··· 330 330 * If we're in an interrupt or have no user 331 331 * context, we must not take the fault.. 332 332 */ 333 - if (in_atomic() || !mm) 333 + if (faulthandler_disabled() || !mm) 334 334 goto intr_or_no_mm; 335 335 336 336 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);

+3 -1

arch/sparc/mm/highmem.c

··· 53 53 unsigned long vaddr; 54 54 long idx, type; 55 55 56 - /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */ 56 + preempt_disable(); 57 57 pagefault_disable(); 58 58 if (!PageHighMem(page)) 59 59 return page_address(page); ··· 91 91 92 92 if (vaddr < FIXADDR_START) { // FIXME 93 93 pagefault_enable(); 94 + preempt_enable(); 94 95 return; 95 96 } 96 97 ··· 127 126 128 127 kmap_atomic_idx_pop(); 129 128 pagefault_enable(); 129 + preempt_enable(); 130 130 } 131 131 EXPORT_SYMBOL(__kunmap_atomic);

+1 -1

arch/sparc/mm/init_64.c

··· 2738 2738 struct mm_struct *mm = current->mm; 2739 2739 struct tsb_config *tp; 2740 2740 2741 - if (in_atomic() || !mm) { 2741 + if (faulthandler_disabled() || !mm) { 2742 2742 const struct exception_table_entry *entry; 2743 2743 2744 2744 entry = search_exception_tables(regs->tpc);

+1 -1

arch/tile/include/asm/topology.h

··· 55 55 #define topology_physical_package_id(cpu) ((void)(cpu), 0) 56 56 #define topology_core_id(cpu) (cpu) 57 57 #define topology_core_cpumask(cpu) ((void)(cpu), cpu_online_mask) 58 - #define topology_thread_cpumask(cpu) cpumask_of(cpu) 58 + #define topology_sibling_cpumask(cpu) cpumask_of(cpu) 59 59 #endif 60 60 61 61 #endif /* _ASM_TILE_TOPOLOGY_H */

+12 -6

arch/tile/include/asm/uaccess.h

··· 78 78 * @addr: User space pointer to start of block to check 79 79 * @size: Size of block to check 80 80 * 81 - * Context: User context only. This function may sleep. 81 + * Context: User context only. This function may sleep if pagefaults are 82 + * enabled. 82 83 * 83 84 * Checks if a pointer to a block of memory in user space is valid. 84 85 * ··· 193 192 * @x: Variable to store result. 194 193 * @ptr: Source address, in user space. 195 194 * 196 - * Context: User context only. This function may sleep. 195 + * Context: User context only. This function may sleep if pagefaults are 196 + * enabled. 197 197 * 198 198 * This macro copies a single simple variable from user space to kernel 199 199 * space. It supports simple types like char and int, but not larger ··· 276 274 * @x: Value to copy to user space. 277 275 * @ptr: Destination address, in user space. 278 276 * 279 - * Context: User context only. This function may sleep. 277 + * Context: User context only. This function may sleep if pagefaults are 278 + * enabled. 280 279 * 281 280 * This macro copies a single simple value from kernel space to user 282 281 * space. It supports simple types like char and int, but not larger ··· 333 330 * @from: Source address, in kernel space. 334 331 * @n: Number of bytes to copy. 335 332 * 336 - * Context: User context only. This function may sleep. 333 + * Context: User context only. This function may sleep if pagefaults are 334 + * enabled. 337 335 * 338 336 * Copy data from kernel space to user space. Caller must check 339 337 * the specified block with access_ok() before calling this function. ··· 370 366 * @from: Source address, in user space. 371 367 * @n: Number of bytes to copy. 372 368 * 373 - * Context: User context only. This function may sleep. 369 + * Context: User context only. This function may sleep if pagefaults are 370 + * enabled. 374 371 * 375 372 * Copy data from user space to kernel space. Caller must check 376 373 * the specified block with access_ok() before calling this function. ··· 442 437 * @from: Source address, in user space. 443 438 * @n: Number of bytes to copy. 444 439 * 445 - * Context: User context only. This function may sleep. 440 + * Context: User context only. This function may sleep if pagefaults are 441 + * enabled. 446 442 * 447 443 * Copy data from user space to user space. Caller must check 448 444 * the specified blocks with access_ok() before calling this function.

+2 -2

arch/tile/mm/fault.c

··· 354 354 355 355 /* 356 356 * If we're in an interrupt, have no user context or are running in an 357 - * atomic region then we must not take the fault. 357 + * region with pagefaults disabled then we must not take the fault. 358 358 */ 359 - if (in_atomic() || !mm) { 359 + if (pagefault_disabled() || !mm) { 360 360 vma = NULL; /* happy compiler */ 361 361 goto bad_area_nosemaphore; 362 362 }

+2 -1

arch/tile/mm/highmem.c

··· 201 201 int idx, type; 202 202 pte_t *pte; 203 203 204 - /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */ 204 + preempt_disable(); 205 205 pagefault_disable(); 206 206 207 207 /* Avoid icache flushes by disallowing atomic executable mappings. */ ··· 259 259 } 260 260 261 261 pagefault_enable(); 262 + preempt_enable(); 262 263 } 263 264 EXPORT_SYMBOL(__kunmap_atomic); 264 265

+3 -2

arch/um/kernel/trap.c

··· 7 7 #include <linux/sched.h> 8 8 #include <linux/hardirq.h> 9 9 #include <linux/module.h> 10 + #include <linux/uaccess.h> 10 11 #include <asm/current.h> 11 12 #include <asm/pgtable.h> 12 13 #include <asm/tlbflush.h> ··· 36 35 *code_out = SEGV_MAPERR; 37 36 38 37 /* 39 - * If the fault was during atomic operation, don't take the fault, just 38 + * If the fault was with pagefaults disabled, don't take the fault, just 40 39 * fail. 41 40 */ 42 - if (in_atomic()) 41 + if (faulthandler_disabled()) 43 42 goto out_nosemaphore; 44 43 45 44 if (is_user)

+1 -1

arch/unicore32/mm/fault.c

··· 218 218 * If we're in an interrupt or have no user 219 219 * context, we must not take the fault.. 220 220 */ 221 - if (in_atomic() || !mm) 221 + if (faulthandler_disabled() || !mm) 222 222 goto no_context; 223 223 224 224 if (user_mode(regs))

+3 -5

arch/x86/include/asm/preempt.h

··· 99 99 extern asmlinkage void ___preempt_schedule(void); 100 100 # define __preempt_schedule() asm ("call ___preempt_schedule") 101 101 extern asmlinkage void preempt_schedule(void); 102 - # ifdef CONFIG_CONTEXT_TRACKING 103 - extern asmlinkage void ___preempt_schedule_context(void); 104 - # define __preempt_schedule_context() asm ("call ___preempt_schedule_context") 105 - extern asmlinkage void preempt_schedule_context(void); 106 - # endif 102 + extern asmlinkage void ___preempt_schedule_notrace(void); 103 + # define __preempt_schedule_notrace() asm ("call ___preempt_schedule_notrace") 104 + extern asmlinkage void preempt_schedule_notrace(void); 107 105 #endif 108 106 109 107 #endif /* __ASM_PREEMPT_H */

-10

arch/x86/include/asm/smp.h

··· 37 37 DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id); 38 38 DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number); 39 39 40 - static inline struct cpumask *cpu_sibling_mask(int cpu) 41 - { 42 - return per_cpu(cpu_sibling_map, cpu); 43 - } 44 - 45 - static inline struct cpumask *cpu_core_mask(int cpu) 46 - { 47 - return per_cpu(cpu_core_map, cpu); 48 - } 49 - 50 40 static inline struct cpumask *cpu_llc_shared_mask(int cpu) 51 41 { 52 42 return per_cpu(cpu_llc_shared_map, cpu);

+1 -1

arch/x86/include/asm/topology.h

··· 124 124 125 125 #ifdef ENABLE_TOPO_DEFINES 126 126 #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu)) 127 - #define topology_thread_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) 127 + #define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) 128 128 #endif 129 129 130 130 static inline void arch_fix_phys_package_id(int num, u32 slot)

+10 -5

arch/x86/include/asm/uaccess.h

··· 74 74 * @addr: User space pointer to start of block to check 75 75 * @size: Size of block to check 76 76 * 77 - * Context: User context only. This function may sleep. 77 + * Context: User context only. This function may sleep if pagefaults are 78 + * enabled. 78 79 * 79 80 * Checks if a pointer to a block of memory in user space is valid. 80 81 * ··· 146 145 * @x: Variable to store result. 147 146 * @ptr: Source address, in user space. 148 147 * 149 - * Context: User context only. This function may sleep. 148 + * Context: User context only. This function may sleep if pagefaults are 149 + * enabled. 150 150 * 151 151 * This macro copies a single simple variable from user space to kernel 152 152 * space. It supports simple types like char and int, but not larger ··· 242 240 * @x: Value to copy to user space. 243 241 * @ptr: Destination address, in user space. 244 242 * 245 - * Context: User context only. This function may sleep. 243 + * Context: User context only. This function may sleep if pagefaults are 244 + * enabled. 246 245 * 247 246 * This macro copies a single simple value from kernel space to user 248 247 * space. It supports simple types like char and int, but not larger ··· 458 455 * @x: Variable to store result. 459 456 * @ptr: Source address, in user space. 460 457 * 461 - * Context: User context only. This function may sleep. 458 + * Context: User context only. This function may sleep if pagefaults are 459 + * enabled. 462 460 * 463 461 * This macro copies a single simple variable from user space to kernel 464 462 * space. It supports simple types like char and int, but not larger ··· 483 479 * @x: Value to copy to user space. 484 480 * @ptr: Destination address, in user space. 485 481 * 486 - * Context: User context only. This function may sleep. 482 + * Context: User context only. This function may sleep if pagefaults are 483 + * enabled. 487 484 * 488 485 * This macro copies a single simple value from kernel space to user 489 486 * space. It supports simple types like char and int, but not larger

+4 -2

arch/x86/include/asm/uaccess_32.h

··· 70 70 * @from: Source address, in kernel space. 71 71 * @n: Number of bytes to copy. 72 72 * 73 - * Context: User context only. This function may sleep. 73 + * Context: User context only. This function may sleep if pagefaults are 74 + * enabled. 74 75 * 75 76 * Copy data from kernel space to user space. Caller must check 76 77 * the specified block with access_ok() before calling this function. ··· 118 117 * @from: Source address, in user space. 119 118 * @n: Number of bytes to copy. 120 119 * 121 - * Context: User context only. This function may sleep. 120 + * Context: User context only. This function may sleep if pagefaults are 121 + * enabled. 122 122 * 123 123 * Copy data from user space to kernel space. Caller must check 124 124 * the specified block with access_ok() before calling this function.

+3 -3

arch/x86/kernel/cpu/perf_event_intel.c

··· 2576 2576 if (!(x86_pmu.flags & PMU_FL_NO_HT_SHARING)) { 2577 2577 void **onln = &cpuc->kfree_on_online[X86_PERF_KFREE_SHARED]; 2578 2578 2579 - for_each_cpu(i, topology_thread_cpumask(cpu)) { 2579 + for_each_cpu(i, topology_sibling_cpumask(cpu)) { 2580 2580 struct intel_shared_regs *pc; 2581 2581 2582 2582 pc = per_cpu(cpu_hw_events, i).shared_regs; ··· 2594 2594 cpuc->lbr_sel = &cpuc->shared_regs->regs[EXTRA_REG_LBR]; 2595 2595 2596 2596 if (x86_pmu.flags & PMU_FL_EXCL_CNTRS) { 2597 - for_each_cpu(i, topology_thread_cpumask(cpu)) { 2597 + for_each_cpu(i, topology_sibling_cpumask(cpu)) { 2598 2598 struct intel_excl_cntrs *c; 2599 2599 2600 2600 c = per_cpu(cpu_hw_events, i).excl_cntrs; ··· 3362 3362 if (!(x86_pmu.flags & PMU_FL_EXCL_ENABLED)) 3363 3363 return 0; 3364 3364 3365 - w = cpumask_weight(topology_thread_cpumask(cpu)); 3365 + w = cpumask_weight(topology_sibling_cpumask(cpu)); 3366 3366 if (w > 1) { 3367 3367 pr_info("PMU erratum BJ122, BV98, HSD29 worked around, HT is on\n"); 3368 3368 return 0;

+2 -1

arch/x86/kernel/cpu/proc.c

··· 12 12 { 13 13 #ifdef CONFIG_SMP 14 14 seq_printf(m, "physical id\t: %d\n", c->phys_proc_id); 15 - seq_printf(m, "siblings\t: %d\n", cpumask_weight(cpu_core_mask(cpu))); 15 + seq_printf(m, "siblings\t: %d\n", 16 + cpumask_weight(topology_core_cpumask(cpu))); 16 17 seq_printf(m, "core id\t\t: %d\n", c->cpu_core_id); 17 18 seq_printf(m, "cpu cores\t: %d\n", c->booted_cores); 18 19 seq_printf(m, "apicid\t\t: %d\n", c->apicid);

+1 -3

arch/x86/kernel/i386_ksyms_32.c

··· 40 40 41 41 #ifdef CONFIG_PREEMPT 42 42 EXPORT_SYMBOL(___preempt_schedule); 43 - #ifdef CONFIG_CONTEXT_TRACKING 44 - EXPORT_SYMBOL(___preempt_schedule_context); 45 - #endif 43 + EXPORT_SYMBOL(___preempt_schedule_notrace); 46 44 #endif

+3 -4

arch/x86/kernel/process.c

··· 445 445 } 446 446 447 447 /* 448 - * MONITOR/MWAIT with no hints, used for default default C1 state. 449 - * This invokes MWAIT with interrutps enabled and no flags, 450 - * which is backwards compatible with the original MWAIT implementation. 448 + * MONITOR/MWAIT with no hints, used for default C1 state. This invokes MWAIT 449 + * with interrupts enabled and no flags, which is backwards compatible with the 450 + * original MWAIT implementation. 451 451 */ 452 - 453 452 static void mwait_idle(void) 454 453 { 455 454 if (!current_set_polling_and_test()) {

+22 -20

arch/x86/kernel/smpboot.c

··· 314 314 cpu1, name, cpu2, cpu_to_node(cpu1), cpu_to_node(cpu2)); 315 315 } 316 316 317 - #define link_mask(_m, c1, c2) \ 317 + #define link_mask(mfunc, c1, c2) \ 318 318 do { \ 319 - cpumask_set_cpu((c1), cpu_##_m##_mask(c2)); \ 320 - cpumask_set_cpu((c2), cpu_##_m##_mask(c1)); \ 319 + cpumask_set_cpu((c1), mfunc(c2)); \ 320 + cpumask_set_cpu((c2), mfunc(c1)); \ 321 321 } while (0) 322 322 323 323 static bool match_smt(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) ··· 398 398 cpumask_set_cpu(cpu, cpu_sibling_setup_mask); 399 399 400 400 if (!has_mp) { 401 - cpumask_set_cpu(cpu, cpu_sibling_mask(cpu)); 401 + cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu)); 402 402 cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu)); 403 - cpumask_set_cpu(cpu, cpu_core_mask(cpu)); 403 + cpumask_set_cpu(cpu, topology_core_cpumask(cpu)); 404 404 c->booted_cores = 1; 405 405 return; 406 406 } ··· 409 409 o = &cpu_data(i); 410 410 411 411 if ((i == cpu) || (has_smt && match_smt(c, o))) 412 - link_mask(sibling, cpu, i); 412 + link_mask(topology_sibling_cpumask, cpu, i); 413 413 414 414 if ((i == cpu) || (has_mp && match_llc(c, o))) 415 - link_mask(llc_shared, cpu, i); 415 + link_mask(cpu_llc_shared_mask, cpu, i); 416 416 417 417 } 418 418 419 419 /* 420 420 * This needs a separate iteration over the cpus because we rely on all 421 - * cpu_sibling_mask links to be set-up. 421 + * topology_sibling_cpumask links to be set-up. 422 422 */ 423 423 for_each_cpu(i, cpu_sibling_setup_mask) { 424 424 o = &cpu_data(i); 425 425 426 426 if ((i == cpu) || (has_mp && match_die(c, o))) { 427 - link_mask(core, cpu, i); 427 + link_mask(topology_core_cpumask, cpu, i); 428 428 429 429 /* 430 430 * Does this new cpu bringup a new core? 431 431 */ 432 - if (cpumask_weight(cpu_sibling_mask(cpu)) == 1) { 432 + if (cpumask_weight( 433 + topology_sibling_cpumask(cpu)) == 1) { 433 434 /* 434 435 * for each core in package, increment 435 436 * the booted_cores for this new cpu 436 437 */ 437 - if (cpumask_first(cpu_sibling_mask(i)) == i) 438 + if (cpumask_first( 439 + topology_sibling_cpumask(i)) == i) 438 440 c->booted_cores++; 439 441 /* 440 442 * increment the core count for all ··· 1011 1009 physid_set_mask_of_physid(boot_cpu_physical_apicid, &phys_cpu_present_map); 1012 1010 else 1013 1011 physid_set_mask_of_physid(0, &phys_cpu_present_map); 1014 - cpumask_set_cpu(0, cpu_sibling_mask(0)); 1015 - cpumask_set_cpu(0, cpu_core_mask(0)); 1012 + cpumask_set_cpu(0, topology_sibling_cpumask(0)); 1013 + cpumask_set_cpu(0, topology_core_cpumask(0)); 1016 1014 } 1017 1015 1018 1016 enum { ··· 1295 1293 int sibling; 1296 1294 struct cpuinfo_x86 *c = &cpu_data(cpu); 1297 1295 1298 - for_each_cpu(sibling, cpu_core_mask(cpu)) { 1299 - cpumask_clear_cpu(cpu, cpu_core_mask(sibling)); 1296 + for_each_cpu(sibling, topology_core_cpumask(cpu)) { 1297 + cpumask_clear_cpu(cpu, topology_core_cpumask(sibling)); 1300 1298 /*/ 1301 1299 * last thread sibling in this cpu core going down 1302 1300 */ 1303 - if (cpumask_weight(cpu_sibling_mask(cpu)) == 1) 1301 + if (cpumask_weight(topology_sibling_cpumask(cpu)) == 1) 1304 1302 cpu_data(sibling).booted_cores--; 1305 1303 } 1306 1304 1307 - for_each_cpu(sibling, cpu_sibling_mask(cpu)) 1308 - cpumask_clear_cpu(cpu, cpu_sibling_mask(sibling)); 1305 + for_each_cpu(sibling, topology_sibling_cpumask(cpu)) 1306 + cpumask_clear_cpu(cpu, topology_sibling_cpumask(sibling)); 1309 1307 for_each_cpu(sibling, cpu_llc_shared_mask(cpu)) 1310 1308 cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling)); 1311 1309 cpumask_clear(cpu_llc_shared_mask(cpu)); 1312 - cpumask_clear(cpu_sibling_mask(cpu)); 1313 - cpumask_clear(cpu_core_mask(cpu)); 1310 + cpumask_clear(topology_sibling_cpumask(cpu)); 1311 + cpumask_clear(topology_core_cpumask(cpu)); 1314 1312 c->phys_proc_id = 0; 1315 1313 c->cpu_core_id = 0; 1316 1314 cpumask_clear_cpu(cpu, cpu_sibling_setup_mask);

+1 -1

arch/x86/kernel/tsc_sync.c

··· 113 113 */ 114 114 static inline unsigned int loop_timeout(int cpu) 115 115 { 116 - return (cpumask_weight(cpu_core_mask(cpu)) > 1) ? 2 : 20; 116 + return (cpumask_weight(topology_core_cpumask(cpu)) > 1) ? 2 : 20; 117 117 } 118 118 119 119 /*

+1 -3

arch/x86/kernel/x8664_ksyms_64.c

··· 75 75 76 76 #ifdef CONFIG_PREEMPT 77 77 EXPORT_SYMBOL(___preempt_schedule); 78 - #ifdef CONFIG_CONTEXT_TRACKING 79 - EXPORT_SYMBOL(___preempt_schedule_context); 80 - #endif 78 + EXPORT_SYMBOL(___preempt_schedule_notrace); 81 79 #endif

+1 -3

arch/x86/lib/thunk_32.S

··· 38 38 39 39 #ifdef CONFIG_PREEMPT 40 40 THUNK ___preempt_schedule, preempt_schedule 41 - #ifdef CONFIG_CONTEXT_TRACKING 42 - THUNK ___preempt_schedule_context, preempt_schedule_context 43 - #endif 41 + THUNK ___preempt_schedule_notrace, preempt_schedule_notrace 44 42 #endif 45 43

+1 -3

arch/x86/lib/thunk_64.S

··· 49 49 50 50 #ifdef CONFIG_PREEMPT 51 51 THUNK ___preempt_schedule, preempt_schedule 52 - #ifdef CONFIG_CONTEXT_TRACKING 53 - THUNK ___preempt_schedule_context, preempt_schedule_context 54 - #endif 52 + THUNK ___preempt_schedule_notrace, preempt_schedule_notrace 55 53 #endif 56 54 57 55 #if defined(CONFIG_TRACE_IRQFLAGS) \

+4 -2

arch/x86/lib/usercopy_32.c

··· 647 647 * @from: Source address, in kernel space. 648 648 * @n: Number of bytes to copy. 649 649 * 650 - * Context: User context only. This function may sleep. 650 + * Context: User context only. This function may sleep if pagefaults are 651 + * enabled. 651 652 * 652 653 * Copy data from kernel space to user space. 653 654 * ··· 669 668 * @from: Source address, in user space. 670 669 * @n: Number of bytes to copy. 671 670 * 672 - * Context: User context only. This function may sleep. 671 + * Context: User context only. This function may sleep if pagefaults are 672 + * enabled. 673 673 * 674 674 * Copy data from user space to kernel space. 675 675 *

+3 -2

arch/x86/mm/fault.c

··· 13 13 #include <linux/hugetlb.h> /* hstate_index_to_shift */ 14 14 #include <linux/prefetch.h> /* prefetchw */ 15 15 #include <linux/context_tracking.h> /* exception_enter(), ... */ 16 + #include <linux/uaccess.h> /* faulthandler_disabled() */ 16 17 17 18 #include <asm/traps.h> /* dotraplinkage, ... */ 18 19 #include <asm/pgalloc.h> /* pgd_*(), ... */ ··· 1127 1126 1128 1127 /* 1129 1128 * If we're in an interrupt, have no user context or are running 1130 - * in an atomic region then we must not take the fault: 1129 + * in a region with pagefaults disabled then we must not take the fault 1131 1130 */ 1132 - if (unlikely(in_atomic() || !mm)) { 1131 + if (unlikely(faulthandler_disabled() || !mm)) { 1133 1132 bad_area_nosemaphore(regs, error_code, address); 1134 1133 return; 1135 1134 }

+2 -1

arch/x86/mm/highmem_32.c

··· 35 35 unsigned long vaddr; 36 36 int idx, type; 37 37 38 - /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */ 38 + preempt_disable(); 39 39 pagefault_disable(); 40 40 41 41 if (!PageHighMem(page)) ··· 100 100 #endif 101 101 102 102 pagefault_enable(); 103 + preempt_enable(); 103 104 } 104 105 EXPORT_SYMBOL(__kunmap_atomic); 105 106

+2

arch/x86/mm/iomap_32.c

··· 59 59 unsigned long vaddr; 60 60 int idx, type; 61 61 62 + preempt_disable(); 62 63 pagefault_disable(); 63 64 64 65 type = kmap_atomic_idx_push(); ··· 118 117 } 119 118 120 119 pagefault_enable(); 120 + preempt_enable(); 121 121 } 122 122 EXPORT_SYMBOL_GPL(iounmap_atomic);

+2 -2

arch/xtensa/mm/fault.c

··· 15 15 #include <linux/mm.h> 16 16 #include <linux/module.h> 17 17 #include <linux/hardirq.h> 18 + #include <linux/uaccess.h> 18 19 #include <asm/mmu_context.h> 19 20 #include <asm/cacheflush.h> 20 21 #include <asm/hardirq.h> 21 - #include <asm/uaccess.h> 22 22 #include <asm/pgalloc.h> 23 23 24 24 DEFINE_PER_CPU(unsigned long, asid_cache) = ASID_USER_FIRST; ··· 57 57 /* If we're in an interrupt or have no user 58 58 * context, we must not take the fault.. 59 59 */ 60 - if (in_atomic() || !mm) { 60 + if (faulthandler_disabled() || !mm) { 61 61 bad_page_fault(regs, address, SIGSEGV); 62 62 return; 63 63 }

+2

arch/xtensa/mm/highmem.c

··· 42 42 enum fixed_addresses idx; 43 43 unsigned long vaddr; 44 44 45 + preempt_disable(); 45 46 pagefault_disable(); 46 47 if (!PageHighMem(page)) 47 48 return page_address(page); ··· 80 79 } 81 80 82 81 pagefault_enable(); 82 + preempt_enable(); 83 83 } 84 84 EXPORT_SYMBOL(__kunmap_atomic); 85 85

+1 -1

block/blk-mq-cpumap.c

··· 24 24 { 25 25 unsigned int ret; 26 26 27 - ret = cpumask_first(topology_thread_cpumask(cpu)); 27 + ret = cpumask_first(topology_sibling_cpumask(cpu)); 28 28 if (ret < nr_cpu_ids) 29 29 return ret; 30 30

+1 -1

drivers/acpi/acpi_pad.c

··· 105 105 mutex_lock(&round_robin_lock); 106 106 cpumask_clear(tmp); 107 107 for_each_cpu(cpu, pad_busy_cpus) 108 - cpumask_or(tmp, tmp, topology_thread_cpumask(cpu)); 108 + cpumask_or(tmp, tmp, topology_sibling_cpumask(cpu)); 109 109 cpumask_andnot(tmp, cpu_online_mask, tmp); 110 110 /* avoid HT sibilings if possible */ 111 111 if (cpumask_empty(tmp))

+1 -1

drivers/base/topology.c

··· 61 61 define_id_show_func(core_id); 62 62 static DEVICE_ATTR_RO(core_id); 63 63 64 - define_siblings_show_func(thread_siblings, thread_cpumask); 64 + define_siblings_show_func(thread_siblings, sibling_cpumask); 65 65 static DEVICE_ATTR_RO(thread_siblings); 66 66 static DEVICE_ATTR_RO(thread_siblings_list); 67 67

+3 -2

drivers/cpufreq/acpi-cpufreq.c

··· 699 699 dmi_check_system(sw_any_bug_dmi_table); 700 700 if (bios_with_sw_any_bug && !policy_is_shared(policy)) { 701 701 policy->shared_type = CPUFREQ_SHARED_TYPE_ALL; 702 - cpumask_copy(policy->cpus, cpu_core_mask(cpu)); 702 + cpumask_copy(policy->cpus, topology_core_cpumask(cpu)); 703 703 } 704 704 705 705 if (check_amd_hwpstate_cpu(cpu) && !acpi_pstate_strict) { 706 706 cpumask_clear(policy->cpus); 707 707 cpumask_set_cpu(cpu, policy->cpus); 708 - cpumask_copy(data->freqdomain_cpus, cpu_sibling_mask(cpu)); 708 + cpumask_copy(data->freqdomain_cpus, 709 + topology_sibling_cpumask(cpu)); 709 710 policy->shared_type = CPUFREQ_SHARED_TYPE_HW; 710 711 pr_info_once(PFX "overriding BIOS provided _PSD data\n"); 711 712 }

+1 -1

drivers/cpufreq/p4-clockmod.c

··· 172 172 unsigned int i; 173 173 174 174 #ifdef CONFIG_SMP 175 - cpumask_copy(policy->cpus, cpu_sibling_mask(policy->cpu)); 175 + cpumask_copy(policy->cpus, topology_sibling_cpumask(policy->cpu)); 176 176 #endif 177 177 178 178 /* Errata workaround */

+3 -10

drivers/cpufreq/powernow-k8.c

··· 57 57 58 58 static struct cpufreq_driver cpufreq_amd64_driver; 59 59 60 - #ifndef CONFIG_SMP 61 - static inline const struct cpumask *cpu_core_mask(int cpu) 62 - { 63 - return cpumask_of(0); 64 - } 65 - #endif 66 - 67 60 /* Return a frequency in MHz, given an input fid */ 68 61 static u32 find_freq_from_fid(u32 fid) 69 62 { ··· 613 620 614 621 pr_debug("cfid 0x%x, cvid 0x%x\n", data->currfid, data->currvid); 615 622 data->powernow_table = powernow_table; 616 - if (cpumask_first(cpu_core_mask(data->cpu)) == data->cpu) 623 + if (cpumask_first(topology_core_cpumask(data->cpu)) == data->cpu) 617 624 print_basics(data); 618 625 619 626 for (j = 0; j < data->numps; j++) ··· 777 784 CPUFREQ_TABLE_END; 778 785 data->powernow_table = powernow_table; 779 786 780 - if (cpumask_first(cpu_core_mask(data->cpu)) == data->cpu) 787 + if (cpumask_first(topology_core_cpumask(data->cpu)) == data->cpu) 781 788 print_basics(data); 782 789 783 790 /* notify BIOS that we exist */ ··· 1083 1090 if (rc != 0) 1084 1091 goto err_out_exit_acpi; 1085 1092 1086 - cpumask_copy(pol->cpus, cpu_core_mask(pol->cpu)); 1093 + cpumask_copy(pol->cpus, topology_core_cpumask(pol->cpu)); 1087 1094 data->available_cores = pol->cpus; 1088 1095 1089 1096 /* min/max the cpu is capable of */

+1 -1

drivers/cpufreq/speedstep-ich.c

··· 292 292 293 293 /* only run on CPU to be set, or on its sibling */ 294 294 #ifdef CONFIG_SMP 295 - cpumask_copy(policy->cpus, cpu_sibling_mask(policy->cpu)); 295 + cpumask_copy(policy->cpus, topology_sibling_cpumask(policy->cpu)); 296 296 #endif 297 297 policy_cpu = cpumask_any_and(policy->cpus, cpu_online_mask); 298 298

+7 -1

drivers/crypto/vmx/aes.c

··· 78 78 int ret; 79 79 struct p8_aes_ctx *ctx = crypto_tfm_ctx(tfm); 80 80 81 + preempt_disable(); 81 82 pagefault_disable(); 82 83 enable_kernel_altivec(); 83 84 ret = aes_p8_set_encrypt_key(key, keylen * 8, &ctx->enc_key); 84 85 ret += aes_p8_set_decrypt_key(key, keylen * 8, &ctx->dec_key); 85 86 pagefault_enable(); 86 - 87 + preempt_enable(); 88 + 87 89 ret += crypto_cipher_setkey(ctx->fallback, key, keylen); 88 90 return ret; 89 91 } ··· 97 95 if (in_interrupt()) { 98 96 crypto_cipher_encrypt_one(ctx->fallback, dst, src); 99 97 } else { 98 + preempt_disable(); 100 99 pagefault_disable(); 101 100 enable_kernel_altivec(); 102 101 aes_p8_encrypt(src, dst, &ctx->enc_key); 103 102 pagefault_enable(); 103 + preempt_enable(); 104 104 } 105 105 } 106 106 ··· 113 109 if (in_interrupt()) { 114 110 crypto_cipher_decrypt_one(ctx->fallback, dst, src); 115 111 } else { 112 + preempt_disable(); 116 113 pagefault_disable(); 117 114 enable_kernel_altivec(); 118 115 aes_p8_decrypt(src, dst, &ctx->dec_key); 119 116 pagefault_enable(); 117 + preempt_enable(); 120 118 } 121 119 } 122 120

+6

drivers/crypto/vmx/aes_cbc.c

··· 79 79 int ret; 80 80 struct p8_aes_cbc_ctx *ctx = crypto_tfm_ctx(tfm); 81 81 82 + preempt_disable(); 82 83 pagefault_disable(); 83 84 enable_kernel_altivec(); 84 85 ret = aes_p8_set_encrypt_key(key, keylen * 8, &ctx->enc_key); 85 86 ret += aes_p8_set_decrypt_key(key, keylen * 8, &ctx->dec_key); 86 87 pagefault_enable(); 88 + preempt_enable(); 87 89 88 90 ret += crypto_blkcipher_setkey(ctx->fallback, key, keylen); 89 91 return ret; ··· 108 106 if (in_interrupt()) { 109 107 ret = crypto_blkcipher_encrypt(&fallback_desc, dst, src, nbytes); 110 108 } else { 109 + preempt_disable(); 111 110 pagefault_disable(); 112 111 enable_kernel_altivec(); 113 112 ··· 122 119 } 123 120 124 121 pagefault_enable(); 122 + preempt_enable(); 125 123 } 126 124 127 125 return ret; ··· 145 141 if (in_interrupt()) { 146 142 ret = crypto_blkcipher_decrypt(&fallback_desc, dst, src, nbytes); 147 143 } else { 144 + preempt_disable(); 148 145 pagefault_disable(); 149 146 enable_kernel_altivec(); 150 147 ··· 159 154 } 160 155 161 156 pagefault_enable(); 157 + preempt_enable(); 162 158 } 163 159 164 160 return ret;

+8

drivers/crypto/vmx/ghash.c

··· 114 114 if (keylen != GHASH_KEY_LEN) 115 115 return -EINVAL; 116 116 117 + preempt_disable(); 117 118 pagefault_disable(); 118 119 enable_kernel_altivec(); 119 120 enable_kernel_fp(); 120 121 gcm_init_p8(ctx->htable, (const u64 *) key); 121 122 pagefault_enable(); 123 + preempt_enable(); 122 124 return crypto_shash_setkey(ctx->fallback, key, keylen); 123 125 } 124 126 ··· 142 140 } 143 141 memcpy(dctx->buffer + dctx->bytes, src, 144 142 GHASH_DIGEST_SIZE - dctx->bytes); 143 + preempt_disable(); 145 144 pagefault_disable(); 146 145 enable_kernel_altivec(); 147 146 enable_kernel_fp(); 148 147 gcm_ghash_p8(dctx->shash, ctx->htable, dctx->buffer, 149 148 GHASH_DIGEST_SIZE); 150 149 pagefault_enable(); 150 + preempt_enable(); 151 151 src += GHASH_DIGEST_SIZE - dctx->bytes; 152 152 srclen -= GHASH_DIGEST_SIZE - dctx->bytes; 153 153 dctx->bytes = 0; 154 154 } 155 155 len = srclen & ~(GHASH_DIGEST_SIZE - 1); 156 156 if (len) { 157 + preempt_disable(); 157 158 pagefault_disable(); 158 159 enable_kernel_altivec(); 159 160 enable_kernel_fp(); 160 161 gcm_ghash_p8(dctx->shash, ctx->htable, src, len); 161 162 pagefault_enable(); 163 + preempt_enable(); 162 164 src += len; 163 165 srclen -= len; 164 166 } ··· 186 180 if (dctx->bytes) { 187 181 for (i = dctx->bytes; i < GHASH_DIGEST_SIZE; i++) 188 182 dctx->buffer[i] = 0; 183 + preempt_disable(); 189 184 pagefault_disable(); 190 185 enable_kernel_altivec(); 191 186 enable_kernel_fp(); 192 187 gcm_ghash_p8(dctx->shash, ctx->htable, dctx->buffer, 193 188 GHASH_DIGEST_SIZE); 194 189 pagefault_enable(); 190 + preempt_enable(); 195 191 dctx->bytes = 0; 196 192 } 197 193 memcpy(out, dctx->shash, GHASH_DIGEST_SIZE);

+2 -1

drivers/gpu/drm/i915/i915_gem_execbuffer.c

··· 32 32 #include "i915_trace.h" 33 33 #include "intel_drv.h" 34 34 #include <linux/dma_remapping.h> 35 + #include <linux/uaccess.h> 35 36 36 37 #define __EXEC_OBJECT_HAS_PIN (1<<31) 37 38 #define __EXEC_OBJECT_HAS_FENCE (1<<30) ··· 466 465 } 467 466 468 467 /* We can't wait for rendering with pagefaults disabled */ 469 - if (obj->active && in_atomic()) 468 + if (obj->active && pagefault_disabled()) 470 469 return -EFAULT; 471 470 472 471 if (use_cpu_reloc(obj))

+2 -1

drivers/hwmon/coretemp.c

··· 63 63 #define TO_ATTR_NO(cpu) (TO_CORE_ID(cpu) + BASE_SYSFS_ATTR_NO) 64 64 65 65 #ifdef CONFIG_SMP 66 - #define for_each_sibling(i, cpu) for_each_cpu(i, cpu_sibling_mask(cpu)) 66 + #define for_each_sibling(i, cpu) \ 67 + for_each_cpu(i, topology_sibling_cpumask(cpu)) 67 68 #else 68 69 #define for_each_sibling(i, cpu) for (i = 0; false; ) 69 70 #endif

+1 -1

drivers/net/ethernet/sfc/efx.c

··· 1304 1304 if (!cpumask_test_cpu(cpu, thread_mask)) { 1305 1305 ++count; 1306 1306 cpumask_or(thread_mask, thread_mask, 1307 - topology_thread_cpumask(cpu)); 1307 + topology_sibling_cpumask(cpu)); 1308 1308 } 1309 1309 } 1310 1310

+1 -1

drivers/staging/lustre/lustre/libcfs/linux/linux-cpu.c

··· 87 87 /* return cpumask of HTs in the same core */ 88 88 static void cfs_cpu_ht_siblings(int cpu, cpumask_t *mask) 89 89 { 90 - cpumask_copy(mask, topology_thread_cpumask(cpu)); 90 + cpumask_copy(mask, topology_sibling_cpumask(cpu)); 91 91 } 92 92 93 93 static void cfs_node_to_cpumask(int node, cpumask_t *mask)

+2 -2

drivers/staging/lustre/lustre/ptlrpc/service.c

··· 557 557 * there are. 558 558 */ 559 559 /* weight is # of HTs */ 560 - if (cpumask_weight(topology_thread_cpumask(0)) > 1) { 560 + if (cpumask_weight(topology_sibling_cpumask(0)) > 1) { 561 561 /* depress thread factor for hyper-thread */ 562 562 factor = factor - (factor >> 1) + (factor >> 3); 563 563 } ··· 2768 2768 2769 2769 init_waitqueue_head(&ptlrpc_hr.hr_waitq); 2770 2770 2771 - weight = cpumask_weight(topology_thread_cpumask(0)); 2771 + weight = cpumask_weight(topology_sibling_cpumask(0)); 2772 2772 2773 2773 cfs_percpt_for_each(hrp, i, ptlrpc_hr.hr_partitions) { 2774 2774 hrp->hrp_cpt = i;

+5 -2

include/asm-generic/futex.h

··· 8 8 #ifndef CONFIG_SMP 9 9 /* 10 10 * The following implementation only for uniprocessor machines. 11 - * For UP, it's relies on the fact that pagefault_disable() also disables 12 - * preemption to ensure mutual exclusion. 11 + * It relies on preempt_disable() ensuring mutual exclusion. 13 12 * 14 13 */ 15 14 ··· 37 38 if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28)) 38 39 oparg = 1 << oparg; 39 40 41 + preempt_disable(); 40 42 pagefault_disable(); 41 43 42 44 ret = -EFAULT; ··· 72 72 73 73 out_pagefault_enable: 74 74 pagefault_enable(); 75 + preempt_enable(); 75 76 76 77 if (ret == 0) { 77 78 switch (cmp) { ··· 107 106 { 108 107 u32 val; 109 108 109 + preempt_disable(); 110 110 if (unlikely(get_user(val, uaddr) != 0)) 111 111 return -EFAULT; 112 112 ··· 115 113 return -EFAULT; 116 114 117 115 *uval = val; 116 + preempt_enable(); 118 117 119 118 return 0; 120 119 }

+2 -5

include/asm-generic/preempt.h

··· 79 79 #ifdef CONFIG_PREEMPT 80 80 extern asmlinkage void preempt_schedule(void); 81 81 #define __preempt_schedule() preempt_schedule() 82 - 83 - #ifdef CONFIG_CONTEXT_TRACKING 84 - extern asmlinkage void preempt_schedule_context(void); 85 - #define __preempt_schedule_context() preempt_schedule_context() 86 - #endif 82 + extern asmlinkage void preempt_schedule_notrace(void); 83 + #define __preempt_schedule_notrace() preempt_schedule_notrace() 87 84 #endif /* CONFIG_PREEMPT */ 88 85 89 86 #endif /* __ASM_PREEMPT_H */

-1

include/linux/bottom_half.h

··· 2 2 #define _LINUX_BH_H 3 3 4 4 #include <linux/preempt.h> 5 - #include <linux/preempt_mask.h> 6 5 7 6 #ifdef CONFIG_TRACE_IRQFLAGS 8 7 extern void __local_bh_disable_ip(unsigned long ip, unsigned int cnt);

+1 -1

include/linux/hardirq.h

··· 1 1 #ifndef LINUX_HARDIRQ_H 2 2 #define LINUX_HARDIRQ_H 3 3 4 - #include <linux/preempt_mask.h> 4 + #include <linux/preempt.h> 5 5 #include <linux/lockdep.h> 6 6 #include <linux/ftrace_irq.h> 7 7 #include <linux/vtime.h>

+2

include/linux/highmem.h

··· 65 65 66 66 static inline void *kmap_atomic(struct page *page) 67 67 { 68 + preempt_disable(); 68 69 pagefault_disable(); 69 70 return page_address(page); 70 71 } ··· 74 73 static inline void __kunmap_atomic(void *addr) 75 74 { 76 75 pagefault_enable(); 76 + preempt_enable(); 77 77 } 78 78 79 79 #define kmap_atomic_pfn(pfn) kmap_atomic(pfn_to_page(pfn))

+2 -3

include/linux/init_task.h

··· 50 50 .cpu_timers = INIT_CPU_TIMERS(sig.cpu_timers), \ 51 51 .rlim = INIT_RLIMITS, \ 52 52 .cputimer = { \ 53 - .cputime = INIT_CPUTIME, \ 54 - .running = 0, \ 55 - .lock = __RAW_SPIN_LOCK_UNLOCKED(sig.cputimer.lock), \ 53 + .cputime_atomic = INIT_CPUTIME_ATOMIC, \ 54 + .running = 0, \ 56 55 }, \ 57 56 .cred_guard_mutex = \ 58 57 __MUTEX_INITIALIZER(sig.cred_guard_mutex), \

+2

include/linux/io-mapping.h

··· 141 141 io_mapping_map_atomic_wc(struct io_mapping *mapping, 142 142 unsigned long offset) 143 143 { 144 + preempt_disable(); 144 145 pagefault_disable(); 145 146 return ((char __force __iomem *) mapping) + offset; 146 147 } ··· 150 149 io_mapping_unmap_atomic(void __iomem *vaddr) 151 150 { 152 151 pagefault_enable(); 152 + preempt_enable(); 153 153 } 154 154 155 155 /* Non-atomic map/unmap */

+2 -1

include/linux/kernel.h

··· 244 244 245 245 #if defined(CONFIG_MMU) && \ 246 246 (defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_DEBUG_ATOMIC_SLEEP)) 247 - void might_fault(void); 247 + #define might_fault() __might_fault(__FILE__, __LINE__) 248 + void __might_fault(const char *file, int line); 248 249 #else 249 250 static inline void might_fault(void) { } 250 251 #endif

+5

include/linux/lglock.h

··· 52 52 static struct lglock name = { .lock = &name ## _lock } 53 53 54 54 void lg_lock_init(struct lglock *lg, char *name); 55 + 55 56 void lg_local_lock(struct lglock *lg); 56 57 void lg_local_unlock(struct lglock *lg); 57 58 void lg_local_lock_cpu(struct lglock *lg, int cpu); 58 59 void lg_local_unlock_cpu(struct lglock *lg, int cpu); 60 + 61 + void lg_double_lock(struct lglock *lg, int cpu1, int cpu2); 62 + void lg_double_unlock(struct lglock *lg, int cpu1, int cpu2); 63 + 59 64 void lg_global_lock(struct lglock *lg); 60 65 void lg_global_unlock(struct lglock *lg); 61 66

+137 -24

include/linux/preempt.h

··· 10 10 #include <linux/list.h> 11 11 12 12 /* 13 - * We use the MSB mostly because its available; see <linux/preempt_mask.h> for 14 - * the other bits -- can't include that header due to inclusion hell. 13 + * We put the hardirq and softirq counter into the preemption 14 + * counter. The bitmask has the following meaning: 15 + * 16 + * - bits 0-7 are the preemption count (max preemption depth: 256) 17 + * - bits 8-15 are the softirq count (max # of softirqs: 256) 18 + * 19 + * The hardirq count could in theory be the same as the number of 20 + * interrupts in the system, but we run all interrupt handlers with 21 + * interrupts disabled, so we cannot have nesting interrupts. Though 22 + * there are a few palaeontologic drivers which reenable interrupts in 23 + * the handler, so we need more than one bit here. 24 + * 25 + * PREEMPT_MASK: 0x000000ff 26 + * SOFTIRQ_MASK: 0x0000ff00 27 + * HARDIRQ_MASK: 0x000f0000 28 + * NMI_MASK: 0x00100000 29 + * PREEMPT_ACTIVE: 0x00200000 30 + * PREEMPT_NEED_RESCHED: 0x80000000 15 31 */ 32 + #define PREEMPT_BITS 8 33 + #define SOFTIRQ_BITS 8 34 + #define HARDIRQ_BITS 4 35 + #define NMI_BITS 1 36 + 37 + #define PREEMPT_SHIFT 0 38 + #define SOFTIRQ_SHIFT (PREEMPT_SHIFT + PREEMPT_BITS) 39 + #define HARDIRQ_SHIFT (SOFTIRQ_SHIFT + SOFTIRQ_BITS) 40 + #define NMI_SHIFT (HARDIRQ_SHIFT + HARDIRQ_BITS) 41 + 42 + #define __IRQ_MASK(x) ((1UL << (x))-1) 43 + 44 + #define PREEMPT_MASK (__IRQ_MASK(PREEMPT_BITS) << PREEMPT_SHIFT) 45 + #define SOFTIRQ_MASK (__IRQ_MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT) 46 + #define HARDIRQ_MASK (__IRQ_MASK(HARDIRQ_BITS) << HARDIRQ_SHIFT) 47 + #define NMI_MASK (__IRQ_MASK(NMI_BITS) << NMI_SHIFT) 48 + 49 + #define PREEMPT_OFFSET (1UL << PREEMPT_SHIFT) 50 + #define SOFTIRQ_OFFSET (1UL << SOFTIRQ_SHIFT) 51 + #define HARDIRQ_OFFSET (1UL << HARDIRQ_SHIFT) 52 + #define NMI_OFFSET (1UL << NMI_SHIFT) 53 + 54 + #define SOFTIRQ_DISABLE_OFFSET (2 * SOFTIRQ_OFFSET) 55 + 56 + #define PREEMPT_ACTIVE_BITS 1 57 + #define PREEMPT_ACTIVE_SHIFT (NMI_SHIFT + NMI_BITS) 58 + #define PREEMPT_ACTIVE (__IRQ_MASK(PREEMPT_ACTIVE_BITS) << PREEMPT_ACTIVE_SHIFT) 59 + 60 + /* We use the MSB mostly because its available */ 16 61 #define PREEMPT_NEED_RESCHED 0x80000000 17 62 63 + /* preempt_count() and related functions, depends on PREEMPT_NEED_RESCHED */ 18 64 #include <asm/preempt.h> 65 + 66 + #define hardirq_count() (preempt_count() & HARDIRQ_MASK) 67 + #define softirq_count() (preempt_count() & SOFTIRQ_MASK) 68 + #define irq_count() (preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK \ 69 + | NMI_MASK)) 70 + 71 + /* 72 + * Are we doing bottom half or hardware interrupt processing? 73 + * Are we in a softirq context? Interrupt context? 74 + * in_softirq - Are we currently processing softirq or have bh disabled? 75 + * in_serving_softirq - Are we currently processing softirq? 76 + */ 77 + #define in_irq() (hardirq_count()) 78 + #define in_softirq() (softirq_count()) 79 + #define in_interrupt() (irq_count()) 80 + #define in_serving_softirq() (softirq_count() & SOFTIRQ_OFFSET) 81 + 82 + /* 83 + * Are we in NMI context? 84 + */ 85 + #define in_nmi() (preempt_count() & NMI_MASK) 86 + 87 + #if defined(CONFIG_PREEMPT_COUNT) 88 + # define PREEMPT_DISABLE_OFFSET 1 89 + #else 90 + # define PREEMPT_DISABLE_OFFSET 0 91 + #endif 92 + 93 + /* 94 + * The preempt_count offset needed for things like: 95 + * 96 + * spin_lock_bh() 97 + * 98 + * Which need to disable both preemption (CONFIG_PREEMPT_COUNT) and 99 + * softirqs, such that unlock sequences of: 100 + * 101 + * spin_unlock(); 102 + * local_bh_enable(); 103 + * 104 + * Work as expected. 105 + */ 106 + #define SOFTIRQ_LOCK_OFFSET (SOFTIRQ_DISABLE_OFFSET + PREEMPT_DISABLE_OFFSET) 107 + 108 + /* 109 + * Are we running in atomic context? WARNING: this macro cannot 110 + * always detect atomic context; in particular, it cannot know about 111 + * held spinlocks in non-preemptible kernels. Thus it should not be 112 + * used in the general case to determine whether sleeping is possible. 113 + * Do not use in_atomic() in driver code. 114 + */ 115 + #define in_atomic() (preempt_count() != 0) 116 + 117 + /* 118 + * Check whether we were atomic before we did preempt_disable(): 119 + * (used by the scheduler) 120 + */ 121 + #define in_atomic_preempt_off() \ 122 + ((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_DISABLE_OFFSET) 19 123 20 124 #if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_PREEMPT_TRACER) 21 125 extern void preempt_count_add(int val); ··· 137 33 #define preempt_count_inc() preempt_count_add(1) 138 34 #define preempt_count_dec() preempt_count_sub(1) 139 35 36 + #define preempt_active_enter() \ 37 + do { \ 38 + preempt_count_add(PREEMPT_ACTIVE + PREEMPT_DISABLE_OFFSET); \ 39 + barrier(); \ 40 + } while (0) 41 + 42 + #define preempt_active_exit() \ 43 + do { \ 44 + barrier(); \ 45 + preempt_count_sub(PREEMPT_ACTIVE + PREEMPT_DISABLE_OFFSET); \ 46 + } while (0) 47 + 140 48 #ifdef CONFIG_PREEMPT_COUNT 141 49 142 50 #define preempt_disable() \ ··· 165 49 166 50 #define preempt_enable_no_resched() sched_preempt_enable_no_resched() 167 51 52 + #define preemptible() (preempt_count() == 0 && !irqs_disabled()) 53 + 168 54 #ifdef CONFIG_PREEMPT 169 55 #define preempt_enable() \ 170 56 do { \ ··· 175 57 __preempt_schedule(); \ 176 58 } while (0) 177 59 60 + #define preempt_enable_notrace() \ 61 + do { \ 62 + barrier(); \ 63 + if (unlikely(__preempt_count_dec_and_test())) \ 64 + __preempt_schedule_notrace(); \ 65 + } while (0) 66 + 178 67 #define preempt_check_resched() \ 179 68 do { \ 180 69 if (should_resched()) \ 181 70 __preempt_schedule(); \ 182 71 } while (0) 183 72 184 - #else 73 + #else /* !CONFIG_PREEMPT */ 185 74 #define preempt_enable() \ 186 75 do { \ 187 76 barrier(); \ 188 77 preempt_count_dec(); \ 189 78 } while (0) 79 + 80 + #define preempt_enable_notrace() \ 81 + do { \ 82 + barrier(); \ 83 + __preempt_count_dec(); \ 84 + } while (0) 85 + 190 86 #define preempt_check_resched() do { } while (0) 191 - #endif 87 + #endif /* CONFIG_PREEMPT */ 192 88 193 89 #define preempt_disable_notrace() \ 194 90 do { \ ··· 215 83 barrier(); \ 216 84 __preempt_count_dec(); \ 217 85 } while (0) 218 - 219 - #ifdef CONFIG_PREEMPT 220 - 221 - #ifndef CONFIG_CONTEXT_TRACKING 222 - #define __preempt_schedule_context() __preempt_schedule() 223 - #endif 224 - 225 - #define preempt_enable_notrace() \ 226 - do { \ 227 - barrier(); \ 228 - if (unlikely(__preempt_count_dec_and_test())) \ 229 - __preempt_schedule_context(); \ 230 - } while (0) 231 - #else 232 - #define preempt_enable_notrace() \ 233 - do { \ 234 - barrier(); \ 235 - __preempt_count_dec(); \ 236 - } while (0) 237 - #endif 238 86 239 87 #else /* !CONFIG_PREEMPT_COUNT */ 240 88 ··· 233 121 #define preempt_disable_notrace() barrier() 234 122 #define preempt_enable_no_resched_notrace() barrier() 235 123 #define preempt_enable_notrace() barrier() 124 + #define preemptible() 0 236 125 237 126 #endif /* CONFIG_PREEMPT_COUNT */ 238 127

-117

include/linux/preempt_mask.h

··· 1 - #ifndef LINUX_PREEMPT_MASK_H 2 - #define LINUX_PREEMPT_MASK_H 3 - 4 - #include <linux/preempt.h> 5 - 6 - /* 7 - * We put the hardirq and softirq counter into the preemption 8 - * counter. The bitmask has the following meaning: 9 - * 10 - * - bits 0-7 are the preemption count (max preemption depth: 256) 11 - * - bits 8-15 are the softirq count (max # of softirqs: 256) 12 - * 13 - * The hardirq count could in theory be the same as the number of 14 - * interrupts in the system, but we run all interrupt handlers with 15 - * interrupts disabled, so we cannot have nesting interrupts. Though 16 - * there are a few palaeontologic drivers which reenable interrupts in 17 - * the handler, so we need more than one bit here. 18 - * 19 - * PREEMPT_MASK: 0x000000ff 20 - * SOFTIRQ_MASK: 0x0000ff00 21 - * HARDIRQ_MASK: 0x000f0000 22 - * NMI_MASK: 0x00100000 23 - * PREEMPT_ACTIVE: 0x00200000 24 - */ 25 - #define PREEMPT_BITS 8 26 - #define SOFTIRQ_BITS 8 27 - #define HARDIRQ_BITS 4 28 - #define NMI_BITS 1 29 - 30 - #define PREEMPT_SHIFT 0 31 - #define SOFTIRQ_SHIFT (PREEMPT_SHIFT + PREEMPT_BITS) 32 - #define HARDIRQ_SHIFT (SOFTIRQ_SHIFT + SOFTIRQ_BITS) 33 - #define NMI_SHIFT (HARDIRQ_SHIFT + HARDIRQ_BITS) 34 - 35 - #define __IRQ_MASK(x) ((1UL << (x))-1) 36 - 37 - #define PREEMPT_MASK (__IRQ_MASK(PREEMPT_BITS) << PREEMPT_SHIFT) 38 - #define SOFTIRQ_MASK (__IRQ_MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT) 39 - #define HARDIRQ_MASK (__IRQ_MASK(HARDIRQ_BITS) << HARDIRQ_SHIFT) 40 - #define NMI_MASK (__IRQ_MASK(NMI_BITS) << NMI_SHIFT) 41 - 42 - #define PREEMPT_OFFSET (1UL << PREEMPT_SHIFT) 43 - #define SOFTIRQ_OFFSET (1UL << SOFTIRQ_SHIFT) 44 - #define HARDIRQ_OFFSET (1UL << HARDIRQ_SHIFT) 45 - #define NMI_OFFSET (1UL << NMI_SHIFT) 46 - 47 - #define SOFTIRQ_DISABLE_OFFSET (2 * SOFTIRQ_OFFSET) 48 - 49 - #define PREEMPT_ACTIVE_BITS 1 50 - #define PREEMPT_ACTIVE_SHIFT (NMI_SHIFT + NMI_BITS) 51 - #define PREEMPT_ACTIVE (__IRQ_MASK(PREEMPT_ACTIVE_BITS) << PREEMPT_ACTIVE_SHIFT) 52 - 53 - #define hardirq_count() (preempt_count() & HARDIRQ_MASK) 54 - #define softirq_count() (preempt_count() & SOFTIRQ_MASK) 55 - #define irq_count() (preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK \ 56 - | NMI_MASK)) 57 - 58 - /* 59 - * Are we doing bottom half or hardware interrupt processing? 60 - * Are we in a softirq context? Interrupt context? 61 - * in_softirq - Are we currently processing softirq or have bh disabled? 62 - * in_serving_softirq - Are we currently processing softirq? 63 - */ 64 - #define in_irq() (hardirq_count()) 65 - #define in_softirq() (softirq_count()) 66 - #define in_interrupt() (irq_count()) 67 - #define in_serving_softirq() (softirq_count() & SOFTIRQ_OFFSET) 68 - 69 - /* 70 - * Are we in NMI context? 71 - */ 72 - #define in_nmi() (preempt_count() & NMI_MASK) 73 - 74 - #if defined(CONFIG_PREEMPT_COUNT) 75 - # define PREEMPT_CHECK_OFFSET 1 76 - #else 77 - # define PREEMPT_CHECK_OFFSET 0 78 - #endif 79 - 80 - /* 81 - * The preempt_count offset needed for things like: 82 - * 83 - * spin_lock_bh() 84 - * 85 - * Which need to disable both preemption (CONFIG_PREEMPT_COUNT) and 86 - * softirqs, such that unlock sequences of: 87 - * 88 - * spin_unlock(); 89 - * local_bh_enable(); 90 - * 91 - * Work as expected. 92 - */ 93 - #define SOFTIRQ_LOCK_OFFSET (SOFTIRQ_DISABLE_OFFSET + PREEMPT_CHECK_OFFSET) 94 - 95 - /* 96 - * Are we running in atomic context? WARNING: this macro cannot 97 - * always detect atomic context; in particular, it cannot know about 98 - * held spinlocks in non-preemptible kernels. Thus it should not be 99 - * used in the general case to determine whether sleeping is possible. 100 - * Do not use in_atomic() in driver code. 101 - */ 102 - #define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != 0) 103 - 104 - /* 105 - * Check whether we were atomic before we did preempt_disable(): 106 - * (used by the scheduler, *after* releasing the kernel lock) 107 - */ 108 - #define in_atomic_preempt_off() \ 109 - ((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_CHECK_OFFSET) 110 - 111 - #ifdef CONFIG_PREEMPT_COUNT 112 - # define preemptible() (preempt_count() == 0 && !irqs_disabled()) 113 - #else 114 - # define preemptible() 0 115 - #endif 116 - 117 - #endif /* LINUX_PREEMPT_MASK_H */

+91 -27

include/linux/sched.h

··· 25 25 #include <linux/errno.h> 26 26 #include <linux/nodemask.h> 27 27 #include <linux/mm_types.h> 28 - #include <linux/preempt_mask.h> 28 + #include <linux/preempt.h> 29 29 30 30 #include <asm/page.h> 31 31 #include <asm/ptrace.h> ··· 174 174 extern void get_iowait_load(unsigned long *nr_waiters, unsigned long *load); 175 175 176 176 extern void calc_global_load(unsigned long ticks); 177 + 178 + #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON) 177 179 extern void update_cpu_load_nohz(void); 180 + #else 181 + static inline void update_cpu_load_nohz(void) { } 182 + #endif 178 183 179 184 extern unsigned long get_parent_ip(unsigned long addr); 180 185 ··· 219 214 #define TASK_WAKEKILL 128 220 215 #define TASK_WAKING 256 221 216 #define TASK_PARKED 512 222 - #define TASK_STATE_MAX 1024 217 + #define TASK_NOLOAD 1024 218 + #define TASK_STATE_MAX 2048 223 219 224 - #define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWP" 220 + #define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWPN" 225 221 226 222 extern char ___assert_task_state[1 - 2*!!( 227 223 sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)]; ··· 231 225 #define TASK_KILLABLE (TASK_WAKEKILL | TASK_UNINTERRUPTIBLE) 232 226 #define TASK_STOPPED (TASK_WAKEKILL | __TASK_STOPPED) 233 227 #define TASK_TRACED (TASK_WAKEKILL | __TASK_TRACED) 228 + 229 + #define TASK_IDLE (TASK_UNINTERRUPTIBLE | TASK_NOLOAD) 234 230 235 231 /* Convenience macros for the sake of wake_up */ 236 232 #define TASK_NORMAL (TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE) ··· 249 241 ((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0) 250 242 #define task_contributes_to_load(task) \ 251 243 ((task->state & TASK_UNINTERRUPTIBLE) != 0 && \ 252 - (task->flags & PF_FROZEN) == 0) 244 + (task->flags & PF_FROZEN) == 0 && \ 245 + (task->state & TASK_NOLOAD) == 0) 253 246 254 247 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP 255 248 ··· 577 568 .sum_exec_runtime = 0, \ 578 569 } 579 570 571 + /* 572 + * This is the atomic variant of task_cputime, which can be used for 573 + * storing and updating task_cputime statistics without locking. 574 + */ 575 + struct task_cputime_atomic { 576 + atomic64_t utime; 577 + atomic64_t stime; 578 + atomic64_t sum_exec_runtime; 579 + }; 580 + 581 + #define INIT_CPUTIME_ATOMIC \ 582 + (struct task_cputime_atomic) { \ 583 + .utime = ATOMIC64_INIT(0), \ 584 + .stime = ATOMIC64_INIT(0), \ 585 + .sum_exec_runtime = ATOMIC64_INIT(0), \ 586 + } 587 + 580 588 #ifdef CONFIG_PREEMPT_COUNT 581 589 #define PREEMPT_DISABLED (1 + PREEMPT_ENABLED) 582 590 #else ··· 611 585 612 586 /** 613 587 * struct thread_group_cputimer - thread group interval timer counts 614 - * @cputime: thread group interval timers. 588 + * @cputime_atomic: atomic thread group interval timers. 615 589 * @running: non-zero when there are timers running and 616 590 * @cputime receives updates. 617 - * @lock: lock for fields in this struct. 618 591 * 619 592 * This structure contains the version of task_cputime, above, that is 620 593 * used for thread group CPU timer calculations. 621 594 */ 622 595 struct thread_group_cputimer { 623 - struct task_cputime cputime; 596 + struct task_cputime_atomic cputime_atomic; 624 597 int running; 625 - raw_spinlock_t lock; 626 598 }; 627 599 628 600 #include <linux/rwsem.h> ··· 923 899 */ 924 900 #define SCHED_CAPACITY_SHIFT 10 925 901 #define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT) 902 + 903 + /* 904 + * Wake-queues are lists of tasks with a pending wakeup, whose 905 + * callers have already marked the task as woken internally, 906 + * and can thus carry on. A common use case is being able to 907 + * do the wakeups once the corresponding user lock as been 908 + * released. 909 + * 910 + * We hold reference to each task in the list across the wakeup, 911 + * thus guaranteeing that the memory is still valid by the time 912 + * the actual wakeups are performed in wake_up_q(). 913 + * 914 + * One per task suffices, because there's never a need for a task to be 915 + * in two wake queues simultaneously; it is forbidden to abandon a task 916 + * in a wake queue (a call to wake_up_q() _must_ follow), so if a task is 917 + * already in a wake queue, the wakeup will happen soon and the second 918 + * waker can just skip it. 919 + * 920 + * The WAKE_Q macro declares and initializes the list head. 921 + * wake_up_q() does NOT reinitialize the list; it's expected to be 922 + * called near the end of a function, where the fact that the queue is 923 + * not used again will be easy to see by inspection. 924 + * 925 + * Note that this can cause spurious wakeups. schedule() callers 926 + * must ensure the call is done inside a loop, confirming that the 927 + * wakeup condition has in fact occurred. 928 + */ 929 + struct wake_q_node { 930 + struct wake_q_node *next; 931 + }; 932 + 933 + struct wake_q_head { 934 + struct wake_q_node *first; 935 + struct wake_q_node **lastp; 936 + }; 937 + 938 + #define WAKE_Q_TAIL ((struct wake_q_node *) 0x01) 939 + 940 + #define WAKE_Q(name) \ 941 + struct wake_q_head name = { WAKE_Q_TAIL, &name.first } 942 + 943 + extern void wake_q_add(struct wake_q_head *head, 944 + struct task_struct *task); 945 + extern void wake_up_q(struct wake_q_head *head); 926 946 927 947 /* 928 948 * sched-domains (multiprocessor balancing) declarations: ··· 1403 1335 int rcu_read_lock_nesting; 1404 1336 union rcu_special rcu_read_unlock_special; 1405 1337 struct list_head rcu_node_entry; 1406 - #endif /* #ifdef CONFIG_PREEMPT_RCU */ 1407 - #ifdef CONFIG_PREEMPT_RCU 1408 1338 struct rcu_node *rcu_blocked_node; 1409 1339 #endif /* #ifdef CONFIG_PREEMPT_RCU */ 1410 1340 #ifdef CONFIG_TASKS_RCU ··· 1433 1367 int exit_state; 1434 1368 int exit_code, exit_signal; 1435 1369 int pdeath_signal; /* The signal sent when the parent dies */ 1436 - unsigned int jobctl; /* JOBCTL_*, siglock protected */ 1370 + unsigned long jobctl; /* JOBCTL_*, siglock protected */ 1437 1371 1438 1372 /* Used for emulating ABI behavior of previous Linux versions */ 1439 1373 unsigned int personality; ··· 1578 1512 1579 1513 /* Protection of the PI data structures: */ 1580 1514 raw_spinlock_t pi_lock; 1515 + 1516 + struct wake_q_node wake_q; 1581 1517 1582 1518 #ifdef CONFIG_RT_MUTEXES 1583 1519 /* PI waiters blocked on a rt_mutex held by this task */ ··· 1794 1726 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP 1795 1727 unsigned long task_state_change; 1796 1728 #endif 1729 + int pagefault_disabled; 1797 1730 }; 1798 1731 1799 1732 /* Future-safe accessor for struct task_struct's cpus_allowed. */ ··· 2148 2079 #define JOBCTL_TRAPPING_BIT 21 /* switching to TRACED */ 2149 2080 #define JOBCTL_LISTENING_BIT 22 /* ptracer is listening for events */ 2150 2081 2151 - #define JOBCTL_STOP_DEQUEUED (1 << JOBCTL_STOP_DEQUEUED_BIT) 2152 - #define JOBCTL_STOP_PENDING (1 << JOBCTL_STOP_PENDING_BIT) 2153 - #define JOBCTL_STOP_CONSUME (1 << JOBCTL_STOP_CONSUME_BIT) 2154 - #define JOBCTL_TRAP_STOP (1 << JOBCTL_TRAP_STOP_BIT) 2155 - #define JOBCTL_TRAP_NOTIFY (1 << JOBCTL_TRAP_NOTIFY_BIT) 2156 - #define JOBCTL_TRAPPING (1 << JOBCTL_TRAPPING_BIT) 2157 - #define JOBCTL_LISTENING (1 << JOBCTL_LISTENING_BIT) 2082 + #define JOBCTL_STOP_DEQUEUED (1UL << JOBCTL_STOP_DEQUEUED_BIT) 2083 + #define JOBCTL_STOP_PENDING (1UL << JOBCTL_STOP_PENDING_BIT) 2084 + #define JOBCTL_STOP_CONSUME (1UL << JOBCTL_STOP_CONSUME_BIT) 2085 + #define JOBCTL_TRAP_STOP (1UL << JOBCTL_TRAP_STOP_BIT) 2086 + #define JOBCTL_TRAP_NOTIFY (1UL << JOBCTL_TRAP_NOTIFY_BIT) 2087 + #define JOBCTL_TRAPPING (1UL << JOBCTL_TRAPPING_BIT) 2088 + #define JOBCTL_LISTENING (1UL << JOBCTL_LISTENING_BIT) 2158 2089 2159 2090 #define JOBCTL_TRAP_MASK (JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY) 2160 2091 #define JOBCTL_PENDING_MASK (JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK) 2161 2092 2162 2093 extern bool task_set_jobctl_pending(struct task_struct *task, 2163 - unsigned int mask); 2094 + unsigned long mask); 2164 2095 extern void task_clear_jobctl_trapping(struct task_struct *task); 2165 2096 extern void task_clear_jobctl_pending(struct task_struct *task, 2166 - unsigned int mask); 2097 + unsigned long mask); 2167 2098 2168 2099 static inline void rcu_copy_process(struct task_struct *p) 2169 2100 { ··· 3033 2964 void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times); 3034 2965 void thread_group_cputimer(struct task_struct *tsk, struct task_cputime *times); 3035 2966 3036 - static inline void thread_group_cputime_init(struct signal_struct *sig) 3037 - { 3038 - raw_spin_lock_init(&sig->cputimer.lock); 3039 - } 3040 - 3041 2967 /* 3042 2968 * Reevaluate whether the task has signals pending delivery. 3043 2969 * Wake the task if so. ··· 3146 3082 static inline unsigned long task_rlimit(const struct task_struct *tsk, 3147 3083 unsigned int limit) 3148 3084 { 3149 - return ACCESS_ONCE(tsk->signal->rlim[limit].rlim_cur); 3085 + return READ_ONCE(tsk->signal->rlim[limit].rlim_cur); 3150 3086 } 3151 3087 3152 3088 static inline unsigned long task_rlimit_max(const struct task_struct *tsk, 3153 3089 unsigned int limit) 3154 3090 { 3155 - return ACCESS_ONCE(tsk->signal->rlim[limit].rlim_max); 3091 + return READ_ONCE(tsk->signal->rlim[limit].rlim_max); 3156 3092 } 3157 3093 3158 3094 static inline unsigned long rlimit(unsigned int limit)

+3 -3

include/linux/topology.h

··· 191 191 #ifndef topology_core_id 192 192 #define topology_core_id(cpu) ((void)(cpu), 0) 193 193 #endif 194 - #ifndef topology_thread_cpumask 195 - #define topology_thread_cpumask(cpu) cpumask_of(cpu) 194 + #ifndef topology_sibling_cpumask 195 + #define topology_sibling_cpumask(cpu) cpumask_of(cpu) 196 196 #endif 197 197 #ifndef topology_core_cpumask 198 198 #define topology_core_cpumask(cpu) cpumask_of(cpu) ··· 201 201 #ifdef CONFIG_SCHED_SMT 202 202 static inline const struct cpumask *cpu_smt_mask(int cpu) 203 203 { 204 - return topology_thread_cpumask(cpu); 204 + return topology_sibling_cpumask(cpu); 205 205 } 206 206 #endif 207 207

+35 -13

include/linux/uaccess.h

··· 1 1 #ifndef __LINUX_UACCESS_H__ 2 2 #define __LINUX_UACCESS_H__ 3 3 4 - #include <linux/preempt.h> 4 + #include <linux/sched.h> 5 5 #include <asm/uaccess.h> 6 6 7 + static __always_inline void pagefault_disabled_inc(void) 8 + { 9 + current->pagefault_disabled++; 10 + } 11 + 12 + static __always_inline void pagefault_disabled_dec(void) 13 + { 14 + current->pagefault_disabled--; 15 + WARN_ON(current->pagefault_disabled < 0); 16 + } 17 + 7 18 /* 8 - * These routines enable/disable the pagefault handler in that 9 - * it will not take any locks and go straight to the fixup table. 19 + * These routines enable/disable the pagefault handler. If disabled, it will 20 + * not take any locks and go straight to the fixup table. 10 21 * 11 - * They have great resemblance to the preempt_disable/enable calls 12 - * and in fact they are identical; this is because currently there is 13 - * no other way to make the pagefault handlers do this. So we do 14 - * disable preemption but we don't necessarily care about that. 22 + * User access methods will not sleep when called from a pagefault_disabled() 23 + * environment. 15 24 */ 16 25 static inline void pagefault_disable(void) 17 26 { 18 - preempt_count_inc(); 27 + pagefault_disabled_inc(); 19 28 /* 20 29 * make sure to have issued the store before a pagefault 21 30 * can hit. ··· 34 25 35 26 static inline void pagefault_enable(void) 36 27 { 37 - #ifndef CONFIG_PREEMPT 38 28 /* 39 29 * make sure to issue those last loads/stores before enabling 40 30 * the pagefault handler again. 41 31 */ 42 32 barrier(); 43 - preempt_count_dec(); 44 - #else 45 - preempt_enable(); 46 - #endif 33 + pagefault_disabled_dec(); 47 34 } 35 + 36 + /* 37 + * Is the pagefault handler disabled? If so, user access methods will not sleep. 38 + */ 39 + #define pagefault_disabled() (current->pagefault_disabled != 0) 40 + 41 + /* 42 + * The pagefault handler is in general disabled by pagefault_disable() or 43 + * when in irq context (via in_atomic()). 44 + * 45 + * This function should only be used by the fault handlers. Other users should 46 + * stick to pagefault_disabled(). 47 + * Please NEVER use preempt_disable() to disable the fault handler. With 48 + * !CONFIG_PREEMPT_COUNT, this is like a NOP. So the handler won't be disabled. 49 + * in_atomic() will report different values based on !CONFIG_PREEMPT_COUNT. 50 + */ 51 + #define faulthandler_disabled() (pagefault_disabled() || in_atomic()) 48 52 49 53 #ifndef ARCH_HAS_NOCACHE_UACCESS 50 54

+10 -7

include/linux/wait.h

··· 969 969 * on that signal. 970 970 */ 971 971 static inline int 972 - wait_on_bit(void *word, int bit, unsigned mode) 972 + wait_on_bit(unsigned long *word, int bit, unsigned mode) 973 973 { 974 974 might_sleep(); 975 975 if (!test_bit(bit, word)) ··· 994 994 * on that signal. 995 995 */ 996 996 static inline int 997 - wait_on_bit_io(void *word, int bit, unsigned mode) 997 + wait_on_bit_io(unsigned long *word, int bit, unsigned mode) 998 998 { 999 999 might_sleep(); 1000 1000 if (!test_bit(bit, word)) ··· 1020 1020 * received a signal and the mode permitted wakeup on that signal. 1021 1021 */ 1022 1022 static inline int 1023 - wait_on_bit_timeout(void *word, int bit, unsigned mode, unsigned long timeout) 1023 + wait_on_bit_timeout(unsigned long *word, int bit, unsigned mode, 1024 + unsigned long timeout) 1024 1025 { 1025 1026 might_sleep(); 1026 1027 if (!test_bit(bit, word)) ··· 1048 1047 * on that signal. 1049 1048 */ 1050 1049 static inline int 1051 - wait_on_bit_action(void *word, int bit, wait_bit_action_f *action, unsigned mode) 1050 + wait_on_bit_action(unsigned long *word, int bit, wait_bit_action_f *action, 1051 + unsigned mode) 1052 1052 { 1053 1053 might_sleep(); 1054 1054 if (!test_bit(bit, word)) ··· 1077 1075 * the @mode allows that signal to wake the process. 1078 1076 */ 1079 1077 static inline int 1080 - wait_on_bit_lock(void *word, int bit, unsigned mode) 1078 + wait_on_bit_lock(unsigned long *word, int bit, unsigned mode) 1081 1079 { 1082 1080 might_sleep(); 1083 1081 if (!test_and_set_bit(bit, word)) ··· 1101 1099 * the @mode allows that signal to wake the process. 1102 1100 */ 1103 1101 static inline int 1104 - wait_on_bit_lock_io(void *word, int bit, unsigned mode) 1102 + wait_on_bit_lock_io(unsigned long *word, int bit, unsigned mode) 1105 1103 { 1106 1104 might_sleep(); 1107 1105 if (!test_and_set_bit(bit, word)) ··· 1127 1125 * the @mode allows that signal to wake the process. 1128 1126 */ 1129 1127 static inline int 1130 - wait_on_bit_lock_action(void *word, int bit, wait_bit_action_f *action, unsigned mode) 1128 + wait_on_bit_lock_action(unsigned long *word, int bit, wait_bit_action_f *action, 1129 + unsigned mode) 1131 1130 { 1132 1131 might_sleep(); 1133 1132 if (!test_and_set_bit(bit, word))

+2 -1

include/trace/events/sched.h

··· 147 147 __print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|", 148 148 { 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" }, 149 149 { 16, "Z" }, { 32, "X" }, { 64, "x" }, 150 - { 128, "K" }, { 256, "W" }, { 512, "P" }) : "R", 150 + { 128, "K" }, { 256, "W" }, { 512, "P" }, 151 + { 1024, "N" }) : "R", 151 152 __entry->prev_state & TASK_STATE_MAX ? "+" : "", 152 153 __entry->next_comm, __entry->next_pid, __entry->next_prio) 153 154 );

+33 -21

ipc/mqueue.c

··· 47 47 #define RECV 1 48 48 49 49 #define STATE_NONE 0 50 - #define STATE_PENDING 1 51 - #define STATE_READY 2 50 + #define STATE_READY 1 52 51 53 52 struct posix_msg_tree_node { 54 53 struct rb_node rb_node; ··· 570 571 wq_add(info, sr, ewp); 571 572 572 573 for (;;) { 573 - set_current_state(TASK_INTERRUPTIBLE); 574 + __set_current_state(TASK_INTERRUPTIBLE); 574 575 575 576 spin_unlock(&info->lock); 576 577 time = schedule_hrtimeout_range_clock(timeout, 0, 577 578 HRTIMER_MODE_ABS, CLOCK_REALTIME); 578 - 579 - while (ewp->state == STATE_PENDING) 580 - cpu_relax(); 581 579 582 580 if (ewp->state == STATE_READY) { 583 581 retval = 0; ··· 903 907 * list of waiting receivers. A sender checks that list before adding the new 904 908 * message into the message array. If there is a waiting receiver, then it 905 909 * bypasses the message array and directly hands the message over to the 906 - * receiver. 907 - * The receiver accepts the message and returns without grabbing the queue 908 - * spinlock. Therefore an intermediate STATE_PENDING state and memory barriers 909 - * are necessary. The same algorithm is used for sysv semaphores, see 910 - * ipc/sem.c for more details. 910 + * receiver. The receiver accepts the message and returns without grabbing the 911 + * queue spinlock: 912 + * 913 + * - Set pointer to message. 914 + * - Queue the receiver task for later wakeup (without the info->lock). 915 + * - Update its state to STATE_READY. Now the receiver can continue. 916 + * - Wake up the process after the lock is dropped. Should the process wake up 917 + * before this wakeup (due to a timeout or a signal) it will either see 918 + * STATE_READY and continue or acquire the lock to check the state again. 911 919 * 912 920 * The same algorithm is used for senders. 913 921 */ ··· 919 919 /* pipelined_send() - send a message directly to the task waiting in 920 920 * sys_mq_timedreceive() (without inserting message into a queue). 921 921 */ 922 - static inline void pipelined_send(struct mqueue_inode_info *info, 922 + static inline void pipelined_send(struct wake_q_head *wake_q, 923 + struct mqueue_inode_info *info, 923 924 struct msg_msg *message, 924 925 struct ext_wait_queue *receiver) 925 926 { 926 927 receiver->msg = message; 927 928 list_del(&receiver->list); 928 - receiver->state = STATE_PENDING; 929 - wake_up_process(receiver->task); 930 - smp_wmb(); 929 + wake_q_add(wake_q, receiver->task); 930 + /* 931 + * Rely on the implicit cmpxchg barrier from wake_q_add such 932 + * that we can ensure that updating receiver->state is the last 933 + * write operation: As once set, the receiver can continue, 934 + * and if we don't have the reference count from the wake_q, 935 + * yet, at that point we can later have a use-after-free 936 + * condition and bogus wakeup. 937 + */ 931 938 receiver->state = STATE_READY; 932 939 } 933 940 934 941 /* pipelined_receive() - if there is task waiting in sys_mq_timedsend() 935 942 * gets its message and put to the queue (we have one free place for sure). */ 936 - static inline void pipelined_receive(struct mqueue_inode_info *info) 943 + static inline void pipelined_receive(struct wake_q_head *wake_q, 944 + struct mqueue_inode_info *info) 937 945 { 938 946 struct ext_wait_queue *sender = wq_get_first_waiter(info, SEND); 939 947 ··· 952 944 } 953 945 if (msg_insert(sender->msg, info)) 954 946 return; 947 + 955 948 list_del(&sender->list); 956 - sender->state = STATE_PENDING; 957 - wake_up_process(sender->task); 958 - smp_wmb(); 949 + wake_q_add(wake_q, sender->task); 959 950 sender->state = STATE_READY; 960 951 } 961 952 ··· 972 965 struct timespec ts; 973 966 struct posix_msg_tree_node *new_leaf = NULL; 974 967 int ret = 0; 968 + WAKE_Q(wake_q); 975 969 976 970 if (u_abs_timeout) { 977 971 int res = prepare_timeout(u_abs_timeout, &expires, &ts); ··· 1057 1049 } else { 1058 1050 receiver = wq_get_first_waiter(info, RECV); 1059 1051 if (receiver) { 1060 - pipelined_send(info, msg_ptr, receiver); 1052 + pipelined_send(&wake_q, info, msg_ptr, receiver); 1061 1053 } else { 1062 1054 /* adds message to the queue */ 1063 1055 ret = msg_insert(msg_ptr, info); ··· 1070 1062 } 1071 1063 out_unlock: 1072 1064 spin_unlock(&info->lock); 1065 + wake_up_q(&wake_q); 1073 1066 out_free: 1074 1067 if (ret) 1075 1068 free_msg(msg_ptr); ··· 1158 1149 msg_ptr = wait.msg; 1159 1150 } 1160 1151 } else { 1152 + WAKE_Q(wake_q); 1153 + 1161 1154 msg_ptr = msg_get(info); 1162 1155 1163 1156 inode->i_atime = inode->i_mtime = inode->i_ctime = 1164 1157 CURRENT_TIME; 1165 1158 1166 1159 /* There is now free space in queue. */ 1167 - pipelined_receive(info); 1160 + pipelined_receive(&wake_q, info); 1168 1161 spin_unlock(&info->lock); 1162 + wake_up_q(&wake_q); 1169 1163 ret = 0; 1170 1164 } 1171 1165 if (ret == 0) {

+4 -4

kernel/fork.c

··· 1091 1091 { 1092 1092 unsigned long cpu_limit; 1093 1093 1094 - /* Thread group counters. */ 1095 - thread_group_cputime_init(sig); 1096 - 1097 - cpu_limit = ACCESS_ONCE(sig->rlim[RLIMIT_CPU].rlim_cur); 1094 + cpu_limit = READ_ONCE(sig->rlim[RLIMIT_CPU].rlim_cur); 1098 1095 if (cpu_limit != RLIM_INFINITY) { 1099 1096 sig->cputime_expires.prof_exp = secs_to_cputime(cpu_limit); 1100 1097 sig->cputimer.running = 1; ··· 1393 1396 p->hardirq_context = 0; 1394 1397 p->softirq_context = 0; 1395 1398 #endif 1399 + 1400 + p->pagefault_disabled = 0; 1401 + 1396 1402 #ifdef CONFIG_LOCKDEP 1397 1403 p->lockdep_depth = 0; /* no locks held yet */ 1398 1404 p->curr_chain_key = 0;

+17 -16

kernel/futex.c

··· 1090 1090 1091 1091 /* 1092 1092 * The hash bucket lock must be held when this is called. 1093 - * Afterwards, the futex_q must not be accessed. 1093 + * Afterwards, the futex_q must not be accessed. Callers 1094 + * must ensure to later call wake_up_q() for the actual 1095 + * wakeups to occur. 1094 1096 */ 1095 - static void wake_futex(struct futex_q *q) 1097 + static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q) 1096 1098 { 1097 1099 struct task_struct *p = q->task; 1098 1100 ··· 1102 1100 return; 1103 1101 1104 1102 /* 1105 - * We set q->lock_ptr = NULL _before_ we wake up the task. If 1106 - * a non-futex wake up happens on another CPU then the task 1107 - * might exit and p would dereference a non-existing task 1108 - * struct. Prevent this by holding a reference on p across the 1109 - * wake up. 1103 + * Queue the task for later wakeup for after we've released 1104 + * the hb->lock. wake_q_add() grabs reference to p. 1110 1105 */ 1111 - get_task_struct(p); 1112 - 1106 + wake_q_add(wake_q, p); 1113 1107 __unqueue_futex(q); 1114 1108 /* 1115 1109 * The waiting task can free the futex_q as soon as ··· 1115 1117 */ 1116 1118 smp_wmb(); 1117 1119 q->lock_ptr = NULL; 1118 - 1119 - wake_up_state(p, TASK_NORMAL); 1120 - put_task_struct(p); 1121 1120 } 1122 1121 1123 1122 static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this) ··· 1212 1217 struct futex_q *this, *next; 1213 1218 union futex_key key = FUTEX_KEY_INIT; 1214 1219 int ret; 1220 + WAKE_Q(wake_q); 1215 1221 1216 1222 if (!bitset) 1217 1223 return -EINVAL; ··· 1240 1244 if (!(this->bitset & bitset)) 1241 1245 continue; 1242 1246 1243 - wake_futex(this); 1247 + mark_wake_futex(&wake_q, this); 1244 1248 if (++ret >= nr_wake) 1245 1249 break; 1246 1250 } 1247 1251 } 1248 1252 1249 1253 spin_unlock(&hb->lock); 1254 + wake_up_q(&wake_q); 1250 1255 out_put_key: 1251 1256 put_futex_key(&key); 1252 1257 out: ··· 1266 1269 struct futex_hash_bucket *hb1, *hb2; 1267 1270 struct futex_q *this, *next; 1268 1271 int ret, op_ret; 1272 + WAKE_Q(wake_q); 1269 1273 1270 1274 retry: 1271 1275 ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ); ··· 1318 1320 ret = -EINVAL; 1319 1321 goto out_unlock; 1320 1322 } 1321 - wake_futex(this); 1323 + mark_wake_futex(&wake_q, this); 1322 1324 if (++ret >= nr_wake) 1323 1325 break; 1324 1326 } ··· 1332 1334 ret = -EINVAL; 1333 1335 goto out_unlock; 1334 1336 } 1335 - wake_futex(this); 1337 + mark_wake_futex(&wake_q, this); 1336 1338 if (++op_ret >= nr_wake2) 1337 1339 break; 1338 1340 } ··· 1342 1344 1343 1345 out_unlock: 1344 1346 double_unlock_hb(hb1, hb2); 1347 + wake_up_q(&wake_q); 1345 1348 out_put_keys: 1346 1349 put_futex_key(&key2); 1347 1350 out_put_key1: ··· 1502 1503 struct futex_pi_state *pi_state = NULL; 1503 1504 struct futex_hash_bucket *hb1, *hb2; 1504 1505 struct futex_q *this, *next; 1506 + WAKE_Q(wake_q); 1505 1507 1506 1508 if (requeue_pi) { 1507 1509 /* ··· 1679 1679 * woken by futex_unlock_pi(). 1680 1680 */ 1681 1681 if (++task_count <= nr_wake && !requeue_pi) { 1682 - wake_futex(this); 1682 + mark_wake_futex(&wake_q, this); 1683 1683 continue; 1684 1684 } 1685 1685 ··· 1719 1719 out_unlock: 1720 1720 free_pi_state(pi_state); 1721 1721 double_unlock_hb(hb1, hb2); 1722 + wake_up_q(&wake_q); 1722 1723 hb_waiters_dec(hb2); 1723 1724 1724 1725 /*

+22

kernel/locking/lglock.c

··· 60 60 } 61 61 EXPORT_SYMBOL(lg_local_unlock_cpu); 62 62 63 + void lg_double_lock(struct lglock *lg, int cpu1, int cpu2) 64 + { 65 + BUG_ON(cpu1 == cpu2); 66 + 67 + /* lock in cpu order, just like lg_global_lock */ 68 + if (cpu2 < cpu1) 69 + swap(cpu1, cpu2); 70 + 71 + preempt_disable(); 72 + lock_acquire_shared(&lg->lock_dep_map, 0, 0, NULL, _RET_IP_); 73 + arch_spin_lock(per_cpu_ptr(lg->lock, cpu1)); 74 + arch_spin_lock(per_cpu_ptr(lg->lock, cpu2)); 75 + } 76 + 77 + void lg_double_unlock(struct lglock *lg, int cpu1, int cpu2) 78 + { 79 + lock_release(&lg->lock_dep_map, 1, _RET_IP_); 80 + arch_spin_unlock(per_cpu_ptr(lg->lock, cpu1)); 81 + arch_spin_unlock(per_cpu_ptr(lg->lock, cpu2)); 82 + preempt_enable(); 83 + } 84 + 63 85 void lg_global_lock(struct lglock *lg) 64 86 { 65 87 int i;

+1 -1

kernel/sched/Makefile

··· 11 11 CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer 12 12 endif 13 13 14 - obj-y += core.o proc.o clock.o cputime.o 14 + obj-y += core.o loadavg.o clock.o cputime.o 15 15 obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o 16 16 obj-y += wait.o completion.o idle.o 17 17 obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o

+1 -5

kernel/sched/auto_group.c

··· 1 - #ifdef CONFIG_SCHED_AUTOGROUP 2 - 3 1 #include "sched.h" 4 2 5 3 #include <linux/proc_fs.h> ··· 139 141 140 142 p->signal->autogroup = autogroup_kref_get(ag); 141 143 142 - if (!ACCESS_ONCE(sysctl_sched_autogroup_enabled)) 144 + if (!READ_ONCE(sysctl_sched_autogroup_enabled)) 143 145 goto out; 144 146 145 147 for_each_thread(p, t) ··· 247 249 return snprintf(buf, buflen, "%s-%ld", "/autogroup", tg->autogroup->id); 248 250 } 249 251 #endif /* CONFIG_SCHED_DEBUG */ 250 - 251 - #endif /* CONFIG_SCHED_AUTOGROUP */

+1 -1

kernel/sched/auto_group.h

··· 29 29 static inline struct task_group * 30 30 autogroup_task_group(struct task_struct *p, struct task_group *tg) 31 31 { 32 - int enabled = ACCESS_ONCE(sysctl_sched_autogroup_enabled); 32 + int enabled = READ_ONCE(sysctl_sched_autogroup_enabled); 33 33 34 34 if (enabled && task_wants_autogroup(p, tg)) 35 35 return p->signal->autogroup->tg;

+97 -39

kernel/sched/core.c

··· 511 511 static bool set_nr_if_polling(struct task_struct *p) 512 512 { 513 513 struct thread_info *ti = task_thread_info(p); 514 - typeof(ti->flags) old, val = ACCESS_ONCE(ti->flags); 514 + typeof(ti->flags) old, val = READ_ONCE(ti->flags); 515 515 516 516 for (;;) { 517 517 if (!(val & _TIF_POLLING_NRFLAG)) ··· 540 540 } 541 541 #endif 542 542 #endif 543 + 544 + void wake_q_add(struct wake_q_head *head, struct task_struct *task) 545 + { 546 + struct wake_q_node *node = &task->wake_q; 547 + 548 + /* 549 + * Atomically grab the task, if ->wake_q is !nil already it means 550 + * its already queued (either by us or someone else) and will get the 551 + * wakeup due to that. 552 + * 553 + * This cmpxchg() implies a full barrier, which pairs with the write 554 + * barrier implied by the wakeup in wake_up_list(). 555 + */ 556 + if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL)) 557 + return; 558 + 559 + get_task_struct(task); 560 + 561 + /* 562 + * The head is context local, there can be no concurrency. 563 + */ 564 + *head->lastp = node; 565 + head->lastp = &node->next; 566 + } 567 + 568 + void wake_up_q(struct wake_q_head *head) 569 + { 570 + struct wake_q_node *node = head->first; 571 + 572 + while (node != WAKE_Q_TAIL) { 573 + struct task_struct *task; 574 + 575 + task = container_of(node, struct task_struct, wake_q); 576 + BUG_ON(!task); 577 + /* task can safely be re-inserted now */ 578 + node = node->next; 579 + task->wake_q.next = NULL; 580 + 581 + /* 582 + * wake_up_process() implies a wmb() to pair with the queueing 583 + * in wake_q_add() so as not to miss wakeups. 584 + */ 585 + wake_up_process(task); 586 + put_task_struct(task); 587 + } 588 + } 543 589 544 590 /* 545 591 * resched_curr - mark rq's current task 'to be rescheduled now'. ··· 2151 2105 2152 2106 #ifdef CONFIG_PREEMPT_NOTIFIERS 2153 2107 2108 + static struct static_key preempt_notifier_key = STATIC_KEY_INIT_FALSE; 2109 + 2154 2110 /** 2155 2111 * preempt_notifier_register - tell me when current is being preempted & rescheduled 2156 2112 * @notifier: notifier struct to register 2157 2113 */ 2158 2114 void preempt_notifier_register(struct preempt_notifier *notifier) 2159 2115 { 2116 + static_key_slow_inc(&preempt_notifier_key); 2160 2117 hlist_add_head(&notifier->link, &current->preempt_notifiers); 2161 2118 } 2162 2119 EXPORT_SYMBOL_GPL(preempt_notifier_register); ··· 2168 2119 * preempt_notifier_unregister - no longer interested in preemption notifications 2169 2120 * @notifier: notifier struct to unregister 2170 2121 * 2171 - * This is safe to call from within a preemption notifier. 2122 + * This is *not* safe to call from within a preemption notifier. 2172 2123 */ 2173 2124 void preempt_notifier_unregister(struct preempt_notifier *notifier) 2174 2125 { 2175 2126 hlist_del(&notifier->link); 2127 + static_key_slow_dec(&preempt_notifier_key); 2176 2128 } 2177 2129 EXPORT_SYMBOL_GPL(preempt_notifier_unregister); 2178 2130 2179 - static void fire_sched_in_preempt_notifiers(struct task_struct *curr) 2131 + static void __fire_sched_in_preempt_notifiers(struct task_struct *curr) 2180 2132 { 2181 2133 struct preempt_notifier *notifier; 2182 2134 ··· 2185 2135 notifier->ops->sched_in(notifier, raw_smp_processor_id()); 2186 2136 } 2187 2137 2138 + static __always_inline void fire_sched_in_preempt_notifiers(struct task_struct *curr) 2139 + { 2140 + if (static_key_false(&preempt_notifier_key)) 2141 + __fire_sched_in_preempt_notifiers(curr); 2142 + } 2143 + 2188 2144 static void 2189 - fire_sched_out_preempt_notifiers(struct task_struct *curr, 2190 - struct task_struct *next) 2145 + __fire_sched_out_preempt_notifiers(struct task_struct *curr, 2146 + struct task_struct *next) 2191 2147 { 2192 2148 struct preempt_notifier *notifier; 2193 2149 ··· 2201 2145 notifier->ops->sched_out(notifier, next); 2202 2146 } 2203 2147 2148 + static __always_inline void 2149 + fire_sched_out_preempt_notifiers(struct task_struct *curr, 2150 + struct task_struct *next) 2151 + { 2152 + if (static_key_false(&preempt_notifier_key)) 2153 + __fire_sched_out_preempt_notifiers(curr, next); 2154 + } 2155 + 2204 2156 #else /* !CONFIG_PREEMPT_NOTIFIERS */ 2205 2157 2206 - static void fire_sched_in_preempt_notifiers(struct task_struct *curr) 2158 + static inline void fire_sched_in_preempt_notifiers(struct task_struct *curr) 2207 2159 { 2208 2160 } 2209 2161 2210 - static void 2162 + static inline void 2211 2163 fire_sched_out_preempt_notifiers(struct task_struct *curr, 2212 2164 struct task_struct *next) 2213 2165 { ··· 2461 2397 2462 2398 void get_iowait_load(unsigned long *nr_waiters, unsigned long *load) 2463 2399 { 2464 - struct rq *this = this_rq(); 2465 - *nr_waiters = atomic_read(&this->nr_iowait); 2466 - *load = this->cpu_load[0]; 2400 + struct rq *rq = this_rq(); 2401 + *nr_waiters = atomic_read(&rq->nr_iowait); 2402 + *load = rq->load.weight; 2467 2403 } 2468 2404 2469 2405 #ifdef CONFIG_SMP ··· 2561 2497 update_rq_clock(rq); 2562 2498 curr->sched_class->task_tick(rq, curr, 0); 2563 2499 update_cpu_load_active(rq); 2500 + calc_global_load_tick(rq); 2564 2501 raw_spin_unlock(&rq->lock); 2565 2502 2566 2503 perf_event_task_tick(); ··· 2590 2525 u64 scheduler_tick_max_deferment(void) 2591 2526 { 2592 2527 struct rq *rq = this_rq(); 2593 - unsigned long next, now = ACCESS_ONCE(jiffies); 2528 + unsigned long next, now = READ_ONCE(jiffies); 2594 2529 2595 2530 next = rq->last_sched_tick + HZ; 2596 2531 ··· 2791 2726 * - return from syscall or exception to user-space 2792 2727 * - return from interrupt-handler to user-space 2793 2728 * 2794 - * WARNING: all callers must re-check need_resched() afterward and reschedule 2795 - * accordingly in case an event triggered the need for rescheduling (such as 2796 - * an interrupt waking up a task) while preemption was disabled in __schedule(). 2729 + * WARNING: must be called with preemption disabled! 2797 2730 */ 2798 2731 static void __sched __schedule(void) 2799 2732 { ··· 2800 2737 struct rq *rq; 2801 2738 int cpu; 2802 2739 2803 - preempt_disable(); 2804 2740 cpu = smp_processor_id(); 2805 2741 rq = cpu_rq(cpu); 2806 2742 rcu_note_context_switch(); ··· 2863 2801 raw_spin_unlock_irq(&rq->lock); 2864 2802 2865 2803 post_schedule(rq); 2866 - 2867 - sched_preempt_enable_no_resched(); 2868 2804 } 2869 2805 2870 2806 static inline void sched_submit_work(struct task_struct *tsk) ··· 2883 2823 2884 2824 sched_submit_work(tsk); 2885 2825 do { 2826 + preempt_disable(); 2886 2827 __schedule(); 2828 + sched_preempt_enable_no_resched(); 2887 2829 } while (need_resched()); 2888 2830 } 2889 2831 EXPORT_SYMBOL(schedule); ··· 2924 2862 static void __sched notrace preempt_schedule_common(void) 2925 2863 { 2926 2864 do { 2927 - __preempt_count_add(PREEMPT_ACTIVE); 2865 + preempt_active_enter(); 2928 2866 __schedule(); 2929 - __preempt_count_sub(PREEMPT_ACTIVE); 2867 + preempt_active_exit(); 2930 2868 2931 2869 /* 2932 2870 * Check again in case we missed a preemption opportunity 2933 2871 * between schedule and now. 2934 2872 */ 2935 - barrier(); 2936 2873 } while (need_resched()); 2937 2874 } 2938 2875 ··· 2955 2894 NOKPROBE_SYMBOL(preempt_schedule); 2956 2895 EXPORT_SYMBOL(preempt_schedule); 2957 2896 2958 - #ifdef CONFIG_CONTEXT_TRACKING 2959 2897 /** 2960 - * preempt_schedule_context - preempt_schedule called by tracing 2898 + * preempt_schedule_notrace - preempt_schedule called by tracing 2961 2899 * 2962 2900 * The tracing infrastructure uses preempt_enable_notrace to prevent 2963 2901 * recursion and tracing preempt enabling caused by the tracing ··· 2969 2909 * instead of preempt_schedule() to exit user context if needed before 2970 2910 * calling the scheduler. 2971 2911 */ 2972 - asmlinkage __visible void __sched notrace preempt_schedule_context(void) 2912 + asmlinkage __visible void __sched notrace preempt_schedule_notrace(void) 2973 2913 { 2974 2914 enum ctx_state prev_ctx; 2975 2915 ··· 2977 2917 return; 2978 2918 2979 2919 do { 2980 - __preempt_count_add(PREEMPT_ACTIVE); 2920 + /* 2921 + * Use raw __prempt_count() ops that don't call function. 2922 + * We can't call functions before disabling preemption which 2923 + * disarm preemption tracing recursions. 2924 + */ 2925 + __preempt_count_add(PREEMPT_ACTIVE + PREEMPT_DISABLE_OFFSET); 2926 + barrier(); 2981 2927 /* 2982 2928 * Needs preempt disabled in case user_exit() is traced 2983 2929 * and the tracer calls preempt_enable_notrace() causing ··· 2993 2927 __schedule(); 2994 2928 exception_exit(prev_ctx); 2995 2929 2996 - __preempt_count_sub(PREEMPT_ACTIVE); 2997 2930 barrier(); 2931 + __preempt_count_sub(PREEMPT_ACTIVE + PREEMPT_DISABLE_OFFSET); 2998 2932 } while (need_resched()); 2999 2933 } 3000 - EXPORT_SYMBOL_GPL(preempt_schedule_context); 3001 - #endif /* CONFIG_CONTEXT_TRACKING */ 2934 + EXPORT_SYMBOL_GPL(preempt_schedule_notrace); 3002 2935 3003 2936 #endif /* CONFIG_PREEMPT */ 3004 2937 ··· 3017 2952 prev_state = exception_enter(); 3018 2953 3019 2954 do { 3020 - __preempt_count_add(PREEMPT_ACTIVE); 2955 + preempt_active_enter(); 3021 2956 local_irq_enable(); 3022 2957 __schedule(); 3023 2958 local_irq_disable(); 3024 - __preempt_count_sub(PREEMPT_ACTIVE); 3025 - 3026 - /* 3027 - * Check again in case we missed a preemption opportunity 3028 - * between schedule and now. 3029 - */ 3030 - barrier(); 2959 + preempt_active_exit(); 3031 2960 } while (need_resched()); 3032 2961 3033 2962 exception_exit(prev_state); ··· 3099 3040 if (!dl_prio(p->normal_prio) || 3100 3041 (pi_task && dl_entity_preempt(&pi_task->dl, &p->dl))) { 3101 3042 p->dl.dl_boosted = 1; 3102 - p->dl.dl_throttled = 0; 3103 3043 enqueue_flag = ENQUEUE_REPLENISH; 3104 3044 } else 3105 3045 p->dl.dl_boosted = 0; ··· 5372 5314 .priority = CPU_PRI_MIGRATION, 5373 5315 }; 5374 5316 5375 - static void __cpuinit set_cpu_rq_start_time(void) 5317 + static void set_cpu_rq_start_time(void) 5376 5318 { 5377 5319 int cpu = smp_processor_id(); 5378 5320 struct rq *rq = cpu_rq(cpu); ··· 7792 7734 return rt_runtime_us; 7793 7735 } 7794 7736 7795 - static int sched_group_set_rt_period(struct task_group *tg, long rt_period_us) 7737 + static int sched_group_set_rt_period(struct task_group *tg, u64 rt_period_us) 7796 7738 { 7797 7739 u64 rt_runtime, rt_period; 7798 7740 7799 - rt_period = (u64)rt_period_us * NSEC_PER_USEC; 7741 + rt_period = rt_period_us * NSEC_PER_USEC; 7800 7742 rt_runtime = tg->rt_bandwidth.rt_runtime; 7801 7743 7802 7744 return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);

+1 -1

kernel/sched/cputime.c

··· 567 567 { 568 568 cputime_t old; 569 569 570 - while (new > (old = ACCESS_ONCE(*counter))) 570 + while (new > (old = READ_ONCE(*counter))) 571 571 cmpxchg_cputime(counter, old, new); 572 572 } 573 573

+45 -6

kernel/sched/deadline.c

··· 640 640 } 641 641 642 642 static 643 - int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se) 643 + int dl_runtime_exceeded(struct sched_dl_entity *dl_se) 644 644 { 645 645 return (dl_se->runtime <= 0); 646 646 } ··· 684 684 sched_rt_avg_update(rq, delta_exec); 685 685 686 686 dl_se->runtime -= dl_se->dl_yielded ? 0 : delta_exec; 687 - if (dl_runtime_exceeded(rq, dl_se)) { 687 + if (dl_runtime_exceeded(dl_se)) { 688 688 dl_se->dl_throttled = 1; 689 689 __dequeue_task_dl(rq, curr, 0); 690 690 if (unlikely(!start_dl_timer(dl_se, curr->dl.dl_boosted))) ··· 995 995 rq = cpu_rq(cpu); 996 996 997 997 rcu_read_lock(); 998 - curr = ACCESS_ONCE(rq->curr); /* unlocked access */ 998 + curr = READ_ONCE(rq->curr); /* unlocked access */ 999 999 1000 1000 /* 1001 1001 * If we are dealing with a -deadline task, we must ··· 1012 1012 (p->nr_cpus_allowed > 1)) { 1013 1013 int target = find_later_rq(p); 1014 1014 1015 - if (target != -1) 1015 + if (target != -1 && 1016 + dl_time_before(p->dl.deadline, 1017 + cpu_rq(target)->dl.earliest_dl.curr)) 1016 1018 cpu = target; 1017 1019 } 1018 1020 rcu_read_unlock(); ··· 1232 1230 return NULL; 1233 1231 } 1234 1232 1233 + /* 1234 + * Return the earliest pushable rq's task, which is suitable to be executed 1235 + * on the CPU, NULL otherwise: 1236 + */ 1237 + static struct task_struct *pick_earliest_pushable_dl_task(struct rq *rq, int cpu) 1238 + { 1239 + struct rb_node *next_node = rq->dl.pushable_dl_tasks_leftmost; 1240 + struct task_struct *p = NULL; 1241 + 1242 + if (!has_pushable_dl_tasks(rq)) 1243 + return NULL; 1244 + 1245 + next_node: 1246 + if (next_node) { 1247 + p = rb_entry(next_node, struct task_struct, pushable_dl_tasks); 1248 + 1249 + if (pick_dl_task(rq, p, cpu)) 1250 + return p; 1251 + 1252 + next_node = rb_next(next_node); 1253 + goto next_node; 1254 + } 1255 + 1256 + return NULL; 1257 + } 1258 + 1235 1259 static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl); 1236 1260 1237 1261 static int find_later_rq(struct task_struct *task) ··· 1360 1332 break; 1361 1333 1362 1334 later_rq = cpu_rq(cpu); 1335 + 1336 + if (!dl_time_before(task->dl.deadline, 1337 + later_rq->dl.earliest_dl.curr)) { 1338 + /* 1339 + * Target rq has tasks of equal or earlier deadline, 1340 + * retrying does not release any lock and is unlikely 1341 + * to yield a different result. 1342 + */ 1343 + later_rq = NULL; 1344 + break; 1345 + } 1363 1346 1364 1347 /* Retry if something changed. */ 1365 1348 if (double_lock_balance(rq, later_rq)) { ··· 1553 1514 if (src_rq->dl.dl_nr_running <= 1) 1554 1515 goto skip; 1555 1516 1556 - p = pick_next_earliest_dl_task(src_rq, this_cpu); 1517 + p = pick_earliest_pushable_dl_task(src_rq, this_cpu); 1557 1518 1558 1519 /* 1559 1520 * We found a task to be pulled if: ··· 1698 1659 cpudl_clear_freecpu(&rq->rd->cpudl, rq->cpu); 1699 1660 } 1700 1661 1701 - void init_sched_dl_class(void) 1662 + void __init init_sched_dl_class(void) 1702 1663 { 1703 1664 unsigned int i; 1704 1665

+7 -4

kernel/sched/debug.c

··· 132 132 p->prio); 133 133 #ifdef CONFIG_SCHEDSTATS 134 134 SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld", 135 - SPLIT_NS(p->se.vruntime), 135 + SPLIT_NS(p->se.statistics.wait_sum), 136 136 SPLIT_NS(p->se.sum_exec_runtime), 137 137 SPLIT_NS(p->se.statistics.sum_sleep_runtime)); 138 138 #else 139 - SEQ_printf(m, "%15Ld %15Ld %15Ld.%06ld %15Ld.%06ld %15Ld.%06ld", 140 - 0LL, 0LL, 0LL, 0L, 0LL, 0L, 0LL, 0L); 139 + SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld", 140 + 0LL, 0L, 141 + SPLIT_NS(p->se.sum_exec_runtime), 142 + 0LL, 0L); 141 143 #endif 142 144 #ifdef CONFIG_NUMA_BALANCING 143 145 SEQ_printf(m, " %d", task_node(p)); ··· 158 156 SEQ_printf(m, 159 157 "\nrunnable tasks:\n" 160 158 " task PID tree-key switches prio" 161 - " exec-runtime sum-exec sum-sleep\n" 159 + " wait-time sum-exec sum-sleep\n" 162 160 "------------------------------------------------------" 163 161 "----------------------------------------------------\n"); 164 162 ··· 584 582 nr_switches = p->nvcsw + p->nivcsw; 585 583 586 584 #ifdef CONFIG_SCHEDSTATS 585 + PN(se.statistics.sum_sleep_runtime); 587 586 PN(se.statistics.wait_start); 588 587 PN(se.statistics.sleep_start); 589 588 PN(se.statistics.block_start);

+297 -77

kernel/sched/fair.c

··· 141 141 * 142 142 * This idea comes from the SD scheduler of Con Kolivas: 143 143 */ 144 - static int get_update_sysctl_factor(void) 144 + static unsigned int get_update_sysctl_factor(void) 145 145 { 146 - unsigned int cpus = min_t(int, num_online_cpus(), 8); 146 + unsigned int cpus = min_t(unsigned int, num_online_cpus(), 8); 147 147 unsigned int factor; 148 148 149 149 switch (sysctl_sched_tunable_scaling) { ··· 576 576 loff_t *ppos) 577 577 { 578 578 int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); 579 - int factor = get_update_sysctl_factor(); 579 + unsigned int factor = get_update_sysctl_factor(); 580 580 581 581 if (ret || !write) 582 582 return ret; ··· 834 834 835 835 static unsigned int task_scan_min(struct task_struct *p) 836 836 { 837 - unsigned int scan_size = ACCESS_ONCE(sysctl_numa_balancing_scan_size); 837 + unsigned int scan_size = READ_ONCE(sysctl_numa_balancing_scan_size); 838 838 unsigned int scan, floor; 839 839 unsigned int windows = 1; 840 840 ··· 1198 1198 static bool load_too_imbalanced(long src_load, long dst_load, 1199 1199 struct task_numa_env *env) 1200 1200 { 1201 + long imb, old_imb; 1202 + long orig_src_load, orig_dst_load; 1201 1203 long src_capacity, dst_capacity; 1202 - long orig_src_load; 1203 - long load_a, load_b; 1204 - long moved_load; 1205 - long imb; 1206 1204 1207 1205 /* 1208 1206 * The load is corrected for the CPU capacity available on each node. ··· 1213 1215 dst_capacity = env->dst_stats.compute_capacity; 1214 1216 1215 1217 /* We care about the slope of the imbalance, not the direction. */ 1216 - load_a = dst_load; 1217 - load_b = src_load; 1218 - if (load_a < load_b) 1219 - swap(load_a, load_b); 1218 + if (dst_load < src_load) 1219 + swap(dst_load, src_load); 1220 1220 1221 1221 /* Is the difference below the threshold? */ 1222 - imb = load_a * src_capacity * 100 - 1223 - load_b * dst_capacity * env->imbalance_pct; 1222 + imb = dst_load * src_capacity * 100 - 1223 + src_load * dst_capacity * env->imbalance_pct; 1224 1224 if (imb <= 0) 1225 1225 return false; 1226 1226 1227 1227 /* 1228 1228 * The imbalance is above the allowed threshold. 1229 - * Allow a move that brings us closer to a balanced situation, 1230 - * without moving things past the point of balance. 1229 + * Compare it with the old imbalance. 1231 1230 */ 1232 1231 orig_src_load = env->src_stats.load; 1232 + orig_dst_load = env->dst_stats.load; 1233 1233 1234 - /* 1235 - * In a task swap, there will be one load moving from src to dst, 1236 - * and another moving back. This is the net sum of both moves. 1237 - * A simple task move will always have a positive value. 1238 - * Allow the move if it brings the system closer to a balanced 1239 - * situation, without crossing over the balance point. 1240 - */ 1241 - moved_load = orig_src_load - src_load; 1234 + if (orig_dst_load < orig_src_load) 1235 + swap(orig_dst_load, orig_src_load); 1242 1236 1243 - if (moved_load > 0) 1244 - /* Moving src -> dst. Did we overshoot balance? */ 1245 - return src_load * dst_capacity < dst_load * src_capacity; 1246 - else 1247 - /* Moving dst -> src. Did we overshoot balance? */ 1248 - return dst_load * src_capacity < src_load * dst_capacity; 1237 + old_imb = orig_dst_load * src_capacity * 100 - 1238 + orig_src_load * dst_capacity * env->imbalance_pct; 1239 + 1240 + /* Would this change make things worse? */ 1241 + return (imb > old_imb); 1249 1242 } 1250 1243 1251 1244 /* ··· 1398 1409 } 1399 1410 } 1400 1411 1412 + /* Only move tasks to a NUMA node less busy than the current node. */ 1413 + static bool numa_has_capacity(struct task_numa_env *env) 1414 + { 1415 + struct numa_stats *src = &env->src_stats; 1416 + struct numa_stats *dst = &env->dst_stats; 1417 + 1418 + if (src->has_free_capacity && !dst->has_free_capacity) 1419 + return false; 1420 + 1421 + /* 1422 + * Only consider a task move if the source has a higher load 1423 + * than the destination, corrected for CPU capacity on each node. 1424 + * 1425 + * src->load dst->load 1426 + * --------------------- vs --------------------- 1427 + * src->compute_capacity dst->compute_capacity 1428 + */ 1429 + if (src->load * dst->compute_capacity > 1430 + dst->load * src->compute_capacity) 1431 + return true; 1432 + 1433 + return false; 1434 + } 1435 + 1401 1436 static int task_numa_migrate(struct task_struct *p) 1402 1437 { 1403 1438 struct task_numa_env env = { ··· 1476 1463 update_numa_stats(&env.dst_stats, env.dst_nid); 1477 1464 1478 1465 /* Try to find a spot on the preferred nid. */ 1479 - task_numa_find_cpu(&env, taskimp, groupimp); 1466 + if (numa_has_capacity(&env)) 1467 + task_numa_find_cpu(&env, taskimp, groupimp); 1480 1468 1481 1469 /* 1482 1470 * Look at other nodes in these cases: ··· 1508 1494 env.dist = dist; 1509 1495 env.dst_nid = nid; 1510 1496 update_numa_stats(&env.dst_stats, env.dst_nid); 1511 - task_numa_find_cpu(&env, taskimp, groupimp); 1497 + if (numa_has_capacity(&env)) 1498 + task_numa_find_cpu(&env, taskimp, groupimp); 1512 1499 } 1513 1500 } 1514 1501 ··· 1809 1794 u64 runtime, period; 1810 1795 spinlock_t *group_lock = NULL; 1811 1796 1812 - seq = ACCESS_ONCE(p->mm->numa_scan_seq); 1797 + /* 1798 + * The p->mm->numa_scan_seq field gets updated without 1799 + * exclusive access. Use READ_ONCE() here to ensure 1800 + * that the field is read in a single access: 1801 + */ 1802 + seq = READ_ONCE(p->mm->numa_scan_seq); 1813 1803 if (p->numa_scan_seq == seq) 1814 1804 return; 1815 1805 p->numa_scan_seq = seq; ··· 1958 1938 } 1959 1939 1960 1940 rcu_read_lock(); 1961 - tsk = ACCESS_ONCE(cpu_rq(cpu)->curr); 1941 + tsk = READ_ONCE(cpu_rq(cpu)->curr); 1962 1942 1963 1943 if (!cpupid_match_pid(tsk, cpupid)) 1964 1944 goto no_join; ··· 2127 2107 2128 2108 static void reset_ptenuma_scan(struct task_struct *p) 2129 2109 { 2130 - ACCESS_ONCE(p->mm->numa_scan_seq)++; 2110 + /* 2111 + * We only did a read acquisition of the mmap sem, so 2112 + * p->mm->numa_scan_seq is written to without exclusive access 2113 + * and the update is not guaranteed to be atomic. That's not 2114 + * much of an issue though, since this is just used for 2115 + * statistical sampling. Use READ_ONCE/WRITE_ONCE, which are not 2116 + * expensive, to avoid any form of compiler optimizations: 2117 + */ 2118 + WRITE_ONCE(p->mm->numa_scan_seq, READ_ONCE(p->mm->numa_scan_seq) + 1); 2131 2119 p->mm->numa_scan_offset = 0; 2132 2120 } 2133 2121 ··· 4351 4323 } 4352 4324 4353 4325 #ifdef CONFIG_SMP 4326 + 4327 + /* 4328 + * per rq 'load' arrray crap; XXX kill this. 4329 + */ 4330 + 4331 + /* 4332 + * The exact cpuload at various idx values, calculated at every tick would be 4333 + * load = (2^idx - 1) / 2^idx * load + 1 / 2^idx * cur_load 4334 + * 4335 + * If a cpu misses updates for n-1 ticks (as it was idle) and update gets called 4336 + * on nth tick when cpu may be busy, then we have: 4337 + * load = ((2^idx - 1) / 2^idx)^(n-1) * load 4338 + * load = (2^idx - 1) / 2^idx) * load + 1 / 2^idx * cur_load 4339 + * 4340 + * decay_load_missed() below does efficient calculation of 4341 + * load = ((2^idx - 1) / 2^idx)^(n-1) * load 4342 + * avoiding 0..n-1 loop doing load = ((2^idx - 1) / 2^idx) * load 4343 + * 4344 + * The calculation is approximated on a 128 point scale. 4345 + * degrade_zero_ticks is the number of ticks after which load at any 4346 + * particular idx is approximated to be zero. 4347 + * degrade_factor is a precomputed table, a row for each load idx. 4348 + * Each column corresponds to degradation factor for a power of two ticks, 4349 + * based on 128 point scale. 4350 + * Example: 4351 + * row 2, col 3 (=12) says that the degradation at load idx 2 after 4352 + * 8 ticks is 12/128 (which is an approximation of exact factor 3^8/4^8). 4353 + * 4354 + * With this power of 2 load factors, we can degrade the load n times 4355 + * by looking at 1 bits in n and doing as many mult/shift instead of 4356 + * n mult/shifts needed by the exact degradation. 4357 + */ 4358 + #define DEGRADE_SHIFT 7 4359 + static const unsigned char 4360 + degrade_zero_ticks[CPU_LOAD_IDX_MAX] = {0, 8, 32, 64, 128}; 4361 + static const unsigned char 4362 + degrade_factor[CPU_LOAD_IDX_MAX][DEGRADE_SHIFT + 1] = { 4363 + {0, 0, 0, 0, 0, 0, 0, 0}, 4364 + {64, 32, 8, 0, 0, 0, 0, 0}, 4365 + {96, 72, 40, 12, 1, 0, 0}, 4366 + {112, 98, 75, 43, 15, 1, 0}, 4367 + {120, 112, 98, 76, 45, 16, 2} }; 4368 + 4369 + /* 4370 + * Update cpu_load for any missed ticks, due to tickless idle. The backlog 4371 + * would be when CPU is idle and so we just decay the old load without 4372 + * adding any new load. 4373 + */ 4374 + static unsigned long 4375 + decay_load_missed(unsigned long load, unsigned long missed_updates, int idx) 4376 + { 4377 + int j = 0; 4378 + 4379 + if (!missed_updates) 4380 + return load; 4381 + 4382 + if (missed_updates >= degrade_zero_ticks[idx]) 4383 + return 0; 4384 + 4385 + if (idx == 1) 4386 + return load >> missed_updates; 4387 + 4388 + while (missed_updates) { 4389 + if (missed_updates % 2) 4390 + load = (load * degrade_factor[idx][j]) >> DEGRADE_SHIFT; 4391 + 4392 + missed_updates >>= 1; 4393 + j++; 4394 + } 4395 + return load; 4396 + } 4397 + 4398 + /* 4399 + * Update rq->cpu_load[] statistics. This function is usually called every 4400 + * scheduler tick (TICK_NSEC). With tickless idle this will not be called 4401 + * every tick. We fix it up based on jiffies. 4402 + */ 4403 + static void __update_cpu_load(struct rq *this_rq, unsigned long this_load, 4404 + unsigned long pending_updates) 4405 + { 4406 + int i, scale; 4407 + 4408 + this_rq->nr_load_updates++; 4409 + 4410 + /* Update our load: */ 4411 + this_rq->cpu_load[0] = this_load; /* Fasttrack for idx 0 */ 4412 + for (i = 1, scale = 2; i < CPU_LOAD_IDX_MAX; i++, scale += scale) { 4413 + unsigned long old_load, new_load; 4414 + 4415 + /* scale is effectively 1 << i now, and >> i divides by scale */ 4416 + 4417 + old_load = this_rq->cpu_load[i]; 4418 + old_load = decay_load_missed(old_load, pending_updates - 1, i); 4419 + new_load = this_load; 4420 + /* 4421 + * Round up the averaging division if load is increasing. This 4422 + * prevents us from getting stuck on 9 if the load is 10, for 4423 + * example. 4424 + */ 4425 + if (new_load > old_load) 4426 + new_load += scale - 1; 4427 + 4428 + this_rq->cpu_load[i] = (old_load * (scale - 1) + new_load) >> i; 4429 + } 4430 + 4431 + sched_avg_update(this_rq); 4432 + } 4433 + 4434 + #ifdef CONFIG_NO_HZ_COMMON 4435 + /* 4436 + * There is no sane way to deal with nohz on smp when using jiffies because the 4437 + * cpu doing the jiffies update might drift wrt the cpu doing the jiffy reading 4438 + * causing off-by-one errors in observed deltas; {0,2} instead of {1,1}. 4439 + * 4440 + * Therefore we cannot use the delta approach from the regular tick since that 4441 + * would seriously skew the load calculation. However we'll make do for those 4442 + * updates happening while idle (nohz_idle_balance) or coming out of idle 4443 + * (tick_nohz_idle_exit). 4444 + * 4445 + * This means we might still be one tick off for nohz periods. 4446 + */ 4447 + 4448 + /* 4449 + * Called from nohz_idle_balance() to update the load ratings before doing the 4450 + * idle balance. 4451 + */ 4452 + static void update_idle_cpu_load(struct rq *this_rq) 4453 + { 4454 + unsigned long curr_jiffies = READ_ONCE(jiffies); 4455 + unsigned long load = this_rq->cfs.runnable_load_avg; 4456 + unsigned long pending_updates; 4457 + 4458 + /* 4459 + * bail if there's load or we're actually up-to-date. 4460 + */ 4461 + if (load || curr_jiffies == this_rq->last_load_update_tick) 4462 + return; 4463 + 4464 + pending_updates = curr_jiffies - this_rq->last_load_update_tick; 4465 + this_rq->last_load_update_tick = curr_jiffies; 4466 + 4467 + __update_cpu_load(this_rq, load, pending_updates); 4468 + } 4469 + 4470 + /* 4471 + * Called from tick_nohz_idle_exit() -- try and fix up the ticks we missed. 4472 + */ 4473 + void update_cpu_load_nohz(void) 4474 + { 4475 + struct rq *this_rq = this_rq(); 4476 + unsigned long curr_jiffies = READ_ONCE(jiffies); 4477 + unsigned long pending_updates; 4478 + 4479 + if (curr_jiffies == this_rq->last_load_update_tick) 4480 + return; 4481 + 4482 + raw_spin_lock(&this_rq->lock); 4483 + pending_updates = curr_jiffies - this_rq->last_load_update_tick; 4484 + if (pending_updates) { 4485 + this_rq->last_load_update_tick = curr_jiffies; 4486 + /* 4487 + * We were idle, this means load 0, the current load might be 4488 + * !0 due to remote wakeups and the sort. 4489 + */ 4490 + __update_cpu_load(this_rq, 0, pending_updates); 4491 + } 4492 + raw_spin_unlock(&this_rq->lock); 4493 + } 4494 + #endif /* CONFIG_NO_HZ */ 4495 + 4496 + /* 4497 + * Called from scheduler_tick() 4498 + */ 4499 + void update_cpu_load_active(struct rq *this_rq) 4500 + { 4501 + unsigned long load = this_rq->cfs.runnable_load_avg; 4502 + /* 4503 + * See the mess around update_idle_cpu_load() / update_cpu_load_nohz(). 4504 + */ 4505 + this_rq->last_load_update_tick = jiffies; 4506 + __update_cpu_load(this_rq, load, 1); 4507 + } 4508 + 4354 4509 /* Used instead of source_load when we know the type == 0 */ 4355 4510 static unsigned long weighted_cpuload(const int cpu) 4356 4511 { ··· 4586 4375 static unsigned long cpu_avg_load_per_task(int cpu) 4587 4376 { 4588 4377 struct rq *rq = cpu_rq(cpu); 4589 - unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running); 4378 + unsigned long nr_running = READ_ONCE(rq->cfs.h_nr_running); 4590 4379 unsigned long load_avg = rq->cfs.runnable_load_avg; 4591 4380 4592 4381 if (nr_running) ··· 5337 5126 * entity, update_curr() will update its vruntime, otherwise 5338 5127 * forget we've ever seen it. 5339 5128 */ 5340 - if (curr && curr->on_rq) 5341 - update_curr(cfs_rq); 5342 - else 5343 - curr = NULL; 5129 + if (curr) { 5130 + if (curr->on_rq) 5131 + update_curr(cfs_rq); 5132 + else 5133 + curr = NULL; 5344 5134 5345 - /* 5346 - * This call to check_cfs_rq_runtime() will do the throttle and 5347 - * dequeue its entity in the parent(s). Therefore the 'simple' 5348 - * nr_running test will indeed be correct. 5349 - */ 5350 - if (unlikely(check_cfs_rq_runtime(cfs_rq))) 5351 - goto simple; 5135 + /* 5136 + * This call to check_cfs_rq_runtime() will do the 5137 + * throttle and dequeue its entity in the parent(s). 5138 + * Therefore the 'simple' nr_running test will indeed 5139 + * be correct. 5140 + */ 5141 + if (unlikely(check_cfs_rq_runtime(cfs_rq))) 5142 + goto simple; 5143 + } 5352 5144 5353 5145 se = pick_next_entity(cfs_rq, curr); 5354 5146 cfs_rq = group_cfs_rq(se); ··· 5681 5467 } 5682 5468 5683 5469 #ifdef CONFIG_NUMA_BALANCING 5684 - /* Returns true if the destination node has incurred more faults */ 5470 + /* 5471 + * Returns true if the destination node is the preferred node. 5472 + * Needs to match fbq_classify_rq(): if there is a runnable task 5473 + * that is not on its preferred node, we should identify it. 5474 + */ 5685 5475 static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env) 5686 5476 { 5687 5477 struct numa_group *numa_group = rcu_dereference(p->numa_group); 5478 + unsigned long src_faults, dst_faults; 5688 5479 int src_nid, dst_nid; 5689 5480 5690 5481 if (!sched_feat(NUMA_FAVOUR_HIGHER) || !p->numa_faults || ··· 5703 5484 if (src_nid == dst_nid) 5704 5485 return false; 5705 5486 5706 - if (numa_group) { 5707 - /* Task is already in the group's interleave set. */ 5708 - if (node_isset(src_nid, numa_group->active_nodes)) 5709 - return false; 5710 - 5711 - /* Task is moving into the group's interleave set. */ 5712 - if (node_isset(dst_nid, numa_group->active_nodes)) 5713 - return true; 5714 - 5715 - return group_faults(p, dst_nid) > group_faults(p, src_nid); 5716 - } 5717 - 5718 5487 /* Encourage migration to the preferred node. */ 5719 5488 if (dst_nid == p->numa_preferred_nid) 5720 5489 return true; 5721 5490 5722 - return task_faults(p, dst_nid) > task_faults(p, src_nid); 5491 + /* Migrating away from the preferred node is bad. */ 5492 + if (src_nid == p->numa_preferred_nid) 5493 + return false; 5494 + 5495 + if (numa_group) { 5496 + src_faults = group_faults(p, src_nid); 5497 + dst_faults = group_faults(p, dst_nid); 5498 + } else { 5499 + src_faults = task_faults(p, src_nid); 5500 + dst_faults = task_faults(p, dst_nid); 5501 + } 5502 + 5503 + return dst_faults > src_faults; 5723 5504 } 5724 5505 5725 5506 5726 5507 static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env) 5727 5508 { 5728 5509 struct numa_group *numa_group = rcu_dereference(p->numa_group); 5510 + unsigned long src_faults, dst_faults; 5729 5511 int src_nid, dst_nid; 5730 5512 5731 5513 if (!sched_feat(NUMA) || !sched_feat(NUMA_RESIST_LOWER)) ··· 5741 5521 if (src_nid == dst_nid) 5742 5522 return false; 5743 5523 5744 - if (numa_group) { 5745 - /* Task is moving within/into the group's interleave set. */ 5746 - if (node_isset(dst_nid, numa_group->active_nodes)) 5747 - return false; 5748 - 5749 - /* Task is moving out of the group's interleave set. */ 5750 - if (node_isset(src_nid, numa_group->active_nodes)) 5751 - return true; 5752 - 5753 - return group_faults(p, dst_nid) < group_faults(p, src_nid); 5754 - } 5755 - 5756 - /* Migrating away from the preferred node is always bad. */ 5524 + /* Migrating away from the preferred node is bad. */ 5757 5525 if (src_nid == p->numa_preferred_nid) 5758 5526 return true; 5759 5527 5760 - return task_faults(p, dst_nid) < task_faults(p, src_nid); 5528 + /* Encourage migration to the preferred node. */ 5529 + if (dst_nid == p->numa_preferred_nid) 5530 + return false; 5531 + 5532 + if (numa_group) { 5533 + src_faults = group_faults(p, src_nid); 5534 + dst_faults = group_faults(p, dst_nid); 5535 + } else { 5536 + src_faults = task_faults(p, src_nid); 5537 + dst_faults = task_faults(p, dst_nid); 5538 + } 5539 + 5540 + return dst_faults < src_faults; 5761 5541 } 5762 5542 5763 5543 #else ··· 6257 6037 * Since we're reading these variables without serialization make sure 6258 6038 * we read them once before doing sanity checks on them. 6259 6039 */ 6260 - age_stamp = ACCESS_ONCE(rq->age_stamp); 6261 - avg = ACCESS_ONCE(rq->rt_avg); 6040 + age_stamp = READ_ONCE(rq->age_stamp); 6041 + avg = READ_ONCE(rq->rt_avg); 6262 6042 delta = __rq_clock_broken(rq) - age_stamp; 6263 6043 6264 6044 if (unlikely(delta < 0))

+23 -213

kernel/sched/proc.c kernel/sched/loadavg.c

··· 1 1 /* 2 - * kernel/sched/proc.c 2 + * kernel/sched/loadavg.c 3 3 * 4 - * Kernel load calculations, forked from sched/core.c 4 + * This file contains the magic bits required to compute the global loadavg 5 + * figure. Its a silly number but people think its important. We go through 6 + * great pains to make it work on big machines and tickless kernels. 5 7 */ 6 8 7 9 #include <linux/export.h> ··· 83 81 long nr_active, delta = 0; 84 82 85 83 nr_active = this_rq->nr_running; 86 - nr_active += (long) this_rq->nr_uninterruptible; 84 + nr_active += (long)this_rq->nr_uninterruptible; 87 85 88 86 if (nr_active != this_rq->calc_load_active) { 89 87 delta = nr_active - this_rq->calc_load_active; ··· 188 186 delta = calc_load_fold_active(this_rq); 189 187 if (delta) { 190 188 int idx = calc_load_write_idx(); 189 + 191 190 atomic_long_add(delta, &calc_load_idle[idx]); 192 191 } 193 192 } ··· 244 241 { 245 242 unsigned long result = 1UL << frac_bits; 246 243 247 - if (n) for (;;) { 248 - if (n & 1) { 249 - result *= x; 250 - result += 1UL << (frac_bits - 1); 251 - result >>= frac_bits; 244 + if (n) { 245 + for (;;) { 246 + if (n & 1) { 247 + result *= x; 248 + result += 1UL << (frac_bits - 1); 249 + result >>= frac_bits; 250 + } 251 + n >>= 1; 252 + if (!n) 253 + break; 254 + x *= x; 255 + x += 1UL << (frac_bits - 1); 256 + x >>= frac_bits; 252 257 } 253 - n >>= 1; 254 - if (!n) 255 - break; 256 - x *= x; 257 - x += 1UL << (frac_bits - 1); 258 - x >>= frac_bits; 259 258 } 260 259 261 260 return result; ··· 290 285 calc_load_n(unsigned long load, unsigned long exp, 291 286 unsigned long active, unsigned int n) 292 287 { 293 - 294 288 return calc_load(load, fixed_power_int(exp, FSHIFT, n), active); 295 289 } 296 290 ··· 343 339 /* 344 340 * calc_load - update the avenrun load estimates 10 ticks after the 345 341 * CPUs have updated calc_load_tasks. 342 + * 343 + * Called from the global timer code. 346 344 */ 347 345 void calc_global_load(unsigned long ticks) 348 346 { ··· 376 370 } 377 371 378 372 /* 379 - * Called from update_cpu_load() to periodically update this CPU's 373 + * Called from scheduler_tick() to periodically update this CPU's 380 374 * active count. 381 375 */ 382 - static void calc_load_account_active(struct rq *this_rq) 376 + void calc_global_load_tick(struct rq *this_rq) 383 377 { 384 378 long delta; 385 379 ··· 391 385 atomic_long_add(delta, &calc_load_tasks); 392 386 393 387 this_rq->calc_load_update += LOAD_FREQ; 394 - } 395 - 396 - /* 397 - * End of global load-average stuff 398 - */ 399 - 400 - /* 401 - * The exact cpuload at various idx values, calculated at every tick would be 402 - * load = (2^idx - 1) / 2^idx * load + 1 / 2^idx * cur_load 403 - * 404 - * If a cpu misses updates for n-1 ticks (as it was idle) and update gets called 405 - * on nth tick when cpu may be busy, then we have: 406 - * load = ((2^idx - 1) / 2^idx)^(n-1) * load 407 - * load = (2^idx - 1) / 2^idx) * load + 1 / 2^idx * cur_load 408 - * 409 - * decay_load_missed() below does efficient calculation of 410 - * load = ((2^idx - 1) / 2^idx)^(n-1) * load 411 - * avoiding 0..n-1 loop doing load = ((2^idx - 1) / 2^idx) * load 412 - * 413 - * The calculation is approximated on a 128 point scale. 414 - * degrade_zero_ticks is the number of ticks after which load at any 415 - * particular idx is approximated to be zero. 416 - * degrade_factor is a precomputed table, a row for each load idx. 417 - * Each column corresponds to degradation factor for a power of two ticks, 418 - * based on 128 point scale. 419 - * Example: 420 - * row 2, col 3 (=12) says that the degradation at load idx 2 after 421 - * 8 ticks is 12/128 (which is an approximation of exact factor 3^8/4^8). 422 - * 423 - * With this power of 2 load factors, we can degrade the load n times 424 - * by looking at 1 bits in n and doing as many mult/shift instead of 425 - * n mult/shifts needed by the exact degradation. 426 - */ 427 - #define DEGRADE_SHIFT 7 428 - static const unsigned char 429 - degrade_zero_ticks[CPU_LOAD_IDX_MAX] = {0, 8, 32, 64, 128}; 430 - static const unsigned char 431 - degrade_factor[CPU_LOAD_IDX_MAX][DEGRADE_SHIFT + 1] = { 432 - {0, 0, 0, 0, 0, 0, 0, 0}, 433 - {64, 32, 8, 0, 0, 0, 0, 0}, 434 - {96, 72, 40, 12, 1, 0, 0}, 435 - {112, 98, 75, 43, 15, 1, 0}, 436 - {120, 112, 98, 76, 45, 16, 2} }; 437 - 438 - /* 439 - * Update cpu_load for any missed ticks, due to tickless idle. The backlog 440 - * would be when CPU is idle and so we just decay the old load without 441 - * adding any new load. 442 - */ 443 - static unsigned long 444 - decay_load_missed(unsigned long load, unsigned long missed_updates, int idx) 445 - { 446 - int j = 0; 447 - 448 - if (!missed_updates) 449 - return load; 450 - 451 - if (missed_updates >= degrade_zero_ticks[idx]) 452 - return 0; 453 - 454 - if (idx == 1) 455 - return load >> missed_updates; 456 - 457 - while (missed_updates) { 458 - if (missed_updates % 2) 459 - load = (load * degrade_factor[idx][j]) >> DEGRADE_SHIFT; 460 - 461 - missed_updates >>= 1; 462 - j++; 463 - } 464 - return load; 465 - } 466 - 467 - /* 468 - * Update rq->cpu_load[] statistics. This function is usually called every 469 - * scheduler tick (TICK_NSEC). With tickless idle this will not be called 470 - * every tick. We fix it up based on jiffies. 471 - */ 472 - static void __update_cpu_load(struct rq *this_rq, unsigned long this_load, 473 - unsigned long pending_updates) 474 - { 475 - int i, scale; 476 - 477 - this_rq->nr_load_updates++; 478 - 479 - /* Update our load: */ 480 - this_rq->cpu_load[0] = this_load; /* Fasttrack for idx 0 */ 481 - for (i = 1, scale = 2; i < CPU_LOAD_IDX_MAX; i++, scale += scale) { 482 - unsigned long old_load, new_load; 483 - 484 - /* scale is effectively 1 << i now, and >> i divides by scale */ 485 - 486 - old_load = this_rq->cpu_load[i]; 487 - old_load = decay_load_missed(old_load, pending_updates - 1, i); 488 - new_load = this_load; 489 - /* 490 - * Round up the averaging division if load is increasing. This 491 - * prevents us from getting stuck on 9 if the load is 10, for 492 - * example. 493 - */ 494 - if (new_load > old_load) 495 - new_load += scale - 1; 496 - 497 - this_rq->cpu_load[i] = (old_load * (scale - 1) + new_load) >> i; 498 - } 499 - 500 - sched_avg_update(this_rq); 501 - } 502 - 503 - #ifdef CONFIG_SMP 504 - static inline unsigned long get_rq_runnable_load(struct rq *rq) 505 - { 506 - return rq->cfs.runnable_load_avg; 507 - } 508 - #else 509 - static inline unsigned long get_rq_runnable_load(struct rq *rq) 510 - { 511 - return rq->load.weight; 512 - } 513 - #endif 514 - 515 - #ifdef CONFIG_NO_HZ_COMMON 516 - /* 517 - * There is no sane way to deal with nohz on smp when using jiffies because the 518 - * cpu doing the jiffies update might drift wrt the cpu doing the jiffy reading 519 - * causing off-by-one errors in observed deltas; {0,2} instead of {1,1}. 520 - * 521 - * Therefore we cannot use the delta approach from the regular tick since that 522 - * would seriously skew the load calculation. However we'll make do for those 523 - * updates happening while idle (nohz_idle_balance) or coming out of idle 524 - * (tick_nohz_idle_exit). 525 - * 526 - * This means we might still be one tick off for nohz periods. 527 - */ 528 - 529 - /* 530 - * Called from nohz_idle_balance() to update the load ratings before doing the 531 - * idle balance. 532 - */ 533 - void update_idle_cpu_load(struct rq *this_rq) 534 - { 535 - unsigned long curr_jiffies = ACCESS_ONCE(jiffies); 536 - unsigned long load = get_rq_runnable_load(this_rq); 537 - unsigned long pending_updates; 538 - 539 - /* 540 - * bail if there's load or we're actually up-to-date. 541 - */ 542 - if (load || curr_jiffies == this_rq->last_load_update_tick) 543 - return; 544 - 545 - pending_updates = curr_jiffies - this_rq->last_load_update_tick; 546 - this_rq->last_load_update_tick = curr_jiffies; 547 - 548 - __update_cpu_load(this_rq, load, pending_updates); 549 - } 550 - 551 - /* 552 - * Called from tick_nohz_idle_exit() -- try and fix up the ticks we missed. 553 - */ 554 - void update_cpu_load_nohz(void) 555 - { 556 - struct rq *this_rq = this_rq(); 557 - unsigned long curr_jiffies = ACCESS_ONCE(jiffies); 558 - unsigned long pending_updates; 559 - 560 - if (curr_jiffies == this_rq->last_load_update_tick) 561 - return; 562 - 563 - raw_spin_lock(&this_rq->lock); 564 - pending_updates = curr_jiffies - this_rq->last_load_update_tick; 565 - if (pending_updates) { 566 - this_rq->last_load_update_tick = curr_jiffies; 567 - /* 568 - * We were idle, this means load 0, the current load might be 569 - * !0 due to remote wakeups and the sort. 570 - */ 571 - __update_cpu_load(this_rq, 0, pending_updates); 572 - } 573 - raw_spin_unlock(&this_rq->lock); 574 - } 575 - #endif /* CONFIG_NO_HZ */ 576 - 577 - /* 578 - * Called from scheduler_tick() 579 - */ 580 - void update_cpu_load_active(struct rq *this_rq) 581 - { 582 - unsigned long load = get_rq_runnable_load(this_rq); 583 - /* 584 - * See the mess around update_idle_cpu_load() / update_cpu_load_nohz(). 585 - */ 586 - this_rq->last_load_update_tick = jiffies; 587 - __update_cpu_load(this_rq, load, 1); 588 - 589 - calc_load_account_active(this_rq); 590 388 }

+1 -1

kernel/sched/rt.c

··· 1323 1323 rq = cpu_rq(cpu); 1324 1324 1325 1325 rcu_read_lock(); 1326 - curr = ACCESS_ONCE(rq->curr); /* unlocked access */ 1326 + curr = READ_ONCE(rq->curr); /* unlocked access */ 1327 1327 1328 1328 /* 1329 1329 * If the current task on @p's runqueue is an RT task, then

+7 -4

kernel/sched/sched.h

··· 26 26 extern unsigned long calc_load_update; 27 27 extern atomic_long_t calc_load_tasks; 28 28 29 + extern void calc_global_load_tick(struct rq *this_rq); 29 30 extern long calc_load_fold_active(struct rq *this_rq); 31 + 32 + #ifdef CONFIG_SMP 30 33 extern void update_cpu_load_active(struct rq *this_rq); 34 + #else 35 + static inline void update_cpu_load_active(struct rq *this_rq) { } 36 + #endif 31 37 32 38 /* 33 39 * Helpers for converting nanosecond timing to jiffy resolution ··· 713 707 714 708 static inline u64 __rq_clock_broken(struct rq *rq) 715 709 { 716 - return ACCESS_ONCE(rq->clock); 710 + return READ_ONCE(rq->clock); 717 711 } 718 712 719 713 static inline u64 rq_clock(struct rq *rq) ··· 1290 1284 extern void init_sched_dl_class(void); 1291 1285 extern void init_sched_rt_class(void); 1292 1286 extern void init_sched_fair_class(void); 1293 - extern void init_sched_dl_class(void); 1294 1287 1295 1288 extern void resched_curr(struct rq *rq); 1296 1289 extern void resched_cpu(int cpu); ··· 1302 1297 extern void init_dl_task_timer(struct sched_dl_entity *dl_se); 1303 1298 1304 1299 unsigned long to_ratio(u64 period, u64 runtime); 1305 - 1306 - extern void update_idle_cpu_load(struct rq *this_rq); 1307 1300 1308 1301 extern void init_task_runnable_average(struct task_struct *p); 1309 1302

+5 -10

kernel/sched/stats.h

··· 174 174 { 175 175 struct thread_group_cputimer *cputimer = &tsk->signal->cputimer; 176 176 177 - if (!cputimer->running) 177 + /* Check if cputimer isn't running. This is accessed without locking. */ 178 + if (!READ_ONCE(cputimer->running)) 178 179 return false; 179 180 180 181 /* ··· 216 215 if (!cputimer_running(tsk)) 217 216 return; 218 217 219 - raw_spin_lock(&cputimer->lock); 220 - cputimer->cputime.utime += cputime; 221 - raw_spin_unlock(&cputimer->lock); 218 + atomic64_add(cputime, &cputimer->cputime_atomic.utime); 222 219 } 223 220 224 221 /** ··· 237 238 if (!cputimer_running(tsk)) 238 239 return; 239 240 240 - raw_spin_lock(&cputimer->lock); 241 - cputimer->cputime.stime += cputime; 242 - raw_spin_unlock(&cputimer->lock); 241 + atomic64_add(cputime, &cputimer->cputime_atomic.stime); 243 242 } 244 243 245 244 /** ··· 258 261 if (!cputimer_running(tsk)) 259 262 return; 260 263 261 - raw_spin_lock(&cputimer->lock); 262 - cputimer->cputime.sum_exec_runtime += ns; 263 - raw_spin_unlock(&cputimer->lock); 264 + atomic64_add(ns, &cputimer->cputime_atomic.sum_exec_runtime); 264 265 }

+2 -2

kernel/sched/wait.c

··· 601 601 602 602 __sched int bit_wait_timeout(struct wait_bit_key *word) 603 603 { 604 - unsigned long now = ACCESS_ONCE(jiffies); 604 + unsigned long now = READ_ONCE(jiffies); 605 605 if (signal_pending_state(current->state, current)) 606 606 return 1; 607 607 if (time_after_eq(now, word->timeout)) ··· 613 613 614 614 __sched int bit_wait_io_timeout(struct wait_bit_key *word) 615 615 { 616 - unsigned long now = ACCESS_ONCE(jiffies); 616 + unsigned long now = READ_ONCE(jiffies); 617 617 if (signal_pending_state(current->state, current)) 618 618 return 1; 619 619 if (time_after_eq(now, word->timeout))

+3 -3

kernel/signal.c

··· 245 245 * RETURNS: 246 246 * %true if @mask is set, %false if made noop because @task was dying. 247 247 */ 248 - bool task_set_jobctl_pending(struct task_struct *task, unsigned int mask) 248 + bool task_set_jobctl_pending(struct task_struct *task, unsigned long mask) 249 249 { 250 250 BUG_ON(mask & ~(JOBCTL_PENDING_MASK | JOBCTL_STOP_CONSUME | 251 251 JOBCTL_STOP_SIGMASK | JOBCTL_TRAPPING)); ··· 297 297 * CONTEXT: 298 298 * Must be called with @task->sighand->siglock held. 299 299 */ 300 - void task_clear_jobctl_pending(struct task_struct *task, unsigned int mask) 300 + void task_clear_jobctl_pending(struct task_struct *task, unsigned long mask) 301 301 { 302 302 BUG_ON(mask & ~JOBCTL_PENDING_MASK); 303 303 ··· 2000 2000 struct signal_struct *sig = current->signal; 2001 2001 2002 2002 if (!(current->jobctl & JOBCTL_STOP_PENDING)) { 2003 - unsigned int gstop = JOBCTL_STOP_PENDING | JOBCTL_STOP_CONSUME; 2003 + unsigned long gstop = JOBCTL_STOP_PENDING | JOBCTL_STOP_CONSUME; 2004 2004 struct task_struct *t; 2005 2005 2006 2006 /* signr will be recorded in task->jobctl for retries */

+5 -37

kernel/stop_machine.c

··· 211 211 return err; 212 212 } 213 213 214 - struct irq_cpu_stop_queue_work_info { 215 - int cpu1; 216 - int cpu2; 217 - struct cpu_stop_work *work1; 218 - struct cpu_stop_work *work2; 219 - }; 220 - 221 - /* 222 - * This function is always run with irqs and preemption disabled. 223 - * This guarantees that both work1 and work2 get queued, before 224 - * our local migrate thread gets the chance to preempt us. 225 - */ 226 - static void irq_cpu_stop_queue_work(void *arg) 227 - { 228 - struct irq_cpu_stop_queue_work_info *info = arg; 229 - cpu_stop_queue_work(info->cpu1, info->work1); 230 - cpu_stop_queue_work(info->cpu2, info->work2); 231 - } 232 - 233 214 /** 234 215 * stop_two_cpus - stops two cpus 235 216 * @cpu1: the cpu to stop ··· 226 245 { 227 246 struct cpu_stop_done done; 228 247 struct cpu_stop_work work1, work2; 229 - struct irq_cpu_stop_queue_work_info call_args; 230 248 struct multi_stop_data msdata; 231 249 232 250 preempt_disable(); ··· 240 260 .fn = multi_cpu_stop, 241 261 .arg = &msdata, 242 262 .done = &done 243 - }; 244 - 245 - call_args = (struct irq_cpu_stop_queue_work_info){ 246 - .cpu1 = cpu1, 247 - .cpu2 = cpu2, 248 - .work1 = &work1, 249 - .work2 = &work2, 250 263 }; 251 264 252 265 cpu_stop_init_done(&done, 2); ··· 258 285 return -ENOENT; 259 286 } 260 287 261 - lg_local_lock(&stop_cpus_lock); 262 - /* 263 - * Queuing needs to be done by the lowest numbered CPU, to ensure 264 - * that works are always queued in the same order on every CPU. 265 - * This prevents deadlocks. 266 - */ 267 - smp_call_function_single(min(cpu1, cpu2), 268 - &irq_cpu_stop_queue_work, 269 - &call_args, 1); 270 - lg_local_unlock(&stop_cpus_lock); 288 + lg_double_lock(&stop_cpus_lock, cpu1, cpu2); 289 + cpu_stop_queue_work(cpu1, &work1); 290 + cpu_stop_queue_work(cpu2, &work2); 291 + lg_double_unlock(&stop_cpus_lock, cpu1, cpu2); 292 + 271 293 preempt_enable(); 272 294 273 295 wait_for_completion(&done.completion);

+54 -33

kernel/time/posix-cpu-timers.c

··· 196 196 return 0; 197 197 } 198 198 199 - static void update_gt_cputime(struct task_cputime *a, struct task_cputime *b) 199 + /* 200 + * Set cputime to sum_cputime if sum_cputime > cputime. Use cmpxchg 201 + * to avoid race conditions with concurrent updates to cputime. 202 + */ 203 + static inline void __update_gt_cputime(atomic64_t *cputime, u64 sum_cputime) 200 204 { 201 - if (b->utime > a->utime) 202 - a->utime = b->utime; 205 + u64 curr_cputime; 206 + retry: 207 + curr_cputime = atomic64_read(cputime); 208 + if (sum_cputime > curr_cputime) { 209 + if (atomic64_cmpxchg(cputime, curr_cputime, sum_cputime) != curr_cputime) 210 + goto retry; 211 + } 212 + } 203 213 204 - if (b->stime > a->stime) 205 - a->stime = b->stime; 214 + static void update_gt_cputime(struct task_cputime_atomic *cputime_atomic, struct task_cputime *sum) 215 + { 216 + __update_gt_cputime(&cputime_atomic->utime, sum->utime); 217 + __update_gt_cputime(&cputime_atomic->stime, sum->stime); 218 + __update_gt_cputime(&cputime_atomic->sum_exec_runtime, sum->sum_exec_runtime); 219 + } 206 220 207 - if (b->sum_exec_runtime > a->sum_exec_runtime) 208 - a->sum_exec_runtime = b->sum_exec_runtime; 221 + /* Sample task_cputime_atomic values in "atomic_timers", store results in "times". */ 222 + static inline void sample_cputime_atomic(struct task_cputime *times, 223 + struct task_cputime_atomic *atomic_times) 224 + { 225 + times->utime = atomic64_read(&atomic_times->utime); 226 + times->stime = atomic64_read(&atomic_times->stime); 227 + times->sum_exec_runtime = atomic64_read(&atomic_times->sum_exec_runtime); 209 228 } 210 229 211 230 void thread_group_cputimer(struct task_struct *tsk, struct task_cputime *times) 212 231 { 213 232 struct thread_group_cputimer *cputimer = &tsk->signal->cputimer; 214 233 struct task_cputime sum; 215 - unsigned long flags; 216 234 217 - if (!cputimer->running) { 235 + /* Check if cputimer isn't running. This is accessed without locking. */ 236 + if (!READ_ONCE(cputimer->running)) { 218 237 /* 219 238 * The POSIX timer interface allows for absolute time expiry 220 239 * values through the TIMER_ABSTIME flag, therefore we have 221 - * to synchronize the timer to the clock every time we start 222 - * it. 240 + * to synchronize the timer to the clock every time we start it. 223 241 */ 224 242 thread_group_cputime(tsk, &sum); 225 - raw_spin_lock_irqsave(&cputimer->lock, flags); 226 - cputimer->running = 1; 227 - update_gt_cputime(&cputimer->cputime, &sum); 228 - } else 229 - raw_spin_lock_irqsave(&cputimer->lock, flags); 230 - *times = cputimer->cputime; 231 - raw_spin_unlock_irqrestore(&cputimer->lock, flags); 243 + update_gt_cputime(&cputimer->cputime_atomic, &sum); 244 + 245 + /* 246 + * We're setting cputimer->running without a lock. Ensure 247 + * this only gets written to in one operation. We set 248 + * running after update_gt_cputime() as a small optimization, 249 + * but barriers are not required because update_gt_cputime() 250 + * can handle concurrent updates. 251 + */ 252 + WRITE_ONCE(cputimer->running, 1); 253 + } 254 + sample_cputime_atomic(times, &cputimer->cputime_atomic); 232 255 } 233 256 234 257 /* ··· 605 582 if (!task_cputime_zero(&tsk->cputime_expires)) 606 583 return false; 607 584 608 - if (tsk->signal->cputimer.running) 585 + /* Check if cputimer is running. This is accessed without locking. */ 586 + if (READ_ONCE(tsk->signal->cputimer.running)) 609 587 return false; 610 588 611 589 return true; ··· 876 852 /* 877 853 * Check for the special case thread timers. 878 854 */ 879 - soft = ACCESS_ONCE(sig->rlim[RLIMIT_RTTIME].rlim_cur); 855 + soft = READ_ONCE(sig->rlim[RLIMIT_RTTIME].rlim_cur); 880 856 if (soft != RLIM_INFINITY) { 881 857 unsigned long hard = 882 - ACCESS_ONCE(sig->rlim[RLIMIT_RTTIME].rlim_max); 858 + READ_ONCE(sig->rlim[RLIMIT_RTTIME].rlim_max); 883 859 884 860 if (hard != RLIM_INFINITY && 885 861 tsk->rt.timeout > DIV_ROUND_UP(hard, USEC_PER_SEC/HZ)) { ··· 906 882 } 907 883 } 908 884 909 - static void stop_process_timers(struct signal_struct *sig) 885 + static inline void stop_process_timers(struct signal_struct *sig) 910 886 { 911 887 struct thread_group_cputimer *cputimer = &sig->cputimer; 912 - unsigned long flags; 913 888 914 - raw_spin_lock_irqsave(&cputimer->lock, flags); 915 - cputimer->running = 0; 916 - raw_spin_unlock_irqrestore(&cputimer->lock, flags); 889 + /* Turn off cputimer->running. This is done without locking. */ 890 + WRITE_ONCE(cputimer->running, 0); 917 891 } 918 892 919 893 static u32 onecputick; ··· 980 958 SIGPROF); 981 959 check_cpu_itimer(tsk, &sig->it[CPUCLOCK_VIRT], &virt_expires, utime, 982 960 SIGVTALRM); 983 - soft = ACCESS_ONCE(sig->rlim[RLIMIT_CPU].rlim_cur); 961 + soft = READ_ONCE(sig->rlim[RLIMIT_CPU].rlim_cur); 984 962 if (soft != RLIM_INFINITY) { 985 963 unsigned long psecs = cputime_to_secs(ptime); 986 964 unsigned long hard = 987 - ACCESS_ONCE(sig->rlim[RLIMIT_CPU].rlim_max); 965 + READ_ONCE(sig->rlim[RLIMIT_CPU].rlim_max); 988 966 cputime_t x; 989 967 if (psecs >= hard) { 990 968 /* ··· 1133 1111 } 1134 1112 1135 1113 sig = tsk->signal; 1136 - if (sig->cputimer.running) { 1114 + /* Check if cputimer is running. This is accessed without locking. */ 1115 + if (READ_ONCE(sig->cputimer.running)) { 1137 1116 struct task_cputime group_sample; 1138 1117 1139 - raw_spin_lock(&sig->cputimer.lock); 1140 - group_sample = sig->cputimer.cputime; 1141 - raw_spin_unlock(&sig->cputimer.lock); 1118 + sample_cputime_atomic(&group_sample, &sig->cputimer.cputime_atomic); 1142 1119 1143 1120 if (task_cputime_expired(&group_sample, &sig->cputime_expires)) 1144 1121 return 1; ··· 1178 1157 * If there are any active process wide timers (POSIX 1.b, itimers, 1179 1158 * RLIMIT_CPU) cputimer must be running. 1180 1159 */ 1181 - if (tsk->signal->cputimer.running) 1160 + if (READ_ONCE(tsk->signal->cputimer.running)) 1182 1161 check_process_timers(tsk, &firing); 1183 1162 1184 1163 /*

+1 -1

lib/cpu_rmap.c

··· 191 191 /* Update distances based on topology */ 192 192 for_each_cpu(cpu, update_mask) { 193 193 if (cpu_rmap_copy_neigh(rmap, cpu, 194 - topology_thread_cpumask(cpu), 1)) 194 + topology_sibling_cpumask(cpu), 1)) 195 195 continue; 196 196 if (cpu_rmap_copy_neigh(rmap, cpu, 197 197 topology_core_cpumask(cpu), 2))

+1 -1

lib/radix-tree.c

··· 33 33 #include <linux/string.h> 34 34 #include <linux/bitops.h> 35 35 #include <linux/rcupdate.h> 36 - #include <linux/preempt_mask.h> /* in_interrupt() */ 36 + #include <linux/preempt.h> /* in_interrupt() */ 37 37 38 38 39 39 /*

+4 -2

lib/strnlen_user.c

··· 85 85 * @str: The string to measure. 86 86 * @count: Maximum count (including NUL character) 87 87 * 88 - * Context: User context only. This function may sleep. 88 + * Context: User context only. This function may sleep if pagefaults are 89 + * enabled. 89 90 * 90 91 * Get the size of a NUL-terminated string in user space. 91 92 * ··· 122 121 * strlen_user: - Get the size of a user string INCLUDING final NUL. 123 122 * @str: The string to measure. 124 123 * 125 - * Context: User context only. This function may sleep. 124 + * Context: User context only. This function may sleep if pagefaults are 125 + * enabled. 126 126 * 127 127 * Get the size of a NUL-terminated string in user space. 128 128 *

+6 -12

mm/memory.c

··· 3737 3737 } 3738 3738 3739 3739 #if defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_DEBUG_ATOMIC_SLEEP) 3740 - void might_fault(void) 3740 + void __might_fault(const char *file, int line) 3741 3741 { 3742 3742 /* 3743 3743 * Some code (nfs/sunrpc) uses socket ops on kernel memory while ··· 3747 3747 */ 3748 3748 if (segment_eq(get_fs(), KERNEL_DS)) 3749 3749 return; 3750 - 3751 - /* 3752 - * it would be nicer only to annotate paths which are not under 3753 - * pagefault_disable, however that requires a larger audit and 3754 - * providing helpers like get_user_atomic. 3755 - */ 3756 - if (in_atomic()) 3750 + if (pagefault_disabled()) 3757 3751 return; 3758 - 3759 - __might_sleep(__FILE__, __LINE__, 0); 3760 - 3752 + __might_sleep(file, line, 0); 3753 + #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) 3761 3754 if (current->mm) 3762 3755 might_lock_read(&current->mm->mmap_sem); 3756 + #endif 3763 3757 } 3764 - EXPORT_SYMBOL(might_fault); 3758 + EXPORT_SYMBOL(__might_fault); 3765 3759 #endif 3766 3760 3767 3761 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)