posix_cpu_timer: Reduce unnecessary sighand lock contention

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

It was found while running a database workload on large systems that
significant time was spent trying to acquire the sighand lock.

The issue was that whenever an itimer expired, many threads ended up
simultaneously trying to send the signal. Most of the time, nothing
happened after acquiring the sighand lock because another thread
had just already sent the signal and updated the "next expire" time.
The fastpath_timer_check() didn't help much since the "next expire"
time was updated after the threads exit fastpath_timer_check().

This patch addresses this by having the thread_group_cputimer structure
maintain a boolean to signify when a thread in the group is already
checking for process wide timers, and adds extra logic in the fastpath
to check the boolean.

Signed-off-by: Jason Low <jason.low2@hp.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: George Spelvin <linux@horizon.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: hideaki.kimura@hpe.com
Cc: terry.rudd@hpe.com
Cc: scott.norton@hpe.com
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1444849677-29330-5-git-send-email-jason.low2@hp.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

authored by

Jason Low and committed by

Thomas Gleixner 10 years ago c8d75aa4 d5c373eb

+28 -2

3 changed files

expand all

include

linux

init_task.h

sched.h

kernel

time

posix-cpu-timers.c

include/linux/init_task.h

··· 60 60 .cputimer = { \ 61 61 .cputime_atomic = INIT_CPUTIME_ATOMIC, \ 62 62 .running = false, \ 63 + .checking_timer = false, \ 63 64 }, \ 64 65 INIT_PREV_CPUTIME(sig) \ 65 66 .cred_guard_mutex = \

include/linux/sched.h

··· 619 619 * @cputime_atomic: atomic thread group interval timers. 620 620 * @running: true when there are timers running and 621 621 * @cputime_atomic receives updates. 622 + * @checking_timer: true when a thread in the group is in the 623 + * process of checking for thread group timers. 622 624 * 623 625 * This structure contains the version of task_cputime, above, that is 624 626 * used for thread group CPU timer calculations. ··· 628 626 struct thread_group_cputimer { 629 627 struct task_cputime_atomic cputime_atomic; 630 628 bool running; 629 + bool checking_timer; 631 630 }; 632 631 633 632 #include <linux/rwsem.h>

+24 -2

kernel/time/posix-cpu-timers.c

··· 975 975 if (!READ_ONCE(tsk->signal->cputimer.running)) 976 976 return; 977 977 978 + /* 979 + * Signify that a thread is checking for process timers. 980 + * Write access to this field is protected by the sighand lock. 981 + */ 982 + sig->cputimer.checking_timer = true; 983 + 978 984 /* 979 985 * Collect the current process totals. 980 986 */ ··· 1035 1029 sig->cputime_expires.sched_exp = sched_expires; 1036 1030 if (task_cputime_zero(&sig->cputime_expires)) 1037 1031 stop_process_timers(sig); 1032 + 1033 + sig->cputimer.checking_timer = false; 1038 1034 } 1039 1035 1040 1036 /* ··· 1150 1142 } 1151 1143 1152 1144 sig = tsk->signal; 1153 - /* Check if cputimer is running. This is accessed without locking. */ 1154 - if (READ_ONCE(sig->cputimer.running)) { 1145 + /* 1146 + * Check if thread group timers expired when the cputimer is 1147 + * running and no other thread in the group is already checking 1148 + * for thread group cputimers. These fields are read without the 1149 + * sighand lock. However, this is fine because this is meant to 1150 + * be a fastpath heuristic to determine whether we should try to 1151 + * acquire the sighand lock to check/handle timers. 1152 + * 1153 + * In the worst case scenario, if 'running' or 'checking_timer' gets 1154 + * set but the current thread doesn't see the change yet, we'll wait 1155 + * until the next thread in the group gets a scheduler interrupt to 1156 + * handle the timer. This isn't an issue in practice because these 1157 + * types of delays with signals actually getting sent are expected. 1158 + */ 1159 + if (READ_ONCE(sig->cputimer.running) && 1160 + !READ_ONCE(sig->cputimer.checking_timer)) { 1155 1161 struct task_cputime group_sample; 1156 1162 1157 1163 sample_cputime_atomic(&group_sample, &sig->cputimer.cputime_atomic);