sched/core: Call __schedule() from do_idle() without enabling preemption

I finally got around to creating trampolines for dynamically allocated
ftrace_ops with using synchronize_rcu_tasks(). For users of the ftrace
function hook callbacks, like perf, that allocate the ftrace_ops
descriptor via kmalloc() and friends, ftrace was not able to optimize
the functions being traced to use a trampoline because they would also
need to be allocated dynamically. The problem is that they cannot be
freed when CONFIG_PREEMPT is set, as there's no way to tell if a task
was preempted on the trampoline. That was before Paul McKenney
implemented synchronize_rcu_tasks() that would make sure all tasks
(except idle) have scheduled out or have entered user space.

While testing this, I triggered this bug:

BUG: unable to handle kernel paging request at ffffffffa0230077
...
RIP: 0010:0xffffffffa0230077
...
Call Trace:
schedule+0x5/0xe0
schedule_preempt_disabled+0x18/0x30
do_idle+0x172/0x220

What happened was that the idle task was preempted on the trampoline.
As synchronize_rcu_tasks() ignores the idle thread, there's nothing
that lets ftrace know that the idle task was preempted on a trampoline.

The idle task shouldn't need to ever enable preemption. The idle task
is simply a loop that calls schedule or places the cpu into idle mode.
In fact, having preemption enabled is inefficient, because it can
happen when idle is just about to call schedule anyway, which would
cause schedule to be called twice. Once for when the interrupt came in
and was returning back to normal context, and then again in the normal
path that the idle loop is running in, which would be pointless, as it
had already scheduled.

The only reason schedule_preempt_disable() enables preemption is to be
able to call sched_submit_work(), which requires preemption enabled. As
this is a nop when the task is in the RUNNING state, and idle is always
in the running state, there's no reason that idle needs to enable
preemption. But that means it cannot use schedule_preempt_disable() as
other callers of that function require calling sched_submit_work().

Adding a new function local to kernel/sched/ that allows idle to call
the scheduler without enabling preemption, fixes the
synchronize_rcu_tasks() issue, as well as removes the pointless spurious
schedule calls caused by interrupts happening in the brief window where
preemption is enabled just before it calls schedule.

Reviewed: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170414084809.3dacde2a@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>

authored by Steven Rostedt (VMware) and committed by Ingo Molnar 8663effb 2ea659a9

Changed files
+28 -1
kernel
+25
kernel/sched/core.c
··· 3502 3502 } 3503 3503 EXPORT_SYMBOL(schedule); 3504 3504 3505 + /* 3506 + * synchronize_rcu_tasks() makes sure that no task is stuck in preempted 3507 + * state (have scheduled out non-voluntarily) by making sure that all 3508 + * tasks have either left the run queue or have gone into user space. 3509 + * As idle tasks do not do either, they must not ever be preempted 3510 + * (schedule out non-voluntarily). 3511 + * 3512 + * schedule_idle() is similar to schedule_preempt_disable() except that it 3513 + * never enables preemption because it does not call sched_submit_work(). 3514 + */ 3515 + void __sched schedule_idle(void) 3516 + { 3517 + /* 3518 + * As this skips calling sched_submit_work(), which the idle task does 3519 + * regardless because that function is a nop when the task is in a 3520 + * TASK_RUNNING state, make sure this isn't used someplace that the 3521 + * current task can be in any other state. Note, idle is always in the 3522 + * TASK_RUNNING state. 3523 + */ 3524 + WARN_ON_ONCE(current->state); 3525 + do { 3526 + __schedule(false); 3527 + } while (need_resched()); 3528 + } 3529 + 3505 3530 #ifdef CONFIG_CONTEXT_TRACKING 3506 3531 asmlinkage __visible void __sched schedule_user(void) 3507 3532 {
+1 -1
kernel/sched/idle.c
··· 265 265 smp_mb__after_atomic(); 266 266 267 267 sched_ttwu_pending(); 268 - schedule_preempt_disabled(); 268 + schedule_idle(); 269 269 270 270 if (unlikely(klp_patch_pending(current))) 271 271 klp_update_patch_state(current);
+2
kernel/sched/sched.h
··· 1467 1467 } 1468 1468 #endif 1469 1469 1470 + extern void schedule_idle(void); 1471 + 1470 1472 extern void sysrq_sched_debug_show(void); 1471 1473 extern void sched_init_granularity(void); 1472 1474 extern void update_max_interval(void);