Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

sched_ext: Documentation: scheduler: Document extensible scheduler class

Add Documentation/scheduler/sched-ext.rst which gives a high-level overview
and pointers to the examples.

v6: - Add paragraph explaining debug dump.

v5: - Updated to reflect /sys/kernel interface change. Kconfig options
added.

v4: - README improved, reformatted in markdown and renamed to README.md.

v3: - Added tools/sched_ext/README.

- Dropped _example prefix from scheduler names.

v2: - Apply minor edits suggested by Bagas. Caveats section dropped as all
of them are addressed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>

+580
+1
Documentation/scheduler/index.rst
··· 20 20 sched-nice-design 21 21 sched-rt-group 22 22 sched-stats 23 + sched-ext 23 24 sched-debug 24 25 25 26 text_files
+314
Documentation/scheduler/sched-ext.rst
··· 1 + ========================== 2 + Extensible Scheduler Class 3 + ========================== 4 + 5 + sched_ext is a scheduler class whose behavior can be defined by a set of BPF 6 + programs - the BPF scheduler. 7 + 8 + * sched_ext exports a full scheduling interface so that any scheduling 9 + algorithm can be implemented on top. 10 + 11 + * The BPF scheduler can group CPUs however it sees fit and schedule them 12 + together, as tasks aren't tied to specific CPUs at the time of wakeup. 13 + 14 + * The BPF scheduler can be turned on and off dynamically anytime. 15 + 16 + * The system integrity is maintained no matter what the BPF scheduler does. 17 + The default scheduling behavior is restored anytime an error is detected, 18 + a runnable task stalls, or on invoking the SysRq key sequence 19 + :kbd:`SysRq-S`. 20 + 21 + * When the BPF scheduler triggers an error, debug information is dumped to 22 + aid debugging. The debug dump is passed to and printed out by the 23 + scheduler binary. The debug dump can also be accessed through the 24 + `sched_ext_dump` tracepoint. The SysRq key sequence :kbd:`SysRq-D` 25 + triggers a debug dump. This doesn't terminate the BPF scheduler and can 26 + only be read through the tracepoint. 27 + 28 + Switching to and from sched_ext 29 + =============================== 30 + 31 + ``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and 32 + ``tools/sched_ext`` contains the example schedulers. The following config 33 + options should be enabled to use sched_ext: 34 + 35 + .. code-block:: none 36 + 37 + CONFIG_BPF=y 38 + CONFIG_SCHED_CLASS_EXT=y 39 + CONFIG_BPF_SYSCALL=y 40 + CONFIG_BPF_JIT=y 41 + CONFIG_DEBUG_INFO_BTF=y 42 + CONFIG_BPF_JIT_ALWAYS_ON=y 43 + CONFIG_BPF_JIT_DEFAULT_ON=y 44 + CONFIG_PAHOLE_HAS_SPLIT_BTF=y 45 + CONFIG_PAHOLE_HAS_BTF_TAG=y 46 + 47 + sched_ext is used only when the BPF scheduler is loaded and running. 48 + 49 + If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be 50 + treated as ``SCHED_NORMAL`` and scheduled by CFS until the BPF scheduler is 51 + loaded. On load, such tasks will be switched to and scheduled by sched_ext. 52 + 53 + The BPF scheduler can choose to schedule all normal and lower class tasks by 54 + calling ``scx_bpf_switch_all()`` from its ``init()`` operation. In this 55 + case, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE`` and 56 + ``SCHED_EXT`` tasks are scheduled by sched_ext. In the example schedulers, 57 + this mode can be selected with the ``-a`` option. 58 + 59 + Terminating the sched_ext scheduler program, triggering :kbd:`SysRq-S`, or 60 + detection of any internal error including stalled runnable tasks aborts the 61 + BPF scheduler and reverts all tasks back to CFS. 62 + 63 + .. code-block:: none 64 + 65 + # make -j16 -C tools/sched_ext 66 + # tools/sched_ext/scx_simple 67 + local=0 global=3 68 + local=5 global=24 69 + local=9 global=44 70 + local=13 global=56 71 + local=17 global=72 72 + ^CEXIT: BPF scheduler unregistered 73 + 74 + The current status of the BPF scheduler can be determined as follows: 75 + 76 + .. code-block:: none 77 + 78 + # cat /sys/kernel/sched_ext/state 79 + enabled 80 + # cat /sys/kernel/sched_ext/root/ops 81 + simple 82 + 83 + ``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more 84 + detailed information: 85 + 86 + .. code-block:: none 87 + 88 + # tools/sched_ext/scx_show_state.py 89 + ops : simple 90 + enabled : 1 91 + switching_all : 1 92 + switched_all : 1 93 + enable_state : enabled (2) 94 + bypass_depth : 0 95 + nr_rejected : 0 96 + 97 + If ``CONFIG_SCHED_DEBUG`` is set, whether a given task is on sched_ext can 98 + be determined as follows: 99 + 100 + .. code-block:: none 101 + 102 + # grep ext /proc/self/sched 103 + ext.enabled : 1 104 + 105 + The Basics 106 + ========== 107 + 108 + Userspace can implement an arbitrary BPF scheduler by loading a set of BPF 109 + programs that implement ``struct sched_ext_ops``. The only mandatory field 110 + is ``ops.name`` which must be a valid BPF object name. All operations are 111 + optional. The following modified excerpt is from 112 + ``tools/sched/scx_simple.bpf.c`` showing a minimal global FIFO scheduler. 113 + 114 + .. code-block:: c 115 + 116 + /* 117 + * Decide which CPU a task should be migrated to before being 118 + * enqueued (either at wakeup, fork time, or exec time). If an 119 + * idle core is found by the default ops.select_cpu() implementation, 120 + * then dispatch the task directly to SCX_DSQ_LOCAL and skip the 121 + * ops.enqueue() callback. 122 + * 123 + * Note that this implementation has exactly the same behavior as the 124 + * default ops.select_cpu implementation. The behavior of the scheduler 125 + * would be exactly same if the implementation just didn't define the 126 + * simple_select_cpu() struct_ops prog. 127 + */ 128 + s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p, 129 + s32 prev_cpu, u64 wake_flags) 130 + { 131 + s32 cpu; 132 + /* Need to initialize or the BPF verifier will reject the program */ 133 + bool direct = false; 134 + 135 + cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct); 136 + 137 + if (direct) 138 + scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); 139 + 140 + return cpu; 141 + } 142 + 143 + /* 144 + * Do a direct dispatch of a task to the global DSQ. This ops.enqueue() 145 + * callback will only be invoked if we failed to find a core to dispatch 146 + * to in ops.select_cpu() above. 147 + * 148 + * Note that this implementation has exactly the same behavior as the 149 + * default ops.enqueue implementation, which just dispatches the task 150 + * to SCX_DSQ_GLOBAL. The behavior of the scheduler would be exactly same 151 + * if the implementation just didn't define the simple_enqueue struct_ops 152 + * prog. 153 + */ 154 + void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags) 155 + { 156 + scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); 157 + } 158 + 159 + s32 BPF_STRUCT_OPS(simple_init) 160 + { 161 + /* 162 + * All SCHED_OTHER, SCHED_IDLE, and SCHED_BATCH tasks should 163 + * use sched_ext. 164 + */ 165 + scx_bpf_switch_all(); 166 + return 0; 167 + } 168 + 169 + void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei) 170 + { 171 + exit_type = ei->type; 172 + } 173 + 174 + SEC(".struct_ops") 175 + struct sched_ext_ops simple_ops = { 176 + .select_cpu = (void *)simple_select_cpu, 177 + .enqueue = (void *)simple_enqueue, 178 + .init = (void *)simple_init, 179 + .exit = (void *)simple_exit, 180 + .name = "simple", 181 + }; 182 + 183 + Dispatch Queues 184 + --------------- 185 + 186 + To match the impedance between the scheduler core and the BPF scheduler, 187 + sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and a 188 + priority queue. By default, there is one global FIFO (``SCX_DSQ_GLOBAL``), 189 + and one local dsq per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage 190 + an arbitrary number of dsq's using ``scx_bpf_create_dsq()`` and 191 + ``scx_bpf_destroy_dsq()``. 192 + 193 + A CPU always executes a task from its local DSQ. A task is "dispatched" to a 194 + DSQ. A non-local DSQ is "consumed" to transfer a task to the consuming CPU's 195 + local DSQ. 196 + 197 + When a CPU is looking for the next task to run, if the local DSQ is not 198 + empty, the first task is picked. Otherwise, the CPU tries to consume the 199 + global DSQ. If that doesn't yield a runnable task either, ``ops.dispatch()`` 200 + is invoked. 201 + 202 + Scheduling Cycle 203 + ---------------- 204 + 205 + The following briefly shows how a waking task is scheduled and executed. 206 + 207 + 1. When a task is waking up, ``ops.select_cpu()`` is the first operation 208 + invoked. This serves two purposes. First, CPU selection optimization 209 + hint. Second, waking up the selected CPU if idle. 210 + 211 + The CPU selected by ``ops.select_cpu()`` is an optimization hint and not 212 + binding. The actual decision is made at the last step of scheduling. 213 + However, there is a small performance gain if the CPU 214 + ``ops.select_cpu()`` returns matches the CPU the task eventually runs on. 215 + 216 + A side-effect of selecting a CPU is waking it up from idle. While a BPF 217 + scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper, 218 + using ``ops.select_cpu()`` judiciously can be simpler and more efficient. 219 + 220 + A task can be immediately dispatched to a DSQ from ``ops.select_cpu()`` by 221 + calling ``scx_bpf_dispatch()``. If the task is dispatched to 222 + ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be dispatched to the 223 + local DSQ of whichever CPU is returned from ``ops.select_cpu()``. 224 + Additionally, dispatching directly from ``ops.select_cpu()`` will cause the 225 + ``ops.enqueue()`` callback to be skipped. 226 + 227 + Note that the scheduler core will ignore an invalid CPU selection, for 228 + example, if it's outside the allowed cpumask of the task. 229 + 230 + 2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the 231 + task was dispatched directly from ``ops.select_cpu()``). ``ops.enqueue()`` 232 + can make one of the following decisions: 233 + 234 + * Immediately dispatch the task to either the global or local DSQ by 235 + calling ``scx_bpf_dispatch()`` with ``SCX_DSQ_GLOBAL`` or 236 + ``SCX_DSQ_LOCAL``, respectively. 237 + 238 + * Immediately dispatch the task to a custom DSQ by calling 239 + ``scx_bpf_dispatch()`` with a DSQ ID which is smaller than 2^63. 240 + 241 + * Queue the task on the BPF side. 242 + 243 + 3. When a CPU is ready to schedule, it first looks at its local DSQ. If 244 + empty, it then looks at the global DSQ. If there still isn't a task to 245 + run, ``ops.dispatch()`` is invoked which can use the following two 246 + functions to populate the local DSQ. 247 + 248 + * ``scx_bpf_dispatch()`` dispatches a task to a DSQ. Any target DSQ can 249 + be used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``, 250 + ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dispatch()`` 251 + currently can't be called with BPF locks held, this is being worked on 252 + and will be supported. ``scx_bpf_dispatch()`` schedules dispatching 253 + rather than performing them immediately. There can be up to 254 + ``ops.dispatch_max_batch`` pending tasks. 255 + 256 + * ``scx_bpf_consume()`` tranfers a task from the specified non-local DSQ 257 + to the dispatching DSQ. This function cannot be called with any BPF 258 + locks held. ``scx_bpf_consume()`` flushes the pending dispatched tasks 259 + before trying to consume the specified DSQ. 260 + 261 + 4. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ, 262 + the CPU runs the first one. If empty, the following steps are taken: 263 + 264 + * Try to consume the global DSQ. If successful, run the task. 265 + 266 + * If ``ops.dispatch()`` has dispatched any tasks, retry #3. 267 + 268 + * If the previous task is an SCX task and still runnable, keep executing 269 + it (see ``SCX_OPS_ENQ_LAST``). 270 + 271 + * Go idle. 272 + 273 + Note that the BPF scheduler can always choose to dispatch tasks immediately 274 + in ``ops.enqueue()`` as illustrated in the above simple example. If only the 275 + built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as 276 + a task is never queued on the BPF scheduler and both the local and global 277 + DSQs are consumed automatically. 278 + 279 + ``scx_bpf_dispatch()`` queues the task on the FIFO of the target DSQ. Use 280 + ``scx_bpf_dispatch_vtime()`` for the priority queue. Internal DSQs such as 281 + ``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue 282 + dispatching, and must be dispatched to with ``scx_bpf_dispatch()``. See the 283 + function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c`` for 284 + more information. 285 + 286 + Where to Look 287 + ============= 288 + 289 + * ``include/linux/sched/ext.h`` defines the core data structures, ops table 290 + and constants. 291 + 292 + * ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers. 293 + The functions prefixed with ``scx_bpf_`` can be called from the BPF 294 + scheduler. 295 + 296 + * ``tools/sched_ext/`` hosts example BPF scheduler implementations. 297 + 298 + * ``scx_simple[.bpf].c``: Minimal global FIFO scheduler example using a 299 + custom DSQ. 300 + 301 + * ``scx_qmap[.bpf].c``: A multi-level FIFO scheduler supporting five 302 + levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``. 303 + 304 + ABI Instability 305 + =============== 306 + 307 + The APIs provided by sched_ext to BPF schedulers programs have no stability 308 + guarantees. This includes the ops table callbacks and constants defined in 309 + ``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in 310 + ``kernel/sched/ext.c``. 311 + 312 + While we will attempt to provide a relatively stable API surface when 313 + possible, they are subject to change without warning between kernel 314 + versions.
+2
include/linux/sched/ext.h
··· 1 1 /* SPDX-License-Identifier: GPL-2.0 */ 2 2 /* 3 + * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst 4 + * 3 5 * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. 4 6 * Copyright (c) 2022 Tejun Heo <tj@kernel.org> 5 7 * Copyright (c) 2022 David Vernet <dvernet@meta.com>
+1
kernel/Kconfig.preempt
··· 156 156 similar to struct sched_class. 157 157 158 158 For more information: 159 + Documentation/scheduler/sched-ext.rst 159 160 https://github.com/sched-ext/scx
+2
kernel/sched/ext.c
··· 1 1 /* SPDX-License-Identifier: GPL-2.0 */ 2 2 /* 3 + * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst 4 + * 3 5 * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. 4 6 * Copyright (c) 2022 Tejun Heo <tj@kernel.org> 5 7 * Copyright (c) 2022 David Vernet <dvernet@meta.com>
+2
kernel/sched/ext.h
··· 1 1 /* SPDX-License-Identifier: GPL-2.0 */ 2 2 /* 3 + * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst 4 + * 3 5 * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. 4 6 * Copyright (c) 2022 Tejun Heo <tj@kernel.org> 5 7 * Copyright (c) 2022 David Vernet <dvernet@meta.com>
+258
tools/sched_ext/README.md
··· 1 + SCHED_EXT EXAMPLE SCHEDULERS 2 + ============================ 3 + 4 + # Introduction 5 + 6 + This directory contains a number of example sched_ext schedulers. These 7 + schedulers are meant to provide examples of different types of schedulers 8 + that can be built using sched_ext, and illustrate how various features of 9 + sched_ext can be used. 10 + 11 + Some of the examples are performant, production-ready schedulers. That is, for 12 + the correct workload and with the correct tuning, they may be deployed in a 13 + production environment with acceptable or possibly even improved performance. 14 + Others are just examples that in practice, would not provide acceptable 15 + performance (though they could be improved to get there). 16 + 17 + This README will describe these example schedulers, including describing the 18 + types of workloads or scenarios they're designed to accommodate, and whether or 19 + not they're production ready. For more details on any of these schedulers, 20 + please see the header comment in their .bpf.c file. 21 + 22 + 23 + # Compiling the examples 24 + 25 + There are a few toolchain dependencies for compiling the example schedulers. 26 + 27 + ## Toolchain dependencies 28 + 29 + 1. clang >= 16.0.0 30 + 31 + The schedulers are BPF programs, and therefore must be compiled with clang. gcc 32 + is actively working on adding a BPF backend compiler as well, but are still 33 + missing some features such as BTF type tags which are necessary for using 34 + kptrs. 35 + 36 + 2. pahole >= 1.25 37 + 38 + You may need pahole in order to generate BTF from DWARF. 39 + 40 + 3. rust >= 1.70.0 41 + 42 + Rust schedulers uses features present in the rust toolchain >= 1.70.0. You 43 + should be able to use the stable build from rustup, but if that doesn't 44 + work, try using the rustup nightly build. 45 + 46 + There are other requirements as well, such as make, but these are the main / 47 + non-trivial ones. 48 + 49 + ## Compiling the kernel 50 + 51 + In order to run a sched_ext scheduler, you'll have to run a kernel compiled 52 + with the patches in this repository, and with a minimum set of necessary 53 + Kconfig options: 54 + 55 + ``` 56 + CONFIG_BPF=y 57 + CONFIG_SCHED_CLASS_EXT=y 58 + CONFIG_BPF_SYSCALL=y 59 + CONFIG_BPF_JIT=y 60 + CONFIG_DEBUG_INFO_BTF=y 61 + ``` 62 + 63 + It's also recommended that you also include the following Kconfig options: 64 + 65 + ``` 66 + CONFIG_BPF_JIT_ALWAYS_ON=y 67 + CONFIG_BPF_JIT_DEFAULT_ON=y 68 + CONFIG_PAHOLE_HAS_SPLIT_BTF=y 69 + CONFIG_PAHOLE_HAS_BTF_TAG=y 70 + ``` 71 + 72 + There is a `Kconfig` file in this directory whose contents you can append to 73 + your local `.config` file, as long as there are no conflicts with any existing 74 + options in the file. 75 + 76 + ## Getting a vmlinux.h file 77 + 78 + You may notice that most of the example schedulers include a "vmlinux.h" file. 79 + This is a large, auto-generated header file that contains all of the types 80 + defined in some vmlinux binary that was compiled with 81 + [BTF](https://docs.kernel.org/bpf/btf.html) (i.e. with the BTF-related Kconfig 82 + options specified above). 83 + 84 + The header file is created using `bpftool`, by passing it a vmlinux binary 85 + compiled with BTF as follows: 86 + 87 + ```bash 88 + $ bpftool btf dump file /path/to/vmlinux format c > vmlinux.h 89 + ``` 90 + 91 + `bpftool` analyzes all of the BTF encodings in the binary, and produces a 92 + header file that can be included by BPF programs to access those types. For 93 + example, using vmlinux.h allows a scheduler to access fields defined directly 94 + in vmlinux as follows: 95 + 96 + ```c 97 + #include "vmlinux.h" 98 + // vmlinux.h is also implicitly included by scx_common.bpf.h. 99 + #include "scx_common.bpf.h" 100 + 101 + /* 102 + * vmlinux.h provides definitions for struct task_struct and 103 + * struct scx_enable_args. 104 + */ 105 + void BPF_STRUCT_OPS(example_enable, struct task_struct *p, 106 + struct scx_enable_args *args) 107 + { 108 + bpf_printk("Task %s enabled in example scheduler", p->comm); 109 + } 110 + 111 + // vmlinux.h provides the definition for struct sched_ext_ops. 112 + SEC(".struct_ops.link") 113 + struct sched_ext_ops example_ops { 114 + .enable = (void *)example_enable, 115 + .name = "example", 116 + } 117 + ``` 118 + 119 + The scheduler build system will generate this vmlinux.h file as part of the 120 + scheduler build pipeline. It looks for a vmlinux file in the following 121 + dependency order: 122 + 123 + 1. If the O= environment variable is defined, at `$O/vmlinux` 124 + 2. If the KBUILD_OUTPUT= environment variable is defined, at 125 + `$KBUILD_OUTPUT/vmlinux` 126 + 3. At `../../vmlinux` (i.e. at the root of the kernel tree where you're 127 + compiling the schedulers) 128 + 3. `/sys/kernel/btf/vmlinux` 129 + 4. `/boot/vmlinux-$(uname -r)` 130 + 131 + In other words, if you have compiled a kernel in your local repo, its vmlinux 132 + file will be used to generate vmlinux.h. Otherwise, it will be the vmlinux of 133 + the kernel you're currently running on. This means that if you're running on a 134 + kernel with sched_ext support, you may not need to compile a local kernel at 135 + all. 136 + 137 + ### Aside on CO-RE 138 + 139 + One of the cooler features of BPF is that it supports 140 + [CO-RE](https://nakryiko.com/posts/bpf-core-reference-guide/) (Compile Once Run 141 + Everywhere). This feature allows you to reference fields inside of structs with 142 + types defined internal to the kernel, and not have to recompile if you load the 143 + BPF program on a different kernel with the field at a different offset. In our 144 + example above, we print out a task name with `p->comm`. CO-RE would perform 145 + relocations for that access when the program is loaded to ensure that it's 146 + referencing the correct offset for the currently running kernel. 147 + 148 + ## Compiling the schedulers 149 + 150 + Once you have your toolchain setup, and a vmlinux that can be used to generate 151 + a full vmlinux.h file, you can compile the schedulers using `make`: 152 + 153 + ```bash 154 + $ make -j($nproc) 155 + ``` 156 + 157 + # Example schedulers 158 + 159 + This directory contains the following example schedulers. These schedulers are 160 + for testing and demonstrating different aspects of sched_ext. While some may be 161 + useful in limited scenarios, they are not intended to be practical. 162 + 163 + For more scheduler implementations, tools and documentation, visit 164 + https://github.com/sched-ext/scx. 165 + 166 + ## scx_simple 167 + 168 + A simple scheduler that provides an example of a minimal sched_ext scheduler. 169 + scx_simple can be run in either global weighted vtime mode, or FIFO mode. 170 + 171 + Though very simple, in limited scenarios, this scheduler can perform reasonably 172 + well on single-socket systems with a unified L3 cache. 173 + 174 + ## scx_qmap 175 + 176 + Another simple, yet slightly more complex scheduler that provides an example of 177 + a basic weighted FIFO queuing policy. It also provides examples of some common 178 + useful BPF features, such as sleepable per-task storage allocation in the 179 + `ops.prep_enable()` callback, and using the `BPF_MAP_TYPE_QUEUE` map type to 180 + enqueue tasks. It also illustrates how core-sched support could be implemented. 181 + 182 + ## scx_central 183 + 184 + A "central" scheduler where scheduling decisions are made from a single CPU. 185 + This scheduler illustrates how scheduling decisions can be dispatched from a 186 + single CPU, allowing other cores to run with infinite slices, without timer 187 + ticks, and without having to incur the overhead of making scheduling decisions. 188 + 189 + The approach demonstrated by this scheduler may be useful for any workload that 190 + benefits from minimizing scheduling overhead and timer ticks. An example of 191 + where this could be particularly useful is running VMs, where running with 192 + infinite slices and no timer ticks allows the VM to avoid unnecessary expensive 193 + vmexits. 194 + 195 + 196 + # Troubleshooting 197 + 198 + There are a number of common issues that you may run into when building the 199 + schedulers. We'll go over some of the common ones here. 200 + 201 + ## Build Failures 202 + 203 + ### Old version of clang 204 + 205 + ``` 206 + error: static assertion failed due to requirement 'SCX_DSQ_FLAG_BUILTIN': bpftool generated vmlinux.h is missing high bits for 64bit enums, upgrade clang and pahole 207 + _Static_assert(SCX_DSQ_FLAG_BUILTIN, 208 + ^~~~~~~~~~~~~~~~~~~~ 209 + 1 error generated. 210 + ``` 211 + 212 + This means you built the kernel or the schedulers with an older version of 213 + clang than what's supported (i.e. older than 16.0.0). To remediate this: 214 + 215 + 1. `which clang` to make sure you're using a sufficiently new version of clang. 216 + 217 + 2. `make fullclean` in the root path of the repository, and rebuild the kernel 218 + and schedulers. 219 + 220 + 3. Rebuild the kernel, and then your example schedulers. 221 + 222 + The schedulers are also cleaned if you invoke `make mrproper` in the root 223 + directory of the tree. 224 + 225 + ### Stale kernel build / incomplete vmlinux.h file 226 + 227 + As described above, you'll need a `vmlinux.h` file that was generated from a 228 + vmlinux built with BTF, and with sched_ext support enabled. If you don't, 229 + you'll see errors such as the following which indicate that a type being 230 + referenced in a scheduler is unknown: 231 + 232 + ``` 233 + /path/to/sched_ext/tools/sched_ext/user_exit_info.h:25:23: note: forward declaration of 'struct scx_exit_info' 234 + 235 + const struct scx_exit_info *ei) 236 + 237 + ^ 238 + ``` 239 + 240 + In order to resolve this, please follow the steps above in 241 + [Getting a vmlinux.h file](#getting-a-vmlinuxh-file) in order to ensure your 242 + schedulers are using a vmlinux.h file that includes the requisite types. 243 + 244 + ## Misc 245 + 246 + ### llvm: [OFF] 247 + 248 + You may see the following output when building the schedulers: 249 + 250 + ``` 251 + Auto-detecting system features: 252 + ... clang-bpf-co-re: [ on ] 253 + ... llvm: [ OFF ] 254 + ... libcap: [ on ] 255 + ... libbfd: [ on ] 256 + ``` 257 + 258 + Seeing `llvm: [ OFF ]` here is not an issue. You can safely ignore.