···11+ Static Keys22+ -----------33+44+By: Jason Baron <jbaron@redhat.com>55+66+0) Abstract77+88+Static keys allows the inclusion of seldom used features in99+performance-sensitive fast-path kernel code, via a GCC feature and a code1010+patching technique. A quick example:1111+1212+ struct static_key key = STATIC_KEY_INIT_FALSE;1313+1414+ ...1515+1616+ if (static_key_false(&key))1717+ do unlikely code1818+ else1919+ do likely code2020+2121+ ...2222+ static_key_slow_inc();2323+ ...2424+ static_key_slow_inc();2525+ ...2626+2727+The static_key_false() branch will be generated into the code with as little2828+impact to the likely code path as possible.2929+3030+3131+1) Motivation3232+3333+3434+Currently, tracepoints are implemented using a conditional branch. The3535+conditional check requires checking a global variable for each tracepoint.3636+Although the overhead of this check is small, it increases when the memory3737+cache comes under pressure (memory cache lines for these global variables may3838+be shared with other memory accesses). As we increase the number of tracepoints3939+in the kernel this overhead may become more of an issue. In addition,4040+tracepoints are often dormant (disabled) and provide no direct kernel4141+functionality. Thus, it is highly desirable to reduce their impact as much as4242+possible. Although tracepoints are the original motivation for this work, other4343+kernel code paths should be able to make use of the static keys facility.4444+4545+4646+2) Solution4747+4848+4949+gcc (v4.5) adds a new 'asm goto' statement that allows branching to a label:5050+5151+http://gcc.gnu.org/ml/gcc-patches/2009-07/msg01556.html5252+5353+Using the 'asm goto', we can create branches that are either taken or not taken5454+by default, without the need to check memory. Then, at run-time, we can patch5555+the branch site to change the branch direction.5656+5757+For example, if we have a simple branch that is disabled by default:5858+5959+ if (static_key_false(&key))6060+ printk("I am the true branch\n");6161+6262+Thus, by default the 'printk' will not be emitted. And the code generated will6363+consist of a single atomic 'no-op' instruction (5 bytes on x86), in the6464+straight-line code path. When the branch is 'flipped', we will patch the6565+'no-op' in the straight-line codepath with a 'jump' instruction to the6666+out-of-line true branch. Thus, changing branch direction is expensive but6767+branch selection is basically 'free'. That is the basic tradeoff of this6868+optimization.6969+7070+This lowlevel patching mechanism is called 'jump label patching', and it gives7171+the basis for the static keys facility.7272+7373+3) Static key label API, usage and examples:7474+7575+7676+In order to make use of this optimization you must first define a key:7777+7878+ struct static_key key;7979+8080+Which is initialized as:8181+8282+ struct static_key key = STATIC_KEY_INIT_TRUE;8383+8484+or:8585+8686+ struct static_key key = STATIC_KEY_INIT_FALSE;8787+8888+If the key is not initialized, it is default false. The 'struct static_key',8989+must be a 'global'. That is, it can't be allocated on the stack or dynamically9090+allocated at run-time.9191+9292+The key is then used in code as:9393+9494+ if (static_key_false(&key))9595+ do unlikely code9696+ else9797+ do likely code9898+9999+Or:100100+101101+ if (static_key_true(&key))102102+ do likely code103103+ else104104+ do unlikely code105105+106106+A key that is initialized via 'STATIC_KEY_INIT_FALSE', must be used in a107107+'static_key_false()' construct. Likewise, a key initialized via108108+'STATIC_KEY_INIT_TRUE' must be used in a 'static_key_true()' construct. A109109+single key can be used in many branches, but all the branches must match the110110+way that the key has been initialized.111111+112112+The branch(es) can then be switched via:113113+114114+ static_key_slow_inc(&key);115115+ ...116116+ static_key_slow_dec(&key);117117+118118+Thus, 'static_key_slow_inc()' means 'make the branch true', and119119+'static_key_slow_dec()' means 'make the the branch false' with appropriate120120+reference counting. For example, if the key is initialized true, a121121+static_key_slow_dec(), will switch the branch to false. And a subsequent122122+static_key_slow_inc(), will change the branch back to true. Likewise, if the123123+key is initialized false, a 'static_key_slow_inc()', will change the branch to124124+true. And then a 'static_key_slow_dec()', will again make the branch false.125125+126126+An example usage in the kernel is the implementation of tracepoints:127127+128128+ static inline void trace_##name(proto) \129129+ { \130130+ if (static_key_false(&__tracepoint_##name.key)) \131131+ __DO_TRACE(&__tracepoint_##name, \132132+ TP_PROTO(data_proto), \133133+ TP_ARGS(data_args), \134134+ TP_CONDITION(cond)); \135135+ }136136+137137+Tracepoints are disabled by default, and can be placed in performance critical138138+pieces of the kernel. Thus, by using a static key, the tracepoints can have139139+absolutely minimal impact when not in use.140140+141141+142142+4) Architecture level code patching interface, 'jump labels'143143+144144+145145+There are a few functions and macros that architectures must implement in order146146+to take advantage of this optimization. If there is no architecture support, we147147+simply fall back to a traditional, load, test, and jump sequence.148148+149149+* select HAVE_ARCH_JUMP_LABEL, see: arch/x86/Kconfig150150+151151+* #define JUMP_LABEL_NOP_SIZE, see: arch/x86/include/asm/jump_label.h152152+153153+* __always_inline bool arch_static_branch(struct static_key *key), see:154154+ arch/x86/include/asm/jump_label.h155155+156156+* void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type),157157+ see: arch/x86/kernel/jump_label.c158158+159159+* __init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type),160160+ see: arch/x86/kernel/jump_label.c161161+162162+163163+* struct jump_entry, see: arch/x86/include/asm/jump_label.h164164+165165+166166+5) Static keys / jump label analysis, results (x86_64):167167+168168+169169+As an example, let's add the following branch to 'getppid()', such that the170170+system call now looks like:171171+172172+SYSCALL_DEFINE0(getppid)173173+{174174+ int pid;175175+176176++ if (static_key_false(&key))177177++ printk("I am the true branch\n");178178+179179+ rcu_read_lock();180180+ pid = task_tgid_vnr(rcu_dereference(current->real_parent));181181+ rcu_read_unlock();182182+183183+ return pid;184184+}185185+186186+The resulting instructions with jump labels generated by GCC is:187187+188188+ffffffff81044290 <sys_getppid>:189189+ffffffff81044290: 55 push %rbp190190+ffffffff81044291: 48 89 e5 mov %rsp,%rbp191191+ffffffff81044294: e9 00 00 00 00 jmpq ffffffff81044299 <sys_getppid+0x9>192192+ffffffff81044299: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax193193+ffffffff810442a0: 00 00194194+ffffffff810442a2: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax195195+ffffffff810442a9: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax196196+ffffffff810442b0: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi197197+ffffffff810442b7: e8 f4 d9 00 00 callq ffffffff81051cb0 <pid_vnr>198198+ffffffff810442bc: 5d pop %rbp199199+ffffffff810442bd: 48 98 cltq200200+ffffffff810442bf: c3 retq201201+ffffffff810442c0: 48 c7 c7 e3 54 98 81 mov $0xffffffff819854e3,%rdi202202+ffffffff810442c7: 31 c0 xor %eax,%eax203203+ffffffff810442c9: e8 71 13 6d 00 callq ffffffff8171563f <printk>204204+ffffffff810442ce: eb c9 jmp ffffffff81044299 <sys_getppid+0x9>205205+206206+Without the jump label optimization it looks like:207207+208208+ffffffff810441f0 <sys_getppid>:209209+ffffffff810441f0: 8b 05 8a 52 d8 00 mov 0xd8528a(%rip),%eax # ffffffff81dc9480 <key>210210+ffffffff810441f6: 55 push %rbp211211+ffffffff810441f7: 48 89 e5 mov %rsp,%rbp212212+ffffffff810441fa: 85 c0 test %eax,%eax213213+ffffffff810441fc: 75 27 jne ffffffff81044225 <sys_getppid+0x35>214214+ffffffff810441fe: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax215215+ffffffff81044205: 00 00216216+ffffffff81044207: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax217217+ffffffff8104420e: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax218218+ffffffff81044215: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi219219+ffffffff8104421c: e8 2f da 00 00 callq ffffffff81051c50 <pid_vnr>220220+ffffffff81044221: 5d pop %rbp221221+ffffffff81044222: 48 98 cltq222222+ffffffff81044224: c3 retq223223+ffffffff81044225: 48 c7 c7 13 53 98 81 mov $0xffffffff81985313,%rdi224224+ffffffff8104422c: 31 c0 xor %eax,%eax225225+ffffffff8104422e: e8 60 0f 6d 00 callq ffffffff81715193 <printk>226226+ffffffff81044233: eb c9 jmp ffffffff810441fe <sys_getppid+0xe>227227+ffffffff81044235: 66 66 2e 0f 1f 84 00 data32 nopw %cs:0x0(%rax,%rax,1)228228+ffffffff8104423c: 00 00 00 00229229+230230+Thus, the disable jump label case adds a 'mov', 'test' and 'jne' instruction231231+vs. the jump label case just has a 'no-op' or 'jmp 0'. (The jmp 0, is patched232232+to a 5 byte atomic no-op instruction at boot-time.) Thus, the disabled jump233233+label case adds:234234+235235+6 (mov) + 2 (test) + 2 (jne) = 10 - 5 (5 byte jump 0) = 5 addition bytes.236236+237237+If we then include the padding bytes, the jump label code saves, 16 total bytes238238+of instruction memory for this small fucntion. In this case the non-jump label239239+function is 80 bytes long. Thus, we have have saved 20% of the instruction240240+footprint. We can in fact improve this even further, since the 5-byte no-op241241+really can be a 2-byte no-op since we can reach the branch with a 2-byte jmp.242242+However, we have not yet implemented optimal no-op sizes (they are currently243243+hard-coded).244244+245245+Since there are a number of static key API uses in the scheduler paths,246246+'pipe-test' (also known as 'perf bench sched pipe') can be used to show the247247+performance improvement. Testing done on 3.3.0-rc2:248248+249249+jump label disabled:250250+251251+ Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs):252252+253253+ 855.700314 task-clock # 0.534 CPUs utilized ( +- 0.11% )254254+ 200,003 context-switches # 0.234 M/sec ( +- 0.00% )255255+ 0 CPU-migrations # 0.000 M/sec ( +- 39.58% )256256+ 487 page-faults # 0.001 M/sec ( +- 0.02% )257257+ 1,474,374,262 cycles # 1.723 GHz ( +- 0.17% )258258+ <not supported> stalled-cycles-frontend259259+ <not supported> stalled-cycles-backend260260+ 1,178,049,567 instructions # 0.80 insns per cycle ( +- 0.06% )261261+ 208,368,926 branches # 243.507 M/sec ( +- 0.06% )262262+ 5,569,188 branch-misses # 2.67% of all branches ( +- 0.54% )263263+264264+ 1.601607384 seconds time elapsed ( +- 0.07% )265265+266266+jump label enabled:267267+268268+ Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs):269269+270270+ 841.043185 task-clock # 0.533 CPUs utilized ( +- 0.12% )271271+ 200,004 context-switches # 0.238 M/sec ( +- 0.00% )272272+ 0 CPU-migrations # 0.000 M/sec ( +- 40.87% )273273+ 487 page-faults # 0.001 M/sec ( +- 0.05% )274274+ 1,432,559,428 cycles # 1.703 GHz ( +- 0.18% )275275+ <not supported> stalled-cycles-frontend276276+ <not supported> stalled-cycles-backend277277+ 1,175,363,994 instructions # 0.82 insns per cycle ( +- 0.04% )278278+ 206,859,359 branches # 245.956 M/sec ( +- 0.04% )279279+ 4,884,119 branch-misses # 2.36% of all branches ( +- 0.85% )280280+281281+ 1.579384366 seconds time elapsed282282+283283+The percentage of saved branches is .7%, and we've saved 12% on284284+'branch-misses'. This is where we would expect to get the most savings, since285285+this optimization is about reducing the number of branches. In addition, we've286286+saved .2% on instructions, and 2.8% on cycles and 1.4% on elapsed time.
+20-9
arch/Kconfig
···4747 If in doubt, say "N".48484949config JUMP_LABEL5050- bool "Optimize trace point call sites"5050+ bool "Optimize very unlikely/likely branches"5151 depends on HAVE_ARCH_JUMP_LABEL5252 help5353- If it is detected that the compiler has support for "asm goto",5454- the kernel will compile trace point locations with just a5555- nop instruction. When trace points are enabled, the nop will5656- be converted to a jump to the trace function. This technique5757- lowers overhead and stress on the branch prediction of the5858- processor.5353+ This option enables a transparent branch optimization that5454+ makes certain almost-always-true or almost-always-false branch5555+ conditions even cheaper to execute within the kernel.59566060- On i386, options added to the compiler flags may increase6161- the size of the kernel slightly.5757+ Certain performance-sensitive kernel code, such as trace points,5858+ scheduler functionality, networking code and KVM have such5959+ branches and include support for this optimization technique.6060+6161+ If it is detected that the compiler has support for "asm goto",6262+ the kernel will compile such branches with just a nop6363+ instruction. When the condition flag is toggled to true, the6464+ nop will be converted to a jump instruction to execute the6565+ conditional block of instructions.6666+6767+ This technique lowers overhead and stress on the branch prediction6868+ of the processor and generally makes the kernel faster. The update6969+ of the condition is slower, but those are always very rare.7070+7171+ ( On 32-bit x86, the necessary options added to the compiler7272+ flags may increase the size of the kernel slightly. )62736374config OPTPROBES6475 def_bool y
···234234}235235236236static bool mmu_audit;237237-static struct jump_label_key mmu_audit_key;237237+static struct static_key mmu_audit_key;238238239239static void __kvm_mmu_audit(struct kvm_vcpu *vcpu, int point)240240{···250250251251static inline void kvm_mmu_audit(struct kvm_vcpu *vcpu, int point)252252{253253- if (static_branch((&mmu_audit_key)))253253+ if (static_key_false((&mmu_audit_key)))254254 __kvm_mmu_audit(vcpu, point);255255}256256···259259 if (mmu_audit)260260 return;261261262262- jump_label_inc(&mmu_audit_key);262262+ static_key_slow_inc(&mmu_audit_key);263263 mmu_audit = true;264264}265265···268268 if (!mmu_audit)269269 return;270270271271- jump_label_dec(&mmu_audit_key);271271+ static_key_slow_dec(&mmu_audit_key);272272 mmu_audit = false;273273}274274
+100-39
include/linux/jump_label.h
···99 *1010 * Jump labels provide an interface to generate dynamic branches using1111 * self-modifying code. Assuming toolchain and architecture support the result1212- * of a "if (static_branch(&key))" statement is a unconditional branch (which1212+ * of a "if (static_key_false(&key))" statement is a unconditional branch (which1313 * defaults to false - and the true block is placed out of line).1414 *1515- * However at runtime we can change the 'static' branch target using1616- * jump_label_{inc,dec}(). These function as a 'reference' count on the key1515+ * However at runtime we can change the branch target using1616+ * static_key_slow_{inc,dec}(). These function as a 'reference' count on the key1717 * object and for as long as there are references all branches referring to1818 * that particular key will point to the (out of line) true block.1919 *2020- * Since this relies on modifying code the jump_label_{inc,dec}() functions2020+ * Since this relies on modifying code the static_key_slow_{inc,dec}() functions2121 * must be considered absolute slow paths (machine wide synchronization etc.).2222 * OTOH, since the affected branches are unconditional their runtime overhead2323 * will be absolutely minimal, esp. in the default (off) case where the total···2626 *2727 * When the control is directly exposed to userspace it is prudent to delay the2828 * decrement to avoid high frequency code modifications which can (and do)2929- * cause significant performance degradation. Struct jump_label_key_deferred and3030- * jump_label_dec_deferred() provide for this.2929+ * cause significant performance degradation. Struct static_key_deferred and3030+ * static_key_slow_dec_deferred() provide for this.3131 *3232 * Lacking toolchain and or architecture support, it falls back to a simple3333 * conditional branch.3434- */3434+ *3535+ * struct static_key my_key = STATIC_KEY_INIT_TRUE;3636+ *3737+ * if (static_key_true(&my_key)) {3838+ * }3939+ *4040+ * will result in the true case being in-line and starts the key with a single4141+ * reference. Mixing static_key_true() and static_key_false() on the same key is not4242+ * allowed.4343+ *4444+ * Not initializing the key (static data is initialized to 0s anyway) is the4545+ * same as using STATIC_KEY_INIT_FALSE and static_key_false() is4646+ * equivalent with static_branch().4747+ *4848+*/35493650#include <linux/types.h>3751#include <linux/compiler.h>···53395440#if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)55415656-struct jump_label_key {4242+struct static_key {5743 atomic_t enabled;4444+/* Set lsb bit to 1 if branch is default true, 0 ot */5845 struct jump_entry *entries;5946#ifdef CONFIG_MODULES6060- struct jump_label_mod *next;4747+ struct static_key_mod *next;6148#endif6249};63506464-struct jump_label_key_deferred {6565- struct jump_label_key key;5151+struct static_key_deferred {5252+ struct static_key key;6653 unsigned long timeout;6754 struct delayed_work work;6855};···81668267#ifdef HAVE_JUMP_LABEL83688484-#ifdef CONFIG_MODULES8585-#define JUMP_LABEL_INIT {ATOMIC_INIT(0), NULL, NULL}8686-#else8787-#define JUMP_LABEL_INIT {ATOMIC_INIT(0), NULL}8888-#endif6969+#define JUMP_LABEL_TRUE_BRANCH 1UL89709090-static __always_inline bool static_branch(struct jump_label_key *key)7171+static7272+inline struct jump_entry *jump_label_get_entries(struct static_key *key)7373+{7474+ return (struct jump_entry *)((unsigned long)key->entries7575+ & ~JUMP_LABEL_TRUE_BRANCH);7676+}7777+7878+static inline bool jump_label_get_branch_default(struct static_key *key)7979+{8080+ if ((unsigned long)key->entries & JUMP_LABEL_TRUE_BRANCH)8181+ return true;8282+ return false;8383+}8484+8585+static __always_inline bool static_key_false(struct static_key *key)8686+{8787+ return arch_static_branch(key);8888+}8989+9090+static __always_inline bool static_key_true(struct static_key *key)9191+{9292+ return !static_key_false(key);9393+}9494+9595+/* Deprecated. Please use 'static_key_false() instead. */9696+static __always_inline bool static_branch(struct static_key *key)9197{9298 return arch_static_branch(key);9399}···12488extern void arch_jump_label_transform_static(struct jump_entry *entry,12589 enum jump_label_type type);12690extern int jump_label_text_reserved(void *start, void *end);127127-extern void jump_label_inc(struct jump_label_key *key);128128-extern void jump_label_dec(struct jump_label_key *key);129129-extern void jump_label_dec_deferred(struct jump_label_key_deferred *key);130130-extern bool jump_label_enabled(struct jump_label_key *key);9191+extern void static_key_slow_inc(struct static_key *key);9292+extern void static_key_slow_dec(struct static_key *key);9393+extern void static_key_slow_dec_deferred(struct static_key_deferred *key);9494+extern bool static_key_enabled(struct static_key *key);13195extern void jump_label_apply_nops(struct module *mod);132132-extern void jump_label_rate_limit(struct jump_label_key_deferred *key,133133- unsigned long rl);9696+extern void9797+jump_label_rate_limit(struct static_key_deferred *key, unsigned long rl);9898+9999+#define STATIC_KEY_INIT_TRUE ((struct static_key) \100100+ { .enabled = ATOMIC_INIT(1), .entries = (void *)1 })101101+#define STATIC_KEY_INIT_FALSE ((struct static_key) \102102+ { .enabled = ATOMIC_INIT(0), .entries = (void *)0 })134103135104#else /* !HAVE_JUMP_LABEL */136105137106#include <linux/atomic.h>138107139139-#define JUMP_LABEL_INIT {ATOMIC_INIT(0)}140140-141141-struct jump_label_key {108108+struct static_key {142109 atomic_t enabled;143110};144111···149110{150111}151112152152-struct jump_label_key_deferred {153153- struct jump_label_key key;113113+struct static_key_deferred {114114+ struct static_key key;154115};155116156156-static __always_inline bool static_branch(struct jump_label_key *key)117117+static __always_inline bool static_key_false(struct static_key *key)157118{158158- if (unlikely(atomic_read(&key->enabled)))119119+ if (unlikely(atomic_read(&key->enabled)) > 0)159120 return true;160121 return false;161122}162123163163-static inline void jump_label_inc(struct jump_label_key *key)124124+static __always_inline bool static_key_true(struct static_key *key)125125+{126126+ if (likely(atomic_read(&key->enabled)) > 0)127127+ return true;128128+ return false;129129+}130130+131131+/* Deprecated. Please use 'static_key_false() instead. */132132+static __always_inline bool static_branch(struct static_key *key)133133+{134134+ if (unlikely(atomic_read(&key->enabled)) > 0)135135+ return true;136136+ return false;137137+}138138+139139+static inline void static_key_slow_inc(struct static_key *key)164140{165141 atomic_inc(&key->enabled);166142}167143168168-static inline void jump_label_dec(struct jump_label_key *key)144144+static inline void static_key_slow_dec(struct static_key *key)169145{170146 atomic_dec(&key->enabled);171147}172148173173-static inline void jump_label_dec_deferred(struct jump_label_key_deferred *key)149149+static inline void static_key_slow_dec_deferred(struct static_key_deferred *key)174150{175175- jump_label_dec(&key->key);151151+ static_key_slow_dec(&key->key);176152}177153178154static inline int jump_label_text_reserved(void *start, void *end)···198144static inline void jump_label_lock(void) {}199145static inline void jump_label_unlock(void) {}200146201201-static inline bool jump_label_enabled(struct jump_label_key *key)147147+static inline bool static_key_enabled(struct static_key *key)202148{203203- return !!atomic_read(&key->enabled);149149+ return (atomic_read(&key->enabled) > 0);204150}205151206152static inline int jump_label_apply_nops(struct module *mod)···208154 return 0;209155}210156211211-static inline void jump_label_rate_limit(struct jump_label_key_deferred *key,157157+static inline void158158+jump_label_rate_limit(struct static_key_deferred *key,212159 unsigned long rl)213160{214161}162162+163163+#define STATIC_KEY_INIT_TRUE ((struct static_key) \164164+ { .enabled = ATOMIC_INIT(1) })165165+#define STATIC_KEY_INIT_FALSE ((struct static_key) \166166+ { .enabled = ATOMIC_INIT(0) })167167+215168#endif /* HAVE_JUMP_LABEL */216169217217-#define jump_label_key_enabled ((struct jump_label_key){ .enabled = ATOMIC_INIT(1), })218218-#define jump_label_key_disabled ((struct jump_label_key){ .enabled = ATOMIC_INIT(0), })170170+#define STATIC_KEY_INIT STATIC_KEY_INIT_FALSE171171+#define jump_label_enabled static_key_enabled219172220173#endif /* _LINUX_JUMP_LABEL_H */