x86/locking/atomic: Improve performance by using asm_inline() for atomic locking instructions

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

According to:

https://gcc.gnu.org/onlinedocs/gcc/Size-of-an-asm.html

the usage of asm pseudo directives in the asm template can confuse
the compiler to wrongly estimate the size of the generated
code.

The LOCK_PREFIX macro expands to several asm pseudo directives, so
its usage in atomic locking insns causes instruction length estimates
to fail significantly (the specially instrumented compiler reports
the estimated length of these asm templates to be 6 instructions long).

This incorrect estimate further causes unoptimal inlining decisions,
un-optimal instruction scheduling and un-optimal code block alignments
for functions that use these locking primitives.

Use asm_inline instead:

https://gcc.gnu.org/pipermail/gcc-patches/2018-December/512349.html

which is a feature that makes GCC pretend some inline assembler code
is tiny (while it would think it is huge), instead of just asm.

For code size estimation, the size of the asm is then taken as
the minimum size of one instruction, ignoring how many instructions
compiler thinks it is.

bloat-o-meter reports the following code size increase
(x86_64 defconfig, gcc-14.2.1):

add/remove: 82/283 grow/shrink: 870/372 up/down: 76272/-43618 (32654)
Total: Before=22770320, After=22802974, chg +0.14%

with top grows (>500 bytes):

Function old new delta
----------------------------------------------------------------
copy_process 6465 10191 +3726
balance_dirty_pages_ratelimited_flags 237 2949 +2712
icl_plane_update_noarm 5800 7969 +2169
samsung_input_mapping 3375 5170 +1795
ext4_do_update_inode.isra - 1526 +1526
__schedule 2416 3472 +1056
__i915_vma_resource_unhold - 946 +946
sched_mm_cid_after_execve 175 1097 +922
__do_sys_membarrier - 862 +862
filemap_fault 2666 3462 +796
nl80211_send_wiphy 11185 11874 +689
samsung_input_mapping.cold 900 1500 +600
virtio_gpu_queue_fenced_ctrl_buffer 839 1410 +571
ilk_update_pipe_csc 1201 1735 +534
enable_step - 525 +525
icl_color_commit_noarm 1334 1847 +513
tg3_read_bc_ver - 501 +501

and top shrinks (>500 bytes):

Function old new delta
----------------------------------------------------------------
nl80211_send_iftype_data 580 - -580
samsung_gamepad_input_mapping.isra.cold 604 - -604
virtio_gpu_queue_ctrl_sgs 724 - -724
tg3_get_invariants 9218 8376 -842
__i915_vma_resource_unhold.part 899 - -899
ext4_mark_iloc_dirty 1735 106 -1629
samsung_gamepad_input_mapping.isra 2046 - -2046
icl_program_input_csc 2203 - -2203
copy_mm 2242 - -2242
balance_dirty_pages 2657 - -2657

These code size changes can be grouped into 4 groups:

a) some functions now include once-called functions in full or
in part. These are:

Function old new delta
----------------------------------------------------------------
copy_process 6465 10191 +3726
balance_dirty_pages_ratelimited_flags 237 2949 +2712
icl_plane_update_noarm 5800 7969 +2169
samsung_input_mapping 3375 5170 +1795
ext4_do_update_inode.isra - 1526 +1526

that now include:

Function old new delta
----------------------------------------------------------------
copy_mm 2242 - -2242
balance_dirty_pages 2657 - -2657
icl_program_input_csc 2203 - -2203
samsung_gamepad_input_mapping.isra 2046 - -2046
ext4_mark_iloc_dirty 1735 106 -1629

b) ISRA [interprocedural scalar replacement of aggregates,
interprocedural pass that removes unused function return values
(turning functions returning a value which is never used into void
functions) and removes unused function parameters. It can also
replace an aggregate parameter by a set of other parameters
representing part of the original, turning those passed by reference
into new ones which pass the value directly.]

Top grows and shrinks of this group are listed below:

Function old new delta
----------------------------------------------------------------
ext4_do_update_inode.isra - 1526 +1526
nfs4_begin_drain_session.isra - 249 +249
nfs4_end_drain_session.isra - 168 +168
__guc_action_register_multi_lrc_v70.isra 335 500 +165
__i915_gem_free_objects.isra - 144 +144
...
membarrier_register_private_expedited.isra 108 - -108
syncobj_eventfd_entry_func.isra 445 314 -131
__ext4_sb_bread_gfp.isra 140 - -140
class_preempt_notrace_destructor.isra 145 - -145
p9_fid_put.isra 151 - -151
__mm_cid_try_get.isra 238 - -238
membarrier_global_expedited.isra 294 - -294
mm_cid_get.isra 295 - -295
samsung_gamepad_input_mapping.isra.cold 604 - -604
samsung_gamepad_input_mapping.isra 2046 - -2046

c) different split points of hot/cold split that just move code around:

Top grows and shrinks of this group are listed below:

Function old new delta
----------------------------------------------------------------
samsung_input_mapping.cold 900 1500 +600
__i915_request_reset.cold 311 389 +78
nfs_update_inode.cold 77 153 +76
__do_sys_swapon.cold 404 455 +51
copy_process.cold - 45 +45
tg3_get_invariants.cold 73 115 +42
...
hibernate.cold 671 643 -28
copy_mm.cold 31 - -31
software_resume.cold 249 207 -42
io_poll_wake.cold 106 54 -52
samsung_gamepad_input_mapping.isra.cold 604 - -604

c) full inline of small functions with locking insn (~150 cases).
These bring in most of the code size increase because the removed
function code is now inlined in multiple places. E.g.:

0000000000a50e10 <release_devnum>:
a50e10: 48 63 07 movslq (%rdi),%rax
a50e13: 85 c0 test %eax,%eax
a50e15: 7e 10 jle a50e27 <release_devnum+0x17>
a50e17: 48 8b 4f 50 mov 0x50(%rdi),%rcx
a50e1b: f0 48 0f b3 41 50 lock btr %rax,0x50(%rcx)
a50e21: c7 07 ff ff ff ff movl $0xffffffff,(%rdi)
a50e27: e9 00 00 00 00 jmp a50e2c <release_devnum+0x1c>
a50e28: R_X86_64_PLT32 __x86_return_thunk-0x4
a50e2c: 0f 1f 40 00 nopl 0x0(%rax)

is now fully inlined into the caller function. This is desirable due
to the per function overhead of CPU bug mitigations like retpolines.

FTR a) with -Os (where generated code size really matters) x86_64
defconfig object file decreases by 24.388 kbytes, representing 0.1%
code size decrease:

text data bss dec hex filename
23883860 4617284 814212 29315356 1bf511c vmlinux-old.o
23859472 4615404 814212 29289088 1beea80 vmlinux-new.o

FTR b) clang recognizes "asm inline", but there was no difference in
code sizes:

text data bss dec hex filename
27577163 4503078 807732 32887973 1f5d4a5 vmlinux-clang-patched.o
27577181 4503078 807732 32887991 1f5d4b7 vmlinux-clang-unpatched.o

The performance impact of the patch was assessed by recompiling
fedora-41 6.13.5 kernel and running lmbench with old and new kernel.
The most noticeable improvements were:

Process fork+exit: 270.0952 microseconds
Process fork+execve: 2620.3333 microseconds
Process fork+/bin/sh -c: 6781.0000 microseconds
File /usr/tmp/XXX write bandwidth: 1780350 KB/sec
Pagefaults on /usr/tmp/XXX: 0.3875 microseconds

to:

Process fork+exit: 298.6842 microseconds
Process fork+execve: 1662.7500 microseconds
Process fork+/bin/sh -c: 2127.6667 microseconds
File /usr/tmp/XXX write bandwidth: 1950077 KB/sec
Pagefaults on /usr/tmp/XXX: 0.1958 microseconds

and from:

Socket bandwidth using localhost
0.000001 2.52 MB/sec
0.000064 163.02 MB/sec
0.000128 321.70 MB/sec
0.000256 630.06 MB/sec
0.000512 1207.07 MB/sec
0.001024 2004.06 MB/sec
0.001437 2475.43 MB/sec
10.000000 5817.34 MB/sec

Avg xfer: 3.2KB, 41.8KB in 1.2230 millisecs, 34.15 MB/sec
AF_UNIX sock stream bandwidth: 9850.01 MB/sec
Pipe bandwidth: 4631.28 MB/sec

to:

Socket bandwidth using localhost
0.000001 3.13 MB/sec
0.000064 187.08 MB/sec
0.000128 324.12 MB/sec
0.000256 618.51 MB/sec
0.000512 1137.13 MB/sec
0.001024 1962.95 MB/sec
0.001437 2458.27 MB/sec
10.000000 6168.08 MB/sec

Avg xfer: 3.2KB, 41.8KB in 1.0060 millisecs, 41.52 MB/sec
AF_UNIX sock stream bandwidth: 9921.68 MB/sec
Pipe bandwidth: 4649.96 MB/sec

[ mingo: Prettified the changelog a bit. ]

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20250309170955.48919-1-ubizjak@gmail.com

authored by

Uros Bizjak and committed by

Ingo Molnar 1 year ago faa6f77b f685a96b

+38 -38

7 changed files

expand all

arch

x86

include

asm

atomic.h

atomic64_64.h

bitops.h

cmpxchg.h

cmpxchg_32.h

cmpxchg_64.h

rmwcc.h

+7 -7

arch/x86/include/asm/atomic.h

··· 30 30 31 31 static __always_inline void arch_atomic_add(int i, atomic_t *v) 32 32 { 33 - asm volatile(LOCK_PREFIX "addl %1,%0" 33 + asm_inline volatile(LOCK_PREFIX "addl %1, %0" 34 34 : "+m" (v->counter) 35 35 : "ir" (i) : "memory"); 36 36 } 37 37 38 38 static __always_inline void arch_atomic_sub(int i, atomic_t *v) 39 39 { 40 - asm volatile(LOCK_PREFIX "subl %1,%0" 40 + asm_inline volatile(LOCK_PREFIX "subl %1, %0" 41 41 : "+m" (v->counter) 42 42 : "ir" (i) : "memory"); 43 43 } ··· 50 50 51 51 static __always_inline void arch_atomic_inc(atomic_t *v) 52 52 { 53 - asm volatile(LOCK_PREFIX "incl %0" 53 + asm_inline volatile(LOCK_PREFIX "incl %0" 54 54 : "+m" (v->counter) :: "memory"); 55 55 } 56 56 #define arch_atomic_inc arch_atomic_inc 57 57 58 58 static __always_inline void arch_atomic_dec(atomic_t *v) 59 59 { 60 - asm volatile(LOCK_PREFIX "decl %0" 60 + asm_inline volatile(LOCK_PREFIX "decl %0" 61 61 : "+m" (v->counter) :: "memory"); 62 62 } 63 63 #define arch_atomic_dec arch_atomic_dec ··· 116 116 117 117 static __always_inline void arch_atomic_and(int i, atomic_t *v) 118 118 { 119 - asm volatile(LOCK_PREFIX "andl %1,%0" 119 + asm_inline volatile(LOCK_PREFIX "andl %1, %0" 120 120 : "+m" (v->counter) 121 121 : "ir" (i) 122 122 : "memory"); ··· 134 134 135 135 static __always_inline void arch_atomic_or(int i, atomic_t *v) 136 136 { 137 - asm volatile(LOCK_PREFIX "orl %1,%0" 137 + asm_inline volatile(LOCK_PREFIX "orl %1, %0" 138 138 : "+m" (v->counter) 139 139 : "ir" (i) 140 140 : "memory"); ··· 152 152 153 153 static __always_inline void arch_atomic_xor(int i, atomic_t *v) 154 154 { 155 - asm volatile(LOCK_PREFIX "xorl %1,%0" 155 + asm_inline volatile(LOCK_PREFIX "xorl %1, %0" 156 156 : "+m" (v->counter) 157 157 : "ir" (i) 158 158 : "memory");

+7 -7

arch/x86/include/asm/atomic64_64.h

··· 22 22 23 23 static __always_inline void arch_atomic64_add(s64 i, atomic64_t *v) 24 24 { 25 - asm volatile(LOCK_PREFIX "addq %1,%0" 25 + asm_inline volatile(LOCK_PREFIX "addq %1, %0" 26 26 : "=m" (v->counter) 27 27 : "er" (i), "m" (v->counter) : "memory"); 28 28 } 29 29 30 30 static __always_inline void arch_atomic64_sub(s64 i, atomic64_t *v) 31 31 { 32 - asm volatile(LOCK_PREFIX "subq %1,%0" 32 + asm_inline volatile(LOCK_PREFIX "subq %1, %0" 33 33 : "=m" (v->counter) 34 34 : "er" (i), "m" (v->counter) : "memory"); 35 35 } ··· 42 42 43 43 static __always_inline void arch_atomic64_inc(atomic64_t *v) 44 44 { 45 - asm volatile(LOCK_PREFIX "incq %0" 45 + asm_inline volatile(LOCK_PREFIX "incq %0" 46 46 : "=m" (v->counter) 47 47 : "m" (v->counter) : "memory"); 48 48 } ··· 50 50 51 51 static __always_inline void arch_atomic64_dec(atomic64_t *v) 52 52 { 53 - asm volatile(LOCK_PREFIX "decq %0" 53 + asm_inline volatile(LOCK_PREFIX "decq %0" 54 54 : "=m" (v->counter) 55 55 : "m" (v->counter) : "memory"); 56 56 } ··· 110 110 111 111 static __always_inline void arch_atomic64_and(s64 i, atomic64_t *v) 112 112 { 113 - asm volatile(LOCK_PREFIX "andq %1,%0" 113 + asm_inline volatile(LOCK_PREFIX "andq %1, %0" 114 114 : "+m" (v->counter) 115 115 : "er" (i) 116 116 : "memory"); ··· 128 128 129 129 static __always_inline void arch_atomic64_or(s64 i, atomic64_t *v) 130 130 { 131 - asm volatile(LOCK_PREFIX "orq %1,%0" 131 + asm_inline volatile(LOCK_PREFIX "orq %1, %0" 132 132 : "+m" (v->counter) 133 133 : "er" (i) 134 134 : "memory"); ··· 146 146 147 147 static __always_inline void arch_atomic64_xor(s64 i, atomic64_t *v) 148 148 { 149 - asm volatile(LOCK_PREFIX "xorq %1,%0" 149 + asm_inline volatile(LOCK_PREFIX "xorq %1, %0" 150 150 : "+m" (v->counter) 151 151 : "er" (i) 152 152 : "memory");

+7 -7

arch/x86/include/asm/bitops.h

··· 52 52 arch_set_bit(long nr, volatile unsigned long *addr) 53 53 { 54 54 if (__builtin_constant_p(nr)) { 55 - asm volatile(LOCK_PREFIX "orb %b1,%0" 55 + asm_inline volatile(LOCK_PREFIX "orb %b1,%0" 56 56 : CONST_MASK_ADDR(nr, addr) 57 57 : "iq" (CONST_MASK(nr)) 58 58 : "memory"); 59 59 } else { 60 - asm volatile(LOCK_PREFIX __ASM_SIZE(bts) " %1,%0" 60 + asm_inline volatile(LOCK_PREFIX __ASM_SIZE(bts) " %1,%0" 61 61 : : RLONG_ADDR(addr), "Ir" (nr) : "memory"); 62 62 } 63 63 } ··· 72 72 arch_clear_bit(long nr, volatile unsigned long *addr) 73 73 { 74 74 if (__builtin_constant_p(nr)) { 75 - asm volatile(LOCK_PREFIX "andb %b1,%0" 75 + asm_inline volatile(LOCK_PREFIX "andb %b1,%0" 76 76 : CONST_MASK_ADDR(nr, addr) 77 77 : "iq" (~CONST_MASK(nr))); 78 78 } else { 79 - asm volatile(LOCK_PREFIX __ASM_SIZE(btr) " %1,%0" 79 + asm_inline volatile(LOCK_PREFIX __ASM_SIZE(btr) " %1,%0" 80 80 : : RLONG_ADDR(addr), "Ir" (nr) : "memory"); 81 81 } 82 82 } ··· 98 98 volatile unsigned long *addr) 99 99 { 100 100 bool negative; 101 - asm volatile(LOCK_PREFIX "xorb %2,%1" 101 + asm_inline volatile(LOCK_PREFIX "xorb %2,%1" 102 102 CC_SET(s) 103 103 : CC_OUT(s) (negative), WBYTE_ADDR(addr) 104 104 : "iq" ((char)mask) : "memory"); ··· 122 122 arch_change_bit(long nr, volatile unsigned long *addr) 123 123 { 124 124 if (__builtin_constant_p(nr)) { 125 - asm volatile(LOCK_PREFIX "xorb %b1,%0" 125 + asm_inline volatile(LOCK_PREFIX "xorb %b1,%0" 126 126 : CONST_MASK_ADDR(nr, addr) 127 127 : "iq" (CONST_MASK(nr))); 128 128 } else { 129 - asm volatile(LOCK_PREFIX __ASM_SIZE(btc) " %1,%0" 129 + asm_inline volatile(LOCK_PREFIX __ASM_SIZE(btc) " %1,%0" 130 130 : : RLONG_ADDR(addr), "Ir" (nr) : "memory"); 131 131 } 132 132 }

+12 -12

arch/x86/include/asm/cmpxchg.h

··· 44 44 __typeof__ (*(ptr)) __ret = (arg); \ 45 45 switch (sizeof(*(ptr))) { \ 46 46 case __X86_CASE_B: \ 47 - asm volatile (lock #op "b %b0, %1\n" \ 47 + asm_inline volatile (lock #op "b %b0, %1" \ 48 48 : "+q" (__ret), "+m" (*(ptr)) \ 49 49 : : "memory", "cc"); \ 50 50 break; \ 51 51 case __X86_CASE_W: \ 52 - asm volatile (lock #op "w %w0, %1\n" \ 52 + asm_inline volatile (lock #op "w %w0, %1" \ 53 53 : "+r" (__ret), "+m" (*(ptr)) \ 54 54 : : "memory", "cc"); \ 55 55 break; \ 56 56 case __X86_CASE_L: \ 57 - asm volatile (lock #op "l %0, %1\n" \ 57 + asm_inline volatile (lock #op "l %0, %1" \ 58 58 : "+r" (__ret), "+m" (*(ptr)) \ 59 59 : : "memory", "cc"); \ 60 60 break; \ 61 61 case __X86_CASE_Q: \ 62 - asm volatile (lock #op "q %q0, %1\n" \ 62 + asm_inline volatile (lock #op "q %q0, %1" \ 63 63 : "+r" (__ret), "+m" (*(ptr)) \ 64 64 : : "memory", "cc"); \ 65 65 break; \ ··· 91 91 case __X86_CASE_B: \ 92 92 { \ 93 93 volatile u8 *__ptr = (volatile u8 *)(ptr); \ 94 - asm volatile(lock "cmpxchgb %2,%1" \ 94 + asm_inline volatile(lock "cmpxchgb %2, %1" \ 95 95 : "=a" (__ret), "+m" (*__ptr) \ 96 96 : "q" (__new), "0" (__old) \ 97 97 : "memory"); \ ··· 100 100 case __X86_CASE_W: \ 101 101 { \ 102 102 volatile u16 *__ptr = (volatile u16 *)(ptr); \ 103 - asm volatile(lock "cmpxchgw %2,%1" \ 103 + asm_inline volatile(lock "cmpxchgw %2, %1" \ 104 104 : "=a" (__ret), "+m" (*__ptr) \ 105 105 : "r" (__new), "0" (__old) \ 106 106 : "memory"); \ ··· 109 109 case __X86_CASE_L: \ 110 110 { \ 111 111 volatile u32 *__ptr = (volatile u32 *)(ptr); \ 112 - asm volatile(lock "cmpxchgl %2,%1" \ 112 + asm_inline volatile(lock "cmpxchgl %2, %1" \ 113 113 : "=a" (__ret), "+m" (*__ptr) \ 114 114 : "r" (__new), "0" (__old) \ 115 115 : "memory"); \ ··· 118 118 case __X86_CASE_Q: \ 119 119 { \ 120 120 volatile u64 *__ptr = (volatile u64 *)(ptr); \ 121 - asm volatile(lock "cmpxchgq %2,%1" \ 121 + asm_inline volatile(lock "cmpxchgq %2, %1" \ 122 122 : "=a" (__ret), "+m" (*__ptr) \ 123 123 : "r" (__new), "0" (__old) \ 124 124 : "memory"); \ ··· 165 165 case __X86_CASE_B: \ 166 166 { \ 167 167 volatile u8 *__ptr = (volatile u8 *)(_ptr); \ 168 - asm volatile(lock "cmpxchgb %[new], %[ptr]" \ 168 + asm_inline volatile(lock "cmpxchgb %[new], %[ptr]" \ 169 169 CC_SET(z) \ 170 170 : CC_OUT(z) (success), \ 171 171 [ptr] "+m" (*__ptr), \ ··· 177 177 case __X86_CASE_W: \ 178 178 { \ 179 179 volatile u16 *__ptr = (volatile u16 *)(_ptr); \ 180 - asm volatile(lock "cmpxchgw %[new], %[ptr]" \ 180 + asm_inline volatile(lock "cmpxchgw %[new], %[ptr]" \ 181 181 CC_SET(z) \ 182 182 : CC_OUT(z) (success), \ 183 183 [ptr] "+m" (*__ptr), \ ··· 189 189 case __X86_CASE_L: \ 190 190 { \ 191 191 volatile u32 *__ptr = (volatile u32 *)(_ptr); \ 192 - asm volatile(lock "cmpxchgl %[new], %[ptr]" \ 192 + asm_inline volatile(lock "cmpxchgl %[new], %[ptr]" \ 193 193 CC_SET(z) \ 194 194 : CC_OUT(z) (success), \ 195 195 [ptr] "+m" (*__ptr), \ ··· 201 201 case __X86_CASE_Q: \ 202 202 { \ 203 203 volatile u64 *__ptr = (volatile u64 *)(_ptr); \ 204 - asm volatile(lock "cmpxchgq %[new], %[ptr]" \ 204 + asm_inline volatile(lock "cmpxchgq %[new], %[ptr]" \ 205 205 CC_SET(z) \ 206 206 : CC_OUT(z) (success), \ 207 207 [ptr] "+m" (*__ptr), \

+2 -2

arch/x86/include/asm/cmpxchg_32.h

··· 19 19 union __u64_halves o = { .full = (_old), }, \ 20 20 n = { .full = (_new), }; \ 21 21 \ 22 - asm volatile(_lock "cmpxchg8b %[ptr]" \ 22 + asm_inline volatile(_lock "cmpxchg8b %[ptr]" \ 23 23 : [ptr] "+m" (*(_ptr)), \ 24 24 "+a" (o.low), "+d" (o.high) \ 25 25 : "b" (n.low), "c" (n.high) \ ··· 45 45 n = { .full = (_new), }; \ 46 46 bool ret; \ 47 47 \ 48 - asm volatile(_lock "cmpxchg8b %[ptr]" \ 48 + asm_inline volatile(_lock "cmpxchg8b %[ptr]" \ 49 49 CC_SET(e) \ 50 50 : CC_OUT(e) (ret), \ 51 51 [ptr] "+m" (*(_ptr)), \

+2 -2

arch/x86/include/asm/cmpxchg_64.h

··· 38 38 union __u128_halves o = { .full = (_old), }, \ 39 39 n = { .full = (_new), }; \ 40 40 \ 41 - asm volatile(_lock "cmpxchg16b %[ptr]" \ 41 + asm_inline volatile(_lock "cmpxchg16b %[ptr]" \ 42 42 : [ptr] "+m" (*(_ptr)), \ 43 43 "+a" (o.low), "+d" (o.high) \ 44 44 : "b" (n.low), "c" (n.high) \ ··· 65 65 n = { .full = (_new), }; \ 66 66 bool ret; \ 67 67 \ 68 - asm volatile(_lock "cmpxchg16b %[ptr]" \ 68 + asm_inline volatile(_lock "cmpxchg16b %[ptr]" \ 69 69 CC_SET(e) \ 70 70 : CC_OUT(e) (ret), \ 71 71 [ptr] "+m" (*(_ptr)), \

+1 -1

arch/x86/include/asm/rmwcc.h

··· 29 29 #define __GEN_RMWcc(fullop, _var, cc, clobbers, ...) \ 30 30 ({ \ 31 31 bool c; \ 32 - asm volatile (fullop CC_SET(cc) \ 32 + asm_inline volatile (fullop CC_SET(cc) \ 33 33 : [var] "+m" (_var), CC_OUT(cc) (c) \ 34 34 : __VA_ARGS__ : clobbers); \ 35 35 c; \