commits

dlserver can get dequeued during a dlserver pick_task due to the delayed
deueue feature and this can lead to issues with dlserver logic as it
still thinks that dlserver is on the runqueue. The dlserver throttling
and replenish logic gets confused and can lead to double enqueue of
dlserver.

Double enqueue of dlserver could happend due to couple of reasons:

Case 1
------

Delayed dequeue feature[1] can cause dlserver being stopped during a
pick initiated by dlserver:
__pick_next_task
pick_task_dl -> server_pick_task
pick_task_fair
pick_next_entity (if (sched_delayed))
dequeue_entities
dl_server_stop

server_pick_task goes ahead with update_curr_dl_se without knowing that
dlserver is dequeued and this confuses the logic and may lead to
unintended enqueue while the server is stopped.

Case 2
------
A race condition between a task dequeue on one cpu and same task's enqueue
on this cpu by a remote cpu while the lock is released causing dlserver
double enqueue.

One cpu would be in the schedule() and releasing RQ-lock:

current->state = TASK_INTERRUPTIBLE();
schedule();
deactivate_task()
dl_stop_server();
pick_next_task()
pick_next_task_fair()
sched_balance_newidle()
rq_unlock(this_rq)

at which point another CPU can take our RQ-lock and do:

try_to_wake_up()
ttwu_queue()
rq_lock()
...
activate_task()
dl_server_start() --> first enqueue
wakeup_preempt() := check_preempt_wakeup_fair()
update_curr()
update_curr_task()
if (current->dl_server)
dl_server_update()
enqueue_dl_entity() --> second enqueue

This bug was not apparent as the enqueue in dl_server_start doesn't
usually happen because of the defer logic. But as a side effect of the
first case(dequeue during dlserver pick), dl_throttled and dl_yield will
be set and this causes the time accounting of dlserver to messup and
then leading to a enqueue in dl_server_start.

Have an explicit flag representing the status of dlserver to avoid the
confusion. This is set in dl_server_start and reset in dlserver_stop.

Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # ROCK 5B
Link: https://lkml.kernel.org/r/20241213032244.877029-1-vineeth@bitbyteword.org

1y ago

Linus Torvalds

c25ca0c2

Merge tag 'timers_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

1y ago

Geert Uytterhoeven

9151299e

irqchip/stm32mp-exti: CONFIG_STM32MP_EXTI should not default to y when compile-testing

1y ago

Thomas Gleixner

79124056

modpost: Add .irqentry.text to OTHER_SECTIONS

1y ago

Linus Torvalds

88862eeb

Merge tag 'trace-printf-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

1y ago

Lukas Bulwahn

caf4bdb5

MAINTAINERS: fix typo in I2C OF COMPONENT PROBER

1y ago

Benjamin Szőke

c0cd2941

arc: rename aux.h to arc_aux.h

1y ago

Linus Torvalds

35f301dd

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Daniel Borkmann:

- Fix a bug in the BPF verifier to track changes to packet data
property for global functions (Eduard Zingerman)

- Fix a theoretical BPF prog_array use-after-free in RCU handling of
__uprobe_perf_func (Jann Horn)

- Fix BPF tracing to have an explicit list of tracepoints and their
arguments which need to be annotated as PTR_MAYBE_NULL (Kumar
Kartikeya Dwivedi)

- Fix a logic bug in the bpf_remove_insns code where a potential error
would have been wrongly propagated (Anton Protopopov)

- Avoid deadlock scenarios caused by nested kprobe and fentry BPF
programs (Priya Bala Govindasamy)

- Fix a bug in BPF verifier which was missing a size check for
BTF-based context access (Kumar Kartikeya Dwivedi)

- Fix a crash found by syzbot through an invalid BPF prog_array access
in perf_event_detach_bpf_prog (Jiri Olsa)

- Fix several BPF sockmap bugs including a race causing a refcount
imbalance upon element replace (Michal Luczaj)

- Fix a use-after-free from mismatching BPF program/attachment RCU
flavors (Jann Horn)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (23 commits)
bpf: Avoid deadlock caused by nested kprobe and fentry bpf programs
selftests/bpf: Add tests for raw_tp NULL args
bpf: Augment raw_tp arguments with PTR_MAYBE_NULL
bpf: Revert "bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"
selftests/bpf: Add test for narrow ctx load for pointer args
bpf: Check size for BTF-based ctx access of pointer members
selftests/bpf: extend changes_pkt_data with cases w/o subprograms
bpf: fix null dereference when computing changes_pkt_data of prog w/o subprogs
bpf: Fix theoretical prog_array UAF in __uprobe_perf_func()
bpf: fix potential error return
selftests/bpf: validate that tail call invalidates packet pointers
bpf: consider that tail calls invalidate packet pointers
selftests/bpf: freplace tests for tracking of changes_packet_data
bpf: check changes_pkt_data property for extension programs
selftests/bpf: test for changing packet data from global functions
bpf: track changes_pkt_data property for global functions
bpf: refactor bpf_helper_changes_pkt_data to use helper number
bpf: add find_containing_subprog() utility function
bpf,perf: Fix invalid prog_array access in perf_event_detach_bpf_prog
bpf: Fix UAF via mismatching bpf_prog/attachment RCU flavors
...

1y ago

liuderong

f103396a

scsi: ufs: core: Update compl_time_stamp_local_clock after completing a cqe

1y ago

Sean Christopherson

1201f226

KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init

Snapshot the output of CPUID.0xD.[1..n] during kvm.ko initiliaization to
avoid the overead of CPUID during runtime. The offset, size, and metadata
for CPUID.0xD.[1..n] sub-leaves does not depend on XCR0 or XSS values, i.e.
is constant for a given CPU, and thus can be cached during module load.

On Intel's Emerald Rapids, CPUID is *wildly* expensive, to the point where
recomputing XSAVE offsets and sizes results in a 4x increase in latency of
nested VM-Enter and VM-Exit (nested transitions can trigger
xstate_required_size() multiple times per transition), relative to using
cached values. The issue is easily visible by running `perf top` while
triggering nested transitions: kvm_update_cpuid_runtime() shows up at a
whopping 50%.

As measured via RDTSC from L2 (using KVM-Unit-Test's CPUID VM-Exit test
and a slightly modified L1 KVM to handle CPUID in the fastpath), a nested
roundtrip to emulate CPUID on Skylake (SKX), Icelake (ICX), and Emerald
Rapids (EMR) takes:

SKX 11650
ICX 22350
EMR 28850

Using cached values, the latency drops to:

SKX 6850
ICX 9000
EMR 7900

The underlying issue is that CPUID itself is slow on ICX, and comically
slow on EMR. The problem is exacerbated on CPUs which support XSAVES
and/or XSAVEC, as KVM invokes xstate_required_size() twice on each
runtime CPUID update, and because there are more supported XSAVE features
(CPUID for supported XSAVE feature sub-leafs is significantly slower).

SKX:
CPUID.0xD.2 = 348 cycles
CPUID.0xD.3 = 400 cycles
CPUID.0xD.4 = 276 cycles
CPUID.0xD.5 = 236 cycles
<other sub-leaves are similar>

EMR:
CPUID.0xD.2 = 1138 cycles
CPUID.0xD.3 = 1362 cycles
CPUID.0xD.4 = 1068 cycles
CPUID.0xD.5 = 910 cycles
CPUID.0xD.6 = 914 cycles
CPUID.0xD.7 = 1350 cycles
CPUID.0xD.8 = 734 cycles
CPUID.0xD.9 = 766 cycles
CPUID.0xD.10 = 732 cycles
CPUID.0xD.11 = 718 cycles
CPUID.0xD.12 = 734 cycles
CPUID.0xD.13 = 1700 cycles
CPUID.0xD.14 = 1126 cycles
CPUID.0xD.15 = 898 cycles
CPUID.0xD.16 = 716 cycles
CPUID.0xD.17 = 748 cycles
CPUID.0xD.18 = 776 cycles

Note, updating runtime CPUID information multiple times per nested
transition is itself a flaw, especially since CPUID is a mandotory
intercept on both Intel and AMD. E.g. KVM doesn't need to ensure emulated
CPUID state is up-to-date while running L2. That flaw will be fixed in a
future patch, as deferring runtime CPUID updates is more subtle than it
appears at first glance, the benefits aren't super critical to have once
the XSAVE issue is resolved, and caching CPUID output is desirable even if
KVM's updates are deferred.

Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20241211013302.1347853-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

1y ago

Michael Neuling

ea6398a5

RISC-V: KVM: Fix csr_write -> csr_set for HVIEN PMU overflow bit

1y ago

Peter Zijlstra

76f2f783

sched/eevdf: More PELT vs DELAYED_DEQUEUE

1y ago

Linus Torvalds

84262262

Merge tag 'x86_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

1y ago

Thomas Gleixner

76031d95

clocksource: Make negative motion detection more robust

1y ago

Thomas Gleixner

9d9f204b

genirq/proc: Add missing space separator back

1y ago

Linus Torvalds

f788b5ef

Merge tag 'timers_urgent_for_v6.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

1y ago

Dr. David Alan Gilbert

f69e6375

printf: Remove unused 'bprintf'

1y ago

Chen-Yu Tsai

0d40daa1

of: base: Document prefix argument for of_get_next_child_with_prefix()

1y ago

Linus Torvalds

a0e3919a

Merge tag 'usb-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

1y ago

Priya Bala Govindasamy

c83508da

bpf: Avoid deadlock caused by nested kprobe and fentry bpf programs

1y ago

John Garry

6918141d

scsi: scsi_debug: Fix hrtimer support for ndelay

1y ago

Paolo Bonzini

3154bddf

Merge tag 'kvmarm-fixes-6.13-2' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

1y ago

Vincent Guittot

c1f43c34

sched/fair: Fix sched_can_stop_tick() for fair tasks

1y ago

Linus Torvalds

553c89ec

Merge tag 'mm-hotfixes-stable-2024-12-07-22-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

1y ago

Sean Christopherson

49207766

x86/CPU/AMD: WARN when setting EFER.AUTOIBRS if and only if the WRMSR fails

1y ago

Linus Torvalds

e70140ba

Get rid of 'remove_new' relic from platform driver struct

1y ago

Stefan Wahren

ee3878b8

irqchip/bcm2836: Enable SKIP_SET_WAKE and MASK_ON_SUSPEND

1y ago

Linux 6.13-rc3 v6.13-rc3

78d4f34e

Linus Torvalds

Merge tag 'arc-6.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc

42a19aa1

Linus Torvalds

Merge tag 'efi-fixes-for-v6.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

7031a38a

Linus Torvalds

ARC: build: Try to guess GCC variant of cross compiler

824927e8

Leon Romanovsky

Merge tag 'i2c-for-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

151167d8

Linus Torvalds

efi/esrt: remove esre_attribute::store()

145ac100

Jiri Slaby (SUSE)

ARC: bpf: Correct conditional check in 'check_jmp_32'

7dd9eb6b

Hardevsinh Palaniya

Merge tag 'edac_urgent_for_v6.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

dccbe204

Linus Torvalds

Merge tag 'i2c-host-fixes-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-current

5b6b08af

Wolfram Sang

efivarfs: Fix error on non-existent file

2ab0837c

James Bottomley

ARC: dts: Replace deprecated snps,nr-gpios property for snps,dw-apb-gpio-port devices

4d93ffe6

Uwe Kleine-König

Merge tag 'irq_urgent_for_v6.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

f7c7a1ba

Linus Torvalds

EDAC/amd64: Simplify ECC check on unified memory controllers

74736734

Borislav Petkov (AMD)

Linux 6.13-rc2 v6.13-rc2

fac04efc

Linus Torvalds

i2c: riic: Always round-up when calculating bus period

de6b4379

Geert Uytterhoeven

efi/zboot: Limit compression options to GZIP and ZSTD

0b2c29fb

Ard Biesheuvel

ARC: build: Use __force to suppress per-CPU cmpxchg warnings

1e8af9f0

Paul E. McKenney

Merge tag 'sched_urgent_for_v6.13_rc3-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

acd855a9

Linus Torvalds

irqchip/gic-v3: Work around insecure GIC integrations

773c05f4

Marc Zyngier

Merge tag 'kbuild-fixes-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

0b6809a7

Linus Torvalds

i2c: nomadik: Add missing sentinel to match table

5751eee5

Geert Uytterhoeven

Linux 6.13-rc1 v6.13-rc1

40384c84

Linus Torvalds

ARC: fix reference of dependency for PAE40 config

dd2b2302

Lukas Bulwahn

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

81576a9a

Linus Torvalds

sched/dlserver: Fix dlserver time accounting

c7f7e9c7

Vineeth Pillai (Google)

irqchip/gic: Correct declaration of *percpu_base pointer in union gic_base

a1855f1b

Uros Bizjak

Merge tag 'irq_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

eadaac4d

Linus Torvalds

kbuild: deb-pkg: fix build error with O=

d8d326d6

Masahiro Yamada

i2c: pnx: Fix timeout in wait functions

7363f2d4

Vladimir Riabchun

Merge tag 'i2c-for-6.13-rc1-part3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

a14bf463

Linus Torvalds

ARC: build: disallow invalid PAE40 + 4K page config

8871331b

Vineet Gupta

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

2d8308bf

Linus Torvalds

Merge tag 'kvm-riscv-fixes-6.13-1' of https://github.com/kvm-riscv/linux into HEAD

3522c419

Paolo Bonzini

sched/dlserver: Fix dlserver double enqueue

b53127db

Vineeth Pillai (Google)

Merge tag 'timers_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

c25ca0c2

Linus Torvalds

irqchip/stm32mp-exti: CONFIG_STM32MP_EXTI should not default to y when compile-testing

9151299e

Geert Uytterhoeven

modpost: Add .irqentry.text to OTHER_SECTIONS

79124056

Thomas Gleixner

Merge tag 'trace-printf-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

88862eeb

Linus Torvalds

MAINTAINERS: fix typo in I2C OF COMPONENT PROBER

caf4bdb5

Lukas Bulwahn

arc: rename aux.h to arc_aux.h

c0cd2941

Benjamin Szőke

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

35f301dd

Linus Torvalds

scsi: ufs: core: Update compl_time_stamp_local_clock after completing a cqe

f103396a

liuderong

KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init

1201f226

Sean Christopherson

RISC-V: KVM: Fix csr_write -> csr_set for HVIEN PMU overflow bit

ea6398a5

Michael Neuling

sched/eevdf: More PELT vs DELAYED_DEQUEUE

76f2f783

Peter Zijlstra

Merge tag 'x86_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

84262262

Linus Torvalds

clocksource: Make negative motion detection more robust

Guenter reported boot stalls on a emulated ARM 32-bit platform, which has a
24-bit wide clocksource.

It turns out that the calculated maximal idle time, which limits idle
sleeps to prevent clocksource wrap arounds, is close to the point where the
negative motion detection triggers.

max_idle_ns: 597268854 ns
negative motion tripping point: 671088640 ns

If the idle wakeup is delayed beyond that point, the clocksource
advances far enough to trigger the negative motion detection. This
prevents the clock to advance and in the worst case the system stalls
completely if the consecutive sleeps based on the stale clock are
delayed as well.

Cure this by calculating a more robust cut-off value for negative motion,
which covers 87.5% of the actual clocksource counter width. Compare the
delta against this value to catch negative motion. This is specifically for
clock sources with a small counter width as their wrap around time is close
to the half counter width. For clock sources with wide counters this is not
a problem because the maximum idle time is far from the half counter width
due to the math overflow protection constraints.

For the case at hand this results in a tripping point of 1174405120ns.

Note, that this cannot prevent issues when the delay exceeds the 87.5%
margin, but that's not different from the previous unchecked version which
allowed arbitrary time jumps.

Systems with small counter width are prone to invalid results, but this
problem is unlikely to be seen on real hardware. If such a system
completely stalls for more than half a second, then there are other more
urgent problems than the counter wrapping around.

Fixes: c163e40af9b2 ("timekeeping: Always check for negative motion")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/all/8734j5ul4x.ffs@tglx
Closes: https://lore.kernel.org/all/387b120b-d68a-45e8-b6ab-768cd95d11c2@roeck-us.net

76031d95

Thomas Gleixner

genirq/proc: Add missing space separator back

9d9f204b

Thomas Gleixner

Merge tag 'timers_urgent_for_v6.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

f788b5ef

Linus Torvalds

printf: Remove unused 'bprintf'

f69e6375

Dr. David Alan Gilbert

of: base: Document prefix argument for of_get_next_child_with_prefix()

0d40daa1

Chen-Yu Tsai

Merge tag 'usb-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

a0e3919a

Linus Torvalds

bpf: Avoid deadlock caused by nested kprobe and fentry bpf programs

BPF program types like kprobe and fentry can cause deadlocks in certain
situations. If a function takes a lock and one of these bpf programs is
hooked to some point in the function's critical section, and if the
bpf program tries to call the same function and take the same lock it will
lead to deadlock. These situations have been reported in the following
bug reports.

In percpu_freelist -
Link: https://lore.kernel.org/bpf/CAADnVQLAHwsa+2C6j9+UC6ScrDaN9Fjqv1WjB1pP9AzJLhKuLQ@mail.gmail.com/T/
Link: https://lore.kernel.org/bpf/CAPPBnEYm+9zduStsZaDnq93q1jPLqO-PiKX9jy0MuL8LCXmCrQ@mail.gmail.com/T/
In bpf_lru_list -
Link: https://lore.kernel.org/bpf/CAPPBnEajj+DMfiR_WRWU5=6A7KKULdB5Rob_NJopFLWF+i9gCA@mail.gmail.com/T/
Link: https://lore.kernel.org/bpf/CAPPBnEZQDVN6VqnQXvVqGoB+ukOtHGZ9b9U0OLJJYvRoSsMY_g@mail.gmail.com/T/
Link: https://lore.kernel.org/bpf/CAPPBnEaCB1rFAYU7Wf8UxqcqOWKmRPU1Nuzk3_oLk6qXR7LBOA@mail.gmail.com/T/

Similar bugs have been reported by syzbot.
In queue_stack_maps -
Link: https://lore.kernel.org/lkml/0000000000004c3fc90615f37756@google.com/
Link: https://lore.kernel.org/all/20240418230932.2689-1-hdanton@sina.com/T/
In lpm_trie -
Link: https://lore.kernel.org/linux-kernel/00000000000035168a061a47fa38@google.com/T/
In ringbuf -
Link: https://lore.kernel.org/bpf/20240313121345.2292-1-hdanton@sina.com/T/

Prevent kprobe and fentry bpf programs from attaching to these critical
sections by removing CC_FLAGS_FTRACE for percpu_freelist.o,
bpf_lru_list.o, queue_stack_maps.o, lpm_trie.o, ringbuf.o files.

The bugs reported by syzbot are due to tracepoint bpf programs being
called in the critical sections. This patch does not aim to fix deadlocks
caused by tracepoint programs. However, it does prevent deadlocks from
occurring in similar situations due to kprobe and fentry programs.

Signed-off-by: Priya Bala Govindasamy <pgovind2@uci.edu>
Link: https://lore.kernel.org/r/CAPPBnEZpjGnsuA26Mf9kYibSaGLm=oF6=12L21X1GEQdqjLnzQ@mail.gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>