commits

When offlining and onlining CPUs the overall reported idle and iowait
times as reported by /proc/stat jump backward and forward:

cpu 132 0 176 225249 47 6 6 21 0 0
cpu0 80 0 115 112575 33 3 4 18 0 0
cpu1 52 0 60 112673 13 3 1 2 0 0

cpu 133 0 177 226681 47 6 6 21 0 0
cpu0 80 0 116 113387 33 3 4 18 0 0

cpu 133 0 178 114431 33 6 6 21 0 0 <---- jump backward
cpu0 80 0 116 114247 33 3 4 18 0 0
cpu1 52 0 61 183 0 3 1 2 0 0 <---- idle + iowait start with 0

cpu 133 0 178 228956 47 6 6 21 0 0 <---- jump forward
cpu0 81 0 117 114929 33 3 4 18 0 0

Reason for this is that get_idle_time() in fs/proc/stat.c has different
sources for both values depending on if a CPU is online or offline:

- if a CPU is online the values may be taken from its per cpu
tick_cpu_sched structure

- if a CPU is offline the values are taken from its per cpu cpustat
structure

The problem is that the per cpu tick_cpu_sched structure is set to zero on
CPU offline. See tick_cancel_sched_timer() in kernel/time/tick-sched.c.

Therefore when a CPU is brought offline and online afterwards both its idle
and iowait sleeptime will be zero, causing a jump backward in total system
idle and iowait sleeptime. In a similar way if a CPU is then brought
offline again the total idle and iowait sleeptimes will jump forward.

It looks like this behavior was introduced with commit 4b0c0f294f60
("tick: Cleanup NOHZ per cpu data on cpu down").

This was only noticed now on s390, since we switched to generic idle time
reporting with commit be76ea614460 ("s390/idle: remove arch_cpu_idle_time()
and corresponding code").

Fix this by preserving the values of idle_sleeptime and iowait_sleeptime
members of the per-cpu tick_sched structure on CPU hotplug.

Fixes: 4b0c0f294f60 ("tick: Cleanup NOHZ per cpu data on cpu down")
Reported-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240115163555.1004144-1-hca@linux.ibm.com

2y ago

Kent Overstreet

d826cc57

bcachefs: logged_ops_format.h

2y ago

Linus Torvalds

2368fcf3

Merge tag 'header_cleanup-2024-01-20' of https://evilpiepirate.org/git/bcachefs

2y ago

Michael Ellerman

18f14afe

powerpc/64s: Increase default stack size to 32KB

2y ago

Thomas Gleixner

80fe58cc

Merge tag 'timers-v6.8-rc1' of http://git.linaro.org/people/daniel.lezcano/linux into timers/core

2y ago

Kent Overstreet

8d52ba60

bcachefs: reflink_format.h

2y ago

Linus Torvalds

7a396820

Merge tag 'v6.8-rc-part2-smb-client' of git://git.samba.org/sfrench/cifs-2.6

2y ago

Leonardo Bras

5f4c01f1

spinlock: Fix failing build for PREEMPT_RT

2y ago

Michael Ellerman

d2441d3e

MAINTAINERS: powerpc: Add Aneesh & Naveen

2y ago

Anna-Maria Behnsen

da65f29d

timers: Fix nextevt calculation when no timers are pending

2y ago

Arnd Bergmann

c0c4579d

clocksource/drivers/ep93xx: Fix error handling during probe

2y ago

Kent Overstreet

b2fa1b63

bcachefs; extents_format.h

2y ago

Linus Torvalds

65163d16

Merge tag 'dmaengine-fix-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine

Pull dmaengine updates from Vinod Koul:
"New support:
- Loongson LS2X APB DMA controller
- sf-pdma: mpfs-pdma support
- Qualcomm X1E80100 GPI dma controller support

Updates:
- Xilinx XDMA updates to support interleaved DMA transfers
- TI PSIL threads for AM62P and J722S and cfg register regions
description
- axi-dmac Improving the cyclic DMA transfers
- Tegra Support dma-channel-mask property
- Remaining platform remove callback returning void conversions

Driver fixes for:
- Xilinx xdma driver operator precedence and initialization fix
- Excess kernel-doc warning fix in imx-sdma xilinx xdma drivers
- format-overflow warning fix for rz-dmac, sh usb dmac drivers
- 'output may be truncated' fix for shdma, fsl-qdma and dw-edma
drivers"

* tag 'dmaengine-fix-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: (58 commits)
dmaengine: dw-edma: increase size of 'name' in debugfs code
dmaengine: fsl-qdma: increase size of 'irq_name'
dmaengine: shdma: increase size of 'dev_id'
dmaengine: xilinx: xdma: Fix kernel-doc warnings
dmaengine: usb-dmac: Avoid format-overflow warning
dmaengine: sh: rz-dmac: Avoid format-overflow warning
dmaengine: imx-sdma: fix Excess kernel-doc warnings
dmaengine: xilinx: xdma: Fix initialization location of desc in xdma_channel_isr()
dmaengine: xilinx: xdma: Fix operator precedence in xdma_prep_interleaved_dma()
dmaengine: xilinx: xdma: statify xdma_prep_interleaved_dma
dmaengine: xilinx: xdma: Workaround truncation compilation error
dmaengine: pl330: issue_pending waits until WFP state
dmaengine: xilinx: xdma: Implement interleaved DMA transfers
dmaengine: xilinx: xdma: Prepare the introduction of interleaved DMA transfers
dmaengine: xilinx: xdma: Add transfer error reporting
dmaengine: xilinx: xdma: Add error checking in xdma_channel_isr()
dmaengine: xilinx: xdma: Rework xdma_terminate_all()
dmaengine: xilinx: xdma: Ease dma_pool alignment requirements
dmaengine: xilinx: xdma: Add necessary macro definitions
dmaengine: xilinx: xdma: Get rid of unused code
...

2y ago

Shyam Prasad N

78e727e5

cifs: update iface_last_update on each query-and-update

2y ago

Kent Overstreet

1e2f2d31

Kill sched.h dependency on rcupdate.h

2y ago

Haren Myneni

0cf72f7f

powerpc/pseries/vas: Migration suspend waits for no in-progress open windows

The hypervisor returns migration failure if all VAS windows are not
closed. During pre-migration stage, vas_migration_handler() sets
migration_in_progress flag and closes all windows from the list.
The allocate VAS window routine checks the migration flag, setup
the window and then add it to the list. So there is possibility of
the migration handler missing the window that is still in the
process of setup.

t1: Allocate and open VAS t2: Migration event
window

lock vas_pseries_mutex
If migration_in_progress set
unlock vas_pseries_mutex
return
open window HCALL
unlock vas_pseries_mutex
Modify window HCALL lock vas_pseries_mutex
setup window migration_in_progress=true
Closes all windows from the list
// May miss windows that are
// not in the list
unlock vas_pseries_mutex
lock vas_pseries_mutex return
if nr_closed_windows == 0
// No DLPAR CPU or migration
add window to the list
// Window will be added to the
// list after the setup is completed
unlock vas_pseries_mutex
return
unlock vas_pseries_mutex
Close VAS window
// due to DLPAR CPU or migration
return -EBUSY

This patch resolves the issue with the following steps:
- Set the migration_in_progress flag without holding mutex.
- Introduce nr_open_wins_progress counter in VAS capabilities
struct
- This counter tracks the number of open windows are still in
progress
- The allocate setup window thread closes windows if the migration
is set and decrements nr_open_window_progress counter
- The migration handler waits for no in-progress open windows.

The code flow with the fix is as follows:

t1: Allocate and open VAS t2: Migration event
window

lock vas_pseries_mutex
If migration_in_progress set
unlock vas_pseries_mutex
return
open window HCALL
nr_open_wins_progress++
// Window opened, but not
// added to the list yet
unlock vas_pseries_mutex
Modify window HCALL migration_in_progress=true
setup window lock vas_pseries_mutex
Closes all windows from the list
While nr_open_wins_progress {
unlock vas_pseries_mutex
lock vas_pseries_mutex sleep
if nr_closed_windows == 0 // Wait if any open window in
or migration is not started // progress. The open window
// No DLPAR CPU or migration // thread closes the window without
add window to the list // adding to the list and return if
nr_open_wins_progress-- // the migration is in progress.
unlock vas_pseries_mutex
return
Close VAS window
nr_open_wins_progress--
unlock vas_pseries_mutex
return -EBUSY lock vas_pseries_mutex
}
unlock vas_pseries_mutex
return

Fixes: 37e6764895ef ("powerpc/pseries/vas: Add VAS migration handler")
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20231125235104.3405008-1-haren@linux.ibm.com

2y ago

Thomas Gleixner

bb8caad5

timers: Rework idle logic

2y ago

Randy Dunlap

0515c734

clocksource/drivers/cadence-ttc: Fix some kernel-doc warnings

2y ago

Kent Overstreet

0560eb9a

bcachefs: ec_format.h

2y ago

Linus Torvalds

80fc600f

Merge tag 'coccinelle-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jlawall/linux

2y ago

Vinod Koul

cb95a4fa

dmaengine: dw-edma: increase size of 'name' in debugfs code

2y ago

Shyam Prasad N

f591062b

cifs: handle servers that still advertise multichannel after disabling

2y ago

Kent Overstreet

e717ceb5

kill unnecessary thread_info.h include

2y ago

Naveen N Rao

4b3338aa

powerpc/ftrace: Fix stack teardown in ftrace_no_trace

2y ago

Anna-Maria Behnsen

7a39a508

timers: Use already existing function for forwarding timer base

2y ago

Tony Lindgren

b99a212a

clocksource/drivers/timer-ti-dm: Fix make W=n kerneldoc warnings

2y ago

Kent Overstreet

c6c4ff65

bcachefs: subvolume_format.h

2y ago

Aurelien Jarno

31e97d7c

media: solo6x10: replace max(a, min(b, c)) by clamp(b, a, c)

2y ago

Julia Lawall

ff82e84e

coccinelle: device_attr_show: simplify patch case

2y ago

Vinod Koul

6386f6c9

dmaengine: fsl-qdma: increase size of 'irq_name'

2y ago

Shyam Prasad N

ce09f8d8

cifs: new mount option called retrans

2y ago

Kent Overstreet

30094208

Kill unnecessary kernel.h include

2y ago

Nicholas Piggin

dc158d23

KVM: PPC: Book3S HV: Fix KVM_RUN clobbering FP/VEC user registers

2y ago

Anna-Maria Behnsen

1e490484

timers: Split out forward timer base functionality

2y ago

Joshua Yeong

6a902b11

clocksource/timer-riscv: Add riscv_clock_shutdown callback

2y ago

Kent Overstreet

8fed323b

bcachefs: snapshot_format.h

2y ago

Linus Torvalds

978ffcbf

execve: open the executable file before doing anything else

2y ago

Li Zhijian

68ea60a7

coccinelle: device_attr_show: Adapt to the latest Documentation/filesystems/sysfs.rst

2y ago

Vinod Koul

40429024

dmaengine: shdma: increase size of 'dev_id'

2y ago

Shyam Prasad N

49fe25ce

cifs: reschedule periodic query for server interfaces

2y ago

Kent Overstreet

2b010a69

preempt.h: Kill dependency on list.h

2y ago

Timothy Pearson

5e1d824f

powerpc: Don't clobber f0/vs0 during fp|altivec register save

During floating point and vector save to thread data f0/vs0 are
clobbered by the FPSCR/VSCR store routine. This has been obvserved to
lead to userspace register corruption and application data corruption
with io-uring.

Fix it by restoring f0/vs0 after FPSCR/VSCR store has completed for
all the FP, altivec, VMX register save paths.

Tested under QEMU in kvm mode, running on a Talos II workstation with
dual POWER9 DD2.2 CPUs.

Additional detail (mpe):

Typically save_fpu() is called from __giveup_fpu() which saves the FP
regs and also *turns off FP* in the tasks MSR, meaning the kernel will
reload the FP regs from the thread struct before letting the task use FP
again. So in that case save_fpu() is free to clobber f0 because the FP
regs no longer hold live values for the task.

There is another case though, which is the path via:
sys_clone()
...
copy_process()
dup_task_struct()
arch_dup_task_struct()
flush_all_to_thread()
save_all()

That path saves the FP regs but leaves them live. That's meant as an
optimisation for a process that's using FP/VSX and then calls fork(),
leaving the regs live means the parent process doesn't have to take a
fault after the fork to get its FP regs back. The optimisation was added
in commit 8792468da5e1 ("powerpc: Add the ability to save FPU without
giving it up").

That path does clobber f0, but f0 is volatile across function calls,
and typically programs reach copy_process() from userspace via a syscall
wrapper function. So in normal usage f0 being clobbered across a
syscall doesn't cause visible data corruption.

But there is now a new path, because io-uring can call copy_process()
via create_io_thread() from the signal handling path. That's OK if the
signal is handled as part of syscall return, but it's not OK if the
signal is handled due to some other interrupt.

That path is:

interrupt_return_srr_user()
interrupt_exit_user_prepare()
interrupt_exit_user_prepare_main()
do_notify_resume()
get_signal()
task_work_run()
create_worker_cb()
create_io_worker()
copy_process()
dup_task_struct()
arch_dup_task_struct()
flush_all_to_thread()
save_all()
if (tsk->thread.regs->msr & MSR_FP)
save_fpu()
# f0 is clobbered and potentially live in userspace

Note the above discussion applies equally to save_altivec().

Fixes: 8792468da5e1 ("powerpc: Add the ability to save FPU without giving it up")
Cc: stable@vger.kernel.org # v4.6+
Closes: https://lore.kernel.org/all/480932026.45576726.1699374859845.JavaMail.zimbra@raptorengineeringinc.com/
Closes: https://lore.kernel.org/linuxppc-dev/480221078.47953493.1700206777956.JavaMail.zimbra@raptorengineeringinc.com/
Tested-by: Timothy Pearson <tpearson@raptorengineering.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com>
[mpe: Reword change log to describe exact path of corruption & other minor tweaks]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/1921539696.48534988.1700407082933.JavaMail.zimbra@raptorengineeringinc.com