Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge tag 'timers-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
"Updates for timekeeping, timers and clockevent/source drivers:

Core:

- Yet another round of improvements to make the clocksource watchdog
more robust:

- Relax the clocksource-watchdog skew criteria to match the NTP
criteria.

- Temporarily skip the watchdog when high memory latencies are
detected which can lead to false-positives.

- Provide an option to enable TSC skew detection even on systems
where TSC is marked as reliable.

Sigh!

- Initialize the restart block in the nanosleep syscalls to be
directed to the no restart function instead of doing a partial
setup on entry.

This prevents an erroneous restart_syscall() invocation from
corrupting user space data. While such a situation is clearly a
user space bug, preventing this is a correctness issue and caters
to the least suprise principle.

- Ignore the hrtimer slack for realtime tasks in schedule_hrtimeout()
to align it with the nanosleep semantics.

Drivers:

- The obligatory new driver bindings for Mediatek, Rockchip and
RISC-V variants.

- Add support for the C3STOP misfeature to the RISC-V timer to handle
the case where the timer stops in deeper idle state.

- Set up a static key in the RISC-V timer correctly before first use.

- The usual small improvements and fixes all over the place"

* tag 'timers-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
clocksource/drivers/timer-sun4i: Add CLOCK_EVT_FEAT_DYNIRQ
clocksource/drivers/em_sti: Mark driver as non-removable
clocksource/drivers/sh_tmu: Mark driver as non-removable
clocksource/drivers/riscv: Patch riscv_clock_next_event() jump before first use
clocksource/drivers/timer-microchip-pit64b: Add delay timer
clocksource/drivers/timer-microchip-pit64b: Select driver only on ARM
dt-bindings: timer: sifive,clint: add comaptibles for T-Head's C9xx
dt-bindings: timer: mediatek,mtk-timer: add MT8365
clocksource/drivers/riscv: Get rid of clocksource_arch_init() callback
clocksource/drivers/sh_cmt: Mark driver as non-removable
clocksource/drivers/timer-microchip-pit64b: Drop obsolete dependency on COMPILE_TEST
clocksource/drivers/riscv: Increase the clock source rating
clocksource/drivers/timer-riscv: Set CLOCK_EVT_FEAT_C3STOP based on DT
dt-bindings: timer: Add bindings for the RISC-V timer device
RISC-V: time: initialize hrtimer based broadcast clock event device
dt-bindings: timer: rk-timer: Add rktimer for rv1126
time/debug: Fix memory leak with using debugfs_lookup()
clocksource: Enable TSC watchdog checking of HPET and PMTMR only when requested
posix-timers: Use atomic64_try_cmpxchg() in __update_gt_cputime()
clocksource: Verify HPET and PMTMR when TSC unverified
...

+252 -77
+10
Documentation/admin-guide/kernel-parameters.txt
··· 6369 6369 in situations with strict latency requirements (where 6370 6370 interruptions from clocksource watchdog are not 6371 6371 acceptable). 6372 + [x86] recalibrate: force recalibration against a HW timer 6373 + (HPET or PM timer) on systems whose TSC frequency was 6374 + obtained from HW or FW using either an MSR or CPUID(0x15). 6375 + Warn if the difference is more than 500 ppm. 6376 + [x86] watchdog: Use TSC as the watchdog clocksource with 6377 + which to check other HW timers (HPET or PM timer), but 6378 + only on systems where TSC has been deemed trustworthy. 6379 + This will be suppressed by an earlier tsc=nowatchdog and 6380 + can be overridden by a later tsc=nowatchdog. A console 6381 + message will flag any such suppression or overriding. 6372 6382 6373 6383 tsc_early_khz= [X86] Skip early TSC calibration and use the given 6374 6384 value instead. Useful when the early TSC frequency discovery
+1
Documentation/devicetree/bindings/timer/mediatek,mtk-timer.txt
··· 33 33 34 34 For those SoCs that use CPUX 35 35 * "mediatek,mt6795-systimer" for MT6795 compatible timers (CPUX) 36 + * "mediatek,mt8365-systimer" for MT8365 compatible timers (CPUX) 36 37 37 38 - reg: Should contain location and length for timer register. 38 39 - clocks: Should contain system clock.
+52
Documentation/devicetree/bindings/timer/riscv,timer.yaml
··· 1 + # SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) 2 + %YAML 1.2 3 + --- 4 + $id: http://devicetree.org/schemas/timer/riscv,timer.yaml# 5 + $schema: http://devicetree.org/meta-schemas/core.yaml# 6 + 7 + title: RISC-V timer 8 + 9 + maintainers: 10 + - Anup Patel <anup@brainfault.org> 11 + 12 + description: |+ 13 + RISC-V platforms always have a RISC-V timer device for the supervisor-mode 14 + based on the time CSR defined by the RISC-V privileged specification. The 15 + timer interrupts of this device are configured using the RISC-V SBI Time 16 + extension or the RISC-V Sstc extension. 17 + 18 + The clock frequency of RISC-V timer device is specified via the 19 + "timebase-frequency" DT property of "/cpus" DT node which is described 20 + in Documentation/devicetree/bindings/riscv/cpus.yaml 21 + 22 + properties: 23 + compatible: 24 + enum: 25 + - riscv,timer 26 + 27 + interrupts-extended: 28 + minItems: 1 29 + maxItems: 4096 # Should be enough? 30 + 31 + riscv,timer-cannot-wake-cpu: 32 + type: boolean 33 + description: 34 + If present, the timer interrupt cannot wake up the CPU from one or 35 + more suspend/idle states. 36 + 37 + additionalProperties: false 38 + 39 + required: 40 + - compatible 41 + - interrupts-extended 42 + 43 + examples: 44 + - | 45 + timer { 46 + compatible = "riscv,timer"; 47 + interrupts-extended = <&cpu1intc 5>, 48 + <&cpu2intc 5>, 49 + <&cpu3intc 5>, 50 + <&cpu4intc 5>; 51 + }; 52 + ...
+1
Documentation/devicetree/bindings/timer/rockchip,rk-timer.yaml
··· 17 17 - items: 18 18 - enum: 19 19 - rockchip,rv1108-timer 20 + - rockchip,rv1126-timer 20 21 - rockchip,rk3036-timer 21 22 - rockchip,rk3128-timer 22 23 - rockchip,rk3188-timer
+8
Documentation/devicetree/bindings/timer/sifive,clint.yaml
··· 20 20 property of "/cpus" DT node. The "timebase-frequency" DT property is 21 21 described in Documentation/devicetree/bindings/riscv/cpus.yaml 22 22 23 + T-Head C906/C910 CPU cores include an implementation of CLINT too, however 24 + their implementation lacks a memory-mapped MTIME register, thus not 25 + compatible with SiFive ones. 26 + 23 27 properties: 24 28 compatible: 25 29 oneOf: ··· 33 29 - starfive,jh7100-clint 34 30 - canaan,k210-clint 35 31 - const: sifive,clint0 32 + - items: 33 + - enum: 34 + - allwinner,sun20i-d1-clint 35 + - const: thead,c900-clint 36 36 - items: 37 37 - const: sifive,clint0 38 38 - const: riscv,clint0
-1
arch/riscv/Kconfig
··· 12 12 13 13 config RISCV 14 14 def_bool y 15 - select ARCH_CLOCKSOURCE_INIT 16 15 select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION 17 16 select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2 18 17 select ARCH_HAS_BINFMT_FLAT
+2 -8
arch/riscv/kernel/time.c
··· 5 5 */ 6 6 7 7 #include <linux/of_clk.h> 8 + #include <linux/clockchips.h> 8 9 #include <linux/clocksource.h> 9 10 #include <linux/delay.h> 10 11 #include <asm/sbi.h> ··· 30 29 31 30 of_clk_init(NULL); 32 31 timer_probe(); 33 - } 34 32 35 - void clocksource_arch_init(struct clocksource *cs) 36 - { 37 - #ifdef CONFIG_GENERIC_GETTIMEOFDAY 38 - cs->vdso_clock_mode = VDSO_CLOCKMODE_ARCHTIMER; 39 - #else 40 - cs->vdso_clock_mode = VDSO_CLOCKMODE_NONE; 41 - #endif 33 + tick_setup_hrtimer_broadcast(); 42 34 }
+1
arch/x86/include/asm/time.h
··· 8 8 extern void hpet_time_init(void); 9 9 extern void time_init(void); 10 10 extern bool pit_timer_init(void); 11 + extern bool tsc_clocksource_watchdog_disabled(void); 11 12 12 13 extern struct clock_event_device *global_clock_event; 13 14
+2
arch/x86/kernel/hpet.c
··· 1091 1091 if (!hpet_counting()) 1092 1092 goto out_nohpet; 1093 1093 1094 + if (tsc_clocksource_watchdog_disabled()) 1095 + clocksource_hpet.flags |= CLOCK_SOURCE_MUST_VERIFY; 1094 1096 clocksource_register_hz(&clocksource_hpet, (u32)hpet_freq); 1095 1097 1096 1098 if (id & HPET_ID_LEGSUP) {
+50 -5
arch/x86/kernel/tsc.c
··· 48 48 49 49 int tsc_clocksource_reliable; 50 50 51 + static int __read_mostly tsc_force_recalibrate; 52 + 51 53 static u32 art_to_tsc_numerator; 52 54 static u32 art_to_tsc_denominator; 53 55 static u64 art_to_tsc_offset; ··· 293 291 294 292 static int no_sched_irq_time; 295 293 static int no_tsc_watchdog; 294 + static int tsc_as_watchdog; 296 295 297 296 static int __init tsc_setup(char *str) 298 297 { ··· 303 300 no_sched_irq_time = 1; 304 301 if (!strcmp(str, "unstable")) 305 302 mark_tsc_unstable("boot parameter"); 306 - if (!strcmp(str, "nowatchdog")) 303 + if (!strcmp(str, "nowatchdog")) { 307 304 no_tsc_watchdog = 1; 305 + if (tsc_as_watchdog) 306 + pr_alert("%s: Overriding earlier tsc=watchdog with tsc=nowatchdog\n", 307 + __func__); 308 + tsc_as_watchdog = 0; 309 + } 310 + if (!strcmp(str, "recalibrate")) 311 + tsc_force_recalibrate = 1; 312 + if (!strcmp(str, "watchdog")) { 313 + if (no_tsc_watchdog) 314 + pr_alert("%s: tsc=watchdog overridden by earlier tsc=nowatchdog\n", 315 + __func__); 316 + else 317 + tsc_as_watchdog = 1; 318 + } 308 319 return 1; 309 320 } 310 321 ··· 1201 1184 clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY; 1202 1185 } 1203 1186 1187 + bool tsc_clocksource_watchdog_disabled(void) 1188 + { 1189 + return !(clocksource_tsc.flags & CLOCK_SOURCE_MUST_VERIFY) && 1190 + tsc_as_watchdog && !no_tsc_watchdog; 1191 + } 1192 + 1204 1193 static void __init check_system_tsc_reliable(void) 1205 1194 { 1206 1195 #if defined(CONFIG_MGEODEGX1) || defined(CONFIG_MGEODE_LX) || defined(CONFIG_X86_GENERIC) ··· 1395 1372 else 1396 1373 freq = calc_pmtimer_ref(delta, ref_start, ref_stop); 1397 1374 1375 + /* Will hit this only if tsc_force_recalibrate has been set */ 1376 + if (boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)) { 1377 + 1378 + /* Warn if the deviation exceeds 500 ppm */ 1379 + if (abs(tsc_khz - freq) > (tsc_khz >> 11)) { 1380 + pr_warn("Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!\n"); 1381 + pr_info("Previous calibrated TSC freq:\t %lu.%03lu MHz\n", 1382 + (unsigned long)tsc_khz / 1000, 1383 + (unsigned long)tsc_khz % 1000); 1384 + } 1385 + 1386 + pr_info("TSC freq recalibrated by [%s]:\t %lu.%03lu MHz\n", 1387 + hpet ? "HPET" : "PM_TIMER", 1388 + (unsigned long)freq / 1000, 1389 + (unsigned long)freq % 1000); 1390 + 1391 + return; 1392 + } 1393 + 1398 1394 /* Make sure we're within 1% */ 1399 1395 if (abs(tsc_khz - freq) > tsc_khz/100) 1400 1396 goto out; ··· 1447 1405 if (!boot_cpu_has(X86_FEATURE_TSC) || !tsc_khz) 1448 1406 return 0; 1449 1407 1450 - if (tsc_unstable) 1451 - goto unreg; 1408 + if (tsc_unstable) { 1409 + clocksource_unregister(&clocksource_tsc_early); 1410 + return 0; 1411 + } 1452 1412 1453 1413 if (boot_cpu_has(X86_FEATURE_NONSTOP_TSC_S3)) 1454 1414 clocksource_tsc.flags |= CLOCK_SOURCE_SUSPEND_NONSTOP; ··· 1463 1419 if (boot_cpu_has(X86_FEATURE_ART)) 1464 1420 art_related_clocksource = &clocksource_tsc; 1465 1421 clocksource_register_khz(&clocksource_tsc, tsc_khz); 1466 - unreg: 1467 1422 clocksource_unregister(&clocksource_tsc_early); 1468 - return 0; 1423 + 1424 + if (!tsc_force_recalibrate) 1425 + return 0; 1469 1426 } 1470 1427 1471 1428 schedule_delayed_work(&tsc_irqwork, 0);
+1 -1
drivers/clocksource/Kconfig
··· 706 706 707 707 config MICROCHIP_PIT64B 708 708 bool "Microchip PIT64B support" 709 - depends on OF || COMPILE_TEST 709 + depends on OF && ARM 710 710 select TIMER_OF 711 711 help 712 712 This option enables Microchip PIT64B timer for Atmel
+4 -2
drivers/clocksource/acpi_pm.c
··· 23 23 #include <linux/pci.h> 24 24 #include <linux/delay.h> 25 25 #include <asm/io.h> 26 + #include <asm/time.h> 26 27 27 28 /* 28 29 * The I/O port the PMTMR resides at. ··· 211 210 return -ENODEV; 212 211 } 213 212 214 - return clocksource_register_hz(&clocksource_acpi_pm, 215 - PMTMR_TICKS_PER_SEC); 213 + if (tsc_clocksource_watchdog_disabled()) 214 + clocksource_acpi_pm.flags |= CLOCK_SOURCE_MUST_VERIFY; 215 + return clocksource_register_hz(&clocksource_acpi_pm, PMTMR_TICKS_PER_SEC); 216 216 } 217 217 218 218 /* We use fs_initcall because we want the PCI fixups to have run
+1 -6
drivers/clocksource/em_sti.c
··· 333 333 return 0; 334 334 } 335 335 336 - static int em_sti_remove(struct platform_device *pdev) 337 - { 338 - return -EBUSY; /* cannot unregister clockevent and clocksource */ 339 - } 340 - 341 336 static const struct of_device_id em_sti_dt_ids[] = { 342 337 { .compatible = "renesas,em-sti", }, 343 338 {}, ··· 341 346 342 347 static struct platform_driver em_sti_device_driver = { 343 348 .probe = em_sti_probe, 344 - .remove = em_sti_remove, 345 349 .driver = { 346 350 .name = "em_sti", 347 351 .of_match_table = em_sti_dt_ids, 352 + .suppress_bind_attrs = true, 348 353 } 349 354 }; 350 355
+1 -6
drivers/clocksource/sh_cmt.c
··· 1145 1145 return 0; 1146 1146 } 1147 1147 1148 - static int sh_cmt_remove(struct platform_device *pdev) 1149 - { 1150 - return -EBUSY; /* cannot unregister clockevent and clocksource */ 1151 - } 1152 - 1153 1148 static struct platform_driver sh_cmt_device_driver = { 1154 1149 .probe = sh_cmt_probe, 1155 - .remove = sh_cmt_remove, 1156 1150 .driver = { 1157 1151 .name = "sh_cmt", 1158 1152 .of_match_table = of_match_ptr(sh_cmt_of_table), 1153 + .suppress_bind_attrs = true, 1159 1154 }, 1160 1155 .id_table = sh_cmt_id_table, 1161 1156 };
+1 -6
drivers/clocksource/sh_tmu.c
··· 632 632 return 0; 633 633 } 634 634 635 - static int sh_tmu_remove(struct platform_device *pdev) 636 - { 637 - return -EBUSY; /* cannot unregister clockevent and clocksource */ 638 - } 639 - 640 635 static const struct platform_device_id sh_tmu_id_table[] = { 641 636 { "sh-tmu", SH_TMU }, 642 637 { "sh-tmu-sh3", SH_TMU_SH3 }, ··· 647 652 648 653 static struct platform_driver sh_tmu_device_driver = { 649 654 .probe = sh_tmu_probe, 650 - .remove = sh_tmu_remove, 651 655 .driver = { 652 656 .name = "sh_tmu", 653 657 .of_match_table = of_match_ptr(sh_tmu_of_table), 658 + .suppress_bind_attrs = true, 654 659 }, 655 660 .id_table = sh_tmu_id_table, 656 661 };
+12
drivers/clocksource/timer-microchip-pit64b.c
··· 9 9 10 10 #include <linux/clk.h> 11 11 #include <linux/clockchips.h> 12 + #include <linux/delay.h> 12 13 #include <linux/interrupt.h> 13 14 #include <linux/of_address.h> 14 15 #include <linux/of_irq.h> ··· 93 92 static void __iomem *mchp_pit64b_cs_base; 94 93 /* Default cycles for clockevent timer. */ 95 94 static u64 mchp_pit64b_ce_cycles; 95 + /* Delay timer. */ 96 + static struct delay_timer mchp_pit64b_dt; 96 97 97 98 static inline u64 mchp_pit64b_cnt_read(void __iomem *base) 98 99 { ··· 168 165 } 169 166 170 167 static u64 notrace mchp_pit64b_sched_read_clk(void) 168 + { 169 + return mchp_pit64b_cnt_read(mchp_pit64b_cs_base); 170 + } 171 + 172 + static unsigned long notrace mchp_pit64b_dt_read(void) 171 173 { 172 174 return mchp_pit64b_cnt_read(mchp_pit64b_cs_base); 173 175 } ··· 383 375 } 384 376 385 377 sched_clock_register(mchp_pit64b_sched_read_clk, 64, clk_rate); 378 + 379 + mchp_pit64b_dt.read_current_timer = mchp_pit64b_dt_read; 380 + mchp_pit64b_dt.freq = clk_rate; 381 + register_current_timer_delay(&mchp_pit64b_dt); 386 382 387 383 return 0; 388 384 }
+21 -6
drivers/clocksource/timer-riscv.c
··· 28 28 #include <asm/timex.h> 29 29 30 30 static DEFINE_STATIC_KEY_FALSE(riscv_sstc_available); 31 + static bool riscv_timer_cannot_wake_cpu; 31 32 32 33 static int riscv_clock_next_event(unsigned long delta, 33 34 struct clock_event_device *ce) ··· 74 73 75 74 static struct clocksource riscv_clocksource = { 76 75 .name = "riscv_clocksource", 77 - .rating = 300, 76 + .rating = 400, 78 77 .mask = CLOCKSOURCE_MASK(64), 79 78 .flags = CLOCK_SOURCE_IS_CONTINUOUS, 80 79 .read = riscv_clocksource_rdtime, 80 + #if IS_ENABLED(CONFIG_GENERIC_GETTIMEOFDAY) 81 + .vdso_clock_mode = VDSO_CLOCKMODE_ARCHTIMER, 82 + #else 83 + .vdso_clock_mode = VDSO_CLOCKMODE_NONE, 84 + #endif 81 85 }; 82 86 83 87 static int riscv_timer_starting_cpu(unsigned int cpu) ··· 91 85 92 86 ce->cpumask = cpumask_of(cpu); 93 87 ce->irq = riscv_clock_event_irq; 88 + if (riscv_timer_cannot_wake_cpu) 89 + ce->features |= CLOCK_EVT_FEAT_C3STOP; 94 90 clockevents_config_and_register(ce, riscv_timebase, 100, 0x7fffffff); 95 91 96 92 enable_percpu_irq(riscv_clock_event_irq, ··· 147 139 if (cpuid != smp_processor_id()) 148 140 return 0; 149 141 142 + child = of_find_compatible_node(NULL, NULL, "riscv,timer"); 143 + if (child) { 144 + riscv_timer_cannot_wake_cpu = of_property_read_bool(child, 145 + "riscv,timer-cannot-wake-cpu"); 146 + of_node_put(child); 147 + } 148 + 150 149 domain = NULL; 151 150 child = of_get_compatible_child(n, "riscv,cpu-intc"); 152 151 if (!child) { ··· 192 177 return error; 193 178 } 194 179 180 + if (riscv_isa_extension_available(NULL, SSTC)) { 181 + pr_info("Timer interrupt in S-mode is available via sstc extension\n"); 182 + static_branch_enable(&riscv_sstc_available); 183 + } 184 + 195 185 error = cpuhp_setup_state(CPUHP_AP_RISCV_TIMER_STARTING, 196 186 "clockevents/riscv/timer:starting", 197 187 riscv_timer_starting_cpu, riscv_timer_dying_cpu); 198 188 if (error) 199 189 pr_err("cpu hp setup state failed for RISCV timer [%d]\n", 200 190 error); 201 - 202 - if (riscv_isa_extension_available(NULL, SSTC)) { 203 - pr_info("Timer interrupt in S-mode is available via sstc extension\n"); 204 - static_branch_enable(&riscv_sstc_available); 205 - } 206 191 207 192 return error; 208 193 }
+2 -1
drivers/clocksource/timer-sun4i.c
··· 144 144 .clkevt = { 145 145 .name = "sun4i_tick", 146 146 .rating = 350, 147 - .features = CLOCK_EVT_FEAT_PERIODIC | CLOCK_EVT_FEAT_ONESHOT, 147 + .features = CLOCK_EVT_FEAT_PERIODIC | CLOCK_EVT_FEAT_ONESHOT | 148 + CLOCK_EVT_FEAT_DYNIRQ, 148 149 .set_state_shutdown = sun4i_clkevt_shutdown, 149 150 .set_state_periodic = sun4i_clkevt_set_periodic, 150 151 .set_state_oneshot = sun4i_clkevt_set_oneshot,
-1
include/linux/bits.h
··· 6 6 #include <vdso/bits.h> 7 7 #include <asm/bitsperlong.h> 8 8 9 - #define BIT_ULL(nr) (ULL(1) << (nr)) 10 9 #define BIT_MASK(nr) (UL(1) << ((nr) % BITS_PER_LONG)) 11 10 #define BIT_WORD(nr) ((nr) / BITS_PER_LONG) 12 11 #define BIT_ULL_MASK(nr) (ULL(1) << ((nr) % BITS_PER_LONG_LONG))
+1
include/vdso/bits.h
··· 5 5 #include <vdso/const.h> 6 6 7 7 #define BIT(nr) (UL(1) << (nr)) 8 + #define BIT_ULL(nr) (ULL(1) << (nr)) 8 9 9 10 #endif /* __VDSO_BITS_H */
+5 -1
kernel/time/Kconfig
··· 200 200 int "Clocksource watchdog maximum allowable skew (in μs)" 201 201 depends on CLOCKSOURCE_WATCHDOG 202 202 range 50 1000 203 - default 100 203 + default 125 204 204 help 205 205 Specify the maximum amount of allowable watchdog skew in 206 206 microseconds before reporting the clocksource to be unstable. 207 + The default is based on a half-second clocksource watchdog 208 + interval and NTP's maximum frequency drift of 500 parts 209 + per million. If the clocksource is good enough for NTP, 210 + it is good enough for the clocksource watchdog! 207 211 208 212 endmenu 209 213 endif
+51 -21
kernel/time/clocksource.c
··· 96 96 static u64 suspend_start; 97 97 98 98 /* 99 + * Interval: 0.5sec. 100 + */ 101 + #define WATCHDOG_INTERVAL (HZ >> 1) 102 + 103 + /* 99 104 * Threshold: 0.0312s, when doubled: 0.0625s. 100 105 * Also a default for cs->uncertainty_margin when registering clocks. 101 106 */ ··· 111 106 * clocksource surrounding a read of the clocksource being validated. 112 107 * This delay could be due to SMIs, NMIs, or to VCPU preemptions. Used as 113 108 * a lower bound for cs->uncertainty_margin values when registering clocks. 109 + * 110 + * The default of 500 parts per million is based on NTP's limits. 111 + * If a clocksource is good enough for NTP, it is good enough for us! 114 112 */ 115 113 #ifdef CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US 116 114 #define MAX_SKEW_USEC CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US 117 115 #else 118 - #define MAX_SKEW_USEC 100 116 + #define MAX_SKEW_USEC (125 * WATCHDOG_INTERVAL / HZ) 119 117 #endif 120 118 121 119 #define WATCHDOG_MAX_SKEW (MAX_SKEW_USEC * NSEC_PER_USEC) ··· 147 139 148 140 static int clocksource_watchdog_kthread(void *data); 149 141 static void __clocksource_change_rating(struct clocksource *cs, int rating); 150 - 151 - /* 152 - * Interval: 0.5sec. 153 - */ 154 - #define WATCHDOG_INTERVAL (HZ >> 1) 155 142 156 143 static void clocksource_watchdog_work(struct work_struct *work) 157 144 { ··· 260 257 goto skip_test; 261 258 } 262 259 263 - pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n", 264 - smp_processor_id(), watchdog->name, wd_delay, nretries); 260 + pr_warn("timekeeping watchdog on CPU%d: wd-%s-wd excessive read-back delay of %lldns vs. limit of %ldns, wd-wd read-back delay only %lldns, attempt %d, marking %s unstable\n", 261 + smp_processor_id(), cs->name, wd_delay, WATCHDOG_MAX_SKEW, wd_seq_delay, nretries, cs->name); 265 262 return WD_READ_UNSTABLE; 266 263 267 264 skip_test: ··· 387 384 } 388 385 EXPORT_SYMBOL_GPL(clocksource_verify_percpu); 389 386 387 + static inline void clocksource_reset_watchdog(void) 388 + { 389 + struct clocksource *cs; 390 + 391 + list_for_each_entry(cs, &watchdog_list, wd_list) 392 + cs->flags &= ~CLOCK_SOURCE_WATCHDOG; 393 + } 394 + 395 + 390 396 static void clocksource_watchdog(struct timer_list *unused) 391 397 { 392 398 u64 csnow, wdnow, cslast, wdlast, delta; ··· 403 391 int64_t wd_nsec, cs_nsec; 404 392 struct clocksource *cs; 405 393 enum wd_read_status read_ret; 394 + unsigned long extra_wait = 0; 406 395 u32 md; 407 396 408 397 spin_lock(&watchdog_lock); ··· 423 410 424 411 read_ret = cs_watchdog_read(cs, &csnow, &wdnow); 425 412 426 - if (read_ret != WD_READ_SUCCESS) { 427 - if (read_ret == WD_READ_UNSTABLE) 428 - /* Clock readout unreliable, so give it up. */ 429 - __clocksource_unstable(cs); 413 + if (read_ret == WD_READ_UNSTABLE) { 414 + /* Clock readout unreliable, so give it up. */ 415 + __clocksource_unstable(cs); 430 416 continue; 417 + } 418 + 419 + /* 420 + * When WD_READ_SKIP is returned, it means the system is likely 421 + * under very heavy load, where the latency of reading 422 + * watchdog/clocksource is very big, and affect the accuracy of 423 + * watchdog check. So give system some space and suspend the 424 + * watchdog check for 5 minutes. 425 + */ 426 + if (read_ret == WD_READ_SKIP) { 427 + /* 428 + * As the watchdog timer will be suspended, and 429 + * cs->last could keep unchanged for 5 minutes, reset 430 + * the counters. 431 + */ 432 + clocksource_reset_watchdog(); 433 + extra_wait = HZ * 300; 434 + break; 431 435 } 432 436 433 437 /* Clocksource initialized ? */ ··· 473 443 /* Check the deviation from the watchdog clocksource. */ 474 444 md = cs->uncertainty_margin + watchdog->uncertainty_margin; 475 445 if (abs(cs_nsec - wd_nsec) > md) { 446 + u64 cs_wd_msec; 447 + u64 wd_msec; 448 + u32 wd_rem; 449 + 476 450 pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n", 477 451 smp_processor_id(), cs->name); 478 452 pr_warn(" '%s' wd_nsec: %lld wd_now: %llx wd_last: %llx mask: %llx\n", 479 453 watchdog->name, wd_nsec, wdnow, wdlast, watchdog->mask); 480 454 pr_warn(" '%s' cs_nsec: %lld cs_now: %llx cs_last: %llx mask: %llx\n", 481 455 cs->name, cs_nsec, csnow, cslast, cs->mask); 456 + cs_wd_msec = div_u64_rem(cs_nsec - wd_nsec, 1000U * 1000U, &wd_rem); 457 + wd_msec = div_u64_rem(wd_nsec, 1000U * 1000U, &wd_rem); 458 + pr_warn(" Clocksource '%s' skewed %lld ns (%lld ms) over watchdog '%s' interval of %lld ns (%lld ms)\n", 459 + cs->name, cs_nsec - wd_nsec, cs_wd_msec, watchdog->name, wd_nsec, wd_msec); 482 460 if (curr_clocksource == cs) 483 461 pr_warn(" '%s' is current clocksource.\n", cs->name); 484 462 else if (curr_clocksource) ··· 550 512 * pair clocksource_stop_watchdog() clocksource_start_watchdog(). 551 513 */ 552 514 if (!timer_pending(&watchdog_timer)) { 553 - watchdog_timer.expires += WATCHDOG_INTERVAL; 515 + watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait; 554 516 add_timer_on(&watchdog_timer, next_cpu); 555 517 } 556 518 out: ··· 573 535 return; 574 536 del_timer(&watchdog_timer); 575 537 watchdog_running = 0; 576 - } 577 - 578 - static inline void clocksource_reset_watchdog(void) 579 - { 580 - struct clocksource *cs; 581 - 582 - list_for_each_entry(cs, &watchdog_list, wd_list) 583 - cs->flags &= ~CLOCK_SOURCE_WATCHDOG; 584 538 } 585 539 586 540 static void clocksource_resume_watchdog(void)
+14 -4
kernel/time/hrtimer.c
··· 2089 2089 u64 slack; 2090 2090 2091 2091 slack = current->timer_slack_ns; 2092 - if (dl_task(current) || rt_task(current)) 2092 + if (rt_task(current)) 2093 2093 slack = 0; 2094 2094 2095 2095 hrtimer_init_sleeper_on_stack(&t, clockid, mode); ··· 2126 2126 if (!timespec64_valid(&tu)) 2127 2127 return -EINVAL; 2128 2128 2129 + current->restart_block.fn = do_no_restart_syscall; 2129 2130 current->restart_block.nanosleep.type = rmtp ? TT_NATIVE : TT_NONE; 2130 2131 current->restart_block.nanosleep.rmtp = rmtp; 2131 2132 return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL, ··· 2148 2147 if (!timespec64_valid(&tu)) 2149 2148 return -EINVAL; 2150 2149 2150 + current->restart_block.fn = do_no_restart_syscall; 2151 2151 current->restart_block.nanosleep.type = rmtp ? TT_COMPAT : TT_NONE; 2152 2152 current->restart_block.nanosleep.compat_rmtp = rmtp; 2153 2153 return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL, ··· 2272 2270 /** 2273 2271 * schedule_hrtimeout_range_clock - sleep until timeout 2274 2272 * @expires: timeout value (ktime_t) 2275 - * @delta: slack in expires timeout (ktime_t) 2273 + * @delta: slack in expires timeout (ktime_t) for SCHED_OTHER tasks 2276 2274 * @mode: timer mode 2277 2275 * @clock_id: timer clock to be used 2278 2276 */ ··· 2299 2297 return -EINTR; 2300 2298 } 2301 2299 2300 + /* 2301 + * Override any slack passed by the user if under 2302 + * rt contraints. 2303 + */ 2304 + if (rt_task(current)) 2305 + delta = 0; 2306 + 2302 2307 hrtimer_init_sleeper_on_stack(&t, clock_id, mode); 2303 2308 hrtimer_set_expires_range_ns(&t.timer, *expires, delta); 2304 2309 hrtimer_sleeper_start_expires(&t, mode); ··· 2325 2316 /** 2326 2317 * schedule_hrtimeout_range - sleep until timeout 2327 2318 * @expires: timeout value (ktime_t) 2328 - * @delta: slack in expires timeout (ktime_t) 2319 + * @delta: slack in expires timeout (ktime_t) for SCHED_OTHER tasks 2329 2320 * @mode: timer mode 2330 2321 * 2331 2322 * Make the current task sleep until the given expiry time has ··· 2333 2324 * the current task state has been set (see set_current_state()). 2334 2325 * 2335 2326 * The @delta argument gives the kernel the freedom to schedule the 2336 - * actual wakeup to a time that is both power and performance friendly. 2327 + * actual wakeup to a time that is both power and performance friendly 2328 + * for regular (non RT/DL) tasks. 2337 2329 * The kernel give the normal best effort behavior for "@expires+@delta", 2338 2330 * but may decide to fire the timer earlier, but no earlier than @expires. 2339 2331 *
+6 -7
kernel/time/posix-cpu-timers.c
··· 243 243 */ 244 244 static inline void __update_gt_cputime(atomic64_t *cputime, u64 sum_cputime) 245 245 { 246 - u64 curr_cputime; 247 - retry: 248 - curr_cputime = atomic64_read(cputime); 249 - if (sum_cputime > curr_cputime) { 250 - if (atomic64_cmpxchg(cputime, curr_cputime, sum_cputime) != curr_cputime) 251 - goto retry; 252 - } 246 + u64 curr_cputime = atomic64_read(cputime); 247 + 248 + do { 249 + if (sum_cputime <= curr_cputime) 250 + return; 251 + } while (!atomic64_try_cmpxchg(cputime, &curr_cputime, sum_cputime)); 253 252 } 254 253 255 254 static void update_gt_cputime(struct task_cputime_atomic *cputime_atomic,
+2
kernel/time/posix-stubs.c
··· 147 147 return -EINVAL; 148 148 if (flags & TIMER_ABSTIME) 149 149 rmtp = NULL; 150 + current->restart_block.fn = do_no_restart_syscall; 150 151 current->restart_block.nanosleep.type = rmtp ? TT_NATIVE : TT_NONE; 151 152 current->restart_block.nanosleep.rmtp = rmtp; 152 153 texp = timespec64_to_ktime(t); ··· 241 240 return -EINVAL; 242 241 if (flags & TIMER_ABSTIME) 243 242 rmtp = NULL; 243 + current->restart_block.fn = do_no_restart_syscall; 244 244 current->restart_block.nanosleep.type = rmtp ? TT_COMPAT : TT_NONE; 245 245 current->restart_block.nanosleep.compat_rmtp = rmtp; 246 246 texp = timespec64_to_ktime(t);
+2
kernel/time/posix-timers.c
··· 1270 1270 return -EINVAL; 1271 1271 if (flags & TIMER_ABSTIME) 1272 1272 rmtp = NULL; 1273 + current->restart_block.fn = do_no_restart_syscall; 1273 1274 current->restart_block.nanosleep.type = rmtp ? TT_NATIVE : TT_NONE; 1274 1275 current->restart_block.nanosleep.rmtp = rmtp; 1275 1276 ··· 1298 1297 return -EINVAL; 1299 1298 if (flags & TIMER_ABSTIME) 1300 1299 rmtp = NULL; 1300 + current->restart_block.fn = do_no_restart_syscall; 1301 1301 current->restart_block.nanosleep.type = rmtp ? TT_COMPAT : TT_NONE; 1302 1302 current->restart_block.nanosleep.compat_rmtp = rmtp; 1303 1303
+1 -1
kernel/time/test_udelay.c
··· 149 149 static void __exit udelay_test_exit(void) 150 150 { 151 151 mutex_lock(&udelay_test_lock); 152 - debugfs_remove(debugfs_lookup(DEBUGFS_FILENAME, NULL)); 152 + debugfs_lookup_and_remove(DEBUGFS_FILENAME, NULL); 153 153 mutex_unlock(&udelay_test_lock); 154 154 } 155 155