EXP clocksource: Forgive repeated long-latency watchdog clocksource reads
Currently, the clocksource watchdog reacts to repeated long-latency
clocksource reads by marking that clocksource unstable on the theory that
these long-latency reads are a sign of a serious problem. And this theory
does in fact have real-world support in the form of firmware issues [1].
However, it is also possible to trigger this using stress-ng on what
the stress-ng man page terms "poorly designed hardware" [2]. And it
is not necessarily a bad thing for the kernel to diagnose cases where
high-stress workloads are being run on hardware that is not designed
for this sort of use.
Nevertheless, it is quite possible that real-world use will result in
some situation requiring that high-stress workloads run on hardware
not designed to accommodate them, and also requiring that the kernel
refrain from marking clocksources unstable.
Therefore, provide an out-of-tree patch that reacts to this situation
by leaving the clocksource alone, but using the old 62.5-millisecond
skew-detection threshold in response persistent long-latency reads.
In addition, the offending clocksource is marked for re-initialization
in this case, which both restarts that clocksource with a clean bill of
health and avoids false-positive skew reports on later watchdog checks.
Link: https://lore.kernel.org/lkml/20210513155515.GB23902@xsang-OptiPlex-9020/ # [1]
Link: https://lore.kernel.org/lkml/20210521083322.GG25531@xsang-OptiPlex-9020/ # [2]
Link: https://lore.kernel.org/lkml/20210521084405.GH25531@xsang-OptiPlex-9020/
Link: https://lore.kernel.org/lkml/20210511233403.GA2896757@paulmck-ThinkPad-P17-Gen-1/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2 files changed