timers/nohz: Last resort update jiffies on nohz_full IRQ entry

When at least one CPU runs in nohz_full mode, a dedicated timekeeper
CPU is guaranteed to stay online and to never stop its tick.  Except
that this timekeeper could be spinning with interrupts disabled for an
extended period of time, for example, when in the midst of a stop-machine
operation.

This means that jiffies are no longer being updated, which in turn means
that a nohz_full CPU can endlessly program its next tick in the past,
which results in an tick storm.

This situation can arise as follows:

0) CPU 0 is the timekeeper and CPU 1 a nohz_full CPU.

1) A stop-machine callback is queued.

2) CPU 0 reaches MULTI_STOP_DISABLE_IRQ while CPU 1 is still in
   MULTI_STOP_PREPARE. Hence CPU 0 can't do its timekeeping duty. CPU 1
   can still take IRQs.

3) CPU 1 receives an IRQ which queues a timer callback one jiffy forward.

4) On IRQ exit, CPU 1 schedules the tick one jiffy forward, taking
   last_jiffies_update as a base. But last_jiffies_update hasn't been
   updated for 2 jiffies since the timekeeper has interrupts disabled.

5) clockevents_program_event(), which relies on ktime_get(), observes
   that the expiration is in the past and therefore programs the min
   delta event on the clock.

6) The tick fires immediately, goto 3), tick storm!

7) The nohz_full CPU 1 makes no forward progress until such time as
   integer overflow causes the tick's expiration to actually be in
   the future, which after the better part of an hour finally allows
   CPU 1 to reach MULTI_STOP_DISABLE_IRQ.

Solve this by unconditionally updating jiffies if the value is stale
on nohz_full IRQ entry. IRQs and other disturbances are expected to be
rare enough on nohz_full for the unconditional call to ktime_get() to
incur negligible overhead.

Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2 files changed