The X86 entry, exception and interrupt code rework

This all started about 6 month ago with the attempt to move the Posix CPU
timer heavy lifting out of the timer interrupt code and just have lockless
quick checks in that code path. Trivial 5 patches.

This unearthed an inconsistency in the KVM handling of task work and the
review requested to move all of this into generic code so other
architectures can share.

Valid request and solved with another 25 patches but those unearthed
inconsistencies vs. RCU and instrumentation.

Digging into this made it obvious that there are quite some inconsistencies
vs. instrumentation in general. The int3 text poke handling in particular
was completely unprotected and with the batched update of trace events even
more likely to expose to endless int3 recursion.

In parallel the RCU implications of instrumenting fragile entry code came
up in several discussions.

The conclusion of the X86 maintainer team was to go all the way and make
the protection against any form of instrumentation of fragile and dangerous
code pathes enforcable and verifiable by tooling.

A first batch of preparatory work hit mainline with commit d5f744f9a2ac.

The (almost) full solution introduced a new code section '.noinstr.text'
into which all code which needs to be protected from instrumentation of all
sorts goes into. Any call into instrumentable code out of this section has
to be annotated. objtool has support to validate this. Kprobes now excludes
this section fully which also prevents BPF from fiddling with it and all
'noinstr' annotated functions also keep ftrace off. The section, kprobes
and objtool changes are already merged.

The major changes coming with this are:

    - Preparatory cleanups

    - Annotating of relevant functions to move them into the noinstr.text
      section or enforcing inlining by marking them __always_inline so the
      compiler cannot misplace or instrument them.

    - Splitting and simplifying the idtentry macro maze so that it is now
      clearly separated into simple exception entries and the more
      interesting ones which use interrupt stacks and have the paranoid
      handling vs. CR3 and GS.

    - Move quite some of the low level ASM functionality into C code:

       - enter_from and exit to user space handling. The ASM code now calls
         into C after doing the really necessary ASM handling and the return
	 path goes back out without bells and whistels in ASM.

       - exception entry/exit got the equivivalent treatment

       - move all IRQ tracepoints from ASM to C so they can be placed as
         appropriate which is especially important for the int3 recursion
         issue.

    - Consolidate the declaration and definition of entry points between 32
      and 64 bit. They share a common header and macros now.

    - Remove the extra device interrupt entry maze and just use the regular
      exception entry code.

    - All ASM entry points except NMI are now generated from the shared header
      file and the corresponding macros in the 32 and 64 bit entry ASM.

    - The C code entry points are consolidated as well with the help of
      DEFINE_IDTENTRY*() macros. This allows to ensure at one central point
      that all corresponding entry points share the same semantics. The
      actual function body for most entry points is in an instrumentable
      and sane state.

      There are special macros for the more sensitive entry points,
      e.g. INT3 and of course the nasty paranoid #NMI, #MCE, #DB and #DF.
      They allow to put the whole entry instrumentation and RCU handling
      into safe places instead of the previous pray that it is correct
      approach.

    - The INT3 text poke handling is now completely isolated and the
      recursion issue banned. Aside of the entry rework this required other
      isolation work, e.g. the ability to force inline bsearch.

    - Prevent #DB on fragile entry code, entry relevant memory and disable
      it on NMI, #MC entry, which allowed to get rid of the nested #DB IST
      stack shifting hackery.

    - A few other cleanups and enhancements which have been made possible
      through this and already merged changes, e.g. consolidating and
      further restricting the IDT code so the IDT table becomes RO after
      init which removes yet another popular attack vector

    - About 680 lines of ASM maze are gone.

There are a few open issues:

   - An escape out of the noinstr section in the MCE handler which needs
     some more thought but under the aspect that MCE is a complete
     trainwreck by design and the propability to survive it is low, this was
     not high on the priority list.

   - Paravirtualization

     When PV is enabled then objtool complains about a bunch of indirect
     calls out of the noinstr section. There are a few straight forward
     ways to fix this, but the other issues vs. general correctness were
     more pressing than parawitz.

   - KVM

     KVM is inconsistent as well. Patches have been posted, but they have
     not yet been commented on or picked up by the KVM folks.

   - IDLE

     Pretty much the same problems can be found in the low level idle code
     especially the parts where RCU stopped watching. This was beyond the
     scope of the more obvious and exposable problems and is on the todo
     list.

The lesson learned from this brain melting exercise to morph the evolved
code base into something which can be validated and understood is that once
again the violation of the most important engineering principle
"correctness first" has caused quite a few people to spend valuable time on
problems which could have been avoided in the first place. The "features
first" tinkering mindset really has to stop.

With that I want to say thanks to everyone involved in contributing to this
effort. Special thanks go to the following people (alphabetical order):

   Alexandre Chartre
   Andy Lutomirski
   Borislav Petkov
   Brian Gerst
   Frederic Weisbecker
   Josh Poimboeuf
   Juergen Gross
   Lai Jiangshan
   Macro Elver
   Paolo Bonzini
   Paul McKenney
   Peter Zijlstra
   Vitaly Kuznetsov
   Will Deacon
x86/entry: Force rcu_irq_enter() when in idle task

The idea of conditionally calling into rcu_irq_enter() only when RCU is
not watching turned out to be not completely thought through.

Paul noticed occasional premature end of grace periods in RCU torture
testing. Bisection led to the commit which made the invocation of
rcu_irq_enter() conditional on !rcu_is_watching().

It turned out that this conditional breaks RCU assumptions about the idle
task when the scheduler tick happens to be a nested interrupt. Nested
interrupts can happen when the first interrupt invokes softirq processing
on return which enables interrupts.

If that nested tick interrupt does not invoke rcu_irq_enter() then the
RCU's irq-nesting checks will believe that this interrupt came directly
from idle, which will cause RCU to report a quiescent state.  Because this
interrupt instead came from a softirq handler which might have been
executing an RCU read-side critical section, this can cause the grace
period to end prematurely.

Change the condition from !rcu_is_watching() to is_idle_task(current) which
enforces that interrupts in the idle task unconditionally invoke
rcu_irq_enter() independent of the RCU state.

This is also correct vs. user mode entries in NOHZ full scenarios because
user mode entries bring RCU out of EQS and force the RCU irq nesting state
accounting to nested. As only the first interrupt can enter from user mode
a nested tick interrupt will enter from kernel mode and as the nesting
state accounting is forced to nesting it will not do anything stupid even
if rcu_irq_enter() has not been invoked.

Fixes: 3eeec3858488 ("x86/entry: Provide idtentry_entry/exit_cond_rcu()")
Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: "Paul E. McKenney" <paulmck@kernel.org>
Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/87wo4cxubv.fsf@nanos.tec.linutronix.de

1 file changed