refs/tags/x86_urgent_for_5.8_rc3 - linux/kernel/git/paulmck/linux-rcu

tag	2b0ed5e00c9fa0550b926e1d865f0fa936759b8f
tagger	Borislav Petkov <bp@suse.de>	Sun Jun 28 16:39:34 2020 +0200
object	bb5570ad3b54e7930997aec76ab68256d5236d94

* AMD Memory bandwidth counter width fix, by Babu Moger.

* Use the proper length type in the 32-bit truncate() syscall variant,
by Jiri Slaby.

* Reinit IA32_FEAT_CTL during wakeup to fix the case where after
resume, VMXON would #GP due to VMX not being properly enabled, by Sean
Christopherson.

* Fix a static checker warning in the resctrl code, by Dan Carpenter.

* Add a CR4 pinning mask for bits which cannot change after boot, by
Kees Cook.

* Align the start of the loop of __clear_user() to 16 bytes, to improve
performance on AMD zen1 and zen2 microarchitectures, by Matt Fleming.

commit	bb5570ad3b54e7930997aec76ab68256d5236d94	[log] [tgz]
author	Matt Fleming <matt@codeblueprint.co.uk>	Thu Jun 18 11:20:02 2020 +0100
committer	Borislav Petkov <bp@suse.de>	Fri Jun 19 18:32:11 2020 +0200
tree	ec6baa21752c942f30902609460f8c9d52cedcc9
parent	a13b9d0b97211579ea63b96c606de79b963c0f47 [diff]

x86/asm/64: Align start of __clear_user() loop to 16-bytes

x86 CPUs can suffer severe performance drops if a tight loop, such as
the ones in __clear_user(), straddles a 16-byte instruction fetch
window, or worse, a 64-byte cacheline. This issues was discovered in the
SUSE kernel with the following commit,

  1153933703d9 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants")

which increased the code object size from 10 bytes to 15 bytes and
caused the 8-byte copy loop in __clear_user() to be split across a
64-byte cacheline.

Aligning the start of the loop to 16-bytes makes this fit neatly inside
a single instruction fetch window again and restores the performance of
__clear_user() which is used heavily when reading from /dev/zero.

Here are some numbers from running libmicro's read_z* and pread_z*
microbenchmarks which read from /dev/zero:

  Zen 1 (Naples)

  libmicro-file
                                        5.7.0-rc6              5.7.0-rc6              5.7.0-rc6
                                                    revert-1153933703d9+               align16+
  Time mean95-pread_z100k       9.9195 (   0.00%)      5.9856 (  39.66%)      5.9938 (  39.58%)
  Time mean95-pread_z10k        1.1378 (   0.00%)      0.7450 (  34.52%)      0.7467 (  34.38%)
  Time mean95-pread_z1k         0.2623 (   0.00%)      0.2251 (  14.18%)      0.2252 (  14.15%)
  Time mean95-pread_zw100k      9.9974 (   0.00%)      6.0648 (  39.34%)      6.0756 (  39.23%)
  Time mean95-read_z100k        9.8940 (   0.00%)      5.9885 (  39.47%)      5.9994 (  39.36%)
  Time mean95-read_z10k         1.1394 (   0.00%)      0.7483 (  34.33%)      0.7482 (  34.33%)

Note that this doesn't affect Haswell or Broadwell microarchitectures
which seem to avoid the alignment issue by executing the loop straight
out of the Loop Stream Detector (verified using perf events).

Fixes: 1153933703d9 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants")
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # v4.19+
Link: https://lkml.kernel.org/r/20200618102002.30034-1-matt@codeblueprint.co.uk

arch/x86/lib/usercopy_64.c[diff]

1 file changed

tree: ec6baa21752c942f30902609460f8c9d52cedcc9