Commit Graph

5398 Commits

Author SHA1 Message Date
Peter Zijlstra (Intel)
d45476d983 x86/speculation: Rename RETPOLINE_AMD to RETPOLINE_LFENCE
The RETPOLINE_AMD name is unfortunate since it isn't necessarily
AMD only, in fact Hygon also uses it. Furthermore it will likely be
sufficient for some Intel processors. Therefore rename the thing to
RETPOLINE_LFENCE to better describe what it is.

Add the spectre_v2=retpoline,lfence option as an alias to
spectre_v2=retpoline,amd to preserve existing setups. However, the output
of /sys/devices/system/cpu/vulnerabilities/spectre_v2 will be changed.

  [ bp: Fix typos, massage. ]

Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2022-02-21 10:21:28 +01:00
Jue Wang
8ca97812c3 x86/mce: Work around an erratum on fast string copy instructions
A rare kernel panic scenario can happen when the following conditions
are met due to an erratum on fast string copy instructions:

1) An uncorrected error.
2) That error must be in first cache line of a page.
3) Kernel must execute page_copy from the page immediately before that
page.

The fast string copy instructions ("REP; MOVS*") could consume an
uncorrectable memory error in the cache line _right after_ the desired
region to copy and raise an MCE.

Bit 0 of MSR_IA32_MISC_ENABLE can be cleared to disable fast string
copy and will avoid such spurious machine checks. However, that is less
preferable due to the permanent performance impact. Considering memory
poison is rare, it's desirable to keep fast string copy enabled until an
MCE is seen.

Intel has confirmed the following:
1. The CPU erratum of fast string copy only applies to Skylake,
Cascade Lake and Cooper Lake generations.

Directly return from the MCE handler:
2. Will result in complete execution of the "REP; MOVS*" with no data
loss or corruption.
3. Will not result in another MCE firing on the next poisoned cache line
due to "REP; MOVS*".
4. Will resume execution from a correct point in code.
5. Will result in the same instruction that triggered the MCE firing a
second MCE immediately for any other software recoverable data fetch
errors.
6. Is not safe without disabling the fast string copy, as the next fast
string copy of the same buffer on the same CPU would result in a PANIC
MCE.

This should mitigate the erratum completely with the only caveat that
the fast string copy is disabled on the affected hyper thread thus
performance degradation.

This is still better than the OS crashing on MCEs raised on an
irrelevant process due to "REP; MOVS*' accesses in a kernel context,
e.g., copy_page.

Tested:

Injected errors on 1st cache line of 8 anonymous pages of process
'proc1' and observed MCE consumption from 'proc2' with no panic
(directly returned).

Without the fix, the host panicked within a few minutes on a
random 'proc2' process due to kernel access from copy_page.

  [ bp: Fix comment style + touch ups, zap an unlikely(), improve the
    quirk function's readability. ]

Signed-off-by: Jue Wang <juew@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20220218013209.2436006-1-juew@google.com
2022-02-19 14:26:42 +01:00
Reinette Chatre
e5733d8c89 x86/sgx: Fix missing poison handling in reclaimer
The SGX reclaimer code lacks page poison handling in its main
free path. This can lead to avoidable machine checks if a
poisoned page is freed and reallocated instead of being
isolated.

A troublesome scenario is:
 1. Machine check (#MC) occurs (asynchronous, !MF_ACTION_REQUIRED)
 2. arch_memory_failure() is eventually called
 3. (SGX) page->poison set to 1
 4. Page is reclaimed
 5. Page added to normal free lists by sgx_reclaim_pages()
    ^ This is the bug (poison pages should be isolated on the
    sgx_poison_page_list instead)
 6. Page is reallocated by some innocent enclave, a second (synchronous)
    in-kernel #MC is induced, probably during EADD instruction.
    ^ This is the fallout from the bug

(6) is unfortunate and can be avoided by replacing the open coded
enclave page freeing code in the reclaimer with sgx_free_epc_page()
to obtain support for poison page handling that includes placing the
poisoned page on the correct list.

Fixes: d6d261bded ("x86/sgx: Add new sgx_epc_page flag bit to mark free pages")
Fixes: 992801ae92 ("x86/sgx: Initial poison handling for dirty and free pages")
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/dcc95eb2aaefb042527ac50d0a50738c7c160dac.1643830353.git.reinette.chatre@intel.com
2022-02-17 10:24:50 -08:00
Mario Limonciello
08f253ec37 x86/cpu: Clear SME feature flag when not in use
Currently, the SME CPU feature flag is reflective of whether the CPU
supports the feature but not whether it has been activated by the
kernel.

Change this around to clear the SME feature flag if the kernel is not
using it so userspace can determine if it is available and in use
from /proc/cpuinfo.

As the feature flag is cleared on systems where SME isn't active, use
CPUID 0x8000001f to confirm SME availability before calling
native_wbinvd().

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20220216034446.2430634-1-mario.limonciello@amd.com
2022-02-16 19:45:53 +01:00
Frederic Weisbecker
04d4e665a6 sched/isolation: Use single feature type while referring to housekeeping cpumask
Refer to housekeeping APIs using single feature types instead of flags.
This prevents from passing multiple isolation features at once to
housekeeping interfaces, which soon won't be possible anymore as each
isolation features will have their own cpumask.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org
2022-02-16 15:57:55 +01:00
Borislav Petkov
f11445ba7a x86/mce: Use arch atomic and bit helpers
The arch helpers do not have explicit KASAN instrumentation. Use them in
noinstr code.

Inline a couple more functions with single call sites, while at it:

mce_severity_amd_smca() has a single call-site which is noinstr so force
the inlining and fix:

  vmlinux.o: warning: objtool: mce_severity_amd.constprop.0()+0xca: call to \
	  mce_severity_amd_smca() leaves .noinstr.text section

Always inline mca_msr_reg():

     text    data     bss     dec     hex filename
  16065240        128031326       36405368        180501934       ac23dae vmlinux.before
  16065240        128031294       36405368        180501902       ac23d8e vmlinux.after

and mce_no_way_out() as the latter one is used only once, to fix:

  vmlinux.o: warning: objtool: mce_read_aux()+0x53: call to mca_msr_reg() leaves .noinstr.text section
  vmlinux.o: warning: objtool: do_machine_check()+0xc9: call to mce_no_way_out() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20220204083015.17317-4-bp@alien8.de
2022-02-13 22:08:27 +01:00
Reinette Chatre
8795359e35 x86/sgx: Silence softlockup detection when releasing large enclaves
Vijay reported that the "unclobbered_vdso_oversubscribed" selftest
triggers the softlockup detector.

Actual SGX systems have 128GB of enclave memory or more.  The
"unclobbered_vdso_oversubscribed" selftest creates one enclave which
consumes all of the enclave memory on the system. Tearing down such a
large enclave takes around a minute, most of it in the loop where
the EREMOVE instruction is applied to each individual 4k enclave page.

Spending one minute in a loop triggers the softlockup detector.

Add a cond_resched() to give other tasks a chance to run and placate
the softlockup detector.

Cc: stable@vger.kernel.org
Fixes: 1728ab54b4 ("x86/sgx: Add a page reclaimer")
Reported-by: Vijay Dhanraj <vijay.dhanraj@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>  (kselftest as sanity check)
Link: https://lkml.kernel.org/r/ced01cac1e75f900251b0a4ae1150aa8ebd295ec.1644345232.git.reinette.chatre@intel.com
2022-02-10 15:58:14 -08:00
Tony Luck
822ccfade5 x86/cpu: Read/save PPIN MSR during initialization
Currently, the PPIN (Protected Processor Inventory Number) MSR is read
by every CPU that processes a machine check, CMCI, or just polls machine
check banks from a periodic timer. This is not a "fast" MSR, so this
adds to overhead of processing errors.

Add a new "ppin" field to the cpuinfo_x86 structure. Read and save the
PPIN during initialization. Use this copy in mce_setup() instead of
reading the MSR.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220131230111.2004669-4-tony.luck@intel.com
2022-02-01 16:29:26 +01:00
Tony Luck
00a2f23eef x86/cpu: X86_FEATURE_INTEL_PPIN finally has a CPUID bit
After nine generations of adding to model specific list of CPUs that
support PPIN (Protected Processor Inventory Number) Intel allocated
a CPUID bit to enumerate the MSRs.

CPUID(EAX=7, ECX=1).EBX bit 0 enumerates presence of MSR_PPIN_CTL and
MSR_PPIN. Add it to the "scattered" CPUID bits and add an entry to the
ppin_cpuids[] x86_match_cpu() array to catch Intel CPUs that implement
it.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220131230111.2004669-3-tony.luck@intel.com
2022-02-01 16:15:19 +01:00
Tony Luck
0dcab41d34 x86/cpu: Merge Intel and AMD ppin_init() functions
The code to decide whether a system supports the PPIN (Protected
Processor Inventory Number) MSR was cloned from the Intel
implementation. Apart from the X86_FEATURE bit and the MSR numbers it is
identical.

Merge the two functions into common x86 code, but use x86_match_cpu()
instead of the switch (c->x86_model) that was used by the old Intel
code.

No functional change.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220131230111.2004669-2-tony.luck@intel.com
2022-02-01 12:56:23 +01:00
Greg Kroah-Hartman
7f99cb5e60 x86/CPU/AMD: Use default_groups in kobj_type
There are currently 2 ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field. Move the AMD mce sysfs code to use default_groups field which has
been the preferred way since

  aa30f47cf6 ("kobject: Add support for default attribute groups to kobj_type")

so that the obsolete default_attrs field can be removed soon.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20220106103537.3663852-1-gregkh@linuxfoundation.org
2022-02-01 12:41:24 +01:00
Tony Luck
e464121f2d x86/cpu: Add Xeon Icelake-D to list of CPUs that support PPIN
Missed adding the Icelake-D CPU to the list. It uses the same MSRs
to control and read the inventory number as all the other models.

Fixes: dc6b025de9 ("x86/mce: Add Xeon Icelake to list of CPUs that support PPIN")
Reported-by: Ailin Xu <ailin.xu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220121174743.1875294-2-tony.luck@intel.com
2022-01-25 18:40:30 +01:00
Yazen Ghannam
1f52b0aba6 x86/MCE/AMD: Allow thresholding interface updates after init
Changes to the AMD Thresholding sysfs code prevents sysfs writes from
updating the underlying registers once CPU init is completed, i.e.
"threshold_banks" is set.

Allow the registers to be updated if the thresholding interface is
already initialized or if in the init path. Use the "set_lvt_off" value
to indicate if running in the init path, since this value is only set
during init.

Fixes: a037f3ca0e ("x86/mce/amd: Make threshold bank setting hotplug robust")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220117161328.19148-1-yazen.ghannam@amd.com
2022-01-23 20:50:18 +01:00
Linus Torvalds
cb3f09f9af hyperv-next for 5.17
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmHhw7oTHHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXrjSB/979LV4Dn1PMcFYsSdlFEMeHcjzJdw/
 kFnLPXMaPJyfg6QPuf83jxzw9uxw8fcePMdVq/FFBtmVV9fJMAv62B8jaGS1p58c
 WnAg+7zsTN+xEoJn+tskSSon8BNMWVrl41zP3K4Ged+5j8UEBk62GB8Orz1qkpwL
 fTh3/+xAvczJeD4zZb1dAm4WnmcQJ4vhg45p07jX6owvnwQAikMFl45aSW54I5o8
 vAxGzFgdsZ2NtExnRNKh3b3DozA8JUE89KckBSZnDtq4rH8Fyy6Wij56Hc6v6Cml
 SUohiNbHX7hsNwit/lxL8wuF97IiA0pQSABobEg3rxfTghTUep51LlaN
 =/m4A
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-next-signed-20220114' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv updates from Wei Liu:

 - More patches for Hyper-V isolation VM support (Tianyu Lan)

 - Bug fixes and clean-up patches from various people

* tag 'hyperv-next-signed-20220114' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  scsi: storvsc: Fix storvsc_queuecommand() memory leak
  x86/hyperv: Properly deal with empty cpumasks in hyperv_flush_tlb_multi()
  Drivers: hv: vmbus: Initialize request offers message for Isolation VM
  scsi: storvsc: Fix unsigned comparison to zero
  swiotlb: Add CONFIG_HAS_IOMEM check around swiotlb_mem_remap()
  x86/hyperv: Fix definition of hv_ghcb_pg variable
  Drivers: hv: Fix definition of hypercall input & output arg variables
  net: netvsc: Add Isolation VM support for netvsc driver
  scsi: storvsc: Add Isolation VM support for storvsc driver
  hyper-v: Enable swiotlb bounce buffer for Isolation VM
  x86/hyper-v: Add hyperv Isolation VM check in the cc_platform_has()
  swiotlb: Add swiotlb bounce buffer remap function for HV IVM
2022-01-16 15:53:00 +02:00
Linus Torvalds
64ad946152 - Get rid of all the .fixup sections because this generates
misleading/wrong stacktraces and confuse RELIABLE_STACKTRACE and
 LIVEPATCH as the backtrace misses the function which is being fixed up.
 
 - Add Straight Light Speculation mitigation support which uses a new
 compiler switch -mharden-sls= which sticks an INT3 after a RET or an
 indirect branch in order to block speculation after them. Reportedly,
 CPUs do speculate behind such insns.
 
 - The usual set of cleanups and improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIyBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHfKA0ACgkQEsHwGGHe
 VUqLJg/2I2X2xXr5filJVaK+sQgmvDzk67DKnbxRBW2xcPF+B5sSW5yhe3G5UPW7
 SJVdhQ3gHcTiliGGlBf/VE7KXbqxFN0vO4/VFHZm78r43g7OrXTxz6WXXQRJ1n67
 U3YwRH3b6cqXZNFMs+X4bJt6qsGJM1kdTTZ2as4aERnaFr5AOAfQvfKbyhxLe/XA
 3SakfYISVKCBQ2RkTfpMpwmqlsatGFhTC5IrvuDQ83dDsM7O+Dx1J6Gu3fwjKmie
 iVzPOjCh+xTpZQp/SIZmt7MzoduZvpSym4YVyHvEnMiexQT4AmyaRthWqrhnEXY/
 qOvj8/XIqxmix8EaooGqRIK0Y2ZegxkPckNFzaeC3lsWohwMIGIhNXwHNEeuhNyH
 yvNGAW9Cq6NeDRgz5MRUXcimYw4P4oQKYLObS1WqFZhNMqm4sNtoEAYpai/lPYfs
 zUDckgXF2AoPOsSqy3hFAVaGovAgzfDaJVzkt0Lk4kzzjX2WQiNLhmiior460w+K
 0l2Iej58IajSp3MkWmFH368Jo8YfUVmkjbbpsmjsBppA08e1xamJB7RmswI/Ezj6
 s5re6UioCD+UYdjWx41kgbvYdvIkkZ2RLrktoZd/hqHrOLWEIiwEbyFO2nRFJIAh
 YjvPkB1p7iNuAeYcP1x9Ft9GNYVIsUlJ+hK86wtFCqy+abV+zQ==
 =R52z
 -----END PGP SIGNATURE-----

Merge tag 'x86_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 core updates from Borislav Petkov:

 - Get rid of all the .fixup sections because this generates
   misleading/wrong stacktraces and confuse RELIABLE_STACKTRACE and
   LIVEPATCH as the backtrace misses the function which is being fixed
   up.

 - Add Straight Line Speculation mitigation support which uses a new
   compiler switch -mharden-sls= which sticks an INT3 after a RET or an
   indirect branch in order to block speculation after them. Reportedly,
   CPUs do speculate behind such insns.

 - The usual set of cleanups and improvements

* tag 'x86_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
  x86/entry_32: Fix segment exceptions
  objtool: Remove .fixup handling
  x86: Remove .fixup section
  x86/word-at-a-time: Remove .fixup usage
  x86/usercopy: Remove .fixup usage
  x86/usercopy_32: Simplify __copy_user_intel_nocache()
  x86/sgx: Remove .fixup usage
  x86/checksum_32: Remove .fixup usage
  x86/vmx: Remove .fixup usage
  x86/kvm: Remove .fixup usage
  x86/segment: Remove .fixup usage
  x86/fpu: Remove .fixup usage
  x86/xen: Remove .fixup usage
  x86/uaccess: Remove .fixup usage
  x86/futex: Remove .fixup usage
  x86/msr: Remove .fixup usage
  x86/extable: Extend extable functionality
  x86/entry_32: Remove .fixup usage
  x86/entry_64: Remove .fixup usage
  x86/copy_mc_64: Remove .fixup usage
  ...
2022-01-12 16:31:19 -08:00
Linus Torvalds
b35b6d4d71 Power management updates for 5.17-rc1
- Add new P-state driver for AMD processors (Huang Rui).
 
  - Fix initialization of min and max frequency QoS requests in the
    cpufreq core (Rafael Wysocki).
 
  - Fix EPP handling on Alder Lake in intel_pstate (Srinivas Pandruvada).
 
  - Make intel_pstate update cpuinfo.max_freq when notified of HWP
    capabilities changes and drop a redundant function call from that
    driver (Rafael Wysocki).
 
  - Improve IRQ support in the Qcom cpufreq driver (Ard Biesheuvel,
    Stephen Boyd, Vladimir Zapolskiy).
 
  - Fix double devm_remap() in the Mediatek cpufreq driver (Hector Yuan).
 
  - Introduce thermal pressure helpers for cpufreq CPU cooling (Lukasz
    Luba).
 
  - Make cpufreq use default_groups in kobj_type (Greg Kroah-Hartman).
 
  - Make cpuidle use default_groups in kobj_type (Greg Kroah-Hartman).
 
  - Fix two comments in cpuidle code (Jason Wang, Yang Li).
 
  - Allow model-specific normal EPB value to be used in the intel_epb
    sysfs attribute handling code (Srinivas Pandruvada).
 
  - Simplify locking in pm_runtime_put_suppliers() (Rafael Wysocki).
 
  - Add safety net to supplier device release in the runtime PM core
    code (Rafael Wysocki).
 
  - Capture device status before disabling runtime PM for it (Rafael
    Wysocki).
 
  - Add new macros for declaring PM operations to allow drivers to
    avoid guarding them with CONFIG_PM #ifdefs or __maybe_unused and
    update some drivers to use these macros (Paul Cercueil).
 
  - Allow ACPI hardware signature to be honoured during restore from
    hibernation (David Woodhouse).
 
  - Update outdated operating performance points (OPP) documentation
    (Tang Yizhou).
 
  - Reduce log severity for informative message regarding frequency
    transition failures in devfreq (Tzung-Bi Shih).
 
  - Add DRAM frequency controller devfreq driver for Allwinner sunXi
    SoCs (Samuel Holland).
 
  - Add missing COMMON_CLK dependency to sun8i devfreq driver (Arnd
    Bergmann).
 
  - Add support for new layout of Psys PowerLimit Register on SPR to
    the Intel RAPL power capping driver (Zhang Rui).
 
  - Fix typo in a comment in idle_inject.c (Jason Wang).
 
  - Remove unused function definition from the DTPM (Dynamit Thermal
    Power Management) power capping framework (Daniel Lezcano).
 
  - Reduce DTPM trace verbosity (Daniel Lezcano).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmHcgkgSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxs34P/3kFhRk7qrwEekx6F11im6caLKT9+Qap
 PuGVqfTbK7TupVQDVGFBEjTjgKY7Ph7Fcr4bqn6wvNOp96cjXyOSk/c1fcpS3Bpr
 b1PYsFsb9diNKE462sGGYClyCT3X5qQqtpxzOl3g4I1PWKTC1mKFm4Jm2m6S6cFq
 DKhsgYKFzQSZNb1wJM4JjHS9c3BRygqp4nfEAmifu5b9tLZf7stWnFHhbGq63M9m
 OwHOrEEnzhf4pOXGZTvIXeczgE6IcuDdlGkIg7XMHnmKSNvj1HqhEgi2lfSRb98z
 5eI4S6JymCJGVK+gr8iVCq1iJ+LKqV3YPXRqvI35/+NqIKYxMt2ZivQQf5s3aQLe
 26gUulD3O6Pz5tMlwcDElD4/tcClfg35PCD/VzpRR8TAo8vLBb63kZ5v6+HM34ZJ
 6QbLTNZJTnGmEqxMccUxP+HhZz8ssqpLAC+R2sE5yXbNpIZq8CbPiGb65RGiX3SG
 CmRKqH/xQVNKBYP0ChjmUyhKcBxOnx1Xu8AhsN7gRAy0aht7j7OdjTnJuGiX6gu3
 Q5WxvVvkekyfhuFQ5TST9y/fzvMJWzeaA6GhVIr6RoBmshNQGTb0H4HXARxS3Ah5
 qjd7ao7BFLa898FCHaHIpmFWp0wF5iljwCJQVP3I2qUpPvDJxEtsxc4CF/AZzyNR
 VudoFqLoIV5C
 =1egI
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "The most signigicant change here is the addition of a new cpufreq
  'P-state' driver for AMD processors as a better replacement for the
  venerable acpi-cpufreq driver.

  There are also other cpufreq updates (in the core, intel_pstate, ARM
  drivers), PM core updates (mostly related to adding new macros for
  declaring PM operations which should make the lives of driver
  developers somewhat easier), and a bunch of assorted fixes and
  cleanups.

  Summary:

   - Add new P-state driver for AMD processors (Huang Rui).

   - Fix initialization of min and max frequency QoS requests in the
     cpufreq core (Rafael Wysocki).

   - Fix EPP handling on Alder Lake in intel_pstate (Srinivas
     Pandruvada).

   - Make intel_pstate update cpuinfo.max_freq when notified of HWP
     capabilities changes and drop a redundant function call from that
     driver (Rafael Wysocki).

   - Improve IRQ support in the Qcom cpufreq driver (Ard Biesheuvel,
     Stephen Boyd, Vladimir Zapolskiy).

   - Fix double devm_remap() in the Mediatek cpufreq driver (Hector
     Yuan).

   - Introduce thermal pressure helpers for cpufreq CPU cooling (Lukasz
     Luba).

   - Make cpufreq use default_groups in kobj_type (Greg Kroah-Hartman).

   - Make cpuidle use default_groups in kobj_type (Greg Kroah-Hartman).

   - Fix two comments in cpuidle code (Jason Wang, Yang Li).

   - Allow model-specific normal EPB value to be used in the intel_epb
     sysfs attribute handling code (Srinivas Pandruvada).

   - Simplify locking in pm_runtime_put_suppliers() (Rafael Wysocki).

   - Add safety net to supplier device release in the runtime PM core
     code (Rafael Wysocki).

   - Capture device status before disabling runtime PM for it (Rafael
     Wysocki).

   - Add new macros for declaring PM operations to allow drivers to
     avoid guarding them with CONFIG_PM #ifdefs or __maybe_unused and
     update some drivers to use these macros (Paul Cercueil).

   - Allow ACPI hardware signature to be honoured during restore from
     hibernation (David Woodhouse).

   - Update outdated operating performance points (OPP) documentation
     (Tang Yizhou).

   - Reduce log severity for informative message regarding frequency
     transition failures in devfreq (Tzung-Bi Shih).

   - Add DRAM frequency controller devfreq driver for Allwinner sunXi
     SoCs (Samuel Holland).

   - Add missing COMMON_CLK dependency to sun8i devfreq driver (Arnd
     Bergmann).

   - Add support for new layout of Psys PowerLimit Register on SPR to
     the Intel RAPL power capping driver (Zhang Rui).

   - Fix typo in a comment in idle_inject.c (Jason Wang).

   - Remove unused function definition from the DTPM (Dynamit Thermal
     Power Management) power capping framework (Daniel Lezcano).

   - Reduce DTPM trace verbosity (Daniel Lezcano)"

* tag 'pm-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (53 commits)
  x86, sched: Fix undefined reference to init_freq_invariance_cppc() build error
  cpufreq: amd-pstate: Fix Kconfig dependencies for AMD P-State
  cpufreq: amd-pstate: Fix struct amd_cpudata kernel-doc comment
  cpuidle: use default_groups in kobj_type
  x86: intel_epb: Allow model specific normal EPB value
  MAINTAINERS: Add AMD P-State driver maintainer entry
  Documentation: amd-pstate: Add AMD P-State driver introduction
  cpufreq: amd-pstate: Add AMD P-State performance attributes
  cpufreq: amd-pstate: Add AMD P-State frequencies attributes
  cpufreq: amd-pstate: Add boost mode support for AMD P-State
  cpufreq: amd-pstate: Add trace for AMD P-State module
  cpufreq: amd-pstate: Introduce the support for the processors with shared memory solution
  cpufreq: amd-pstate: Add fast switch function for AMD P-State
  cpufreq: amd-pstate: Introduce a new AMD P-State driver to support future processors
  ACPI: CPPC: Add CPPC enable register function
  ACPI: CPPC: Check present CPUs for determining _CPC is valid
  ACPI: CPPC: Implement support for SystemIO registers
  x86/msr: Add AMD CPPC MSR definitions
  x86/cpufeatures: Add AMD Collaborative Processor Performance Control feature flag
  cpufreq: use default_groups in kobj_type
  ...
2022-01-10 20:34:00 -08:00
Linus Torvalds
d93aebbd76 Merge branch 'random-5.17-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator updates from Jason Donenfeld:
 "These a bit more numerous than usual for the RNG, due to folks
  resubmitting patches that had been pending prior and generally renewed
  interest.

  There are a few categories of patches in here:

   1) Dominik Brodowski and I traded a series back and forth for a some
      weeks that fixed numerous issues related to seeds being provided
      at extremely early boot by the firmware, before other parts of the
      kernel or of the RNG have been initialized, both fixing some
      crashes and addressing correctness around early boot randomness.
      One of these is marked for stable.

   2) I replaced the RNG's usage of SHA-1 with BLAKE2s in the entropy
      extractor, and made the construction a bit safer and more
      standard. This was sort of a long overdue low hanging fruit, as we
      were supposed to have phased out SHA-1 usage quite some time ago
      (even if all we needed here was non-invertibility). Along the way
      it also made extraction 131% faster. This required a bit of
      Kconfig and symbol plumbing to make things work well with the
      crypto libraries, which is one of the reasons why I'm sending you
      this pull early in the cycle.

   3) I got rid of a truly superfluous call to RDRAND in the hot path,
      which resulted in a whopping 370% increase in performance.

   4) Sebastian Andrzej Siewior sent some patches regarding PREEMPT_RT,
      the full series of which wasn't ready yet, but the first two
      preparatory cleanups were good on their own. One of them touches
      files in kernel/irq/, which is the other reason why I'm sending
      you this pull early in the cycle.

   5) Other assorted correctness fixes from Eric Biggers, Jann Horn,
      Mark Brown, Dominik Brodowski, and myself"

* 'random-5.17-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
  random: don't reset crng_init_cnt on urandom_read()
  random: avoid superfluous call to RDRAND in CRNG extraction
  random: early initialization of ChaCha constants
  random: use IS_ENABLED(CONFIG_NUMA) instead of ifdefs
  random: harmonize "crng init done" messages
  random: mix bootloader randomness into pool
  random: do not throw away excess input to crng_fast_load
  random: do not re-init if crng_reseed completes before primary init
  random: fix crash on multiple early calls to add_bootloader_randomness()
  random: do not sign extend bytes for rotation when mixing
  random: use BLAKE2s instead of SHA1 in extraction
  lib/crypto: blake2s: include as built-in
  random: fix data race on crng init time
  random: fix data race on crng_node_pool
  irq: remove unused flags argument from __handle_irq_event_percpu()
  random: remove unused irq_flags argument from add_interrupt_randomness()
  random: document add_hwgenerator_randomness() with other input functions
  MAINTAINERS: add git tree for random.c
2022-01-10 11:52:16 -08:00
Linus Torvalds
7e740ae635 - First part of a series to move the AMD address translation code from
arch/x86/ to amd64_edac as that is its only user anyway
 
 - Some MCE error injection improvements to the AMD side
 
 - Reorganization of the #MC handler code and the facilities it calls to
 make it noinstr-safe
 
 - Add support for new AMD MCA bank types and non-uniform banks layout
 
 - The usual set of cleanups and fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHcGZ4ACgkQEsHwGGHe
 VUr6Zw//WBvNvfV/akQGsvVo94G0DaF+buYB+Tl1p0goMd7QfKA5iHxjB1alEJC2
 dTchIr7pjiiE3nr4svuWLLQZamx8kMwQNqipioBHXg3YThj0wD4PbUOC9TlIceBR
 3yxVbvwlD7Y7sb2PII6IMlagzTiIeW0/ps29DHFr5vqDBvEanNdAHoV/h2vQi+76
 Ma96psIxzTMSk11yGB6l9k66EASCdDGBU7sODjup7wuQmuRaQ/1oJAWY0wIJvJez
 frjpaz/YKmlTwTf9bxoJbky2FkeBsD4yXXUGwjDgMq0EyUUaeSbvaQkm8gSHX9Yr
 VDDv1WvT6QIw6x7Wc4skS8lWmZghNBbAHOoNS31BPJ2IDmFWkF5Q2bNEuHrtU4EC
 0mkNeyN6x48L/F8j/1aE/tm+SjiGexZX4zhi6MNWReTV140I1zqQq/r7CCu5+MEa
 PAB1YH/96k2dMPT6mbFrRIFJmkDuBuZOAkuwYWEjO/XjPl2SGBGj1jKolWW3qjRR
 Po7vBJnDt7wgigWFh6+R4rJv+fh87XfB7B2wEOt4Yn37jUkK6dNRIy0zFmDaC1J2
 bHgsJbWC+Sgs1G57gnYABJYzLj7RRdDyCu1/UUVyBBP7/WfZJw0kjABE7p3AaYTd
 15JV1L0c/Ypuv05LJf40LkyF2F5w2fnP5QM2Rr8U4xW/GumEyWs=
 =8Hu7
 -----END PGP SIGNATURE-----

Merge tag 'ras_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RAS updates from Borislav Petkov:
 "A relatively big amount of movements in RAS-land this time around:

   - First part of a series to move the AMD address translation code
     from arch/x86/ to amd64_edac as that is its only user anyway

   - Some MCE error injection improvements to the AMD side

   - Reorganization of the #MC handler code and the facilities it calls
     to make it noinstr-safe

   - Add support for new AMD MCA bank types and non-uniform banks layout

   - The usual set of cleanups and fixes"

* tag 'ras_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  x86/mce: Reduce number of machine checks taken during recovery
  x86/mce/inject: Avoid out-of-bounds write when setting flags
  x86/MCE/AMD, EDAC/mce_amd: Support non-uniform MCA bank type enumeration
  x86/MCE/AMD, EDAC/mce_amd: Add new SMCA bank types
  x86/mce: Check regs before accessing it
  x86/mce: Mark mce_start() noinstr
  x86/mce: Mark mce_timed_out() noinstr
  x86/mce: Move the tainting outside of the noinstr region
  x86/mce: Mark mce_read_aux() noinstr
  x86/mce: Mark mce_end() noinstr
  x86/mce: Mark mce_panic() noinstr
  x86/mce: Prevent severity computation from being instrumented
  x86/mce: Allow instrumentation during task work queueing
  x86/mce: Remove noinstr annotation from mce_setup()
  x86/mce: Use mce_rdmsrl() in severity checking code
  x86/mce: Remove function-local cpus variables
  x86/mce: Do not use memset to clear the banks bitmaps
  x86/mce/inject: Set the valid bit in MCA_STATUS before error injection
  x86/mce/inject: Check if a bank is populated before injecting
  x86/mce: Get rid of cpu_missing
  ...
2022-01-10 11:43:09 -08:00
Linus Torvalds
25f8c7785e - Enable the short string copies for CPUs which support them, in
copy_user_enhanced_fast_string()
 
 - Avoid writing MSR_CSTAR on Intel due to TDX guests raising a #VE trap
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHcFRcACgkQEsHwGGHe
 VUrVYRAAg8hJS/aIMnqr+CDX+iOlx2hxJ2TA2bA45NwWc1A4VTt9kwRB0+NIKkjj
 F3uJbidZjxSch9Oza6O5KyjJK8QtOfqxyYcx8TLjSleqJRoJWxl1Ub1/yAfKIX/0
 QsqXVc/OuMzgwVGYLUwGSWifJOWMYKy03vSczmXK74zp9vZ56fdot8rOhDm3Xb/R
 QSfT5nKlgCvxbvAqgFfbXKoEu/EqT43sTXq4o1C6yDX/G6JOGe6nXZIAvIVm3iKZ
 utOqO+tBOmektF/yg3EHZL/7paFgtfETcI1YpmPYqKhG3KvvZgm7yyU6SqrcctSx
 vMSPTcgcuZl2I5OF+eesUGfGGhHSfSPBAhkxpCTOb6lHf73PYRC3BnQtlQkQt6g/
 UOtm3fQwrVJcKlMu7nem46iDCgbSyvASFa5ZyuOGcrAiFLhJzQNRDlXLpxp/q615
 yOYTRgj4YS6vomzc6bL3zNCcF5aJUwAPNVghe3l2zwKXetoOPvtWX8sKlYjiN3GW
 DTtEi117IAiWkosDIYY+aFNxLeOqxpNMcOkwd5eHHdpR3rkeFkjOtBctll/eHzPi
 NYx++cV5yYW0z4S2uRr6o4k4hdgAQU/p7xhdO28Z+yzWpmXQ//79HhiOf2nNd1iI
 dpQAx9roo8vbR3JYLxGYFuJrZsHna+/f6Gqf5teUy7SjVL5M95U=
 =zbYM
 -----END PGP SIGNATURE-----

Merge tag 'x86_cpu_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cpuid updates from Borislav Petkov:

 - Enable the short string copies for CPUs which support them, in
   copy_user_enhanced_fast_string()

 - Avoid writing MSR_CSTAR on Intel due to TDX guests raising a #VE trap

* tag 'x86_cpu_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/lib: Add fast-short-rep-movs check to copy_user_enhanced_fast_string()
  x86/cpu: Don't write CSTAR MSR on Intel CPUs
2022-01-10 10:09:22 -08:00
Linus Torvalds
4a692ae360 - Flush *all* mappings from the TLB after switching to the trampoline
pagetable to prevent any stale entries' presence
 
 - Flush global mappings from the TLB, in addition to the CR3-write,
 after switching off of the trampoline_pgd during boot to clear the
 identity mappings
 
 - Prevent instrumentation issues resulting from the above changes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHcEGIACgkQEsHwGGHe
 VUp14RAAgo6BbW9J82Pyl55egIhcQDdGsa16Gdm9S/AFIIW/NhwYo9ydrgtzr/70
 3XKpJYX7nH7PUKYRmoca/m3NnzUU+wnjSGS1XMyB3bJvn2/8S1qeuwBty2VP2dYM
 iS2eGRLjVjbMWwQUSK7tPJa5wi11zUqLIyCe3t0YiWso6TK7xKaVJTQ3/19Xc+/a
 zVQ5VpmzglUTxA6xGCvTDn5IUViUb8QmIuw7Ty6QtQEoI6T3qQvPkdJNXOxDcHNy
 9gDGf4O+5YlPCxYsNEkWDDa02zSZ2aWFSq76b98VyMiOK0xts+ktnAwq6oes+as9
 ZLIipOu5aIkj8te7he0FelyvPhZAVzrFvvmMf1U+EV3PqbyVkabhk5SBeP5v8CZy
 bM4eYNuJ2FLvFpUCC9zQ/MNVQ6ZtxN15rrrsTqk46KLPBHmHp/Aj9W/DP4zpCcNg
 Wwh4xbnGNIN8jZBiBJG6R6q7oM/lZt/loEicxm2QFZHtAIYMsiUmE99HnIREjUHd
 +0mwo2rHniie9zh6GoybX8OcbZCLYGdfe3iPvlO9fQpyDTn8IUIlnruDlUiTBMDM
 fX4J2dynh7xXRH1WW+MwxDv4n400+C08SG9zTD0qPCbGhYwNscMlZhA2JN6mlPep
 spuRPOzzwUUxqjXkDloeDDJNUQ8r032OB2LMhWSbLApJrJM9/QA=
 =cM+z
 -----END PGP SIGNATURE-----

Merge tag 'x86_mm_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mm updates from Borislav Petkov:

 - Flush *all* mappings from the TLB after switching to the trampoline
   pagetable to prevent any stale entries' presence

 - Flush global mappings from the TLB, in addition to the CR3-write,
   after switching off of the trampoline_pgd during boot to clear the
   identity mappings

 - Prevent instrumentation issues resulting from the above changes

* tag 'x86_mm_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Prevent early boot triple-faults with instrumentation
  x86/mm: Include spinlock_t definition in pgtable.
  x86/mm: Flush global TLB when switching to trampoline page-table
  x86/mm/64: Flush global TLB on boot and AP bringup
  x86/realmode: Add comment for Global bit usage in trampoline_pgd
  x86/mm: Add missing <asm/cpufeatures.h> dependency to <asm/page_64.h>
2022-01-10 09:51:38 -08:00
Linus Torvalds
bfed6efb8e - Add support for handling hw errors in SGX pages: poisoning, recovering
from poison memory and error injection into SGX pages
 
 - A bunch of changes to the SGX selftests to simplify and allow of SGX
 features testing without the need of a whole SGX software stack
 
 - Add a sysfs attribute which is supposed to show the amount of SGX
 memory in a NUMA node, similar to what /proc/meminfo is to normal
 memory
 
 - The usual bunch of fixes and cleanups too
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHcDQMACgkQEsHwGGHe
 VUq42xAAjWM0AFpIxgUBpbE0swV3ZMulnndl3/vA5XN+9Yn7Q52+AFyPRE0s7Zam
 Ap+cInh2Il7d/sv54rZ4x/j7+TH4i7s8fWPVU/XiPALQuOuw0/B1wJJ+jmMiPFiU
 3jr7DkUPyWjWTHduMY/tk+xMOpkx1XsxJheYnKvsKVW+fjJ0vPuftAZtfu2z2VOh
 3JLcp5cAXPxW0UK9gdoF5bCBQhBu0NRguTbhHhbByAixQO2GyVSKLSRovUdj0a+y
 QRrQ6hgcvpTOsVHJoWJ7yIX4SBzQTe9Bg6dT9DghOxE4Sc2GH89hu7wRztGawBJO
 nLyzWgiW9ttjQutDpBvZANNVcFAPAdtDWczrzZpREbrGKkzT+kOBnIIL1LWITWOy
 2YWTO3ytW0KNIK85GzMjSVOKRMgaHJeBaGuYZ7Z0kb3GuUPJ9zRlaRxNapKQFuzA
 0PGoA4IDT+2Afy7VYBBNUA2d/WverFQuXKusSxK6b5zJ173o5/DXL2q0d3gn/j8Z
 hhxJUJyVOsfRXSG4NKrj4se4FiA0n/RL4oyUZR9iJ8kWzzZTd0eZTAn468bpGIp5
 yiOlPOLgsmu0xzVmAtG1+4d2+S2x+Ec5YE0sP1V/JLNciYk3Ebp7UyfnS3tn33Xc
 cpdWjELvD1LJVpMEURnbjRrwU6OiiAekYJCP/9lmK9zfOGpwRHc=
 =vFTM
 -----END PGP SIGNATURE-----

Merge tag 'x86_sgx_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 SGX updates from Borislav Petkov:

 - Add support for handling hw errors in SGX pages: poisoning,
   recovering from poison memory and error injection into SGX pages

 - A bunch of changes to the SGX selftests to simplify and allow of SGX
   features testing without the need of a whole SGX software stack

 - Add a sysfs attribute which is supposed to show the amount of SGX
   memory in a NUMA node, similar to what /proc/meminfo is to normal
   memory

 - The usual bunch of fixes and cleanups too

* tag 'x86_sgx_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  x86/sgx: Fix NULL pointer dereference on non-SGX systems
  selftests/sgx: Fix corrupted cpuid macro invocation
  x86/sgx: Add an attribute for the amount of SGX memory in a NUMA node
  x86/sgx: Fix minor documentation issues
  selftests/sgx: Add test for multiple TCS entry
  selftests/sgx: Enable multiple thread support
  selftests/sgx: Add page permission and exception test
  selftests/sgx: Rename test properties in preparation for more enclave tests
  selftests/sgx: Provide per-op parameter structs for the test enclave
  selftests/sgx: Add a new kselftest: Unclobbered_vdso_oversubscribed
  selftests/sgx: Move setup_test_encl() to each TEST_F()
  selftests/sgx: Encpsulate the test enclave creation
  selftests/sgx: Dump segments and /proc/self/maps only on failure
  selftests/sgx: Create a heap for the test enclave
  selftests/sgx: Make data measurement for an enclave segment optional
  selftests/sgx: Assign source for each segment
  selftests/sgx: Fix a benign linker warning
  x86/sgx: Add check for SGX pages to ghes_do_memory_failure()
  x86/sgx: Add hook to error injection address validation
  x86/sgx: Hook arch_memory_failure() into mainline code
  ...
2022-01-10 09:44:09 -08:00
Dave Hansen
2056e2989b x86/sgx: Fix NULL pointer dereference on non-SGX systems
== Problem ==

Nathan Chancellor reported an oops when aceessing the
'sgx_total_bytes' sysfs file:

	https://lore.kernel.org/all/YbzhBrimHGGpddDM@archlinux-ax161/

The sysfs output code accesses the sgx_numa_nodes[] array
unconditionally.  However, this array is allocated during SGX
initialization, which only occurs on systems where SGX is
supported.

If the sysfs file is accessed on systems without SGX support,
sgx_numa_nodes[] is NULL and an oops occurs.

== Solution ==

To fix this, hide the entire nodeX/x86/ attribute group on
systems without SGX support using the ->is_visible attribute
group callback.

Unfortunately, SGX is initialized via a device_initcall() which
occurs _after_ the ->is_visible() callback.  Instead of moving
SGX initialization earlier, call sysfs_update_group() during
SGX initialization to update the group visiblility.

This update requires moving the SGX sysfs code earlier in
sgx/main.c.  There are no code changes other than the addition of
arch_update_sysfs_visibility() and a minor whitespace fixup to
arch_node_attr_is_visible() which checkpatch caught.

CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: linux-sgx@vger.kernel.org
Cc: x86@kernel.org
Fixes: 50468e4313 ("x86/sgx: Add an attribute for the amount of SGX memory in a NUMA node")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/20220104171527.5E8416A8@davehans-spike.ostc.intel.com
2022-01-07 08:47:23 -08:00
Sebastian Andrzej Siewior
703f7066f4 random: remove unused irq_flags argument from add_interrupt_randomness()
Since commit
   ee3e00e9e7 ("random: use registers from interrupted code for CPU's w/o a cycle counter")

the irq_flags argument is no longer used.

Remove unused irq_flags.

Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: linux-hyperv@vger.kernel.org
Cc: x86@kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-01-07 00:25:25 +01:00
Srinivas Pandruvada
4ecc933b7d x86: intel_epb: Allow model specific normal EPB value
The current EPB "normal" is defined as 6 and set whenever power-up EPB
value is 0. This setting resulted in the desired out of box power and
performance for several CPU generations. But this value is not suitable
for AlderLake mobile CPUs, as this resulted in higher uncore power.
Since EPB is model specific, this is not unreasonable to have different
behavior.

Allow a capability where "normal" EPB can be redefined. For AlderLake
mobile CPUs this desired normal value is 7.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-01-04 16:37:23 +01:00
Zhang Zixun
de768416b2 x86/mce/inject: Avoid out-of-bounds write when setting flags
A contrived zero-length write, for example, by using write(2):

  ...
  ret = write(fd, str, 0);
  ...

to the "flags" file causes:

  BUG: KASAN: stack-out-of-bounds in flags_write
  Write of size 1 at addr ffff888019be7ddf by task writefile/3787

  CPU: 4 PID: 3787 Comm: writefile Not tainted 5.16.0-rc7+ #12
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014

due to accessing buf one char before its start.

Prevent such out-of-bounds access.

  [ bp: Productize into a proper patch. Link below is the next best
    thing because the original mail didn't get archived on lore. ]

Fixes: 0451d14d05 ("EDAC, mce_amd_inj: Modify flags attribute to use string arguments")
Signed-off-by: Zhang Zixun <zhang133010@icloud.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/linux-edac/YcnePfF1OOqoQwrX@zn.tnic/
2021-12-28 11:45:36 +01:00
Yazen Ghannam
91f75eb481 x86/MCE/AMD, EDAC/mce_amd: Support non-uniform MCA bank type enumeration
AMD systems currently lay out MCA bank types such that the type of bank
number "i" is either the same across all CPUs or is Reserved/Read-as-Zero.

For example:

  Bank # | CPUx | CPUy
    0      LS     LS
    1      RAZ    UMC
    2      CS     CS
    3      SMU    RAZ

Future AMD systems will lay out MCA bank types such that the type of
bank number "i" may be different across CPUs.

For example:

  Bank # | CPUx | CPUy
    0      LS     LS
    1      RAZ    UMC
    2      CS     NBIO
    3      SMU    RAZ

Change the structures that cache MCA bank types to be per-CPU and update
smca_get_bank_type() to handle this change.

Move some SMCA-specific structures to amd.c from mce.h, since they no
longer need to be global.

Break out the "count" for bank types from struct smca_hwid, since this
should provide a per-CPU count rather than a system-wide count.

Apply the "const" qualifier to the struct smca_hwid_mcatypes array. The
values in this array should not change at runtime.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211216162905.4132657-3-yazen.ghannam@amd.com
2021-12-22 17:22:09 +01:00
Yazen Ghannam
5176a93ab2 x86/MCE/AMD, EDAC/mce_amd: Add new SMCA bank types
Add HWID and McaType values for new SMCA bank types, and add their error
descriptions to edac_mce_amd.

The "PHY" bank types all have the same error descriptions, and the NBIF
and SHUB bank types have the same error descriptions. So reuse the same
arrays where appropriate.

  [ bp: Remove useless comments over hwid types. ]

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211216162905.4132657-2-yazen.ghannam@amd.com
2021-12-22 17:19:18 +01:00
Borislav Petkov
b64dfcde1c x86/mm: Prevent early boot triple-faults with instrumentation
Commit in Fixes added a global TLB flush on the early boot path, after
the kernel switches off of the trampoline page table.

Compiler profiling options enabled with GCOV_PROFILE add additional
measurement code on clang which needs to be initialized prior to
use. The global flush in x86_64_start_kernel() happens before those
initializations can happen, leading to accessing invalid memory.
GCOV_PROFILE builds with gcc are still ok so this is clang-specific.

The second issue this fixes is with KASAN: for a similar reason,
kasan_early_init() needs to have happened before KASAN-instrumented
functions are called.

Therefore, reorder the flush to happen after the KASAN early init
and prevent the compilers from adding profiling instrumentation to
native_write_cr4().

Fixes: f154f29085 ("x86/mm/64: Flush global TLB on boot and AP bringup")
Reported-by: "J. Bruce Fields" <bfields@fieldses.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Carel Si <beibei.si@intel.com>
Tested-by: "J. Bruce Fields" <bfields@fieldses.org>
Link: https://lore.kernel.org/r/20211209144141.GC25654@xsang-OptiPlex-9020
2021-12-22 11:51:20 +01:00
Tianyu Lan
062a5c4260 hyper-v: Enable swiotlb bounce buffer for Isolation VM
hyperv Isolation VM requires bounce buffer support to copy
data from/to encrypted memory and so enable swiotlb force
mode to use swiotlb bounce buffer for DMA transaction.

In Isolation VM with AMD SEV, the bounce buffer needs to be
accessed via extra address space which is above shared_gpa_boundary
(E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG.
The access physical address will be original physical address +
shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP
spec is called virtual top of memory(vTOM). Memory addresses below
vTOM are automatically treated as private while memory above
vTOM is treated as shared.

Swiotlb bounce buffer code calls set_memory_decrypted()
to mark bounce buffer visible to host and map it in extra
address space via memremap. Populate the shared_gpa_boundary
(vTOM) via swiotlb_unencrypted_base variable.

The map function memremap() can't work in the early place
(e.g ms_hyperv_init_platform()) and so call swiotlb_update_mem_
attributes() in the hyperv_init().

Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20211213071407.314309-4-ltykernel@gmail.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-12-20 18:01:09 +00:00
Borislav Petkov
1acd85feba x86/mce: Check regs before accessing it
Commit in Fixes accesses pt_regs before checking whether it is NULL or
not. Make sure the NULL pointer check happens first.

Fixes: 0a5b288e85 ("x86/mce: Prevent severity computation from being instrumented")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20211217102029.GA29708@kili
2021-12-20 11:41:02 +01:00
Borislav Petkov
e3d72e8eee x86/mce: Mark mce_start() noinstr
Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0x4ae: call to __const_udelay() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-13-bp@alien8.de
2021-12-13 14:14:05 +01:00
Borislav Petkov
edb3d07e24 x86/mce: Mark mce_timed_out() noinstr
Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0x482: call to mce_timed_out() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-12-bp@alien8.de
2021-12-13 14:13:54 +01:00
Borislav Petkov
75581a203e x86/mce: Move the tainting outside of the noinstr region
add_taint() is yet another external facility which the #MC handler
calls. Move that tainting call into the instrumentation-allowed part of
the handler.

Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0x617: call to add_taint() leaves .noinstr.text section

While at it, allow instrumentation around the mce_log() call.

Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0x690: call to mce_log() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-11-bp@alien8.de
2021-12-13 14:13:35 +01:00
Borislav Petkov
db6c996d6c x86/mce: Mark mce_read_aux() noinstr
Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0x681: call to mce_read_aux() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-10-bp@alien8.de
2021-12-13 14:13:23 +01:00
Borislav Petkov
b4813539d3 x86/mce: Mark mce_end() noinstr
It is called by the #MC handler which is noinstr.

Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0xbd6: call to memset() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-9-bp@alien8.de
2021-12-13 14:13:12 +01:00
Borislav Petkov
3c7ce80a81 x86/mce: Mark mce_panic() noinstr
And allow instrumentation inside it because it does calls to other
facilities which will not be tagged noinstr.

Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0xc73: call to mce_panic() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-8-bp@alien8.de
2021-12-13 14:13:01 +01:00
Borislav Petkov
0a5b288e85 x86/mce: Prevent severity computation from being instrumented
Mark all the MCE severity computation logic noinstr and allow
instrumentation when it "calls out".

Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0xc5d: call to mce_severity() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-7-bp@alien8.de
2021-12-13 14:12:48 +01:00
Borislav Petkov
4fbce464db x86/mce: Allow instrumentation during task work queueing
Fixes

  vmlinux.o: warning: objtool: do_machine_check()+0xdb1: call to queue_task_work() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-6-bp@alien8.de
2021-12-13 14:12:35 +01:00
Borislav Petkov
487d654db3 x86/mce: Remove noinstr annotation from mce_setup()
Instead, sandwitch around the call which is done in noinstr context and
mark the caller - mce_gather_info() - as noinstr.

Also, document what the whole instrumentation strategy with #MC is going
to be in the future and where it all is supposed to be going to.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-5-bp@alien8.de
2021-12-13 14:12:21 +01:00
Borislav Petkov
88f66a4235 x86/mce: Use mce_rdmsrl() in severity checking code
MCA has its own special MSR accessors. Use them.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-4-bp@alien8.de
2021-12-13 14:12:08 +01:00
Borislav Petkov
ad669ec16a x86/mce: Remove function-local cpus variables
Use num_online_cpus() directly.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-3-bp@alien8.de
2021-12-13 14:11:53 +01:00
Borislav Petkov
cd5e0d1fc9 x86/mce: Do not use memset to clear the banks bitmaps
The bitmap is a single unsigned long so no need for the function call.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211208111343.8130-2-bp@alien8.de
2021-12-13 14:11:22 +01:00
Peter Zijlstra
5ce8e39f55 x86/sgx: Remove .fixup usage
Create EX_TYPE_FAULT_SGX which does as EX_TYPE_FAULT does, except adds
this extra bit that SGX really fancies having.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20211110101325.961246679@infradead.org
2021-12-11 09:09:49 +01:00
Colin Ian King
df0114f1f8 x86/resctrl: Remove redundant assignment to variable chunks
The variable chunks is being shifted right and re-assinged the shifted
value which is then returned. Since chunks is not being read afterwards
the assignment is redundant and the >>= operator can be replaced with a
shift >> operator instead.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
Link: https://lkml.kernel.org/r/20211207223735.35173-1-colin.i.king@gmail.com
2021-12-09 09:57:16 -08:00
Jarkko Sakkinen
50468e4313 x86/sgx: Add an attribute for the amount of SGX memory in a NUMA node
== Problem ==

The amount of SGX memory on a system is determined by the BIOS and it
varies wildly between systems.  It can be as small as dozens of MB's
and as large as many GB's on servers.  Just like how applications need
to know how much regular RAM is available, enclave builders need to
know how much SGX memory an enclave can consume.

== Solution ==

Introduce a new sysfs file:

	/sys/devices/system/node/nodeX/x86/sgx_total_bytes

to enumerate the amount of SGX memory available in each NUMA node.
This serves the same function for SGX as /proc/meminfo or
/sys/devices/system/node/nodeX/meminfo does for normal RAM.

'sgx_total_bytes' is needed today to help drive the SGX selftests.
SGX-specific swap code is exercised by creating overcommitted enclaves
which are larger than the physical SGX memory on the system.  They
currently use a CPUID-based approach which can diverge from the actual
amount of SGX memory available.  'sgx_total_bytes' ensures that the
selftests can work efficiently and do not attempt stupid things like
creating a 100,000 MB enclave on a system with 128 MB of SGX memory.

== Implementation Details ==

Introduce CONFIG_HAVE_ARCH_NODE_DEV_GROUP opt-in flag to expose an
arch specific attribute group, and add an attribute for the amount of
SGX memory in bytes to each NUMA node:

== ABI Design Discussion ==

As opposed to the per-node ABI, a single, global ABI was considered.
However, this would prevent enclaves from being able to size
themselves so that they fit on a single NUMA node.  Essentially, a
single value would rule out NUMA optimizations for enclaves.

Create a new "x86/" directory inside each "nodeX/" sysfs directory.
'sgx_total_bytes' is expected to be the first of at least a few
sgx-specific files to be placed in the new directory.  Just scanning
/proc/meminfo, these are the no-brainers that we have for RAM, but we
need for SGX:

	MemTotal:       xxxx kB // sgx_total_bytes (implemented here)
	MemFree:        yyyy kB // sgx_free_bytes
	SwapTotal:      zzzz kB // sgx_swapped_bytes

So, at *least* three.  I think we will eventually end up needing
something more along the lines of a dozen.  A new directory (as
opposed to being in the nodeX/ "root") directory avoids cluttering the
root with several "sgx_*" files.

Place the new file in a new "nodeX/x86/" directory because SGX is
highly x86-specific.  It is very unlikely that any other architecture
(or even non-Intel x86 vendor) will ever implement SGX.  Using "sgx/"
as opposed to "x86/" was also considered.  But, there is a real chance
this can get used for other arch-specific purposes.

[ dhansen: rewrite changelog ]

Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211116162116.93081-2-jarkko@kernel.org
2021-12-09 07:02:22 -08:00
Smita Koralahalli
1e56279a49 x86/mce/inject: Set the valid bit in MCA_STATUS before error injection
MCA handlers check the valid bit in each status register
(MCA_STATUS[Val]) and continue processing the error only if the valid
bit is set.

Set the valid bit unconditionally in the corresponding MCA_STATUS
register and correct any Val=0 injections made by the user as such
errors will get ignored and such injections will be largely pointless.

Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211104215846.254012-3-Smita.KoralahalliChannabasappa@amd.com
2021-12-08 12:01:01 +01:00
Smita Koralahalli
e48d008bd1 x86/mce/inject: Check if a bank is populated before injecting
The MCA_IPID register uniquely identifies a bank's type on Scalable MCA
(SMCA) systems. When an MCA bank is not populated, the MCA_IPID register
will read as zero and writes to it will be ignored.

On a hw-type error injection (injection which writes the actual MCA
registers in an attempt to cause a real MCE) check the value of this
register before trying to inject the error.

Do not impose any limitations on a sw injection and allow the user to
test out all the decoding paths without relying on the available hardware,
as its purpose is to just test the code.

 [ bp: Heavily massage. ]

Link: https://lkml.kernel.org/r/20211019233641.140275-2-Smita.KoralahalliChannabasappa@amd.com
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211104215846.254012-2-Smita.KoralahalliChannabasappa@amd.com
2021-12-08 12:00:56 +01:00
Andi Kleen
9c7e2634f6 x86/cpu: Don't write CSTAR MSR on Intel CPUs
Intel CPUs do not support SYSCALL in 32-bit mode, but the kernel
initializes MSR_CSTAR unconditionally. That MSR write is normally
ignored by the CPU, but in a TDX guest it raises a #VE trap.

Exclude Intel CPUs from the MSR_CSTAR initialization.

[ tglx: Fixed the subject line and removed the redundant comment. ]

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20211119035803.4012145-1-sathyanarayanan.kuppuswamy@linux.intel.com
2021-11-25 00:40:34 +01:00
Linus Torvalds
40c93d7fff Two X86 fixes:
- Move the command line preparation and the early command line parsing
    earlier so that the command line parameters which affect
    early_reserve_memory(), e.g. efi=nosftreserve, are taken into
    account. This was broken when the invocation of early_reserve_memory()
    was moved recently.
 
  - Use an atomic type for the SGX page accounting, which is read and
    written lockless, to plug various race conditions related to it.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmGaYPoTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoTM7D/9bivpPDiNzfjUV7kNKx6aTUwPjdFer
 G0RuuDZqkpJm9j7+51VnQNFssIfAFtzKMJn/DuGVoXF0ERxXMEhJVHiSTeOlCjJU
 u1760qFYlAQ1mwvKVNLk2SenWuNZwwgUneY3VvvS4qYsSq7PsbYlekuddPeX0Nws
 AJ1llOoCoBkm5vNZ5c3/CmhY6iPSRQQkDmbA11cZZUyWl2uouSk21+ax24IDCvW3
 E8Aq9QqB6ND2uukB32kQ7Wp7/UZ4inJHTUXF9UF/8P+N1ftDWeKDjQz6y9U19Tsd
 ivuMr6NqqAos/Fpo9PhlGns07C8HeKGf4ronnt9cUMqjzYWfdS+pRT+0pQR+vIPa
 M8+jyHQplzeOX9/nKOkpV+u0tYP2zgx8e7yeu5Sion8TqsKqiNOy9+D0D2utUDmw
 1x3DzuzGx/mK2OX5gjGSx4ZbS4u0DIAWnF8vB9YfgEfcnqpxr6KdbrY0bLatIbKv
 ip9mh0rRYeTkTZ4FGmvy3hFgAmadCODWxva/7AhzbWVZoM+AShwnTDsipkRaaj3V
 nMdgcVix8qVDg9YIAn9ziZbxkXKQUXFJn7lZj3KBeWjKcV2svA89S/9YL6JTaSeW
 TJ4X6wK8EoApKhEasZhufXBNAl9EmQlBS1k9pHiIjKVuRgBGlzMuEhzvrqZM2+rA
 KaUQSwBN6Ij6Dg==
 =CJUK
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2021-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:

 - Move the command line preparation and the early command line parsing
   earlier so that the command line parameters which affect
   early_reserve_memory(), e.g. efi=nosftreserve, are taken into
   account. This was broken when the invocation of
   early_reserve_memory() was moved recently.

 - Use an atomic type for the SGX page accounting, which is read and
   written locklessly, to plug various race conditions related to it.

* tag 'x86-urgent-2021-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sgx: Fix free page accounting
  x86/boot: Pull up cmdline preparation and early param parsing
2021-11-21 11:25:19 -08:00
Ingo Molnar
5c16f7ee03 Merge branch 'x86/urgent' into x86/sgx, to resolve conflict
Conflicts:
	arch/x86/kernel/cpu/sgx/main.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-11-19 09:31:55 +01:00
Zhaolong Zhang
2322b532ad x86/mce: Get rid of cpu_missing
Get rid of cpu_missing because

  7bb39313cd ("x86/mce: Make mce_timed_out() identify holdout CPUs")

provides a more detailed message about which CPUs are missing.

Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Zhaolong Zhang <zhangzl2013@126.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211109112345.2673403-1-zhangzl2013@126.com
2021-11-17 15:32:31 +01:00
Reinette Chatre
ac5d272a0a x86/sgx: Fix free page accounting
The SGX driver maintains a single global free page counter,
sgx_nr_free_pages, that reflects the number of free pages available
across all NUMA nodes. Correspondingly, a list of free pages is
associated with each NUMA node and sgx_nr_free_pages is updated
every time a page is added or removed from any of the free page
lists. The main usage of sgx_nr_free_pages is by the reclaimer
that runs when it (sgx_nr_free_pages) goes below a watermark
to ensure that there are always some free pages available to, for
example, support efficient page faults.

With sgx_nr_free_pages accessed and modified from a few places
it is essential to ensure that these accesses are done safely but
this is not the case. sgx_nr_free_pages is read without any
protection and updated with inconsistent protection by any one
of the spin locks associated with the individual NUMA nodes.
For example:

      CPU_A                                 CPU_B
      -----                                 -----
 spin_lock(&nodeA->lock);              spin_lock(&nodeB->lock);
 ...                                   ...
 sgx_nr_free_pages--;  /* NOT SAFE */  sgx_nr_free_pages--;

 spin_unlock(&nodeA->lock);            spin_unlock(&nodeB->lock);

Since sgx_nr_free_pages may be protected by different spin locks
while being modified from different CPUs, the following scenario
is possible:

      CPU_A                                CPU_B
      -----                                -----
{sgx_nr_free_pages = 100}
 spin_lock(&nodeA->lock);              spin_lock(&nodeB->lock);
 sgx_nr_free_pages--;                  sgx_nr_free_pages--;
 /* LOAD sgx_nr_free_pages = 100 */    /* LOAD sgx_nr_free_pages = 100 */
 /* sgx_nr_free_pages--          */    /* sgx_nr_free_pages--          */
 /* STORE sgx_nr_free_pages = 99 */    /* STORE sgx_nr_free_pages = 99 */
 spin_unlock(&nodeA->lock);            spin_unlock(&nodeB->lock);

In the above scenario, sgx_nr_free_pages is decremented from two CPUs
but instead of sgx_nr_free_pages ending with a value that is two less
than it started with, it was only decremented by one while the number
of free pages were actually reduced by two. The consequence of
sgx_nr_free_pages not being protected is that its value may not
accurately reflect the actual number of free pages on the system,
impacting the availability of free pages in support of many flows.

The problematic scenario is when the reclaimer does not run because it
believes there to be sufficient free pages while any attempt to allocate
a page fails because there are no free pages available. In the SGX driver
the reclaimer's watermark is only 32 pages so after encountering the
above example scenario 32 times a user space hang is possible when there
are no more free pages because of repeated page faults caused by no
free pages made available.

The following flow was encountered:
asm_exc_page_fault
 ...
   sgx_vma_fault()
     sgx_encl_load_page()
       sgx_encl_eldu() // Encrypted page needs to be loaded from backing
                       // storage into newly allocated SGX memory page
         sgx_alloc_epc_page() // Allocate a page of SGX memory
           __sgx_alloc_epc_page() // Fails, no free SGX memory
           ...
           if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) // Wake reclaimer
             wake_up(&ksgxd_waitq);
           return -EBUSY; // Return -EBUSY giving reclaimer time to run
       return -EBUSY;
     return -EBUSY;
   return VM_FAULT_NOPAGE;

The reclaimer is triggered in above flow with the following code:

static bool sgx_should_reclaim(unsigned long watermark)
{
        return sgx_nr_free_pages < watermark &&
               !list_empty(&sgx_active_page_list);
}

In the problematic scenario there were no free pages available yet the
value of sgx_nr_free_pages was above the watermark. The allocation of
SGX memory thus always failed because of a lack of free pages while no
free pages were made available because the reclaimer is never started
because of sgx_nr_free_pages' incorrect value. The consequence was that
user space kept encountering VM_FAULT_NOPAGE that caused the same
address to be accessed repeatedly with the same result.

Change the global free page counter to an atomic type that
ensures simultaneous updates are done safely. While doing so, move
the updating of the variable outside of the spin lock critical
section to which it does not belong.

Cc: stable@vger.kernel.org
Fixes: 901ddbb9ec ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/a95a40743bbd3f795b465f30922dde7f1ea9e0eb.1637004094.git.reinette.chatre@intel.com
2021-11-16 11:17:43 -08:00
Tony Luck
a495cbdffa x86/sgx: Add SGX infrastructure to recover from poison
Provide a recovery function sgx_memory_failure(). If the poison was
consumed synchronously then send a SIGBUS. Note that the virtual
address of the access is not included with the SIGBUS as is the case
for poison outside of SGX enclaves. This doesn't matter as addresses
of code/data inside an enclave is of little to no use to code executing
outside the (now dead) enclave.

Poison found in a free page results in the page being moved from the
free list to the per-node poison page list.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20211026220050.697075-5-tony.luck@intel.com
2021-11-15 11:13:16 -08:00
Tony Luck
992801ae92 x86/sgx: Initial poison handling for dirty and free pages
A memory controller patrol scrubber can report poison in a page
that isn't currently being used.

Add "poison" field in the sgx_epc_page that can be set for an
sgx_epc_page. Check for it:
1) When sanitizing dirty pages
2) When freeing epc pages

Poison is a new field separated from flags to avoid having to make all
updates to flags atomic, or integrate poison state changes into some
other locking scheme to protect flags (Currently just sgx_reclaimer_lock
which protects the SGX_EPC_PAGE_RECLAIMER_TRACKED bit in page->flags).

In both cases place the poisoned page on a per-node list of poisoned
epc pages to make sure it will not be reallocated.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20211026220050.697075-4-tony.luck@intel.com
2021-11-15 11:13:16 -08:00
Tony Luck
40e0e7843e x86/sgx: Add infrastructure to identify SGX EPC pages
X86 machine check architecture reports a physical address when there
is a memory error. Handling that error requires a method to determine
whether the physical address reported is in any of the areas reserved
for EPC pages by BIOS.

SGX EPC pages do not have Linux "struct page" associated with them.

Keep track of the mapping from ranges of EPC pages to the sections
that contain them using an xarray. N.B. adds CONFIG_XARRAY_MULTI to
the SGX dependecies. So "select" that in arch/x86/Kconfig for X86/SGX.

Create a function arch_is_platform_page() that simply reports whether an
address is an EPC page for use elsewhere in the kernel. The ACPI error
injection code needs this function and is typically built as a module,
so export it.

Note that arch_is_platform_page() will be slower than other similar
"what type is this page" functions that can simply check bits in the
"struct page".  If there is some future performance critical user of
this function it may need to be implemented in a more efficient way.

Note also that the current implementation of xarray allocates a few
hundred kilobytes for this usage on a system with 4GB of SGX EPC memory
configured. This isn't ideal, but worth it for the code simplicity.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20211026220050.697075-3-tony.luck@intel.com
2021-11-15 11:13:16 -08:00
Tony Luck
d6d261bded x86/sgx: Add new sgx_epc_page flag bit to mark free pages
SGX EPC pages go through the following life cycle:

        DIRTY ---> FREE ---> IN-USE --\
                    ^                 |
                    \-----------------/

Recovery action for poison for a DIRTY or FREE page is simple. Just
make sure never to allocate the page. IN-USE pages need some extra
handling.

Add a new flag bit SGX_EPC_PAGE_IS_FREE that is set when a page
is added to a free list and cleared when the page is allocated.

Notes:

1) These transitions are made while holding the node->lock so that
   future code that checks the flags while holding the node->lock
   can be sure that if the SGX_EPC_PAGE_IS_FREE bit is set, then the
   page is on the free list.

2) Initially while the pages are on the dirty list the
   SGX_EPC_PAGE_IS_FREE bit is cleared.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20211026220050.697075-2-tony.luck@intel.com
2021-11-15 11:13:16 -08:00
Sean Christopherson
f3e613e72f x86/hyperv: Move required MSRs check to initial platform probing
Explicitly check for MSR_HYPERCALL and MSR_VP_INDEX support when probing
for running as a Hyper-V guest instead of waiting until hyperv_init() to
detect the bogus configuration.  Add messages to give the admin a heads
up that they are likely running on a broken virtual machine setup.

At best, silently disabling Hyper-V is confusing and difficult to debug,
e.g. the kernel _says_ it's using all these fancy Hyper-V features, but
always falls back to the native versions.  At worst, the half baked setup
will crash/hang the kernel.

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20211104182239.1302956-3-seanjc@google.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-11-15 12:37:08 +00:00
Yazen Ghannam
0b746e8c1e x86/MCE/AMD, EDAC/amd64: Move address translation to AMD64 EDAC
The address translation code used for current AMD systems is
non-architectural. So move it to EDAC.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211028175728.121452-2-yazen.ghannam@amd.com
2021-11-15 12:36:32 +01:00
Linus Torvalds
1654e95ee3 - Add the model number of a new, Raptor Lake CPU, to intel-family.h
- Do not log spurious corrected MCEs on SKL too, due to an erratum
 
 - Clarify the path of paravirt ops patches upstream
 
 - Add an optimization to avoid writing out AMX components to sigframes
 when former are in init state
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmGQ3CgACgkQEsHwGGHe
 VUoLAA/+NXRvcBHYkLaByT9f4OI6B79HzyguIBSfipYiw8ir0H7uEdV5FUCCUgCz
 egBRVFpOsXWt1teeuu6ViO+WBHncUxG/ryZ0ka35lri/3kuVYnugZExWDs4MrGR5
 vehRXehOxYNRaYc3oLYjubSbxqF1nWz3WWfGfhiBKk0jT/S1T9tX6lsRXlKsJCgj
 M4x5aqBWP8HTbFQfqjdHwagNitmSKzgjZvMcC4UWcql33ZCycbjvRdrAzBtw7WRI
 UBvgxWVmeMoagu5fqEOoph1oSoFxWuFrweFUjnxJmT6uZrTsfF7BVgXkxdG6eYUy
 2Xogcd4bPDBiRgbs0vPEog1tyyrKHOQ6p1pvksySKMPq6ULcSZ6hBpEZRpgr6Y9u
 0jB3P6weQgCckx5Hd+iwvX1a+GvEuHSEqAE+j160wFyrsBS5Cir3P1WqthWaPd5I
 3nH3h955PokUHPUioUhdf+8cfuP6h6K0nz1gdYI8GR8+fJHhEceT+pLLeyIxj/VM
 yr+bq+V7D6Cg62w3z3s9Dzg2XKpxStu1R9L1N/K8MtIGf6Uc7paL6xR27XxhmBp5
 Y6bGZw0mxxFhp6AEsFWo3rwLL9Dl5DmFcfgUHHpPK5VP0pVWp48Uapx2Hi2/JzAo
 c1o4UkPQa/EZJBPTklmGkS1JNp/2TsEL4Fw7sew+j7DWtsJpCfk=
 =Ge2T
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Add the model number of a new, Raptor Lake CPU, to intel-family.h

 - Do not log spurious corrected MCEs on SKL too, due to an erratum

 - Clarify the path of paravirt ops patches upstream

 - Add an optimization to avoid writing out AMX components to sigframes
   when former are in init state

* tag 'x86_urgent_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu: Add Raptor Lake to Intel family
  x86/mce: Add errata workaround for Skylake SKX37
  MAINTAINERS: Add some information to PARAVIRT_OPS entry
  x86/fpu: Optimize out sigframe xfeatures when in init state
2021-11-14 09:29:03 -08:00
Dave Jones
e629fc1407 x86/mce: Add errata workaround for Skylake SKX37
Errata SKX37 is word-for-word identical to the other errata listed in
this workaround.   I happened to notice this after investigating a CMCI
storm on a Skylake host.  While I can't confirm this was the root cause,
spurious corrected errors does sound like a likely suspect.

Fixes: 2976908e41 ("x86/mce: Do not log spurious corrected mce errors")
Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20211029205759.GA7385@codemonkey.org.uk
2021-11-12 11:43:35 -08:00
Linus Torvalds
95faf6ba65 Driver core changes for 5.16-rc1
Here is the big set of driver core changes for 5.16-rc1.
 
 All of these have been in linux-next for a while now with no reported
 problems.
 
 Included in here are:
 	- big update and cleanup of the sysfs abi documentation files
 	  and scripts from Mauro.  We are almost at the place where we
 	  can properly check that the running kernel's sysfs abi is
 	  documented fully.
 	- firmware loader updates
 	- dyndbg updates
 	- kernfs cleanups and fixes from Christoph
 	- device property updates
 	- component fix
 	- other minor driver core cleanups and fixes
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYYPbjQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ync9gCfXKMUI1GAnCfJWAwTdTcd18q5akoAoMw32/AH
 0yh5TjAWFyFd7xz5d7qs
 =itsC
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the big set of driver core changes for 5.16-rc1.

  All of these have been in linux-next for a while now with no reported
  problems.

  Included in here are:

   - big update and cleanup of the sysfs abi documentation files and
     scripts from Mauro. We are almost at the place where we can
     properly check that the running kernel's sysfs abi is documented
     fully.

   - firmware loader updates

   - dyndbg updates

   - kernfs cleanups and fixes from Christoph

   - device property updates

   - component fix

   - other minor driver core cleanups and fixes"

* tag 'driver-core-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (122 commits)
  device property: Drop redundant NULL checks
  x86/build: Tuck away built-in firmware under FW_LOADER
  vmlinux.lds.h: wrap built-in firmware support under FW_LOADER
  firmware_loader: move struct builtin_fw to the only place used
  x86/microcode: Use the firmware_loader built-in API
  firmware_loader: remove old DECLARE_BUILTIN_FIRMWARE()
  firmware_loader: formalize built-in firmware API
  component: do not leave master devres group open after bind
  dyndbg: refine verbosity 1-4 summary-detail
  gpiolib: acpi: Replace custom code with device_match_acpi_handle()
  i2c: acpi: Replace custom function with device_match_acpi_handle()
  driver core: Provide device_match_acpi_handle() helper
  dyndbg: fix spurious vNpr_info change
  dyndbg: no vpr-info on empty queries
  dyndbg: vpr-info on remove-module complete, not starting
  device property: Add missed header in fwnode.h
  Documentation: dyndbg: Improve cli param examples
  dyndbg: Remove support for ddebug_query param
  dyndbg: make dyndbg a known cli param
  dyndbg: show module in vpr-info in dd-exec-queries
  ...
2021-11-04 08:32:38 -07:00
Dave Hansen
30d02551ba x86/fpu: Optimize out sigframe xfeatures when in init state
tl;dr: AMX state is ~8k.  Signal frames can have space for this
~8k and each signal entry writes out all 8k even if it is zeros.
Skip writing zeros for AMX to speed up signal delivery by about
4% overall when AMX is in its init state.

This is a user-visible change to the sigframe ABI.

== Hardware XSAVE Background ==

XSAVE state components may be tracked by the processor as being
in their initial configuration.  Software can detect which
features are in this configuration by looking at the XSTATE_BV
field in an XSAVE buffer or with the XGETBV(1) instruction.

Both the XSAVE and XSAVEOPT instructions enumerate features s
being in the initial configuration via the XSTATE_BV field in the
XSAVE header,  However, XSAVEOPT declines to actually write
features in their initial configuration to the buffer.  XSAVE
writes the feature unconditionally, regardless of whether it is
in the initial configuration or not.

Basically, XSAVE users never need to inspect XSTATE_BV to
determine if the feature has been written to the buffer.
XSAVEOPT users *do* need to inspect XSTATE_BV.  They might also
need to clear out the buffer if they want to make an isolated
change to the state, like modifying one register.

== Software Signal / XSAVE Background ==

Signal frames have historically been written with XSAVE itself.
Each state is written in its entirety, regardless of being in its
initial configuration.

In other words, the signal frame ABI uses the XSAVE behavior, not
the XSAVEOPT behavior.

== Problem ==

This means that any application which has acquired permission to
use AMX via ARCH_REQ_XCOMP_PERM will write 8k of state to the
signal frame.  This 8k write will occur even when AMX was in its
initial configuration and software *knows* this because of
XSTATE_BV.

This problem also exists to a lesser degree with AVX-512 and its
2k of state.  However, AVX-512 use does not require
ARCH_REQ_XCOMP_PERM and is more likely to have existing users
which would be impacted by any change in behavior.

== Solution ==

Stop writing out AMX xfeatures which are in their initial state
to the signal frame.  This effectively makes the signal frame
XSAVE buffer look as if it were written with a combination of
XSAVEOPT and XSAVE behavior.  Userspace which handles XSAVEOPT-
style buffers should be able to handle this naturally.

For now, include only the AMX xfeatures: XTILE and XTILEDATA in
this new behavior.  These require new ABI to use anyway, which
makes their users very unlikely to be broken.  This XSAVEOPT-like
behavior should be expected for all future dynamic xfeatures.  It
may also be extended to legacy features like AVX-512 in the
future.

Only attempt this optimization on systems with dynamic features.
Disable dynamic feature support (XFD) if XGETBV1 is unavailable
by adding a CPUID dependency.

This has been measured to reduce the *overall* cycle cost of
signal delivery by about 4%.

Fixes: 2308ee57d9 ("x86/fpu/amx: Enable the AMX feature in 64-bit mode")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: "Chang S. Bae" <chang.seok.bae@intel.com>
Link: https://lore.kernel.org/r/20211102224750.FA412E26@davehans-spike.ostc.intel.com
2021-11-03 22:42:35 +01:00
Linus Torvalds
56d3375448 drm for 5.16-rc1
core:
 - improve dma_fence, lease and resv documentation
 - shmem-helpers: allocate WC pages on x86, use vmf_insert_pin
 - sched fixes/improvements
 - allow empty drm leases
 - add dma resv iterator
 - add more DP 2.0 headers
 - DP MST helper improvements for DP2.0
 
 dma-buf:
 - avoid warnings, remove fence trace macros
 
 bridge:
 - new helper to get rid of panels
 - probe improvements for it66121
 - enable DSI EOTP for anx7625
 
 fbdev:
 - efifb: release runtime PM on destroy
 
 ttm:
 - kerneldoc switch
 - helper to clear all DMA mappings
 - pool shrinker optimizaton
 - remove ttm_tt_destroy_common
 - update ttm_move_memcpy for async use
 
 panel:
 - add new panel-edp driver
 
 amdgpu:
  - Initial DP 2.0 support
  - Initial USB4 DP tunnelling support
  - Aldebaran MCE support
  - Modifier support for DCC image stores for GFX 10.3
  - Display rework for better FP code handling
  - Yellow Carp/Cyan Skillfish updates
  - Cyan Skillfish display support
  - convert vega/navi to IP discovery asic enumeration
  - validate IP discovery table
  - RAS improvements
  - Lots of fixes
 
  i915:
  - DG1 PCI IDs + LMEM discovery/placement
  - DG1 GuC submission by default
  - ADL-S PCI IDs updated + enabled by default
  - ADL-P (XE_LPD) fixed and updates
  - DG2 display fixes
  - PXP protected object support for Gen12 integrated
  - expose multi-LRC submission interface for GuC
  - export logical engine instance to user
  - Disable engine bonding on Gen12+
  - PSR cleanup
  - PSR2 selective fetch by default
  - DP 2.0 prep work
  - VESA vendor block + MSO use of it
  - FBC refactor
  - try again to fix fast-narrow vs slow-wide eDP training
  - use THP when IOMMU enabled
  - LMEM backup/restore for suspend/resume
  - locking simplification
  - GuC major reworking
  - async flip VT-D workaround changes
  - DP link training improvements
  - misc display refactorings
 
 bochs:
 - new PCI ID
 
 rcar-du:
 - Non-contiguious buffer import support for rcar-du
 - r8a779a0 support prep
 
 omapdrm:
 - COMPILE_TEST fixes
 
 sti:
 - COMPILE_TEST fixes
 
 msm:
 - fence ordering improvements
 - eDP support in DP sub-driver
 - dpu irq handling cleanup
 - CRC support for making igt happy
 - NO_CONNECTOR bridge support
 - dsi: 14nm phy support for msm8953
 - mdp5: msm8x53, sdm450, sdm632 support
 
 stm:
 - layer alpha + zpo support
 
 v3d:
 - fix Vulkan CTS failure
 - support multiple sync objects
 
 gud:
 - add R8/RGB332/RGB888 pixel formats
 
 vc4:
 - convert to new bridge helpers
 
 vgem:
 - use shmem helpers
 
 virtio:
 - support mapping exported vram
 
 zte:
 - remove obsolete driver
 
 rockchip:
 - use bridge attach no connector for LVDS/RGB
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmGByPYACgkQDHTzWXnE
 hr6fxA//cXUvTHlEtF7UJDBRAYv+9lXH39NbGYU4aLJuBNlZztCuUi5JOSyDFDH1
 N9VI5biVseev2PEnCzJUubWxTqbUO7FBQTw0TyvZ4Eqn+UZMuFeo0dvdKZRAkvjV
 VHSUc0fm0+WSYanKUK7XK0fwG8aE6JVyYngzgKPSjifhszTdiiRsbU21iTinFhkS
 rgh3HEVELp+LqfoG4qzAYqFUjYqUjvCjd/hX/UkzCII8ZXKr38/4127e95443WOk
 +jes0gWGJe9TvSDrqo9TMx4qukcOniINFUvnzoD2RhOS+Jzr/i5rBh51Xy92g3NO
 Q7hy6byZdk/ZO/MXCDQ2giUOkBiqn5fQjlRGQp4iAZYw9pb3HU+/xrTq0BWVWd8o
 /vmzZYEKKU/sCGpxVDMZxsHV3mXIuVBvuZq6bjmSGcybgOBCiDx5F/Rum4nY2yHp
 lr3cuc0HP3m3f4b/HVvACO4tGd1nDDpVcon7CuhBB7HB7t6Zl9u18qc/qFw0tCTh
 3sgAhno6XFXtPFcSX2KAeeg0mhKDKKrsOnq5y3bDRr05Z0jLocJk95aXEKs6em4j
 gbyHwNaX3CHtiCnFn2/5169+n1K7zqHBtVSGmQlmFDv55rcdx7L3Spk7tCahQeSQ
 ur24r+sEggm8d5Wjl+MYq6wW3oP31s04JFaeV6oCkaSp1wS+alg=
 =jdhH
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2021-11-03' of git://anongit.freedesktop.org/drm/drm

Pull drm updates from Dave Airlie:
 "Summary below. i915 starts to add support for DG2 GPUs, enables DG1
  and ADL-S support by default, lots of work to enable DisplayPort 2.0
  across drivers. Lots of documentation updates and fixes across the
  board.

  core:
   - improve dma_fence, lease and resv documentation
   - shmem-helpers: allocate WC pages on x86, use vmf_insert_pin
   - sched fixes/improvements
   - allow empty drm leases
   - add dma resv iterator
   - add more DP 2.0 headers
   - DP MST helper improvements for DP2.0

  dma-buf:
   - avoid warnings, remove fence trace macros

  bridge:
   - new helper to get rid of panels
   - probe improvements for it66121
   - enable DSI EOTP for anx7625

  fbdev:
   - efifb: release runtime PM on destroy

  ttm:
   - kerneldoc switch
   - helper to clear all DMA mappings
   - pool shrinker optimizaton
   - remove ttm_tt_destroy_common
   - update ttm_move_memcpy for async use

  panel:
   - add new panel-edp driver

  amdgpu:
   - Initial DP 2.0 support
   - Initial USB4 DP tunnelling support
   - Aldebaran MCE support
   - Modifier support for DCC image stores for GFX 10.3
   - Display rework for better FP code handling
   - Yellow Carp/Cyan Skillfish updates
   - Cyan Skillfish display support
   - convert vega/navi to IP discovery asic enumeration
   - validate IP discovery table
   - RAS improvements
   - Lots of fixes

  i915:
   - DG1 PCI IDs + LMEM discovery/placement
   - DG1 GuC submission by default
   - ADL-S PCI IDs updated + enabled by default
   - ADL-P (XE_LPD) fixed and updates
   - DG2 display fixes
   - PXP protected object support for Gen12 integrated
   - expose multi-LRC submission interface for GuC
   - export logical engine instance to user
   - Disable engine bonding on Gen12+
   - PSR cleanup
   - PSR2 selective fetch by default
   - DP 2.0 prep work
   - VESA vendor block + MSO use of it
   - FBC refactor
   - try again to fix fast-narrow vs slow-wide eDP training
   - use THP when IOMMU enabled
   - LMEM backup/restore for suspend/resume
   - locking simplification
   - GuC major reworking
   - async flip VT-D workaround changes
   - DP link training improvements
   - misc display refactorings

  bochs:
   - new PCI ID

  rcar-du:
   - Non-contiguious buffer import support for rcar-du
   - r8a779a0 support prep

  omapdrm:
   - COMPILE_TEST fixes

  sti:
   - COMPILE_TEST fixes

  msm:
   - fence ordering improvements
   - eDP support in DP sub-driver
   - dpu irq handling cleanup
   - CRC support for making igt happy
   - NO_CONNECTOR bridge support
   - dsi: 14nm phy support for msm8953
   - mdp5: msm8x53, sdm450, sdm632 support

  stm:
   - layer alpha + zpo support

  v3d:
   - fix Vulkan CTS failure
   - support multiple sync objects

  gud:
   - add R8/RGB332/RGB888 pixel formats

  vc4:
   - convert to new bridge helpers

  vgem:
   - use shmem helpers

  virtio:
   - support mapping exported vram

  zte:
   - remove obsolete driver

  rockchip:
   - use bridge attach no connector for LVDS/RGB"

* tag 'drm-next-2021-11-03' of git://anongit.freedesktop.org/drm/drm: (1259 commits)
  drm/amdgpu/gmc6: fix DMA mask from 44 to 40 bits
  drm/amd/display: MST support for DPIA
  drm/amdgpu: Fix even more out of bound writes from debugfs
  drm/amdgpu/discovery: add SDMA IP instance info for soc15 parts
  drm/amdgpu/discovery: add UVD/VCN IP instance info for soc15 parts
  drm/amdgpu/UAPI: rearrange header to better align related items
  drm/amd/display: Enable dpia in dmub only for DCN31 B0
  drm/amd/display: Fix USB4 hot plug crash issue
  drm/amd/display: Fix deadlock when falling back to v2 from v3
  drm/amd/display: Fallback to clocks which meet requested voltage on DCN31
  drm/amd/display: move FPU associated DCN301 code to DML folder
  drm/amd/display: fix link training regression for 1 or 2 lane
  drm/amd/display: add two lane settings training options
  drm/amd/display: decouple hw_lane_settings from dpcd_lane_settings
  drm/amd/display: implement decide lane settings
  drm/amd/display: adopt DP2.0 LT SCR revision 8
  drm/amd/display: FEC configuration for dpia links in MST mode
  drm/amd/display: FEC configuration for dpia links
  drm/amd/display: Add workaround flag for EDID read on certain docks
  drm/amd/display: Set phy_mux_sel bit in dmub scratch register
  ...
2021-11-02 16:47:49 -07:00
Linus Torvalds
44261f8e28 hyperv-next for 5.16
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmGBMQUTHHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXmE5B/9MK3Ju+tc6C8eyR3Ic4XBYHJ3voEKO
 M+R90gggBriDOgkz4B8vF+k0aD8wevXAUtmCSXonDzCh5H7GoyfrVZmJEVkwlioH
 ZMSMlFHcjGhCPIXhLbNtfo/NsAYEtT/lRM2lLGCSbdGuKabylXKujVdhuSIcRPdj
 Rj5innUgcAywOoxG6WzFt3JBzM33UQErCGfUF2b7Rvp9E+Zii4vIMxkMzUpnkEHH
 F8WMEdL0DqH5ThOs0MslNgy03pUC9wk1d5DNd9ytYHqiSQtcQZhFHw/P6dxzUFlW
 OptWv31PXUIsiJf4Zi9hmfjgUl+KZHeacZ2hXtidAo86VPcIjVs25OQW
 =40fn
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-next-signed-20211102' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv updates from Wei Liu:

 - Initial patch set for Hyper-V isolation VM support (Tianyu Lan)

 - Fix a warning on preemption (Vitaly Kuznetsov)

 - A bunch of misc cleanup patches

* tag 'hyperv-next-signed-20211102' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  x86/hyperv: Protect set_hv_tscchange_cb() against getting preempted
  Drivers: hv : vmbus: Adding NULL pointer check
  x86/hyperv: Remove duplicate include
  x86/hyperv: Remove duplicated include in hv_init
  Drivers: hv: vmbus: Remove unused code to check for subchannels
  Drivers: hv: vmbus: Initialize VMbus ring buffer for Isolation VM
  Drivers: hv: vmbus: Add SNP support for VMbus channel initiate message
  x86/hyperv: Add ghcb hvcall support for SNP VM
  x86/hyperv: Add Write/Read MSR registers via ghcb page
  Drivers: hv: vmbus: Mark vmbus ring buffer visible to host in Isolation VM
  x86/hyperv: Add new hvcall guest address host visibility support
  x86/hyperv: Initialize shared memory boundary in the Isolation VM.
  x86/hyperv: Initialize GHCB page in Isolation VM
2021-11-02 10:56:49 -07:00
Linus Torvalds
a5a9e00605 seccomp updates for v5.16-rc1
- set spec_store_bypass_disable & spectre_v2_user to prctl (Andrea Arcangeli)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmGAGAkWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJqOWD/4mMFp84IMa/VdCmD6PS+BhisyI
 i7+Hyfisg8AWpjgW4+JihU/6hfsDgs/hNNKbiIopcwc/12KV4M0QIQyF7vmceSwB
 uMsAX7pkobNUUisnrQVbw6boK4hrBvrV3STlVdRHvNlLeQLQIu4UN3+9UMj/qsmh
 46ltdxR489oDDLXFgMkKq9auVP2t5t4fbyRmgBPLSKIXaOxIhWck3kUQwt/Rbr44
 M87/Xr4iQ0w4ddiBFJz9GOHQ5Iz08ms4dBfO+e5FSl6I69Nt6q836el35c/6j4y8
 r7C21WU088MSkjk75RCa3v2sq8db2CjLe+wBugq+yYC29qGgxtTiUZaoiNQCN5bL
 DIRfl1iU5Ge1wEKorpr3DR6DksmfJO4MNPdMo4CcVZT3Gkdi7udLHfrEI82xgdDl
 lh1UiJlRx4YNEcDbGBnxCzKGwauqHa2TgPNWulUPdH7OGhUL86FAV49L84uz9lCD
 C/+PKxDqc2XKjbgqMsbuyQ7hzB2KQK/ieEXzduoHxTxIr5vO/viENrbkUiSL8bsO
 6msCVbCIjtFDvW4Ac16IOwGoflJ7vLAIuXIdAYCeN+JXqOVV+FG/MN447Y674FeH
 R84G6JCT82ULEXrKlwuoSSVJEwA5lzP4IwoWm/ujeUbzi1s+7m+7WRpuJe2jZm6c
 zPsCVkNPUrvp82L/wA==
 =NAsc
 -----END PGP SIGNATURE-----

Merge tag 'seccomp-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull seccomp updates from Kees Cook:
 "These are x86-specific, but I carried these since they're also
  seccomp-specific.

  This flips the defaults for spec_store_bypass_disable and
  spectre_v2_user from "seccomp" to "prctl", as enough time has passed
  to allow system owners to have updated the defensive stances of their
  various workloads, and it's long overdue to unpessimize seccomp
  threads.

  Extensive rationale and details are in Andrea's main patch.

  Summary:

   - set spec_store_bypass_disable & spectre_v2_user to prctl (Andrea Arcangeli)"

* tag 'seccomp-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  x86: deduplicate the spectre_v2_user documentation
  x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl
2021-11-01 17:25:09 -07:00
Linus Torvalds
879dbe9ffe Add a SGX_IOC_VEPC_REMOVE ioctl to the /dev/sgx_vepc virt interface with
which EPC pages can be put back into their uninitialized state without
 having to reopen /dev/sgx_vepc, which could not be possible anymore
 after startup due to security policies.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmF/x7AACgkQEsHwGGHe
 VUqHXA//YWrukmJ5PQZwWqkXGo6h42JWhIdNfSC2c1SVdz1cioGUCCswALTX4g8l
 MYYf3eN12GJ296jPh7m9bz8JvlYjdavSm3Y1yzHIjuQ3q6qywHIuYTbsrMD7waUD
 PkcY1TTYgNJ2+f0AgsC4GZhlcpf9g5DqiftW6wvExx5tLUNsVu3Y3gZy/+fajP4f
 s/TMjcdr2QmPsjun00KfoIY4/z0u8LkyRMSwyoxSV6wYdL6rRtfYFWsbEUS+W6Nw
 /VJ0IKl+aBQ1ztsDc4M5h1uy9II2M/Row5k6JjyrdG4X8D6ACSG7cho6qcMjXgcP
 Gac7Im5IyjPEorxqXAgJiMoAl9lU9a2JMVZqPtihYsQW/ygMTdpzP9sBpcZPMevc
 gxQD4gyixwzUa3cyVDzTPBdk/DEuGc2nwn2k9nPvmNxKMonX1oLEiP7hu265mvet
 56DtwKJF9ddtpepO2zFCg1qX+eZnTuhuZNCPsm/pmdGgzI8cyLznho33OgUSZEQY
 c1UisT7HXNRVC/1Q8VBDTU/D9LtIk+2+Q5lQkcNeftI5PYKTXIVddkOkqJ4GhGWJ
 9EasA4UtnhvsLzJ76gxxuUf677ns+1TCo65e7Hu1+X0eTmBJK3boe3aMHvJeHEWH
 Asd+SMkYWfxAlW/arAYhR2JgT9wgEH3pSx4eXnpGwpeValxBPRs=
 =1UYy
 -----END PGP SIGNATURE-----

Merge tag 'x86_sgx_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 SGX updates from Borislav Petkov:
 "Add a SGX_IOC_VEPC_REMOVE ioctl to the /dev/sgx_vepc virt interface
  with which EPC pages can be put back into their uninitialized state
  without having to reopen /dev/sgx_vepc, which could not be possible
  anymore after startup due to security policies"

* tag 'x86_sgx_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sgx/virt: implement SGX_IOC_VEPC_REMOVE ioctl
  x86/sgx/virt: extract sgx_vepc_remove_page
2021-11-01 15:54:07 -07:00
Linus Torvalds
e0f4c59dc4 - Start checking a CPUID bit on AMD Zen3 which states that the CPU
clears the segment base when a null selector is written. Do the explicit
 detection on older CPUs, zen2 and hygon specifically, which have the
 functionality but do not advertize the CPUID bit. Factor in the presence
 of a hypervisor underneath the kernel and avoid doing the explicit check
 there which the HV might've decided to not advertize for migration
 safety reasons, a.o.
 
 - Add support for a new X86 CPU vendor: VORTEX. Needed for whitelisting
 those CPUs in the hardware vulnerabilities detection
 
 - Force the compiler to use rIP-relative addressing in the fallback path of
 static_cpu_has(), in order to avoid unnecessary register pressure
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmF/wRgACgkQEsHwGGHe
 VUoGQBAAk9V9//FMoENuGFGul/IK8+VBibTfztYgaPvm7vjMDYaYuRBCQiZg5Y8U
 D14pwkg7CuRa6iwZmrk/X/y6FVjo5BJA//ROk/n/9JNvV5QUp3/o00uLiziv80K3
 H6Wm3PUyGgkpBuJg+/K8SLE9UQ6uSh4nsykS+70Dcd45DtkC/vH8pkDs5Q1fVQwb
 7AuOuWTCWKUYOMFYWFI3a9D8tZYhg99ABREbXBaJGiGdIlZKNVe/7W8qQw5s6cVA
 cD5Q2ILY2RCGP55ZQiWoFy3XNP3/ygvZ7Zm1ARYUvUMR2Y5X2XJWN/B6oMbc0oEu
 OZsDDA/ILYcah9eBV/zk4ON/1djksp1iWNXNxjct0cNBPAKxi6T/HhHuIHBtzvW+
 zDyBWUMLlv1m2i1oW4J4NuNJJi9Gaz+7PesmI7C0OQPgywR8UqqfMD+TzlEHWya1
 YqYqI0f3aiyC/sLjUp3GSA7a9sWSd3BZfyAlLBJZCxyXAxX92tXX5BRPh/KYbnJn
 c/NaYA6X4m4Rdvr0gKKtCklaC6w4GLzVak6wIvftzHlUYsWX21BhnTkQrciKbqc+
 AKWed41AO+4pDHROePxc409x3UZolti+1RandikrztIVAolVJ6W/OkHWxXfy28Fg
 iSrtl4M3omv8fCHDaJ26STrXqxH8pIK8noVolwQoXKyAFVyvXTk=
 =rlVy
 -----END PGP SIGNATURE-----

Merge tag 'x86_cpu_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cpu updates from Borislav Petkov:

 - Start checking a CPUID bit on AMD Zen3 which states that the CPU
   clears the segment base when a null selector is written. Do the
   explicit detection on older CPUs, zen2 and hygon specifically, which
   have the functionality but do not advertize the CPUID bit. Factor in
   the presence of a hypervisor underneath the kernel and avoid doing
   the explicit check there which the HV might've decided to not
   advertize for migration safety reasons, or similar.

 - Add support for a new X86 CPU vendor: VORTEX. Needed for whitelisting
   those CPUs in the hardware vulnerabilities detection

 - Force the compiler to use rIP-relative addressing in the fallback
   path of static_cpu_has(), in order to avoid unnecessary register
   pressure

* tag 'x86_cpu_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu: Fix migration safety with X86_BUG_NULL_SEL
  x86/CPU: Add support for Vortex CPUs
  x86/umip: Downgrade warning messages to debug loglevel
  x86/asm: Avoid adding register pressure for the init case in static_cpu_has()
  x86/asm: Add _ASM_RIP() macro for x86-64 (%rip) suffix
2021-11-01 15:33:54 -07:00
Linus Torvalds
158405e888 - Get rid of a bunch of function pointers used in MCA land in favor
of normal functions. This is in preparation of making the MCA code
 noinstr-aware
 
 - When the kernel copies data from user addresses and it encounters a
 machine check, a SIGBUS is sent to that process. Change this action to
 either an -EFAULT which is returned to the user or a short write, making
 the recovery action a lot more user-friendly
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmF/s8sACgkQEsHwGGHe
 VUqnaQ/8DIHkIOF6vy2w56snJwCj0XQYNLO+Clf6sHJ7ukWpWDoAi6HzvjqrBmaa
 bQEdOLeO92wGtVutCQ5ndzq2SJ6UFcZtOulpHyzCpwNinhY2QMsPG6pkSzeaAy/e
 aR4gpTY6pyCJyWl5DXXr7FMzBZVaWYdtZ2szPKmW1d1mLeDIdv5d3hInDbZ48XJF
 o+fZx0uuK0CIuDjDujRNvkPbHXLbBSqSLCTRf66o+sCY5ZXHlAipabxa3UmhHKvd
 dBxMrlObAaDBmDjqpc/YpS4IfWZb7+rHQfVmiq5O85ExXx6cyF6vlM7GI/5VBxSA
 2dVcZX/3TsSqGbFdVygbcF6e/Yl1xhP5AE+pBb5jpzbzEaf4oiM8MDhoMAai3lEL
 7CFsXL2oyAzho7QQsUSkv/hffHHrph2/aUZbGJlz6SdeRF9aoIjZANpcwm44TZrk
 c11Fh1MLTDxx8uhCGrYFXqR8QgeTi4B+8d/CEXNJnkLXZMfSUtoL1iIzhBpsGkv3
 r0JOIG2o5dGX2lLhQOiHZ+us33O1e8mvOli9P1jLoDttoKvNqSqLUuwpBCz4sc0E
 ugfarf7v/R07NN+7SIT+O83ZG8dXxIRPzHm/g7wjZYgyOfEBgFSMBKVWXRotPo/f
 aY88sDVyvF5sbYnUcA6zZANBCKAVfilqdMgCyaoGegoNGzDOCYE=
 =bIZq
 -----END PGP SIGNATURE-----

Merge tag 'ras_core_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RAS updates from Borislav Petkov:

 - Get rid of a bunch of function pointers used in MCA land in favor of
   normal functions. This is in preparation of making the MCA code
   noinstr-aware

 - When the kernel copies data from user addresses and it encounters a
   machine check, a SIGBUS is sent to that process. Change this action
   to either an -EFAULT which is returned to the user or a short write,
   making the recovery action a lot more user-friendly

* tag 'ras_core_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce: Sort mca_config members to get rid of unnecessary padding
  x86/mce: Get rid of the ->quirk_no_way_out() indirect call
  x86/mce: Get rid of msr_ops
  x86/mce: Get rid of machine_check_vector
  x86/mce: Get rid of the mce_severity function pointer
  x86/mce: Drop copyin special case for #MC
  x86/mce: Change to not send SIGBUS error during copy from user
2021-11-01 15:12:04 -07:00
Linus Torvalds
8cb1ae19bf x86/fpu updates:
- Cleanup of extable fixup handling to be more robust, which in turn
    allows to make the FPU exception fixups more robust as well.
 
  - Change the return code for signal frame related failures from explicit
    error codes to a boolean fail/success as that's all what the calling
    code evaluates.
 
  - A large refactoring of the FPU code to prepare for adding AMX support:
 
    - Distangle the public header maze and remove especially the misnomed
      kitchen sink internal.h which is despite it's name included all over
      the place.
 
    - Add a proper abstraction for the register buffer storage (struct
      fpstate) which allows to dynamically size the buffer at runtime by
      flipping the pointer to the buffer container from the default
      container which is embedded in task_struct::tread::fpu to a
      dynamically allocated container with a larger register buffer.
 
    - Convert the code over to the new fpstate mechanism.
 
    - Consolidate the KVM FPU handling by moving the FPU related code into
      the FPU core which removes the number of exports and avoids adding
      even more export when AMX has to be supported in KVM. This also
      removes duplicated code which was of course unnecessary different and
      incomplete in the KVM copy.
 
    - Simplify the KVM FPU buffer handling by utilizing the new fpstate
      container and just switching the buffer pointer from the user space
      buffer to the KVM guest buffer when entering vcpu_run() and flipping
      it back when leaving the function. This cuts the memory requirements
      of a vCPU for FPU buffers in half and avoids pointless memory copy
      operations.
 
      This also solves the so far unresolved problem of adding AMX support
      because the current FPU buffer handling of KVM inflicted a circular
      dependency between adding AMX support to the core and to KVM.  With
      the new scheme of switching fpstate AMX support can be added to the
      core code without affecting KVM.
 
    - Replace various variables with proper data structures so the extra
      information required for adding dynamically enabled FPU features (AMX)
      can be added in one place
 
  - Add AMX (Advanved Matrix eXtensions) support (finally):
 
     AMX is a large XSTATE component which is going to be available with
     Saphire Rapids XEON CPUs. The feature comes with an extra MSR (MSR_XFD)
     which allows to trap the (first) use of an AMX related instruction,
     which has two benefits:
 
     1) It allows the kernel to control access to the feature
 
     2) It allows the kernel to dynamically allocate the large register
        state buffer instead of burdening every task with the the extra 8K
        or larger state storage.
 
     It would have been great to gain this kind of control already with
     AVX512.
 
     The support comes with the following infrastructure components:
 
     1) arch_prctl() to
        - read the supported features (equivalent to XGETBV(0))
        - read the permitted features for a task
        - request permission for a dynamically enabled feature
 
        Permission is granted per process, inherited on fork() and cleared
        on exec(). The permission policy of the kernel is restricted to
        sigaltstack size validation, but the syscall obviously allows
        further restrictions via seccomp etc.
 
     2) A stronger sigaltstack size validation for sys_sigaltstack(2) which
        takes granted permissions and the potentially resulting larger
        signal frame into account. This mechanism can also be used to
        enforce factual sigaltstack validation independent of dynamic
        features to help with finding potential victims of the 2K
        sigaltstack size constant which is broken since AVX512 support was
        added.
 
     3) Exception handling for #NM traps to catch first use of a extended
        feature via a new cause MSR. If the exception was caused by the use
        of such a feature, the handler checks permission for that
        feature. If permission has not been granted, the handler sends a
        SIGILL like the #UD handler would do if the feature would have been
        disabled in XCR0. If permission has been granted, then a new fpstate
        which fits the larger buffer requirement is allocated.
 
        In the unlikely case that this allocation fails, the handler sends
        SIGSEGV to the task. That's not elegant, but unavoidable as the
        other discussed options of preallocation or full per task
        permissions come with their own set of horrors for kernel and/or
        userspace. So this is the lesser of the evils and SIGSEGV caused by
        unexpected memory allocation failures is not a fundamentally new
        concept either.
 
        When allocation succeeds, the fpstate properties are filled in to
        reflect the extended feature set and the resulting sizes, the
        fpu::fpstate pointer is updated accordingly and the trap is disarmed
        for this task permanently.
 
     4) Enumeration and size calculations
 
     5) Trap switching via MSR_XFD
 
        The XFD (eXtended Feature Disable) MSR is context switched with the
        same life time rules as the FPU register state itself. The mechanism
        is keyed off with a static key which is default disabled so !AMX
        equipped CPUs have zero overhead. On AMX enabled CPUs the overhead
        is limited by comparing the tasks XFD value with a per CPU shadow
        variable to avoid redundant MSR writes. In case of switching from a
        AMX using task to a non AMX using task or vice versa, the extra MSR
        write is obviously inevitable.
 
        All other places which need to be aware of the variable feature sets
        and resulting variable sizes are not affected at all because they
        retrieve the information (feature set, sizes) unconditonally from
        the fpstate properties.
 
     6) Enable the new AMX states
 
   Note, this is relatively new code despite the fact that AMX support is in
   the works for more than a year now.
 
   The big refactoring of the FPU code, which allowed to do a proper
   integration has been started exactly 3 weeks ago. Refactoring of the
   existing FPU code and of the original AMX patches took a week and has
   been subject to extensive review and testing. The only fallout which has
   not been caught in review and testing right away was restricted to AMX
   enabled systems, which is completely irrelevant for anyone outside Intel
   and their early access program. There might be dragons lurking as usual,
   but so far the fine grained refactoring has held up and eventual yet
   undetected fallout is bisectable and should be easily addressable before
   the 5.16 release. Famous last words...
 
   Many thanks to Chang Bae and Dave Hansen for working hard on this and
   also to the various test teams at Intel who reserved extra capacity to
   follow the rapid development of this closely which provides the
   confidence level required to offer this rather large update for inclusion
   into 5.16-rc1.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/NkITHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYodDkEADH4+/nN/QoSUHIuuha5Zptj3g2b16a
 /3TxT9fhwPen/kzMGsUk70s3iWJMA+I5dCfkSZexJ2hfhcRe9cBzZIa1HCawKwf3
 YCISTsO/M+LpeORuZ+TpfFLJKnxNr1SEOl+EYffGhq0AkCjifb9Cnr0JZuoMUzGU
 jpfJZ2bj28ri5lG812DtzSMBM9E3SAwgJv+GNjmZbxZKb9mAfhbAMdBUXHirX7Ej
 jmx6koQjYOKwYIW8w1BrdC270lUKQUyJTbQgdRkN9Mh/HnKyFixQ18JqGlgaV2cT
 EtYePUfTEdaHdAhUINLIlEug1MfOslHU+HyGsdywnoChNB4GHPQuePC5Tz60VeFN
 RbQ9aKcBUu8r95rjlnKtAtBijNMA4bjGwllVxNwJ/ZoA9RPv1SbDZ07RX3qTaLVY
 YhVQl8+shD33/W24jUTJv1kMMexpHXIlv0gyfMryzpwI7uzzmGHRPAokJdbYKctC
 dyMPfdE90rxTiMUdL/1IQGhnh3awjbyfArzUhHyQ++HyUyzCFh0slsO0CD18vUy8
 FofhCugGBhjuKw3XwLNQ+KsWURz5qHctSzBc3qMOSyqFHbAJCVRANkhsFvWJo2qL
 75+Z7OTRebtsyOUZIdq26r4roSxHrps3dupWTtN70HWx2NhQG1nLEw986QYiQu1T
 hcKvDmehQLrUvg==
 =x3WL
 -----END PGP SIGNATURE-----

Merge tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fpu updates from Thomas Gleixner:

 - Cleanup of extable fixup handling to be more robust, which in turn
   allows to make the FPU exception fixups more robust as well.

 - Change the return code for signal frame related failures from
   explicit error codes to a boolean fail/success as that's all what the
   calling code evaluates.

 - A large refactoring of the FPU code to prepare for adding AMX
   support:

      - Distangle the public header maze and remove especially the
        misnomed kitchen sink internal.h which is despite it's name
        included all over the place.

      - Add a proper abstraction for the register buffer storage (struct
        fpstate) which allows to dynamically size the buffer at runtime
        by flipping the pointer to the buffer container from the default
        container which is embedded in task_struct::tread::fpu to a
        dynamically allocated container with a larger register buffer.

      - Convert the code over to the new fpstate mechanism.

      - Consolidate the KVM FPU handling by moving the FPU related code
        into the FPU core which removes the number of exports and avoids
        adding even more export when AMX has to be supported in KVM.
        This also removes duplicated code which was of course
        unnecessary different and incomplete in the KVM copy.

      - Simplify the KVM FPU buffer handling by utilizing the new
        fpstate container and just switching the buffer pointer from the
        user space buffer to the KVM guest buffer when entering
        vcpu_run() and flipping it back when leaving the function. This
        cuts the memory requirements of a vCPU for FPU buffers in half
        and avoids pointless memory copy operations.

        This also solves the so far unresolved problem of adding AMX
        support because the current FPU buffer handling of KVM inflicted
        a circular dependency between adding AMX support to the core and
        to KVM. With the new scheme of switching fpstate AMX support can
        be added to the core code without affecting KVM.

      - Replace various variables with proper data structures so the
        extra information required for adding dynamically enabled FPU
        features (AMX) can be added in one place

 - Add AMX (Advanced Matrix eXtensions) support (finally):

   AMX is a large XSTATE component which is going to be available with
   Saphire Rapids XEON CPUs. The feature comes with an extra MSR
   (MSR_XFD) which allows to trap the (first) use of an AMX related
   instruction, which has two benefits:

    1) It allows the kernel to control access to the feature

    2) It allows the kernel to dynamically allocate the large register
       state buffer instead of burdening every task with the the extra
       8K or larger state storage.

   It would have been great to gain this kind of control already with
   AVX512.

   The support comes with the following infrastructure components:

    1) arch_prctl() to
        - read the supported features (equivalent to XGETBV(0))
        - read the permitted features for a task
        - request permission for a dynamically enabled feature

       Permission is granted per process, inherited on fork() and
       cleared on exec(). The permission policy of the kernel is
       restricted to sigaltstack size validation, but the syscall
       obviously allows further restrictions via seccomp etc.

    2) A stronger sigaltstack size validation for sys_sigaltstack(2)
       which takes granted permissions and the potentially resulting
       larger signal frame into account. This mechanism can also be used
       to enforce factual sigaltstack validation independent of dynamic
       features to help with finding potential victims of the 2K
       sigaltstack size constant which is broken since AVX512 support
       was added.

    3) Exception handling for #NM traps to catch first use of a extended
       feature via a new cause MSR. If the exception was caused by the
       use of such a feature, the handler checks permission for that
       feature. If permission has not been granted, the handler sends a
       SIGILL like the #UD handler would do if the feature would have
       been disabled in XCR0. If permission has been granted, then a new
       fpstate which fits the larger buffer requirement is allocated.

       In the unlikely case that this allocation fails, the handler
       sends SIGSEGV to the task. That's not elegant, but unavoidable as
       the other discussed options of preallocation or full per task
       permissions come with their own set of horrors for kernel and/or
       userspace. So this is the lesser of the evils and SIGSEGV caused
       by unexpected memory allocation failures is not a fundamentally
       new concept either.

       When allocation succeeds, the fpstate properties are filled in to
       reflect the extended feature set and the resulting sizes, the
       fpu::fpstate pointer is updated accordingly and the trap is
       disarmed for this task permanently.

    4) Enumeration and size calculations

    5) Trap switching via MSR_XFD

       The XFD (eXtended Feature Disable) MSR is context switched with
       the same life time rules as the FPU register state itself. The
       mechanism is keyed off with a static key which is default
       disabled so !AMX equipped CPUs have zero overhead. On AMX enabled
       CPUs the overhead is limited by comparing the tasks XFD value
       with a per CPU shadow variable to avoid redundant MSR writes. In
       case of switching from a AMX using task to a non AMX using task
       or vice versa, the extra MSR write is obviously inevitable.

       All other places which need to be aware of the variable feature
       sets and resulting variable sizes are not affected at all because
       they retrieve the information (feature set, sizes) unconditonally
       from the fpstate properties.

    6) Enable the new AMX states

   Note, this is relatively new code despite the fact that AMX support
   is in the works for more than a year now.

   The big refactoring of the FPU code, which allowed to do a proper
   integration has been started exactly 3 weeks ago. Refactoring of the
   existing FPU code and of the original AMX patches took a week and has
   been subject to extensive review and testing. The only fallout which
   has not been caught in review and testing right away was restricted
   to AMX enabled systems, which is completely irrelevant for anyone
   outside Intel and their early access program. There might be dragons
   lurking as usual, but so far the fine grained refactoring has held up
   and eventual yet undetected fallout is bisectable and should be
   easily addressable before the 5.16 release. Famous last words...

   Many thanks to Chang Bae and Dave Hansen for working hard on this and
   also to the various test teams at Intel who reserved extra capacity
   to follow the rapid development of this closely which provides the
   confidence level required to offer this rather large update for
   inclusion into 5.16-rc1

* tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits)
  Documentation/x86: Add documentation for using dynamic XSTATE features
  x86/fpu: Include vmalloc.h for vzalloc()
  selftests/x86/amx: Add context switch test
  selftests/x86/amx: Add test cases for AMX state management
  x86/fpu/amx: Enable the AMX feature in 64-bit mode
  x86/fpu: Add XFD handling for dynamic states
  x86/fpu: Calculate the default sizes independently
  x86/fpu/amx: Define AMX state components and have it used for boot-time checks
  x86/fpu/xstate: Prepare XSAVE feature table for gaps in state component numbers
  x86/fpu/xstate: Add fpstate_realloc()/free()
  x86/fpu/xstate: Add XFD #NM handler
  x86/fpu: Update XFD state where required
  x86/fpu: Add sanity checks for XFD
  x86/fpu: Add XFD state to fpstate
  x86/msr-index: Add MSRs for XFD
  x86/cpufeatures: Add eXtended Feature Disabling (XFD) feature bit
  x86/fpu: Reset permission and fpstate on exec()
  x86/fpu: Prepare fpu_clone() for dynamically enabled features
  x86/fpu/signal: Prepare for variable sigframe length
  x86/signal: Use fpu::__state_user_size for sigalt stack validation
  ...
2021-11-01 14:03:56 -07:00
Linus Torvalds
9a7e0a90a4 Scheduler updates:
- Revert the printk format based wchan() symbol resolution as it can leak
    the raw value in case that the symbol is not resolvable.
 
  - Make wchan() more robust and work with all kind of unwinders by
    enforcing that the task stays blocked while unwinding is in progress.
 
  - Prevent sched_fork() from accessing an invalid sched_task_group
 
  - Improve asymmetric packing logic
 
  - Extend scheduler statistics to RT and DL scheduling classes and add
    statistics for bandwith burst to the SCHED_FAIR class.
 
  - Properly account SCHED_IDLE entities
 
  - Prevent a potential deadlock when initial priority is assigned to a
    newly created kthread. A recent change to plug a race between cpuset and
    __sched_setscheduler() introduced a new lock dependency which is now
    triggered. Break the lock dependency chain by moving the priority
    assignment to the thread function.
 
  - Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
 
  - Improve idle balancing in general and especially for NOHZ enabled
    systems.
 
  - Provide proper interfaces for live patching so it does not have to
    fiddle with scheduler internals.
 
  - Add cluster aware scheduling support.
 
  - A small set of tweaks for RT (irqwork, wait_task_inactive(), various
    scheduler options and delaying mmdrop)
 
  - The usual small tweaks and improvements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/OUkTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoR/5D/9ikdGNpKg9osNqJ3GjAmxsK6kVkB29
 iFe2k8pIpWDToWQf/wQRGih4Yj3Cl49QSnZcPIibh2/12EB1qrrW6iSPJkInz8Ec
 /1LS5/Vewn2OyoxyXZjdvGC5gTXEodSbIazASvX7nvdMeI4gsAsL5etzrMJirT/t
 aymqvr7zovvywrwMTQJrGjUMo9l4ewE8tafMNNhRu1BHU1U4ojM9yvThyRAAcmp7
 3Xy49A+Yq3IgrvYI4u8FMK5Zh08KaxSFjiLhePGm/bF+wSfYmWop2TP1jY05W2Uo
 ti8hfbJMUoFRYuMxAiEldkItnc0wV4M9PtWZZ/x+B71bs65Y4Zjt9cW+rxJv2+m1
 vzV31EsQwGnOti072dzWN4c/cZqngVXAjaNtErvDwJUr+Tw1ayv9KUvuodMQqZY6
 mu68bFUO2kV9EMe1CBOv51Uy1RGHyLj3rlNqrkw+Xp5ISE9Ad2vhUEiRp5bQx5Ci
 V/XFhGZkGUluh0vccrdFlNYZwhj8cZEzkOPCnPSeZ+bq8SyZE6xuHH/lTP1CJCOy
 s800rW1huM+kgV+zRN8adDkGXibAk9N3RtVGnQXmuEy8gB9LZmQg+JeM2wsc9B+6
 i0gdqZnsjNAfoK+BBAG4holxptSL8/eOJsFH8ZNIoxQ+iqooyPx9tFX7yXnRTBQj
 d2qWG7UvoseT+g==
 =fgtS
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Thomas Gleixner:

 - Revert the printk format based wchan() symbol resolution as it can
   leak the raw value in case that the symbol is not resolvable.

 - Make wchan() more robust and work with all kind of unwinders by
   enforcing that the task stays blocked while unwinding is in progress.

 - Prevent sched_fork() from accessing an invalid sched_task_group

 - Improve asymmetric packing logic

 - Extend scheduler statistics to RT and DL scheduling classes and add
   statistics for bandwith burst to the SCHED_FAIR class.

 - Properly account SCHED_IDLE entities

 - Prevent a potential deadlock when initial priority is assigned to a
   newly created kthread. A recent change to plug a race between cpuset
   and __sched_setscheduler() introduced a new lock dependency which is
   now triggered. Break the lock dependency chain by moving the priority
   assignment to the thread function.

 - Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.

 - Improve idle balancing in general and especially for NOHZ enabled
   systems.

 - Provide proper interfaces for live patching so it does not have to
   fiddle with scheduler internals.

 - Add cluster aware scheduling support.

 - A small set of tweaks for RT (irqwork, wait_task_inactive(), various
   scheduler options and delaying mmdrop)

 - The usual small tweaks and improvements all over the place

* tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
  sched/fair: Cleanup newidle_balance
  sched/fair: Remove sysctl_sched_migration_cost condition
  sched/fair: Wait before decaying max_newidle_lb_cost
  sched/fair: Skip update_blocked_averages if we are defering load balance
  sched/fair: Account update_blocked_averages in newidle_balance cost
  x86: Fix __get_wchan() for !STACKTRACE
  sched,x86: Fix L2 cache mask
  sched/core: Remove rq_relock()
  sched: Improve wake_up_all_idle_cpus() take #2
  irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT
  irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT
  irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.
  sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ
  sched: Add cluster scheduler level for x86
  sched: Add cluster scheduler level in core and related Kconfig for ARM64
  topology: Represent clusters of CPUs within a die
  sched: Disable -Wunused-but-set-variable
  sched: Add wrapper for get_wchan() to keep task blocked
  x86: Fix get_wchan() to support the ORC unwinder
  proc: Use task_is_running() for wchan in /proc/$pid/stat
  ...
2021-11-01 13:48:52 -07:00
Linus Torvalds
43aa0a195f objtool updates:
- Improve retpoline code patching by separating it from alternatives which
    reduces memory footprint and allows to do better optimizations in the
    actual runtime patching.
 
  - Add proper retpoline support for x86/BPF
 
  - Address noinstr warnings in x86/kvm, lockdep and paravirtualization code
 
  - Add support to handle pv_opsindirect calls in the noinstr analysis
 
  - Classify symbols upfront and cache the result to avoid redundant
    str*cmp() invocations.
 
  - Add a CFI hash to reduce memory consumption which also reduces runtime
    on a allyesconfig by ~50%
 
  - Adjust XEN code to make objtool handling more robust and as a side
    effect to prevent text fragmentation due to placement of the hypercall
    page.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/GFgTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoc1JD/0Sz6seP2OUMxbMT3gCcFo9sMvYTdsM
 7WuGFbBbnCIo7g8JH7k0zRRBigptMp2eUtQXKkgaaIbWN4JbuVKf8KxN5/qXxLi4
 fJ12QnNTGH9N2jtzl5wKmpjaKJnnJMD9D10XwoR+T6gn6NHd+AgLEs7GxxuQUlgo
 eC9oEXhNHC8uNhiZc38EwfwmItI1bRgaLrnZWIL4rYGSMxfCK1/cEOpWrFfX9wmj
 /diB6oqMyPXZXMCtgpX7TniUr5XOTCcUkeO9mQv5bmyq/YM/8hrTbcVSJlsVYLvP
 EsBnUSHAcfLFiHXwa1RNiIGdbiPjbN+UYeXGAvqF58f3e5dTIHtN/UmWo7OH93If
 9rLMVNcMpsfPx7QRk2IxEPumLCkyfwjzfKrVDM6P6TKEIUzD1og4IK9gTlfykVsh
 56G5XiCOC/X2x8IMxKTLGuBiAVLFHXK/rSwoqhvNEWBFKDbP13QWs0LurBcW09Sa
 /kQI9pIBT1xFA/R+OY5Xy1cqNVVK1Gxmk8/bllCijA9pCFSCFM4hLZE5CevdrBCV
 h5SdqEK5hIlzFyypXfsCik/4p/+rfvlGfUKtFsPctxx29SPe+T0orx+l61jiWQok
 rZOflwMawK5lDuASHrvNHGJcWaTwoo3VcXMQDnQY0Wulc43J5IFBaPxkZzgyd+S1
 4lktHxatrCMUgw==
 =pfZi
 -----END PGP SIGNATURE-----

Merge tag 'objtool-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool updates from Thomas Gleixner:

 - Improve retpoline code patching by separating it from alternatives
   which reduces memory footprint and allows to do better optimizations
   in the actual runtime patching.

 - Add proper retpoline support for x86/BPF

 - Address noinstr warnings in x86/kvm, lockdep and paravirtualization
   code

 - Add support to handle pv_opsindirect calls in the noinstr analysis

 - Classify symbols upfront and cache the result to avoid redundant
   str*cmp() invocations.

 - Add a CFI hash to reduce memory consumption which also reduces
   runtime on a allyesconfig by ~50%

 - Adjust XEN code to make objtool handling more robust and as a side
   effect to prevent text fragmentation due to placement of the
   hypercall page.

* tag 'objtool-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
  bpf,x86: Respect X86_FEATURE_RETPOLINE*
  bpf,x86: Simplify computing label offsets
  x86,bugs: Unconditionally allow spectre_v2=retpoline,amd
  x86/alternative: Add debug prints to apply_retpolines()
  x86/alternative: Try inline spectre_v2=retpoline,amd
  x86/alternative: Handle Jcc __x86_indirect_thunk_\reg
  x86/alternative: Implement .retpoline_sites support
  x86/retpoline: Create a retpoline thunk array
  x86/retpoline: Move the retpoline thunk declarations to nospec-branch.h
  x86/asm: Fixup odd GEN-for-each-reg.h usage
  x86/asm: Fix register order
  x86/retpoline: Remove unused replacement symbols
  objtool,x86: Replace alternatives with .retpoline_sites
  objtool: Shrink struct instruction
  objtool: Explicitly avoid self modifying code in .altinstr_replacement
  objtool: Classify symbols
  objtool: Support pv_opsindirect calls for noinstr
  x86/xen: Rework the xen_{cpu,irq,mmu}_opsarrays
  x86/xen: Mark xen_force_evtchn_callback() noinstr
  x86/xen: Make irq_disable() noinstr
  ...
2021-11-01 13:24:43 -07:00
Peter Zijlstra
f8a66d608a x86,bugs: Unconditionally allow spectre_v2=retpoline,amd
Currently Linux prevents usage of retpoline,amd on !AMD hardware, this
is unfriendly and gets in the way of testing. Remove this restriction.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211026120310.487348118@infradead.org
2021-10-28 23:25:29 +02:00
Tianyu Lan
af788f355e x86/hyperv: Initialize shared memory boundary in the Isolation VM.
Hyper-V exposes shared memory boundary via cpuid
HYPERV_CPUID_ISOLATION_CONFIG and store it in the
shared_gpa_boundary of ms_hyperv struct. This prepares
to share memory with host for SNP guest.

Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Link: https://lore.kernel.org/r/20211025122116.264793-3-ltykernel@gmail.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-10-28 10:47:52 +00:00
Tianyu Lan
0cc4f6d9f0 x86/hyperv: Initialize GHCB page in Isolation VM
Hyperv exposes GHCB page via SEV ES GHCB MSR for SNP guest
to communicate with hypervisor. Map GHCB page for all
cpus to read/write MSR register and submit hvcall request
via ghcb page.

Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Link: https://lore.kernel.org/r/20211025122116.264793-2-ltykernel@gmail.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-10-28 10:47:52 +00:00
Dave Airlie
970eae1560 Linux 5.15-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmF298ceHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGIJYH/1rsEFQQ6caeQdy1
 z9eFIe48DNM4l7bFk+qEj2UAbzPdahVJ299Mg5fW0n2CDemOc9/n0b9TxQ37YObi
 mOzu0xwJVupIxkyFMPQSSc2q8aLm67NSpJy08DsmaNses5hSvu8x15RPHLQTybjt
 SwtKns+jpCq79P1GWbrB5e5UkLb0VNoxNp4L1U4pMrYGcEkJUXbaxNY2V/JcXdM7
 Vtn+qN0T/J6V6QVftv0t8Ecj3bjEnmL3kZHaTaNg3dGeKRpCGyHc5lcBQ0cNFG6t
 vjZ9VbuhBzGI3TN2tHH5hpA1UXo7HPBBCwQqxF1jeGLGHULikYwZ3TAPWqL3QZqC
 9cxr9SY=
 =p75d
 -----END PGP SIGNATURE-----

BackMerge tag 'v5.15-rc7' into drm-next

The msm next tree is based on rc3, so let's just backmerge rc7 before pulling it in.

Signed-off-by: Dave Airlie <airlied@redhat.com>
2021-10-28 14:59:38 +10:00
Chang S. Bae
2308ee57d9 x86/fpu/amx: Enable the AMX feature in 64-bit mode
Add the AMX state components in XFEATURE_MASK_USER_SUPPORTED and the
TILE_DATA component to the dynamic states and update the permission check
table accordingly.

This is only effective on 64 bit kernels as for 32bit kernels
XFEATURE_MASK_TILE is defined as 0.

TILE_DATA is caller-saved state and the only dynamic state. Add build time
sanity check to ensure the assumption that every dynamic feature is caller-
saved.

Make AMX state depend on XFD as it is dynamic feature.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211021225527.10184-24-chang.seok.bae@intel.com
2021-10-26 10:53:03 +02:00
Chang S. Bae
c351101678 x86/cpufeatures: Add eXtended Feature Disabling (XFD) feature bit
Intel's eXtended Feature Disable (XFD) feature is an extension of the XSAVE
architecture. XFD allows the kernel to enable a feature state in XCR0 and
to receive a #NM trap when a task uses instructions accessing that state.

This is going to be used to postpone the allocation of a larger XSTATE
buffer for a task to the point where it is actually using a related
instruction after the permission to use that facility has been granted.

XFD is not used by the kernel, but only applied to userspace. This is a
matter of policy as the kernel knows how a fpstate is reallocated and the
XFD state.

The compacted XSAVE format is adjustable for dynamic features. Make XFD
depend on XSAVES.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211021225527.10184-13-chang.seok.bae@intel.com
2021-10-26 10:18:09 +02:00
Paolo Bonzini
ae095b16fc x86/sgx/virt: implement SGX_IOC_VEPC_REMOVE ioctl
For bare-metal SGX on real hardware, the hardware provides guarantees
SGX state at reboot.  For instance, all pages start out uninitialized.
The vepc driver provides a similar guarantee today for freshly-opened
vepc instances, but guests such as Windows expect all pages to be in
uninitialized state on startup, including after every guest reboot.

Some userspace implementations of virtual SGX would rather avoid having
to close and reopen the /dev/sgx_vepc file descriptor and re-mmap the
virtual EPC.  For example, they could sandbox themselves after the guest
starts and forbid further calls to open(), in order to mitigate exploits
from untrusted guests.

Therefore, add a ioctl that does this with EREMOVE.  Userspace can
invoke the ioctl to bring its vEPC pages back to uninitialized state.
There is a possibility that some pages fail to be removed if they are
SECS pages, and the child and SECS pages could be in separate vEPC
regions.  Therefore, the ioctl returns the number of EREMOVE failures,
telling userspace to try the ioctl again after it's done with all
vEPC regions.  A more verbose description of the correct usage and
the possible error conditions is documented in sgx.rst.

Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20211021201155.1523989-3-pbonzini@redhat.com
2021-10-22 08:32:12 -07:00
Paolo Bonzini
fd5128e622 x86/sgx/virt: extract sgx_vepc_remove_page
For bare-metal SGX on real hardware, the hardware provides guarantees
SGX state at reboot.  For instance, all pages start out uninitialized.
The vepc driver provides a similar guarantee today for freshly-opened
vepc instances, but guests such as Windows expect all pages to be in
uninitialized state on startup, including after every guest reboot.

One way to do this is to simply close and reopen the /dev/sgx_vepc file
descriptor and re-mmap the virtual EPC.  However, this is problematic
because it prevents sandboxing the userspace (for example forbidding
open() after the guest starts; this is doable with heavy use of SCM_RIGHTS
file descriptor passing).

In order to implement this, we will need a ioctl that performs
EREMOVE on all pages mapped by a /dev/sgx_vepc file descriptor:
other possibilities, such as closing and reopening the device,
are racy.

Start the implementation by creating a separate function with just
the __eremove wrapper.

Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20211021201155.1523989-2-pbonzini@redhat.com
2021-10-22 08:30:09 -07:00
Borislav Petkov
9d48960414 x86/microcode: Use the firmware_loader built-in API
The microcode loader has been looping through __start_builtin_fw down to
__end_builtin_fw to look for possibly built-in firmware for microcode
updates.

Now that the firmware loader code has exported an API for looping
through the kernel's built-in firmware section, use it and drop the x86
implementation in favor.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20211021155843.1969401-4-mcgrof@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-22 14:13:50 +02:00
Jane Malalane
415de44076 x86/cpu: Fix migration safety with X86_BUG_NULL_SEL
Currently, Linux probes for X86_BUG_NULL_SEL unconditionally which
makes it unsafe to migrate in a virtualised environment as the
properties across the migration pool might differ.

To be specific, the case which goes wrong is:

1. Zen1 (or earlier) and Zen2 (or later) in a migration pool
2. Linux boots on Zen2, probes and finds the absence of X86_BUG_NULL_SEL
3. Linux is then migrated to Zen1

Linux is now running on a X86_BUG_NULL_SEL-impacted CPU while believing
that the bug is fixed.

The only way to address the problem is to fully trust the "no longer
affected" CPUID bit when virtualised, because in the above case it would
be clear deliberately to indicate the fact "you might migrate to
somewhere which has this behaviour".

Zen3 adds the NullSelectorClearsBase CPUID bit to indicate that loading
a NULL segment selector zeroes the base and limit fields, as well as
just attributes. Zen2 also has this behaviour but doesn't have the NSCB
bit.

 [ bp: Minor touchups. ]

Signed-off-by: Jane Malalane <jane.malalane@citrix.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
CC: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20211021104744.24126-1-jane.malalane@citrix.com
2021-10-21 20:49:16 +02:00
Marcos Del Sol Vives
639475d434 x86/CPU: Add support for Vortex CPUs
DM&P devices were not being properly identified, which resulted in
unneeded Spectre/Meltdown mitigations being applied.

The manufacturer states that these devices execute always in-order and
don't support either speculative execution or branch prediction, so
they are not vulnerable to this class of attack. [1]

This is something I've personally tested by a simple timing analysis
on my Vortex86MX CPU, and can confirm it is true.

Add identification for some devices that lack the CPUID product name
call, so they appear properly on /proc/cpuinfo.

¹https://www.ssv-embedded.de/doks/infos/DMP_Ann_180108_Meltdown.pdf

 [ bp: Massage commit message. ]

Signed-off-by: Marcos Del Sol Vives <marcos@orca.pet>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211017094408.1512158-1-marcos@orca.pet
2021-10-21 15:49:07 +02:00
Thomas Gleixner
b56d2795b2 x86/fpu: Replace the includes of fpu/internal.h
Now that the file is empty, fixup all references with the proper includes
and delete the former kitchen sink.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011540.001197214@linutronix.de
2021-10-20 15:27:29 +02:00
Ingo Molnar
082f20b21d Merge branch 'x86/urgent' into x86/fpu, to resolve a conflict
Resolve the conflict between these commits:

   x86/fpu:      1193f408cd ("x86/fpu/signal: Change return type of __fpu_restore_sig() to boolean")

   x86/urgent:   d298b03506 ("x86/fpu: Restore the masking out of reserved MXCSR bits")
                 b2381acd3f ("x86/fpu: Mask out the invalid MXCSR bits properly")

 Conflicts:
        arch/x86/kernel/fpu/signal.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-10-16 15:17:46 +02:00
Tim Chen
66558b730f sched: Add cluster scheduler level for x86
There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce is
shared among a cluster of cores instead of being exclusive to one
single core.

To prevent oversubscription of L2 cache, load should be balanced
between such L2 clusters, especially for tasks with no shared data.
On benchmark such as SPECrate mcf test, this change provides a boost
to performance especially on medium load system on Jacobsville.  on a
Jacobsville that has 24 Atom cores, arranged into 6 clusters of 4
cores each, the benchmark number is as follow:

 Improvement over baseline kernel for mcf_r
 copies		run time	base rate
 1		-0.1%		-0.2%
 6		25.1%		25.1%
 12		18.8%		19.0%
 24		0.3%		0.3%

So this looks pretty good. In terms of the system's task distribution,
some pretty bad clumping can be seen for the vanilla kernel without
the L2 cluster domain for the 6 and 12 copies case. With the extra
domain for cluster, the load does get evened out between the clusters.

Note this patch isn't an universal win as spreading isn't necessarily
a win, particually for those workload who can benefit from packing.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210924085104.44806-4-21cnbao@gmail.com
2021-10-15 11:25:16 +02:00
Mukul Joshi
f38ce910d8 x86/MCE/AMD: Export smca_get_bank_type symbol
Export smca_get_bank_type for use in the AMD GPU
driver to determine MCA bank while handling correctable
and uncorrectable errors in GPU UMC.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-06 15:53:24 -04:00
Vegard Nossum
3958b9c34c x86/entry: Clear X86_FEATURE_SMAP when CONFIG_X86_SMAP=n
Commit

  3c73b81a91 ("x86/entry, selftests: Further improve user entry sanity checks")

added a warning if AC is set when in the kernel.

Commit

  662a022189 ("x86/entry: Fix AC assertion")

changed the warning to only fire if the CPU supports SMAP.

However, the warning can still trigger on a machine that supports SMAP
but where it's disabled in the kernel config and when running the
syscall_nt selftest, for example:

  ------------[ cut here ]------------
  WARNING: CPU: 0 PID: 49 at irqentry_enter_from_user_mode
  CPU: 0 PID: 49 Comm: init Tainted: G                T 5.15.0-rc4+ #98 e6202628ee053b4f310759978284bd8bb0ce6905
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
  RIP: 0010:irqentry_enter_from_user_mode
  ...
  Call Trace:
   ? irqentry_enter
   ? exc_general_protection
   ? asm_exc_general_protection
   ? asm_exc_general_protectio

IS_ENABLED(CONFIG_X86_SMAP) could be added to the warning condition, but
even this would not be enough in case SMAP is disabled at boot time with
the "nosmap" parameter.

To be consistent with "nosmap" behaviour, clear X86_FEATURE_SMAP when
!CONFIG_X86_SMAP.

Found using entry-fuzz + satrandconfig.

 [ bp: Massage commit message. ]

Fixes: 3c73b81a91 ("x86/entry, selftests: Further improve user entry sanity checks")
Fixes: 662a022189 ("x86/entry: Fix AC assertion")
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20211003223423.8666-1-vegard.nossum@oracle.com
2021-10-06 18:46:06 +02:00
James Morse
d4ebfca26d x86/resctrl: Fix kfree() of the wrong type in domain_add_cpu()
Commit in Fixes separated the architecture specific and filesystem parts
of the resctrl domain structures.

This left the error paths in domain_add_cpu() kfree()ing the memory with
the wrong type.

This will cause a problem if someone adds a new member to struct
rdt_hw_domain meaning d_resctrl is no longer the first member.

Fixes: 792e0f6f78 ("x86/resctrl: Split struct rdt_domain")
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20210917165924.28254-1-james.morse@arm.com
2021-10-06 18:45:27 +02:00
James Morse
64e87d4bd3 x86/resctrl: Free the ctrlval arrays when domain_setup_mon_state() fails
domain_add_cpu() is called whenever a CPU is brought online. The
earlier call to domain_setup_ctrlval() allocates the control value
arrays.

If domain_setup_mon_state() fails, the control value arrays are not
freed.

Add the missing kfree() calls.

Fixes: 1bd2a63b4f ("x86/intel_rdt/mba_sc: Add initialization support")
Fixes: edf6fa1c4a ("x86/intel_rdt/cqm: Add RMID (Resource monitoring ID) management")
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20210917165958.28313-1-james.morse@arm.com
2021-10-06 18:45:21 +02:00
Andrea Arcangeli
2f46993d83 x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl
Switch the kernel default of SSBD and STIBP to the ones with
CONFIG_SECCOMP=n (i.e. spec_store_bypass_disable=prctl
spectre_v2_user=prctl) even if CONFIG_SECCOMP=y.

Several motivations listed below:

- If SMT is enabled the seccomp jail can still attack the rest of the
  system even with spectre_v2_user=seccomp by using MDS-HT (except on
  XEON PHI where MDS can be tamed with SMT left enabled, but that's a
  special case). Setting STIBP become a very expensive window dressing
  after MDS-HT was discovered.

- The seccomp jail cannot attack the kernel with spectre-v2-HT
  regardless (even if STIBP is not set), but with MDS-HT the seccomp
  jail can attack the kernel too.

- With spec_store_bypass_disable=prctl the seccomp jail can attack the
  other userland (guest or host mode) using spectre-v2-HT, but the
  userland attack is already mitigated by both ASLR and pid namespaces
  for host userland and through virt isolation with libkrun or
  kata. (if something if somebody is worried about spectre-v2-HT it's
  best to mount proc with hidepid=2,gid=proc on workstations where not
  all apps may run under container runtimes, rather than slowing down
  all seccomp jails, but the best is to add pid namespaces to the
  seccomp jail). As opposed MDS-HT is not mitigated and the seccomp
  jail can still attack all other host and guest userland if SMT is
  enabled even with spec_store_bypass_disable=seccomp.

- If full security is required then MDS-HT must also be mitigated with
  nosmt and then spectre_v2_user=prctl and spectre_v2_user=seccomp
  would become identical.

- Setting spectre_v2_user=seccomp is overall lower priority than to
  setting javascript.options.wasm false in about:config to protect
  against remote wasm MDS-HT, instead of worrying about Spectre-v2-HT
  and STIBP which again is already statistically well mitigated by
  other means in userland and it's fully mitigated in kernel with
  retpolines (unlike the wasm assist call with MDS-HT).

- SSBD is needed to prevent reading the JIT memory and the primary
  user being the OpenJDK. However the primary user of SSBD wouldn't be
  covered by spec_store_bypass_disable=seccomp because it doesn't use
  seccomp and the primary user also explicitly declined to set
  PR_SET_SPECULATION_CTRL+PR_SPEC_STORE_BYPASS despite it easily
  could. In fact it would need to set it only when the sandboxing
  mechanism is enabled for javaws applets, but it still declined it by
  declaring security within the same user address space as an
  untenable objective for their JIT, even in the sandboxing case where
  performance would be a lesser concern (for the record: I kind of
  disagree in not setting PR_SPEC_STORE_BYPASS in the sandbox case and
  I prefer to run javaws through a wrapper that sets
  PR_SPEC_STORE_BYPASS if I need). In turn it can be inferred that
  even if the primary user of SSBD would use seccomp, they would
  invoke it with SECCOMP_FILTER_FLAG_SPEC_ALLOW by now.

- runc/crun already set SECCOMP_FILTER_FLAG_SPEC_ALLOW by default, k8s
  and podman have a default json seccomp allowlist that cannot be
  slowed down, so for the #1 seccomp user this change is already a
  noop.

- systemd/sshd or other apps that use seccomp, if they really need
  STIBP or SSBD, they need to explicitly set the
  PR_SET_SPECULATION_CTRL by now. The stibp/ssbd seccomp blind
  catch-all approach was done probably initially with a wishful
  thinking objective to pretend to have a peace of mind that it could
  magically fix it all. That was wishful thinking before MDS-HT was
  discovered, but after MDS-HT has been discovered it become just
  window dressing.

- For qemu "-sandbox" seccomp jail it wouldn't make sense to set STIBP
  or SSBD. SSBD doesn't help with KVM because there's no JIT (if it's
  needed with TCG it should be an opt-in with
  PR_SET_SPECULATION_CTRL+PR_SPEC_STORE_BYPASS and it shouldn't
  slowdown KVM for nothing). For qemu+KVM STIBP would be even more
  window dressing than it is for all other apps, because in the
  qemu+KVM case there's not only the MDS attack to worry about with
  SMT enabled. Even after disabling SMT, there's still a theoretical
  spectre-v2 attack possible within the same thread context from guest
  mode to host ring3 that the host kernel retpoline mitigation has no
  theoretical chance to mitigate. On some kernels a
  ibrs-always/ibrs-retpoline opt-in model is provided that will
  enabled IBRS in the qemu host ring3 userland which fixes this
  theoretical concern. Only after enabling IBRS in the host userland
  it would then make sense to proceed and worry about STIBP and an
  attack on the other host userland, but then again SMT would need to
  be disabled for full security anyway, so that would render STIBP
  again a noop.

- last but not the least: the lack of "spec_store_bypass_disable=prctl
  spectre_v2_user=prctl" means the moment a guest boots and
  sshd/systemd runs, the guest kernel will write to SPEC_CTRL MSR
  which will make the guest vmexit forever slower, forcing KVM to
  issue a very slow rdmsr instruction at every vmexit. So the end
  result is that SPEC_CTRL MSR is only available in GCE. Most other
  public cloud providers don't expose SPEC_CTRL, which means that not
  only STIBP/SSBD isn't available, but IBPB isn't available either
  (which would cause no overhead to the guest or the hypervisor
  because it's write only and requires no reading during vmexit). So
  the current default already net loss in security (missing IBPB)
  which means most public cloud providers cannot achieve a fully
  secure guest with nosmt (and nosmt is enough to fully mitigate
  MDS-HT). It also means GCE and is unfairly penalized in performance
  because it provides the option to enable full security in the guest
  as an opt-in (i.e. nosmt and IBPB). So this change will allow all
  cloud providers to expose SPEC_CTRL without incurring into any
  hypervisor slowdown and at the same time it will remove the unfair
  penalization of GCE performance for doing the right thing and it'll
  allow to get full security with nosmt with IBPB being available (and
  STIBP becoming meaningless).

Example to put things in prospective: the STIBP enabled in seccomp has
never been about protecting apps using seccomp like sshd from an
attack from a malicious userland, but to the contrary it has always
been about protecting the system from an attack from sshd, after a
successful remote network exploit against sshd. In fact initially it
wasn't obvious STIBP would work both ways (STIBP was about preventing
the task that runs with STIBP to be attacked with spectre-v2-HT, but
accidentally in the STIBP case it also prevents the attack in the
other direction). In the hypothetical case that sshd has been remotely
exploited the last concern should be STIBP being set, because it'll be
still possible to obtain info even from the kernel by using MDS if
nosmt wasn't set (and if it was set, STIBP is a noop in the first
place). As opposed kernel cannot leak anything with spectre-v2 HT
because of retpolines and the userland is mitigated by ASLR already
and ideally PID namespaces too. If something it'd be worth checking if
sshd run the seccomp thread under pid namespaces too if available in
the running kernel. SSBD also would be a noop for sshd, since sshd
uses no JIT. If sshd prefers to keep doing the STIBP window dressing
exercise, it still can even after this change of defaults by opting-in
with PR_SPEC_INDIRECT_BRANCH.

Ultimately setting SSBD and STIBP by default for all seccomp jails is
a bad sweet spot and bad default with more cons than pros that end up
reducing security in the public cloud (by giving an huge incentive to
not expose SPEC_CTRL which would be needed to get full security with
IBPB after setting nosmt in the guest) and by excessively hurting
performance to more secure apps using seccomp that end up having to
opt out with SECCOMP_FILTER_FLAG_SPEC_ALLOW.

The following is the verified result of the new default with SMT
enabled:

(gdb) print spectre_v2_user_stibp
$1 = SPECTRE_V2_USER_PRCTL
(gdb) print spectre_v2_user_ibpb
$2 = SPECTRE_V2_USER_PRCTL
(gdb) print ssb_mode
$3 = SPEC_STORE_BYPASS_PRCTL

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201104235054.5678-1-aarcange@redhat.com
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/lkml/AAA2EF2C-293D-4D5B-BFA6-FF655105CD84@redhat.com
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/lkml/c0722838-06f7-da6b-138f-e0f26362f16a@redhat.com
2021-10-04 12:12:57 -07:00
Borislav Petkov
15802468a9 x86/mce: Sort mca_config members to get rid of unnecessary padding
$ pahole -C mca_config arch/x86/kernel/cpu/mce/core.o

before:

   /* size: 40, cachelines: 1, members: 16 */
   /* sum members: 21, holes: 1, sum holes: 3 */
   /* sum bitfield members: 64 bits, bit holes: 2, sum bit holes: 32 bits */
   /* padding: 4 */
   /* last cacheline: 40 bytes */

after:

   /* size: 32, cachelines: 1, members: 16 */
   /* padding: 3 */
   /* last cacheline: 32 bytes */

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20210922165101.18951-6-bp@alien8.de
2021-09-23 11:38:04 +02:00
Borislav Petkov
cc466666ab x86/mce: Get rid of the ->quirk_no_way_out() indirect call
Use a flag setting to call the only quirk function for that.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20210922165101.18951-5-bp@alien8.de
2021-09-23 11:34:49 +02:00
Borislav Petkov
8121b8f947 x86/mce: Get rid of msr_ops
Avoid having indirect calls and use a normal function which returns the
proper MSR address based on ->smca setting.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20210922165101.18951-4-bp@alien8.de
2021-09-23 11:20:20 +02:00
Borislav Petkov
cbe1de162d x86/mce: Get rid of machine_check_vector
Get rid of the indirect function pointer and use flags settings instead
to steer execution.

Now that it is not an indirect call any longer, drop the instrumentation
annotation for objtool too.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20210922165101.18951-3-bp@alien8.de
2021-09-23 11:15:49 +02:00
Borislav Petkov
631adc7b0b x86/mce: Get rid of the mce_severity function pointer
Turn it into a normal function which calls an AMD- or Intel-specific
variant depending on the CPU it runs on.

No functional changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20210922165101.18951-2-bp@alien8.de
2021-09-23 11:03:51 +02:00
Tony Luck
a6e3cf70b7 x86/mce: Change to not send SIGBUS error during copy from user
Sending a SIGBUS for a copy from user is not the correct semantic.
System calls should return -EFAULT (or a short count for write(2)).

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210818002942.1607544-3-tony.luck@intel.com
2021-09-20 10:55:41 +02:00
Linus Torvalds
20621d2f27 A set of x86 fixes:
- Prevent a infinite loop in the MCE recovery on return to user space,
     which was caused by a second MCE queueing work for the same page and
     thereby creating a circular work list.
 
   - Make kern_addr_valid() handle existing PMD entries, which are marked not
     present in the higher level page table, correctly instead of blindly
     dereferencing them.
 
   - Pass a valid address to sanitize_phys(). This was caused by the mixture
     of inclusive and exclusive ranges. memtype_reserve() expect 'end' being
     exclusive, but sanitize_phys() wants it inclusive. This worked so far,
     but with end being the end of the physical address space the fail is
     exposed.
 
  - Increase the maximum supported GPIO numbers for 64bit. Newer SoCs exceed
    the previous maximum.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmFHhPIACgkQEsHwGGHe
 VUqqQA/+MHQ2HxVOPxnJ0i/D1nK8ccNqTEkSN08z23RGnjqKQun/VaNIIceJY25f
 Abeb2tI+0qRrdWVPVd5YqcTHuBLmnPs6Je3MfOrG47eQNW4/SmkXYuOexK80Bew3
 YDgEV73d40rHcolXZCaonVajx+FmjoNvkDt5LpLvLcCxIyv0GClFBcZrFAm70AxI
 Feax30koh3/MIFxHoXyADN8D+MJu1GxA6QWuoTK40s3G/gTTAwimkDgnNU1JXbcj
 VvVVZaNnnAxjxrCa81blr9nDpHJCDinG9bdvDT3UDLous52hGMZTsHoHogxwfogT
 EhIgPvL8hf+wm1WXA4NyvSNKZxsGfdkvIXaUq9XYHpLRD6Ao6x7jQDL039imucqb
 9YtaH52GhG0SgJlYjkm/zrKezIjKLDen0ZYr/2iNTDM1p2GqQEFo07wC/ME8TkQ6
 /BvtbkIvOuUz3nJeV4/AO+O4kaNvto9O2eHq9oodIN9nrwmlO5fMg8XO9nrhWB11
 ChXEz6kPqta1nyZXy0mwOrlXlqzcusiroG4G9F7IBBz+t/gNwlu3uZuIgkQCXyYw
 DgKz9cnQ3RdgCFknbmEwV5oCjewm7UdcgwaDAaelHIDuWMcshZFvMf1uSjnyg4Z/
 39WI8W7W2aZnIoKWpvu8s7Gr8f1krE7C3xrkvl2WmbKPkxNAin8=
 =7cq3
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.15_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Prevent a infinite loop in the MCE recovery on return to user space,
   which was caused by a second MCE queueing work for the same page and
   thereby creating a circular work list.

 - Make kern_addr_valid() handle existing PMD entries, which are marked
   not present in the higher level page table, correctly instead of
   blindly dereferencing them.

 - Pass a valid address to sanitize_phys(). This was caused by the
   mixture of inclusive and exclusive ranges. memtype_reserve() expect
   'end' being exclusive, but sanitize_phys() wants it inclusive. This
   worked so far, but with end being the end of the physical address
   space the fail is exposed.

 - Increase the maximum supported GPIO numbers for 64bit. Newer SoCs
   exceed the previous maximum.

* tag 'x86_urgent_for_v5.15_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce: Avoid infinite loop for copy from user recovery
  x86/mm: Fix kern_addr_valid() to cope with existing but not present entries
  x86/platform: Increase maximum GPIO number for X86_64
  x86/pat: Pass valid address to sanitize_phys()
2021-09-19 13:29:36 -07:00
Tony Luck
81065b35e2 x86/mce: Avoid infinite loop for copy from user recovery
There are two cases for machine check recovery:

1) The machine check was triggered by ring3 (application) code.
   This is the simpler case. The machine check handler simply queues
   work to be executed on return to user. That code unmaps the page
   from all users and arranges to send a SIGBUS to the task that
   triggered the poison.

2) The machine check was triggered in kernel code that is covered by
   an exception table entry. In this case the machine check handler
   still queues a work entry to unmap the page, etc. but this will
   not be called right away because the #MC handler returns to the
   fix up code address in the exception table entry.

Problems occur if the kernel triggers another machine check before the
return to user processes the first queued work item.

Specifically, the work is queued using the ->mce_kill_me callback
structure in the task struct for the current thread. Attempting to queue
a second work item using this same callback results in a loop in the
linked list of work functions to call. So when the kernel does return to
user, it enters an infinite loop processing the same entry for ever.

There are some legitimate scenarios where the kernel may take a second
machine check before returning to the user.

1) Some code (e.g. futex) first tries a get_user() with page faults
   disabled. If this fails, the code retries with page faults enabled
   expecting that this will resolve the page fault.

2) Copy from user code retries a copy in byte-at-time mode to check
   whether any additional bytes can be copied.

On the other side of the fence are some bad drivers that do not check
the return value from individual get_user() calls and may access
multiple user addresses without noticing that some/all calls have
failed.

Fix by adding a counter (current->mce_count) to keep track of repeated
machine checks before task_work() is called. First machine check saves
the address information and calls task_work_add(). Subsequent machine
checks before that task_work call back is executed check that the address
is in the same page as the first machine check (since the callback will
offline exactly one page).

Expected worst case is four machine checks before moving on (e.g. one
user access with page faults disabled, then a repeat to the same address
with page faults enabled ... repeat in copy tail bytes). Just in case
there is some code that loops forever enforce a limit of 10.

 [ bp: Massage commit message, drop noinstr, fix typo, extend panic
   messages. ]

Fixes: 5567d11c21 ("x86/mce: Send #MC singal from task work")
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/YT/IJ9ziLqmtqEPu@agluck-desk2.amr.corp.intel.com
2021-09-14 10:27:03 +02:00
Thomas Gleixner
0c2e62ba04 x86/extable: Remove EX_TYPE_FAULT from MCE safe fixups
Now that the MC safe copy and FPU have been converted to use the MCE safe
fixup types remove EX_TYPE_FAULT from the list of types which MCE considers
to be safe to be recovered in kernel.

This removes the SGX exception handling of ENCLS from the #MC safe
handling, but according to the SGX wizards the current SGX implementations
cannot survive #MC on ENCLS:

  https://lore.kernel.org/r/YS+upEmTfpZub3s9@google.com

The code relies on the trap number being stored if ENCLS raised an
exception. That's still working, but it does no longer trick the MCE code
into assuming that #MC is handled correctly for ENCLS.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.445255957@linutronix.de
2021-09-13 18:23:00 +02:00
Thomas Gleixner
2cadf5248b x86/extable: Provide EX_TYPE_DEFAULT_MCE_SAFE and EX_TYPE_FAULT_MCE_SAFE
Provide exception fixup types which can be used to identify fixups which
allow in kernel #MC recovery and make them invoke the existing handlers.

These will be used at places where #MC recovery is handled correctly by the
caller.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.269689153@linutronix.de
2021-09-13 17:56:56 +02:00
Thomas Gleixner
46d28947d9 x86/extable: Rework the exception table mechanics
The exception table entries contain the instruction address, the fixup
address and the handler address. All addresses are relative. Storing the
handler address has a few downsides:

 1) Most handlers need to be exported

 2) Handlers can be defined everywhere and there is no overview about the
    handler types

 3) MCE needs to check the handler type to decide whether an in kernel #MC
    can be recovered. The functionality of the handler itself is not in any
    way special, but for these checks there need to be separate functions
    which in the worst case have to be exported.

    Some of these 'recoverable' exception fixups are pretty obscure and
    just reuse some other handler to spare code. That obfuscates e.g. the
    #MC safe copy functions. Cleaning that up would require more handlers
    and exports

Rework the exception fixup mechanics by storing a fixup type number instead
of the handler address and invoke the proper handler for each fixup
type. Also teach the extable sort to leave the type field alone.

This makes most handlers static except for special cases like the MCE
MSR fixup and the BPF fixup. This allows to add more types for cleaning up
the obscure places without adding more handler code and exports.

There is a marginal code size reduction for a production config and it
removes _eight_ exported symbols.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lkml.kernel.org/r/20210908132525.211958725@linutronix.de
2021-09-13 17:51:47 +02:00
Thomas Gleixner
083b32d6f4 x86/mce: Get rid of stray semicolons
and the random number of tabs.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.154428878@linutronix.de
2021-09-13 17:42:51 +02:00
Thomas Gleixner
e42404afc4 x86/mce: Deduplicate exception handling
Prepare code for further simplification. No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.096452100@linutronix.de
2021-09-13 17:00:23 +02:00
Thomas Gleixner
c2f4954c2d Merge branch 'linus' into smp/urgent
Ensure that all usage sites of get/put_online_cpus() except for the
struggler in drivers/thermal are gone. So the last user and the deprecated
inlines can be removed.
2021-09-11 00:38:47 +02:00
Linus Torvalds
c07f191907 hyperv-next for 5.15
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmEuJwwTHHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXp0ICACsx9NtQh1f9xGMClYrbobJfGmiwHVV
 uKut/44Vg39tSyZB4mQt3A8YcaQj1Nibo6HVmxJtbKbrKwlTGAiQh5fiOmBOd7Re
 /rII/S+CGtAyChI1adHTSL2xdk6WY0c7XQw+IPaERBikG5rO81Y6NLjFZNOv494k
 JnG9uGGjAcJWFYylPcLxt4sR/hEfE4KDzsWjWOb5azYgo/RwOan6zYDdkUgocp4A
 J+zmgCiME8LLmEV19gn7p4gpX7X9m5mcNgn53eICYPhrBqI0PTWocm6DepCEnrQ+
 pEobIagWIMx5Dr7euEJwLxFSN7bdzleVOa4FSfM0zUsEjdbiPH47VQFM
 =Vae6
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-next-signed-20210831' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv updates from Wei Liu:

 - make Hyper-V code arch-agnostic (Michael Kelley)

 - fix sched_clock behaviour on Hyper-V (Ani Sinha)

 - fix a fault when Linux runs as the root partition on MSHV (Praveen
   Kumar)

 - fix VSS driver (Vitaly Kuznetsov)

 - cleanup (Sonia Sharma)

* tag 'hyperv-next-signed-20210831' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  hv_utils: Set the maximum packet size for VSS driver to the length of the receive buffer
  Drivers: hv: Enable Hyper-V code to be built on ARM64
  arm64: efi: Export screen_info
  arm64: hyperv: Initialize hypervisor on boot
  arm64: hyperv: Add panic handler
  arm64: hyperv: Add Hyper-V hypercall and register access utilities
  x86/hyperv: fix root partition faults when writing to VP assist page MSR
  hv: hyperv.h: Remove unused inline functions
  drivers: hv: Decouple Hyper-V clock/timer code from VMbus drivers
  x86/hyperv: add comment describing TSC_INVARIANT_CONTROL MSR setting bit 0
  Drivers: hv: Move Hyper-V misc functionality to arch-neutral code
  Drivers: hv: Add arch independent default functions for some Hyper-V handlers
  Drivers: hv: Make portions of Hyper-V init code be arch neutral
  x86/hyperv: fix for unwanted manipulation of sched_clock when TSC marked unstable
  asm-generic/hyperv: Add missing #include of nmi.h
2021-09-01 18:25:20 -07:00
Thomas Gleixner
4b92d4add5 drivers: base: cacheinfo: Get rid of DEFINE_SMP_CALL_CACHE_FUNCTION()
DEFINE_SMP_CALL_CACHE_FUNCTION() was usefel before the CPU hotplug rework
to ensure that the cache related functions are called on the upcoming CPU
because the notifier itself could run on any online CPU.

The hotplug state machine guarantees that the callbacks are invoked on the
upcoming CPU. So there is no need to have this SMP function call
obfuscation. That indirection was missed when the hotplug notifiers were
converted.

This also solves the problem of ARM64 init_cache_level() invoking ACPI
functions which take a semaphore in that context. That's invalid as SMP
function calls run with interrupts disabled. Running it just from the
callback in context of the CPU hotplug thread solves this.

Fixes: 8571890e15 ("arm64: Add support for ACPI based firmware tables")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/871r69ersb.ffs@tglx
2021-09-01 10:29:10 +02:00
Linus Torvalds
0a096f240a A reworked version of the opt-in L1D flush mechanism:
A stop gap for potential future speculation related hardware
   vulnerabilities and a mechanism for truly security paranoid
   applications.
 
   It allows a task to request that the L1D cache is flushed when the kernel
   switches to a different mm. This can be requested via prctl().
 
   Changes vs. the previous versions:
 
     - Get rid of the software flush fallback
 
     - Make the handling consistent with other mitigations
 
     - Kill the task when it ends up on a SMT enabled core which defeats the
       purpose of L1D flushing obviously
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEsn0oTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoa5fD/47vHGtjAtDr/DaXR1C6F9AvVbKEl8p
 oNHn8IukE6ts6G4dFH9wUvo/Ut0K3kxX54I+BATew0LTy6tsQeUYh/xjwXMupgNV
 oKOc9waoqdFvju3ayLFWJmuACLdXpyrGC1j35Aji61zSbR/GdtZ4oDxbuN2YJDAT
 BTcgKrBM5nQm94JNa083RQSCU5LJxbC7ETkIh6NR73RSPCjUC1Wpxy1sAQAa2MPD
 8EzcJ/DjVGaHCI7adX10sz3xdUcyOz7qYz16HpoMGx+oSiq7pGEBtUiK97EYMcrB
 s+ADFUjYmx/pbEWv2r4c9zxNh7ZV3aLBsWwi7bScHIsv8GjrsA/mYLWskuwOV6BB
 22qZjfd0c4raiJwd+nmSx+D2Szv6lZ20gP+krtP2VNC6hUv7ft0VPLySiaFMmUHj
 quooDZis/W5n+4C9Q8Rk9uUtKzzJOngqW+duftiixHiNQ/ECP/QCAHhZYck/NOkL
 tZkNj6lJj9+2iR7mhbYROZ+wrYQzRvqNb2pJJQoi/wA0q7wPSKBi3m+51lPsht5W
 tn94CpaDDZ4IB7Fe1NtcA0UpYJSWpDQGlau4qp92HMCCIcRFfQEm+m9x8axwcj7m
 ECblHJYBPHuNcCHvPA8kHvr1nd6UUXrGPIo8TK8YhUUbK6pO0OjdNzZX496ia/2g
 pLzaW2ENTPLbXg==
 =27wH
 -----END PGP SIGNATURE-----

Merge tag 'x86-cpu-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cache flush updates from Thomas Gleixner:
 "A reworked version of the opt-in L1D flush mechanism.

  This is a stop gap for potential future speculation related hardware
  vulnerabilities and a mechanism for truly security paranoid
  applications.

  It allows a task to request that the L1D cache is flushed when the
  kernel switches to a different mm. This can be requested via prctl().

  Changes vs the previous versions:

   - Get rid of the software flush fallback

   - Make the handling consistent with other mitigations

   - Kill the task when it ends up on a SMT enabled core which defeats
     the purpose of L1D flushing obviously"

* tag 'x86-cpu-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Documentation: Add L1D flushing Documentation
  x86, prctl: Hook L1D flushing in via prctl
  x86/mm: Prepare for opt-in based L1D flush in switch_mm()
  x86/process: Make room for TIF_SPEC_L1D_FLUSH
  sched: Add task_work callback for paranoid L1D flush
  x86/mm: Refactor cond_ibpb() to support other use cases
  x86/smp: Add a per-cpu view of SMT state
2021-08-30 15:00:33 -07:00
Linus Torvalds
4a2b88eb02 Perf events changes for v5.15 are:
- Add support for Intel Sapphire Rapids server CPU uncore events
  - Allow the AMD uncore driver to be built as a module
  - Misc cleanups and fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmEsrWgRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iPbg/+LTi1ki2kswWF5Fmoo0IWZ1GiCkWD0rSm
 SqCJR3u/MHd7QxsfPXFuhABt7mqaC50epfim0MNC1J11wRoNy6GzM/9nICdez9CK
 c2c3jAwoEfWE7o5PTSjXolH0FFXVYwQ9WBRJTBCaZjuXUW7TOk7h9o8fXoF8SssU
 VkP9z0YKanhN480v4k7/KR20ktY6GPFFu5cxhC0wyygZXIoG6ku+nmDjvN1ipo54
 JQYQbCTz+dj/C0uZehbEoTZWx7cajAxlbq+Iyor3ND30YHLeRxWNEcq/Jzn9Szqa
 LXnIi3mg4/mHc8mtZjDHXazpMxGYI02hkBACftPb2gizR4DOwtWlw1A2oHjbfmvM
 oF29kcZmFU/Na2O5JTf0IpV4LLkt/OlrTsTcd5ROYWgmri3UV18MhaKz0R0vQNhI
 8Xx6TJVA9fFHCIONg+4yxzDkYlxHJ9Tg8yFb06M6NWOnud03xYv4FGOzi0laoums
 8XbYZUnMh8aVkXz05CYUadknu/ajMOSqAZZAstng3unazrutSCMkZ+Pzs4X+yqq5
 Zz2Tb26oi6KepLD0eQNTmo1pRwuWC/IBqVF5aKH4e4qgyta/VxWob3Qd+I7ONaCl
 6HpbaWz01Nw8U2wE4dQB0wKwIGlLE8bRQTS2QFuqaEHu0yZsQGt8zMwVWB+f93Gb
 viglPyeA/y8=
 =QrPT
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 perf event updates from Ingo Molnar:

 - Add support for Intel Sapphire Rapids server CPU uncore events

 - Allow the AMD uncore driver to be built as a module

 - Misc cleanups and fixes

* tag 'perf-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  perf/x86/amd/ibs: Add bitfield definitions in new <asm/amd-ibs.h> header
  perf/amd/uncore: Allow the driver to be built as a module
  x86/cpu: Add get_llc_id() helper function
  perf/amd/uncore: Clean up header use, use <linux/ include paths instead of <asm/
  perf/amd/uncore: Simplify code, use free_percpu()'s built-in check for NULL
  perf/hw_breakpoint: Replace deprecated CPU-hotplug functions
  perf/x86/intel: Replace deprecated CPU-hotplug functions
  perf/x86: Remove unused assignment to pointer 'e'
  perf/x86/intel/uncore: Fix IIO cleanup mapping procedure for SNR/ICX
  perf/x86/intel/uncore: Support IMC free-running counters on Sapphire Rapids server
  perf/x86/intel/uncore: Support IIO free-running counters on Sapphire Rapids server
  perf/x86/intel/uncore: Factor out snr_uncore_mmio_map()
  perf/x86/intel/uncore: Add alias PMU name
  perf/x86/intel/uncore: Add Sapphire Rapids server MDF support
  perf/x86/intel/uncore: Add Sapphire Rapids server M3UPI support
  perf/x86/intel/uncore: Add Sapphire Rapids server UPI support
  perf/x86/intel/uncore: Add Sapphire Rapids server M2M support
  perf/x86/intel/uncore: Add Sapphire Rapids server IMC support
  perf/x86/intel/uncore: Add Sapphire Rapids server PCU support
  perf/x86/intel/uncore: Add Sapphire Rapids server M2PCIe support
  ...
2021-08-30 13:50:20 -07:00
Linus Torvalds
230bda0873 - The usual round of minor cleanups and fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmEsrTYACgkQEsHwGGHe
 VUqq4g/9F89bnzmeaFxt2KURuQ3iKCIZ17UcawTFNiW9hboxls1K97a3O4PEy3f9
 ulu4PinW7SE+UnlWVTByvAGf4s+qUW7Jrznnr4yMHLAT4LnMdcwBaB3BQDrdKf+R
 E4QYp9G/LM/f+v4uRhGpd0vvZMwgW3HPGCpS87+IJKrSer8dqgy8GV8IFVZQCYlj
 c/GL67M7o6L9lB5l9yfWneokhBcNLXuqW8hrrEcHem3RYv5JUQ+YKzycS1spxPjb
 rhl+yB6dM92t4aitWHjyyQnzgDeIJUxRfSApDlR5BgzGv2xgmaslvYBln9+Jefv6
 pA+NSp/mYfgK829JvTWWIhznDwC7svMN7iMZQwik0gNR2/iWNWCVqspvszzYfKud
 +ZCCTtFVeijHFBjhviLE635bsrQr3NvmuM/X6PQbNu6g5ts4yPnkF6PxvVnalv4p
 SvnKLOLmevozlyE2Iv/HyF/Ky09j94wxeBx+IsMQ1mvXaKpi4S8x3Qhn37Sx1jiI
 Q97SQv+JxKil5SLrQZSLPhTFfmGLt+iY+DwdkwmtpYJaOBIsB6mIPH6KWp9Jhdcp
 QPoawuFrbRW7TcHdFFq5rVgfYmZ3gsEHAsCYk2xu4vIr7D/QZKDIa+GaCkua0NOI
 jZoRmLejY1IXG0IbysYgJxVVq9+3uRzAbfgat/dDmg/+Y7NCbwo=
 =bPrc
 -----END PGP SIGNATURE-----

Merge tag 'x86_cleanups_for_v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Borislav Petkov:
 "The usual round of minor cleanups and fixes"

* tag 'x86_cleanups_for_v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/kaslr: Have process_mem_region() return a boolean
  x86/power: Fix kernel-doc warnings in cpu.c
  x86/mce/inject: Replace deprecated CPU-hotplug functions.
  x86/microcode: Replace deprecated CPU-hotplug functions.
  x86/mtrr: Replace deprecated CPU-hotplug functions.
  x86/mmiotrace: Replace deprecated CPU-hotplug functions.
2021-08-30 13:35:36 -07:00
Linus Torvalds
42f6e869a0 - A first round of changes towards splitting the arch-specific bits from
the filesystem bits of resctrl, the ultimate goal being to support ARM's
 equivalent technology MPAM, with the same fs interface (James Morse)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmEsrC8ACgkQEsHwGGHe
 VUowsA//TblmI1t1kz6JUAneOVmZILJKRKyekrUsMr3twPpUcH1zMS4yE+ohE06X
 IZZYh3jQokirOxQAmntbh2/uLz5AntldRDzNCEsBZerLz1kW502xaLYMCWg5wPbs
 LWqacvHmOmQtXpB5fxA/jNIHxQyfKc1z/wWjTMwpU6K0P0usflx1UmSaiW7Kol6y
 KY8B1V1DRYsCmtGQ+0Ww4Fye6TZ/w9jwwFolerSVqXy0I8TZNFITUCfmkFeSbAIp
 uQMBXq5SGHn+Q9AsLB2xBhLmqkIb5482eC096t/UJPdFWcBnYdMsOmuLSxz6IuVF
 z5gMctnsoIwdhtV7YSpTPIKopGkZKuyN4EIEwv3LVn2J9FatoeWDZu7kUyWr9WaQ
 rp0u+09gxUBe68h9bNQkvUfSqi5RVSsijKzQ1vs5PpsTgo4lsPjtb4AjHr9bys/e
 2GSMuhaLv+h/kNnte1PXnO5+8+8Yt0N16rVPw8aJE543o2jOm21LUPb/TuAHEmi8
 uKn1iq3Dt9gzep92WMFXHwapbwkpF8NXqrcP/ibN97cXx3pt8wPL3yIqOl7rRsGI
 BxkGDHP33aBEqbBuMFo9wW8P7whdo9ZhqjfLNmL+6HIM+JXXa0dC6sd2aD6RDrx7
 +C0zffDWNfXiUWqC7bFURqRBj3bJ7oufLye3eSPHIMsssYIZYgQ=
 =1WGp
 -----END PGP SIGNATURE-----

Merge tag 'x86_cache_for_v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 resource control updates from Borislav Petkov:
 "A first round of changes towards splitting the arch-specific bits from
  the filesystem bits of resctrl, the ultimate goal being to support
  ARM's equivalent technology MPAM, with the same fs interface (James
  Morse)"

* tag 'x86_cache_for_v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
  x86/resctrl: Make resctrl_arch_get_config() return its value
  x86/resctrl: Merge the CDP resources
  x86/resctrl: Expand resctrl_arch_update_domains()'s msr_param range
  x86/resctrl: Remove rdt_cdp_peer_get()
  x86/resctrl: Merge the ctrl_val arrays
  x86/resctrl: Calculate the index from the configuration type
  x86/resctrl: Apply offset correction when config is staged
  x86/resctrl: Make ctrlval arrays the same size
  x86/resctrl: Pass configuration type to resctrl_arch_get_config()
  x86/resctrl: Add a helper to read a closid's configuration
  x86/resctrl: Rename update_domains() to resctrl_arch_update_domains()
  x86/resctrl: Allow different CODE/DATA configurations to be staged
  x86/resctrl: Group staged configuration into a separate struct
  x86/resctrl: Move the schemata names into struct resctrl_schema
  x86/resctrl: Add a helper to read/set the CDP configuration
  x86/resctrl: Swizzle rdt_resource and resctrl_schema in pseudo_lock_region
  x86/resctrl: Pass the schema to resctrl filesystem functions
  x86/resctrl: Add resctrl_arch_get_num_closid()
  x86/resctrl: Store the effective num_closid in the schema
  x86/resctrl: Walk the resctrl schema list instead of an arch list
  ...
2021-08-30 13:31:36 -07:00
Linus Torvalds
8f645b4208 - Do not start processing MCEs logged early because the decoding chain
is not up yet - delay that processing until everything is ready
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmEsp9MACgkQEsHwGGHe
 VUp1ARAAtlYWtIQXUV+YiGGnueO3WZlbME/O/I6uz3KSIgZg59qaXPpscL7vCjS1
 pFav1F3GKpplvkWAPBQf290yLYn934oLDanof1ENcAQxE46RD/CTr4V1J3xSzb31
 +XEcwHoDFMdSHp219pinNWEDFGlYvb9/Q8AxCMFlUomRrPE/1tjo2l8gmZ7Wi1r5
 uUwImmiYKnKoin3HpaY5n0NW+3madWhVfPhwv7Tsh0Hp0Qh5KRra+0OoLSGeKNDk
 EDpKE/XKAjCNjcBLNIAWtLCHvPC1bWh2jC+Qmu8a9UGn4e5KmWw+GCUoOuBRuA6X
 KOAoXtZXARsDDXadQKjNnQ4P3CvFcmNpCzjoFAarsflA3vKSHttDamk14bxE/mRx
 AkNdw7oQYfOLAP2PVpPGgBEiU5WIoPwXdudDi6R8Tu107Z0mhiJIgbjPajJOvFpj
 EnTO4k/A+Lk4uhPYYFhlhphDT/B8KdVlnEv0VJds0nPEE2uC3JiXxu7dHbBPYcXm
 KtvJ5A+jDEMvFoPzoqwDBr7EzeHUUjyyI1Gym5EsjCNoH4WG8SyXYngCRrFOaIFe
 e047UnJ4a7ibsFHAPrB9LtWLxvhPl13D4D+Ejb1GvnzDwSaAJAP9gVBGC0VpicId
 NGEYAjsRsLGP2D3I4+Bhh1MN6mL/wlcmzO4izXXhMyc9vuTrXws=
 =1ROV
 -----END PGP SIGNATURE-----

Merge tag 'ras_core_for_v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RAS update from Borislav Petkov:
 "A single RAS change for 5.15:

   - Do not start processing MCEs logged early because the decoding
     chain is not up yet - delay that processing until everything is
     ready"

* tag 'ras_core_for_v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce: Defer processing of early errors
2021-08-30 13:23:17 -07:00
Kim Phillips
9164d9493a x86/cpu: Add get_llc_id() helper function
Factor out a helper function rather than export cpu_llc_id, which is
needed in order to be able to build the AMD uncore driver as a module.

Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210817221048.88063-7-kim.phillips@amd.com
2021-08-26 09:14:36 +02:00
Borislav Petkov
3bff147b18 x86/mce: Defer processing of early errors
When a fatal machine check results in a system reset, Linux does not
clear the error(s) from machine check bank(s) - hardware preserves the
machine check banks across a warm reset.

During initialization of the kernel after the reboot, Linux reads, logs,
and clears all machine check banks.

But there is a problem. In:

  5de97c9f6d ("x86/mce: Factor out and deprecate the /dev/mcelog driver")

the call to mce_register_decode_chain() moved later in the boot
sequence. This means that /dev/mcelog doesn't see those early error
logs.

This was partially fixed by:

  cd9c57cad3 ("x86/MCE: Dump MCE to dmesg if no consumers")

which made sure that the logs were not lost completely by printing
to the console. But parsing console logs is error prone. Users of
/dev/mcelog should expect to find any early errors logged to standard
places.

Add a new flag MCP_QUEUE_LOG to machine_check_poll() to be used in early
machine check initialization to indicate that any errors found should
just be queued to genpool. When mcheck_late_init() is called it will
call mce_schedule_work() to actually log and flush any errors queued in
the genpool.

 [ Based on an original patch, commit message by and completely
   productized by Tony Luck. ]

Fixes: 5de97c9f6d ("x86/mce: Factor out and deprecate the /dev/mcelog driver")
Reported-by: Sumanth Kamatala <skamatala@juniper.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210824003129.GA1642753@agluck-desk2.amr.corp.intel.com
2021-08-24 10:40:58 +02:00
Babu Moger
527f721478 x86/resctrl: Fix a maybe-uninitialized build warning treated as error
The recent commit

  064855a690 ("x86/resctrl: Fix default monitoring groups reporting")

caused a RHEL build failure with an uninitialized variable warning
treated as an error because it removed the default case snippet.

The RHEL Makefile uses '-Werror=maybe-uninitialized' to force possibly
uninitialized variable warnings to be treated as errors. This is also
reported by smatch via the 0day robot.

The error from the RHEL build is:

  arch/x86/kernel/cpu/resctrl/monitor.c: In function ‘__mon_event_count’:
  arch/x86/kernel/cpu/resctrl/monitor.c:261:12: error: ‘m’ may be used
  uninitialized in this function [-Werror=maybe-uninitialized]
    m->chunks += chunks;
              ^~

The upstream Makefile does not build using '-Werror=maybe-uninitialized'.
So, the problem is not seen there. Fix the problem by putting back the
default case snippet.

 [ bp: note that there's nothing wrong with the code and other compilers
   do not trigger this warning - this is being done just so the RHEL compiler
   is happy. ]

Fixes: 064855a690 ("x86/resctrl: Fix default monitoring groups reporting")
Reported-by: Terry Bowman <Terry.Bowman@amd.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/162949631908.23903.17090272726012848523.stgit@bmoger-ubuntu
2021-08-22 09:11:29 +02:00
Babu Moger
064855a690 x86/resctrl: Fix default monitoring groups reporting
Creating a new sub monitoring group in the root /sys/fs/resctrl leads to
getting the "Unavailable" value for mbm_total_bytes and mbm_local_bytes
on the entire filesystem.

Steps to reproduce:

  1. mount -t resctrl resctrl /sys/fs/resctrl/

  2. cd /sys/fs/resctrl/

  3. cat mon_data/mon_L3_00/mbm_total_bytes
     23189832

  4. Create sub monitor group:
  mkdir mon_groups/test1

  5. cat mon_data/mon_L3_00/mbm_total_bytes
     Unavailable

When a new monitoring group is created, a new RMID is assigned to the
new group. But the RMID is not active yet. When the events are read on
the new RMID, it is expected to report the status as "Unavailable".

When the user reads the events on the default monitoring group with
multiple subgroups, the events on all subgroups are consolidated
together. Currently, if any of the RMID reads report as "Unavailable",
then everything will be reported as "Unavailable".

Fix the issue by discarding the "Unavailable" reads and reporting all
the successful RMID reads. This is not a problem on Intel systems as
Intel reports 0 on Inactive RMIDs.

Fixes: d89b737901 ("x86/intel_rdt/cqm: Add mon_data")
Reported-by: Paweł Szulik <pawel.szulik@intel.com>
Signed-off-by: Babu Moger <Babu.Moger@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213311
Link: https://lkml.kernel.org/r/162793309296.9224.15871659871696482080.stgit@bmoger-ubuntu
2021-08-12 20:12:20 +02:00
James Morse
111136e69c x86/resctrl: Make resctrl_arch_get_config() return its value
resctrl_arch_get_config() has no return, but does pass a single value
back via one of its arguments.

Return the value instead.

Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210811163831.14917-1-james.morse@arm.com
2021-08-11 18:42:53 +02:00
James Morse
5c3b63cdba x86/resctrl: Merge the CDP resources
resctrl uses struct rdt_resource to describe the available hardware
resources. The domains of the CDP aliases share a single ctrl_val[]
array. The only differences between the struct rdt_hw_resource aliases
is the name and conf_type.

The name from struct rdt_hw_resource is visible to user-space. To
support another architecture, as many user-visible details should be
handled in the filesystem parts of the code that is common to all
architectures. The name and conf_type go together.

Remove conf_type and the CDP aliases. When CDP is supported and enabled,
schemata_list_create() can create two schemata using the single
resource, generating the CODE/DATA suffix to the schema name itself.

This allows the alloc_ctrlval_array() and complications around free()ing
the ctrl_val arrays to be removed.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-25-james.morse@arm.com
2021-08-11 18:39:42 +02:00
James Morse
327364d5b6 x86/resctrl: Expand resctrl_arch_update_domains()'s msr_param range
resctrl_arch_update_domains() specifies the one closid that has been
modified and needs copying to the hardware.

resctrl_arch_update_domains() takes a struct rdt_resource and a closid
as arguments, but copies all the staged configurations for that closid
into the ctrl_val[] array.

resctrl_arch_update_domains() is called once per schema, but once the
resources and domains are merged, the second call of a L2CODE/L2DATA
pair will find no staged configurations, as they were previously
applied. The msr_param of the first call only has one index, so would
only have update the hardware for the last staged configuration.

To avoid a second round of IPIs when changing L2CODE and L2DATA in one
go, expand the range of the msr_param if multiple staged configurations
are found.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-24-james.morse@arm.com
2021-08-11 18:37:02 +02:00
James Morse
fbc06c6980 x86/resctrl: Remove rdt_cdp_peer_get()
When CDP is enabled, rdt_cdp_peer_get() finds the alternative
CODE/DATA resource and returns the alternative domain. This is used
to determine if bitmaps overlap when there are aliased entries
in the two struct rdt_hw_resources.

Now that the ctrl_val[] used by the CODE/DATA resources is the same,
the search for an alternate resource/domain is not needed.

Replace rdt_cdp_peer_get() with resctrl_peer_type(), which returns
the alternative type. This can be passed to resctrl_arch_get_config()
with the same resource and domain.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-23-james.morse@arm.com
2021-08-11 18:33:48 +02:00
James Morse
43ac1dbf61 x86/resctrl: Merge the ctrl_val arrays
Each struct rdt_hw_resource has its own ctrl_val[] array. When CDP is
enabled, two resources are in use, each with its own ctrl_val[] array
that holds half of the configuration used by hardware. One uses the odd
slots, the other the even. rdt_cdp_peer_get() is the helper to find the
alternate resource, its domain, and corresponding entry in the other
ctrl_val[] array.

Once the CDP resources are merged there will be one struct
rdt_hw_resource and one ctrl_val[] array for each hardware resource.
This will include changes to rdt_cdp_peer_get(), making it hard to
bisect any issue.

Merge the ctrl_val[] arrays for three CODE/DATA/NONE resources first.
Doing this before merging the resources temporarily complicates
allocating and freeing the ctrl_val arrays. Add a helper to allocate
the ctrl_val array, that returns the value on the L2 or L3 resource if
it already exists. This gets removed once the resources are merged, and
there really is only one ctrl_val[] array.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-22-james.morse@arm.com
2021-08-11 18:31:04 +02:00
James Morse
2b8dd4ab65 x86/resctrl: Calculate the index from the configuration type
resctrl uses cbm_idx() to map a closid to an index in the configuration
array. This is based on a multiplier and offset that are held in the
resource.

To merge the resources, the resctrl arch code needs to calculate the
index from something else, as there will only be one resource.

Decide based on the staged configuration type. This makes the static
mult and offset parameters redundant.

 [ bp: Remove superfluous brackets in get_config_index() ]

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-21-james.morse@arm.com
2021-08-11 18:19:06 +02:00
James Morse
2e7df368fc x86/resctrl: Apply offset correction when config is staged
When resctrl comes to copy the CAT MSR values from the ctrl_val[] array
into hardware, it applies an offset adjustment based on the type of
the resource. CODE and DATA resources have their closid mapped into an
odd/even range. This mapping is based on a property of the resource.

This happens once the new control value has been written to the ctrl_val[]
array. Once the CDP resources are merged, there will only be a single
property that needs to cover both odd/even mappings to the single
ctrl_val[] array. The offset adjustment must be applied before the new
value is written to the array.

Move the logic from cat_wrmsr() to resctrl_arch_update_domains(). The
value provided to apply_config() is now an index in the array, not the
closid. The parameters provided via struct msr_param are now indexes
too. As resctrl's use of closid is a u32, struct msr_param's type is
changed to match.

With this, the CODE and DATA resources only use the odd or even
indexes in the array. This allows the temporary num_closid/2 fixes in
domain_setup_ctrlval() and reset_all_ctrls() to be removed.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-20-james.morse@arm.com
2021-08-11 18:03:28 +02:00
James Morse
141739aa73 x86/resctrl: Make ctrlval arrays the same size
The CODE and DATA resources report a num_closid that is half the actual
size supported by the hardware. This behaviour is visible to user-space
when CDP is enabled.

The CODE and DATA resources have their own ctrlval arrays which are
half the size of the underlying hardware because num_closid was already
adjusted. One holds the odd configurations values, the other even.

Before the CDP resources can be merged, the 'half the closids' behaviour
needs to be implemented by schemata_list_create(), but this causes the
ctrl_val[] array to be full sized.

Remove the logic from the architecture specific rdt_get_cdp_config()
setup, and add it to schemata_list_create(). Functions that walk all the
configurations, such as domain_setup_ctrlval() and reset_all_ctrls(),
take num_closid directly from struct rdt_hw_resource also have
to halve num_closid as only the lower half of each array is in
use. domain_setup_ctrlval() and reset_all_ctrls() both copy struct
rdt_hw_resource's num_closid to a struct msr_param. Correct the value
here.

This is temporary as a subsequent patch will merge all three ctrl_val[]
arrays such that when CDP is in use, the CODA/DATA layout in the array
matches the hardware. reset_all_ctrls()'s loop over the whole of
ctrl_val[] is not touched as this is harmless, and will be required as
it is once the resources are merged.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-19-james.morse@arm.com
2021-08-11 17:58:33 +02:00
James Morse
fa8f711d2f x86/resctrl: Pass configuration type to resctrl_arch_get_config()
The ctrl_val[] array for a struct rdt_hw_resource only holds
configurations of one type. The type is implicit.

Once the CDP resources are merged, the ctrl_val[] array will hold all
the configurations for the hardware resource. When a particular type of
configuration is needed, it must be specified explicitly.

Pass the expected type from the schema into resctrl_arch_get_config().
Nothing uses this yet, but once a single ctrl_val[] array is used for
the three struct rdt_hw_resources that share hardware, the type will be
used to return the correct configuration value from the shared array.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-18-james.morse@arm.com
2021-08-11 17:53:53 +02:00
James Morse
f07e9d0250 x86/resctrl: Add a helper to read a closid's configuration
Functions like show_doms() reach into the architecture's private
structure to retrieve the configuration from the struct rdt_hw_resource.

The hardware configuration may look completely different to the
values resctrl gets from user-space. The staged configuration and
resctrl_arch_update_domains() allow the architecture to convert or
translate these values.

Resctrl shouldn't read or write the ctrl_val[] values directly. Add
a helper to read the current configuration. This will allow another
architecture to scale the bitmaps if necessary, and possibly use
controls that don't take the user-space control format at all.

Of the remaining functions that access ctrl_val[] directly,
apply_config() is part of the architecture-specific code, and is
called via resctrl_arch_update_domains(). reset_all_ctrls() will be an
architecture specific helper.

update_mba_bw() manipulates both ctrl_val[], mbps_val[] and the
hardware. The mbps_val[] that matches the mba_sc state of the resource
is changed, but the other is left unchanged. Abstracting this is the
subject of later patches that affect set_mba_sc() too.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-17-james.morse@arm.com
2021-08-11 17:46:34 +02:00
James Morse
2e6678195d x86/resctrl: Rename update_domains() to resctrl_arch_update_domains()
update_domains() merges the staged configuration changes into the arch
codes configuration array. Rename to make it clear it is part of the
arch code interface to resctrl.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-16-james.morse@arm.com
2021-08-11 16:36:58 +02:00
James Morse
75408e4350 x86/resctrl: Allow different CODE/DATA configurations to be staged
Before the CDP resources can be merged, struct rdt_domain will need an
array of struct resctrl_staged_config, one per type of configuration.

Use the type as an index to the array to ensure that a schema
configuration string can't specify the same domain twice. This will
allow two schemata to apply configuration changes to one resource.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-15-james.morse@arm.com
2021-08-11 16:33:42 +02:00
James Morse
e8f7282552 x86/resctrl: Group staged configuration into a separate struct
When configuration changes are made, the new value is written to struct
rdt_domain's new_ctrl field and the have_new_ctrl flag is set. Later
new_ctrl is copied to hardware by a call to update_domains().

Once the CDP resources are merged, there will be one new_ctrl field in
use by two struct resctrl_schema requiring a per-schema IPI to copy the
value to hardware.

Move new_ctrl and have_new_ctrl into a new struct resctrl_staged_config.
Before the CDP resources can be merged, struct rdt_domain will need an
array of these, one per type of configuration. Using the type as an
index to the array will ensure that a schema configuration string can't
specify the same domain twice.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-14-james.morse@arm.com
2021-08-11 16:32:32 +02:00
James Morse
e198fde3fe x86/resctrl: Move the schemata names into struct resctrl_schema
resctrl 'info' directories and schema parsing use the schema name.
This lives in the struct rdt_resource, and is specified by the
architecture code.

Once the CDP resources are merged, there will only be one resource (and
one name) in use by two schemata. To allow the CDP CODE/DATA property to
be the type of configuration the schema uses, the name should also be
per-schema.

Add a name field to struct resctrl_schema, and use this wherever
the schema name is exposed (or read from) user-space. Calculating
max_name_width for padding the schemata file also moves as this is
visible to user-space. As the names in struct rdt_resource already
include the CDP information, schemata_list_create() copies them.

schemata_list_create() includes the length of the CDP suffix when
calculating max_name_width in preparation for CDP resources being
merged.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-13-james.morse@arm.com
2021-08-11 16:21:35 +02:00
James Morse
c091e90721 x86/resctrl: Add a helper to read/set the CDP configuration
Whether CDP is enabled for a hardware resource like the L3 cache can be
found by inspecting the alloc_enabled flags of the L3CODE/L3DATA struct
rdt_hw_resources, even if they aren't in use.

Once these resources are merged, the flags can't be compared. Whether
CDP is enabled needs tracking explicitly. If another architecture is
emulating CDP the behaviour may not be per-resource. 'cdp_capable' needs
to be visible to resctrl, even if its not in use, as this affects the
padding of the schemata table visible to user-space.

Add cdp_enabled to struct rdt_hw_resource and cdp_capable to struct
rdt_resource. Add resctrl_arch_set_cdp_enabled() to let resctrl enable
or disable CDP on a resource. resctrl_arch_get_cdp_enabled() lets it
read the current state.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-12-james.morse@arm.com
2021-08-11 15:54:26 +02:00
James Morse
32150edd3f x86/resctrl: Swizzle rdt_resource and resctrl_schema in pseudo_lock_region
struct pseudo_lock_region points to the rdt_resource.

Once the resources are merged, this won't be unique. The resource name
is moving into the schema, so that the filesystem portions of resctrl can
generate it.

Swap pseudo_lock_region's rdt_resource pointer for a schema pointer.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-11-james.morse@arm.com
2021-08-11 15:51:45 +02:00
James Morse
1c290682c0 x86/resctrl: Pass the schema to resctrl filesystem functions
Once the CDP resources are merged, there will be two struct
resctrl_schema for one struct rdt_resource. CDP becomes a type of
configuration that belongs to the schema.

Helpers like rdtgroup_cbm_overlaps() need access to the schema to query
the configuration (or configurations) based on schema properties.

Change these functions to take a struct schema instead of the struct
rdt_resource. All the modified functions are part of the filesystem code
that will move to /fs/resctrl once it is possible to support a second
architecture.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-10-james.morse@arm.com
2021-08-11 15:43:54 +02:00
James Morse
eb6f318769 x86/resctrl: Add resctrl_arch_get_num_closid()
To initialise struct resctrl_schema's num_closid, schemata_list_create()
reaches into the architectures private structure to retrieve num_closid
from the struct rdt_hw_resource. The 'half the closids' behaviour should
be part of the filesystem parts of resctrl that are the same on any
architecture. struct resctrl_schema's num_closid should include any
correction for CDP.

Having two properties called num_closid is likely to be confusing when
they have different values.

Add a helper to read the resource's num_closid from the arch code.
This should return the number of closid that the resource supports,
regardless of whether CDP is in use. Once the CDP resources are merged,
schemata_list_create() can apply the correction itself.

Using a type with an obvious size for the arch helper means changing the
type of num_closid to u32, which matches the type already used by struct
rdtgroup.

reset_all_ctrls() does not use resctrl_arch_get_num_closid(), even
though it sets up a structure for modifying the hardware. This function
will be part of the architecture code, the maximum closid should be the
maximum value the hardware has, regardless of the way resctrl is using
it. All the uses of num_closid in core.c are naturally part of the
architecture specific code.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-9-james.morse@arm.com
2021-08-11 15:35:42 +02:00
James Morse
3183e87c1b x86/resctrl: Store the effective num_closid in the schema
Struct resctrl_schema holds properties that vary with the style of
configuration that resctrl applies to a resource. There are already
two values for the hardware's num_closid, depending on whether the
architecture presents the L3 or L3CODE/L3DATA resources.

As the way CDP changes the number of control groups that resctrl can
create is part of the user-space interface, it should be managed by the
filesystem parts of resctrl. This allows the architecture code to only
describe the value the hardware supports.

Add num_closid to resctrl_schema. This is the value seen by the
filesystem, which may be different to the maximum value described by the
arch code when CDP is enabled.

These functions operate on the num_closid value that is exposed to
user-space:

  * rdtgroup_parse_resource()
  * rdtgroup_schemata_show()
  * rdt_num_closids_show()
  * closid_init()

Change them to use the schema value instead. schemata_list_create() sets
this value, and reaches into the architecture-specific structure to get
the value. This will eventually be replaced with a helper.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-8-james.morse@arm.com
2021-08-11 15:24:27 +02:00
James Morse
331ebe4c43 x86/resctrl: Walk the resctrl schema list instead of an arch list
When parsing a schema configuration value from user-space, resctrl walks
the architectures rdt_resources_all[] array to find a matching struct
rdt_resource.

Once the CDP resources are merged there will be one resource in use
by two schemata. Anything walking rdt_resources_all[] on behalf of a
user-space request should walk the list of struct resctrl_schema
instead.

Change the users of for_each_alloc_enabled_rdt_resource() to walk the
schema instead. Schemata were only created for alloc_enabled resources
so these two lists are currently equivalent.

schemata_list_create() and rdt_kill_sb() are ignored. The first
creates the schema list, and will eventually loop over the resource
indexes using an arch helper to retrieve the resource. rdt_kill_sb()
will eventually make use of an arch 'reset everything' helper.

After the filesystem code is moved, rdtgroup_pseudo_locked_in_hierarchy()
remains part of the x86 specific hooks to support pseudo lock. This
code walks each domain, and still does this after the separate resources
are merged.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-7-james.morse@arm.com
2021-08-11 13:20:43 +02:00
James Morse
208ab16847 x86/resctrl: Label the resources with their configuration type
The names of resources are used for the schema name presented to
user-space. The name used is rooted in a structure provided by the
architecture code because the names are different when CDP is enabled.
x86 implements this by swapping between two sets of resource structures
based on their alloc_enabled flag. The type of configuration in-use is
encoded in the name (and cbm_idx_offset).

Once the CDP behaviour is moved into the parts of resctrl that will
move to /fs/, there will be two struct resctrl_schema for one struct
rdt_resource. The schema describes the type of configuration being
applied to the resource. The name of the schema should be generated
by resctrl, base on the type of configuration. To do this struct
resctrl_schema needs to store the type of configuration in use for a
schema.

Create an enum resctrl_conf_type describing the options, and add it to
struct resctrl_schema. The underlying resources are still separate, as
cbm_idx_offset is still in use.

Temporarily label all the entries in rdt_resources_all[] and copy that
value to struct resctrl_schema. Copying the value ensures there is no
mismatch while the filesystem parts of resctrl are modified to use the
schema. Once the resources are merged, the filesystem code can assign
this value based on the schema being created.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-6-james.morse@arm.com
2021-08-11 13:13:18 +02:00
James Morse
f259449230 x86/resctrl: Pass the schema in info dir's private pointer
Many of resctrl's per-schema files return a value from struct
rdt_resource, which they take as their 'priv' pointer.

Moving properties that resctrl exposes to user-space into the core 'fs'
code, (e.g. the name of the schema), means some of the functions that
back the filesystem need the schema struct (to where the properties are
moved), but currently take struct rdt_resource. For example, once the
CDP resources are merged, struct rdt_resource no longer reflects all the
properties of the schema.

For the info dirs that represent a control, the information needed
will be accessed via struct resctrl_schema, as this is how the resource
is being used. For the monitors, its still struct rdt_resource as the
monitors aren't described as schema.

This difference means the type of the private pointers varies between
control and monitor info dirs.

Change the 'priv' pointer to point to struct resctrl_schema for
the per-schema files that represent a control. The type can be
determined from the fflags field. If the flags are RF_MON_INFO, its
a struct rdt_resource. If the flags are RF_CTRL_INFO, its a struct
resctrl_schema. No entry in res_common_files[] has both flags.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-5-james.morse@arm.com
2021-08-11 12:41:19 +02:00
James Morse
cdb9ebc917 x86/resctrl: Add a separate schema list for resctrl
Resctrl exposes schemata to user-space, which allow the control values
to be specified for a group of tasks.

User-visible properties of the interface, (such as the schemata names
and how the values are parsed) are rooted in a struct provided by the
architecture code. (struct rdt_hw_resource). Once a second architecture
uses resctrl, this would allow user-visible properties to diverge
between architectures.

These properties should come from the resctrl code that will be common
to all architectures. Resctrl has no per-schema structure, only struct
rdt_{hw_,}resource. Create a struct resctrl_schema to hold the
rdt_resource. Before a second architecture can be supported, this
structure will also need to hold the schema name visible to user-space
and the type of configuration values for resctrl.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-4-james.morse@arm.com
2021-08-11 12:28:01 +02:00
James Morse
792e0f6f78 x86/resctrl: Split struct rdt_domain
resctrl is the defacto Linux ABI for SoC resource partitioning features.

To support it on another architecture, it needs to be abstracted from
the features provided by Intel RDT and AMD PQoS, and moved to /fs/.
struct rdt_domain contains a mix of architecture private details and
properties of the filesystem interface user-space uses.

Continue by splitting struct rdt_domain, into an architecture private
'hw' struct, which contains the common resctrl structure that would be
used by any architecture. The hardware values in ctrl_val and mbps_val
need to be accessed via helpers to allow another architecture to convert
these into a different format if necessary. After this split, filesystem
code paths touching a 'hw' struct indicates where an abstraction is
needed.

Splitting this structure only moves types around, and should not lead
to any change in behaviour.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-3-james.morse@arm.com
2021-08-11 12:00:43 +02:00
James Morse
63c8b12319 x86/resctrl: Split struct rdt_resource
resctrl is the defacto Linux ABI for SoC resource partitioning features.

To support it on another architecture, it needs to be abstracted from
the features provided by Intel RDT and AMD PQoS, and moved to /fs/.
struct rdt_resource contains a mix of architecture private details
and properties of the filesystem interface user-space uses.

Start by splitting struct rdt_resource, into an architecture private
'hw' struct, which contains the common resctrl structure that would be
used by any architecture. The foreach helpers are most commonly used by
the filesystem code, and should return the common resctrl structure.
for_each_rdt_resource() is changed to walk the common structure in its
parent arch private structure.

Move as much of the structure as possible into the common structure
in the core code's header file. The x86 hardware accessors remain
part of the architecture private code, as do num_closid, mon_scale
and mbm_width.

mon_scale and mbm_width are used to detect overflow of the hardware
counters, and convert them from their native size to bytes. Any
cross-architecture abstraction should be in terms of bytes, making
these properties private.

The hardware's num_closid is kept in the private structure to force the
filesystem code to use a helper to access it. MPAM would return a single
value for the system, regardless of the resource. Using the helper
prevents this field from being confused with the version of num_closid
that is being exposed to user-space (added in a later patch).

After this split, filesystem code touching a 'hw' struct indicates
where an abstraction is needed.

Splitting this structure only moves types around, and should not lead
to any change in behaviour.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-2-james.morse@arm.com
2021-08-11 11:51:34 +02:00
Sebastian Andrzej Siewior
8ae9e3f638 x86/mce/inject: Replace deprecated CPU-hotplug functions.
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().

Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210803141621.780504-10-bigeasy@linutronix.de
2021-08-10 14:46:27 +02:00
Sebastian Andrzej Siewior
2089f34f8c x86/microcode: Replace deprecated CPU-hotplug functions.
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().

Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210803141621.780504-9-bigeasy@linutronix.de
2021-08-10 14:46:27 +02:00
Sebastian Andrzej Siewior
1a351eefd4 x86/mtrr: Replace deprecated CPU-hotplug functions.
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().

Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210803141621.780504-8-bigeasy@linutronix.de
2021-08-10 14:46:27 +02:00
Balbir Singh
e893bb1bb4 x86, prctl: Hook L1D flushing in via prctl
Use the existing PR_GET/SET_SPECULATION_CTRL API to expose the L1D flush
capability. For L1D flushing PR_SPEC_FORCE_DISABLE and
PR_SPEC_DISABLE_NOEXEC are not supported.

Enabling L1D flush does not check if the task is running on an SMT enabled
core, rather a check is done at runtime (at the time of flush), if the task
runs on a SMT sibling then the task is sent a SIGBUS which is executed
before the task returns to user space or to a guest.

This is better than the other alternatives of:

  a. Ensuring strict affinity of the task (hard to enforce without further
     changes in the scheduler)

  b. Silently skipping flush for tasks that move to SMT enabled cores.

Hook up the core prctl and implement the x86 specific parts which in turn
makes it functional.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210108121056.21940-5-sblbir@amazon.com
2021-07-28 11:42:25 +02:00
Balbir Singh
b5f06f64e2 x86/mm: Prepare for opt-in based L1D flush in switch_mm()
The goal of this is to allow tasks that want to protect sensitive
information, against e.g. the recently found snoop assisted data sampling
vulnerabilites, to flush their L1D on being switched out.  This protects
their data from being snooped or leaked via side channels after the task
has context switched out.

This could also be used to wipe L1D when an untrusted task is switched in,
but that's not a really well defined scenario while the opt-in variant is
clearly defined.

The mechanism is default disabled and can be enabled on the kernel command
line.

Prepare for the actual prctl based opt-in:

  1) Provide the necessary setup functionality similar to the other
     mitigations and enable the static branch when the command line option
     is set and the CPU provides support for hardware assisted L1D
     flushing. Software based L1D flush is not supported because it's CPU
     model specific and not really well defined.

     This does not come with a sysfs file like the other mitigations
     because it is not bound to any specific vulnerability.

     Support has to be queried via the prctl(2) interface.

  2) Add TIF_SPEC_L1D_FLUSH next to L1D_SPEC_IB so the two bits can be
     mangled into the mm pointer in one go which allows to reuse the
     existing mechanism in switch_mm() for the conditional IBPB speculation
     barrier efficiently.

  3) Add the L1D flush specific functionality which flushes L1D when the
     outgoing task opted in.

     Also check whether the incoming task has requested L1D flush and if so
     validate that it is not accidentaly running on an SMT sibling as this
     makes the whole excercise moot because SMT siblings share L1D which
     opens tons of other attack vectors. If that happens schedule task work
     which signals the incoming task on return to user/guest with SIGBUS as
     this is part of the paranoid L1D flush contract.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210108121056.21940-1-sblbir@amazon.com
2021-07-28 11:42:24 +02:00
Wei Liu
f5a11c69b6 Revert "x86/hyperv: fix logical processor creation"
This reverts commit 450605c28d.

Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-07-21 15:55:43 +00:00
Ani Sinha
5f92b45c3b x86/hyperv: add comment describing TSC_INVARIANT_CONTROL MSR setting bit 0
Commit dce7cd6275 ("x86/hyperv: Allow guests to enable InvariantTSC")
added the support for HV_X64_MSR_TSC_INVARIANT_CONTROL. Setting bit 0
of this synthetic MSR will allow hyper-v guests to report invariant TSC
CPU feature through CPUID. This comment adds this explanation to the code
and mentions where the Intel's generic platform init code reads this
feature bit from CPUID. The comment will help developers understand how
the two parts of the initialization (hyperV specific and non-hyperV
specific generic hw init) are related.

Signed-off-by: Ani Sinha <ani@anisinha.ca>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20210716133245.3272672-1-ani@anisinha.ca
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-07-16 14:51:54 +00:00
Michael Kelley
6dc77fa5ac Drivers: hv: Move Hyper-V misc functionality to arch-neutral code
The check for whether hibernation is possible, and the enabling of
Hyper-V panic notification during kexec, are both architecture neutral.
Move the code from under arch/x86 and into drivers/hv/hv_common.c where
it can also be used for ARM64.

No functional change.

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1626287687-2045-4-git-send-email-mikelley@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-07-15 12:59:45 +00:00
Michael Kelley
9d7cf2c967 Drivers: hv: Add arch independent default functions for some Hyper-V handlers
Architecture independent Hyper-V code calls various arch-specific handlers
when needed.  To aid in supporting multiple architectures, provide weak
defaults that can be overridden by arch-specific implementations where
appropriate.  But when arch-specific overrides aren't needed or haven't
been implemented yet for a particular architecture, these stubs reduce
the amount of clutter under arch/.

No functional change.

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1626287687-2045-3-git-send-email-mikelley@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-07-15 12:59:45 +00:00
Michael Kelley
afca4d95dd Drivers: hv: Make portions of Hyper-V init code be arch neutral
The code to allocate and initialize the hv_vp_index array is
architecture neutral. Similarly, the code to allocate and
populate the hypercall input and output arg pages is architecture
neutral.  Move both sets of code out from arch/x86 and into
utility functions in drivers/hv/hv_common.c that can be shared
by Hyper-V initialization on ARM64.

No functional changes. However, the allocation of the hypercall
input and output arg pages is done differently so that the
size is always the Hyper-V page size, even if not the same as
the guest page size (such as with ARM64's 64K page size).

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1626287687-2045-2-git-send-email-mikelley@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-07-15 12:59:45 +00:00