Commit Graph

11591 Commits

Author SHA1 Message Date
Lai Jiangshan
84e5ffd045 KVM: X86/MMU: Fix shadowing 5-level NPT for 4-level NPT L1 guest
When shadowing 5-level NPT for 4-level NPT L1 guest, the root_sp is
allocated with role.level = 5 and the guest pagetable's root gfn.

And root_sp->spt[0] is also allocated with the same gfn and the same
role except role.level = 4.  Luckily that they are different shadow
pages, but only root_sp->spt[0] is the real translation of the guest
pagetable.

Here comes a problem:

If the guest switches from gCR4_LA57=0 to gCR4_LA57=1 (or vice verse)
and uses the same gfn as the root page for nested NPT before and after
switching gCR4_LA57.  The host (hCR4_LA57=1) might use the same root_sp
for the guest even the guest switches gCR4_LA57.  The guest will see
unexpected page mapped and L2 may exploit the bug and hurt L1.  It is
lucky that the problem can't hurt L0.

And three special cases need to be handled:

The root_sp should be like role.direct=1 sometimes: its contents are
not backed by gptes, root_sp->gfns is meaningless.  (For a normal high
level sp in shadow paging, sp->gfns is often unused and kept zero, but
it could be relevant and meaningful if sp->gfns is used because they
are backed by concrete gptes.)

For such root_sp in the case, root_sp is just a portal to contribute
root_sp->spt[0], and root_sp->gfns should not be used and
root_sp->spt[0] should not be dropped if gpte[0] of the guest root
pagetable is changed.

Such root_sp should not be accounted too.

So add role.passthrough to distinguish the shadow pages in the hash
when gCR4_LA57 is toggled and fix above special cases by using it in
kvm_mmu_page_{get|set}_gfn() and sp_has_gptes().

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220420131204.2850-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 12:50:00 -04:00
Paolo Bonzini
347a0d0ded KVM: x86/mmu: replace direct_map with root_role.direct
direct_map is always equal to the direct field of the root page's role:

- for shadow paging, direct_map is true if CR0.PG=0 and root_role.direct is
copied from cpu_role.base.direct

- for TDP, it is always true and root_role.direct is also always true

- for shadow TDP, it is always false and root_role.direct is also always
false

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 12:49:59 -04:00
Paolo Bonzini
4d25502aa1 KVM: x86/mmu: replace root_level with cpu_role.base.level
Remove another duplicate field of struct kvm_mmu.  This time it's
the root level for page table walking; the separate field is
always initialized as cpu_role.base.level, so its users can look
up the CPU mode directly instead.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 12:49:58 -04:00
Paolo Bonzini
a972e29c1d KVM: x86/mmu: replace shadow_root_level with root_role.level
root_role.level is always the same value as shadow_level:

- it's kvm_mmu_get_tdp_level(vcpu) when going through init_kvm_tdp_mmu

- it's the level argument when going through kvm_init_shadow_ept_mmu

- it's assigned directly from new_role.base.level when going
  through shadow_mmu_init_context

Remove the duplication and get the level directly from the role.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 12:49:58 -04:00
Paolo Bonzini
faf729621c KVM: x86/mmu: remove redundant bits from extended role
Before the separation of the CPU and the MMU role, CR0.PG was not
available in the base MMU role, because two-dimensional paging always
used direct=1 in the MMU role.  However, now that the raw role is
snapshotted in mmu->cpu_role, the value of CR0.PG always matches both
!cpu_role.base.direct and cpu_role.base.level > 0.  There is no need to
store it again in union kvm_mmu_extended_role; instead, write an is_cr0_pg
accessor by hand that takes care of the conversion.  Use cpu_role.base.level
since the future of the direct field is unclear.

Likewise, CR4.PAE is now always present in the CPU role as
!cpu_role.base.has_4_byte_gpte.  The inversion makes certain tests on
the MMU role easier, and is easily hidden by the is_cr4_pae accessor
when operating on the CPU role.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 12:49:57 -04:00
Paolo Bonzini
7a7ae82923 KVM: x86/mmu: rename kvm_mmu_role union
It is quite confusing that the "full" union is called kvm_mmu_role
but is used for the "cpu_role" field of struct kvm_mmu.  Rename it
to kvm_cpu_role.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 12:49:56 -04:00
Paolo Bonzini
7a458f0e1b KVM: x86/mmu: remove extended bits from mmu_role, rename field
mmu_role represents the role of the root of the page tables.
It does not need any extended bits, as those govern only KVM's
page table walking; the is_* functions used for page table
walking always use the CPU role.

ext.valid is not present anymore in the MMU role, but an
all-zero MMU role is impossible because the level field is
never zero in the MMU role.  So just zap the whole mmu_role
in order to force invalidation after CPUID is updated.

While making this change, which requires touching almost every
occurrence of "mmu_role", rename it to "root_role".

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 12:49:56 -04:00
Paolo Bonzini
ec283cb1dc KVM: x86/mmu: remove ept_ad field
The ept_ad field is used during page walk to determine if the guest PTEs
have accessed and dirty bits.  In the MMU role, the ad_disabled
bit represents whether the *shadow* PTEs have the bits, so it
would be incorrect to replace PT_HAVE_ACCESSED_DIRTY with just
!mmu->mmu_role.base.ad_disabled.

However, the similar field in the CPU mode, ad_disabled, is initialized
correctly: to the opposite value of ept_ad for shadow EPT, and zero
for non-EPT guest paging modes (which always have A/D bits).  It is
therefore possible to compute PT_HAVE_ACCESSED_DIRTY from the CPU mode,
like other page-format fields; it just has to be inverted to account
for the different polarity.

In fact, now that the CPU mode is distinct from the MMU roles, it would
even be possible to remove PT_HAVE_ACCESSED_DIRTY macro altogether, and
use !mmu->cpu_role.base.ad_disabled instead.  I am not doing this because
the macro has a small effect in terms of dead code elimination:

   text	   data	    bss	    dec	    hex
 103544	  16665	    112	 120321	  1d601    # as of this patch
 103746	  16665	    112	 120523	  1d6cb    # without PT_HAVE_ACCESSED_DIRTY

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 12:49:54 -04:00
Paolo Bonzini
e5ed0fb010 KVM: x86/mmu: split cpu_role from mmu_role
Snapshot the state of the processor registers that govern page walk into
a new field of struct kvm_mmu.  This is a more natural representation
than having it *mostly* in mmu_role but not exclusively; the delta
right now is represented in other fields, such as root_level.

The nested MMU now has only the CPU role; and in fact the new function
kvm_calc_cpu_role is analogous to the previous kvm_calc_nested_mmu_role,
except that it has role.base.direct equal to !CR0.PG.  For a walk-only
MMU, "direct" has no meaning, but we set it to !CR0.PG so that
role.ext.cr0_pg can go away in a future patch.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 12:49:53 -04:00
Sean Christopherson
6819af7597 KVM: x86: Clean up and document nested #PF workaround
Replace the per-vendor hack-a-fix for KVM's #PF => #PF => #DF workaround
with an explicit, common workaround in kvm_inject_emulated_page_fault().
Aside from being a hack, the current approach is brittle and incomplete,
e.g. nSVM's KVM_SET_NESTED_STATE fails to set ->inject_page_fault(),
and nVMX fails to apply the workaround when VMX is intercepting #PF due
to allow_smaller_maxphyaddr=1.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 12:49:49 -04:00
Babu Moger
296d5a17e7 KVM: SEV-ES: Use V_TSC_AUX if available instead of RDTSC/MSR_TSC_AUX intercepts
The TSC_AUX virtualization feature allows AMD SEV-ES guests to securely use
TSC_AUX (auxiliary time stamp counter data) in the RDTSCP and RDPID
instructions. The TSC_AUX value is set using the WRMSR instruction to the
TSC_AUX MSR (0xC0000103). It is read by the RDMSR, RDTSCP and RDPID
instructions. If the read/write of the TSC_AUX MSR is intercepted, then
RDTSCP and RDPID must also be intercepted when TSC_AUX virtualization
is present. However, the RDPID instruction can't be intercepted. This means
that when TSC_AUX virtualization is present, RDTSCP and TSC_AUX MSR
read/write must not be intercepted for SEV-ES (or SEV-SNP) guests.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Message-Id: <165040164424.1399644.13833277687385156344.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 12:49:15 -04:00
Babu Moger
f30903394e x86/cpufeatures: Add virtual TSC_AUX feature bit
The TSC_AUX Virtualization feature allows AMD SEV-ES guests to securely use
TSC_AUX (auxiliary time stamp counter data) MSR in RDTSCP and RDPID
instructions.

The TSC_AUX MSR is typically initialized to APIC ID or another unique
identifier so that software can quickly associate returned TSC value
with the logical processor.

Add the feature bit and also include it in the kvm for detection.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Acked-by: Borislav Petkov <bp@suse.de>
Message-Id: <165040157111.1399644.6123821125319995316.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 12:49:15 -04:00
Sean Christopherson
643d95aac5 Revert "x86/mm: Introduce lookup_address_in_mm()"
Drop lookup_address_in_mm() now that KVM is providing it's own variant
of lookup_address_in_pgd() that is safe for use with user addresses, e.g.
guards against page tables being torn down.  A variant that provides a
non-init mm is inherently dangerous and flawed, as the only reason to use
an mm other than init_mm is to walk a userspace mapping, and
lookup_address_in_pgd() does not play nice with userspace mappings, e.g.
doesn't disable IRQs to block TLB shootdowns and doesn't use READ_ONCE()
to ensure an upper level entry isn't converted to a huge page between
checking the PAGE_SIZE bit and grabbing the address of the next level
down.

This reverts commit 13c72c060f.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <YmwIi3bXr/1yhYV/@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 12:40:41 -04:00
Christoph Hellwig
e10cd4b009 x86/mm: enable ARCH_HAS_VM_GET_PAGE_PROT
This defines and exports a platform specific custom vm_get_page_prot() via
subscribing ARCH_HAS_VM_GET_PAGE_PROT.  This also unsubscribes from config
ARCH_HAS_FILTER_PGPROT, after dropping off arch_filter_pgprot() and
arch_vm_get_page_prot().

Link: https://lkml.kernel.org/r/20220414062125.609297-6-anshuman.khandual@arm.com
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28 23:16:13 -07:00
Suma Hegde
830fe3c30d amd_hsmp: Add HSMP protocol version 5 messages
HSMP protocol version 5 is supported on AMD family 19h model 10h
EPYC processors. This version brings new features such as
-- DIMM statistics
-- Bandwidth for IO and xGMI links
-- Monitor socket and core frequency limits
-- Configure power efficiency modes, DF pstate range etc

Signed-off-by: Suma Hegde <suma.hegde@amd.com>
Reviewed-by: Carlos Bilbao <carlos.bilbao@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
Link: https://lore.kernel.org/r/20220427152248.25643-1-nchatrad@amd.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2022-04-27 21:45:44 +02:00
Thomas Gleixner
bb6e89df90 x86/aperfmperf: Make parts of the frequency invariance code unconditional
The frequency invariance support is currently limited to x86/64 and SMP,
which is the vast majority of machines.

arch_scale_freq_tick() is called every tick on all CPUs and reads the APERF
and MPERF MSRs. The CPU frequency getters function do the same via dedicated
IPIs.

While it could be argued that on systems where frequency invariance support
is disabled (32bit, !SMP) the per tick read of the APERF and MPERF MSRs can
be avoided, it does not make sense to keep the extra code and the resulting
runtime issues of mass IPIs around.

As a first step split out the non frequency invariance specific
initialization code and the read MSR portion of arch_scale_freq_tick(). The
rest of the code is still conditional and guarded with a static key.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.761988704@linutronix.de
2022-04-27 20:22:19 +02:00
Thomas Gleixner
0dfaf3f6ec x86/aperfmperf: Untangle Intel and AMD frequency invariance init
AMD boot CPU initialization happens late via ACPI/CPPC which prevents the
Intel parts from being marked __init.

Split out the common code and provide a dedicated interface for the AMD
initialization and mark the Intel specific code and data __init.

The remaining text size is almost cut in half:

  text:		2614	->	1350
  init.text:	   0	->	 786

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.592465719@linutronix.de
2022-04-27 20:22:19 +02:00
Thomas Gleixner
138a7f9c6b x86/aperfmperf: Separate AP/BP frequency invariance init
This code is convoluted and because it can be invoked post init via the
ACPI/CPPC code, all of the initialization functionality is built in instead
of being part of init text and init data.

As a first step create separate calls for the boot and the application
processors.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.536733494@linutronix.de
2022-04-27 15:51:08 +02:00
Tony Luck
ef79970d7c x86/split-lock: Remove unused TIF_SLD bit
Changes to the "warn" mode of split lock handling mean that TIF_SLD is
never set.

Remove the bit, and the functions that use it.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220310204854.31752-3-tony.luck@intel.com
2022-04-27 15:43:39 +02:00
Matthieu Baerts
b0b592cf08 x86/pm: Fix false positive kmemleak report in msr_build_context()
Since

  e2a1256b17 ("x86/speculation: Restore speculation related MSRs during S3 resume")

kmemleak reports this issue:

  unreferenced object 0xffff888009cedc00 (size 256):
    comm "swapper/0", pid 1, jiffies 4294693823 (age 73.764s)
    hex dump (first 32 bytes):
      00 00 00 00 00 00 00 00 48 00 00 00 00 00 00 00  ........H.......
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    backtrace:
      msr_build_context (include/linux/slab.h:621)
      pm_check_save_msr (arch/x86/power/cpu.c:520)
      do_one_initcall (init/main.c:1298)
      kernel_init_freeable (init/main.c:1370)
      kernel_init (init/main.c:1504)
      ret_from_fork (arch/x86/entry/entry_64.S:304)

Reproducer:

  - boot the VM with a debug kernel config (see
    https://github.com/multipath-tcp/mptcp_net-next/issues/268)
  - wait ~1 minute
  - start a kmemleak scan

The root cause here is alignment within the packed struct saved_context
(from suspend_64.h). Kmemleak only searches for pointers that are
aligned (see how pointers are scanned in kmemleak.c), but pahole shows
that the saved_msrs struct member and all members after it in the
structure are unaligned:

  struct saved_context {
    struct pt_regs             regs;                 /*     0   168 */
    /* --- cacheline 2 boundary (128 bytes) was 40 bytes ago --- */
    u16                        ds;                   /*   168     2 */

    ...

    u64                        misc_enable;          /*   232     8 */
    bool                       misc_enable_saved;    /*   240     1 */

   /* Note below odd offset values for the remainder of this struct */

    struct saved_msrs          saved_msrs;           /*   241    16 */
    /* --- cacheline 4 boundary (256 bytes) was 1 bytes ago --- */
    long unsigned int          efer;                 /*   257     8 */
    u16                        gdt_pad;              /*   265     2 */
    struct desc_ptr            gdt_desc;             /*   267    10 */
    u16                        idt_pad;              /*   277     2 */
    struct desc_ptr            idt;                  /*   279    10 */
    u16                        ldt;                  /*   289     2 */
    u16                        tss;                  /*   291     2 */
    long unsigned int          tr;                   /*   293     8 */
    long unsigned int          safety;               /*   301     8 */
    long unsigned int          return_address;       /*   309     8 */

    /* size: 317, cachelines: 5, members: 25 */
    /* last cacheline: 61 bytes */
  } __attribute__((__packed__));

Move misc_enable_saved to the end of the struct declaration so that
saved_msrs fits in before the cacheline 4 boundary.

The comment above the saved_context declaration says to fix wakeup_64.S
file and __save/__restore_processor_state() if the struct is modified:
it looks like all the accesses in wakeup_64.S are done through offsets
which are computed at build-time. Update that comment accordingly.

At the end, the false positive kmemleak report is due to a limitation
from kmemleak but it is always good to avoid unaligned members for
optimisation purposes.

Please note that it looks like this issue is not new, e.g.

  https://lore.kernel.org/all/9f1bb619-c4ee-21c4-a251-870bd4db04fa@lwfinger.net/
  https://lore.kernel.org/all/94e48fcd-1dbd-ebd2-4c91-f39941735909@molgen.mpg.de/

  [ bp: Massage + cleanup commit message. ]

Fixes: 7a9c2dd08e ("x86/pm: Introduce quirk framework to save/restore extra MSR registers around suspend/resume")
Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20220426202138.498310-1-matthieu.baerts@tessares.net
2022-04-27 13:55:19 +02:00
Brijesh Singh
c2106a231c x86/sev: Get the AP jump table address from secrets page
The GHCB specification section 2.7 states that when SEV-SNP is enabled,
a guest should not rely on the hypervisor to provide the address of the
AP jump table. Instead, if a guest BIOS wants to provide an AP jump
table, it should record the address in the SNP secrets page so the guest
operating system can obtain it directly from there.

Fix this on the guest kernel side by having SNP guests use the AP jump
table address published in the secrets page rather than issuing a GHCB
request to get it.

  [ mroth:
    - Improve error handling when ioremap()/memremap() return NULL
    - Don't mix function calls with declarations
    - Add missing __init
    - Tweak commit message ]

Fixes: 0afb6b660a ("x86/sev: Use SEV-SNP AP creation to start secondary CPUs")
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220422135624.114172-3-michael.roth@amd.com
2022-04-27 13:31:38 +02:00
Guo Ren
84a0c977ab
asm-generic: compat: Cleanup duplicate definitions
There are 7 64bit architectures that support Linux COMPAT mode to
run 32bit applications. A lot of definitions are duplicate:
 - COMPAT_USER_HZ
 - COMPAT_RLIM_INFINITY
 - COMPAT_OFF_T_MAX
 - __compat_uid_t, __compat_uid_t
 - compat_dev_t
 - compat_ipc_pid_t
 - struct compat_flock
 - struct compat_flock64
 - struct compat_statfs
 - struct compat_ipc64_perm, compat_semid64_ds,
	  compat_msqid64_ds, compat_shmid64_ds

Cleanup duplicate definitions and merge them into asm-generic.

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Signed-off-by: Guo Ren <guoren@kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Heiko Stuebner <heiko@sntech.de>
Acked-by: Helge Deller <deller@gmx.de>  # parisc
Link: https://lore.kernel.org/r/20220405071314.3225832-7-guoren@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-04-26 13:35:54 -07:00
Guo Ren
f18ed30db2
fs: stat: compat: Add __ARCH_WANT_COMPAT_STAT
RISC-V doesn't neeed compat_stat, so using __ARCH_WANT_COMPAT_STAT
to exclude unnecessary SYSCALL functions.

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Signed-off-by: Guo Ren <guoren@kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Heiko Stuebner <heiko@sntech.de>
Acked-by: Helge Deller <deller@gmx.de>  # parisc
Link: https://lore.kernel.org/r/20220405071314.3225832-6-guoren@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-04-26 13:35:45 -07:00
Christoph Hellwig
3ce0f2373f
compat: consolidate the compat_flock{,64} definition
Provide a single common definition for the compat_flock and
compat_flock64 structures using the same tricks as for the native
variants.  Another extra define is added for the packing required on
x86.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Guo Ren <guoren@kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Tested-by: Heiko Stuebner <heiko@sntech.de>
Acked-by: Helge Deller <deller@gmx.de>  # parisc
Link: https://lore.kernel.org/r/20220405071314.3225832-4-guoren@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-04-26 13:35:28 -07:00
Christoph Hellwig
306f7cc1e9
uapi: always define F_GETLK64/F_SETLK64/F_SETLKW64 in fcntl.h
The F_GETLK64/F_SETLK64/F_SETLKW64 fcntl opcodes are only implemented
for the 32-bit syscall APIs, but are also needed for compat handling
on 64-bit kernels.

Consolidate them in unistd.h instead of definining the internal compat
definitions in compat.h, which is rather error prone (e.g. parisc
gets the values wrong currently).

Note that before this change they were never visible to userspace due
to the fact that CONFIG_64BIT is only set for kernel builds.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Guo Ren <guoren@kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Tested-by: Heiko Stuebner <heiko@sntech.de>
Link: https://lore.kernel.org/r/20220405071314.3225832-3-guoren@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-04-26 13:35:20 -07:00
Thomas Gleixner
8ad7e8f696 x86/fpu/xsave: Support XSAVEC in the kernel
XSAVEC is the user space counterpart of XSAVES which cannot save supervisor
state. In virtualization scenarios the hypervisor does not expose XSAVES
but XSAVEC to the guest, though the kernel does not make use of it.

That's unfortunate because XSAVEC uses the compacted format of saving the
XSTATE. This is more efficient in terms of storage space vs. XSAVE[OPT] as
it does not create holes for XSTATE components which are not supported or
enabled by the kernel but are available in hardware. There is room for
further optimizations when XSAVEC/S and XGETBV1 are supported.

In order to support XSAVEC:

 - Define the XSAVEC ASM macro as it's not yet supported by the required
   minimal toolchain.

 - Create a software defined X86_FEATURE_XCOMPACTED to select the compacted
   XSTATE buffer format for both XSAVEC and XSAVES.

 - Make XSAVEC an option in the 'XSAVE' ASM alternatives

Requested-by: Andrew Cooper <Andrew.Cooper3@citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220404104820.598704095@linutronix.de
2022-04-25 15:05:37 +02:00
Josh Poimboeuf
4ab7674f59 objtool: Make jump label hack optional
Objtool secretly does a jump label hack to overcome the limitations of
the toolchain.  Make the hack explicit (and optional for other arches)
by turning it into a cmdline option and kernel config option.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Link: https://lkml.kernel.org/r/3bdcbfdd27ecb01ddec13c04bdf756a583b13d24.1650300597.git.jpoimboe@redhat.com
2022-04-22 12:32:04 +02:00
Josh Poimboeuf
03f16cd020 objtool: Add CONFIG_OBJTOOL
Now that stack validation is an optional feature of objtool, add
CONFIG_OBJTOOL and replace most usages of CONFIG_STACK_VALIDATION with
it.

CONFIG_STACK_VALIDATION can now be considered to be frame-pointer
specific.  CONFIG_UNWINDER_ORC is already inherently valid for live
patching, so no need to "validate" it.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Link: https://lkml.kernel.org/r/939bf3d85604b2a126412bf11af6e3bd3b872bcb.1650300597.git.jpoimboe@redhat.com
2022-04-22 12:32:03 +02:00
Mingwei Zhang
683412ccf6 KVM: SEV: add cache flush to solve SEV cache incoherency issues
Flush the CPU caches when memory is reclaimed from an SEV guest (where
reclaim also includes it being unmapped from KVM's memslots).  Due to lack
of coherency for SEV encrypted memory, failure to flush results in silent
data corruption if userspace is malicious/broken and doesn't ensure SEV
guest memory is properly pinned and unpinned.

Cache coherency is not enforced across the VM boundary in SEV (AMD APM
vol.2 Section 15.34.7). Confidential cachelines, generated by confidential
VM guests have to be explicitly flushed on the host side. If a memory page
containing dirty confidential cachelines was released by VM and reallocated
to another user, the cachelines may corrupt the new user at a later time.

KVM takes a shortcut by assuming all confidential memory remain pinned
until the end of VM lifetime. Therefore, KVM does not flush cache at
mmu_notifier invalidation events. Because of this incorrect assumption and
the lack of cache flushing, malicous userspace can crash the host kernel:
creating a malicious VM and continuously allocates/releases unpinned
confidential memory pages when the VM is running.

Add cache flush operations to mmu_notifier operations to ensure that any
physical memory leaving the guest VM get flushed. In particular, hook
mmu_notifier_invalidate_range_start and mmu_notifier_release events and
flush cache accordingly. The hook after releasing the mmu lock to avoid
contention with other vCPUs.

Cc: stable@vger.kernel.org
Suggested-by: Sean Christpherson <seanjc@google.com>
Reported-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20220421031407.2516575-4-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21 15:41:00 -04:00
Tom Lendacky
2bf93ffbb9 virt: sevguest: Change driver name to reflect generic SEV support
During patch review, it was decided the SNP guest driver name should not
be SEV-SNP specific, but should be generic for use with anything SEV.
However, this feedback was missed and the driver name, and many of the
driver functions and structures, are SEV-SNP name specific. Rename the
driver to "sev-guest" (to match the misc device that is created) and
update some of the function and structure names, too.

While in the file, adjust the one pr_err() message to be a dev_err()
message so that the message, if issued, uses the driver name.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/307710bb5515c9088a19fd0b930268c7300479b2.1650464054.git.thomas.lendacky@amd.com
2022-04-21 11:48:24 +02:00
Josh Poimboeuf
613871cd66 x86/static_call: Add ANNOTATE_NOENDBR to static call trampoline
The static call trampoline is never indirect-branched to, but is
referenced by the static call key.  Add ANNOTATE_NOENDBR.

Fixes: ed53a0d971 ("x86/alternative: Use .ibt_endbr_seal to seal indirect calls")
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1b5b54aad7d81241dabe5e0c9b40dea64b540b00.1650300597.git.jpoimboe@redhat.com
2022-04-19 21:58:48 +02:00
Borislav Petkov
f9e14dbbd4 x86/cpu: Load microcode during restore_processor_state()
When resuming from system sleep state, restore_processor_state()
restores the boot CPU MSRs. These MSRs could be emulated by microcode.
If microcode is not loaded yet, writing to emulated MSRs leads to
unchecked MSR access error:

  ...
  PM: Calling lapic_suspend+0x0/0x210
  unchecked MSR access error: WRMSR to 0x10f (tried to write 0x0...0) at rIP: ... (native_write_msr)
  Call Trace:
    <TASK>
    ? restore_processor_state
    x86_acpi_suspend_lowlevel
    acpi_suspend_enter
    suspend_devices_and_enter
    pm_suspend.cold
    state_store
    kobj_attr_store
    sysfs_kf_write
    kernfs_fop_write_iter
    new_sync_write
    vfs_write
    ksys_write
    __x64_sys_write
    do_syscall_64
    entry_SYSCALL_64_after_hwframe
   RIP: 0033:0x7fda13c260a7

To ensure microcode emulated MSRs are available for restoration, load
the microcode on the boot CPU before restoring these MSRs.

  [ Pawan: write commit message and productize it. ]

Fixes: e2a1256b17 ("x86/speculation: Restore speculation related MSRs during S3 resume")
Reported-by: Kyle D. Pelton <kyle.d.pelton@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Tested-by: Kyle D. Pelton <kyle.d.pelton@intel.com>
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215841
Link: https://lore.kernel.org/r/4350dfbf785cd482d3fafa72b2b49c83102df3ce.1650386317.git.pawan.kumar.gupta@linux.intel.com
2022-04-19 19:37:05 +02:00
Rafael J. Wysocki
9765fa2566 Merge branch 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux
Pull turbostat changes for 5.19 from Len Brown:

"Chen Yu (1):
      tools/power turbostat: Support thermal throttle count print

Dan Merillat (1):
      tools/power turbostat: fix dump for AMD cpus

Len Brown (5):
      tools/power turbostat: tweak --show and --hide capability
      tools/power turbostat: fix ICX DRAM power numbers
      tools/power turbostat: be more useful as non-root
      tools/power turbostat: No build warnings with -Wextra
      tools/power turbostat: version 2022.04.16

Sumeet Pawnikar (2):
      tools/power turbostat: Add Power Limit4 support
      tools/power turbostat: print power values upto three decimal

Zephaniah E. Loss-Cutler-Hull (2):
      tools/power turbostat: Allow -e for all names.
      tools/power turbostat: Allow printing header every N iterations"

* 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
  tools/power turbostat: version 2022.04.16
  tools/power turbostat: No build warnings with -Wextra
  tools/power turbostat: be more useful as non-root
  tools/power turbostat: fix ICX DRAM power numbers
  tools/power turbostat: Support thermal throttle count print
  tools/power turbostat: Allow printing header every N iterations
  tools/power turbostat: Allow -e for all names.
  tools/power turbostat: print power values upto three decimal
  tools/power turbostat: Add Power Limit4 support
  tools/power turbostat: fix dump for AMD cpus
  tools/power turbostat: tweak --show and --hide capability
2022-04-19 17:43:25 +02:00
Tony Luck
3ccce93403 x86/cpu: Add new Alderlake and Raptorlake CPU model numbers
Intel is subdividing the mobile segment with additional models
with the same codename. Using the Intel "N" and "P" suffices
for these will be less confusing than trying to map to some
different naming convention.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/YlS7n7Xtso9BXZA2@agluck-desk3.sc.intel.com
2022-04-19 12:04:51 +02:00
Christoph Hellwig
3cb4503a33 x86: remove cruft from <asm/dma-mapping.h>
<asm/dma-mapping.h> gets pulled in by all drivers using the DMA API.
Remove x86 internal variables and unnecessary includes from it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2022-04-18 07:21:14 +02:00
Christoph Hellwig
3f70356edf swiotlb: merge swiotlb-xen initialization into swiotlb
Reuse the generic swiotlb initialization for xen-swiotlb.  For ARM/ARM64
this works trivially, while for x86 xen_swiotlb_fixup needs to be passed
as the remap argument to swiotlb_init_remap/swiotlb_init_late.

Note that the lower bound of the swiotlb size is changed to the smaller
IO_TLB_MIN_SLABS based value with this patch, but that is fine as the
2MB value used in Xen before was just an optimization and is not the
hard lower bound.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2022-04-18 07:21:13 +02:00
Christoph Hellwig
78013eaadf x86: remove the IOMMU table infrastructure
The IOMMU table tries to separate the different IOMMUs into different
backends, but actually requires various cross calls.

Rewrite the code to do the generic swiotlb/swiotlb-xen setup directly
in pci-dma.c and then just call into the IOMMU drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2022-04-18 07:21:10 +02:00
Linus Torvalds
3a69a44278 Two x86 fixes related to TSX:
- Use either MSR_TSX_FORCE_ABORT or MSR_IA32_TSX_CTRL to disable TSX to
     cover all CPUs which allow to disable it.
 
   - Disable TSX development mode at boot so that a microcode update which
     provides TSX development mode does not suddenly make the system
     vulnerable to TSX Asynchronous Abort.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmJb5LYTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVVbD/9cxZWkFctCiymedUZqLabkfpYSki65
 MngdpCPzCNaaIdlp44lwCido5+gJsY9unXdm3OAUzLjv6SsxxpDr5njz1/C6TM1l
 XmWjlkLEbG2QDPd1Ybd/lpYQORBmiukyo8v8x0yFT7ZzwvSddoDZAbeUtkQBrIin
 sDTeExsewKzL2X5qXhttrHLHu1PYgurn4ThIrrG+eg2e4FNk6UUFUS3TOyMvzJDg
 NWJ7N5pGy9YkR7CISq1q+qdnH55pGaUrgonDi2qBTt3EaH0fQtZP2ZtIOYr3O4nI
 YCx6isrIiGUB6kSygofxmk4B+22CaUJXd2OcUxMZ/Th/a2aCK+35BtGVPXQGi6nU
 d7m+ZWB7dShOiejFygS59ty+5L5kliKXYZfUASsq1CLoXH8K1xUwBMkbY5FQ2WH1
 Ue4KUvjguNqsgSRAfeHdOi6B36oot0Xf9JO013Wm3V/r9hsGPtSOjWwFuVvT/euw
 a9iFtruATxDssBxH/l0djCKnwwm5yuOt1OpyizcIMFnlCgRD06h/6zgAvsJK7c8d
 dh6lC4D2mXP1e2wtEyZelve1tmRJ/FeReyG2V5FNU7m1mWYGm1rJZ4AEvnbrzcbC
 ePwFva0lPu8GVKG6HRgHfR8PjuQ7TFmKPKytT7fboIqQpTIY+1Q75wYD4eXkSu8Q
 /ltzXQz/8lz7bA==
 =UQaW
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
 "Two x86 fixes related to TSX:

   - Use either MSR_TSX_FORCE_ABORT or MSR_IA32_TSX_CTRL to disable TSX
     to cover all CPUs which allow to disable it.

   - Disable TSX development mode at boot so that a microcode update
     which provides TSX development mode does not suddenly make the
     system vulnerable to TSX Asynchronous Abort"

* tag 'x86-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/tsx: Disable TSX development mode at boot
  x86/tsx: Use MSR_TSX_CTRL to clear CPUID bits
2022-04-17 09:55:59 -07:00
Sumeet Pawnikar
f52ba93190 tools/power turbostat: Add Power Limit4 support
Add Power Limit4 support.

Signed-off-by: Sumeet Pawnikar <sumeet.r.pawnikar@intel.com>
Acked-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
2022-04-16 21:58:14 -04:00
Omar Sandoval
c12cd77cb0 mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
Commit 3ee48b6af4 ("mm, x86: Saving vmcore with non-lazy freeing of
vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
lazy_max_pages() + 1, ensuring that any future vunmaps() immediately
purge the vmap areas instead of doing it lazily.

Commit 690467c81b ("mm/vmalloc: Move draining areas out of caller
context") moved the purging from the vunmap() caller to a worker thread.
Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
(possibly forever).  For example, consider the following scenario:

 1. Thread reads from /proc/vmcore. This eventually calls
    __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
    vmap_lazy_nr to lazy_max_pages() + 1.

 2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
    pages (one page plus the guard page) to the purge list and
    vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
    drain_vmap_work is scheduled.

 3. Thread returns from the kernel and is scheduled out.

 4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
    frees the 2 pages on the purge list. vmap_lazy_nr is now
    lazy_max_pages() + 1.

 5. This is still over the threshold, so it tries to purge areas again,
    but doesn't find anything.

 6. Repeat 5.

If the system is running with only one CPU (which is typicial for kdump)
and preemption is disabled, then this will never make forward progress:
there aren't any more pages to purge, so it hangs.  If there is more
than one CPU or preemption is enabled, then the worker thread will spin
forever in the background.  (Note that if there were already pages to be
purged at the time that set_iounmap_nonlazy() was called, this bug is
avoided.)

This can be reproduced with anything that reads from /proc/vmcore
multiple times.  E.g., vmcore-dmesg /proc/vmcore.

It turns out that improvements to vmap() over the years have obsoleted
the need for this "optimization".  I benchmarked `dd if=/proc/vmcore
of=/dev/null` with 4k and 1M read sizes on a system with a 32GB vmcore.
The test was run on 5.17, 5.18-rc1 with a fix that avoided the hang, and
5.18-rc1 with set_iounmap_nonlazy() removed entirely:

    |5.17  |5.18+fix|5.18+removal
  4k|40.86s|  40.09s|      26.73s
  1M|24.47s|  23.98s|      21.84s

The removal was the fastest (by a wide margin with 4k reads).  This
patch removes set_iounmap_nonlazy().

Link: https://lkml.kernel.org/r/52f819991051f9b865e9ce25605509bfdbacadcd.1649277321.git.osandov@fb.com
Fixes: 690467c81b  ("mm/vmalloc: Move draining areas out of caller context")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15 14:49:56 -07:00
Brian Gerst
203d8919a9 x86/asm: Merge load_gs_index()
Merge the 32- and 64-bit implementations of load_gs_index().

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220325153953.162643-5-brgerst@gmail.com
2022-04-14 14:15:54 +02:00
Brian Gerst
3a24a60854 x86/32: Remove lazy GS macros
GS is always a user segment now.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220325153953.162643-4-brgerst@gmail.com
2022-04-14 14:09:43 +02:00
Matthew Wilcox (Oracle)
4e140f59d2 mm/usercopy: Check kmap addresses properly
If you are copying to an address in the kmap region, you may not copy
across a page boundary, no matter what the size of the underlying
allocation.  You can't kmap() a slab page because slab pages always
come from low memory.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220110231530.665970-2-willy@infradead.org
2022-04-13 12:15:50 -07:00
Peter Zijlstra
989b5db215 x86/uaccess: Implement macros for CMPXCHG on user addresses
Add support for CMPXCHG loops on userspace addresses.  Provide both an
"unsafe" version for tight loops that do their own uaccess begin/end, as
well as a "safe" version for use cases where the CMPXCHG is not buried in
a loop, e.g. KVM will resume the guest instead of looping when emulation
of a guest atomic accesses fails the CMPXCHG.

Provide 8-byte versions for 32-bit kernels so that KVM can do CMPXCHG on
guest PAE PTEs, which are accessed via userspace addresses.

Guard the asm_volatile_goto() variation with CC_HAS_ASM_GOTO_TIED_OUTPUT,
the "+m" constraint fails on some compilers that otherwise support
CC_HAS_ASM_GOTO_OUTPUT.

Cc: stable@vger.kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220202004945.2540433-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-13 13:37:47 -04:00
Like Xu
1921f3aa92 KVM: x86: Use static calls to reduce kvm_pmu_ops overhead
Use static calls to improve kvm_pmu_ops performance, following the same
pattern and naming scheme used by kvm-x86-ops.h.

Here are the worst fenced_rdtsc() cycles numbers for the kvm_pmu_ops
functions that is most often called (up to 7 digits of calls) when running
a single perf test case in a guest on an ICX 2.70GHz host (mitigations=on):

		|	legacy	|	static call
------------------------------------------------------------
.pmc_idx_to_pmc	|	1304840	|	994872 (+23%)
.pmc_is_enabled	|	978670	|	1011750 (-3%)
.msr_idx_to_pmc	|	47828	|	41690 (+12%)
.is_valid_msr	|	28786	|	30108 (-4%)

Signed-off-by: Like Xu <likexu@tencent.com>
[sean: Handle static call updates in pmu.c, tweak changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220329235054.3534728-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-13 13:37:45 -04:00
Like Xu
34886e796c KVM: x86: Move .pmu_ops to kvm_x86_init_ops and tag as __initdata
The pmu_ops should be moved to kvm_x86_init_ops and tagged as __initdata.
That'll save those precious few bytes, and more importantly make
the original ops unreachable, i.e. make it harder to sneak in post-init
modification bugs.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220329235054.3534728-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-13 13:37:45 -04:00
Like Xu
fdc298da86 KVM: x86: Move kvm_ops_static_call_update() to x86.c
The kvm_ops_static_call_update() is defined in kvm_host.h. That's
completely unnecessary, it should have exactly one caller,
kvm_arch_hardware_setup().  Move the helper to x86.c and have it do the
actual memcpy() of the ops in addition to the static call updates.  This
will also allow for cleanly giving kvm_pmu_ops static_call treatment.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
[sean: Move memcpy() into the helper and rename accordingly]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220329235054.3534728-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-13 13:37:44 -04:00
Sean Christopherson
ca2a7c22a1 KVM: x86/mmu: Derive EPT violation RWX bits from EPTE RWX bits
Derive the mask of RWX bits reported on EPT violations from the mask of
RWX bits that are shoved into EPT entries; the layout is the same, the
EPT violation bits are simply shifted by three.  Use the new shift and a
slight copy-paste of the mask derivation instead of completely open
coding the same to convert between the EPT entry bits and the exit
qualification when synthesizing a nested EPT Violation.

No functional change intended.

Cc: SU Hang <darcy.sh@antgroup.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220329030108.97341-3-darcy.sh@antgroup.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-13 13:37:37 -04:00
SU Hang
aecce510fe KVM: VMX: replace 0x180 with EPT_VIOLATION_* definition
Using self-expressing macro definition EPT_VIOLATION_GVA_VALIDATION
and EPT_VIOLATION_GVA_TRANSLATED instead of 0x180
in FNAME(walk_addr_generic)().

Signed-off-by: SU Hang <darcy.sh@antgroup.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220329030108.97341-2-darcy.sh@antgroup.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-13 13:37:19 -04:00
Peng Hao
8176472563 kvm: x86: Adjust the location of pkru_mask of kvm_mmu to reduce memory
Adjust the field pkru_mask to the back of direct_map to make up 8-byte
alignment.This reduces the size of kvm_mmu by 8 bytes.

Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Message-Id: <20220228030749.88353-1-flyingpeng@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-13 13:37:18 -04:00
Paolo Bonzini
a4cfff3f0f Merge branch 'kvm-older-features' into HEAD
Merge branch for features that did not make it into 5.18:

* New ioctls to get/set TSC frequency for a whole VM

* Allow userspace to opt out of hypercall patching

Nested virtualization improvements for AMD:

* Support for "nested nested" optimizations (nested vVMLOAD/VMSAVE,
  nested vGIF)

* Allow AVIC to co-exist with a nested guest running

* Fixes for LBR virtualizations when a nested guest is running,
  and nested LBR virtualization support

* PAUSE filtering for nested hypervisors

Guest support:

* Decoupling of vcpu_is_preempted from PV spinlocks

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-13 13:37:17 -04:00
Thomas Gleixner
daf3af4705 x86/apic: Clarify i82489DX bit overlap in APIC_LVT0
Daniel stumbled over the bit overlap of the i82498DX external APIC and the
TSC deadline timer configuration bit in modern APICs, which is neither
documented in the code nor in the current SDM. Maciej provided links to
the original i82489DX/486 documentation. See Link.

Remove the i82489DX macro maze, use a i82489DX specific define in the apic
code and document the overlap in a comment.

Reported-by: Daniel Vacek <neelx@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Maciej W. Rozycki <macro@orcam.me.uk>
Link: https://lore.kernel.org/r/87ee22f3ci.ffs@tglx
2022-04-13 18:39:48 +02:00
Linus Torvalds
453096eb04 x86:
* Miscellaneous bugfixes
 
 * A small cleanup for the new workqueue code
 
 * Documentation syntax fix
 
 RISC-V:
 
 * Remove hgatp zeroing in kvm_arch_vcpu_put()
 
 * Fix alignment of the guest_hang() in KVM selftest
 
 * Fix PTE A and D bits in KVM selftest
 
 * Missing #include in vcpu_fp.c
 
 ARM:
 
 * Some PSCI fixes after introducing PSCIv1.1 and SYSTEM_RESET2
 
 * Fix the MMU write-lock not being taken on THP split
 
 * Fix mixed-width VM handling
 
 * Fix potential UAF when debugfs registration fails
 
 * Various selftest updates for all of the above
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmJVtdMUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroO33QgAiPh80xUkYfnl8FVN440S5F7UOPQ2
 Cs/PbroNoP+Oz2GoG07aaqnUkFFApeBE5S+VMu1zhRNAernqpreN64/Y2iNaz0Y6
 +MbvEX0FhQRW0UZJIF2m49ilgO8Gkt6aEpVRulq5G9w4NWiH1PtR25FVXfDMi8OG
 xdw4x1jwXNI9lOQJ5EpUKVde3rAbxCfoC6hCTh5pCNd9oLuVeLfnC+Uv91fzXltl
 EIeBlV0/mAi3RLp2E/AX38WP6ucMZqOOAy91/RTqX6oIx/7QL28ZNHXVrwQ67Hkd
 pAr3MAk84tZL58lnosw53i5aXAf9CBp0KBnpk2KGutfRNJ4Vzs1e+DZAJA==
 =vqAv
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "x86:

   - Miscellaneous bugfixes

   - A small cleanup for the new workqueue code

   - Documentation syntax fix

  RISC-V:

   - Remove hgatp zeroing in kvm_arch_vcpu_put()

   - Fix alignment of the guest_hang() in KVM selftest

   - Fix PTE A and D bits in KVM selftest

   - Missing #include in vcpu_fp.c

  ARM:

   - Some PSCI fixes after introducing PSCIv1.1 and SYSTEM_RESET2

   - Fix the MMU write-lock not being taken on THP split

   - Fix mixed-width VM handling

   - Fix potential UAF when debugfs registration fails

   - Various selftest updates for all of the above"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (24 commits)
  KVM: x86: hyper-v: Avoid writing to TSC page without an active vCPU
  KVM: SVM: Do not activate AVIC for SEV-enabled guest
  Documentation: KVM: Add SPDX-License-Identifier tag
  selftests: kvm: add tsc_scaling_sync to .gitignore
  RISC-V: KVM: include missing hwcap.h into vcpu_fp
  KVM: selftests: riscv: Fix alignment of the guest_hang() function
  KVM: selftests: riscv: Set PTE A and D bits in VS-stage page table
  RISC-V: KVM: Don't clear hgatp CSR in kvm_arch_vcpu_put()
  selftests: KVM: Free the GIC FD when cleaning up in arch_timer
  selftests: KVM: Don't leak GIC FD across dirty log test iterations
  KVM: Don't create VM debugfs files outside of the VM directory
  KVM: selftests: get-reg-list: Add KVM_REG_ARM_FW_REG(3)
  KVM: avoid NULL pointer dereference in kvm_dirty_ring_push
  KVM: arm64: selftests: Introduce vcpu_width_config
  KVM: arm64: mixed-width check should be skipped for uninitialized vCPUs
  KVM: arm64: vgic: Remove unnecessary type castings
  KVM: arm64: Don't split hugepages outside of MMU write lock
  KVM: arm64: Drop unneeded minor version check from PSCI v1.x handler
  KVM: arm64: Actually prevent SMC64 SYSTEM_RESET2 from AArch32
  KVM: arm64: Generally disallow SMC64 for AArch32 guests
  ...
2022-04-12 14:16:33 -10:00
Mikulas Patocka
932aba1e16 stat: fix inconsistency between struct stat and struct compat_stat
struct stat (defined in arch/x86/include/uapi/asm/stat.h) has 32-bit
st_dev and st_rdev; struct compat_stat (defined in
arch/x86/include/asm/compat.h) has 16-bit st_dev and st_rdev followed by
a 16-bit padding.

This patch fixes struct compat_stat to match struct stat.

[ Historical note: the old x86 'struct stat' did have that 16-bit field
  that the compat layer had kept around, but it was changes back in 2003
  by "struct stat - support larger dev_t":

    https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=e95b2065677fe32512a597a79db94b77b90c968d

  and back in those days, the x86_64 port was still new, and separate
  from the i386 code, and had already picked up the old version with a
  16-bit st_dev field ]

Note that we can't change compat_dev_t because it is used by
compat_loop_info.

Also, if the st_dev and st_rdev values are 32-bit, we don't have to use
old_valid_dev to test if the value fits into them.  This fixes
-EOVERFLOW on filesystems that are on NVMe because NVMe uses the major
number 259.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-12 13:35:08 -10:00
Brian Gerst
f5d9283ecb x86/32: Simplify ELF_CORE_COPY_REGS
GS is now always a user segment, so there is no difference between user
and kernel registers.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220325153953.162643-2-brgerst@gmail.com
2022-04-12 15:42:59 +02:00
Vitaly Kuznetsov
42dcbe7d8b KVM: x86: hyper-v: Avoid writing to TSC page without an active vCPU
The following WARN is triggered from kvm_vm_ioctl_set_clock():
 WARNING: CPU: 10 PID: 579353 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:3161 mark_page_dirty_in_slot+0x6c/0x80 [kvm]
 ...
 CPU: 10 PID: 579353 Comm: qemu-system-x86 Tainted: G        W  O      5.16.0.stable #20
 Hardware name: LENOVO 20UF001CUS/20UF001CUS, BIOS R1CET65W(1.34 ) 06/17/2021
 RIP: 0010:mark_page_dirty_in_slot+0x6c/0x80 [kvm]
 ...
 Call Trace:
  <TASK>
  ? kvm_write_guest+0x114/0x120 [kvm]
  kvm_hv_invalidate_tsc_page+0x9e/0xf0 [kvm]
  kvm_arch_vm_ioctl+0xa26/0xc50 [kvm]
  ? schedule+0x4e/0xc0
  ? __cond_resched+0x1a/0x50
  ? futex_wait+0x166/0x250
  ? __send_signal+0x1f1/0x3d0
  kvm_vm_ioctl+0x747/0xda0 [kvm]
  ...

The WARN was introduced by commit 03c0304a86bc ("KVM: Warn if
mark_page_dirty() is called without an active vCPU") but the change seems
to be correct (unlike Hyper-V TSC page update mechanism). In fact, there's
no real need to actually write to guest memory to invalidate TSC page, this
can be done by the first vCPU which goes through kvm_guest_time_update().

Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220407201013.963226-1-vkuznets@redhat.com>
2022-04-11 13:29:51 -04:00
Suravee Suthikulpanit
c538dc792f KVM: SVM: Do not activate AVIC for SEV-enabled guest
Since current AVIC implementation cannot support encrypted memory,
inhibit AVIC for SEV-enabled guest.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220408133710.54275-1-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-11 13:28:56 -04:00
Pawan Gupta
400331f8ff x86/tsx: Disable TSX development mode at boot
A microcode update on some Intel processors causes all TSX transactions
to always abort by default[*]. Microcode also added functionality to
re-enable TSX for development purposes. With this microcode loaded, if
tsx=on was passed on the cmdline, and TSX development mode was already
enabled before the kernel boot, it may make the system vulnerable to TSX
Asynchronous Abort (TAA).

To be on safer side, unconditionally disable TSX development mode during
boot. If a viable use case appears, this can be revisited later.

  [*]: Intel TSX Disable Update for Selected Processors, doc ID: 643557

  [ bp: Drop unstable web link, massage heavily. ]

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/347bd844da3a333a9793c6687d4e4eb3b2419a3e.1646943780.git.pawan.kumar.gupta@linux.intel.com
2022-04-11 09:58:40 +02:00
Linus Torvalds
9c6913b749 - Fix the MSI message data struct definition
- Use local labels in the exception table macros to avoid symbol
 conflicts with clang LTO builds
 
 - A couple of fixes to objtool checking of the relatively newly added
 SLS and IBT code
 
 - Rename a local var in the WARN* macro machinery to prevent shadowing
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmJSwSkACgkQEsHwGGHe
 VUp6QQ//TGhL2xxLoN+7pYjIBDEDHJ3Oi0m6fOweqyQAZTYcm/rAPqd7hvoWVSoO
 YsLdWi9jeMwkzG0ItSm/qPVm/UvrViXwuQMdz4nDWqg2IPFIbhgNA3CKCIyPTio2
 WHp2NXvYyDnwPMr6xTTRndMDoxiwxMBnXf91pNwoU3toxw0GuUuXan0Y+GKnvx1A
 sqhbpWO27bAmhKb26wPw5soJVxBbSqx+1TbFVG0Sz/uwYQowMa+nfNg1DXF0sXyJ
 E/ssqBB6wjl7ANVbQsxBQHRzr/EksLVPwHHrlT8ga/5loin+VJ6mTBCPLgG7SMBE
 +R1fm79Bp/9KU194fcqhJ3pvnyJPi8hfizzCqNKnK871V8LRzC+jW0l3EdvASEXC
 sDj0XWsSFoWft9eAtMV11d641uVC4rLB90GyyzmWWrEw9BbxmasBgED6QBx9d+V6
 o1L4y58Tsz88HKzwd0PtBkeGDkvkA7xOx8ViG24IeLA0tcbixnfnATQdelQeWKqO
 4m3o1JU8ogJp9JCEBY7ZeXyStFjZMedM4U/V0akF6AKnpDuVfR3T5C68cYhoLKBu
 XU6Swf5sFHImNWp0+54HPnXhHj/uhuwj9YWCkxx/eXViwvVlxSdTdIQWa380EddN
 0KhOFLwLOdhha2+81FJc6vmkDHwiu6hlR38yqdGvdxZf/KPKjM0=
 =kMtP
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Fix the MSI message data struct definition

 - Use local labels in the exception table macros to avoid symbol
   conflicts with clang LTO builds

 - A couple of fixes to objtool checking of the relatively newly added
   SLS and IBT code

 - Rename a local var in the WARN* macro machinery to prevent shadowing

* tag 'x86_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/msi: Fix msi message data shadow struct
  x86/extable: Prefer local labels in .set directives
  x86,bpf: Avoid IBT objtool warning
  objtool: Fix SLS validation for kcov tail-call replacement
  objtool: Fix IBT tail-call detection
  x86/bug: Prevent shadowing in __WARN_FLAGS
  x86/mm/tlb: Revert retpoline avoidance approach
2022-04-10 07:12:27 -10:00
Linus Torvalds
b51f86e990 - A couple of fixes to cgroup-related handling of perf events
- A couple of fixes to event encoding on Sapphire Rapids
 
 - Pass event caps of inherited events so that perf doesn't fail wrongly at fork()
 
 - Add support for a new Raptor Lake CPU
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmJSvt0ACgkQEsHwGGHe
 VUpKBRAAtegwh4ilwoRM0LePH2TX752pREy+M1qfEUp/XyH3tF8VAixCAmIg7qlI
 IyjRX0AKDC1F08sM7/JmTf0M+hnl/oH2YPG8Q6p3igtfARvn+5bPZdSBpTAC9P5L
 QX3S2WVzv5X78IomIfENqbg5HyZP3IXeg7R7sqZhHbtoG54n5NEv/+aJl5HmHFTt
 gLTrXetL46OSMnLzKfd3hlJqCWSnTz1aGKgGX2cZy9ipI63+XrYMuNmiwJ+CrA3G
 pI98RmKnCPqV2rXij1GpVQNyG2aPR+VVZM3aaq6XBAmiNTaCfnvWbEBGhCkjaSgA
 UU7Y6D1Qxc0OZ1plcjhKc4l/W1oj8jqmG9nS6J2Xy4szdpZIdxBhlWq89xCrb9AC
 yIgKif2iVl7eMVKVG1Jq1u2wTwurBAamH73sCCNn8ndctBjicoM8pbtHMHxzceyZ
 w4Cff0yUNzHgPiqSHQRARw/CaUceL9kDoGzPeEQOR0A+27MpNulchts4HCtIvwzI
 yLIK1JFPHDrCACLTMuAhvov3EMTeoTIfc91eOZRjubRTPx7TxujaZHdP7N+R3nkk
 Giehc/l6IhFPhT8QACk0bziTVJ9in+Jx8pCnocGKuj80Uqs7Sq7swjlasy1Zoy7r
 x9Qzy1gZhPHnvPd6LWU4WyPa767D07DlG/zFdg+P3EeWa/3efdw=
 =ba3V
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Borislav Petkov:

 - A couple of fixes to cgroup-related handling of perf events

 - A couple of fixes to event encoding on Sapphire Rapids

 - Pass event caps of inherited events so that perf doesn't fail wrongly
   at fork()

 - Add support for a new Raptor Lake CPU

* tag 'perf_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Always set cpuctx cgrp when enable cgroup event
  perf/core: Fix perf_cgroup_switch()
  perf/core: Use perf_cgroup_info->active to check if cgroup is active
  perf/core: Don't pass task around when ctx sched in
  perf/x86/intel: Update the FRONTEND MSR mask on Sapphire Rapids
  perf/x86/intel: Don't extend the pseudo-encoding to GP counters
  perf/core: Inherit event_caps
  perf/x86/uncore: Add Raptor Lake uncore support
  perf/x86/msr: Add Raptor Lake CPU support
  perf/x86/cstate: Add Raptor Lake support
  perf/x86: Add Intel Raptor Lake support
2022-04-10 07:08:22 -10:00
Maciej W. Rozycki
b584db0c84 x86/PCI: Add $IRT PIRQ routing table support
Handle the $IRT PCI IRQ Routing Table format used by AMI for its BCP 
(BIOS Configuration Program) external tool meant for tweaking BIOS 
structures without the need to rebuild it from sources[1].

The $IRT format has been invented by AMI before Microsoft has come up 
with its $PIR format and a $IRT table is therefore there in some systems 
that lack a $PIR table, such as the DataExpert EXP8449 mainboard based 
on the ALi FinALi 486 chipset (M1489/M1487), which predates DMI 2.0 and 
cannot therefore be easily identified at run time.

Unlike with the $PIR format there is no alignment guarantee as to the 
placement of the $IRT table, so scan the whole BIOS area bytewise.

Credit to Michal Necasek for helping me chase documentation for the 
format.

References:

[1] "What is BCP? - AMI", <https://www.ami.com/what-is-bcp/>

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> # crosvm
Link: https://lore.kernel.org/r/alpine.DEB.2.21.2203302228410.9038@angie.orcam.me.uk
2022-04-10 12:48:14 +02:00
Bjorn Helgaas
4c5e242d3e x86/PCI: Clip only host bridge windows for E820 regions
ACPI firmware advertises PCI host bridge resources via PNP0A03 _CRS
methods.  Some BIOSes include non-window address space in _CRS, and if we
allocate that non-window space for PCI devices, they don't work.

4dc2287c18 ("x86: avoid E820 regions when allocating address space")
works around this issue by clipping out any regions mentioned in the E820
table in the allocate_resource() path, but the implementation has a couple
issues:

  - The clipping is done for *all* allocations, not just those for PCI
    address space, and

  - The clipping is done at each allocation instead of being done once when
    setting up the host bridge windows.

Rework the implementation so we only clip PCI host bridge windows, and we
do it once when setting them up.

Example output changes:

    BIOS-e820: [mem 0x00000000b0000000-0x00000000c00fffff] reserved
  + acpi PNP0A08:00: clipped [mem 0xc0000000-0xfebfffff window] to [mem 0xc0100000-0xfebfffff window] for e820 entry [mem 0xb0000000-0xc00fffff]
  - pci_bus 0000:00: root bus resource [mem 0xc0000000-0xfebfffff window]
  + pci_bus 0000:00: root bus resource [mem 0xc0100000-0xfebfffff window]

Link: https://lore.kernel.org/r/20220304035110.988712-3-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-04-08 11:35:01 -05:00
Kirill A. Shutemov
e2efb6359e ACPICA: Avoid cache flush inside virtual machines
While running inside virtual machine, the kernel can bypass cache
flushing. Changing sleep state in a virtual machine doesn't affect the
host system sleep state and cannot lead to data loss.

Before entering sleep states, the ACPI code flushes caches to prevent
data loss using the WBINVD instruction.  This mechanism is required on
bare metal.

But, any use WBINVD inside of a guest is worthless.  Changing sleep
state in a virtual machine doesn't affect the host system sleep state
and cannot lead to data loss, so most hypervisors simply ignore it.
Despite this, the ACPI code calls WBINVD unconditionally anyway.
It's useless, but also normally harmless.

In TDX guests, though, WBINVD stops being harmless; it triggers a
virtualization exception (#VE).  If the ACPI cache-flushing WBINVD
were left in place, TDX guests would need handling to recover from
the exception.

Avoid using WBINVD whenever running under a hypervisor.  This both
removes the useless WBINVDs and saves TDX from implementing WBINVD
handling.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-30-kirill.shutemov@linux.intel.com
2022-04-07 08:27:54 -07:00
Kirill A. Shutemov
968b493173 x86/mm: Make DMA memory shared for TD guest
Intel TDX doesn't allow VMM to directly access guest private memory.
Any memory that is required for communication with the VMM must be
shared explicitly. The same rule applies for any DMA to and from the
TDX guest. All DMA pages have to be marked as shared pages. A generic way
to achieve this without any changes to device drivers is to use the
SWIOTLB framework.

The previous patch ("Add support for TDX shared memory") gave TDX guests
the _ability_ to make some pages shared, but did not make any pages
shared. This actually marks SWIOTLB buffers *as* shared.

Start returning true for cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) in
TDX guests.  This has several implications:

 - Allows the existing mem_encrypt_init() to be used for TDX which
   sets SWIOTLB buffers shared (aka. "decrypted").
 - Ensures that all DMA is routed via the SWIOTLB mechanism (see
   pci_swiotlb_detect())

Stop selecting DYNAMIC_PHYSICAL_MASK directly. It will get set
indirectly by selecting X86_MEM_ENCRYPT.

mem_encrypt_init() is currently under an AMD-specific #ifdef. Move it to
a generic area of the header.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-28-kirill.shutemov@linux.intel.com
2022-04-07 08:27:53 -07:00
Kuppuswamy Sathyanarayanan
f39642d0db x86/acpi/x86/boot: Add multiprocessor wake-up support
Secondary CPU startup is currently performed with something called
the "INIT/SIPI protocol".  This protocol requires assistance from
VMMs to boot guests.  As should be a familiar story by now, that
support can not be provded to TDX guests because TDX VMMs are
not trusted by guests.

To remedy this situation a new[1] "Multiprocessor Wakeup Structure"
has been added to to an existing ACPI table (MADT).  This structure
provides the physical address of a "mailbox".  A write to the mailbox
then steers the secondary CPU to the boot code.

Add ACPI MADT wake structure parsing support and wake support.  Use
this support to wake CPUs whenever it is present instead of INIT/SIPI.

While this structure can theoretically be used on 32-bit kernels,
there are no 32-bit TDX guest kernels.  It has not been tested and
can not practically *be* tested on 32-bit.  Make it 64-bit only.

1. Details about the new structure can be found in ACPI v6.4, in the
   "Multiprocessor Wakeup Structure" section.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-22-kirill.shutemov@linux.intel.com
2022-04-07 08:27:53 -07:00
Sean Christopherson
ff2e64684f x86/boot: Add a trampoline for booting APs via firmware handoff
Historically, x86 platforms have booted secondary processors (APs)
using INIT followed by the start up IPI (SIPI) messages. In regular
VMs, this boot sequence is supported by the VMM emulation. But such a
wakeup model is fatal for secure VMs like TDX in which VMM is an
untrusted entity. To address this issue, a new wakeup model was added
in ACPI v6.4, in which firmware (like TDX virtual BIOS) will help boot
the APs. More details about this wakeup model can be found in ACPI
specification v6.4, the section titled "Multiprocessor Wakeup Structure".

Since the existing trampoline code requires processors to boot in real
mode with 16-bit addressing, it will not work for this wakeup model
(because it boots the AP in 64-bit mode). To handle it, extend the
trampoline code to support 64-bit mode firmware handoff. Also, extend
IDT and GDT pointers to support 64-bit mode hand off.

There is no TDX-specific detection for this new boot method. The kernel
will rely on it as the sole boot method whenever the new ACPI structure
is present.

The ACPI table parser for the MADT multiprocessor wake up structure and
the wakeup method that uses this structure will be added by the following
patch in this series.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-21-kirill.shutemov@linux.intel.com
2022-04-07 08:27:52 -07:00
Kuppuswamy Sathyanarayanan
cfb8ec7a31 x86/tdx: Wire up KVM hypercalls
KVM hypercalls use the VMCALL or VMMCALL instructions. Although the ABI
is similar, those instructions no longer function for TDX guests.

Make vendor-specific TDVMCALLs instead of VMCALL. This enables TDX
guests to run with KVM acting as the hypervisor.

Among other things, KVM hypercall is used to send IPIs.

Since the KVM driver can be built as a kernel module, export
tdx_kvm_hypercall() to make the symbols visible to kvm.ko.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-20-kirill.shutemov@linux.intel.com
2022-04-07 08:27:52 -07:00
Andi Kleen
32e72854fa x86/tdx: Port I/O: Add early boot support
TDX guests cannot do port I/O directly. The TDX module triggers a #VE
exception to let the guest kernel emulate port I/O by converting them
into TDCALLs to call the host.

But before IDT handlers are set up, port I/O cannot be emulated using
normal kernel #VE handlers. To support the #VE-based emulation during
this boot window, add a minimal early #VE handler support in early
exception handlers. This is similar to what AMD SEV does. This is
mainly to support earlyprintk's serial driver, as well as potentially
the VGA driver.

The early handler only supports I/O-related #VE exceptions. Unhandled or
failed exceptions will be handled via early_fixup_exceptions() (like
normal exception failures). At runtime I/O-related #VE exceptions (along
with other types) handled by virt_exception_kernel().

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-19-kirill.shutemov@linux.intel.com
2022-04-07 08:27:52 -07:00
Kirill A. Shutemov
4c5b9aac6c x86/boot: Port I/O: Add decompression-time support for TDX
Port I/O instructions trigger #VE in the TDX environment. In response to
the exception, kernel emulates these instructions using hypercalls.

But during early boot, on the decompression stage, it is cumbersome to
deal with #VE. It is cleaner to go to hypercalls directly, bypassing #VE
handling.

Hook up TDX-specific port I/O helpers if booting in TDX environment.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-17-kirill.shutemov@linux.intel.com
2022-04-07 08:27:52 -07:00
Kirill A. Shutemov
1e8f93e183 x86: Consolidate port I/O helpers
There are two implementations of port I/O helpers: one in the kernel and
one in the boot stub.

Move the helpers required for both to <asm/shared/io.h> and use the one
implementation everywhere.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-15-kirill.shutemov@linux.intel.com
2022-04-07 08:27:52 -07:00
Kirill A. Shutemov
15104de122 x86: Adjust types used in port I/O helpers
Change port I/O helpers to use u8/u16/u32 instead of unsigned
char/short/int for values. Use u16 instead of int for port number.

It aligns the helpers with implementation in boot stub in preparation
for consolidation.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-14-kirill.shutemov@linux.intel.com
2022-04-07 08:27:52 -07:00
Kuppuswamy Sathyanarayanan
4b05f81504 x86/tdx: Detect TDX at early kernel decompression time
The early decompression code does port I/O for its console output. But,
handling the decompression-time port I/O demands a different approach
from normal runtime because the IDT required to support #VE based port
I/O emulation is not yet set up. Paravirtualizing I/O calls during
the decompression step is acceptable because the decompression code
doesn't have a lot of call sites to IO instruction.

To support port I/O in decompression code, TDX must be detected before
the decompression code might do port I/O. Detect whether the kernel runs
in a TDX guest.

Add an early_is_tdx_guest() interface to query the cached TDX guest
status in the decompression code.

TDX is detected with CPUID. Make cpuid_count() accessible outside
boot/cpuflags.c.

TDX detection in the main kernel is very similar. Move common bits
into <asm/shared/tdx.h>.

The actual port I/O paravirtualization will come later in the series.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-13-kirill.shutemov@linux.intel.com
2022-04-07 08:27:51 -07:00
Kirill A. Shutemov
bfe6ed0c67 x86/tdx: Add HLT support for TDX guests
The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).

To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].

In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.

To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.

Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-07 08:27:51 -07:00
Kirill A. Shutemov
9a22bf6deb x86/traps: Add #VE support for TDX guest
Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the
kernel:

 * Specific instructions (WBINVD, for example)
 * Specific MSR accesses
 * Specific CPUID leaf accesses
 * Access to specific guest physical addresses

Syscall entry code has a critical window where the kernel stack is not
yet set up. Any exception in this window leads to hard to debug issues
and can be exploited for privilege escalation. Exceptions in the NMI
entry code also cause issues. Returning from the exception handler with
IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.

For these reasons, the kernel avoids #VEs during the syscall gap and
the NMI entry code. Entry code paths do not access TD-shared memory,
MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
that might generate #VE. VMM can remove memory from TD at any point,
but access to unaccepted (or missing) private memory leads to VM
termination, not to #VE.

Similarly to page faults and breakpoints, #VEs are allowed in NMI
handlers once the kernel is ready to deal with nested NMIs.

During #VE delivery, all interrupts, including NMIs, are blocked until
TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
the VE info.

TDGETVEINFO retrieves the #VE info from the TDX module, which also
clears the "#VE valid" flag.  This must be done before anything else as
any #VE that occurs while the valid flag is set escalates to #DF by TDX
module. It will result in an oops.

Virtual NMIs are inhibited if the #VE valid flag is set. NMI will not be
delivered until TDGETVEINFO is called.

For now, convert unhandled #VE's (everything, until later in this
series) so that they appear just like a #GP by calling the
ve_raise_fault() directly. The ve_raise_fault() function is similar
to #GP handler and is responsible for sending SIGSEGV to userspace
and CPU die and notifying debuggers and other die chain users.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-8-kirill.shutemov@linux.intel.com
2022-04-07 08:27:51 -07:00
Kuppuswamy Sathyanarayanan
eb94f1b6a7 x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer
expose the guest state to the host. This prevents the old hypercall
mechanisms from working. So, to communicate with VMM, TDX
specification defines a new instruction called TDCALL.

In a TDX based VM, since the VMM is an untrusted entity, an intermediary
layer -- TDX module -- facilitates secure communication between the host
and the guest. TDX module is loaded like a firmware into a special CPU
mode called SEAM. TDX guests communicate with the TDX module using the
TDCALL instruction.

A guest uses TDCALL to communicate with both the TDX module and VMM.
The value of the RAX register when executing the TDCALL instruction is
used to determine the TDCALL type. A leaf of TDCALL used to communicate
with the VMM is called TDVMCALL.

Add generic interfaces to communicate with the TDX module and VMM
(using the TDCALL instruction).

__tdx_module_call()  - Used to communicate with the TDX module (via
		       TDCALL instruction).
__tdx_hypercall()    - Used by the guest to request services from
		       the VMM (via TDVMCALL leaf of TDCALL).

Also define an additional wrapper _tdx_hypercall(), which adds error
handling support for the TDCALL failure.

The __tdx_module_call() and __tdx_hypercall() helper functions are
implemented in assembly in a .S file.  The TDCALL ABI requires
shuffling arguments in and out of registers, which proved to be
awkward with inline assembly.

Just like syscalls, not all TDVMCALL use cases need to use the same
number of argument registers. The implementation here picks the current
worst-case scenario for TDCALL (4 registers). For TDCALLs with fewer
than 4 arguments, there will end up being a few superfluous (cheap)
instructions. But, this approach maximizes code reuse.

For registers used by the TDCALL instruction, please check TDX GHCI
specification, the section titled "TDCALL instruction" and "TDG.VP.VMCALL
Interface".

Based on previous patch by Sean Christopherson.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-4-kirill.shutemov@linux.intel.com
2022-04-07 08:27:50 -07:00
Kirill A. Shutemov
527a534c73 x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappers
Secure Arbitration Mode (SEAM) is an extension of VMX architecture.  It
defines a new VMX root operation (SEAM VMX root) and a new VMX non-root
operation (SEAM VMX non-root) which are both isolated from the legacy
VMX operation where the host kernel runs.

A CPU-attested software module (called 'TDX module') runs in SEAM VMX
root to manage and protect VMs running in SEAM VMX non-root.  SEAM VMX
root is also used to host another CPU-attested software module (called
'P-SEAMLDR') to load and update the TDX module.

Host kernel transits to either P-SEAMLDR or TDX module via the new
SEAMCALL instruction, which is essentially a VMExit from VMX root mode
to SEAM VMX root mode.  SEAMCALLs are leaf functions defined by
P-SEAMLDR and TDX module around the new SEAMCALL instruction.

A guest kernel can also communicate with TDX module via TDCALL
instruction.

TDCALLs and SEAMCALLs use an ABI different from the x86-64 system-v ABI.
RAX is used to carry both the SEAMCALL leaf function number (input) and
the completion status (output).  Additional GPRs (RCX, RDX, R8-R11) may
be further used as both input and output operands in individual leaf.

TDCALL and SEAMCALL share the same ABI and require the largely same
code to pass down arguments and retrieve results.

Define an assembly macro that can be used to implement C wrapper for
both TDCALL and SEAMCALL.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-3-kirill.shutemov@linux.intel.com
2022-04-07 08:27:50 -07:00
Kuppuswamy Sathyanarayanan
59bd54a84d x86/tdx: Detect running as a TDX guest in early boot
In preparation of extending cc_platform_has() API to support TDX guest,
use CPUID instruction to detect support for TDX guests in the early
boot code (via tdx_early_init()). Since copy_bootdata() is the first
user of cc_platform_has() API, detect the TDX guest status before it.

Define a synthetic feature flag (X86_FEATURE_TDX_GUEST) and set this
bit in a valid TDX guest platform.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-2-kirill.shutemov@linux.intel.com
2022-04-07 08:27:50 -07:00
Brijesh Singh
3a45b37538 x86/sev: Register SEV-SNP guest request platform device
Version 2 of the GHCB specification provides a Non Automatic Exit (NAE)
event type that can be used by the SEV-SNP guest to communicate with the
PSP without risk from a malicious hypervisor who wishes to read, alter,
drop or replay the messages sent.

SNP_LAUNCH_UPDATE can insert two special pages into the guest’s memory:
the secrets page and the CPUID page. The PSP firmware populates the
contents of the secrets page. The secrets page contains encryption keys
used by the guest to interact with the firmware. Because the secrets
page is encrypted with the guest’s memory encryption key, the hypervisor
cannot read the keys. See SEV-SNP firmware spec for further details on
the secrets page format.

Create a platform device that the SEV-SNP guest driver can bind to get
the platform resources such as encryption key and message id to use to
communicate with the PSP. The SEV-SNP guest driver provides a userspace
interface to get the attestation report, key derivation, extended
attestation report etc.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-43-brijesh.singh@amd.com
2022-04-07 16:47:12 +02:00
Brijesh Singh
d5af44dde5 x86/sev: Provide support for SNP guest request NAEs
Version 2 of GHCB specification provides SNP_GUEST_REQUEST and
SNP_EXT_GUEST_REQUEST NAE that can be used by the SNP guest to
communicate with the PSP.

While at it, add a snp_issue_guest_request() helper that will be used by
driver or other subsystem to issue the request to PSP.

See SEV-SNP firmware and GHCB spec for more details.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-42-brijesh.singh@amd.com
2022-04-07 16:47:12 +02:00
Michael Roth
b190a043c4 x86/sev: Add SEV-SNP feature detection/setup
Initial/preliminary detection of SEV-SNP is done via the Confidential
Computing blob. Check for it prior to the normal SEV/SME feature
initialization, and add some sanity checks to confirm it agrees with
SEV-SNP CPUID/MSR bits.

Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-39-brijesh.singh@amd.com
2022-04-07 16:47:11 +02:00
Michael Roth
c01fce9cef x86/compressed: Add SEV-SNP feature detection/setup
Initial/preliminary detection of SEV-SNP is done via the Confidential
Computing blob. Check for it prior to the normal SEV/SME feature
initialization, and add some sanity checks to confirm it agrees with
SEV-SNP CPUID/MSR bits.

Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-35-brijesh.singh@amd.com
2022-04-07 16:47:11 +02:00
Michael Roth
8c9c509baf x86/boot: Add a pointer to Confidential Computing blob in bootparams
The previously defined Confidential Computing blob is provided to the
kernel via a setup_data structure or EFI config table entry. Currently,
these are both checked for by boot/compressed kernel to access the CPUID
table address within it for use with SEV-SNP CPUID enforcement.

To also enable that enforcement for the run-time kernel, similar
access to the CPUID table is needed early on while it's still using
the identity-mapped page table set up by boot/compressed, where global
pointers need to be accessed via fixup_pointer().

This isn't much of an issue for accessing setup_data, and the EFI config
table helper code currently used in boot/compressed *could* be used in
this case as well since they both rely on identity-mapping. However, it
has some reliance on EFI helpers/string constants that would need to be
accessed via fixup_pointer(), and fixing it up while making it shareable
between boot/compressed and run-time kernel is fragile and introduces a
good bit of ugliness.

Instead, add a boot_params->cc_blob_address pointer that the
boot/compressed kernel can initialize so that the run-time kernel can
access the CC blob from there instead of re-scanning the EFI config
table.

Also document these in Documentation/x86/zero-page.rst. While there,
add missing documentation for the acpi_rsdp_addr field, which serves a
similar purpose in providing the run-time kernel a pointer to the ACPI
RSDP table so that it does not need to [re-]scan the EFI configuration
table.

  [ bp: Fix typos, massage commit message. ]

Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-34-brijesh.singh@amd.com
2022-04-07 16:47:11 +02:00
Michael Roth
ee0bfa08a3 x86/compressed/64: Add support for SEV-SNP CPUID table in #VC handlers
CPUID instructions generate a #VC exception for SEV-ES/SEV-SNP guests,
for which early handlers are currently set up to handle. In the case
of SEV-SNP, guests can use a configurable location in guest memory
that has been pre-populated with a firmware-validated CPUID table to
look up the relevant CPUID values rather than requesting them from
hypervisor via a VMGEXIT. Add the various hooks in the #VC handlers to
allow CPUID instructions to be handled via the table. The code to
actually configure/enable the table will be added in a subsequent
commit.

Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-33-brijesh.singh@amd.com
2022-04-07 16:47:11 +02:00
Michael Roth
b66370db9a KVM: x86: Move lookup of indexed CPUID leafs to helper
Determining which CPUID leafs have significant ECX/index values is
also needed by guest kernel code when doing SEV-SNP-validated CPUID
lookups. Move this to common code to keep future updates in sync.

Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Link: https://lore.kernel.org/r/20220307213356.2797205-31-brijesh.singh@amd.com
2022-04-07 16:47:11 +02:00
Brijesh Singh
5ea98e01ab x86/boot: Add Confidential Computing type to setup_data
While launching encrypted guests, the hypervisor may need to provide
some additional information during the guest boot. When booting under an
EFI-based BIOS, the EFI configuration table contains an entry for the
confidential computing blob that contains the required information.

To support booting encrypted guests on non-EFI VMs, the hypervisor
needs to pass this additional information to the guest kernel using a
different method.

For this purpose, introduce SETUP_CC_BLOB type in setup_data to hold
the physical address of the confidential computing blob location. The
boot loader or hypervisor may choose to use this method instead of an
EFI configuration table. The CC blob location scanning should give
preference to a setup_data blob over an EFI configuration table.

In AMD SEV-SNP, the CC blob contains the address of the secrets and
CPUID pages. The secrets page includes information such as a VM to PSP
communication key and the CPUID page contains PSP-filtered CPUID values.
Define the AMD SEV confidential computing blob structure.

While at it, define the EFI GUID for the confidential computing blob.

  [ bp: Massage commit message, mark struct __packed. ]

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20220307213356.2797205-30-brijesh.singh@amd.com
2022-04-07 16:46:33 +02:00
Reto Buerki
59b18a1e65 x86/msi: Fix msi message data shadow struct
The x86 MSI message data is 32 bits in total and is either in
compatibility or remappable format, see Intel Virtualization Technology
for Directed I/O, section 5.1.2.

Fixes: 6285aa5073 ("x86/msi: Provide msi message shadow structs")
Co-developed-by: Adrian-Ken Rueegsegger <ken@codelabs.ch>
Signed-off-by: Adrian-Ken Rueegsegger <ken@codelabs.ch>
Signed-off-by: Reto Buerki <reet@codelabs.ch>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220407110647.67372-1-reet@codelabs.ch
2022-04-07 15:19:32 +02:00
Nick Desaulniers
334865b291 x86/extable: Prefer local labels in .set directives
Bernardo reported an error that Nathan bisected down to
(x86_64) defconfig+LTO_CLANG_FULL+X86_PMEM_LEGACY.

    LTO     vmlinux.o
  ld.lld: error: <instantiation>:1:13: redefinition of 'found'
  .set found, 0
              ^

  <inline asm>:29:1: while in macro instantiation
  extable_type_reg reg=%eax, type=(17 | ((0) << 16))
  ^

This appears to be another LTO specific issue similar to what was folded
into commit 4b5305decc ("x86/extable: Extend extable functionality"),
where the `.set found, 0` in DEFINE_EXTABLE_TYPE_REG in
arch/x86/include/asm/asm.h conflicts with the symbol for the static
function `found` in arch/x86/kernel/pmem.c.

Assembler .set directive declare symbols with global visibility, so the
assembler may not rename such symbols in the event of a conflict. LTO
could rename static functions if there was a conflict in C sources, but
it cannot see into symbols defined in inline asm.

The symbols are also retained in the symbol table, regardless of LTO.

Give the symbols .L prefixes making them locally visible, so that they
may be renamed for LTO to avoid conflicts, and to drop them from the
symbol table regardless of LTO.

Fixes: 4b5305decc ("x86/extable: Extend extable functionality")
Reported-by: Bernardo Meurer Costa <beme@google.com>
Debugged-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20220329202148.2379697-1-ndesaulniers@google.com
2022-04-07 11:27:02 +02:00
Michael Roth
469693d8f6 x86/head/64: Re-enable stack protection
Due to

  103a4908ad ("x86/head/64: Disable stack protection for head$(BITS).o")

kernel/head{32,64}.c are compiled with -fno-stack-protector to allow
a call to set_bringup_idt_handler(), which would otherwise have stack
protection enabled with CONFIG_STACKPROTECTOR_STRONG.

While sufficient for that case, there may still be issues with calls to
any external functions that were compiled with stack protection enabled
that in-turn make stack-protected calls, or if the exception handlers
set up by set_bringup_idt_handler() make calls to stack-protected
functions.

Subsequent patches for SEV-SNP CPUID validation support will introduce
both such cases. Attempting to disable stack protection for everything
in scope to address that is prohibitive since much of the code, like the
SEV-ES #VC handler, is shared code that remains in use after boot and
could benefit from having stack protection enabled. Attempting to inline
calls is brittle and can quickly balloon out to library/helper code
where that's not really an option.

Instead, re-enable stack protection for head32.c/head64.c, and make the
appropriate changes to ensure the segment used for the stack canary is
initialized in advance of any stack-protected C calls.

For head64.c:

- The BSP will enter from startup_64() and call into C code
  (startup_64_setup_env()) shortly after setting up the stack, which
  may result in calls to stack-protected code. Set up %gs early to allow
  for this safely.
- APs will enter from secondary_startup_64*(), and %gs will be set up
  soon after. There is one call to C code prior to %gs being setup
  (__startup_secondary_64()), but it is only to fetch 'sme_me_mask'
  global, so just load 'sme_me_mask' directly instead, and remove the
  now-unused __startup_secondary_64() function.

For head32.c:

- BSPs/APs will set %fs to __BOOT_DS prior to any C calls. In recent
  kernels, the compiler is configured to access the stack canary at
  %fs:__stack_chk_guard [1], which overlaps with the initial per-cpu
  '__stack_chk_guard' variable in the initial/"master" .data..percpu
  area. This is sufficient to allow access to the canary for use
  during initial startup, so no changes are needed there.

[1] 3fb0fdb3bb ("x86/stackprotector/32: Make the canary into a regular percpu variable")

  [ bp: Massage commit message. ]

Suggested-by: Joerg Roedel <jroedel@suse.de> #for 64-bit %gs set up
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-24-brijesh.singh@amd.com
2022-04-06 17:06:55 +02:00
Tom Lendacky
0afb6b660a x86/sev: Use SEV-SNP AP creation to start secondary CPUs
To provide a more secure way to start APs under SEV-SNP, use the SEV-SNP
AP Creation NAE event. This allows for guest control over the AP register
state rather than trusting the hypervisor with the SEV-ES Jump Table
address.

During native_smp_prepare_cpus(), invoke an SEV-SNP function that, if
SEV-SNP is active, will set/override apic->wakeup_secondary_cpu. This
will allow the SEV-SNP AP Creation NAE event method to be used to boot
the APs. As a result of installing the override when SEV-SNP is active,
this method of starting the APs becomes the required method. The override
function will fail to start the AP if the hypervisor does not have
support for AP creation.

  [ bp: Work in forgotten review comments. ]

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-23-brijesh.singh@amd.com
2022-04-06 17:06:49 +02:00
Brijesh Singh
dc3f3d2474 x86/mm: Validate memory when changing the C-bit
Add the needed functionality to change pages state from shared
to private and vice-versa using the Page State Change VMGEXIT as
documented in the GHCB spec.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-22-brijesh.singh@amd.com
2022-04-06 13:24:53 +02:00
Brijesh Singh
5e5ccff60a x86/sev: Add helper for validating pages in early enc attribute changes
early_set_memory_{encrypted,decrypted}() are used for changing the page
state from decrypted (shared) to encrypted (private) and vice versa.

When SEV-SNP is active, the page state transition needs to go through
additional steps.

If the page is transitioned from shared to private, then perform the
following after the encryption attribute is set in the page table:

1. Issue the page state change VMGEXIT to add the page as a private
   in the RMP table.
2. Validate the page after its successfully added in the RMP table.

To maintain the security guarantees, if the page is transitioned from
private to shared, then perform the following before clearing the
encryption attribute from the page table.

1. Invalidate the page.
2. Issue the page state change VMGEXIT to make the page shared in the
   RMP table.

early_set_memory_{encrypted,decrypted}() can be called before the GHCB
is setup so use the SNP page state MSR protocol VMGEXIT defined in the
GHCB specification to request the page state change in the RMP table.

While at it, add a helper snp_prep_memory() which will be used in
probe_roms(), in a later patch.

  [ bp: Massage commit message. ]

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Link: https://lore.kernel.org/r/20220307213356.2797205-19-brijesh.singh@amd.com
2022-04-06 13:22:54 +02:00
Brijesh Singh
95d33bfaa3 x86/sev: Register GHCB memory when SEV-SNP is active
The SEV-SNP guest is required by the GHCB spec to register the GHCB's
Guest Physical Address (GPA). This is because the hypervisor may prefer
that a guest uses a consistent and/or specific GPA for the GHCB associated
with a vCPU. For more information, see the GHCB specification section
"GHCB GPA Registration".

  [ bp: Cleanup comments. ]

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-18-brijesh.singh@amd.com
2022-04-06 13:16:58 +02:00
Brijesh Singh
87294bdb7b x86/compressed: Register GHCB memory when SEV-SNP is active
The SEV-SNP guest is required by the GHCB spec to register the GHCB's
Guest Physical Address (GPA). This is because the hypervisor may prefer
that a guest use a consistent and/or specific GPA for the GHCB associated
with a vCPU. For more information, see the GHCB specification section
"GHCB GPA Registration".

If hypervisor can not work with the guest provided GPA then terminate the
guest boot.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Link: https://lore.kernel.org/r/20220307213356.2797205-17-brijesh.singh@amd.com
2022-04-06 13:14:24 +02:00
Brijesh Singh
4f9c403e44 x86/compressed: Add helper for validating pages in the decompression stage
Many of the integrity guarantees of SEV-SNP are enforced through the
Reverse Map Table (RMP). Each RMP entry contains the GPA at which a
particular page of DRAM should be mapped. The VMs can request the
hypervisor to add pages in the RMP table via the Page State Change
VMGEXIT defined in the GHCB specification.

Inside each RMP entry is a Validated flag; this flag is automatically
cleared to 0 by the CPU hardware when a new RMP entry is created for a
guest. Each VM page can be either validated or invalidated, as indicated
by the Validated flag in the RMP entry. Memory access to a private page
that is not validated generates a #VC. A VM must use the PVALIDATE
instruction to validate a private page before using it.

To maintain the security guarantee of SEV-SNP guests, when transitioning
pages from private to shared, the guest must invalidate the pages before
asking the hypervisor to change the page state to shared in the RMP table.

After the pages are mapped private in the page table, the guest must
issue a page state change VMGEXIT to mark the pages private in the RMP
table and validate them.

Upon boot, BIOS should have validated the entire system memory.
During the kernel decompression stage, early_setup_ghcb() uses
set_page_decrypted() to make the GHCB page shared (i.e. clear encryption
attribute). And while exiting from the decompression, it calls
set_page_encrypted() to make the page private.

Add snp_set_page_{private,shared}() helpers that are used by
set_page_{decrypted,encrypted}() to change the page state in the RMP
table.

  [ bp: Massage commit message and comments. ]

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-16-brijesh.singh@amd.com
2022-04-06 13:10:40 +02:00
Brijesh Singh
81cc3df9a9 x86/sev: Check the VMPL level
The Virtual Machine Privilege Level (VMPL) feature in the SEV-SNP
architecture allows a guest VM to divide its address space into four
levels. The level can be used to provide hardware isolated abstraction
layers within a VM. VMPL0 is the highest privilege level, and VMPL3 is
the least privilege level. Certain operations must be done by the VMPL0
software, such as:

* Validate or invalidate memory range (PVALIDATE instruction)
* Allocate VMSA page (RMPADJUST instruction when VMSA=1)

The initial SNP support requires that the guest kernel is running at
VMPL0. Add such a check to verify the guest is running at level 0 before
continuing the boot. There is no easy method to query the current VMPL
level, so use the RMPADJUST instruction to determine whether the guest
is running at the VMPL0.

  [ bp: Massage commit message. ]

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-15-brijesh.singh@amd.com
2022-04-06 13:10:34 +02:00
Brijesh Singh
0bd6f1e526 x86/sev: Add a helper for the PVALIDATE instruction
An SNP-active guest uses the PVALIDATE instruction to validate or
rescind the validation of a guest page’s RMP entry. Upon completion, a
return code is stored in EAX and rFLAGS bits are set based on the return
code. If the instruction completed successfully, the carry flag (CF)
indicates if the content of the RMP were changed or not.

See AMD APM Volume 3 for additional details.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Link: https://lore.kernel.org/r/20220307213356.2797205-14-brijesh.singh@amd.com
2022-04-06 13:10:30 +02:00
Brijesh Singh
cbd3d4f7c4 x86/sev: Check SEV-SNP features support
Version 2 of the GHCB specification added the advertisement of features
that are supported by the hypervisor. If the hypervisor supports SEV-SNP
then it must set the SEV-SNP features bit to indicate that the base
functionality is supported.

Check that feature bit while establishing the GHCB; if failed, terminate
the guest.

Version 2 of the GHCB specification adds several new Non-Automatic Exits
(NAEs), most of them are optional except the hypervisor feature. Now
that the hypervisor feature NAE is implemented, bump the GHCB maximum
supported protocol version.

While at it, move the GHCB protocol negotiation check from the #VC
exception handler to sev_enable() so that all feature detection happens
before the first #VC exception.

While at it, document why the GHCB page cannot be setup from
load_stage2_idt().

  [ bp: Massage commit message. ]

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-13-brijesh.singh@amd.com
2022-04-06 13:10:23 +02:00
Brijesh Singh
2ea29c5abb x86/sev: Save the negotiated GHCB version
The SEV-ES guest calls sev_es_negotiate_protocol() to negotiate the GHCB
protocol version before establishing the GHCB. Cache the negotiated GHCB
version so that it can be used later.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Link: https://lore.kernel.org/r/20220307213356.2797205-12-brijesh.singh@amd.com
2022-04-06 13:10:18 +02:00
Brijesh Singh
6c0f74d678 x86/sev: Define the Linux-specific guest termination reasons
The GHCB specification defines the reason code for reason set 0. The
reason codes defined in the set 0 do not cover all possible causes for a
guest to request termination.

The reason sets 1 to 255 are reserved for the vendor-specific codes.
Reserve the reason set 1 for the Linux guest. Define the error codes for
reason set 1 so that one can have meaningful termination reasons and thus
better guest failure diagnosis.

While at it, change sev_es_terminate() to accept a reason set parameter.

  [ bp: Massage commit message. ]

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Link: https://lore.kernel.org/r/20220307213356.2797205-11-brijesh.singh@amd.com
2022-04-06 13:02:41 +02:00
Brijesh Singh
f742b90e61 x86/mm: Extend cc_attr to include AMD SEV-SNP
The CC_ATTR_GUEST_SEV_SNP can be used by the guest to query whether the
SNP (Secure Nested Paging) feature is active.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-10-brijesh.singh@amd.com
2022-04-06 13:02:34 +02:00
Michael Roth
176db62257 x86/boot: Introduce helpers for MSR reads/writes
The current set of helpers used throughout the run-time kernel have
dependencies on code/facilities outside of the boot kernel, so there
are a number of call-sites throughout the boot kernel where inline
assembly is used instead. More will be added with subsequent patches
that add support for SEV-SNP, so take the opportunity to provide a basic
set of helpers that can be used by the boot kernel to reduce reliance on
inline assembly.

Use boot_* prefix so that it's clear these are helpers specific to the
boot kernel to avoid any confusion with the various other MSR read/write
helpers.

  [ bp: Disambiguate parameter names and trim comment. ]

Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-6-brijesh.singh@amd.com
2022-04-06 12:59:17 +02:00
Tom Lendacky
6d3b3d34e3 KVM: SVM: Update the SEV-ES save area mapping
This is the final step in defining the multiple save areas to keep them
separate and ensuring proper operation amongst the different types of
guests. Update the SEV-ES/SEV-SNP save area to match the APM. This save
area will be used for the upcoming SEV-SNP AP Creation NAE event support.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Link: https://lore.kernel.org/r/20220307213356.2797205-5-brijesh.singh@amd.com
2022-04-06 12:19:51 +02:00
Tom Lendacky
a4690359ea KVM: SVM: Create a separate mapping for the GHCB save area
The initial implementation of the GHCB spec was based on trying to keep
the register state offsets the same relative to the VM save area. However,
the save area for SEV-ES has changed within the hardware causing the
relation between the SEV-ES save area to change relative to the GHCB save
area.

This is the second step in defining the multiple save areas to keep them
separate and ensuring proper operation amongst the different types of
guests. Create a GHCB save area that matches the GHCB specification.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Link: https://lore.kernel.org/r/20220307213356.2797205-4-brijesh.singh@amd.com
2022-04-06 12:13:34 +02:00
Tom Lendacky
3dd2775b74 KVM: SVM: Create a separate mapping for the SEV-ES save area
The save area for SEV-ES/SEV-SNP guests, as used by the hardware, is
different from the save area of a non SEV-ES/SEV-SNP guest.

This is the first step in defining the multiple save areas to keep them
separate and ensuring proper operation amongst the different types of
guests. Create an SEV-ES/SEV-SNP save area and adjust usage to the new
save area definition where needed.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Link: https://lore.kernel.org/r/20220405182743.308853-1-brijesh.singh@amd.com
2022-04-06 12:08:40 +02:00
Muralidhara M K
e1907d3751 x86/amd_nb: Unexport amd_cache_northbridges()
amd_cache_northbridges() is exported by amd_nb.c and is called by
amd64-agp.c and amd64_edac.c modules at module_init() time so that NB
descriptors are properly cached before those drivers can use them.

However, the init_amd_nbs() initcall already does call
amd_cache_northbridges() unconditionally and thus makes sure the NB
descriptors are enumerated.

That initcall is a fs_initcall type which is on the 5th group (starting
from 0) of initcalls that gets run in increasing numerical order by the
init code.

The module_init() call is turned into an __initcall() in the MODULE=n
case and those are device-level initcalls, i.e., group 6.

Therefore, the northbridges caching is already finished by the time
module initialization starts and thus the correct initialization order
is retained.

Unexport amd_cache_northbridges(), update dependent modules to
call amd_nb_num() instead. While at it, simplify the checks in
amd_cache_northbridges().

  [ bp: Heavily massage and *actually* explain why the change is ok. ]

Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220324122729.221765-1-nchatrad@amd.com
2022-04-05 19:22:27 +02:00
Brijesh Singh
046f773be1 KVM: SVM: Define sev_features and VMPL field in the VMSA
The hypervisor uses the sev_features field (offset 3B0h) in the Save State
Area to control the SEV-SNP guest features such as SNPActive, vTOM,
ReflectVC etc. An SEV-SNP guest can read the sev_features field through
the SEV_STATUS MSR.

While at it, update dump_vmcb() to log the VMPL level.

See APM2 Table 15-34 and B-4 for more details.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Link: https://lore.kernel.org/r/20220307213356.2797205-2-brijesh.singh@amd.com
2022-04-05 19:09:27 +02:00
Sean Christopherson
1d0e848060 KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded
Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as
-1 is technically undefined behavior when its value is read out by
param_get_bool(), as boolean values are supposed to be '0' or '1'.

Alternatively, KVM could define a custom getter for the param, but the
auto value doesn't depend on the vendor module in any way, and printing
"auto" would be unnecessarily unfriendly to the user.

In addition to fixing the undefined behavior, resolving the auto value
also fixes the scenario where the auto value resolves to N and no vendor
module is loaded.  Previously, -1 would result in Y being printed even
though KVM would ultimately disable the mitigation.

Rename the existing MMU module init/exit helpers to clarify that they're
invoked with respect to the vendor module, and add comments to document
why KVM has two separate "module init" flows.

  =========================================================================
  UBSAN: invalid-load in kernel/params.c:320:33
  load of value 255 is not a valid value for type '_Bool'
  CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Call Trace:
   <TASK>
   dump_stack_lvl+0x34/0x44
   ubsan_epilogue+0x5/0x40
   __ubsan_handle_load_invalid_value.cold+0x43/0x48
   param_get_bool.cold+0xf/0x14
   param_attr_show+0x55/0x80
   module_attr_show+0x1c/0x30
   sysfs_kf_seq_show+0x93/0xc0
   seq_read_iter+0x11c/0x450
   new_sync_read+0x11b/0x1a0
   vfs_read+0xf0/0x190
   ksys_read+0x5f/0xe0
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae
   </TASK>
  =========================================================================

Fixes: b8e8c8303f ("kvm: mmu: ITLB_MULTIHIT mitigation")
Cc: stable@vger.kernel.org
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220331221359.3912754-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-05 08:09:46 -04:00
Vincent Mailhol
9ce02f0fc6 x86/bug: Prevent shadowing in __WARN_FLAGS
The macro __WARN_FLAGS() uses a local variable named "f". This being a
common name, there is a risk of shadowing other variables.

For example, GCC would yield:

| In file included from ./include/linux/bug.h:5,
|                  from ./include/linux/cpumask.h:14,
|                  from ./arch/x86/include/asm/cpumask.h:5,
|                  from ./arch/x86/include/asm/msr.h:11,
|                  from ./arch/x86/include/asm/processor.h:22,
|                  from ./arch/x86/include/asm/timex.h:5,
|                  from ./include/linux/timex.h:65,
|                  from ./include/linux/time32.h:13,
|                  from ./include/linux/time.h:60,
|                  from ./include/linux/stat.h:19,
|                  from ./include/linux/module.h:13,
|                  from virt/lib/irqbypass.mod.c:1:
| ./include/linux/rcupdate.h: In function 'rcu_head_after_call_rcu':
| ./arch/x86/include/asm/bug.h:80:21: warning: declaration of 'f' shadows a parameter [-Wshadow]
|    80 |         __auto_type f = BUGFLAG_WARNING|(flags);                \
|       |                     ^
| ./include/asm-generic/bug.h:106:17: note: in expansion of macro '__WARN_FLAGS'
|   106 |                 __WARN_FLAGS(BUGFLAG_ONCE |                     \
|       |                 ^~~~~~~~~~~~
| ./include/linux/rcupdate.h:1007:9: note: in expansion of macro 'WARN_ON_ONCE'
|  1007 |         WARN_ON_ONCE(func != (rcu_callback_t)~0L);
|       |         ^~~~~~~~~~~~
| In file included from ./include/linux/rbtree.h:24,
|                  from ./include/linux/mm_types.h:11,
|                  from ./include/linux/buildid.h:5,
|                  from ./include/linux/module.h:14,
|                  from virt/lib/irqbypass.mod.c:1:
| ./include/linux/rcupdate.h:1001:62: note: shadowed declaration is here
|  1001 | rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f)
|       |                                               ~~~~~~~~~~~~~~~^

For reference, sparse also warns about it, c.f. [1].

This patch renames the variable from f to __flags (with two underscore
prefixes as suggested in the Linux kernel coding style [2]) in order
to prevent collisions.

[1] https://lore.kernel.org/all/CAFGhKbyifH1a+nAMCvWM88TK6fpNPdzFtUXPmRGnnQeePV+1sw@mail.gmail.com/

[2] Linux kernel coding style, section 12) Macros, Enums and RTL,
paragraph 5) namespace collisions when defining local variables in
macros resembling functions
https://www.kernel.org/doc/html/latest/process/coding-style.html#macros-enums-and-rtl

Fixes: bfb1a7c91f ("x86/bug: Merge annotate_reachable() into_BUG_FLAGS() asm")
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20220324023742.106546-1-mailhol.vincent@wanadoo.fr
2022-04-05 10:24:40 +02:00
Stephane Eranian
d5616bac7a perf/x86/amd: Add idle hooks for branch sampling
On AMD Fam19h Zen3, the branch sampling (BRS) feature must be disabled before
entering low power and re-enabled (if was active) when returning from low
power. Otherwise, the NMI interrupt may be held up for too long and cause
problems. Stopping BRS will cause the NMI to be delivered if it was held up.

Define a perf_amd_brs_lopwr_cb() callback to stop/restart BRS.  The callback
is protected by a jump label which is enabled only when AMD BRS is detected.
In all other cases, the callback is never called.

Signed-off-by: Stephane Eranian <eranian@google.com>
[peterz: static_call() and build fixes]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220322221517.2510440-10-eranian@google.com
2022-04-05 10:24:38 +02:00
Stephane Eranian
ada543459c perf/x86/amd: Add AMD Fam19h Branch Sampling support
Add support for the AMD Fam19h 16-deep branch sampling feature as
described in the AMD PPR Fam19h Model 01h Revision B1.  This is a model
specific extension. It is not an architected AMD feature.

The Branch Sampling (BRS) operates with a 16-deep saturating buffer in MSR
registers. There is no branch type filtering. All control flow changes are
captured. BRS relies on specific programming of the core PMU of Fam19h.  In
particular, the following requirements must be met:
 - the sampling period be greater than 16 (BRS depth)
 - the sampling period must use a fixed and not frequency mode

BRS interacts with the NMI interrupt as well. Because enabling BRS is
expensive, it is only activated after P event occurrences, where P is the
desired sampling period.  At P occurrences of the event, the counter
overflows, the CPU catches the interrupt, activates BRS for 16 branches until
it saturates, and then delivers the NMI to the kernel.  Between the overflow
and the time BRS activates more branches may be executed skewing the period.
All along, the sampling event keeps counting. The skid may be attenuated by
reducing the sampling period by 16 (subsequent patch).

BRS is integrated into perf_events seamlessly via the same
PERF_RECORD_BRANCH_STACK sample format. BRS generates perf_branch_entry
records in the sampling buffer. No prediction information is supported. The
branches are stored in reverse order of execution.  The most recent branch is
the first entry in each record.

No modification to the perf tool is necessary.

BRS can be used with any sampling event. However, it is recommended to use
the RETIRED_BRANCH_INSTRUCTIONS event because it matches what the BRS
captures.

$ perf record -b -c 1000037 -e cpu/event=0xc2,name=ret_br_instructions/ test

$ perf report -D
56531696056126 0x193c000 [0x1a8]: PERF_RECORD_SAMPLE(IP, 0x2): 18122/18230: 0x401d24 period: 1000037 addr: 0
... branch stack: nr:16
.....  0: 0000000000401d24 -> 0000000000401d5a 0 cycles      0
.....  1: 0000000000401d5c -> 0000000000401d24 0 cycles      0
.....  2: 0000000000401d22 -> 0000000000401d5c 0 cycles      0
.....  3: 0000000000401d5e -> 0000000000401d22 0 cycles      0
.....  4: 0000000000401d20 -> 0000000000401d5e 0 cycles      0
.....  5: 0000000000401d3e -> 0000000000401d20 0 cycles      0
.....  6: 0000000000401d42 -> 0000000000401d3e 0 cycles      0
.....  7: 0000000000401d3c -> 0000000000401d42 0 cycles      0
.....  8: 0000000000401d44 -> 0000000000401d3c 0 cycles      0
.....  9: 0000000000401d3a -> 0000000000401d44 0 cycles      0
..... 10: 0000000000401d46 -> 0000000000401d3a 0 cycles      0
..... 11: 0000000000401d38 -> 0000000000401d46 0 cycles      0
..... 12: 0000000000401d48 -> 0000000000401d38 0 cycles      0
..... 13: 0000000000401d36 -> 0000000000401d48 0 cycles      0
..... 14: 0000000000401d4a -> 0000000000401d36 0 cycles      0
..... 15: 0000000000401d34 -> 0000000000401d4a 0 cycles      0
 ... thread: test:18230
 ...... dso: test

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220322221517.2510440-4-eranian@google.com
2022-04-05 10:24:37 +02:00
Stephane Eranian
a77d41ac3a x86/cpufeatures: Add AMD Fam19h Branch Sampling feature
Add a cpu feature for AMD Fam19h Branch Sampling feature as bit
31 of EBX on CPUID leaf function 0x80000008.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220322221517.2510440-3-eranian@google.com
2022-04-05 10:24:36 +02:00
Kan Liang
4a263bf331 perf/x86/intel: Don't extend the pseudo-encoding to GP counters
The INST_RETIRED.PREC_DIST event (0x0100) doesn't count on SPR.
perf stat -e cpu/event=0xc0,umask=0x0/,cpu/event=0x0,umask=0x1/ -C0

 Performance counter stats for 'CPU(s) 0':

           607,246      cpu/event=0xc0,umask=0x0/
                 0      cpu/event=0x0,umask=0x1/

The encoding for INST_RETIRED.PREC_DIST is pseudo-encoding, which
doesn't work on the generic counters. However, current perf extends its
mask to the generic counters.

The pseudo event-code for a fixed counter must be 0x00. Check and avoid
extending the mask for the fixed counter event which using the
pseudo-encoding, e.g., ref-cycles and PREC_DIST event.

With the patch,
perf stat -e cpu/event=0xc0,umask=0x0/,cpu/event=0x0,umask=0x1/ -C0

 Performance counter stats for 'CPU(s) 0':

           583,184      cpu/event=0xc0,umask=0x0/
           583,048      cpu/event=0x0,umask=0x1/

Fixes: 2de71ee153 ("perf/x86/intel: Fix ICL/SPR INST_RETIRED.PREC_DIST encodings")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1648482543-14923-1-git-send-email-kan.liang@linux.intel.com
2022-04-05 09:59:44 +02:00
Sebastian Andrzej Siewior
1c1e7e3c23 x86/percpu: Remove volatile from arch_raw_cpu_ptr().
The volatile attribute in the inline assembly of arch_raw_cpu_ptr()
forces the compiler to always generate the code, even if the compiler
can decide upfront that its result is not needed.

For instance invoking __intel_pmu_disable_all(false) (like
intel_pmu_snapshot_arch_branch_stack() does) leads to loading the
address of &cpu_hw_events into the register while compiler knows that it
has no need for it. This ends up with code like:

|	movq	$cpu_hw_events, %rax			#, tcp_ptr__
|	add	%gs:this_cpu_off(%rip), %rax		# this_cpu_off, tcp_ptr__
|	xorl	%eax, %eax				# tmp93

It also creates additional code within local_lock() with !RT &&
!LOCKDEP which is not desired.

By removing the volatile attribute the compiler can place the
function freely and avoid it if it is not needed in the end.
By using the function twice the compiler properly caches only the
variable offset and always loads the CPU-offset.

this_cpu_ptr() also remains properly placed within a preempt_disable()
sections because
- arch_raw_cpu_ptr() assembly has a memory input ("m" (this_cpu_off))
- prempt_{dis,en}able() fundamentally has a 'barrier()' in it

Therefore this_cpu_ptr() is already properly serialized and does not
rely on the 'volatile' attribute.

Remove volatile from arch_raw_cpu_ptr().

[ bigeasy: Added Linus' explanation why this_cpu_ptr() is not moved out
  of a preempt_disable() section without the 'volatile' attribute. ]

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220328145810.86783-2-bigeasy@linutronix.de
2022-04-05 09:59:38 +02:00
Christophe Leroy
5517d50082 static_call: Properly initialise DEFINE_STATIC_CALL_RET0()
When a static call is updated with __static_call_return0() as target,
arch_static_call_transform() set it to use an optimised set of
instructions which are meant to lay in the same cacheline.

But when initialising a static call with DEFINE_STATIC_CALL_RET0(),
we get a branch to the real __static_call_return0() function instead
of getting the optimised setup:

	c00d8120 <__SCT__perf_snapshot_branch_stack>:
	c00d8120:	4b ff ff f4 	b       c00d8114 <__static_call_return0>
	c00d8124:	3d 80 c0 0e 	lis     r12,-16370
	c00d8128:	81 8c 81 3c 	lwz     r12,-32452(r12)
	c00d812c:	7d 89 03 a6 	mtctr   r12
	c00d8130:	4e 80 04 20 	bctr
	c00d8134:	38 60 00 00 	li      r3,0
	c00d8138:	4e 80 00 20 	blr
	c00d813c:	00 00 00 00 	.long 0x0

Add ARCH_DEFINE_STATIC_CALL_RET0_TRAMP() defined by each architecture
to setup the optimised configuration, and rework
DEFINE_STATIC_CALL_RET0() to call it:

	c00d8120 <__SCT__perf_snapshot_branch_stack>:
	c00d8120:	48 00 00 14 	b       c00d8134 <__SCT__perf_snapshot_branch_stack+0x14>
	c00d8124:	3d 80 c0 0e 	lis     r12,-16370
	c00d8128:	81 8c 81 3c 	lwz     r12,-32452(r12)
	c00d812c:	7d 89 03 a6 	mtctr   r12
	c00d8130:	4e 80 04 20 	bctr
	c00d8134:	38 60 00 00 	li      r3,0
	c00d8138:	4e 80 00 20 	blr
	c00d813c:	00 00 00 00 	.long 0x0

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/1e0a61a88f52a460f62a58ffc2a5f847d1f7d9d8.1647253456.git.christophe.leroy@csgroup.eu
2022-04-05 09:59:38 +02:00
Ira Weiny
5a0893088a x86/pkeys: Remove __arch_set_user_pkey_access() declaration
In the x86 code __arch_set_user_pkey_access() is not used and is not
defined.

Remove the dead declaration.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220331180655.2946086-1-ira.weiny@intel.com
2022-04-04 15:58:24 -07:00
Ira Weiny
70431c63d7 x86/pkeys: Clean up arch_set_user_pkey_access() declaration
arch_set_user_pkey_access() was declared two times in the header.

Remove the 2nd declaration.

Suggested-by: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220331180554.2945884-1-ira.weiny@intel.com
2022-04-04 15:58:24 -07:00
Borislav Petkov
76ea0025a2 x86/cpu: Remove "noexec"
It doesn't make any sense to disable non-executable mappings -
security-wise or else.

So rip out that switch and move the remaining code into setup.c and
delete setup_nx.c

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220127115626.14179-6-bp@alien8.de
2022-04-04 10:17:03 +02:00
Borislav Petkov
dbae0a934f x86/cpu: Remove CONFIG_X86_SMAP and "nosmap"
Those were added as part of the SMAP enablement but SMAP is currently
an integral part of kernel proper and there's no need to disable it
anymore.

Rip out that functionality. Leave --uaccess default on for objtool as
this is what objtool should do by default anyway.

If still needed - clearcpuid=smap.

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220127115626.14179-4-bp@alien8.de
2022-04-04 10:16:57 +02:00
Borislav Petkov
1625c833db x86/cpu: Allow feature bit names from /proc/cpuinfo in clearcpuid=
Having to give the X86_FEATURE array indices in order to disable a
feature bit for testing is not really user-friendly. So accept the
feature bit names too.

Some feature bits don't have names so there the array indices are still
accepted, of course.

Clearing CPUID flags is not something which should be done in production
so taint the kernel too.

An exemplary cmdline would then be something like:

  clearcpuid=de,440,smca,succory,bmi1,3dnow

("succory" is wrong on purpose). And it says:

  [   ... ] Clearing CPUID bits: de 13:24 smca (unknown: succory) bmi1 3dnow

  [ Fix CONFIG_X86_FEATURE_NAMES=n build error as reported by the 0day
    robot: https://lore.kernel.org/r/202203292206.ICsY2RKX-lkp@intel.com ]

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220127115626.14179-2-bp@alien8.de
2022-04-04 10:16:52 +02:00
Borislav Petkov
ace1a98519 x86/mm: Force-inline __phys_addr_nodebug()
Fix:

  vmlinux.o: warning: objtool: __sev_es_nmi_complete()+0x8b: call to __phys_addr_nodebug() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220324183607.31717-4-bp@alien8.de
2022-04-04 10:13:25 +02:00
Borislav Petkov
6b91ec4ad2 x86/kvm/svm: Force-inline GHCB accessors
In order to fix:

  vmlinux.o: warning: objtool: __sev_es_nmi_complete()+0x4c: call to ghcb_set_sw_exit_code() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220324183607.31717-3-bp@alien8.de
2022-04-04 10:13:20 +02:00
Borislav Petkov
e87f4152e5 task_stack, x86/cea: Force-inline stack helpers
Force-inline two stack helpers to fix the following objtool warnings:

  vmlinux.o: warning: objtool: in_task_stack()+0xc: call to task_stack_page() leaves .noinstr.text section
  vmlinux.o: warning: objtool: in_entry_stack()+0x10: call to cpu_entry_stack() leaves .noinstr.text section

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220324183607.31717-2-bp@alien8.de
2022-04-04 10:13:07 +02:00
Linus Torvalds
38904911e8 * Only do MSR filtering for MSRs accessed by rdmsr/wrmsr
* Documentation improvements
 
 * Prevent module exit until all VMs are freed
 
 * PMU Virtualization fixes
 
 * Fix for kvm_irq_delivery_to_apic_fast() NULL-pointer dereferences
 
 * Other miscellaneous bugfixes
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmJIGV8UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroO5FQgAhls4+Nu+NqId/yvvyNxr3vXq0dHI
 hLlHtvzgGzZisZ7y2bNeyIpJVBDT5LCbrptPD/5eTvchVswDh0+kCVC0Uni5ugGT
 tLT/Pv9Oq9e0X7aGdHRyuHIivIFDC20zIZO2DV48Lrj/+r6DafB2Fghq2XQLlBxN
 p8KislvuqAAos543BPC1+Lk3dhOLuZ8qcFD8wGRlcCwjNwYaitrQ16rO04cLfUur
 OwIks1I6TdI2JpLBhm6oWYVG/YnRsoo4bQE8cjdQ6yNSbwWtRpV33q7X6onw8x8K
 BEeESoTnMqfaxIF/6mPl6bnDblVHFp6Xhld/vJcgeWQTdajFtuFE/K4sCA==
 =xnQ6
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:

 - Only do MSR filtering for MSRs accessed by rdmsr/wrmsr

 - Documentation improvements

 - Prevent module exit until all VMs are freed

 - PMU Virtualization fixes

 - Fix for kvm_irq_delivery_to_apic_fast() NULL-pointer dereferences

 - Other miscellaneous bugfixes

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (42 commits)
  KVM: x86: fix sending PV IPI
  KVM: x86/mmu: do compare-and-exchange of gPTE via the user address
  KVM: x86: Remove redundant vm_entry_controls_clearbit() call
  KVM: x86: cleanup enter_rmode()
  KVM: x86: SVM: fix tsc scaling when the host doesn't support it
  kvm: x86: SVM: remove unused defines
  KVM: x86: SVM: move tsc ratio definitions to svm.h
  KVM: x86: SVM: fix avic spec based definitions again
  KVM: MIPS: remove reference to trap&emulate virtualization
  KVM: x86: document limitations of MSR filtering
  KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr
  KVM: x86/emulator: Emulate RDPID only if it is enabled in guest
  KVM: x86/pmu: Fix and isolate TSX-specific performance event logic
  KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was set
  KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs
  KVM: x86: Trace all APICv inhibit changes and capture overall status
  KVM: x86: Add wrappers for setting/clearing APICv inhibits
  KVM: x86: Make APICv inhibit reasons an enum and cleanup naming
  KVM: X86: Handle implicit supervisor access with SMAP
  KVM: X86: Rename variable smap to not_smap in permission_fault()
  ...
2022-04-02 12:09:02 -07:00
Maxim Levitsky
d5fa597ed8 KVM: x86: allow per cpu apicv inhibit reasons
Add optional callback .vcpu_get_apicv_inhibit_reasons returning
extra inhibit reasons that prevent APICv from working on this vCPU.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220322174050.241850-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:41:24 -04:00
David Woodhouse
ffbb61d09f KVM: x86: Accept KVM_[GS]ET_TSC_KHZ as a VM ioctl.
This sets the default TSC frequency for subsequently created vCPUs.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20220225145304.36166-2-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:41:19 -04:00
Boris Ostrovsky
1a65105a5a KVM: x86/xen: handle PV spinlocks slowpath
Add support for SCHEDOP_poll hypercall.

This implementation is optimized for polling for a single channel, which
is what Linux does. Polling for multiple channels is not especially
efficient (and has not been tested).

PV spinlocks slow path uses this hypercall, and explicitly crash if it's
not supported.

[ dwmw2: Rework to use kvm_vcpu_halt(), not supported for 32-bit guests ]

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-17-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:41:17 -04:00
David Woodhouse
fde0451be8 KVM: x86/xen: Support per-vCPU event channel upcall via local APIC
Windows uses a per-vCPU vector, and it's delivered via the local APIC
basically like an MSI (with associated EOI) unlike the traditional
guest-wide vector which is just magically asserted by Xen (and in the
KVM case by kvm_xen_has_interrupt() / kvm_cpu_get_extint()).

Now that the kernel is able to raise event channel events for itself,
being able to do so for Windows guests is also going to be useful.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-15-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:41:17 -04:00
David Woodhouse
28d1629f75 KVM: x86/xen: Kernel acceleration for XENVER_version
Turns out this is a fast path for PV guests because they use it to
trigger the event channel upcall. So letting it bounce all the way up
to userspace is not great.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-14-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:41:16 -04:00
Joao Martins
5363952605 KVM: x86/xen: handle PV timers oneshot mode
If the guest has offloaded the timer virq, handle the following
hypercalls for programming the timer:

    VCPUOP_set_singleshot_timer
    VCPUOP_stop_singleshot_timer
    set_timer_op(timestamp_ns)

The event channel corresponding to the timer virq is then used to inject
events once timer deadlines are met. For now we back the PV timer with
hrtimer.

[ dwmw2: Add save/restore, 32-bit compat mode, immediate delivery,
         don't check timer in kvm_vcpu_has_event() ]

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-13-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:41:16 -04:00
David Woodhouse
942c2490c2 KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID
In order to intercept hypercalls such as VCPUOP_set_singleshot_timer, we
need to be aware of the Xen CPU numbering.

This looks a lot like the Hyper-V handling of vpidx, for obvious reasons.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-12-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:41:15 -04:00
Joao Martins
2fd6df2f2b KVM: x86/xen: intercept EVTCHNOP_send from guests
Userspace registers a sending @port to either deliver to an @eventfd
or directly back to a local event channel port.

After binding events the guest or host may wish to bind those
events to a particular vcpu. This is usually done for unbound
and and interdomain events. Update requests are handled via the
KVM_XEN_EVTCHN_UPDATE flag.

Unregistered ports are handled by the emulator.

Co-developed-by: Ankur Arora <ankur.a.arora@oracle.com>
Co-developed-By: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-10-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:41:15 -04:00
David Woodhouse
69d413cfcf KVM: x86/xen: Use gfn_to_pfn_cache for vcpu_time_info
This switches the final pvclock to kvm_setup_pvclock_pfncache() and now
the old kvm_setup_pvclock_page() can be removed.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-7-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:41:13 -04:00
David Woodhouse
7caf957156 KVM: x86/xen: Use gfn_to_pfn_cache for vcpu_info
Currently, the fast path of kvm_xen_set_evtchn_fast() doesn't set the
index bits in the target vCPU's evtchn_pending_sel, because it only has
a userspace virtual address with which to do so. It just sets them in
the kernel, and kvm_xen_has_interrupt() then completes the delivery to
the actual vcpu_info structure when the vCPU runs.

Using a gfn_to_pfn_cache allows kvm_xen_set_evtchn_fast() to do the full
delivery in the common case.

Clean up the fallback case too, by moving the deferred delivery out into
a separate kvm_xen_inject_pending_events() function which isn't ever
called in atomic contexts as __kvm_xen_has_interrupt() is.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-6-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:41:13 -04:00
David Woodhouse
916d3608df KVM: x86: Use gfn_to_pfn_cache for pv_time
Add a new kvm_setup_guest_pvclock() which parallels the existing
kvm_setup_pvclock_page(). The latter will be removed once we convert
all users to the gfn_to_pfn_cache version.

Using the new cache, we can potentially let kvm_set_guest_paused() set
the PVCLOCK_GUEST_STOPPED bit directly rather than having to delegate
to the vCPU via KVM_REQ_CLOCK_UPDATE. But not yet.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-5-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:41:12 -04:00
David Woodhouse
a795cd43c5 KVM: x86/xen: Use gfn_to_pfn_cache for runstate area
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-4-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:41:12 -04:00
Oliver Upton
f1a9761fbb KVM: x86: Allow userspace to opt out of hypercall patching
KVM handles the VMCALL/VMMCALL instructions very strangely. Even though
both of these instructions really should #UD when executed on the wrong
vendor's hardware (i.e. VMCALL on SVM, VMMCALL on VMX), KVM replaces the
guest's instruction with the appropriate instruction for the vendor.
Nonetheless, older guest kernels without commit c1118b3602 ("x86: kvm:
use alternatives for VMCALL vs. VMMCALL if kernel text is read-only")
do not patch in the appropriate instruction using alternatives, likely
motivating KVM's intervention.

Add a quirk allowing userspace to opt out of hypercall patching. If the
quirk is disabled, KVM synthesizes a #UD in the guest.

Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20220316005538.2282772-2-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:41:10 -04:00
Maxim Levitsky
bb2aa78e9a KVM: x86: SVM: move tsc ratio definitions to svm.h
Another piece of SVM spec which should be in the header file

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220322172449.235575-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:37:25 -04:00
Maxim Levitsky
0dacc3df89 KVM: x86: SVM: fix avic spec based definitions again
Due to wrong rebase, commit
4a204f7895 ("KVM: SVM: Allow AVIC support on system w/ physical APIC ID > 255")

moved avic spec #defines back to avic.c.

Move them back, and while at it extend AVIC_DOORBELL_PHYSICAL_ID_MASK to 12
bits as well (it will be used in nested avic)

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220322172449.235575-5-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:37:24 -04:00
Sean Christopherson
320af55a93 KVM: x86: Add wrappers for setting/clearing APICv inhibits
Add set/clear wrappers for toggling APICv inhibits to make the call sites
more readable, and opportunistically rename the inner helpers to align
with the new wrappers and to make them more readable as well.  Invert the
flag from "activate" to "set"; activate is painfully ambiguous as it's
not obvious if the inhibit is being activated, or if APICv is being
activated, in which case the inhibit is being deactivated.

For the functions that take @set, swap the order of the inhibit reason
and @set so that the call sites are visually similar to those that bounce
through the wrapper.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220311043517.17027-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:44 -04:00
Sean Christopherson
7491b7b2e1 KVM: x86: Make APICv inhibit reasons an enum and cleanup naming
Use an enum for the APICv inhibit reasons, there is no meaning behind
their values and they most definitely are not "unsigned longs".  Rename
the various params to "reason" for consistency and clarity (inhibit may
be confused as a command, i.e. inhibit APICv, instead of the reason that
is getting toggled/checked).

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220311043517.17027-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:44 -04:00
Lai Jiangshan
4f4aa80e3b KVM: X86: Handle implicit supervisor access with SMAP
There are two kinds of implicit supervisor access
	implicit supervisor access when CPL = 3
	implicit supervisor access when CPL < 3

Current permission_fault() handles only the first kind for SMAP.

But if the access is implicit when SMAP is on, data may not be read
nor write from any user-mode address regardless the current CPL.

So the second kind should be also supported.

The first kind can be detect via CPL and access mode: if it is
supervisor access and CPL = 3, it must be implicit supervisor access.

But it is not possible to detect the second kind without extra
information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS
into @access. This extra information also works for the first kind, so
the logic is changed to use this information for both cases.

The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48
which is in the most significant 16 bits of u64 and less likely to be
forced to change due to future hardware uses it.

This patch removes the call to ->get_cpl() for access mode is determined
by @access.  Not only does it reduce a function call, but also remove
confusions when the permission is checked for nested TDP.  The nested
TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing
on it.  The original code works just because it is always user walk for
NPT and SMAP fault is not set for EPT in update_permission_bitmask.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220311070346.45023-5-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:43 -04:00
Lai Jiangshan
5b22bbe717 KVM: X86: Change the type of access u32 to u64
Change the type of access u32 to u64 for FNAME(walk_addr) and
->gva_to_gpa().

The kinds of accesses are usually combinations of UWX, and VMX/SVM's
nested paging adds a new factor of access: is it an access for a guest
page table or for a final guest physical address.

And SMAP relies a factor for supervisor access: explicit or implicit.

So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include
all these information to do the walk.

Although @access(u32) has enough bits to encode all the kinds, this
patch extends it to u64:
	o Extra bits will be in the higher 32 bits, so that we can
	  easily obtain the traditional access mode (UWX) by converting
	  it to u32.
	o Reuse the value for the access kind defined by SVM's nested
	  paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as
	  @error_code in kvm_handle_page_fault().

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:42 -04:00
Jim Mattson
95b065bf5c KVM: x86/pmu: Use different raw event masks for AMD and Intel
The third nybble of AMD's event select overlaps with Intel's IN_TX and
IN_TXCP bits. Therefore, we can't use AMD64_RAW_EVENT_MASK on Intel
platforms that support TSX.

Declare a raw_event_mask in the kvm_pmu structure, initialize it in
the vendor-specific pmu_refresh() functions, and use that mask for
PERF_TYPE_RAW configurations in reprogram_gp_counter().

Fixes: 710c476514 ("KVM: x86/pmu: Use AMD64_RAW_EVENT_MASK for PERF_TYPE_RAW")
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20220308012452.3468611-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:40 -04:00
Paolo Bonzini
a1a39128fa KVM: MMU: propagate alloc_workqueue failure
If kvm->arch.tdp_mmu_zap_wq cannot be created, the failure has
to be propagated up to kvm_mmu_init_vm and kvm_arch_init_vm.
kvm_arch_init_vm also has to undo all the initialization, so
group all the MMU initialization code at the beginning and
handle cleaning up of kvm_page_track_init.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:38 -04:00
Linus Torvalds
88e6c02076 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "Assorted bits and pieces"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  aio: drop needless assignment in aio_read()
  clean overflow checks in count_mounts() a bit
  seq_file: fix NULL pointer arithmetic warning
  uml/x86: use x86 load_unaligned_zeropad()
  asm/user.h: killed unused macros
  constify struct path argument of finish_automount()/do_add_mount()
  fs: Remove FIXME comment in generic_write_checks()
2022-04-01 19:57:03 -07:00
Masami Hiramatsu
f3a112c0c4 x86,rethook,kprobes: Replace kretprobe with rethook on x86
Replaces the kretprobe code with rethook on x86. With this patch,
kretprobe on x86 uses the rethook instead of kretprobe specific
trampoline code.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/164826163692.2455864.13745421016848209527.stgit@devnote2
2022-03-28 19:38:51 -07:00
Linus Torvalds
7001052160 Add support for Intel CET-IBT, available since Tigerlake (11th gen), which is a
coarse grained, hardware based, forward edge Control-Flow-Integrity mechanism
 where any indirect CALL/JMP must target an ENDBR instruction or suffer #CP.
 
 Additionally, since Alderlake (12th gen)/Sapphire-Rapids, speculation is
 limited to 2 instructions (and typically fewer) on branch targets not starting
 with ENDBR. CET-IBT also limits speculation of the next sequential instruction
 after the indirect CALL/JMP [1].
 
 CET-IBT is fundamentally incompatible with retpolines, but provides, as
 described above, speculation limits itself.
 
 [1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEv3OU3/byMaA0LqWJdkfhpEvA5LoFAmI/LI8VHHBldGVyekBp
 bmZyYWRlYWQub3JnAAoJEHZH4aRLwOS6ZnkP/2QCgQLTu6oRxv9O020CHwlaSEeD
 1Hoy3loum5q5hAi1Ik3dR9p0H5u64c9qbrBVxaFoNKaLt5GKrtHaDSHNk2L/CFHX
 urpH65uvTLxbyZzcahkAahoJ71XU+m7PcrHLWMunw9sy10rExYVsUOlFyoyG6XCF
 BDCNZpdkC09ZM3vwlWGMZd5Pp+6HcZNPyoV9tpvWAS2l+WYFWAID7mflbpQ+tA8b
 y/hM6b3Ud0rT2ubuG1iUpopgNdwqQZ+HisMPGprh+wKZkYwS2l8pUTrz0MaBkFde
 go7fW16kFy2HQzGm6aIEBmfcg0palP/mFVaWP0zS62LwhJSWTn5G6xWBr3yxSsht
 9gWCiI0oDZuTg698MedWmomdG2SK6yAuZuqmdKtLLoWfWgviPEi7TDFG/cKtZdAW
 ag8GM8T4iyYZzpCEcWO9GWbjo6TTGq30JBQefCBG47GjD0csv2ubXXx0Iey+jOwT
 x3E8wnv9dl8V9FSd/tMpTFmje8ges23yGrWtNpb5BRBuWTeuGiBPZED2BNyyIf+T
 dmewi2ufNMONgyNp27bDKopY81CPAQq9cVxqNm9Cg3eWPFnpOq2KGYEvisZ/rpEL
 EjMQeUBsy/C3AUFAleu1vwNnkwP/7JfKYpN00gnSyeQNZpqwxXBCKnHNgOMTXyJz
 beB/7u2KIUbKEkSN
 =jZfK
 -----END PGP SIGNATURE-----

Merge tag 'x86_core_for_5.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 CET-IBT (Control-Flow-Integrity) support from Peter Zijlstra:
 "Add support for Intel CET-IBT, available since Tigerlake (11th gen),
  which is a coarse grained, hardware based, forward edge
  Control-Flow-Integrity mechanism where any indirect CALL/JMP must
  target an ENDBR instruction or suffer #CP.

  Additionally, since Alderlake (12th gen)/Sapphire-Rapids, speculation
  is limited to 2 instructions (and typically fewer) on branch targets
  not starting with ENDBR. CET-IBT also limits speculation of the next
  sequential instruction after the indirect CALL/JMP [1].

  CET-IBT is fundamentally incompatible with retpolines, but provides,
  as described above, speculation limits itself"

[1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html

* tag 'x86_core_for_5.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
  kvm/emulate: Fix SETcc emulation for ENDBR
  x86/Kconfig: Only allow CONFIG_X86_KERNEL_IBT with ld.lld >= 14.0.0
  x86/Kconfig: Only enable CONFIG_CC_HAS_IBT for clang >= 14.0.0
  kbuild: Fixup the IBT kbuild changes
  x86/Kconfig: Do not allow CONFIG_X86_X32_ABI=y with llvm-objcopy
  x86: Remove toolchain check for X32 ABI capability
  x86/alternative: Use .ibt_endbr_seal to seal indirect calls
  objtool: Find unused ENDBR instructions
  objtool: Validate IBT assumptions
  objtool: Add IBT/ENDBR decoding
  objtool: Read the NOENDBR annotation
  x86: Annotate idtentry_df()
  x86,objtool: Move the ASM_REACHABLE annotation to objtool.h
  x86: Annotate call_on_stack()
  objtool: Rework ASM_REACHABLE
  x86: Mark __invalid_creds() __noreturn
  exit: Mark do_group_exit() __noreturn
  x86: Mark stop_this_cpu() __noreturn
  objtool: Ignore extra-symbol code
  objtool: Rename --duplicate to --lto
  ...
2022-03-27 10:17:23 -07:00
Linus Torvalds
148a650476 pci-v5.18-changes
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmI7iOwUHGJoZWxnYWFz
 QGdvb2dsZS5jb20ACgkQWYigwDrT+vxkuhAAtJkVwfeyUjZ8sms+qWdZaucJmFF1
 PDeKy8O8upLzRRykdWoAOjKKVcCB9ohxBjPMco2oYNTmSozxeau8jjMA9OTQvTOS
 ZhDDoi49/vHRHuq3WIeAMCuk7tH3H1L3f0UHJxJ3H/oObQ+eMsitPcGFK+QrISDX
 pYokOnXZvf7BT7NpVtogSe2mhniOD1zQSicAMiH6WKNHHZcxewrzV9LP3MFOoBAr
 VMhlhzJbOp9spvCt7M1DycJEQ2RNe+wGLBFDalhPuprwnkNchRV+0AwWfD90zc9u
 h/0J8jkXfqS6QfSd/lOlTvI6kGsV8UKZEt4h4X/hlHFebFM5ktD9X7GmcoYUDFd9
 aHV3I/Jf62uGJ31IrT0V/cSYNlMO+IVFwXLGir4B1cFPOkzyIG/i60iV/C6bnnCa
 TCMH6vxalFycYaHBFqw/K/Dlq+mrAX74nQDfbk8y6rprczM1BN220Z8BkpG13TBu
 MxgCEul2/BJmNcPS1IWb/mCfBy+rdrVn2DZuID3J9KTwKNOUTIuAF0FuxLP4Bk4o
 sti3vKIXOcHnAcJB9tEnpEfstPv2JT13eWDIMmp/qCwqcujOvsg/DSYrx+8ogmBF
 DJ/sbPy3BdIOAeTgepWHAxYcv9SlZTGJGl+oaR1zV0qLBogyQUWZ9Ijx5aAEAw3j
 AJicpdk3BkH3LC8=
 =5Q9H
 -----END PGP SIGNATURE-----

Merge tag 'pci-v5.18-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull pci updates from Bjorn Helgaas:
 "Enumeration:
   - Move the VGA arbiter from drivers/gpu to drivers/pci because it's
     PCI-specific, not GPU-specific (Bjorn Helgaas)
   - Select the default VGA device consistently whether it's enumerated
     before or after VGA arbiter init, which fixes arches that enumerate
     PCI devices late (Huacai Chen)

  Resource management:
   - Support BAR sizes up to 8TB (Dongdong Liu)

  PCIe native device hotplug:
   - Fix "Command Completed" tracking to avoid spurious timouts when
     powering off empty slots (Liguang Zhang)
   - Quirk Qualcomm devices that don't implement Command Completed
     correctly, again to avoid spurious timeouts (Manivannan Sadhasivam)

  Peer-to-peer DMA:
   - Add Intel 3rd Gen Intel Xeon Scalable Processors to whitelist
     (Michael J. Ruhl)

  APM X-Gene PCIe controller driver:
   - Revert generic DT parsing changes that broke some machines in the
     field (Marc Zyngier)

  Freescale i.MX6 PCIe controller driver:
   - Allow controller probe to succeed even when no devices currently
     present to allow hot-add later (Fabio Estevam)
   - Enable power management on i.MX6QP (Richard Zhu)
   - Assert CLKREQ# on i.MX8MM so enumeration doesn't hang when no
     device is connected (Richard Zhu)

  Marvell Aardvark PCIe controller driver:
   - Fix MSI and MSI-X support (Marek Behún, Pali Rohár)
   - Add support for ERR and PME interrupts (Pali Rohár)

  Marvell MVEBU PCIe controller driver:
   - Add DT binding and support for "num-lanes" (Pali Rohár)
   - Add support for INTx interrupts (Pali Rohár)

  Microsoft Hyper-V host bridge driver:
   - Avoid unnecessary hypercalls when unmasking IRQs on ARM64 (Boqun
     Feng)

  Qualcomm PCIe controller driver:
   - Add SM8450 DT binding and driver support (Dmitry Baryshkov)

  Renesas R-Car PCIe controller driver:
   - Help the controller get to the L1 state since the hardware can't do
     it on its own (Marek Vasut)
   - Return PCI_ERROR_RESPONSE (~0) for reads that fail on PCIe (Marek
     Vasut)

  SiFive FU740 PCIe controller driver:
   - Drop redundant '-gpios' from DT GPIO lookup (Ben Dooks)
   - Force 2.5GT/s for initial device probe (Ben Dooks)

  Socionext UniPhier Pro5 controller driver:
   - Add NX1 DT binding and driver support (Kunihiko Hayashi)

  Synopsys DesignWare PCIe controller driver:
   - Restore MSI configuration so MSI works after resume (Jisheng
     Zhang)"

* tag 'pci-v5.18-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (94 commits)
  x86/PCI: Add #includes to asm/pci_x86.h
  PCI: ibmphp: Remove unused assignments
  PCI: cpqphp: Remove unused assignments
  PCI: fu740: Remove unused assignments
  PCI: kirin: Remove unused assignments
  PCI: Remove unused assignments
  PCI: Declare pci_filp_private only when HAVE_PCI_MMAP
  PCI: Avoid broken MSI on SB600 USB devices
  PCI: fu740: Force 2.5GT/s for initial device probe
  PCI: xgene: Revert "PCI: xgene: Fix IB window setup"
  PCI: xgene: Revert "PCI: xgene: Use inbound resources for setup"
  PCI: imx6: Assert i.MX8MM CLKREQ# even if no device present
  PCI: imx6: Invoke the PHY exit function after PHY power off
  PCI: rcar: Use PCI_SET_ERROR_RESPONSE after read which triggered an exception
  PCI: rcar: Finish transition to L1 state in rcar_pcie_config_access()
  PCI: dwc: Restore MSI Receiver mask during resume
  PCI: fu740: Drop redundant '-gpios' from DT GPIO lookup
  PCI/VGA: Replace full MIT license text with SPDX identifier
  PCI/VGA: Use unsigned format string to print lock counts
  PCI/VGA: Log bridge control messages when adding devices
  ...
2022-03-25 13:02:05 -07:00
Linus Torvalds
1464677662 platform-drivers-x86 for v5.18-1
Highlights:
 - new drivers:
   - AMD Host System Management Port (HSMP)
   - Intel Software Defined Silicon
 - removed drivers (functionality folded into other drivers):
   - intel_cht_int33fe_microb
   - surface3_button
 - amd-pmc:
   - s2idle bug-fixes
   - Support for AMD Spill to DRAM STB feature
 - hp-wmi:
   - Fix SW_TABLET_MODE detection method (and other fixes)
   - Support omen thermal profile policy v1
 - serial-multi-instantiate:
   - Add SPI device support
   - Add support for CS35L41 amplifiers used in new laptops
 - think-lmi:
   - syfs-class-firmware-attributes Certificate authentication support
 - thinkpad_acpi:
   - Fixes + quirks
   - Add platform_profile support on AMD based ThinkPads
 - x86-android-tablets
   - Improve Asus ME176C / TF103C support
   - Support Nextbook Ares 8, Lenovo Tab 2 830 and 1050 tablets
 - Lots of various other small fixes and hardware-id additions
 
 The following is an automated git shortlog grouped by driver:
 
 ACPI / scan:
  -  Create platform device for CS35L41
 
 ACPI / x86:
  -  Add support for LPS0 callback handler
 
 ALSA:
  -  hda/realtek: Add support for HP Laptops
 
 Add AMD system management interface:
  - Add AMD system management interface
 
 Add Intel Software Defined Silicon driver:
  - Add Intel Software Defined Silicon driver
 
 Documentation:
  -  syfs-class-firmware-attributes: Lenovo Certificate support
  -  Add x86/amd_hsmp driver
 
 ISST:
  -  Fix possible circular locking dependency detected
 
 Input:
  -  soc_button_array - add support for Microsoft Surface 3 (MSHW0028) buttons
 
 Merge remote-tracking branch 'pdx86/platform-drivers-x86-pinctrl-pmu_clk' into review-hans-gcc12:
  - Merge remote-tracking branch 'pdx86/platform-drivers-x86-pinctrl-pmu_clk' into review-hans-gcc12
 
 Merge tag 'platform-drivers-x86-serial-multi-instantiate-1' into review-hans:
  - Merge tag 'platform-drivers-x86-serial-multi-instantiate-1' into review-hans
 
 Replace acpi_bus_get_device():
  - Replace acpi_bus_get_device()
 
 amd-pmc:
  -  Only report STB errors when STB enabled
  -  Drop CPU QoS workaround
  -  Output error codes in messages
  -  Move to later in the suspend process
  -  Validate entry into the deepest state on resume
  -  uninitialized variable in amd_pmc_s2d_init()
  -  Set QOS during suspend on CZN w/ timer wakeup
  -  Add support for AMD Spill to DRAM STB feature
  -  Correct usage of SMU version
  -  Make amd_pmc_stb_debugfs_fops static
 
 asus-tf103c-dock:
  -  Make 2 global structs static
 
 asus-wmi:
  -  Fix regression when probing for fan curve control
 
 hp-wmi:
  -  support omen thermal profile policy v1
  -  Changing bios_args.data to be dynamically allocated
  -  Fix 0x05 error code reported by several WMI calls
  -  Fix SW_TABLET_MODE detection method
  -  Fix hp_wmi_read_int() reporting error (0x05)
 
 huawei-wmi:
  -  check the return value of device_create_file()
 
 i2c-multi-instantiate:
  -  Rename it for a generic serial driver name
 
 int3472:
  -  Add terminator to gpiod_lookup_table
 
 intel-uncore-freq:
  -  fix uncore_freq_common_init() error codes
 
 intel_cht_int33fe:
  -  Move to intel directory
  -  Drop Lenovo Yogabook YB1-X9x code
  -  Switch to DMI modalias based loading
 
 intel_crystal_cove_charger:
  -  Fix IRQ masking / unmasking
 
 lg-laptop:
  -  Move setting of battery charge limit to common location
 
 pinctrl:
  -  baytrail: Add pinconf group + function for the pmu_clk
 
 platform/dcdbas:
  -  move EXPORT_SYMBOL after function
 
 platform/surface:
  -  Remove Surface 3 Button driver
  -  surface3-wmi: Simplify resource management
  -  Replace acpi_bus_get_device()
  -  Reinstate platform dependency
 
 platform/x86/intel-uncore-freq:
  -  Split common and enumeration part
 
 platform/x86/intel/uncore-freq:
  -  Display uncore current frequency
  -  Use sysfs API to create attributes
  -  Move to uncore-frequency folder
 
 selftests:
  -  sdsi: test sysfs setup
 
 serial-multi-instantiate:
  -  Add SPI support
  -  Reorganize I2C functions
 
 spi:
  -  Add API to count spi acpi resources
  -  Support selection of the index of the ACPI Spi Resource before alloc
  -  Create helper API to lookup ACPI info for spi device
  -  Make spi_alloc_device and spi_add_device public again
 
 surface:
  -  surface3_power: Fix battery readings on batteries without a serial number
 
 think-lmi:
  -  Certificate authentication support
 
 thinkpad_acpi:
  -  consistently check fan_get_status return.
  -  Don't use test_bit on an integer
  -  Fix compiler warning about uninitialized err variable
  -  clean up dytc profile convert
  -  Add PSC mode support
  -  Add dual fan probe
  -  Add dual-fan quirk for T15g (2nd gen)
  -  Fix incorrect use of platform profile on AMD platforms
  -  Add quirk for ThinkPads without a fan
 
 tools arch x86:
  -  Add Intel SDSi provisiong tool
 
 touchscreen_dmi:
  -  Add info for the RWC NANOTE P8 AY07J 2-in-1
 
 x86-android-tablets:
  -  Depend on EFI and SPI
  -  Lenovo Yoga Tablet 2 830/1050 sound support
  -  Workaround Lenovo Yoga Tablet 2 830/1050 poweroff hang
  -  Add Lenovo Yoga Tablet 2 830 / 1050 data
  -  Fix EBUSY error when requesting IOAPIC IRQs
  -  Minor charger / fuel-gauge improvements
  -  Add Nextbook Ares 8 data
  -  Add IRQ to Asus ME176C accelerometer info
  -  Add lid-switch gpio-keys pdev to Asus ME176C + TF103C
  -  Add x86_android_tablet_get_gpiod() helper
  -  Add Asus ME176C/TF103C charger and fuelgauge props
  -  Add battery swnode support
  -  Trivial typo fix for MODULE_AUTHOR
  -  Fix the buttons on CZC P10T tablet
  -  Constify the gpiod_lookup_tables arrays
  -  Add an init() callback to struct x86_dev_info
  -  Add support for disabling ACPI _AEI handlers
  -  Correct crystal_cove_charger module name
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEEuvA7XScYQRpenhd+kuxHeUQDJ9wFAmI8SjEUHGhkZWdvZWRl
 QHJlZGhhdC5jb20ACgkQkuxHeUQDJ9wYUwf/cdUMPFy5cwpHq1LuqGy+PxVCRHCe
 71PFd2Ycj+HGOtrt66RxSiCC1Seb4tylr7FvudToDaqWjlBf5n6LhpDudg4ds7Qw
 lCuRlaXTIrF7p3nOLIsWvJPRqacMG79KkRM62MLTS2evtRYjbnKvFzNPJPzr8827
 1AhCakE92S8gkR5lUZYYHtsaz9rZ4z4TrEtjO6GdlbL2bDw0l18dNNwdMomfVpNS
 bBIHIDLeufDuMJ4PxIHlE5MB3AuZAuc0HTJWihozyJX/h5FMGI6qVm0/s9RAfHgX
 XdMCpADtS/JjHCmkFgLZYIzvXTxwQVZRo5VO0Wrv5Mis6gSpxJXCd0aKlA==
 =1x9/
 -----END PGP SIGNATURE-----

Merge tag 'platform-drivers-x86-v5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86

Pull x86 platform driver updates from Hans de Goede:
  "New drivers:
    - AMD Host System Management Port (HSMP)
    - Intel Software Defined Silicon

  Removed drivers (functionality folded into other drivers):
    - intel_cht_int33fe_microb
    - surface3_button

  amd-pmc:
    - s2idle bug-fixes
    - Support for AMD Spill to DRAM STB feature

  hp-wmi:
    - Fix SW_TABLET_MODE detection method (and other fixes)
    - Support omen thermal profile policy v1

  serial-multi-instantiate:
    - Add SPI device support
    - Add support for CS35L41 amplifiers used in new laptops

  think-lmi:
    - syfs-class-firmware-attributes Certificate authentication support

  thinkpad_acpi:
    - Fixes + quirks
    - Add platform_profile support on AMD based ThinkPads

  x86-android-tablets:
    - Improve Asus ME176C / TF103C support
    - Support Nextbook Ares 8, Lenovo Tab 2 830 and 1050 tablets

  Lots of various other small fixes and hardware-id additions"

* tag 'platform-drivers-x86-v5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (60 commits)
  platform/x86: think-lmi: Certificate authentication support
  Documentation: syfs-class-firmware-attributes: Lenovo Certificate support
  platform/x86: amd-pmc: Only report STB errors when STB enabled
  platform/x86: amd-pmc: Drop CPU QoS workaround
  platform/x86: amd-pmc: Output error codes in messages
  platform/x86: amd-pmc: Move to later in the suspend process
  ACPI / x86: Add support for LPS0 callback handler
  platform/x86: thinkpad_acpi: consistently check fan_get_status return.
  platform/x86: hp-wmi: support omen thermal profile policy v1
  platform/x86: hp-wmi: Changing bios_args.data to be dynamically allocated
  platform/x86: hp-wmi: Fix 0x05 error code reported by several WMI calls
  platform/x86: hp-wmi: Fix SW_TABLET_MODE detection method
  platform/x86: hp-wmi: Fix hp_wmi_read_int() reporting error (0x05)
  platform/x86: amd-pmc: Validate entry into the deepest state on resume
  platform/x86: thinkpad_acpi: Don't use test_bit on an integer
  platform/x86: thinkpad_acpi: Fix compiler warning about uninitialized err variable
  platform/x86: thinkpad_acpi: clean up dytc profile convert
  platform/x86: x86-android-tablets: Depend on EFI and SPI
  platform/x86: amd-pmc: uninitialized variable in amd_pmc_s2d_init()
  platform/x86: intel-uncore-freq: fix uncore_freq_common_init() error codes
  ...
2022-03-25 12:14:39 -07:00
Linus Torvalds
169e77764a Networking changes for 5.18.
Core
 ----
 
  - Introduce XDP multi-buffer support, allowing the use of XDP with
    jumbo frame MTUs and combination with Rx coalescing offloads (LRO).
 
  - Speed up netns dismantling (5x) and lower the memory cost a little.
    Remove unnecessary per-netns sockets. Scope some lists to a netns.
    Cut down RCU syncing. Use batch methods. Allow netdev registration
    to complete out of order.
 
  - Support distinguishing timestamp types (ingress vs egress) and
    maintaining them across packet scrubbing points (e.g. redirect).
 
  - Continue the work of annotating packet drop reasons throughout
    the stack.
 
  - Switch netdev error counters from an atomic to dynamically
    allocated per-CPU counters.
 
  - Rework a few preempt_disable(), local_irq_save() and busy waiting
    sections problematic on PREEMPT_RT.
 
  - Extend the ref_tracker to allow catching use-after-free bugs.
 
 BPF
 ---
 
  - Introduce "packing allocator" for BPF JIT images. JITed code is
    marked read only, and used to be allocated at page granularity.
    Custom allocator allows for more efficient memory use, lower
    iTLB pressure and prevents identity mapping huge pages from
    getting split.
 
  - Make use of BTF type annotations (e.g. __user, __percpu) to enforce
    the correct probe read access method, add appropriate helpers.
 
  - Convert the BPF preload to use light skeleton and drop
    the user-mode-driver dependency.
 
  - Allow XDP BPF_PROG_RUN test infra to send real packets, enabling
    its use as a packet generator.
 
  - Allow local storage memory to be allocated with GFP_KERNEL if called
    from a hook allowed to sleep.
 
  - Introduce fprobe (multi kprobe) to speed up mass attachment (arch
    bits to come later).
 
  - Add unstable conntrack lookup helpers for BPF by using the BPF
    kfunc infra.
 
  - Allow cgroup BPF progs to return custom errors to user space.
 
  - Add support for AF_UNIX iterator batching.
 
  - Allow iterator programs to use sleepable helpers.
 
  - Support JIT of add, and, or, xor and xchg atomic ops on arm64.
 
  - Add BTFGen support to bpftool which allows to use CO-RE in kernels
    without BTF info.
 
  - Large number of libbpf API improvements, cleanups and deprecations.
 
 Protocols
 ---------
 
  - Micro-optimize UDPv6 Tx, gaining up to 5% in test on dummy netdev.
 
  - Adjust TSO packet sizes based on min_rtt, allowing very low latency
    links (data centers) to always send full-sized TSO super-frames.
 
  - Make IPv6 flow label changes (AKA hash rethink) more configurable,
    via sysctl and setsockopt. Distinguish between server and client
    behavior.
 
  - VxLAN support to "collect metadata" devices to terminate only
    configured VNIs. This is similar to VLAN filtering in the bridge.
 
  - Support inserting IPv6 IOAM information to a fraction of frames.
 
  - Add protocol attribute to IP addresses to allow identifying where
    given address comes from (kernel-generated, DHCP etc.)
 
  - Support setting socket and IPv6 options via cmsg on ping6 sockets.
 
  - Reject mis-use of ECN bits in IP headers as part of DSCP/TOS.
    Define dscp_t and stop taking ECN bits into account in fib-rules.
 
  - Add support for locked bridge ports (for 802.1X).
 
  - tun: support NAPI for packets received from batched XDP buffs,
    doubling the performance in some scenarios.
 
  - IPv6 extension header handling in Open vSwitch.
 
  - Support IPv6 control message load balancing in bonding, prevent
    neighbor solicitation and advertisement from using the wrong port.
    Support NS/NA monitor selection similar to existing ARP monitor.
 
  - SMC
    - improve performance with TCP_CORK and sendfile()
    - support auto-corking
    - support TCP_NODELAY
 
  - MCTP (Management Component Transport Protocol)
    - add user space tag control interface
    - I2C binding driver (as specified by DMTF DSP0237)
 
  - Multi-BSSID beacon handling in AP mode for WiFi.
 
  - Bluetooth:
    - handle MSFT Monitor Device Event
    - add MGMT Adv Monitor Device Found/Lost events
 
  - Multi-Path TCP:
    - add support for the SO_SNDTIMEO socket option
    - lots of selftest cleanups and improvements
 
  - Increase the max PDU size in CAN ISOTP to 64 kB.
 
 Driver API
 ----------
 
  - Add HW counters for SW netdevs, a mechanism for devices which
    offload packet forwarding to report packet statistics back to
    software interfaces such as tunnels.
 
  - Select the default NIC queue count as a fraction of number of
    physical CPU cores, instead of hard-coding to 8.
 
  - Expose devlink instance locks to drivers. Allow device layer of
    drivers to use that lock directly instead of creating their own
    which always runs into ordering issues in devlink callbacks.
 
  - Add header/data split indication to guide user space enabling
    of TCP zero-copy Rx.
 
  - Allow configuring completion queue event size.
 
  - Refactor page_pool to enable fragmenting after allocation.
 
  - Add allocation and page reuse statistics to page_pool.
 
  - Improve Multiple Spanning Trees support in the bridge to allow
    reuse of topologies across VLANs, saving HW resources in switches.
 
  - DSA (Distributed Switch Architecture):
    - replay and offload of host VLAN entries
    - offload of static and local FDB entries on LAG interfaces
    - FDB isolation and unicast filtering
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - LAN937x T1 PHYs
    - Davicom DM9051 SPI NIC driver
    - Realtek RTL8367S, RTL8367RB-VB switch and MDIO
    - Microchip ksz8563 switches
    - Netronome NFP3800 SmartNICs
    - Fungible SmartNICs
    - MediaTek MT8195 switches
 
  - WiFi:
    - mt76: MediaTek mt7916
    - mt76: MediaTek mt7921u USB adapters
    - brcmfmac: Broadcom BCM43454/6
 
  - Mobile:
    - iosm: Intel M.2 7360 WWAN card
 
 Drivers
 -------
 
  - Convert many drivers to the new phylink API built for split PCS
    designs but also simplifying other cases.
 
  - Intel Ethernet NICs:
    - add TTY for GNSS module for E810T device
    - improve AF_XDP performance
    - GTP-C and GTP-U filter offload
    - QinQ VLAN support
 
  - Mellanox Ethernet NICs (mlx5):
    - support xdp->data_meta
    - multi-buffer XDP
    - offload tc push_eth and pop_eth actions
 
  - Netronome Ethernet NICs (nfp):
    - flow-independent tc action hardware offload (police / meter)
    - AF_XDP
 
  - Other Ethernet NICs:
    - at803x: fiber and SFP support
    - xgmac: mdio: preamble suppression and custom MDC frequencies
    - r8169: enable ASPM L1.2 if system vendor flags it as safe
    - macb/gem: ZynqMP SGMII
    - hns3: add TX push mode
    - dpaa2-eth: software TSO
    - lan743x: multi-queue, mdio, SGMII, PTP
    - axienet: NAPI and GRO support
 
  - Mellanox Ethernet switches (mlxsw):
    - source and dest IP address rewrites
    - RJ45 ports
 
  - Marvell Ethernet switches (prestera):
    - basic routing offload
    - multi-chain TC ACL offload
 
  - NXP embedded Ethernet switches (ocelot & felix):
    - PTP over UDP with the ocelot-8021q DSA tagging protocol
    - basic QoS classification on Felix DSA switch using dcbnl
    - port mirroring for ocelot switches
 
  - Microchip high-speed industrial Ethernet (sparx5):
    - offloading of bridge port flooding flags
    - PTP Hardware Clock
 
  - Other embedded switches:
    - lan966x: PTP Hardward Clock
    - qca8k: mdio read/write operations via crafted Ethernet packets
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - add LDPC FEC type and 802.11ax High Efficiency data in radiotap
    - enable RX PPDU stats in monitor co-exist mode
 
  - Intel WiFi (iwlwifi):
    - UHB TAS enablement via BIOS
    - band disablement via BIOS
    - channel switch offload
    - 32 Rx AMPDU sessions in newer devices
 
  - MediaTek WiFi (mt76):
    - background radar detection
    - thermal management improvements on mt7915
    - SAR support for more mt76 platforms
    - MBSSID and 6 GHz band on mt7915
 
  - RealTek WiFi:
    - rtw89: AP mode
    - rtw89: 160 MHz channels and 6 GHz band
    - rtw89: hardware scan
 
  - Bluetooth:
    - mt7921s: wake on Bluetooth, SCO over I2S, wide-band-speed (WBS)
 
  - Microchip CAN (mcp251xfd):
    - multiple RX-FIFOs and runtime configurable RX/TX rings
    - internal PLL, runtime PM handling simplification
    - improve chip detection and error handling after wakeup
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmI7YBcACgkQMUZtbf5S
 IrveSBAAmSNJlUK6vPsnNzs7IhsZnfI/AUjm2TCLZnlhKttbpI4A/4Pohk33V7RS
 FGX7f8kjEfhUwrIiLDgeCnztNHRECrCmk6aZc/jLEvecmTauJ+f6kjShkDY/wix+
 AkPHmrZnQeLPAEVuljDdV+sL6ik08+zQL7PazIYHsaSKKC0MGQptRwcri8PLRAKE
 KPBAhVhleq2rAZ/ntprSN52F4Af6rpFTrPIWuN8Bqdbc9dy5094LT0mpOOWYvgr3
 /DLvvAPuLemwyIQkjWknVKBRUAQcmNPC+BY3J8K3LRaiNhekGqOFan46BfqP+k2J
 6DWu0Qrp2yWt4BMOeEToZR5rA6v5suUAMIBu8PRZIDkINXQMlIxHfGjZyNm0rVfw
 7edNri966yus9OdzwPa32MIG3oC6PnVAwYCJAjjBMNS8sSIkp7wgHLkgWN4UFe2H
 K/e6z8TLF4UQ+zFM0aGI5WZ+9QqWkTWEDF3R3OhdFpGrznna0gxmkOeV2YvtsgxY
 cbS0vV9Zj73o+bYzgBKJsw/dAjyLdXoHUGvus26VLQ78S/VGunVKtItwoxBAYmZo
 krW964qcC89YofzSi8RSKLHuEWtNWZbVm8YXr75u6jpr5GhMBu0CYefLs+BuZcxy
 dw8c69cGneVbGZmY2J3rBhDkchbuICl8vdUPatGrOJAoaFdYKuw=
 =ELpe
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "The sprinkling of SPI drivers is because we added a new one and Mark
  sent us a SPI driver interface conversion pull request.

  Core
  ----

   - Introduce XDP multi-buffer support, allowing the use of XDP with
     jumbo frame MTUs and combination with Rx coalescing offloads (LRO).

   - Speed up netns dismantling (5x) and lower the memory cost a little.
     Remove unnecessary per-netns sockets. Scope some lists to a netns.
     Cut down RCU syncing. Use batch methods. Allow netdev registration
     to complete out of order.

   - Support distinguishing timestamp types (ingress vs egress) and
     maintaining them across packet scrubbing points (e.g. redirect).

   - Continue the work of annotating packet drop reasons throughout the
     stack.

   - Switch netdev error counters from an atomic to dynamically
     allocated per-CPU counters.

   - Rework a few preempt_disable(), local_irq_save() and busy waiting
     sections problematic on PREEMPT_RT.

   - Extend the ref_tracker to allow catching use-after-free bugs.

  BPF
  ---

   - Introduce "packing allocator" for BPF JIT images. JITed code is
     marked read only, and used to be allocated at page granularity.
     Custom allocator allows for more efficient memory use, lower iTLB
     pressure and prevents identity mapping huge pages from getting
     split.

   - Make use of BTF type annotations (e.g. __user, __percpu) to enforce
     the correct probe read access method, add appropriate helpers.

   - Convert the BPF preload to use light skeleton and drop the
     user-mode-driver dependency.

   - Allow XDP BPF_PROG_RUN test infra to send real packets, enabling
     its use as a packet generator.

   - Allow local storage memory to be allocated with GFP_KERNEL if
     called from a hook allowed to sleep.

   - Introduce fprobe (multi kprobe) to speed up mass attachment (arch
     bits to come later).

   - Add unstable conntrack lookup helpers for BPF by using the BPF
     kfunc infra.

   - Allow cgroup BPF progs to return custom errors to user space.

   - Add support for AF_UNIX iterator batching.

   - Allow iterator programs to use sleepable helpers.

   - Support JIT of add, and, or, xor and xchg atomic ops on arm64.

   - Add BTFGen support to bpftool which allows to use CO-RE in kernels
     without BTF info.

   - Large number of libbpf API improvements, cleanups and deprecations.

  Protocols
  ---------

   - Micro-optimize UDPv6 Tx, gaining up to 5% in test on dummy netdev.

   - Adjust TSO packet sizes based on min_rtt, allowing very low latency
     links (data centers) to always send full-sized TSO super-frames.

   - Make IPv6 flow label changes (AKA hash rethink) more configurable,
     via sysctl and setsockopt. Distinguish between server and client
     behavior.

   - VxLAN support to "collect metadata" devices to terminate only
     configured VNIs. This is similar to VLAN filtering in the bridge.

   - Support inserting IPv6 IOAM information to a fraction of frames.

   - Add protocol attribute to IP addresses to allow identifying where
     given address comes from (kernel-generated, DHCP etc.)

   - Support setting socket and IPv6 options via cmsg on ping6 sockets.

   - Reject mis-use of ECN bits in IP headers as part of DSCP/TOS.
     Define dscp_t and stop taking ECN bits into account in fib-rules.

   - Add support for locked bridge ports (for 802.1X).

   - tun: support NAPI for packets received from batched XDP buffs,
     doubling the performance in some scenarios.

   - IPv6 extension header handling in Open vSwitch.

   - Support IPv6 control message load balancing in bonding, prevent
     neighbor solicitation and advertisement from using the wrong port.
     Support NS/NA monitor selection similar to existing ARP monitor.

   - SMC
      - improve performance with TCP_CORK and sendfile()
      - support auto-corking
      - support TCP_NODELAY

   - MCTP (Management Component Transport Protocol)
      - add user space tag control interface
      - I2C binding driver (as specified by DMTF DSP0237)

   - Multi-BSSID beacon handling in AP mode for WiFi.

   - Bluetooth:
      - handle MSFT Monitor Device Event
      - add MGMT Adv Monitor Device Found/Lost events

   - Multi-Path TCP:
      - add support for the SO_SNDTIMEO socket option
      - lots of selftest cleanups and improvements

   - Increase the max PDU size in CAN ISOTP to 64 kB.

  Driver API
  ----------

   - Add HW counters for SW netdevs, a mechanism for devices which
     offload packet forwarding to report packet statistics back to
     software interfaces such as tunnels.

   - Select the default NIC queue count as a fraction of number of
     physical CPU cores, instead of hard-coding to 8.

   - Expose devlink instance locks to drivers. Allow device layer of
     drivers to use that lock directly instead of creating their own
     which always runs into ordering issues in devlink callbacks.

   - Add header/data split indication to guide user space enabling of
     TCP zero-copy Rx.

   - Allow configuring completion queue event size.

   - Refactor page_pool to enable fragmenting after allocation.

   - Add allocation and page reuse statistics to page_pool.

   - Improve Multiple Spanning Trees support in the bridge to allow
     reuse of topologies across VLANs, saving HW resources in switches.

   - DSA (Distributed Switch Architecture):
      - replay and offload of host VLAN entries
      - offload of static and local FDB entries on LAG interfaces
      - FDB isolation and unicast filtering

  New hardware / drivers
  ----------------------

   - Ethernet:
      - LAN937x T1 PHYs
      - Davicom DM9051 SPI NIC driver
      - Realtek RTL8367S, RTL8367RB-VB switch and MDIO
      - Microchip ksz8563 switches
      - Netronome NFP3800 SmartNICs
      - Fungible SmartNICs
      - MediaTek MT8195 switches

   - WiFi:
      - mt76: MediaTek mt7916
      - mt76: MediaTek mt7921u USB adapters
      - brcmfmac: Broadcom BCM43454/6

   - Mobile:
      - iosm: Intel M.2 7360 WWAN card

  Drivers
  -------

   - Convert many drivers to the new phylink API built for split PCS
     designs but also simplifying other cases.

   - Intel Ethernet NICs:
      - add TTY for GNSS module for E810T device
      - improve AF_XDP performance
      - GTP-C and GTP-U filter offload
      - QinQ VLAN support

   - Mellanox Ethernet NICs (mlx5):
      - support xdp->data_meta
      - multi-buffer XDP
      - offload tc push_eth and pop_eth actions

   - Netronome Ethernet NICs (nfp):
      - flow-independent tc action hardware offload (police / meter)
      - AF_XDP

   - Other Ethernet NICs:
      - at803x: fiber and SFP support
      - xgmac: mdio: preamble suppression and custom MDC frequencies
      - r8169: enable ASPM L1.2 if system vendor flags it as safe
      - macb/gem: ZynqMP SGMII
      - hns3: add TX push mode
      - dpaa2-eth: software TSO
      - lan743x: multi-queue, mdio, SGMII, PTP
      - axienet: NAPI and GRO support

   - Mellanox Ethernet switches (mlxsw):
      - source and dest IP address rewrites
      - RJ45 ports

   - Marvell Ethernet switches (prestera):
      - basic routing offload
      - multi-chain TC ACL offload

   - NXP embedded Ethernet switches (ocelot & felix):
      - PTP over UDP with the ocelot-8021q DSA tagging protocol
      - basic QoS classification on Felix DSA switch using dcbnl
      - port mirroring for ocelot switches

   - Microchip high-speed industrial Ethernet (sparx5):
      - offloading of bridge port flooding flags
      - PTP Hardware Clock

   - Other embedded switches:
      - lan966x: PTP Hardward Clock
      - qca8k: mdio read/write operations via crafted Ethernet packets

   - Qualcomm 802.11ax WiFi (ath11k):
      - add LDPC FEC type and 802.11ax High Efficiency data in radiotap
      - enable RX PPDU stats in monitor co-exist mode

   - Intel WiFi (iwlwifi):
      - UHB TAS enablement via BIOS
      - band disablement via BIOS
      - channel switch offload
      - 32 Rx AMPDU sessions in newer devices

   - MediaTek WiFi (mt76):
      - background radar detection
      - thermal management improvements on mt7915
      - SAR support for more mt76 platforms
      - MBSSID and 6 GHz band on mt7915

   - RealTek WiFi:
      - rtw89: AP mode
      - rtw89: 160 MHz channels and 6 GHz band
      - rtw89: hardware scan

   - Bluetooth:
      - mt7921s: wake on Bluetooth, SCO over I2S, wide-band-speed (WBS)

   - Microchip CAN (mcp251xfd):
      - multiple RX-FIFOs and runtime configurable RX/TX rings
      - internal PLL, runtime PM handling simplification
      - improve chip detection and error handling after wakeup"

* tag 'net-next-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2521 commits)
  llc: fix netdevice reference leaks in llc_ui_bind()
  drivers: ethernet: cpsw: fix panic when interrupt coaleceing is set via ethtool
  ice: don't allow to run ice_send_event_to_aux() in atomic ctx
  ice: fix 'scheduling while atomic' on aux critical err interrupt
  net/sched: fix incorrect vlan_push_eth dest field
  net: bridge: mst: Restrict info size queries to bridge ports
  net: marvell: prestera: add missing destroy_workqueue() in prestera_module_init()
  drivers: net: xgene: Fix regression in CRC stripping
  net: geneve: add missing netlink policy and size for IFLA_GENEVE_INNER_PROTO_INHERIT
  net: dsa: fix missing host-filtered multicast addresses
  net/mlx5e: Fix build warning, detected write beyond size of field
  iwlwifi: mvm: Don't fail if PPAG isn't supported
  selftests/bpf: Fix kprobe_multi test.
  Revert "rethook: x86: Add rethook x86 implementation"
  Revert "arm64: rethook: Add arm64 rethook implementation"
  Revert "powerpc: Add rethook support"
  Revert "ARM: rethook: Add rethook arm implementation"
  netdevice: add missing dm_private kdoc
  net: bridge: mst: prevent NULL deref in br_mst_info_size()
  selftests: forwarding: Use same VRF for port and VLAN upper
  ...
2022-03-24 13:13:26 -07:00