linux-loongson

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-09-07 14:19:35 +00:00

Author	SHA1	Message	Date
Li zeming	1de9992f9d	KVM: x86/mmu: Remove unnecessary ‘NULL’ values from sptep Don't initialize "spte" and "sptep" in fast_page_fault() as they are both guaranteed (for all intents and purposes) to be written at the start of every loop iteration. Add a sanity check that "sptep" is non-NULL after walking the shadow page tables, as encountering a NULL root would result in "spte" not being written, i.e. would lead to uninitialized data or the previous value being consumed. Signed-off-by: Li zeming <zeming@nfschina.com> Link: https://lore.kernel.org/r/20230905182006.2964-1-zeming@nfschina.com [sean: rewrite changelog with --verbose] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-10-18 14:34:28 -07:00
Yan Zhao	1affe455d6	KVM: x86/mmu: Add helpers to return if KVM honors guest MTRRs Add helpers to check if KVM honors guest MTRRs instead of open coding the logic in kvm_tdp_page_fault(). Future fixes and cleanups will also need to determine if KVM should honor guest MTRRs, e.g. for CR0.CD toggling and and non-coherent DMA transitions. Provide an inner helper, __kvm_mmu_honors_guest_mtrrs(), so that KVM can check if guest MTRRs were honored when stopping non-coherent DMA. Note, there is no need to explicitly check that TDP is enabled, KVM clears shadow_memtype_mask when TDP is disabled, i.e. it's non-zero if and only if EPT is enabled. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20230714065006.20201-1-yan.y.zhao@intel.com Link: https://lore.kernel.org/r/20230714065043.20258-1-yan.y.zhao@intel.com [sean: squash into a one patch, drop explicit TDP check massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-10-09 14:34:57 -07:00
Qi Zheng	e5985c4098	kvm: mmu: dynamically allocate the x86-mmu shrinker Use new APIs to dynamically allocate the x86-mmu shrinker. Link: https://lkml.kernel.org/r/20230911094444.68966-3-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Airlie <airlied@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Rui <ray.huang@amd.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sean Paul <sean@poorly.run> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Song Liu <song@kernel.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-10-04 10:32:23 -07:00
Sean Christopherson	0df9dab891	KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously Stop zapping invalidate TDP MMU roots via work queue now that KVM preserves TDP MMU roots until they are explicitly invalidated. Zapping roots asynchronously was effectively a workaround to avoid stalling a vCPU for an extended during if a vCPU unloaded a root, which at the time happened whenever the guest toggled CR0.WP (a frequent operation for some guest kernels). While a clever hack, zapping roots via an unbound worker had subtle, unintended consequences on host scheduling, especially when zapping multiple roots, e.g. as part of a memslot. Because the work of zapping a root is no longer bound to the task that initiated the zap, things like the CPU affinity and priority of the original task get lost. Losing the affinity and priority can be especially problematic if unbound workqueues aren't affined to a small number of CPUs, as zapping multiple roots can cause KVM to heavily utilize the majority of CPUs in the system, beyond the CPUs KVM is already using to run vCPUs. When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root zap can result in KVM occupying all logical CPUs for ~8ms, and result in high priority tasks not being scheduled in in a timely manner. In v5.15, which doesn't preserve unloaded roots, the issues were even more noticeable as KVM would zap roots more frequently and could occupy all CPUs for 50ms+. Consuming all CPUs for an extended duration can lead to significant jitter throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots is a semi-frequent operation as memslots are deleted and recreated with different host virtual addresses to react to host GPU drivers allocating and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips during games due to the audio server's tasks not getting scheduled in promptly, despite the tasks having a high realtime priority. Deleting memslots isn't exactly a fast path and should be avoided when possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the memslot shenanigans, but KVM is squarely in the wrong. Not to mention that removing the async zapping eliminates a non-trivial amount of complexity. Note, one of the subtle behaviors hidden behind the async zapping is that KVM would zap invalidated roots only once (ignoring partial zaps from things like mmu_notifier events). Preserve this behavior by adding a flag to identify roots that are scheduled to be zapped versus roots that have already been zapped but not yet freed. Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can encounter invalid roots, as it's not at all obvious why zapping invalidated roots shouldn't simply zap all invalid roots. Reported-by: Pattara Teerapong <pteerapong@google.com> Cc: David Stevens <stevensd@google.com> Cc: Yiwei Zhang<zzyiwei@google.com> Cc: Paul Hsia <paulhsia@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-09-23 05:35:48 -04:00
Paolo Bonzini	441a5dfcd9	KVM: x86/mmu: Do not filter address spaces in for_each_tdp_mmu_root_yield_safe() All callers except the MMU notifier want to process all address spaces. Remove the address space ID argument of for_each_tdp_mmu_root_yield_safe() and switch the MMU notifier to use __for_each_tdp_mmu_root_yield_safe(). Extracted out of a patch by Sean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-09-23 05:35:12 -04:00
Sean Christopherson	50107e8b2a	KVM: x86/mmu: Open code leaf invalidation from mmu_notifier The mmu_notifier path is a bit of a special snowflake, e.g. it zaps only a single address space (because it's per-slot), and can't always yield. Because of this, it calls kvm_tdp_mmu_zap_leafs() in ways that no one else does. Iterate manually over the leafs in response to an mmu_notifier invalidation, instead of invoking kvm_tdp_mmu_zap_leafs(). Drop the @can_yield param from kvm_tdp_mmu_zap_leafs() as its sole remaining caller unconditionally passes "true". Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-09-21 05:47:57 -04:00
Sean Christopherson	0e3223d8d0	KVM: x86/mmu: Use dummy root, backed by zero page, for !visible guest roots When attempting to allocate a shadow root for a !visible guest root gfn, e.g. that resides in MMIO space, load a dummy root that is backed by the zero page instead of immediately synthesizing a triple fault shutdown (using the zero page ensures any attempt to translate memory will generate a !PRESENT fault and thus VM-Exit). Unless the vCPU is racing with memslot activity, KVM will inject a page fault due to not finding a visible slot in FNAME(walk_addr_generic), i.e. the end result is mostly same, but critically KVM will inject a fault only after KVM runs the vCPU with the bogus root. Waiting to inject a fault until after running the vCPU fixes a bug where KVM would bail from nested VM-Enter if L1 tried to run L2 with TDP enabled and a !visible root. Even though a bad root will probably lead to shutdown, (a) it's not guaranteed and (b) the CPU won't read the underlying memory until after VM-Enter succeeds. E.g. if L1 runs L2 with a VMX preemption timer value of '0', then architecturally the preemption timer VM-Exit is guaranteed to occur before the CPU executes any instruction, i.e. before the CPU needs to translate a GPA to a HPA (so long as there are no injected events with higher priority than the preemption timer). If KVM manages to get to FNAME(fetch) with a dummy root, e.g. because userspace created a memslot between installing the dummy root and handling the page fault, simply unload the MMU to allocate a new root and retry the instruction. Use KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to drop the root, as invoking kvm_mmu_free_roots() while holding mmu_lock would deadlock, and conceptually the dummy root has indeeed become obsolete. The only difference versus existing usage of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is that the root has become obsolete due to memslot creation, not memslot deletion or movement. Reported-by: Reima Ishii <ishiir@g.ecc.u-tokyo.ac.jp> Cc: Yu Zhang <yu.c.zhang@linux.intel.com> Link: https://lore.kernel.org/r/20230729005200.1057358-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 14:08:24 -04:00
Sean Christopherson	c30e000e69	KVM: x86/mmu: Harden new PGD against roots without shadow pages Harden kvm_mmu_new_pgd() against NULL pointer dereference bugs by sanity checking that the target root has an associated shadow page prior to dereferencing said shadow page. The code in question is guaranteed to only see roots with shadow pages as fast_pgd_switch() explicitly frees the current root if it doesn't have a shadow page, i.e. is a PAE root, and that in turn prevents valid roots from being cached, but that's all very subtle. Link: https://lore.kernel.org/r/20230729005200.1057358-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 14:08:21 -04:00
Sean Christopherson	c5f2d5645f	KVM: x86/mmu: Add helper to convert root hpa to shadow page Add a dedicated helper for converting a root hpa to a shadow page in anticipation of using a "dummy" root to handle the scenario where KVM needs to load a valid shadow root (from hardware's perspective), but the guest doesn't have a visible root to shadow. Similar to PAE roots, the dummy root won't have an associated kvm_mmu_page and will need special handling when finding a shadow page given a root. Opportunistically retrieve the root shadow page in kvm_mmu_sync_roots() after verifying the root is unsync (the dummy root can never be unsync). Link: https://lore.kernel.org/r/20230729005200.1057358-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 14:08:20 -04:00
Sean Christopherson	96316a0670	KVM: x86/mmu: Drop @slot param from exported/external page-track APIs Refactor KVM's exported/external page-track, a.k.a. write-track, APIs to take only the gfn and do the required memslot lookup in KVM proper. Forcing users of the APIs to get the memslot unnecessarily bleeds KVM internals into KVMGT and complicates usage of the APIs. No functional change intended. Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-28-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 14:08:18 -04:00
Sean Christopherson	7b574863e7	KVM: x86/mmu: Rename page-track APIs to reflect the new reality Rename the page-track APIs to capture that they're all about tracking writes, now that the facade of supporting multiple modes is gone. Opportunstically replace "slot" with "gfn" in anticipation of removing the @slot param from the external APIs. No functional change intended. Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-25-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 14:08:15 -04:00
Sean Christopherson	338068b5be	KVM: x86/mmu: Drop infrastructure for multiple page-track modes Drop "support" for multiple page-track modes, as there is no evidence that array-based and refcounted metadata is the optimal solution for other modes, nor is there any evidence that other use cases, e.g. for access-tracking, will be a good fit for the page-track machinery in general. E.g. one potential use case of access-tracking would be to prevent guest access to poisoned memory (from the guest's perspective). In that case, the number of poisoned pages is likely to be a very small percentage of the guest memory, and there is no need to reference count the number of access-tracking users, i.e. expanding gfn_track[] for a new mode would be grossly inefficient. And for poisoned memory, host userspace would also likely want to trap accesses, e.g. to inject #MC into the guest, and that isn't currently supported by the page-track framework. A better alternative for that poisoned page use case is likely a variation of the proposed per-gfn attributes overlay (linked), which would allow efficiently tracking the sparse set of poisoned pages, and by default would exit to userspace on access. Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com Cc: Ben Gardon <bgardon@google.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-24-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 14:08:14 -04:00
Sean Christopherson	58ea7cf700	KVM: x86/mmu: Move KVM-only page-track declarations to internal header Bury the declaration of the page-track helpers that are intended only for internal KVM use in a "private" header. In addition to guarding against unwanted usage of the internal-only helpers, dropping their definitions avoids exposing other structures that should be KVM-internal, e.g. for memslots. This is a baby step toward making kvm_host.h a KVM-internal header in the very distant future. Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-22-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 14:08:13 -04:00
Yan Zhao	d104d5bbbc	KVM: x86: Remove the unused page-track hook track_flush_slot() Remove ->track_remove_slot(), there are no longer any users and it's unlikely a "flush" hook will ever be the correct API to provide to an external page-track user. Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-21-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 14:07:26 -04:00
Sean Christopherson	932844462a	KVM: x86/mmu: Don't bounce through page-track mechanism for guest PTEs Don't use the generic page-track mechanism to handle writes to guest PTEs in KVM's MMU. KVM's MMU needs access to information that should not be exposed to external page-track users, e.g. KVM needs (for some definitions of "need") the vCPU to query the current paging mode, whereas external users, i.e. KVMGT, have no ties to the current vCPU and so should never need the vCPU. Moving away from the page-track mechanism will allow dropping use of the page-track mechanism for KVM's own MMU, and will also allow simplifying and cleaning up the page-track APIs. Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:49:00 -04:00
Sean Christopherson	eeb87272a3	KVM: x86/mmu: Don't rely on page-track mechanism to flush on memslot change Call kvm_mmu_zap_all_fast() directly when flushing a memslot instead of bouncing through the page-track mechanism. KVM (unfortunately) needs to zap and flush all page tables on memslot DELETE/MOVE irrespective of whether KVM is shadowing guest page tables. This will allow changing KVM to register a page-track notifier on the first shadow root allocation, and will also allow deleting the misguided kvm_page_track_flush_slot() hook itself once KVM-GT also moves to a different method for reacting to memslot changes. No functional change intended. Cc: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20221110014821.1548347-2-seanjc@google.com Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:49:00 -04:00
Sean Christopherson	db0d70e610	KVM: x86/mmu: Move kvm_arch_flush_shadow_{all,memslot}() to mmu.c Move x86's implementation of kvm_arch_flush_shadow_{all,memslot}() into mmu.c, and make kvm_mmu_zap_all() static as it was globally visible only for kvm_arch_flush_shadow_all(). This will allow refactoring kvm_arch_flush_shadow_memslot() to call kvm_mmu_zap_all() directly without having to expose kvm_mmu_zap_all_fast() outside of mmu.c. Keeping everything in mmu.c will also likely simplify supporting TDX, which intends to do zap only relevant SPTEs on memslot updates. No functional change intended. Suggested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:59 -04:00
Sean Christopherson	52e322eda3	KVM: x86/mmu: BUG() in rmap helpers iff CONFIG_BUG_ON_DATA_CORRUPTION=y Introduce KVM_BUG_ON_DATA_CORRUPTION() and use it in the low-level rmap helpers to convert the existing BUG()s to WARN_ON_ONCE() when the kernel is built with CONFIG_BUG_ON_DATA_CORRUPTION=n, i.e. does NOT want to BUG() on corruption of host kernel data structures. Environments that don't have infrastructure to automatically capture crash dumps, i.e. aren't likely to enable CONFIG_BUG_ON_DATA_CORRUPTION=y, are typically better served overall by WARN-and-continue behavior (for the kernel, the VM is dead regardless), as a BUG() while holding mmu_lock all but guarantees the _best_ case scenario is a panic(). Make the BUG()s conditional instead of removing/replacing them entirely as there's a non-zero chance (though by no means a guarantee) that the damage isn't contained to the target VM, e.g. if no rmap is found for a SPTE then KVM may be double-zapping the SPTE, i.e. has already freed the memory the SPTE pointed at and thus KVM is reading/writing memory that KVM no longer owns. Link: https://lore.kernel.org/all/20221129191237.31447-1-mizhang@google.com Suggested-by: Mingwei Zhang <mizhang@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Jim Mattson <jmattson@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20230729004722.1056172-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:50 -04:00
Mingwei Zhang	069f30c619	KVM: x86/mmu: Plumb "struct kvm" all the way to pte_list_remove() Plumb "struct kvm" all the way to pte_list_remove() to allow the usage of KVM_BUG() and/or KVM_BUG_ON(). This will allow killing only the offending VM instead of doing BUG() if the kernel is built with CONFIG_BUG_ON_DATA_CORRUPTION=n, i.e. does NOT want to BUG() if KVM's data structures (rmaps) appear to be corrupted. Signed-off-by: Mingwei Zhang <mizhang@google.com> [sean: tweak changelog] Link: https://lore.kernel.org/r/20230729004722.1056172-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:49 -04:00
Sean Christopherson	870d4d4ed8	KVM: x86/mmu: Replace MMU_DEBUG with proper KVM_PROVE_MMU Kconfig Replace MMU_DEBUG, which requires manually modifying KVM to enable the macro, with a proper Kconfig, KVM_PROVE_MMU. Now that pgprintk() and rmap_printk() are gone, i.e. the macro guards only KVM_MMU_WARN_ON() and won't flood the kernel logs, enabling the option for debug kernels is both desirable and feasible. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230729004722.1056172-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:47 -04:00
Sean Christopherson	20ba462dfd	KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE() Convert all "runtime" assertions, i.e. assertions that can be triggered while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the MMU that is tied to running vCPUs, i.e. not contained to loading and initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if KVM ends up with a bug that causes a root to be invalidated before the page fault handler is invoked, pretty much _every_ page fault VM-Exit triggers the WARN. If a WARN is triggered frequently, the resulting spam usually causes a lot of damage of its own, e.g. consumes resources to log the WARN and pollutes the kernel log, often to the point where other useful information can be lost. In many case, the damage caused by the spam is actually worse than the bug itself, e.g. KVM can almost always recover from an unexpectedly invalid root. On the flip side, warning every time is rarely helpful for debug and triage, i.e. a single splat is usually sufficient to point a debugger in the right direction, and automated testing, e.g. syzkaller, typically runs with warn_on_panic=1, i.e. will never get past the first WARN anyways. Lastly, when an assertions fails multiple times, the stack traces in KVM are almost always identical, i.e. the full splat only needs to be captured once. And _if_ there is value in captruing information about the failed assert, a ratelimited printk() is sufficient and less likely to rack up a large amount of collateral damage. Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:44 -04:00
Sean Christopherson	0fe6370eb3	KVM: x86/mmu: Rename MMU_WARN_ON() to KVM_MMU_WARN_ON() Rename MMU_WARN_ON() to make it super obvious that the assertions are all about KVM's MMU, not the primary MMU. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230729004722.1056172-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:43 -04:00
Sean Christopherson	58da926caa	KVM: x86/mmu: Cleanup sanity check of SPTEs at SP free Massage the error message for the sanity check on SPTEs when freeing a shadow page to be more verbose, and to print out all shadow-present SPTEs, not just the first SPTE encountered. Printing all SPTEs can be quite valuable for debug, e.g. highlights whether the leak is a one-off or widepsread, or possibly the result of memory corruption (something else in the kernel stomping on KVM's SPTEs). Opportunistically move the MMU_WARN_ON() into the helper itself, which will allow a future cleanup to use BUILD_BUG_ON_INVALID() as the stub for MMU_WARN_ON(). BUILD_BUG_ON_INVALID() works as intended and results in the compiler complaining about is_empty_shadow_page() not being declared. Link: https://lore.kernel.org/r/20230729004722.1056172-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:43 -04:00
Sean Christopherson	242a6dd8da	KVM: x86/mmu: Avoid pointer arithmetic when iterating over SPTEs Replace the pointer arithmetic used to iterate over SPTEs in is_empty_shadow_page() with more standard interger-based iteration. No functional change intended. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230729004722.1056172-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:42 -04:00
Sean Christopherson	c4f92cfe02	KVM: x86/mmu: Delete the "dbg" module param Delete KVM's "dbg" module param now that its usage in KVM is gone (it used to guard pgprintk() and rmap_printk()). Link: https://lore.kernel.org/r/20230729004722.1056172-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:41 -04:00
Sean Christopherson	350c49fdea	KVM: x86/mmu: Delete rmap_printk() and all its usage Delete rmap_printk() so that MMU_WARN_ON() and MMU_DEBUG can be morphed into something that can be regularly enabled for debug kernels. The information provided by rmap_printk() isn't all that useful now that the rmap and unsync code is mature, as the prints are simultaneously too verbose (_lots_ of message) and yet not verbose enough to be helpful for debug (most instances print just the SPTE pointer/value, which is rarely sufficient to root cause anything but trivial bugs). Alternatively, rmap_printk() could be reworked to into tracepoints, but it's not clear there is a real need as rmap bugs rarely escape initial development, and when bugs do escape to production, they are often edge cases and/or reside in code that isn't directly related to the rmaps. In other words, the problems with rmap_printk() being unhelpful also apply to tracepoints. And deleting rmap_printk() doesn't preclude adding tracepoints in the future. Link: https://lore.kernel.org/r/20230729004722.1056172-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:40 -04:00
Sean Christopherson	a98b889492	KVM: x86/mmu: Delete pgprintk() and all its usage Delete KVM's pgprintk() and all its usage, as the code is very prone to bitrot due to being buried behind MMU_DEBUG, and the functionality has been rendered almost entirely obsolete by the tracepoints KVM has gained over the years. And for the situations where the information provided by KVM's tracepoints is insufficient, pgprintk() rarely fills in the gaps, and is almost always far too noisy, i.e. developers end up implementing custom prints anyways. Link: https://lore.kernel.org/r/20230729004722.1056172-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:39 -04:00
Sean Christopherson	d09f711233	KVM: x86/mmu: Guard against collision with KVM-defined PFERR_IMPLICIT_ACCESS Add an assertion in kvm_mmu_page_fault() to ensure the error code provided by hardware doesn't conflict with KVM's software-defined IMPLICIT_ACCESS flag. In the unlikely scenario that future hardware starts using bit 48 for a hardware-defined flag, preserving the bit could result in KVM incorrectly interpreting the unknown flag as KVM's IMPLICIT_ACCESS flag. WARN so that any such conflict can be surfaced to KVM developers and resolved, but otherwise ignore the bit as KVM can't possibly rely on a flag it knows nothing about. Fixes: `4f4aa80e3b` ("KVM: X86: Handle implicit supervisor access with SMAP") Acked-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230721223711.2334426-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:39 -04:00
Paolo Bonzini	6d5e3c318a	KVM x86 changes for 6.6: - Misc cleanups - Retry APIC optimized recalculation if a vCPU is added/enabled - Overhaul emergency reboot code to bring SVM up to par with VMX, tie the "emergency disabling" behavior to KVM actually being loaded, and move all of the logic within KVM - Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC ratio MSR can diverge from the default iff TSC scaling is enabled, and clean up related code - Add a framework to allow "caching" feature flags so that KVM can check if the guest can use a feature without needing to search guest CPUID -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmTueMwSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5hp4P/i/UmIJEJupryUrD/ZXcSjqmupCtv4JS Z2o1KIAPbM5GUX4iyF1cnZrI4Ac5zMtULN8Tp3ATOp3AqKy72AqB1Z82e+v6SKis KfSXlDFCPFisrwv3Ys7JEu9vIS8oqITHmSBk8OAmElwujdQ5jYLZjwGbCXbM9qas yCFGLqD4fjX8XqkZLmXggjT99MPSgiTPoKL592Wq4JR8mY4hyQqJzBepDjb94sT7 wrsAv1B+BchGDguk0+nOdmHM4emGrZU7fVqi3OFPofSlwAAdkqZObleb422KB058 5bcpNow+9VH5pzgq8XSAU7DLNgH9aXH0PcVU8ASU6P0D9fceKoOFuL47nnFbwz0t vKafcXNWFs8xHE4iyzvAAsZK/X8GR0ngNByPnamATMsjt2tTmsa5BOyAPkIN+GpT DzZCIk27SbdGC3lGYlSV+5ob/+sOr6m384DkvSZnU6JiiFLlZiTxURj1/9Zvfka8 2co2wnf8cJxnKFUThFfuxs9XpKgvhkOE8LauwCSo4MAQM95Pen+NAK960RBWj0xl wof5kIGmKbwmMXyg2Sr+EKqe5KRPba22Yi3x24tURAXafKK/AW7T8dgEEXOll7dp pKmTPAevwUk9wYIGultjhEBXKYgMOeD2BVoTa5je5h1Da28onrSJ7aLQUixHHs0J gLdtzs8M9K9t =yGM1 -----END PGP SIGNATURE----- Merge tag 'kvm-x86-misc-6.6' of https://github.com/kvm-x86/linux into HEAD KVM x86 changes for 6.6: - Misc cleanups - Retry APIC optimized recalculation if a vCPU is added/enabled - Overhaul emergency reboot code to bring SVM up to par with VMX, tie the "emergency disabling" behavior to KVM actually being loaded, and move all of the logic within KVM - Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC ratio MSR can diverge from the default iff TSC scaling is enabled, and clean up related code - Add a framework to allow "caching" feature flags so that KVM can check if the guest can use a feature without needing to search guest CPUID	2023-08-31 13:36:33 -04:00
Paolo Bonzini	0d15bf966d	Common KVM changes for 6.6: - Wrap kvm_{gfn,hva}_range.pte in a union to allow mmu_notifier events to pass action specific data without needing to constantly update the main handlers. - Drop unused function declarations -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmTudpYSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5xJUQAKnMVEV+7gRtfV5KCJFRTNAMxo4zSIt/ K6QX+x/SwUriXj4nTAlvAtju1xz4nwTYBABKj3bXEaLpVjIUIbnEzEGuTKKK6XY9 UyJKVgafwLuKLWPYN/5Zv5SCO7DmVC9W3lVMtchgt7gFcRxtZhmEn53boHhrhan0 /2L5XD6N9rd81Zmd/rQkJNRND7XY3HkvDSnfmsRI/rfFUglCUHBDp4c2Wkmz+Dnb ux7N37si5OTbRVp18VzbLg1jalstDEm36ZQ7tLkvIbNbZV6pV93/ZmcTmsOruTeN gHVr6/RXmKKwgO3wtZ9DKL6oMcoh20yoT+vqhbaihVssLPGPusk7S2cCQ7529u8/ Oda+w67MMdbE46N9CmB56fkpwNvn9nLCoQFhMhXBWhPJVNmorpiR6drHKqLy5zCq lrsWGqXU/DXA2PwdsztfIIMVeALawzExHu9ayppcKwb4S8TLJhLma7dT+EvwUxuV hoswaIT7Tq2ptZ34Fo5/vEz+90u2wi7LynHrNdTs7NLsW+WI/jab7KxKc+mf5WYh KuMzqmmPXmWRFupFeDa61YY5PCvMddDeac/jCYL/2cr73RA8bUItivwt5mEg5nOW 9NEU+cLbl1s8g2KfxwhvodVkbhiNGf8MkVpE5skHHh9OX8HYzZUa/s6uUZO1O0eh XOk+fa9KWabt =n819 -----END PGP SIGNATURE----- Merge tag 'kvm-x86-generic-6.6' of https://github.com/kvm-x86/linux into HEAD Common KVM changes for 6.6: - Wrap kvm_{gfn,hva}_range.pte in a union to allow mmu_notifier events to pass action specific data without needing to constantly update the main handlers. - Drop unused function declarations	2023-08-31 13:19:55 -04:00
Sean Christopherson	ccf31d6e6c	KVM: x86/mmu: Use KVM-governed feature framework to track "GBPAGES enabled" Use the governed feature framework to track whether or not the guest can use 1GiB pages, and drop the one-off helper that wraps the surprisingly non-trivial logic surrounding 1GiB page usage in the guest. No functional change intended. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20230815203653.519297-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-08-17 11:38:27 -07:00
Sean Christopherson	3e1efe2b67	KVM: Wrap kvm_{gfn,hva}_range.pte in a per-action union Wrap kvm_{gfn,hva}_range.pte in a union so that future notifier events can pass event specific information up and down the stack without needing to constantly expand and churn the APIs. Lockless aging of SPTEs will pass around a bitmap, and support for memory attributes will pass around the new attributes for the range. Add a "KVM_NO_ARG" placeholder to simplify handling events without an argument (creating a dummy union variable is midly annoying). Opportunstically drop explicit zero-initialization of the "pte" field, as omitting the field (now a union) has the same effect. Cc: Yu Zhao <yuzhao@google.com> Link: https://lore.kernel.org/all/CAOUHufagkd2Jk3_HrVoFFptRXM=hX2CV8f+M-dka-hJU4bP8kw@mail.gmail.com Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Yu Zhao <yuzhao@google.com> Link: https://lore.kernel.org/r/20230729004144.1054885-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-08-17 11:26:53 -07:00
David Matlack	619b507244	KVM: Move kvm_arch_flush_remote_tlbs_memslot() to common code Move kvm_arch_flush_remote_tlbs_memslot() to common code and drop "arch_" from the name. kvm_arch_flush_remote_tlbs_memslot() is just a range-based TLB invalidation where the range is defined by the memslot. Now that kvm_flush_remote_tlbs_range() can be called from common code we can just use that and drop a bunch of duplicate code from the arch directories. Note this adds a lockdep assertion for slots_lock being held when calling kvm_flush_remote_tlbs_memslot(), which was previously only asserted on x86. MIPS has calls to kvm_flush_remote_tlbs_memslot(), but they all hold the slots_lock, so the lockdep assertion continues to hold true. Also drop the CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT ifdef gating kvm_flush_remote_tlbs_memslot(), since it is no longer necessary. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Acked-by: Anup Patel <anup@brainfault.org> Acked-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230811045127.3308641-7-rananta@google.com	2023-08-17 09:40:35 +01:00
David Matlack	d478899605	KVM: Allow range-based TLB invalidation from common code Make kvm_flush_remote_tlbs_range() visible in common code and create a default implementation that just invalidates the whole TLB. This paves the way for several future features/cleanups: - Introduction of range-based TLBI on ARM. - Eliminating kvm_arch_flush_remote_tlbs_memslot() - Moving the KVM/x86 TDP MMU to common code. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Anup Patel <anup@brainfault.org> Acked-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230811045127.3308641-6-rananta@google.com	2023-08-17 09:40:35 +01:00
Like Xu	1d6664fadd	KVM: x86: Use sysfs_emit() instead of sprintf() Use sysfs_emit() instead of the sprintf() for sysfs entries. sysfs_emit() knows the maximum of the temporary buffer used for outputting sysfs content and avoids overrunning the buffer length. Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230625073438.57427-1-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-07-31 13:55:26 -07:00
Paolo Bonzini	255006adb3	KVM VMX changes for 6.5: - Fix missing/incorrect #GP checks on ENCLS - Use standard mmu_notifier hooks for handling APIC access page - Misc cleanups -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmSaLDYSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5ovYP/ib86UG9QXwoEKx0mIyLQ5q1jD+StvxH 18SIH62+MXAtmz2E+EmXIySW76diOKCngApJ11WTERPwpZYEpcITh2D2Jp/vwgk5 xUPK+WKYQs1SGpJu3wXhLE1u6mB7X9p7EaXRSKG67P7YK09gTaOik1/3h6oNrGO+ KI06reCQN1PstKTfrZXxYpRlfDc761YaAmSZ79Bg+bK9PisFqme7TJ2mAqNZPFPd E7ho/UOEyWRSyd5VMsuOUB760pMQ9edKrs+38xNDp5N+0Fh0ItTjuAcd2KVWMZyW Fk+CJq4kCqTlEik5OwcEHsTGJGBFscGPSO+T0YtVfSZDdtN/rHN7l8RGquOebVTG Ldm5bg4agu4lXsqqzMxn8J9SkbNg3xno79mMSc2185jS2HLt5Hu6PzQnQ2tEtHJQ IuovmssHOVKDoYODOg0tq8UMydgT3hAvC7YJCouubCjxUUw+22nhN3EDuAhbJhtT DgQNGT7GmsrKIWLEjbm6EpLLOdJdB7/U1MrEshLS015a/DUz4b3ZGYApneifJL8h nGE2Wu+36xGUVNLgDMdvd+R17WdyQa+f+9KjUGy71KelFV4vI4A3JwvH0aIsTyHZ LGlQBZqelc66GYwMiqVC0GYGRtrdgygQopfstvZJ3rYiHZV/mdhB5A0T4J2Xvh2Q bnDNzsSFdsH5 =PjYj -----END PGP SIGNATURE----- Merge tag 'kvm-x86-vmx-6.5' of https://github.com/kvm-x86/linux into HEAD KVM VMX changes for 6.5: - Fix missing/incorrect #GP checks on ENCLS - Use standard mmu_notifier hooks for handling APIC access page - Misc cleanups	2023-07-01 07:20:04 -04:00
Paolo Bonzini	88de4b9480	KVM x86/mmu changes for 6.5: - Add back a comment about the subtle side effect of try_cmpxchg64() in tdp_mmu_set_spte_atomic() - Add an assertion in __kvm_mmu_invalidate_addr() to verify that the target KVM MMU is the current MMU - Add a "never" option to effectively avoid creating NX hugepage recovery threads -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmSaGrgSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5KG0P/iFP2w0PAUQgIgbWeOWiIh1ZXq9JjjX+ GAjii1deMFzuEjOWTzWUY2FDE7Mzea5hsVoOmLY7kb9jwYwPJDhaKlaQbNLYgFAr uJqGg8ZMRIbXGBhX98Z4qrUzqKwjgKDzswu/Fg6xZOhVLKNoIkV/YwVo3b1dZ8+e ecctLJFtmV/xa7cFyOTnB9rDgUuBXc2jB7+7eza6+oFlhO/S1VB2XPBq+IT3KoPO F40YQW05ortC8IaFHHkJSRTfVM8v+2WDzrwpJUtyalDAie4hhy08svCEX2cXEeYX qWgjzPzQLM6AcFhb491M1BjFiEYuh5qhvETK+1PiIXTTq+xaIDb1HSM9BexkSVBR scHt8RdamPq3noqZQgMEIzVHp5L3k72oy1iP0k2uIzirMW9v+M3MWLsQuDV+CaLU +EYQozWNEcDO7b/gpYsGWG9Me11GibqIJeyLJFU7HwAmACBiRyy6RD+RS7NMGzHB 9HT6TkSbPc1+cLJ5npCFwZBkj+vwjPs3lEjVQkiGZtavt1nWHfE8ASdv+hNwnJg/ Xz+PVdKh6g0A3mUqxz/ZuDTp3Hfz3jL1roYFGIAUXAjaebih0MUc/CYf84VvoqIq IymI37EoK9CnMszQGJBc2IeB+Bc8KptYCs+M0WYNQ7MPcLIJHKpIzFPosRb2qyOj /GEYFjfFwaPR =FM8H -----END PGP SIGNATURE----- Merge tag 'kvm-x86-mmu-6.5' of https://github.com/kvm-x86/linux into HEAD KVM x86/mmu changes for 6.5: - Add back a comment about the subtle side effect of try_cmpxchg64() in tdp_mmu_set_spte_atomic() - Add an assertion in __kvm_mmu_invalidate_addr() to verify that the target KVM MMU is the current MMU - Add a "never" option to effectively avoid creating NX hugepage recovery threads	2023-07-01 07:18:30 -04:00
Sean Christopherson	0b210faf33	KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages Add a "never" option to the nx_huge_pages module param to allow userspace to do a one-way hard disabling of the mitigation, and don't create the per-VM recovery threads when the mitigation is hard disabled. Letting userspace pinky swear that userspace doesn't want to enable NX mitigation (without reloading KVM) allows certain use cases to avoid the latency problems associated with spawning a kthread for each VM. E.g. in FaaS use cases, the guest kernel is trusted and the host may create 100+ VMs per logical CPU, which can result in 100ms+ latencies when a burst of VMs is created. Reported-by: Li RongQing <lirongqing@baidu.com> Closes: https://lore.kernel.org/all/1679555884-32544-1-git-send-email-lirongqing@baidu.com Cc: Yong He <zhuangel570@gmail.com> Cc: Robert Hoo <robert.hoo.linux@gmail.com> Cc: Kai Huang <kai.huang@intel.com> Reviewed-by: Robert Hoo <robert.hoo.linux@gmail.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Luiz Capitulino <luizcap@amazon.com> Reviewed-by: Li RongQing <lirongqing@baidu.com> Link: https://lore.kernel.org/r/20230602005859.784190-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-13 09:16:03 -07:00
Sean Christopherson	0a3869e14d	KVM: x86/mmu: Trigger APIC-access page reload iff vendor code cares Request an APIC-access page reload when the backing page is migrated (or unmapped) if and only if vendor code actually plugs the backing pfn into structures that reside outside of KVM's MMU. This avoids kicking all vCPUs in the (hopefully infrequent) scenario where the backing page is migrated/invalidated. Unlike VMX's APICv, SVM's AVIC doesn't plug the backing pfn directly into the VMCB and so doesn't need a hook to invalidate an out-of-MMU "mapping". Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230602011518.787006-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 15:07:05 -07:00
Sean Christopherson	0a8a5f2c8c	KVM: x86: Use standard mmu_notifier invalidate hooks for APIC access page Now that KVM honors past and in-progress mmu_notifier invalidations when reloading the APIC-access page, use KVM's "standard" invalidation hooks to trigger a reload and delete the one-off usage of invalidate_range(). Aside from eliminating one-off code in KVM, dropping KVM's use of invalidate_range() will allow common mmu_notifier to redefine the API to be more strictly focused on invalidating secondary TLBs that share the primary MMU's page tables. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230602011518.787006-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 15:07:05 -07:00
Sean Christopherson	817fa99836	KVM: x86/mmu: Grab memslot for correct address space in NX recovery worker Factor in the address space (non-SMM vs. SMM) of the target shadow page when recovering potential NX huge pages, otherwise KVM will retrieve the wrong memslot when zapping shadow pages that were created for SMM. The bug most visibly manifests as a WARN on the memslot being non-NULL, but the worst case scenario is that KVM could unaccount the shadow page without ensuring KVM won't install a huge page, i.e. if the non-SMM slot is being dirty logged, but the SMM slot is not. ------------[ cut here ]------------ WARNING: CPU: 1 PID: 3911 at arch/x86/kvm/mmu/mmu.c:7015 kvm_nx_huge_page_recovery_worker+0x38c/0x3d0 [kvm] CPU: 1 PID: 3911 Comm: kvm-nx-lpage-re RIP: 0010:kvm_nx_huge_page_recovery_worker+0x38c/0x3d0 [kvm] RSP: 0018:ffff99b284f0be68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff99b284edd000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff9271397024e0 R08: 0000000000000000 R09: ffff927139702450 R10: 0000000000000000 R11: 0000000000000001 R12: ffff99b284f0be98 R13: 0000000000000000 R14: ffff9270991fcd80 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff927f9f640000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f0aacad3ae0 CR3: 000000088fc2c005 CR4: 00000000003726e0 Call Trace: <TASK> __pfx_kvm_nx_huge_page_recovery_worker+0x10/0x10 [kvm] kvm_vm_worker_thread+0x106/0x1c0 [kvm] kthread+0xd9/0x100 ret_from_fork+0x2c/0x50 </TASK> ---[ end trace 0000000000000000 ]--- This bug was exposed by commit `edbdb43fc9` ("KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated"), which allowed KVM to retain SMM TDP MMU roots effectively indefinitely. Before commit `edbdb43fc9`, KVM would zap all SMM TDP MMU roots and thus all SMM TDP MMU shadow pages once all vCPUs exited SMM, which made the window where this bug (recovering an SMM NX huge page) could be encountered quite tiny. To hit the bug, the NX recovery thread would have to run while at least one vCPU was in SMM. Most VMs typically only use SMM during boot, and so the problematic shadow pages were gone by the time the NX recovery thread ran. Now that KVM preserves TDP MMU roots until they are explicitly invalidated (e.g. by a memslot deletion), the window to trigger the bug is effectively never closed because most VMMs don't delete memslots after boot (except for a handful of special scenarios). Fixes: `eb29860570` ("KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages") Reported-by: Fabio Coatti <fabio.coatti@gmail.com> Closes: https://lore.kernel.org/all/CADpTngX9LESCdHVu_2mQkNGena_Ng2CphWNwsRGSMxzDsTjU2A@mail.gmail.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230602010137.784664-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-02 16:34:10 -07:00
Like Xu	762b33eb90	KVM: x86/mmu: Assert on @mmu in the __kvm_mmu_invalidate_addr() Add assertion to track that "mmu == vcpu->arch.mmu" is always true in the context of __kvm_mmu_invalidate_addr(). for_each_shadow_entry_using_root() and kvm_sync_spte() operate on vcpu->arch.mmu, but the only reason that doesn't cause explosions is because handle_invept() frees roots instead of doing a manual invalidation. As of now, there are no major roadblocks to switching INVEPT emulation over to use kvm_mmu_invalidate_addr(). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230523032947.60041-1-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-05-26 11:24:52 -07:00
Paolo Bonzini	48b1893ae3	KVM x86 PMU changes for 6.4: - Disallow virtualizing legacy LBRs if architectural LBRs are available, the two are mutually exclusive in hardware - Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES) after KVM_RUN, and overhaul the vmx_pmu_caps selftest to better validate PERF_CAPABILITIES - Apply PMU filters to emulated events and add test coverage to the pmu_event_filter selftest - Misc cleanups and fixes -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmRGtd4SHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5Z9kP/i3WZ40hevvQvB/5cEpxxmxYDwCYnnjM hiQgK5jT4SrMTmVjLgkNdI2PogQoS4CX+GC7lcA9bvse84hjuPvgOflb2B+p2UQi Ytbr9g/tfKNIpnKIk9mcPcSObN9vm2Kgt7n28rtPrHWj89eQzgc66eijqdpKBLxA c3crVR8krwYAQK0tmzHq1+H6hB369YbHAHyTTRRI/bNWnqKblnvUbt0NL2aBusa9 rNMaOdRtinLpy2dmuX/b3japRB8QTnlf7zpPIF4cBEhbYXy5woClZpf1D2fCA6Er XFbEoYawMVd9UeJYbW4z5yErLT83eYoGp4U0eFXWp6fvh8nZlgCGvBKE9g4mmqwj aSLaTR5eVN2qlw6jXVeg3unCo8Eyl36AwYwve2L6sFmBvZvNV5iz2eQ7rrOe4oE3 dnTUaLQ8I2SVg04MbYmCq5W+frTL/I7kqNpbccL1Z3R5WO4y5gz63mug6NfLIvhR t45TAIaifxBfcXQsBZM3v2KUK/xQrD3AbJmFKh54L2CKqiGaNWsMLX+6NZ7LZWgf 8rEqsVkkQDgF7z8eXai4TR26nYfSX6g9gDqtOH73L87aJ7PJk5cRoDWQ1sWs1e/l 4HA/L0Bo/3pnKAa0ZWxJOixmzqY49gNQf3dj8gt3jk3y2ijbAivshiSpPBmIxn0u QLeOf/LGvipl =m18F -----END PGP SIGNATURE----- Merge tag 'kvm-x86-pmu-6.4' of https://github.com/kvm-x86/linux into HEAD KVM x86 PMU changes for 6.4: - Disallow virtualizing legacy LBRs if architectural LBRs are available, the two are mutually exclusive in hardware - Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES) after KVM_RUN, and overhaul the vmx_pmu_caps selftest to better validate PERF_CAPABILITIES - Apply PMU filters to emulated events and add test coverage to the pmu_event_filter selftest - Misc cleanups and fixes	2023-04-26 15:53:36 -04:00
Paolo Bonzini	807b758496	KVM x86 MMU changes for 6.4: - Tweak FNAME(sync_spte) to avoid unnecessary writes+flushes when the guest is only adding new PTEs - Overhaul .sync_page() and .invlpg() to share the .sync_page() implementation, i.e. utilize .sync_page()'s optimizations when emulating invalidations - Clean up the range-based flushing APIs - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle changed SPTE" overhead associated with writing the entire entry - Track the number of "tail" entries in a pte_list_desc to avoid having to walk (potentially) all descriptors during insertion and deletion, which gets quite expensive if the guest is spamming fork() - Misc cleanups -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmRGsvASHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5XnoP/0D8rQmrA0xPHK81zYS1E71tsR/itO/T CQMSB4PhEqvcRUaWOuhLBRUW+noWzaOkjkMYK2uoPTdtme7v9+Ar7EtfrWYHrBWD IxHCAymo3a5dQPUc3Nb77u6HjRAOokPSqSz5jE4qAjlniW09feruro2Phi+BTme4 JjxTc/7Oh0Fu26+mK7mJHiw3fV1x3YznnnRPrKGrVQes5L6ozNICkUZ6nvuJUVMk lTNHNQbG8PqJZnfWG7VIKRn1vdfXwEfnvyucGVEqFfPLkOXqJHyqMVmIOtvsH7C5 l8j36+lBZwtFh2jk2EsXOTb6sS7l1MSvyHLlbaJaqqffP+77Hf1n0fROur0k9Yse jJJejJWxZ/SvjMt/bOA+4ybGafZH0lt20DsDWnat5GSQ1EVT1CInN2p8OY8pdecR QOJBqnNUOykC7/Pyad+IxTxwrOSNCYh+5aYG8AdGquZvNUEwjffVJqrmxDvklY8Z DTYwGKgNY7NsP/dV0WYYElsAuHiKwiDZL15KftiQebO1fPcZDpTzDo83/8UMfGxh yegngcNX9Qi7lWtLkUMy8A99UvejM0QrS/Zt8v1zjlQ8PjreZLLBWsNpe0ufIMRk 31ZAC2OS4Koi3wZ54tA7Z1Kh11meGhAk5Ti7sNke0rDqB9UMmj6UKw121cSRvW7q W6O4U3YeGpKx =zb4u -----END PGP SIGNATURE----- Merge tag 'kvm-x86-mmu-6.4' of https://github.com/kvm-x86/linux into HEAD KVM x86 MMU changes for 6.4: - Tweak FNAME(sync_spte) to avoid unnecessary writes+flushes when the guest is only adding new PTEs - Overhaul .sync_page() and .invlpg() to share the .sync_page() implementation, i.e. utilize .sync_page()'s optimizations when emulating invalidations - Clean up the range-based flushing APIs - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle changed SPTE" overhead associated with writing the entire entry - Track the number of "tail" entries in a pte_list_desc to avoid having to walk (potentially) all descriptors during insertion and deletion, which gets quite expensive if the guest is spamming fork() - Misc cleanups	2023-04-26 15:50:01 -04:00
Sean Christopherson	cf9f4c0eb1	KVM: x86/mmu: Refresh CR0.WP prior to checking for emulated permission faults Refresh the MMU's snapshot of the vCPU's CR0.WP prior to checking for permission faults when emulating a guest memory access and CR0.WP may be guest owned. If the guest toggles only CR0.WP and triggers emulation of a supervisor write, e.g. when KVM is emulating UMIP, KVM may consume a stale CR0.WP, i.e. use stale protection bits metadata. Note, KVM passes through CR0.WP if and only if EPT is enabled as CR0.WP is part of the MMU role for legacy shadow paging, and SVM (NPT) doesn't support per-bit interception controls for CR0. Don't bother checking for EPT vs. NPT as the "old == new" check will always be true under NPT, i.e. the only cost is the read of vcpu->arch.cr4 (SVM unconditionally grabs CR0 from the VMCB on VM-Exit). Reported-by: Mathias Krause <minipli@grsecurity.net> Link: https://lkml.kernel.org/r/677169b4-051f-fcae-756b-9a3e1bb9f8fe%40grsecurity.net Fixes: `fb509f76ac` ("KVM: VMX: Make CR0.WP a guest owned bit") Tested-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20230405002608.418442-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-10 15:25:36 -07:00
Sean Christopherson	9ed3bf4112	KVM: x86/mmu: Move filling of Hyper-V's TLB range struct into Hyper-V code Refactor Hyper-V's range-based TLB flushing API to take a gfn+nr_pages pair instead of a struct, and bury said struct in Hyper-V specific code. Passing along two params generates much better code for the common case where KVM is _not_ running on Hyper-V, as forwarding the flush on to Hyper-V's hv_flush_remote_tlbs_range() from kvm_flush_remote_tlbs_range() becomes a tail call. Cc: David Matlack <dmatlack@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20230405003133.419177-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-10 15:17:29 -07:00
Sean Christopherson	8a1300ff95	KVM: x86: Rename Hyper-V remote TLB hooks to match established scheme Rename the Hyper-V hooks for TLB flushing to match the naming scheme used by all the other TLB flushing hooks, e.g. in kvm_x86_ops, vendor code, arch hooks from common code, etc. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20230405003133.419177-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-10 15:17:29 -07:00
Sean Christopherson	fb3146b4dc	KVM: x86: Add a helper to query whether or not a vCPU has ever run Add a helper to query if a vCPU has run so that KVM doesn't have to open code the check on last_vmentry_cpu being set to a magic value. No functional change intended. Suggested-by: Xiaoyao Li <xiaoyao.li@intel.com> Cc: Like Xu <like.xu.linux@gmail.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20230311004618.920745-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-04-06 14:57:22 -07:00
Paolo Bonzini	2fdcc1b324	KVM: x86/mmu: Avoid indirect call for get_cr3 Most of the time, calls to get_guest_pgd result in calling kvm_read_cr3 (the exception is only nested TDP). Hardcode the default instead of using the get_cr3 function, avoiding a retpoline if they are enabled. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20230322013731.102955-2-minipli@grsecurity.net Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-22 07:46:42 -07:00
Sean Christopherson	f3d90f901d	KVM: x86/mmu: Clean up mmu.c functions that put return type on separate line Adjust a variety of functions in mmu.c to put the function return type on the same line as the function declaration. As stated in the Linus specification: But the "on their own line" is complete garbage to begin with. That will NEVER be a kernel rule. We should never have a rule that assumes things are so long that they need to be on multiple lines. We don't put function return types on their own lines either, even if some other projects have that rule (just to get function names at the beginning of lines or some other odd reason). Leave the functions generated by BUILD_MMU_ROLE_REGS_ACCESSOR() as-is, that code is basically illegible no matter how it's formatted. No functional change intended. Link: https://lore.kernel.org/mm-commits/CAHk-=wjS-Jg7sGMwUPpDsjv392nDOOs0CtUtVkp=S6Q7JzFJRw@mail.gmail.com Signed-off-by: Ben Gardon <bgardon@google.com> Link: https://lore.kernel.org/r/20230202182809.1929122-4-bgardon@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-17 16:02:47 -07:00
Sean Christopherson	eddd9e8302	KVM: x86/mmu: Replace comment with an actual lockdep assertion on mmu_lock Assert that mmu_lock is held for write in __walk_slot_rmaps() instead of hoping the function comment will magically prevent introducing bugs. Signed-off-by: Ben Gardon <bgardon@google.com> Link: https://lore.kernel.org/r/20230202182809.1929122-3-bgardon@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-17 16:02:07 -07:00
Sean Christopherson	727ae37701	KVM: x86/mmu: Rename slot rmap walkers to add clarity and clean up code Replace "slot_handle_level" with "walk_slot_rmaps" to better capture what the helpers are doing, and to slightly shorten the function names so that each function's return type and attributes can be placed on the same line as the function declaration. No functional change intended. Link: https://lore.kernel.org/mm-commits/CAHk-=wjS-Jg7sGMwUPpDsjv392nDOOs0CtUtVkp=S6Q7JzFJRw@mail.gmail.com Signed-off-by: Ben Gardon <bgardon@google.com> Link: https://lore.kernel.org/r/20230202182809.1929122-2-bgardon@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-17 16:02:06 -07:00
David Matlack	9d4655da1a	KVM: x86/mmu: Use gfn_t in kvm_flush_remote_tlbs_range() Use gfn_t instead of u64 for kvm_flush_remote_tlbs_range()'s parameters, since gfn_t is the standard type for GFNs throughout KVM. Opportunistically rename pages to nr_pages to make its role even more obvious. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20230126184025.2294823-6-dmatlack@google.com [sean: convert pages to gfn_t too, and rename] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-17 15:36:20 -07:00
David Matlack	8c63e8c217	KVM: x86/mmu: Rename kvm_flush_remote_tlbs_with_address() Rename kvm_flush_remote_tlbs_with_address() to kvm_flush_remote_tlbs_range(). This name is shorter, which reduces the number of callsites that need to be broken up across multiple lines, and more readable since it conveys a range of memory is being flushed rather than a single address. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20230126184025.2294823-5-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-17 15:16:12 -07:00
David Matlack	28e4b4597d	KVM: x86/mmu: Collapse kvm_flush_remote_tlbs_with_{range,address}() together Collapse kvm_flush_remote_tlbs_with_range() and kvm_flush_remote_tlbs_with_address() into a single function. This eliminates some lines of code and a useless NULL check on the range struct. Opportunistically switch from ENOTSUPP to EOPNOTSUPP to make checkpatch happy. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20230126184025.2294823-4-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-17 15:08:41 -07:00
Lai Jiangshan	141705b783	KVM: x86/mmu: Track tail count in pte_list_desc to optimize guest fork() Rework "struct pte_list_desc" and pte_list_{add\|remove} to track the tail count, i.e. number of PTEs in non-head descriptors, and to always keep all tail descriptors full so that adding a new entry and counting the number of entries is done in constant time instead of linear time. No visible performace is changed in tests. But pte_list_add() is no longer shown in the perf result for the COWed pages even the guest forks millions of tasks. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230113122910.672417-1-jiangshanlai@gmail.com [sean: reword shortlog, tweak changelog, add lots of comments, add BUG_ON()] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-16 19:07:37 -07:00
Lai Jiangshan	19ace7d6ca	KVM: x86/mmu: Skip calling mmu->sync_spte() when the spte is 0 Sync the spte only when the spte is set and avoid the indirect branch. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216235321.735214-5-jiangshanlai@gmail.com [sean: add wrapper instead of open coding each check] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-16 17:19:55 -07:00
Lai Jiangshan	9fd4a4e3a3	KVM: x86/mmu: Remove FNAME(invlpg) and use FNAME(sync_spte) to update vTLB instead. In hardware TLB, invalidating TLB entries means the translations are removed from the TLB. In KVM shadowed vTLB, the translations (combinations of shadow paging and hardware TLB) are generally maintained as long as they remain "clean" when the TLB of an address space (i.e. a PCID or all) is flushed with the help of write-protections, sp->unsync, and kvm_sync_page(), where "clean" in this context means that no updates to KVM's SPTEs are needed. However, FNAME(invlpg) always zaps/removes the vTLB if the shadow page is unsync, and thus triggers a remote flush even if the original vTLB entry is clean, i.e. is usable as-is. Besides this, FNAME(invlpg) is largely is a duplicate implementation of FNAME(sync_spte) to invalidate a vTLB entry. To address both issues, reuse FNAME(sync_spte) to share the code and slightly modify the semantics, i.e. keep the vTLB entry if it's "clean" and avoid remote TLB flush. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216235321.735214-3-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-16 17:19:54 -07:00
Lai Jiangshan	ed335278bd	KVM: x86/mmu: Allow the roots to be invalid in FNAME(invlpg) Don't assume the current root to be valid, just check it and remove the WARN(). Also move the code to check if the root is valid into FNAME(invlpg) to simplify the code. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216235321.735214-2-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-16 17:19:53 -07:00
Lai Jiangshan	2c86c444e2	KVM: x86/mmu: Use kvm_mmu_invalidate_addr() in nested_ept_invalidate_addr() Use kvm_mmu_invalidate_addr() instead open calls to mmu->invlpg(). No functional change intended. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216235321.735214-1-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-16 17:19:53 -07:00
Lai Jiangshan	9ebc3f51da	KVM: x86/mmu: Use kvm_mmu_invalidate_addr() in kvm_mmu_invpcid_gva() Use kvm_mmu_invalidate_addr() instead open calls to mmu->invlpg(). No functional change intended. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216154115.710033-10-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-16 17:19:52 -07:00
Lai Jiangshan	cd42853e95	kvm: x86/mmu: Use KVM_MMU_ROOT_XXX for kvm_mmu_invalidate_addr() The @root_hpa for kvm_mmu_invalidate_addr() is called with @mmu->root.hpa or INVALID_PAGE where @mmu->root.hpa is to invalidate gva for the current root (the same meaning as KVM_MMU_ROOT_CURRENT) and INVALID_PAGE is to invalidate gva for all roots (the same meaning as KVM_MMU_ROOTS_ALL). Change the argument type of kvm_mmu_invalidate_addr() and use KVM_MMU_ROOT_XXX instead so that we can reuse the function for kvm_mmu_invpcid_gva() and nested_ept_invalidate_addr() for invalidating gva for different set of roots. No fuctionalities changed. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216154115.710033-9-jiangshanlai@gmail.com [sean: massage comment slightly] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-16 17:19:51 -07:00
Sean Christopherson	f94db0c8b9	KVM: x86/mmu: Sanity check input to kvm_mmu_free_roots() Tweak KVM_MMU_ROOTS_ALL to precisely cover all current+previous root flags, and add a sanity in kvm_mmu_free_roots() to verify that the set of roots to free doesn't stray outside KVM_MMU_ROOTS_ALL. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216154115.710033-8-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-16 17:19:51 -07:00
Lai Jiangshan	c3c6c9fc5d	KVM: x86/mmu: Move the code out of FNAME(sync_page)'s loop body into mmu.c Rename mmu->sync_page to mmu->sync_spte and move the code out of FNAME(sync_page)'s loop body into mmu.c. No functionalities change intended. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216154115.710033-6-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-16 17:19:44 -07:00
Lai Jiangshan	8ef228c20c	KVM: x86/mmu: Set mmu->sync_page as NULL for direct paging mmu->sync_page for direct paging is never called. And both mmu->sync_page and mm->invlpg only make sense in shadow paging. Setting mmu->sync_page as NULL for direct paging makes it consistent with mm->invlpg which is set NULL for the case. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216154115.710033-5-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-16 12:49:53 -07:00
Lai Jiangshan	51dddf6c49	KVM: x86/mmu: Check mmu->sync_page pointer in kvm_sync_page_check() Assert that mmu->sync_page is non-NULL as part of the sanity checks performed before attempting to sync a shadow page. Explicitly checking mmu->sync_page is all but guaranteed to be redundant with the existing sanity check that the MMU is indirect, but the cost is negligible, and the explicit check also serves as documentation. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216154115.710033-4-jiangshanlai@gmail.com [sean: increase verbosity of changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-16 12:44:19 -07:00
Lai Jiangshan	90e444702a	KVM: x86/mmu: Move the check in FNAME(sync_page) as kvm_sync_page_check() Prepare to check mmu->sync_page pointer before calling it. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216154115.710033-3-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-16 12:42:15 -07:00
Lai Jiangshan	753b43c9d1	KVM: x86/mmu: Use 64-bit address to invalidate to fix a subtle bug FNAME(invlpg)() and kvm_mmu_invalidate_gva() take a gva_t, i.e. unsigned long, as the type of the address to invalidate. On 32-bit kernels, the upper 32 bits of the GPA will get dropped when an L2 GPA address is invalidated in the shadowed nested TDP MMU. Convert it to u64 to fix the problem. Reported-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230216154115.710033-2-jiangshanlai@gmail.com [sean: tweak changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-03-16 12:41:05 -07:00
Sean Christopherson	258d985f6e	KVM: x86/mmu: Use EMULTYPE flag to track write #PFs to shadow pages Use a new EMULTYPE flag, EMULTYPE_WRITE_PF_TO_SP, to track page faults on self-changing writes to shadowed page tables instead of propagating that information to the emulator via a semi-persistent vCPU flag. Using a flag in "struct kvm_vcpu_arch" is confusing, especially as implemented, as it's not at all obvious that clearing the flag only when emulation actually occurs is correct. E.g. if KVM sets the flag and then retries the fault without ever getting to the emulator, the flag will be left set for future calls into the emulator. But because the flag is consumed if and only if both EMULTYPE_PF and EMULTYPE_ALLOW_RETRY_PF are set, and because EMULTYPE_ALLOW_RETRY_PF is deliberately not set for direct MMUs, emulated MMIO, or while L2 is active, KVM avoids false positives on a stale flag since FNAME(page_fault) is guaranteed to be run and refresh the flag before it's ultimately consumed by the tail end of reexecute_instruction(). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230202182817.407394-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-03-14 10:28:56 -04:00
David Matlack	7f604e92fb	KVM: x86/mmu: Make tdp_mmu_allowed static Make tdp_mmu_allowed static since it is only ever used within arch/x86/kvm/mmu/mmu.c. Link: https://lore.kernel.org/kvm/202302072055.odjDVd5V-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20230213212844.3062733-1-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-02-16 12:29:50 -05:00
Christophe JAILLET	11b36fe7d4	KVM: x86/mmu: Use kstrtobool() instead of strtobool() strtobool() is the same as kstrtobool(). However, the latter is more used within the kernel. In order to remove strtobool() and slightly simplify kstrtox.h, switch to the other function name. While at it, include the corresponding header file (<linux/kstrtox.h>) Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/670882aa04dbdd171b46d3b20ffab87158454616.1673689135.git.christophe.jaillet@wanadoo.fr Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-01-24 10:05:49 -08:00
Hou Wenlong	4ad980aea7	KVM: x86/mmu: Cleanup range-based flushing for given page Use the new kvm_flush_remote_tlbs_gfn() helper to cleanup the call sites of range-based flushing for given page, which makes the code clear. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/r/593ee1a876ece0e819191c0b23f56b940d6686db.1665214747.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-01-24 10:05:48 -08:00
Hou Wenlong	3cdf93746f	KVM: x86/mmu: Fix wrong gfn range of tlb flushing in validate_direct_spte() The spte pointing to the children SP is dropped, so the whole gfn range covered by the children SP should be flushed. Although, Hyper-V may treat a 1-page flush the same if the address points to a huge page, it still would be better to use the correct size of huge page. Fixes: `c3134ce240` ("KVM: Replace old tlb flush function with new one to flush a specified range.") Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/r/5f297c566f7d7ff2ea6da3c66d050f69ce1b8ede.1665214747.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-01-24 10:05:48 -08:00
Hou Wenlong	1b2dc73604	KVM: x86/mmu: Fix wrong start gfn of tlb flushing with range When a spte is dropped, the start gfn of tlb flushing should be the gfn of spte not the base gfn of SP which contains the spte. Also introduce a helper function to do range-based flushing when a spte is dropped, which would help prevent future buggy use of kvm_flush_remote_tlbs_with_address() in such case. Fixes: `c3134ce240` ("KVM: Replace old tlb flush function with new one to flush a specified range.") Suggested-by: David Matlack <dmatlack@google.com> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/r/72ac2169a261976f00c1703e88cda676dfb960f5.1665214747.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-01-24 10:05:47 -08:00
Hou Wenlong	9ffe926537	KVM: x86/mmu: Fix wrong gfn range of tlb flushing in kvm_set_pte_rmapp() When the spte of hupe page is dropped in kvm_set_pte_rmapp(), the whole gfn range covered by the spte should be flushed. However, rmap_walk_init_level() doesn't align down the gfn for new level like tdp iterator does, then the gfn used in kvm_set_pte_rmapp() is not the base gfn of huge page. And the size of gfn range is wrong too for huge page. Use the base gfn of huge page and the size of huge page for flushing tlbs for huge page. Also introduce a helper function to flush the given page (huge or not) of guest memory, which would help prevent future buggy use of kvm_flush_remote_tlbs_with_address() in such case. Fixes: `c3134ce240` ("KVM: Replace old tlb flush function with new one to flush a specified range.") Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/r/0ce24d7078fa5f1f8d64b0c59826c50f32f8065e.1665214747.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-01-24 10:05:46 -08:00
Hou Wenlong	c667a3baed	KVM: x86/mmu: Move round_gfn_for_level() helper into mmu_internal.h Rounding down the GFN to a huge page size is a common pattern throughout KVM, so move round_gfn_for_level() helper in tdp_iter.c to mmu_internal.h for common usage. Also rename it as gfn_round_for_level() to use gfn_* prefix and clean up the other call sites. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/r/415c64782f27444898db650e21cf28eeb6441dfa.1665214747.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-01-24 10:05:45 -08:00
Wei Liu	a7e48ef77f	KVM: x86/mmu: fix an incorrect comment in kvm_mmu_new_pgd() There is no function named kvm_mmu_ensure_valid_pgd(). Fix the comment and remove the pair of braces to conform to Linux kernel coding style. Signed-off-by: Wei Liu <wei.liu@kernel.org> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20221128214709.224710-1-wei.liu@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-01-24 10:05:45 -08:00
Sean Christopherson	8d20bd6381	KVM: x86: Unify pr_fmt to use module name for all KVM modules Define pr_fmt using KBUILD_MODNAME for all KVM x86 code so that printks use consistent formatting across common x86, Intel, and AMD code. In addition to providing consistent print formatting, using KBUILD_MODNAME, e.g. kvm_amd and kvm_intel, allows referencing SVM and VMX (and SEV and SGX and ...) as technologies without generating weird messages, and without causing naming conflicts with other kernel code, e.g. "SEV: ", "tdx: ", "sgx: " etc.. are all used by the kernel for non-KVM subsystems. Opportunistically move away from printk() for prints that need to be modified anyways, e.g. to drop a manual "kvm: " prefix. Opportunistically convert a few SGX WARNs that are similarly modified to WARN_ONCE; in the very unlikely event that the WARNs fire, odds are good that they would fire repeatedly and spam the kernel log without providing unique information in each print. Note, defining pr_fmt yields undesirable results for code that uses KVM's printk wrappers, e.g. vcpu_unimpl(). But, that's a pre-existing problem as SVM/kvm_amd already defines a pr_fmt, and thankfully use of KVM's wrappers is relatively limited in KVM x86 code. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paul Durrant <paul@xen.org> Message-Id: <20221130230934.1014142-35-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-12-29 15:47:35 -05:00
Paolo Bonzini	fc471e8310	Merge branch 'kvm-late-6.1' into HEAD x86: * Change tdp_mmu to a read-only parameter * Separate TDP and shadow MMU page fault paths * Enable Hyper-V invariant TSC control selftests: * Use TAP interface for kvm_binary_stats_test and tsc_msrs_test Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-12-29 15:36:47 -05:00
Sean Christopherson	dfe0ecc6f5	KVM: x86/mmu: Pivot on "TDP MMU enabled" when handling direct page faults When handling direct page faults, pivot on the TDP MMU being globally enabled instead of checking if the target MMU is a TDP MMU. Now that the TDP MMU is all-or-nothing, if the TDP MMU is enabled, KVM will reach direct_page_fault() if and only if the MMU is a TDP MMU. When TDP is enabled (obviously required for the TDP MMU), only non-nested TDP page faults reach direct_page_fault(), i.e. nonpaging MMUs are impossible, as NPT requires paging to be enabled and EPT faults use ept_page_fault(). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221012181702.3663607-8-seanjc@google.com> [Use tdp_mmu_enabled variable. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-12-29 15:33:26 -05:00
Sean Christopherson	78fdd2f09f	KVM: x86/mmu: Pivot on "TDP MMU enabled" to check if active MMU is TDP MMU Simplify and optimize the logic for detecting if the current/active MMU is a TDP MMU. If the TDP MMU is globally enabled, then the active MMU is a TDP MMU if it is direct. When TDP is enabled, so called nonpaging MMUs are never used as the only form of shadow paging KVM uses is for nested TDP, and the active MMU can't be direct in that case. Rename the helper and take the vCPU instead of an arbitrary MMU, as nonpaging MMUs can show up in the walk_mmu if L1 is using nested TDP and L2 has paging disabled. Taking the vCPU has the added bonus of cleaning up the callers, all of which check the current MMU but wrap code that consumes the vCPU. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221012181702.3663607-9-seanjc@google.com> [Use tdp_mmu_enabled variable. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-12-29 15:33:25 -05:00
Sean Christopherson	de0322f575	KVM: x86/mmu: Replace open coded usage of tdp_mmu_page with is_tdp_mmu_page() Use is_tdp_mmu_page() instead of querying sp->tdp_mmu_page directly so that all users benefit if KVM ever finds a way to optimize the logic. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221012181702.3663607-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-12-29 15:33:25 -05:00
David Matlack	6c882ef4fc	KVM: x86/mmu: Rename __direct_map() to direct_map() Rename __direct_map() to direct_map() since the leading underscores are unnecessary. This also makes the page fault handler names more consistent: kvm_tdp_mmu_page_fault() calls kvm_tdp_mmu_map() and direct_page_fault() calls direct_map(). Opportunistically make some trivial cleanups to comments that had to be modified anyway since they mentioned __direct_map(). Specifically, use "()" when referring to functions, and include kvm_tdp_mmu_map() among the various callers of disallowed_hugepage_adjust(). No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220921173546.2674386-11-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-12-29 15:33:24 -05:00
David Matlack	9f33697ac7	KVM: x86/mmu: Stop needlessly making MMU pages available for TDP MMU faults Stop calling make_mmu_pages_available() when handling TDP MMU faults. The TDP MMU does not participate in the "available MMU pages" tracking and limiting so calling this function is unnecessary work when handling TDP MMU faults. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220921173546.2674386-10-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-12-29 15:33:23 -05:00
David Matlack	9aa8ab43b3	KVM: x86/mmu: Split out TDP MMU page fault handling Split out the page fault handling for the TDP MMU to a separate function. This creates some duplicate code, but makes the TDP MMU fault handler simpler to read by eliminating branches and will enable future cleanups by allowing the TDP MMU and non-TDP MMU fault paths to diverge. Only compile in the TDP MMU fault handler for 64-bit builds since kvm_tdp_mmu_map() does not exist in 32-bit builds. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220921173546.2674386-9-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-12-29 15:33:22 -05:00
David Matlack	e5e6f8d254	KVM: x86/mmu: Initialize fault.{gfn,slot} earlier for direct MMUs Move the initialization of fault.{gfn,slot} earlier in the page fault handling code for fully direct MMUs. This will enable a future commit to split out TDP MMU page fault handling without needing to duplicate the initialization of these 2 fields. Opportunistically take advantage of the fact that fault.gfn is initialized in kvm_tdp_page_fault() rather than recomputing it from fault->addr. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220921173546.2674386-8-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-12-29 15:33:21 -05:00
David Matlack	354c908c06	KVM: x86/mmu: Handle no-slot faults in kvm_faultin_pfn() Handle faults on GFNs that do not have a backing memslot in kvm_faultin_pfn() and drop handle_abnormal_pfn(). This eliminates duplicate code in the various page fault handlers. Opportunistically tweak the comment about handling gfn > host.MAXPHYADDR to reflect that the effect of returning RET_PF_EMULATE at that point is to avoid creating an MMIO SPTE for such GFNs. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220921173546.2674386-7-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-12-29 15:33:21 -05:00
David Matlack	cd08d178ff	KVM: x86/mmu: Avoid memslot lookup during KVM_PFN_ERR_HWPOISON handling Pass the kvm_page_fault struct down to kvm_handle_error_pfn() to avoid a memslot lookup when handling KVM_PFN_ERR_HWPOISON. Opportunistically move the gfn_to_hva_memslot() call and @current down into kvm_send_hwpoison_signal() to cut down on line lengths. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220921173546.2674386-6-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-12-29 15:33:20 -05:00
David Matlack	56c3a4e4a2	KVM: x86/mmu: Handle error PFNs in kvm_faultin_pfn() Handle error PFNs in kvm_faultin_pfn() rather than relying on the caller to invoke handle_abnormal_pfn() after kvm_faultin_pfn(). Opportunistically rename kvm_handle_bad_page() to kvm_handle_error_pfn() to make it more consistent with is_error_pfn(). This commit moves KVM closer to being able to drop handle_abnormal_pfn(), which will reduce the amount of duplicate code in the various page fault handlers. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220921173546.2674386-5-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-12-29 15:33:19 -05:00
David Matlack	ba6e3fe255	KVM: x86/mmu: Grab mmu_invalidate_seq in kvm_faultin_pfn() Grab mmu_invalidate_seq in kvm_faultin_pfn() and stash it in struct kvm_page_fault. The eliminates duplicate code and reduces the amount of parameters needed for is_page_fault_stale(). Preemptively split out __kvm_faultin_pfn() to a separate function for use in subsequent commits. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220921173546.2674386-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-12-29 15:33:18 -05:00
David Matlack	09732d2b4d	KVM: x86/mmu: Move TDP MMU VM init/uninit behind tdp_mmu_enabled Move kvm_mmu_{init,uninit}_tdp_mmu() behind tdp_mmu_enabled. This makes these functions consistent with the rest of the calls into the TDP MMU from mmu.c, and which is now possible since tdp_mmu_enabled is only modified when the x86 vendor module is loaded. i.e. It will never change during the lifetime of a VM. This change also enabled removing the stub definitions for 32-bit KVM, as the compiler will just optimize the calls out like it does for all the other TDP MMU functions. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220921173546.2674386-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-12-29 15:33:17 -05:00
David Matlack	1f98f2bd8e	KVM: x86/mmu: Change tdp_mmu to a read-only parameter Change tdp_mmu to a read-only parameter and drop the per-vm tdp_mmu_enabled. For 32-bit KVM, make tdp_mmu_enabled a macro that is always false so that the compiler can continue omitting cals to the TDP MMU. The TDP MMU was introduced in 5.10 and has been enabled by default since 5.15. At this point there are no known functionality gaps between the TDP MMU and the shadow MMU, and the TDP MMU uses less memory and scales better with the number of vCPUs. In other words, there is no good reason to disable the TDP MMU on a live system. Purposely do not drop tdp_mmu=N support (i.e. do not force 64-bit KVM to always use the TDP MMU) since tdp_mmu=N is still used to get test coverage of KVM's shadow MMU TDP support, which is used in 32-bit KVM. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220921173546.2674386-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-12-29 15:33:16 -05:00
Lai Jiangshan	c4a488685b	kvm: x86/mmu: Warn on linking when sp->unsync_children Since the commit `65855ed8b0` ("KVM: X86: Synchronize the shadow pagetable before link it"), no sp would be linked with sp->unsync_children = 1. So make it WARN if it is the case. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20221212090106.378206-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-12-29 15:33:13 -05:00
Linus Torvalds	8fa590bf34	ARM64: * Enable the per-vcpu dirty-ring tracking mechanism, together with an option to keep the good old dirty log around for pages that are dirtied by something other than a vcpu. * Switch to the relaxed parallel fault handling, using RCU to delay page table reclaim and giving better performance under load. * Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option, which multi-process VMMs such as crosvm rely on (see merge commit `382b5b87a9`: "Fix a number of issues with MTE, such as races on the tags being initialised vs the PG_mte_tagged flag as well as the lack of support for VM_SHARED when KVM is involved. Patches from Catalin Marinas and Peter Collingbourne"). * Merge the pKVM shadow vcpu state tracking that allows the hypervisor to have its own view of a vcpu, keeping that state private. * Add support for the PMUv3p5 architecture revision, bringing support for 64bit counters on systems that support it, and fix the no-quite-compliant CHAIN-ed counter support for the machines that actually exist out there. * Fix a handful of minor issues around 52bit VA/PA support (64kB pages only) as a prefix of the oncoming support for 4kB and 16kB pages. * Pick a small set of documentation and spelling fixes, because no good merge window would be complete without those. s390: * Second batch of the lazy destroy patches * First batch of KVM changes for kernel virtual != physical address support * Removal of a unused function x86: * Allow compiling out SMM support * Cleanup and documentation of SMM state save area format * Preserve interrupt shadow in SMM state save area * Respond to generic signals during slow page faults * Fixes and optimizations for the non-executable huge page errata fix. * Reprogram all performance counters on PMU filter change * Cleanups to Hyper-V emulation and tests * Process Hyper-V TLB flushes from a nested guest (i.e. from a L2 guest running on top of a L1 Hyper-V hypervisor) * Advertise several new Intel features * x86 Xen-for-KVM: Allow the Xen runstate information to cross a page boundary Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured ** Add support for 32-bit guests in SCHEDOP_poll * Notable x86 fixes and cleanups: One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0). Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few years back when eliminating unnecessary barriers when switching between vmcs01 and vmcs02. Clean up vmread_error_trampoline() to make it more obvious that params must be passed on the stack, even for x86-64. Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective of the current guest CPUID. Fudge around a race with TSC refinement that results in KVM incorrectly thinking a guest needs TSC scaling when running on a CPU with a constant TSC, but no hardware-enumerated TSC frequency. Advertise (on AMD) that the SMM_CTL MSR is not supported ** Remove unnecessary exports Generic: * Support for responding to signals during page faults; introduces new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks Selftests: * Fix an inverted check in the access tracking perf test, and restore support for asserting that there aren't too many idle pages when running on bare metal. * Fix build errors that occur in certain setups (unsure exactly what is unique about the problematic setup) due to glibc overriding static_assert() to a variant that requires a custom message. * Introduce actual atomics for clear/set_bit() in selftests * Add support for pinning vCPUs in dirty_log_perf_test. * Rename the so called "perf_util" framework to "memstress". * Add a lightweight psuedo RNG for guest use, and use it to randomize the access pattern and write vs. read percentage in the memstress tests. * Add a common ucall implementation; code dedup and pre-work for running SEV (and beyond) guests in selftests. * Provide a common constructor and arch hook, which will eventually be used by x86 to automatically select the right hypercall (AMD vs. Intel). * A bunch of added/enabled/fixed selftests for ARM64, covering memslots, breakpoints, stage-2 faults and access tracking. * x86-specific selftest changes: Clean up x86's page table management. Clean up and enhance the "smaller maxphyaddr" test, and add a related test to cover generic emulation failure. Clean up the nEPT support checks. Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values. ** Fix an ordering issue in the AMX test introduced by recent conversions to use kvm_cpu_has(), and harden the code to guard against similar bugs in the future. Anything that tiggers caching of KVM's supported CPUID, kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if the caching occurs before the test opts in via prctl(). Documentation: * Remove deleted ioctls from documentation * Clean up the docs for the x86 MSR filter. * Various fixes -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmOaFrcUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroPemQgAq49excg2Cc+EsHnZw3vu/QWdA0Rt KhL3OgKxuHNjCbD2O9n2t5di7eJOTQ7F7T0eDm3xPTr4FS8LQ2327/mQePU/H2CF mWOpq9RBWLzFsSTeVA2Mz9TUTkYSnDHYuRsBvHyw/n9cL76BWVzjImldFtjYjjex yAwl8c5itKH6bc7KO+5ydswbvBzODkeYKUSBNdbn6m0JGQST7XppNwIAJvpiHsii Qgpk0e4Xx9q4PXG/r5DedI6BlufBsLhv0aE9SHPzyKH3JbbUFhJYI8ZD5OhBQuYW MwxK2KlM5Jm5ud2NZDDlsMmmvd1lnYCFDyqNozaKEWC1Y5rq1AbMa51fXA== =QAYX -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "ARM64: - Enable the per-vcpu dirty-ring tracking mechanism, together with an option to keep the good old dirty log around for pages that are dirtied by something other than a vcpu. - Switch to the relaxed parallel fault handling, using RCU to delay page table reclaim and giving better performance under load. - Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option, which multi-process VMMs such as crosvm rely on (see merge commit `382b5b87a9`: "Fix a number of issues with MTE, such as races on the tags being initialised vs the PG_mte_tagged flag as well as the lack of support for VM_SHARED when KVM is involved. Patches from Catalin Marinas and Peter Collingbourne"). - Merge the pKVM shadow vcpu state tracking that allows the hypervisor to have its own view of a vcpu, keeping that state private. - Add support for the PMUv3p5 architecture revision, bringing support for 64bit counters on systems that support it, and fix the no-quite-compliant CHAIN-ed counter support for the machines that actually exist out there. - Fix a handful of minor issues around 52bit VA/PA support (64kB pages only) as a prefix of the oncoming support for 4kB and 16kB pages. - Pick a small set of documentation and spelling fixes, because no good merge window would be complete without those. s390: - Second batch of the lazy destroy patches - First batch of KVM changes for kernel virtual != physical address support - Removal of a unused function x86: - Allow compiling out SMM support - Cleanup and documentation of SMM state save area format - Preserve interrupt shadow in SMM state save area - Respond to generic signals during slow page faults - Fixes and optimizations for the non-executable huge page errata fix. - Reprogram all performance counters on PMU filter change - Cleanups to Hyper-V emulation and tests - Process Hyper-V TLB flushes from a nested guest (i.e. from a L2 guest running on top of a L1 Hyper-V hypervisor) - Advertise several new Intel features - x86 Xen-for-KVM: - Allow the Xen runstate information to cross a page boundary - Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured - Add support for 32-bit guests in SCHEDOP_poll - Notable x86 fixes and cleanups: - One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0). - Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few years back when eliminating unnecessary barriers when switching between vmcs01 and vmcs02. - Clean up vmread_error_trampoline() to make it more obvious that params must be passed on the stack, even for x86-64. - Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective of the current guest CPUID. - Fudge around a race with TSC refinement that results in KVM incorrectly thinking a guest needs TSC scaling when running on a CPU with a constant TSC, but no hardware-enumerated TSC frequency. - Advertise (on AMD) that the SMM_CTL MSR is not supported - Remove unnecessary exports Generic: - Support for responding to signals during page faults; introduces new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks Selftests: - Fix an inverted check in the access tracking perf test, and restore support for asserting that there aren't too many idle pages when running on bare metal. - Fix build errors that occur in certain setups (unsure exactly what is unique about the problematic setup) due to glibc overriding static_assert() to a variant that requires a custom message. - Introduce actual atomics for clear/set_bit() in selftests - Add support for pinning vCPUs in dirty_log_perf_test. - Rename the so called "perf_util" framework to "memstress". - Add a lightweight psuedo RNG for guest use, and use it to randomize the access pattern and write vs. read percentage in the memstress tests. - Add a common ucall implementation; code dedup and pre-work for running SEV (and beyond) guests in selftests. - Provide a common constructor and arch hook, which will eventually be used by x86 to automatically select the right hypercall (AMD vs. Intel). - A bunch of added/enabled/fixed selftests for ARM64, covering memslots, breakpoints, stage-2 faults and access tracking. - x86-specific selftest changes: - Clean up x86's page table management. - Clean up and enhance the "smaller maxphyaddr" test, and add a related test to cover generic emulation failure. - Clean up the nEPT support checks. - Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values. - Fix an ordering issue in the AMX test introduced by recent conversions to use kvm_cpu_has(), and harden the code to guard against similar bugs in the future. Anything that tiggers caching of KVM's supported CPUID, kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if the caching occurs before the test opts in via prctl(). Documentation: - Remove deleted ioctls from documentation - Clean up the docs for the x86 MSR filter. - Various fixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits) KVM: x86: Add proper ReST tables for userspace MSR exits/flags KVM: selftests: Allocate ucall pool from MEM_REGION_DATA KVM: arm64: selftests: Align VA space allocator with TTBR0 KVM: arm64: Fix benign bug with incorrect use of VA_BITS KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow KVM: x86: Advertise that the SMM_CTL MSR is not supported KVM: x86: remove unnecessary exports KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic" tools: KVM: selftests: Convert clear/set_bit() to actual atomics tools: Drop "atomic_" prefix from atomic test_and_set_bit() tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests perf tools: Use dedicated non-atomic clear/set bit helpers tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers KVM: arm64: selftests: Enable single-step without a "full" ucall() KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself KVM: Remove stale comment about KVM_REQ_UNHALT KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR KVM: Reference to kvm_userspace_memory_region in doc and comments KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl ...	2022-12-15 11:12:21 -08:00
Kazuki Takiguchi	47b0c2e4c2	KVM: x86/mmu: Fix race condition in direct_page_fault make_mmu_pages_available() must be called with mmu_lock held for write. However, if the TDP MMU is used, it will be called with mmu_lock held for read. This function does nothing unless shadow pages are used, so there is no race unless nested TDP is used. Since nested TDP uses shadow pages, old shadow pages may be zapped by this function even when the TDP MMU is enabled. Since shadow pages are never allocated by kvm_tdp_mmu_map(), a race condition can be avoided by not calling make_mmu_pages_available() if the TDP MMU is currently in use. I encountered this when repeatedly starting and stopping nested VM. It can be artificially caused by allocating a large number of nested TDP SPTEs. For example, the following BUG and general protection fault are caused in the host kernel. pte_list_remove: 00000000cd54fc10 many->many ------------[ cut here ]------------ kernel BUG at arch/x86/kvm/mmu/mmu.c:963! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI RIP: 0010:pte_list_remove.cold+0x16/0x48 [kvm] Call Trace: <TASK> drop_spte+0xe0/0x180 [kvm] mmu_page_zap_pte+0x4f/0x140 [kvm] __kvm_mmu_prepare_zap_page+0x62/0x3e0 [kvm] kvm_mmu_zap_oldest_mmu_pages+0x7d/0xf0 [kvm] direct_page_fault+0x3cb/0x9b0 [kvm] kvm_tdp_page_fault+0x2c/0xa0 [kvm] kvm_mmu_page_fault+0x207/0x930 [kvm] npf_interception+0x47/0xb0 [kvm_amd] svm_invoke_exit_handler+0x13c/0x1a0 [kvm_amd] svm_handle_exit+0xfc/0x2c0 [kvm_amd] kvm_arch_vcpu_ioctl_run+0xa79/0x1780 [kvm] kvm_vcpu_ioctl+0x29b/0x6f0 [kvm] __x64_sys_ioctl+0x95/0xd0 do_syscall_64+0x5c/0x90 general protection fault, probably for non-canonical address 0xdead000000000122: 0000 [#1] PREEMPT SMP NOPTI RIP: 0010:kvm_mmu_commit_zap_page.part.0+0x4b/0xe0 [kvm] Call Trace: <TASK> kvm_mmu_zap_oldest_mmu_pages+0xae/0xf0 [kvm] direct_page_fault+0x3cb/0x9b0 [kvm] kvm_tdp_page_fault+0x2c/0xa0 [kvm] kvm_mmu_page_fault+0x207/0x930 [kvm] npf_interception+0x47/0xb0 [kvm_amd] CVE: CVE-2022-45869 Fixes: `a2855afc7e` ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU") Signed-off-by: Kazuki Takiguchi <takiguchi.kazuki171@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-23 18:50:08 -05:00
Paolo Bonzini	6c7b2202e4	KVM: x86: avoid memslot check in NX hugepage recovery if it cannot succeed Since gfn_to_memslot() is relatively expensive, it helps to skip it if it the memslot cannot possibly have dirty logging enabled. In order to do this, add to struct kvm a counter of the number of log-page memslots. While the correct value can only be read with slots_lock taken, the NX recovery thread is content with using an approximate value. Therefore, the counter is an atomic_t. Based on https://lore.kernel.org/kvm/20221027200316.2221027-2-dmatlack@google.com/ by David Matlack. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-18 11:30:12 -05:00
David Matlack	eb29860570	KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages Do not recover (i.e. zap) an NX Huge Page that is being dirty tracked, as it will just be faulted back in at the same 4KiB granularity when accessed by a vCPU. This may need to be changed if KVM ever supports 2MiB (or larger) dirty tracking granularity, or faulting huge pages during dirty tracking for reads/executes. However for now, these zaps are entirely wasteful. In order to check if this commit increases the CPU usage of the NX recovery worker thread I used a modified version of execute_perf_test [1] that supports splitting guest memory into multiple slots and reports /proc/pid/schedstat:se.sum_exec_runtime for the NX recovery worker just before tearing down the VM. The goal was to force a large number of NX Huge Page recoveries and see if the recovery worker used any more CPU. Test Setup: echo 1000 > /sys/module/kvm/parameters/nx_huge_pages_recovery_period_ms echo 10 > /sys/module/kvm/parameters/nx_huge_pages_recovery_ratio Test Command: ./execute_perf_test -v64 -s anonymous_hugetlb_1gb -x 16 -o \| kvm-nx-lpage-re:se.sum_exec_runtime \| \| ---------------------------------------- \| Run \| Before \| After \| ------- \| ------------------ \| ------------------- \| 1 \| 730.084105 \| 724.375314 \| 2 \| 728.751339 \| 740.581988 \| 3 \| 736.264720 \| 757.078163 \| Comparing the median results, this commit results in about a 1% increase CPU usage of the NX recovery worker when testing a VM with 16 slots. However, the effect is negligible with the default halving time of NX pages, which is 1 hour rather than 10 seconds given by period_ms = 1000, ratio = 10. [1] https://lore.kernel.org/kvm/20221019234050.3919566-2-dmatlack@google.com/ Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20221103204421.1146958-1-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-17 11:26:35 -05:00
Sean Christopherson	6d3085e4d8	KVM: x86/mmu: Block all page faults during kvm_zap_gfn_range() When zapping a GFN range, pass 0 => ALL_ONES for the to-be-invalidated range to effectively block all page faults while the zap is in-progress. The invalidation helpers take a host virtual address, whereas zapping a GFN obviously provides a guest physical address and with the wrong unit of measurement (frame vs. byte). Alternatively, KVM could walk all memslots to get the associated HVAs, but thanks to SMM, that would require multiple lookups. And practically speaking, kvm_zap_gfn_range() usage is quite rare and not a hot path, e.g. MTRR and CR0.CD are almost guaranteed to be done only on vCPU0 during boot, and APICv inhibits are similarly infrequent operations. Fixes: `edb298c663` ("KVM: x86/mmu: bump mmu notifier count in kvm_zap_gfn_range") Reported-by: Chao Peng <chao.p.peng@linux.intel.com> Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221111001841.2412598-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-11 07:19:46 -05:00
Sean Christopherson	3a05675722	KVM: x86/mmu: WARN if TDP MMU SP disallows hugepage after being zapped Extend the accounting sanity check in kvm_recover_nx_huge_pages() to the TDP MMU, i.e. verify that zapping a shadow page unaccounts the disallowed NX huge page regardless of the MMU type. Recovery runs while holding mmu_lock for write and so it should be impossible to get false positives on the WARN. Suggested-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221019165618.927057-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:34 -05:00
Mingwei Zhang	76901e56fb	KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust() Explicitly check if a NX huge page is disallowed when determining if a page fault needs to be forced to use a smaller sized page. KVM currently assumes that the NX huge page mitigation is the only scenario where KVM will force a shadow page instead of a huge page, and so unnecessarily keeps an existing shadow page instead of replacing it with a huge page. Any scenario that causes KVM to zap leaf SPTEs may result in having a SP that can be made huge without violating the NX huge page mitigation. E.g. prior to commit `5ba7c4c6d1` ("KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging"), KVM would keep shadow pages after disabling dirty logging due to a live migration being canceled, resulting in degraded performance due to running with 4kb pages instead of huge pages. Although the dirty logging case is "fixed", that fix is coincidental, i.e. is an implementation detail, and there are other scenarios where KVM will zap leaf SPTEs. E.g. zapping leaf SPTEs in response to a host page migration (mmu_notifier invalidation) to create a huge page would yield a similar result; KVM would see the shadow-present non-leaf SPTE and assume a huge page is disallowed. Fixes: `b8e8c8303f` ("kvm: mmu: ITLB_MULTIHIT mitigation") Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> [sean: use spte_to_child_sp(), massage changelog, fold into if-statement] Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Message-Id: <20221019165618.927057-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:34 -05:00
Sean Christopherson	5e3edd7e8b	KVM: x86/mmu: Add helper to convert SPTE value to its shadow page Add a helper to convert a SPTE to its shadow page to deduplicate a variety of flows and hopefully avoid future bugs, e.g. if KVM attempts to get the shadow page for a SPTE without dropping high bits. Opportunistically add a comment in mmu_free_root_page() documenting why it treats the root HPA as a SPTE. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221019165618.927057-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:33 -05:00
Sean Christopherson	61f9447854	KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE Set nx_huge_page_disallowed in TDP MMU shadow pages before making the SP visible to other readers, i.e. before setting its SPTE. This will allow KVM to query the flag when determining if a shadow page can be replaced by a NX huge page without violating the rules of the mitigation. Note, the shadow/legacy MMU holds mmu_lock for write, so it's impossible for another CPU to see a shadow page without an up-to-date nx_huge_page_disallowed, i.e. only the TDP MMU needs the complicated dance. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Message-Id: <20221019165618.927057-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:32 -05:00
Sean Christopherson	b5b0977f4a	KVM: x86/mmu: Properly account NX huge page workaround for nonpaging MMUs Account and track NX huge pages for nonpaging MMUs so that a future enhancement to precisely check if a shadow page can't be replaced by a NX huge page doesn't get false positives. Without correct tracking, KVM can get stuck in a loop if an instruction is fetching and writing data on the same huge page, e.g. KVM installs a small executable page on the fetch fault, replaces it with an NX huge page on the write fault, and faults again on the fetch. Alternatively, and perhaps ideally, KVM would simply not enforce the workaround for nonpaging MMUs. The guest has no page tables to abuse and KVM is guaranteed to switch to a different MMU on CR0.PG being toggled so there's no security or performance concerns. However, getting make_spte() to play nice now and in the future is unnecessarily complex. In the current code base, make_spte() can enforce the mitigation if TDP is enabled or the MMU is indirect, but make_spte() may not always have a vCPU/MMU to work with, e.g. if KVM were to support in-line huge page promotion when disabling dirty logging. Without a vCPU/MMU, KVM could either pass in the correct information and/or derive it from the shadow page, but the former is ugly and the latter subtly non-trivial due to the possibility of direct shadow pages in indirect MMUs. Given that using shadow paging with an unpaged guest is far from top priority _and_ has been subjected to the workaround since its inception, keep it simple and just fix the accounting glitch. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221019165618.927057-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:32 -05:00
Sean Christopherson	55c510e26a	KVM: x86/mmu: Rename NX huge pages fields/functions for consistency Rename most of the variables/functions involved in the NX huge page mitigation to provide consistency, e.g. lpage vs huge page, and NX huge vs huge NX, and also to provide clarity, e.g. to make it obvious the flag applies only to the NX huge page mitigation, not to any condition that prevents creating a huge page. Add a comment explaining what the newly named "possible_nx_huge_pages" tracks. Leave the nx_lpage_splits stat alone as the name is ABI and thus set in stone. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221019165618.927057-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:31 -05:00
Sean Christopherson	428e921611	KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked Tag shadow pages that cannot be replaced with an NX huge page regardless of whether or not zapping the page would allow KVM to immediately create a huge page, e.g. because something else prevents creating a huge page. I.e. track pages that are disallowed from being NX huge pages regardless of whether or not the page could have been huge at the time of fault. KVM currently tracks pages that were disallowed from being huge due to the NX workaround if and only if the page could otherwise be huge. But that fails to handled the scenario where whatever restriction prevented KVM from installing a huge page goes away, e.g. if dirty logging is disabled, the host mapping level changes, etc... Failure to tag shadow pages appropriately could theoretically lead to false negatives, e.g. if a fetch fault requests a small page and thus isn't tracked, and a read/write fault later requests a huge page, KVM will not reject the huge page as it should. To avoid yet another flag, initialize the list_head and use list_empty() to determine whether or not a page is on the list of NX huge pages that should be recovered. Note, the TDP MMU accounting is still flawed as fixing the TDP MMU is more involved due to mmu_lock being held for read. This will be addressed in a future commit. Fixes: `5bcaf3e171` ("KVM: x86/mmu: Account NX huge page disallowed iff huge page was requested") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221019165618.927057-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:31 -05:00
Peter Xu	766576874b	kvm: x86: Allow to respond to generic signals during slow PF Enable x86 slow page faults to be able to respond to non-fatal signals, returning -EINTR properly when it happens. Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221011195947.557281-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:28 -05:00
Peter Xu	c8b88b332b	kvm: Add interruptible flag to __gfn_to_pfn_memslot() Add a new "interruptible" flag showing that the caller is willing to be interrupted by signals during the __gfn_to_pfn_memslot() request. Wire it up with a FOLL_INTERRUPTIBLE flag that we've just introduced. This prepares KVM to be able to respond to SIGUSR1 (for QEMU that's the SIGIPI) even during e.g. handling an userfaultfd page fault. No functional change intended. Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221011195809.557016-4-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:27 -05:00
Paolo Bonzini	b0b42197b5	KVM: x86: start moving SMM-related functions to new files Create a new header and source with code related to system management mode emulation. Entry and exit will move there too; for now, opportunistically rename put_smstate to PUT_SMSTATE while moving it to smm.h, and adjust the SMM state saving code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220929172016.319443-2-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:16 -05:00
Miaohe Lin	3adbdf8103	KVM: x86/mmu: use helper macro SPTE_ENT_PER_PAGE Use helper macro SPTE_ENT_PER_PAGE to get the number of spte entries per page. Minor readability improvement. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220913085452.25561-1-linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:14 -05:00
Miaohe Lin	fa3e42037e	KVM: x86/mmu: fix some comment typos Fix some typos in comments. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220913091725.35953-1-linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-11-09 12:31:14 -05:00
Linus Torvalds	ef688f8b8c	The first batch of KVM patches, mostly covering x86, which I am sending out early due to me travelling next week. There is a lone mm patch for which Andrew gave an informal ack at https://lore.kernel.org/linux-mm/20220817102500.440c6d0a3fce296fdf91bea6@linux-foundation.org. I will send the bulk of ARM work, as well as other architectures, at the end of next week. ARM: * Account stage2 page table allocations in memory stats. x86: * Account EPT/NPT arm64 page table allocations in memory stats. * Tracepoint cleanups/fixes for nested VM-Enter and emulated MSR accesses. * Drop eVMCS controls filtering for KVM on Hyper-V, all known versions of Hyper-V now support eVMCS fields associated with features that are enumerated to the guest. * Use KVM's sanitized VMCS config as the basis for the values of nested VMX capabilities MSRs. * A myriad event/exception fixes and cleanups. Most notably, pending exceptions morph into VM-Exits earlier, as soon as the exception is queued, instead of waiting until the next vmentry. This fixed a longstanding issue where the exceptions would incorrecly become double-faults instead of triggering a vmexit; the common case of page-fault vmexits had a special workaround, but now it's fixed for good. * A handful of fixes for memory leaks in error paths. * Cleanups for VMREAD trampoline and VMX's VM-Exit assembly flow. * Never write to memory from non-sleepable kvm_vcpu_check_block() * Selftests refinements and cleanups. * Misc typo cleanups. Generic: * remove KVM_REQ_UNHALT -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmM2zwcUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroNpbwf+MlVeOlzE5SBdrJ0TEnLmKUel1lSz QnZzP5+D65oD0zhCilUZHcg6G4mzZ5SdVVOvrGJvA0eXh25ruLNMF6jbaABkMLk/ FfI1ybN7A82hwJn/aXMI/sUurWv4Jteaad20JC2DytBCnsW8jUqc49gtXHS2QWy4 3uMsFdpdTAg4zdJKgEUfXBmQviweVpjjl3ziRyZZ7yaeo1oP7XZ8LaE1nR2l5m0J mfjzneNm5QAnueypOh5KhSwIvqf6WHIVm/rIHDJ1HIFbgfOU0dT27nhb1tmPwAcE +cJnnMUHjZqtCXteHkAxMClyRq0zsEoKk0OGvSOOMoq3Q0DavSXUNANOig== =/hqX -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "The first batch of KVM patches, mostly covering x86. ARM: - Account stage2 page table allocations in memory stats x86: - Account EPT/NPT arm64 page table allocations in memory stats - Tracepoint cleanups/fixes for nested VM-Enter and emulated MSR accesses - Drop eVMCS controls filtering for KVM on Hyper-V, all known versions of Hyper-V now support eVMCS fields associated with features that are enumerated to the guest - Use KVM's sanitized VMCS config as the basis for the values of nested VMX capabilities MSRs - A myriad event/exception fixes and cleanups. Most notably, pending exceptions morph into VM-Exits earlier, as soon as the exception is queued, instead of waiting until the next vmentry. This fixed a longstanding issue where the exceptions would incorrecly become double-faults instead of triggering a vmexit; the common case of page-fault vmexits had a special workaround, but now it's fixed for good - A handful of fixes for memory leaks in error paths - Cleanups for VMREAD trampoline and VMX's VM-Exit assembly flow - Never write to memory from non-sleepable kvm_vcpu_check_block() - Selftests refinements and cleanups - Misc typo cleanups Generic: - remove KVM_REQ_UNHALT" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (94 commits) KVM: remove KVM_REQ_UNHALT KVM: mips, x86: do not rely on KVM_REQ_UNHALT KVM: x86: never write to memory from kvm_vcpu_check_block() KVM: x86: Don't snapshot pending INIT/SIPI prior to checking nested events KVM: nVMX: Make event request on VMXOFF iff INIT/SIPI is pending KVM: nVMX: Make an event request if INIT or SIPI is pending on VM-Enter KVM: SVM: Make an event request if INIT or SIPI is pending when GIF is set KVM: x86: lapic does not have to process INIT if it is blocked KVM: x86: Rename kvm_apic_has_events() to make it INIT/SIPI specific KVM: x86: Rename and expose helper to detect if INIT/SIPI are allowed KVM: nVMX: Make an event request when pending an MTF nested VM-Exit KVM: x86: make vendor code check for all nested events mailmap: Update Oliver's email address KVM: x86: Allow force_emulation_prefix to be written without a reload KVM: selftests: Add an x86-only test to verify nested exception queueing KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes KVM: x86: Rename inject_pending_events() to kvm_check_and_inject_events() KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle behavior KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions KVM: x86: Morph pending exceptions to pending VM-Exits at queue time ...	2022-10-09 09:39:55 -07:00
Wonhyuk Yang	faa03b3972	KVM: Add extra information in kvm_page_fault trace point Currently, kvm_page_fault trace point provide fault_address and error code. However it is not enough to find which cpu and instruction cause kvm_page_faults. So add vcpu id and instruction pointer in kvm_page_fault trace point. Cc: Baik Song An <bsahn@etri.re.kr> Cc: Hong Yeon Kim <kimhy@etri.re.kr> Cc: Taeung Song <taeung@reallinux.co.kr> Cc: linuxgeek@linuxgeek.io Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com> Link: https://lore.kernel.org/r/20220510071001.87169-1-vvghjk1234@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-09-26 12:02:30 -04:00
Miaohe Lin	604f533262	KVM: x86/mmu: add missing update to max_mmu_rmap_size The update to statistic max_mmu_rmap_size is unintentionally removed by commit `4293ddb788` ("KVM: x86/mmu: Remove redundant spte present check in mmu_set_spte"). Add missing update to it or max_mmu_rmap_size will always be nonsensical 0. Fixes: `4293ddb788` ("KVM: x86/mmu: Remove redundant spte present check in mmu_set_spte") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Message-Id: <20220907080657.42898-1-linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-09-22 17:03:20 -04:00
Yosry Ahmed	43a063cab3	KVM: x86/mmu: count KVM mmu usage in secondary pagetable stats. Count the pages used by KVM mmu on x86 in memory stats under secondary pagetable stats (e.g. "SecPageTables" in /proc/meminfo) to give better visibility into the memory consumption of KVM mmu in a similar way to how normal user page tables are accounted. Add the inner helper in common KVM, ARM will also use it to count stats in a future commit. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Acked-by: Marc Zyngier <maz@kernel.org> # generic KVM changes Link: https://lore.kernel.org/r/20220823004639.2387269-3-yosryahmed@google.com Link: https://lore.kernel.org/r/20220823004639.2387269-4-yosryahmed@google.com [sean: squash x86 usage to workaround modpost issues] Signed-off-by: Sean Christopherson <seanjc@google.com>	2022-08-30 07:41:12 -07:00
Miaohe Lin	d7c9bfb9ca	KVM: x86/mmu: fix memoryleak in kvm_mmu_vendor_module_init() When register_shrinker() fails, KVM doesn't release the percpu counter kvm_total_used_mmu_pages leading to memoryleak. Fix this issue by calling percpu_counter_destroy() when register_shrinker() fails. Fixes: `ab271bd4df` ("x86: kvm: propagate register_shrinker return code") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Link: https://lore.kernel.org/r/20220823063237.47299-1-linmiaohe@huawei.com [sean: tweak shortlog and changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2022-08-24 13:47:49 -07:00
Junaid Shahid	b64d740ea7	kvm: x86: mmu: Always flush TLBs when enabling dirty logging When A/D bits are not available, KVM uses a software access tracking mechanism, which involves making the SPTEs inaccessible. However, the clear_young() MMU notifier does not flush TLBs. So it is possible that there may still be stale, potentially writable, TLB entries. This is usually fine, but can be problematic when enabling dirty logging, because it currently only does a TLB flush if any SPTEs were modified. But if all SPTEs are in access-tracked state, then there won't be a TLB flush, which means that the guest could still possibly write to memory and not have it reflected in the dirty bitmap. So just unconditionally flush the TLBs when enabling dirty logging. As an alternative, KVM could explicitly check the MMU-Writable bit when write-protecting SPTEs to decide if a flush is needed (instead of checking the Writable bit), but given that a flush almost always happens anyway, so just making it unconditional seems simpler. Signed-off-by: Junaid Shahid <junaids@google.com> Message-Id: <20220810224939.2611160-1-junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-08-19 07:38:03 -04:00
Junaid Shahid	1441ca1494	kvm: x86: mmu: Drop the need_remote_flush() function This is only used by kvm_mmu_pte_write(), which no longer actually creates the new SPTE and instead just clears the old SPTE. So we just need to check if the old SPTE was shadow-present instead of calling need_remote_flush(). Hence we can drop this function. It was incomplete anyway as it didn't take access-tracking into account. This patch should not result in any functional change. Signed-off-by: Junaid Shahid <junaids@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220723024316.2725328-1-junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-08-19 07:38:02 -04:00
Chao Peng	20ec3ebd70	KVM: Rename mmu_notifier_* to mmu_invalidate_* The motivation of this renaming is to make these variables and related helper functions less mmu_notifier bound and can also be used for non mmu_notifier based page invalidation. mmu_invalidate_* was chosen to better describe the purpose of 'invalidating' a page that those variables are used for. - mmu_notifier_seq/range_start/range_end are renamed to mmu_invalidate_seq/range_start/range_end. - mmu_notifier_retry{_hva} helper functions are renamed to mmu_invalidate_retry{_hva}. - mmu_notifier_count is renamed to mmu_invalidate_in_progress to avoid confusion with mn_active_invalidate_count. - While here, also update kvm_inc/dec_notifier_count() to kvm_mmu_invalidate_begin/end() to match the change for mmu_notifier_count. No functional change intended. Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Message-Id: <20220816125322.1110439-3-chao.p.peng@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-08-19 04:05:41 -04:00
Mingwei Zhang	1685c0f325	KVM: x86/mmu: rename trace function name for asynchronous page fault Rename the tracepoint function from trace_kvm_async_pf_doublefault() to trace_kvm_async_pf_repeated_fault() to make it clear, since double fault has nothing to do with this trace function. Asynchronous Page Fault (APF) is an artifact generated by KVM when it cannot find a physical page to satisfy an EPT violation. KVM uses APF to tell the guest OS to do something else such as scheduling other guest processes to make forward progress. However, when another guest process also touches a previously APFed page, KVM halts the vCPU instead of generating a repeated APF to avoid wasting cycles. Double fault (#DF) clearly has a different meaning and a different consequence when triggered. #DF requires two nested contributory exceptions instead of two page faults faulting at the same address. A prevous bug on APF indicates that it may trigger a double fault in the guest [1] and clearly this trace function has nothing to do with it. So rename this function should be a valid choice. No functional change intended. [1] https://www.spinics.net/lists/kvm/msg214957.html Signed-off-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20220807052141.69186-1-mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-08-10 15:08:26 -04:00
Sean Christopherson	c3e0c8c2e8	KVM: x86/mmu: Fully re-evaluate MMIO caching when SPTE masks change Fully re-evaluate whether or not MMIO caching can be enabled when SPTE masks change; simply clearing enable_mmio_caching when a configuration isn't compatible with caching fails to handle the scenario where the masks are updated, e.g. by VMX for EPT or by SVM to account for the C-bit location, and toggle compatibility from false=>true. Snapshot the original module param so that re-evaluating MMIO caching preserves userspace's desire to allow caching. Use a snapshot approach so that enable_mmio_caching still reflects KVM's actual behavior. Fixes: `8b9e74bfbf` ("KVM: x86/mmu: Use enable_mmio_caching to track if MMIO caching is enabled") Reported-by: Michael Roth <michael.roth@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: stable@vger.kernel.org Tested-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-Id: <20220803224957.1285926-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-08-10 15:08:24 -04:00
Sean Christopherson	982bae43f1	KVM: x86: Tag kvm_mmu_x86_module_init() with __init Mark kvm_mmu_x86_module_init() with __init, the entire reason it exists is to initialize variables when kvm.ko is loaded, i.e. it must never be called after module initialization. Fixes: `1d0e848060` ("KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded") Cc: stable@vger.kernel.org Reviewed-by: Kai Huang <kai.huang@intel.com> Tested-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220803224957.1285926-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-08-10 15:08:24 -04:00
Linus Torvalds	6614a3c316	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/ SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE= =w/UH -----END PGP SIGNATURE----- Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Most of the MM queue. A few things are still pending. Liam's maple tree rework didn't make it. This has resulted in a few other minor patch series being held over for next time. Multi-gen LRU still isn't merged as we were waiting for mapletree to stabilize. The current plan is to merge MGLRU into -mm soon and to later reintroduce mapletree, with a view to hopefully getting both into 6.1-rc1. Summary: - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place" [ XFS merge from hell as per Darrick Wong in https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ] * tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits) tools/testing/selftests/vm/hmm-tests.c: fix build mm: Kconfig: fix typo mm: memory-failure: convert to pr_fmt() mm: use is_zone_movable_page() helper hugetlbfs: fix inaccurate comment in hugetlbfs_statfs() hugetlbfs: cleanup some comments in inode.c hugetlbfs: remove unneeded header file hugetlbfs: remove unneeded hugetlbfs_ops forward declaration hugetlbfs: use helper macro SZ_1{K,M} mm: cleanup is_highmem() mm/hmm: add a test for cross device private faults selftests: add soft-dirty into run_vmtests.sh selftests: soft-dirty: add test for mprotect mm/mprotect: fix soft-dirty check in can_change_pte_writable() mm: memcontrol: fix potential oom_lock recursion deadlock mm/gup.c: fix formatting in check_and_migrate_movable_page() xfs: fail dax mount if reflink is enabled on a partition mm/memcontrol.c: remove the redundant updating of stats_flush_threshold userfaultfd: don't fail on unrecognized features hugetlb_cgroup: fix wrong hugetlb cgroup numa stat ...	2022-08-05 16:32:45 -07:00
Paolo Bonzini	31f6e3832a	KVM: x86/mmu: remove unused variable The last use of 'pfn' went away with the same-named argument to host_pfn_mapping_level; now that the hugepage level is obtained exclusively from the host page tables, kvm_mmu_zap_collapsible_spte does not need to know host pfns at all. Fixes: `a8ac499bb6` ("KVM: x86/mmu: Don't require refcounted "struct page" to create huge SPTEs") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-08-01 07:30:00 -04:00
Sean Christopherson	6c6ab524cf	KVM: x86/mmu: Treat NX as a valid SPTE bit for NPT Treat the NX bit as valid when using NPT, as KVM will set the NX bit when the NX huge page mitigation is enabled (mindblowing) and trigger the WARN that fires on reserved SPTE bits being set. KVM has required NX support for SVM since commit `b26a71a1a5` ("KVM: SVM: Refuse to load kvm_amd if NX support is not available") for exactly this reason, but apparently it never occurred to anyone to actually test NPT with the mitigation enabled. ------------[ cut here ]------------ spte = 0x800000018a600ee7, level = 2, rsvd bits = 0x800f0000001fe000 WARNING: CPU: 152 PID: 15966 at arch/x86/kvm/mmu/spte.c:215 make_spte+0x327/0x340 [kvm] Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 10.48.0 01/27/2022 RIP: 0010:make_spte+0x327/0x340 [kvm] Call Trace: <TASK> tdp_mmu_map_handle_target_level+0xc3/0x230 [kvm] kvm_tdp_mmu_map+0x343/0x3b0 [kvm] direct_page_fault+0x1ae/0x2a0 [kvm] kvm_tdp_page_fault+0x7d/0x90 [kvm] kvm_mmu_page_fault+0xfb/0x2e0 [kvm] npf_interception+0x55/0x90 [kvm_amd] svm_invoke_exit_handler+0x31/0xf0 [kvm_amd] svm_handle_exit+0xf6/0x1d0 [kvm_amd] vcpu_enter_guest+0xb6d/0xee0 [kvm] ? kvm_pmu_trigger_event+0x6d/0x230 [kvm] vcpu_run+0x65/0x2c0 [kvm] kvm_arch_vcpu_ioctl_run+0x355/0x610 [kvm] kvm_vcpu_ioctl+0x551/0x610 [kvm] __se_sys_ioctl+0x77/0xc0 __x64_sys_ioctl+0x1d/0x20 do_syscall_64+0x44/0xa0 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> ---[ end trace 0000000000000000 ]--- Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220723013029.1753623-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-07-28 13:51:43 -04:00
Sean Christopherson	65e3b446bc	KVM: x86/mmu: Document the "rules" for using host_pfn_mapping_level() Add a comment to document how host_pfn_mapping_level() can be used safely, as the line between safe and dangerous is quite thin. E.g. if KVM were to ever support in-place promotion to create huge pages, consuming the level is safe if the caller holds mmu_lock and checks that there's an existing _leaf_ SPTE, but unsafe if the caller only checks that there's a non-leaf SPTE. Opportunistically tweak the existing comments to explicitly document why KVM needs to use READ_ONCE(). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715232107.3775620-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-07-28 13:22:23 -04:00
Sean Christopherson	a8ac499bb6	KVM: x86/mmu: Don't require refcounted "struct page" to create huge SPTEs Drop the requirement that a pfn be backed by a refcounted, compound or or ZONE_DEVICE, struct page, and instead rely solely on the host page tables to identify huge pages. The PageCompound() check is a remnant of an old implementation that identified (well, attempt to identify) huge pages without walking the host page tables. The ZONE_DEVICE check was added as an exception to the PageCompound() requirement. In other words, neither check is actually a hard requirement, if the primary has a pfn backed with a huge page, then KVM can back the pfn with a huge page regardless of the backing store. Dropping the @pfn parameter will also allow KVM to query the max host mapping level without having to first get the pfn, which is advantageous for use outside of the page fault path where KVM wants to take action if and only if a page can be mapped huge, i.e. avoids the pfn lookup for gfns that can't be backed with a huge page. Cc: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20220715232107.3775620-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-07-28 13:22:22 -04:00
Sean Christopherson	d5e90a6998	KVM: x86/mmu: Restrict mapping level based on guest MTRR iff they're used Restrict the mapping level for SPTEs based on the guest MTRRs if and only if KVM may actually use the guest MTRRs to compute the "real" memtype. For all forms of paging, guest MTRRs are purely virtual in the sense that they are completely ignored by hardware, i.e. they affect the memtype only if software manually consumes them. The only scenario where KVM consumes the guest MTRRs is when shadow_memtype_mask is non-zero and the guest has non-coherent DMA, in all other cases KVM simply leaves the PAT field in SPTEs as '0' to encode WB memtype. Note, KVM may still ultimately ignore guest MTRRs, e.g. if the backing pfn is host MMIO, but false positives are ok as they only cause a slight performance blip (unless the guest is doing weird things with its MTRRs, which is extremely unlikely). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220715230016.3762909-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-07-28 13:22:22 -04:00
Sean Christopherson	3c2e10373e	KVM: x86/mmu: Remove underscores from __pte_list_remove() Remove the underscores from __pte_list_remove(), the function formerly known as pte_list_remove() is now named kvm_zap_one_rmap_spte() to show that it zaps rmaps/PTEs, i.e. doesn't just remove an entry from a list. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715224226.3749507-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-07-28 13:22:18 -04:00
Sean Christopherson	9202aee816	KVM: x86/mmu: Rename pte_list_{destroy,remove}() to show they zap SPTEs Rename pte_list_remove() and pte_list_destroy() to kvm_zap_one_rmap_spte() and kvm_zap_all_rmap_sptes() respectively to document that (a) they zap SPTEs and (b) to better document how they differ (remove vs. destroy does not exactly scream "one vs. all"). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715224226.3749507-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-07-28 13:22:18 -04:00
Sean Christopherson	f8480721a7	KVM: x86/mmu: Rename rmap zap helpers to eliminate "unmap" wrapper Rename kvm_unmap_rmap() and kvm_zap_rmap() to kvm_zap_rmap() and __kvm_zap_rmap() respectively to show that what was the "unmap" helper is just a wrapper for the "zap" helper, i.e. that they do the exact same thing, one just exists to deal with its caller passing in more params. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715224226.3749507-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-07-28 13:22:17 -04:00
Sean Christopherson	2833eda0e2	KVM: x86/mmu: Rename __kvm_zap_rmaps() to align with other nomenclature Rename __kvm_zap_rmaps() to kvm_rmap_zap_gfn_range() to avoid future confusion with a soon-to-be-introduced __kvm_zap_rmap(). Using a plural "rmaps" is somewhat ambiguous without additional context, as it's not obvious whether it's referring to multiple rmap lists, versus multiple rmap entries within a single list. Use kvm_rmap_zap_gfn_range() to align with the pattern established by kvm_rmap_zap_collapsible_sptes(), without losing the information that it zaps only rmap-based MMUs, i.e. don't rename it to __kvm_zap_gfn_range(). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715224226.3749507-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-07-28 13:22:16 -04:00
Sean Christopherson	aed02fe3ca	KVM: x86/mmu: Drop the "p is for pointer" from rmap helpers Drop the trailing "p" from rmap helpers, i.e. rename functions to simply be kvm_<action>_rmap(). Declaring that a function takes a pointer is completely unnecessary and goes against kernel style. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715224226.3749507-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-07-28 13:22:16 -04:00
Sean Christopherson	a42989e7fb	KVM: x86/mmu: Directly "destroy" PTE list when recycling rmaps Use pte_list_destroy() directly when recycling rmaps instead of bouncing through kvm_unmap_rmapp() and kvm_zap_rmapp(). Calling kvm_unmap_rmapp() is unnecessary and odd as it requires passing dummy parameters; passing NULL for @slot when __rmap_add() already has a valid slot is especially weird and confusing. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715224226.3749507-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-07-28 13:22:15 -04:00
Sean Christopherson	35d539c3e4	KVM: x86/mmu: Return a u64 (the old SPTE) from mmu_spte_clear_track_bits() Return a u64, not an int, from mmu_spte_clear_track_bits(). The return value is the old SPTE value, which is very much a 64-bit value. The sole caller that consumes the return value, drop_spte(), already uses a u64. The only reason that truncating the SPTE value is not problematic is because drop_spte() only queries the shadow-present bit, which is in the lower 32 bits. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715224226.3749507-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-07-28 13:22:15 -04:00
Sean Christopherson	dfd4eb444e	KVM: x86/mmu: Fix typo and tweak comment for split_desc_cache capacity Remove a spurious closing paranthesis and tweak the comment about the cache capacity for PTE descriptors (rmaps) eager page splitting to tone down the assertion slightly, and to call out that topup requires dropping mmu_lock, which is the real motivation for avoiding topup (as opposed to memory usage). Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220712020724.1262121-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-07-14 11:31:25 -04:00
Sean Christopherson	39944ab99c	KVM: x86/mmu: Expand quadrant comment for PG_LEVEL_4K shadow pages Tweak the comment above the computation of the quadrant for PG_LEVEL_4K shadow pages to explicitly call out how and why KVM uses role.quadrant to consume gPTE bits. Opportunistically wrap an unnecessarily long line. No functional change intended. Link: https://lore.kernel.org/all/YqvWvBv27fYzOFdE@google.com Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220712020724.1262121-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-07-14 11:31:24 -04:00
Sean Christopherson	79e48cec6c	KVM: x86/mmu: Add optimized helper to retrieve an SPTE's index Add spte_index() to dedup all the code that calculates a SPTE's index into its parent's page table and/or spt array. Opportunistically tweak the calculation to avoid pointer arithmetic, which is subtle (subtract in 8-byte chunks) and less performant (requires the compiler to generate the subtraction). Suggested-by: David Matlack <dmatlack@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220712020724.1262121-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-07-14 11:31:23 -04:00
Roman Gushchin	e33c267ab7	mm: shrinkers: provide shrinkers with names Currently shrinkers are anonymous objects. For debugging purposes they can be identified by count/scan function names, but it's not always useful: e.g. for superblock's shrinkers it's nice to have at least an idea of to which superblock the shrinker belongs. This commit adds names to shrinkers. register_shrinker() and prealloc_shrinker() functions are extended to take a format and arguments to master a name. In some cases it's not possible to determine a good name at the time when a shrinker is allocated. For such cases shrinker_debugfs_rename() is provided. The expected format is: <subsystem>-<shrinker_type>[:<instance>]-<id> For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair. After this change the shrinker debugfs directory looks like: $ cd /sys/kernel/debug/shrinker/ $ ls dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42 mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43 mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44 rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49 sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13 sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36 sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19 sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10 sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9 sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37 sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38 sb-dax-11 sb-proc-45 sb-tmpfs-35 sb-debugfs-7 sb-proc-46 sb-tmpfs-40 [roman.gushchin@linux.dev: fix build warnings] Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle Reported-by: kernel test robot <lkp@intel.com> Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:40 -07:00
Sean Christopherson	b9b71f4368	KVM: x86/mmu: Buffer nested MMU split_desc_cache only by default capacity Buffer split_desc_cache, the cache used to allcoate rmap list entries, only by the default cache capacity (currently 40), not by doubling the minimum (513). Aliasing L2 GPAs to L1 GPAs is uncommon, thus eager page splitting is unlikely to need 500+ entries. And because each object is a non-trivial 128 bytes (see struct pte_list_desc), those extra ~500 entries means KVM is in all likelihood wasting ~64kb of memory per VM. Link: https://lore.kernel.org/all/YrTDcrsn0%2F+alpzf@google.com Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220624171808.2845941-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-25 04:54:51 -04:00
Sean Christopherson	72ae5822b8	KVM: x86/mmu: Use "unsigned int", not "u32", for SPTEs' @access info Use an "unsigned int" for @access parameters instead of a "u32", mostly to be consistent throughout KVM, but also because "u32" is misleading. @access can actually squeeze into a u8, i.e. doesn't need 32 bits, but is as an "unsigned int" because sp->role.access is an unsigned int. No functional change intended. Link: https://lore.kernel.org/all/YqyZxEfxXLsHGoZ%2F@google.com Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220624171808.2845941-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-25 04:54:40 -04:00
Paolo Bonzini	0378739401	KVM: x86/mmu: Avoid unnecessary flush on eager page split The TLB flush before installing the newly-populated lower level page table is unnecessary if the lower-level page table maps the huge page identically. KVM knows it is if it did not reuse an existing shadow page table, tell drop_large_spte() to skip the flush in that case. Extracted from a patch by David Matlack. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:52:01 -04:00
David Matlack	ada51a9de7	KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs Add support for Eager Page Splitting pages that are mapped by nested MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB pages, and then splitting all 2MiB pages to 4KiB pages. Note, Eager Page Splitting is limited to nested MMUs as a policy rather than due to any technical reason (the sp->role.guest_mode check could just be deleted and Eager Page Splitting would work correctly for all shadow MMU pages). There is really no reason to support Eager Page Splitting for tdp_mmu=N, since such support will eventually be phased out, and there is no current use case supporting Eager Page Splitting on hosts where TDP is either disabled or unavailable in hardware. Furthermore, future improvements to nested MMU scalability may diverge the code from the legacy shadow paging implementation. These improvements will be simpler to make if Eager Page Splitting does not have to worry about legacy shadow paging. Splitting huge pages mapped by nested MMUs requires dealing with some extra complexity beyond that of the TDP MMU: (1) The shadow MMU has a limit on the number of shadow pages that are allowed to be allocated. So, as a policy, Eager Page Splitting refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer pages available. (2) Splitting a huge page may end up re-using an existing lower level shadow page tables. This is unlike the TDP MMU which always allocates new shadow page tables when splitting. (3) When installing the lower level SPTEs, they must be added to the rmap which may require allocating additional pte_list_desc structs. Case (2) is especially interesting since it may require a TLB flush, unlike the TDP MMU which can fully split huge pages without any TLB flushes. Specifically, an existing lower level page table may point to even lower level page tables that are not fully populated, effectively unmapping a portion of the huge page, which requires a flush. As of this commit, a flush is always done always after dropping the huge page and before installing the lower level page table. This TLB flush could instead be delayed until the MMU lock is about to be dropped, which would batch flushes for multiple splits. However these flushes should be rare in practice (a huge page must be aliased in multiple SPTEs and have been split for NX Huge Pages in only some of them). Flushing immediately is simpler to plumb and also reduces the chances of tripping over a CPU bug (e.g. see iTLB multihit). [ This commit is based off of the original implementation of Eager Page Splitting from Peter in Google's kernel from 2016. ] Suggested-by: Peter Feiner <pfeiner@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-23-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:52:00 -04:00
Paolo Bonzini	0cd8dc7398	KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page() Before allocating a child shadow page table, all callers check whether the parent already points to a huge page and, if so, they drop that SPTE. This is done by drop_large_spte(). However, dropping the large SPTE is really only necessary before the sp is installed. While the sp is returned by kvm_mmu_get_child_sp(), installing it happens later in __link_shadow_page(). Move the call there instead of having it in each and every caller. To ensure that the shadow page is not linked twice if it was present, do _not_ opportunistically make kvm_mmu_get_child_sp() idempotent: instead, return an error value if the shadow page already existed. This is a bit more verbose, but clearer than NULL. Finally, now that the drop_large_spte() name is not taken anymore, remove the two underscores in front of __drop_large_spte(). Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:51:59 -04:00
David Matlack	20d49186c0	KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible levels Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU. This is fine for now since KVM never creates intermediate huge pages during dirty logging. In other words, KVM always replaces 1GiB pages directly with 4KiB pages, so there is no reason to look for collapsible 2MiB pages. However, this will stop being true once the shadow MMU participates in eager page splitting. During eager page splitting, each 1GiB is first split into 2MiB pages and then those are split into 4KiB pages. The intermediate 2MiB pages may be left behind if an error condition causes eager page splitting to bail early. No functional change intended. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-20-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:51:59 -04:00
David Matlack	6a97575d5c	KVM: x86/mmu: Cache the access bits of shadowed translations Splitting huge pages requires allocating/finding shadow pages to replace the huge page. Shadow pages are keyed, in part, off the guest access permissions they are shadowing. For fully direct MMUs, there is no shadowing so the access bits in the shadow page role are always ACC_ALL. But during shadow paging, the guest can enforce whatever access permissions it wants. In particular, eager page splitting needs to know the permissions to use for the subpages, but KVM cannot retrieve them from the guest page tables because eager page splitting does not have a vCPU. Fortunately, the guest access permissions are easy to cache whenever page faults or FNAME(sync_page) update the shadow page tables; this is an extension of the existing cache of the shadowed GFNs in the gfns array of the shadow page. The access bits only take up 3 bits, which leaves 61 bits left over for gfns, which is more than enough. Now that the gfns array caches more information than just GFNs, rename it to shadowed_translation. While here, preemptively fix up the WARN_ON() that detects gfn mismatches in direct SPs. The WARN_ON() was paired with a pr_err_ratelimited(), which means that users could sometimes see the WARN without the accompanying error message. Fix this by outputting the error message as part of the WARN splat, and opportunistically make them WARN_ONCE() because if these ever fire, they are all but guaranteed to fire a lot and will bring down the kernel. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-18-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:51:58 -04:00
David Matlack	81cb4657e9	KVM: x86/mmu: Update page stats in __rmap_add() Update the page stats in __rmap_add() rather than at the call site. This will avoid having to manually update page stats when splitting huge pages in a subsequent commit. No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-17-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:51:58 -04:00
David Matlack	2ff9039a75	KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu Allow adding new entries to the rmap and linking shadow pages without a struct kvm_vcpu pointer by moving the implementation of rmap_add() and link_shadow_page() into inner helper functions. No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:51:57 -04:00
David Matlack	6ec6509eea	KVM: x86/mmu: Pass const memslot to rmap_add() Constify rmap_add()'s @slot parameter; it is simply passed on to gfn_to_rmap(), which takes a const memslot. No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-15-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:51:57 -04:00
David Matlack	cbd858b17e	KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page() Allow @vcpu to be NULL in kvm_mmu_find_shadow_page() (and its only caller __kvm_mmu_get_shadow_page()). @vcpu is only required to sync indirect shadow pages, so it's safe to pass in NULL when looking up direct shadow pages. This will be used for doing eager page splitting, which allocates direct shadow pages from the context of a VM ioctl without access to a vCPU pointer. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-14-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:51:57 -04:00
David Matlack	3cc736b357	KVM: x86/mmu: Pass kvm pointer separately from vcpu to kvm_mmu_find_shadow_page() Get the kvm pointer from the caller, rather than deriving it from vcpu->kvm, and plumb the kvm pointer all the way from kvm_mmu_get_shadow_page(). With this change in place, the vcpu pointer is only needed to sync indirect shadow pages. In other words, __kvm_mmu_get_shadow_page() can now be used to get direct shadow pages without a vcpu pointer. This enables eager page splitting, which needs to allocate direct shadow pages during VM ioctls. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-13-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:51:56 -04:00
David Matlack	336081fb3f	KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page() The vcpu pointer in kvm_mmu_alloc_shadow_page() is only used to get the kvm pointer. So drop the vcpu pointer and just pass in the kvm pointer. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-12-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:51:56 -04:00
David Matlack	2f8b1b539b	KVM: x86/mmu: Pass memory caches to allocate SPs separately Refactor kvm_mmu_alloc_shadow_page() to receive the caches from which it will allocate the various pieces of memory for shadow pages as a parameter, rather than deriving them from the vcpu pointer. This will be useful in a future commit where shadow pages are allocated during VM ioctls for eager page splitting, and thus will use a different set of caches. Preemptively pull the caches out all the way to kvm_mmu_get_shadow_page() since eager page splitting will not be calling kvm_mmu_alloc_shadow_page() directly. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-11-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:51:56 -04:00
David Matlack	be91177133	KVM: x86/mmu: Move guest PT write-protection to account_shadowed() Move the code that write-protects newly-shadowed guest page tables into account_shadowed(). This avoids a extra gfn-to-memslot lookup and is a more logical place for this code to live. But most importantly, this reduces kvm_mmu_alloc_shadow_page()'s reliance on having a struct kvm_vcpu pointer, which will be necessary when creating new shadow pages during VM ioctls for eager page splitting. Note, it is safe to drop the role.level == PG_LEVEL_4K check since account_shadowed() returns early if role.level > PG_LEVEL_4K. No functional change intended. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-10-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:51:55 -04:00
David Matlack	876546436d	KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages Rename 2 functions: kvm_mmu_get_page() -> kvm_mmu_get_shadow_page() kvm_mmu_free_page() -> kvm_mmu_free_shadow_page() This change makes it clear that these functions deal with shadow pages rather than struct pages. It also aligns these functions with the naming scheme for kvm_mmu_find_shadow_page() and kvm_mmu_alloc_shadow_page(). Prefer "shadow_page" over the shorter "sp" since these are core functions and the line lengths aren't terrible. No functional change intended. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-9-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:51:55 -04:00
David Matlack	c306aec81a	KVM: x86/mmu: Consolidate shadow page allocation and initialization Consolidate kvm_mmu_alloc_page() and kvm_mmu_alloc_shadow_page() under the latter so that all shadow page allocation and initialization happens in one place. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-8-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:51:54 -04:00
David Matlack	94c8136448	KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions Decompose kvm_mmu_get_page() into separate helper functions to increase readability and prepare for allocating shadow pages without a vcpu pointer. Specifically, pull the guts of kvm_mmu_get_page() into 2 helper functions: kvm_mmu_find_shadow_page() - Walks the page hash checking for any existing mmu pages that match the given gfn and role. kvm_mmu_alloc_shadow_page() Allocates and initializes an entirely new kvm_mmu_page. This currently requries a vcpu pointer for allocation and looking up the memslot but that will be removed in a future commit. No functional change intended. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-7-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:51:54 -04:00
David Matlack	7f49777550	KVM: x86/mmu: Always pass 0 for @quadrant when gptes are 8 bytes The quadrant is only used when gptes are 4 bytes, but mmu_alloc_{direct,shadow}_roots() pass in a non-zero quadrant for PAE page directories regardless. Make this less confusing by only passing in a non-zero quadrant when it is actually necessary. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-6-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:51:54 -04:00
David Matlack	2e65e842c5	KVM: x86/mmu: Derive shadow MMU page role from parent Instead of computing the shadow page role from scratch for every new page, derive most of the information from the parent shadow page. This eliminates the dependency on the vCPU root role to allocate shadow page tables, and reduces the number of parameters to kvm_mmu_get_page(). Preemptively split out the role calculation to a separate function for use in a following commit. Note that when calculating the MMU root role, we can take @role.passthrough, @role.direct, and @role.access directly from @vcpu->arch.mmu->root_role. Only @role.level and @role.quadrant still must be overridden for PAE page directories, when shadowing 32-bit guest page tables with PAE page tables. No functional change intended. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-5-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:51:53 -04:00
David Matlack	86938ab692	KVM: x86/mmu: Stop passing "direct" to mmu_alloc_root() The "direct" argument is vcpu->arch.mmu->root_role.direct, because unlike non-root page tables, it's impossible to have a direct root in an indirect MMU. So just use that. Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:51:53 -04:00
David Matlack	27a59d57f0	KVM: x86/mmu: Use a bool for direct The parameter "direct" can either be true or false, and all of the callers pass in a bool variable or true/false literal, so just use the type bool. No functional change intended. Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:51:53 -04:00
David Matlack	bb924ca69f	KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs Commit `fb58a9c345` ("KVM: x86/mmu: Optimize MMU page cache lookup for fully direct MMUs") skipped the unsync checks and write flood clearing for full direct MMUs. We can extend this further to skip the checks for all direct shadow pages. Direct shadow pages in indirect MMUs (i.e. shadow paging) are used when shadowing a guest huge page with smaller pages. Such direct shadow pages, like their counterparts in fully direct MMUs, are never marked unsynced or have a non-zero write-flooding count. Checking sp->role.direct also generates better code than checking direct_map because, due to register pressure, direct_map has to get shoved onto the stack and then pulled back off. No functional change intended. Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:51:52 -04:00
Sean Christopherson	5d49f08c2e	KVM: x86/mmu: Shove refcounted page dependency into host_pfn_mapping_level() Move the check that restricts mapping huge pages into the guest to pfns that are backed by refcounted 'struct page' memory into the helper that actually "requires" a 'struct page', host_pfn_mapping_level(). In addition to deduplicating code, moving the check to the helper eliminates the subtle requirement that the caller check that the incoming pfn is backed by a refcounted struct page, and as an added bonus avoids an extra pfn_to_page() lookup. Note, the is_error_noslot_pfn() check in kvm_mmu_hugepage_adjust() needs to stay where it is, as it guards against dereferencing a NULL memslot in the kvm_slot_dirty_track_enabled() that follows. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429010416.2788472-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-20 06:21:36 -04:00
Sean Christopherson	b14b2690c5	KVM: Rename/refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page() Rename and refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page() to better reflect what KVM is actually checking, and to eliminate extra pfn_to_page() lookups. The kvm_release_pfn_*() an kvm_try_get_pfn() helpers in particular benefit from "refouncted" nomenclature, as it's not all that obvious why KVM needs to get/put refcounts for some PG_reserved pages (ZERO_PAGE and ZONE_DEVICE). Add a comment to call out that the list of exceptions to PG_reserved is all but guaranteed to be incomplete. The list has mostly been compiled by people throwing noodles at KVM and finding out they stick a little too well, e.g. the ZERO_PAGE's refcount overflowed and ZONE_DEVICE pages didn't get freed. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429010416.2788472-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-20 06:21:35 -04:00
Sean Christopherson	284dc49307	KVM: Take a 'struct page', not a pfn in kvm_is_zone_device_page() Operate on a 'struct page' instead of a pfn when checking if a page is a ZONE_DEVICE page, and rename the helper accordingly. Generally speaking, KVM doesn't actually care about ZONE_DEVICE memory, i.e. shouldn't do anything special for ZONE_DEVICE memory. Rather, KVM wants to treat ZONE_DEVICE memory like regular memory, and the need to identify ZONE_DEVICE memory only arises as an exception to PG_reserved pages. In other words, KVM should only ever check for ZONE_DEVICE memory after KVM has already verified that there is a struct page associated with the pfn. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429010416.2788472-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-20 06:21:34 -04:00
Sean Christopherson	70e41c31bc	KVM: x86/mmu: Use common logic for computing the 32/64-bit base PA mask Use common logic for computing PT_BASE_ADDR_MASK for 32-bit, 64-bit, and EPT paging. Both PAGE_MASK and the new-common logic are supsersets of what is actually needed for 32-bit paging. PAGE_MASK sets bits 63:12 and the former GUEST_PT64_BASE_ADDR_MASK sets bits 51:12, so regardless of which value is used, the result will always be bits 31:12. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-20 06:21:30 -04:00
Sean Christopherson	2ca3129e80	KVM: x86/mmu: Use separate namespaces for guest PTEs and shadow PTEs Separate the macros for KVM's shadow PTEs (SPTE) from guest 64-bit PTEs (PT64). SPTE and PT64 are _mostly_ the same, but the few differences are quite critical, e.g. *_BASE_ADDR_MASK must differentiate between host and guest physical address spaces, and SPTE_PERM_MASK (was PT64_PERM_MASK) is very much specific to SPTEs. Opportunistically (and temporarily) move most guest macros into paging.h to clearly associate them with shadow paging, and to ensure that they're not used as of this commit. A future patch will eliminate them entirely. Sadly, PT32_LEVEL_BITS is left behind in mmu_internal.h because it's needed for the quadrant calculation in kvm_mmu_get_page(). The quadrant calculation is hot enough (when using shadow paging with 32-bit guests) that adding a per-context helper is undesirable, and burying the computation in paging_tmpl.h with a forward declaration isn't exactly an improvement. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-20 06:21:28 -04:00
Sean Christopherson	42c88ff893	KVM: x86/mmu: Dedup macros for computing various page table masks Provide common helper macros to generate various masks, shifts, etc... for 32-bit vs. 64-bit page tables. Only the inputs differ, the actual calculations are identical. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-20 06:21:27 -04:00
Sean Christopherson	b3fcdb04a9	KVM: x86/mmu: Bury 32-bit PSE paging helpers in paging_tmpl.h Move a handful of one-off macros and helpers for 32-bit PSE paging into paging_tmpl.h and hide them behind "PTTYPE == 32". Under no circumstance should anything but 32-bit shadow paging care about PSE paging. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-20 06:21:26 -04:00
Uros Bizjak	2db2f46fdf	KVM: x86/mmu: Use try_cmpxchg64 in fast_pf_fix_direct_spte Use try_cmpxchg64 instead of cmpxchg64 (*ptr, old, new) != old in fast_pf_fix_direct_spte. cmpxchg returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Jim Mattson <jmattson@google.com> Cc: Joerg Roedel <joro@8bytes.org> Message-Id: <20220520144635.63134-1-ubizjak@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-15 08:12:17 -04:00
Lai Jiangshan	024c3c3304	KVM: X86/MMU: Remove useless mmu_topup_memory_caches() in kvm_mmu_pte_write() Since the commit c5e2184d1544("KVM: x86/mmu: Remove the defunct update_pte() paging hook"), kvm_mmu_pte_write() no longer uses the rmap cache. So remove mmu_topup_memory_caches() in it. Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220605063417.308311-6-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-15 08:07:53 -04:00
Lai Jiangshan	fc10020ac9	KVM: X86/MMU: Remove unused PT32_DIR_BASE_ADDR_MASK from mmu.c It is unused. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220605063417.308311-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-15 08:07:52 -04:00
Yuan Yao	d2263de137	KVM: x86/mmu: Set memory encryption "value", not "mask", in shadow PDPTRs Assign shadow_me_value, not shadow_me_mask, to PAE root entries, a.k.a. shadow PDPTRs, when host memory encryption is supported. The "mask" is the set of all possible memory encryption bits, e.g. MKTME KeyIDs, whereas "value" holds the actual value that needs to be stuffed into host page tables. Using shadow_me_mask results in a failed VM-Entry due to setting reserved PA bits in the PDPTRs, and ultimately causes an OOPS due to physical addresses with non-zero MKTME bits sending to_shadow_page() into the weeds: set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state. BUG: unable to handle page fault for address: ffd43f00063049e8 PGD 86dfd8067 P4D 0 Oops: 0000 [#1] PREEMPT SMP RIP: 0010:mmu_free_root_page+0x3c/0x90 [kvm] kvm_mmu_free_roots+0xd1/0x200 [kvm] __kvm_mmu_unload+0x29/0x70 [kvm] kvm_mmu_unload+0x13/0x20 [kvm] kvm_arch_destroy_vm+0x8a/0x190 [kvm] kvm_put_kvm+0x197/0x2d0 [kvm] kvm_vm_release+0x21/0x30 [kvm] __fput+0x8e/0x260 ____fput+0xe/0x10 task_work_run+0x6f/0xb0 do_exit+0x327/0xa90 do_group_exit+0x35/0xa0 get_signal+0x911/0x930 arch_do_signal_or_restart+0x37/0x720 exit_to_user_mode_prepare+0xb2/0x140 syscall_exit_to_user_mode+0x16/0x30 do_syscall_64+0x4e/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: `e54f1ff244` ("KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask") Signed-off-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-Id: <20220608012015.19566-1-yuan.yao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-09 10:52:16 -04:00
Paolo Bonzini	66da65005a	KVM/riscv fixes for 5.19, take #1 - Typo fix in arch/riscv/kvm/vmid.c - Remove broken reference pattern from MAINTAINERS entry -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEZdn75s5e6LHDQ+f/rUjsVaLHLAcFAmKhgGkACgkQrUjsVaLH LAfCCw/+LEz6af/lm4PXr5CGkJ91xq2xMpU9o39jMBm0RltzsG7zt/90SaUdK/Oz MpX7CLFgb1Cm2ZZ/+l5cBlLc7NUaMMxHH9dpScyrYC8xAb75QYimpe/jfjuMyXjO IaYJB2WCs2gfTYXA58c4sB2WR5rLahLnQGJrwW2CfMSvpv/nAyEZyWYtgXw8tSxH oM06Z/cLWU53uWuX0hbKAVQMdAIrQK5H+z46bhbpFC6gk/XSvaBBEngoOiiE6lC6 uM8i8ZIeUgqSeWWreczd6H25eYwyLuVxXHWSIgbdvEcvBUn0VDO+Ox4UA2ab3g3d uHubqdRY5GnrkbRK0ue6tXfON8NxGlKwlcc6kp9Vqxb3Jxjr2qwToTubHYAUVXUi XzrvSxoZRRikwstb1+PNXECCNYUHkNdj4FBA4WoF0Y3Br1IfSwZLUX+EKkY/DHv+ L4MhFFNqsQPzVly2wNiyxuWwRQyxupHekeMQlp13P9vZnGcptxxEyuQlM1Hf40ST iiOC8L+TCQzc5dN156/KjQIUFPud4huJO+0xHQtang628yVzQazzcxD+ialPkcqt JnpMmNbvvNzFYLoB3dQ/36flmYRA6SbK4Tt4bdhls+UcnLnfHDZow7OLmX5yj8+A QiKx6IOS6KI10LXhVZguAmZuKjXajyLVaCWpBl0tV6XpV9Y5t98= =w6dT -----END PGP SIGNATURE----- Merge tag 'kvm-riscv-fixes-5.19-1' of https://github.com/kvm-riscv/linux into HEAD KVM/riscv fixes for 5.19, take #1 - Typo fix in arch/riscv/kvm/vmid.c - Remove broken reference pattern from MAINTAINERS entry	2022-06-09 09:45:00 -04:00
Shaoqin Huang	cf4a8693d9	KVM: x86/mmu: Check every prev_roots in __kvm_mmu_free_obsolete_roots() When freeing obsolete previous roots, check prev_roots as intended, not the current root. Signed-off-by: Shaoqin Huang <shaoqin.huang@intel.com> Fixes: `527d5cd7ee` ("KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zapped") Message-Id: <20220607005905.2933378-1-shaoqin.huang@intel.com> Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-07 11:28:48 -04:00
Linus Torvalds	bf9095424d	S390: * ultravisor communication device driver * fix TEID on terminating storage key ops RISC-V: * Added Sv57x4 support for G-stage page table * Added range based local HFENCE functions * Added remote HFENCE functions based on VCPU requests * Added ISA extension registers in ONE_REG interface * Updated KVM RISC-V maintainers entry to cover selftests support ARM: * Add support for the ARMv8.6 WFxT extension * Guard pages for the EL2 stacks * Trap and emulate AArch32 ID registers to hide unsupported features * Ability to select and save/restore the set of hypercalls exposed to the guest * Support for PSCI-initiated suspend in collaboration with userspace * GICv3 register-based LPI invalidation support * Move host PMU event merging into the vcpu data structure * GICv3 ITS save/restore fixes * The usual set of small-scale cleanups and fixes x86: * New ioctls to get/set TSC frequency for a whole VM * Allow userspace to opt out of hypercall patching * Only do MSR filtering for MSRs accessed by rdmsr/wrmsr AMD SEV improvements: * Add KVM_EXIT_SHUTDOWN metadata for SEV-ES * V_TSC_AUX support Nested virtualization improvements for AMD: * Support for "nested nested" optimizations (nested vVMLOAD/VMSAVE, nested vGIF) * Allow AVIC to co-exist with a nested guest running * Fixes for LBR virtualizations when a nested guest is running, and nested LBR virtualization support * PAUSE filtering for nested hypervisors Guest support: * Decoupling of vcpu_is_preempted from PV spinlocks -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmKN9M4UHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroNLeAf+KizAlQwxEehHHeNyTkZuKyMawrD6 zsqAENR6i1TxiXe7fDfPFbO2NR0ZulQopHbD9mwnHJ+nNw0J4UT7g3ii1IAVcXPu rQNRGMVWiu54jt+lep8/gDg0JvPGKVVKLhxUaU1kdWT9PhIOC6lwpP3vmeWkUfRi PFL/TMT0M8Nfryi0zHB0tXeqg41BiXfqO8wMySfBAHUbpv8D53D2eXQL6YlMM0pL 2quB1HxHnpueE5vj3WEPQ3PCdy1M2MTfCDBJAbZGG78Ljx45FxSGoQcmiBpPnhJr C6UGP4ZDWpml5YULUoA70k5ylCbP+vI61U4vUtzEiOjHugpPV5wFKtx5nw== =ozWx -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "S390: - ultravisor communication device driver - fix TEID on terminating storage key ops RISC-V: - Added Sv57x4 support for G-stage page table - Added range based local HFENCE functions - Added remote HFENCE functions based on VCPU requests - Added ISA extension registers in ONE_REG interface - Updated KVM RISC-V maintainers entry to cover selftests support ARM: - Add support for the ARMv8.6 WFxT extension - Guard pages for the EL2 stacks - Trap and emulate AArch32 ID registers to hide unsupported features - Ability to select and save/restore the set of hypercalls exposed to the guest - Support for PSCI-initiated suspend in collaboration with userspace - GICv3 register-based LPI invalidation support - Move host PMU event merging into the vcpu data structure - GICv3 ITS save/restore fixes - The usual set of small-scale cleanups and fixes x86: - New ioctls to get/set TSC frequency for a whole VM - Allow userspace to opt out of hypercall patching - Only do MSR filtering for MSRs accessed by rdmsr/wrmsr AMD SEV improvements: - Add KVM_EXIT_SHUTDOWN metadata for SEV-ES - V_TSC_AUX support Nested virtualization improvements for AMD: - Support for "nested nested" optimizations (nested vVMLOAD/VMSAVE, nested vGIF) - Allow AVIC to co-exist with a nested guest running - Fixes for LBR virtualizations when a nested guest is running, and nested LBR virtualization support - PAUSE filtering for nested hypervisors Guest support: - Decoupling of vcpu_is_preempted from PV spinlocks" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (199 commits) KVM: x86: Fix the intel_pt PMI handling wrongly considered from guest KVM: selftests: x86: Sync the new name of the test case to .gitignore Documentation: kvm: reorder ARM-specific section about KVM_SYSTEM_EVENT_SUSPEND x86, kvm: use correct GFP flags for preemption disabled KVM: LAPIC: Drop pending LAPIC timer injection when canceling the timer x86/kvm: Alloc dummy async #PF token outside of raw spinlock KVM: x86: avoid calling x86 emulator without a decoded instruction KVM: SVM: Use kzalloc for sev ioctl interfaces to prevent kernel data leak x86/fpu: KVM: Set the base guest FPU uABI size to sizeof(struct kvm_xsave) s390/uv_uapi: depend on CONFIG_S390 KVM: selftests: x86: Fix test failure on arch lbr capable platforms KVM: LAPIC: Trace LAPIC timer expiration on every vmentry KVM: s390: selftest: Test suppression indication on key prot exception KVM: s390: Don't indicate suppression on dirtying, failing memop selftests: drivers/s390x: Add uvdevice tests drivers/s390/char: Add Ultravisor io device MAINTAINERS: Update KVM RISC-V entry to cover selftests support RISC-V: KVM: Introduce ISA extension register RISC-V: KVM: Cleanup stale TLB entries when host CPU changes RISC-V: KVM: Add remote HFENCE functions based on VCPU requests ...	2022-05-26 14:20:14 -07:00
Paolo Bonzini	9f46c187e2	KVM: x86/mmu: fix NULL pointer dereference on guest INVPCID With shadow paging enabled, the INVPCID instruction results in a call to kvm_mmu_invpcid_gva. If INVPCID is executed with CR0.PG=0, the invlpg callback is not set and the result is a NULL pointer dereference. Fix it trivially by checking for mmu->invlpg before every call. There are other possibilities: - check for CR0.PG, because KVM (like all Intel processors after P5) flushes guest TLB on CR0.PG changes so that INVPCID/INVLPG are a nop with paging disabled - check for EFER.LMA, because KVM syncs and flushes when switching MMU contexts outside of 64-bit mode All of these are tricky, go for the simple solution. This is CVE-2022-1789. Reported-by: Yongkang Jia <kangel@zju.edu.cn> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-05-20 13:49:52 -04:00
Sean Christopherson	b28cb0cd2c	KVM: x86/mmu: Update number of zapped pages even if page list is stable When zapping obsolete pages, update the running count of zapped pages regardless of whether or not the list has become unstable due to zapping a shadow page with its own child shadow pages. If the VM is backed by mostly 4kb pages, KVM can zap an absurd number of SPTEs without bumping the batch count and thus without yielding. In the worst case scenario, this can cause a soft lokcup. watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [dirty_log_perf_:13020] RIP: 0010:workingset_activation+0x19/0x130 mark_page_accessed+0x266/0x2e0 kvm_set_pfn_accessed+0x31/0x40 mmu_spte_clear_track_bits+0x136/0x1c0 drop_spte+0x1a/0xc0 mmu_page_zap_pte+0xef/0x120 __kvm_mmu_prepare_zap_page+0x205/0x5e0 kvm_mmu_zap_all_fast+0xd7/0x190 kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10 kvm_page_track_flush_slot+0x5c/0x80 kvm_arch_flush_shadow_memslot+0xe/0x10 kvm_set_memslot+0x1a8/0x5d0 __kvm_set_memory_region+0x337/0x590 kvm_vm_ioctl+0xb08/0x1040 Fixes: `fbb158cb88` ("KVM: x86/mmu: Revert "Revert "KVM: MMU: zap pages in batch""") Reported-by: David Matlack <dmatlack@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220511145122.3133334-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-05-12 10:09:51 -04:00
Vipin Sharma	6ba1e04fa6	KVM: x86/mmu: Speed up slot_rmap_walk_next for sparsely populated rmaps Avoid calling handlers on empty rmap entries and skip to the next non empty rmap entry. Empty rmap entries are noop in handlers. Signed-off-by: Vipin Sharma <vipinsh@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220502220347.174664-1-vipinsh@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-05-12 09:51:45 -04:00
Kai Huang	e54f1ff244	KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of high bits of physical address bits as 'KeyID' bits. Intel Trust Domain Extentions (TDX) further steals part of MKTME KeyID bits as TDX private KeyID bits. TDX private KeyID bits cannot be set in any mapping in the host kernel since they can only be accessed by software running inside a new CPU isolated mode. And unlike to AMD's SME, host kernel doesn't set any legacy MKTME KeyID bits to any mapping either. Therefore, it's not legitimate for KVM to set any KeyID bits in SPTE which maps guest memory. KVM maintains shadow_zero_check bits to represent which bits must be zero for SPTE which maps guest memory. MKTME KeyID bits should be set to shadow_zero_check. Currently, shadow_me_mask is used by AMD to set the sme_me_mask to SPTE, and shadow_me_shadow is excluded from shadow_zero_check. So initializing shadow_me_mask to represent all MKTME keyID bits doesn't work for VMX (as oppositely, they must be set to shadow_zero_check). Introduce a new 'shadow_me_value' to replace existing shadow_me_mask, and repurpose shadow_me_mask as 'all possible memory encryption bits'. The new schematic of them will be: - shadow_me_value: the memory encryption bit(s) that will be set to the SPTE (the original shadow_me_mask). - shadow_me_mask: all possible memory encryption bits (which is a super set of shadow_me_value). - For now, shadow_me_value is supposed to be set by SVM and VMX respectively, and it is a constant during KVM's life time. This perhaps doesn't fit MKTME but for now host kernel doesn't support it (and perhaps will never do). - Bits in shadow_me_mask are set to shadow_zero_check, except the bits in shadow_me_value. Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them. Replace shadow_me_mask with shadow_me_value in almost all code paths, except the one in PT64_PERM_MASK, which is used by need_remote_flush() to determine whether remote TLB flush is needed. This should still use shadow_me_mask as any encryption bit change should need a TLB flush. And for AMD, move initializing shadow_me_value/shadow_me_mask from kvm_mmu_reset_all_pte_masks() to svm_hardware_setup(). Signed-off-by: Kai Huang <kai.huang@intel.com> Message-Id: <f90964b93a3398b1cf1c56f510f3281e0709e2ab.1650363789.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-05-12 09:51:44 -04:00
Kai Huang	c919e881ba	KVM: x86/mmu: Rename reset_rsvds_bits_mask() Rename reset_rsvds_bits_mask() to reset_guest_rsvds_bits_mask() to make it clearer that it resets the reserved bits check for guest's page table entries. Signed-off-by: Kai Huang <kai.huang@intel.com> Message-Id: <efdc174b85d55598880064b8bf09245d3791031d.1650363789.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-05-12 09:51:44 -04:00
Sean Christopherson	1075d41efd	KVM: x86/mmu: Expand and clean up page fault stats Expand and clean up the page fault stats. The current stats are at best incomplete, and at worst misleading. Differentiate between faults that are actually fixed vs those that result in an MMIO SPTE being created, track faults that are spurious, faults that trigger emulation, faults that that are fixed in the fast path, and last but not least, track the number of faults that are taken. Note, the number of faults that require emulation for write-protected shadow pages can roughly be calculated by subtracting the number of MMIO SPTEs created from the overall number of faults that trigger emulation. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-05-12 09:51:43 -04:00
Sean Christopherson	8a009d5bca	KVM: x86/mmu: Make all page fault handlers internal to the MMU Move kvm_arch_async_page_ready() to mmu.c where it belongs, and move all of the page fault handling collateral that was in mmu.h purely for the async #PF handler into mmu_internal.h, where it belongs. This will allow kvm_mmu_do_page_fault() to act on the RET_PF_* return without having to expose those enums outside of the MMU. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-05-12 09:51:42 -04:00
Sean Christopherson	5276c616ab	KVM: x86/mmu: Add RET_PF_CONTINUE to eliminate bool+int* "returns" Add RET_PF_CONTINUE and use it in handle_abnormal_pfn() and kvm_faultin_pfn() to signal that the page fault handler should continue doing its thing. Aside from being gross and inefficient, using a boolean return to signal continue vs. stop makes it extremely difficult to add more helpers and/or move existing code to a helper. E.g. hypothetically, if nested MMUs were to gain a separate page fault handler in the future, everything up to the "is self-modifying PTE" check can be shared by all shadow MMUs, but communicating up the stack whether to continue on or stop becomes a nightmare. More concretely, proposed support for private guest memory ran into a similar issue, where it'll be forced to forego a helper in order to yield sane code: https://lore.kernel.org/all/YkJbxiL%2FAz7olWlq@google.com. No functional change intended. Cc: David Matlack <dmatlack@google.com> Cc: Chao Peng <chao.p.peng@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-05-12 09:51:42 -04:00
Sean Christopherson	5c64aba517	KVM: x86/mmu: Drop exec/NX check from "page fault can be fast" Tweak the "page fault can be fast" logic to explicitly check for !PRESENT faults in the access tracking case, and drop the exec/NX check that becomes redundant as a result. No sane hardware will generate an access that is both an instruct fetch and a write, i.e. it's a waste of cycles. If hardware goes off the rails, or KVM runs under a misguided hypervisor, spuriously running throught fast path is benign (KVM has been uknowingly being doing exactly that for years). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-05-12 09:51:41 -04:00
Sean Christopherson	54275f74cf	KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use Check for A/D bits being disabled instead of the access tracking mask being non-zero when deciding whether or not to attempt to fix a page fault vian the fast path. Originally, the access tracking mask was non-zero if and only if A/D bits were disabled by _KVM_ (including not being supported by hardware), but that hasn't been true since nVMX was fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause KVM to not use A/D bits while running L2 despite KVM using them while running L1. In other words, don't attempt the fast path just because EPT is enabled. Note, attempting the fast path for all !PRESENT faults can "fix" a very, _VERY_ tiny percentage of faults out of mmu_lock by detecting that the fault is spurious, i.e. has been fixed by a different vCPU, but again the odds of that happening are vanishingly small. E.g. booting an 8-vCPU VM gets less than 10 successes out of 30k+ faults, and that's likely one of the more favorable scenarios. Disabling dirty logging can likely lead to a rash of collisions between vCPUs for some workloads that operate on a common set of pages, but penalizing _all_ !PRESENT faults for that one case is unlikely to be a net positive, not to mention that that problem is best solved by not zapping in the first place. The number of spurious faults does scale with the number of vCPUs, e.g. a 255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast path (again out of 30k), but that's all of 0.2% of faults. Using legacy shadow paging does get more spurious faults, and a few more detected out of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring faults that are reflected into the guest), i.e. the extra detections are purely due to the sheer number of faults observed. On the other hand, getting a "negative" in the fast path takes in the neighborhood of 150-250 cycles. So while it is tempting to keep/extend the current behavior, such a change needs to come with hard numbers showing that it's actually a win in the grand scheme, or any scheme for that matter. Fixes: `995f00a619` ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-05-12 09:51:41 -04:00
Paolo Bonzini	6ea6581f12	Merge branch 'kvm-tdp-mmu-atomicity-fix' into HEAD We are dropping A/D bits (and W bits) in the TDP MMU. Even if mmu_lock is held for write, as volatile SPTEs can be written by other tasks/vCPUs outside of mmu_lock. Attempting to prove that bug exposed another notable goof, which has been lurking for a decade, give or take: KVM treats _all_ MMU-writable SPTEs as volatile, even though KVM never clears WRITABLE outside of MMU lock. As a result, the legacy MMU (and the TDP MMU if not fixed) uses XCHG to update writable SPTEs. The fix does not seem to have an easily-measurable affect on performance; page faults are so slow that wasting even a few hundred cycles is dwarfed by the base cost.	2022-05-03 07:29:30 -04:00
Sean Christopherson	54eb3ef56f	KVM: x86/mmu: Move shadow-present check out of spte_has_volatile_bits() Move the is_shadow_present_pte() check out of spte_has_volatile_bits() and into its callers. Well, caller, since only one of its two callers doesn't already do the shadow-present check. Opportunistically move the helper to spte.c/h so that it can be used by the TDP MMU, which is also the primary motivation for the shadow-present change. Unlike the legacy MMU, the TDP MMU uses a single path for clear leaf and non-leaf SPTEs, and to avoid unnecessary atomic updates, the TDP MMU will need to check is_last_spte() prior to calling spte_has_volatile_bits(), and calling is_last_spte() without first calling is_shadow_present_spte() is at best odd, and at worst a violation of KVM's loosely defines SPTE rules. Note, mmu_spte_clear_track_bits() could likely skip the write entirely for SPTEs that are not shadow-present. Leave that cleanup for a future patch to avoid introducing a functional change, and because the shadow-present check can likely be moved further up the stack, e.g. drop_large_spte() appears to be the only path that doesn't already explicitly check for a shadow-present SPTE. No functional change intended. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-05-03 07:22:32 -04:00
Sean Christopherson	706c9c55e5	KVM: x86/mmu: Don't treat fully writable SPTEs as volatile (modulo A/D) Don't treat SPTEs that are truly writable, i.e. writable in hardware, as being volatile (unless they're volatile for other reasons, e.g. A/D bits). KVM _sets_ the WRITABLE bit out of mmu_lock, but never _clears_ the bit out of mmu_lock, so if the WRITABLE bit is set, it cannot magically get cleared just because the SPTE is MMU-writable. Rename the wrapper of MMU-writable to be more literal, the previous name of spte_can_locklessly_be_made_writable() is wrong and misleading. Fixes: `c7ba5b48cc` ("KVM: MMU: fast path of handling guest page fault") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-05-03 07:22:31 -04:00
Lai Jiangshan	84e5ffd045	KVM: X86/MMU: Fix shadowing 5-level NPT for 4-level NPT L1 guest When shadowing 5-level NPT for 4-level NPT L1 guest, the root_sp is allocated with role.level = 5 and the guest pagetable's root gfn. And root_sp->spt[0] is also allocated with the same gfn and the same role except role.level = 4. Luckily that they are different shadow pages, but only root_sp->spt[0] is the real translation of the guest pagetable. Here comes a problem: If the guest switches from gCR4_LA57=0 to gCR4_LA57=1 (or vice verse) and uses the same gfn as the root page for nested NPT before and after switching gCR4_LA57. The host (hCR4_LA57=1) might use the same root_sp for the guest even the guest switches gCR4_LA57. The guest will see unexpected page mapped and L2 may exploit the bug and hurt L1. It is lucky that the problem can't hurt L0. And three special cases need to be handled: The root_sp should be like role.direct=1 sometimes: its contents are not backed by gptes, root_sp->gfns is meaningless. (For a normal high level sp in shadow paging, sp->gfns is often unused and kept zero, but it could be relevant and meaningful if sp->gfns is used because they are backed by concrete gptes.) For such root_sp in the case, root_sp is just a portal to contribute root_sp->spt[0], and root_sp->gfns should not be used and root_sp->spt[0] should not be dropped if gpte[0] of the guest root pagetable is changed. Such root_sp should not be accounted too. So add role.passthrough to distinguish the shadow pages in the hash when gCR4_LA57 is toggled and fix above special cases by using it in kvm_mmu_page_{get\|set}_gfn() and sp_has_gptes(). Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220420131204.2850-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:50:00 -04:00
Lai Jiangshan	767d8d8d50	KVM: X86/MMU: Add sp_has_gptes() Add sp_has_gptes() which equals to !sp->role.direct currently. Shadow page having gptes needs to be write-protected, accounted and responded to kvm_mmu_pte_write(). Use it in these places to replace !sp->role.direct and rename for_each_gfn_indirect_valid_sp. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220420131204.2850-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:50:00 -04:00
Paolo Bonzini	347a0d0ded	KVM: x86/mmu: replace direct_map with root_role.direct direct_map is always equal to the direct field of the root page's role: - for shadow paging, direct_map is true if CR0.PG=0 and root_role.direct is copied from cpu_role.base.direct - for TDP, it is always true and root_role.direct is also always true - for shadow TDP, it is always false and root_role.direct is also always false Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:59 -04:00
Paolo Bonzini	4d25502aa1	KVM: x86/mmu: replace root_level with cpu_role.base.level Remove another duplicate field of struct kvm_mmu. This time it's the root level for page table walking; the separate field is always initialized as cpu_role.base.level, so its users can look up the CPU mode directly instead. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:58 -04:00
Paolo Bonzini	a972e29c1d	KVM: x86/mmu: replace shadow_root_level with root_role.level root_role.level is always the same value as shadow_level: - it's kvm_mmu_get_tdp_level(vcpu) when going through init_kvm_tdp_mmu - it's the level argument when going through kvm_init_shadow_ept_mmu - it's assigned directly from new_role.base.level when going through shadow_mmu_init_context Remove the duplication and get the level directly from the role. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:58 -04:00
Paolo Bonzini	a7f1de9b60	KVM: x86/mmu: pull CPU mode computation to kvm_init_mmu Do not lead init_kvm_*mmu into the temptation of poking into struct kvm_mmu_role_regs, by passing to it directly the CPU mode. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:57 -04:00
Paolo Bonzini	56b321f9e3	KVM: x86/mmu: simplify and/or inline computation of shadow MMU roles Shadow MMUs compute their role from cpu_role.base, simply by adjusting the root level. It's one line of code, so do not place it in a separate function. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:57 -04:00
Paolo Bonzini	faf729621c	KVM: x86/mmu: remove redundant bits from extended role Before the separation of the CPU and the MMU role, CR0.PG was not available in the base MMU role, because two-dimensional paging always used direct=1 in the MMU role. However, now that the raw role is snapshotted in mmu->cpu_role, the value of CR0.PG always matches both !cpu_role.base.direct and cpu_role.base.level > 0. There is no need to store it again in union kvm_mmu_extended_role; instead, write an is_cr0_pg accessor by hand that takes care of the conversion. Use cpu_role.base.level since the future of the direct field is unclear. Likewise, CR4.PAE is now always present in the CPU role as !cpu_role.base.has_4_byte_gpte. The inversion makes certain tests on the MMU role easier, and is easily hidden by the is_cr4_pae accessor when operating on the CPU role. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:57 -04:00
Paolo Bonzini	7a7ae82923	KVM: x86/mmu: rename kvm_mmu_role union It is quite confusing that the "full" union is called kvm_mmu_role but is used for the "cpu_role" field of struct kvm_mmu. Rename it to kvm_cpu_role. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:56 -04:00
Paolo Bonzini	7a458f0e1b	KVM: x86/mmu: remove extended bits from mmu_role, rename field mmu_role represents the role of the root of the page tables. It does not need any extended bits, as those govern only KVM's page table walking; the is_* functions used for page table walking always use the CPU role. ext.valid is not present anymore in the MMU role, but an all-zero MMU role is impossible because the level field is never zero in the MMU role. So just zap the whole mmu_role in order to force invalidation after CPUID is updated. While making this change, which requires touching almost every occurrence of "mmu_role", rename it to "root_role". Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:56 -04:00
Paolo Bonzini	362505deb8	KVM: x86/mmu: store shadow EFER.NX in the MMU role Now that the MMU role is separate from the CPU role, it can be a truthful description of the format of the shadow pages. This includes whether the shadow pages use the NX bit; so force the efer_nx field of the MMU role when TDP is disabled, and remove the hardcoding it in the callers of reset_shadow_zero_bits_mask. In fact, the initialization of reserved SPTE bits can now be made common to shadow paging and shadow NPT; move it to shadow_mmu_init_context. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:55 -04:00
Paolo Bonzini	f417e1459a	KVM: x86/mmu: cleanup computation of MMU roles for shadow paging Pass the already-computed CPU role, instead of redoing it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:55 -04:00
Paolo Bonzini	2ba676774d	KVM: x86/mmu: cleanup computation of MMU roles for two-dimensional paging Inline kvm_calc_mmu_role_common into its sole caller, and simplify it by removing the computation of unnecessary bits. Extended bits are unnecessary because page walking uses the CPU role, and EFER.NX/CR0.WP can be set to one unconditionally---matching the format of shadow pages rather than the format of guest pages. The MMU role for two dimensional paging does still depend on the CPU role, even if only barely so, due to SMM and guest mode; for consistency, pass it down to kvm_calc_tdp_mmu_root_page_role instead of querying the vcpu with is_smm or is_guest_mode. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:55 -04:00
Paolo Bonzini	19b5dcc3be	KVM: x86/mmu: remove kvm_calc_shadow_root_page_role_common kvm_calc_shadow_root_page_role_common is the same as kvm_calc_cpu_role except for the level, which is overwritten afterwards in kvm_calc_shadow_mmu_root_page_role and kvm_calc_shadow_npt_root_page_role. role.base.direct is already set correctly for the CPU role, and CR0.PG=1 is required for VMRUN so it will also be correct for nested NPT. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:54 -04:00
Paolo Bonzini	ec283cb1dc	KVM: x86/mmu: remove ept_ad field The ept_ad field is used during page walk to determine if the guest PTEs have accessed and dirty bits. In the MMU role, the ad_disabled bit represents whether the shadow PTEs have the bits, so it would be incorrect to replace PT_HAVE_ACCESSED_DIRTY with just !mmu->mmu_role.base.ad_disabled. However, the similar field in the CPU mode, ad_disabled, is initialized correctly: to the opposite value of ept_ad for shadow EPT, and zero for non-EPT guest paging modes (which always have A/D bits). It is therefore possible to compute PT_HAVE_ACCESSED_DIRTY from the CPU mode, like other page-format fields; it just has to be inverted to account for the different polarity. In fact, now that the CPU mode is distinct from the MMU roles, it would even be possible to remove PT_HAVE_ACCESSED_DIRTY macro altogether, and use !mmu->cpu_role.base.ad_disabled instead. I am not doing this because the macro has a small effect in terms of dead code elimination: text data bss dec hex 103544 16665 112 120321 1d601 # as of this patch 103746 16665 112 120523 1d6cb # without PT_HAVE_ACCESSED_DIRTY Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:54 -04:00
Paolo Bonzini	60f3cb60a5	KVM: x86/mmu: do not recompute root level from kvm_mmu_role_regs The root_level can be found in the cpu_role (in fact the field is superfluous and could be removed, but one thing at a time). Since there is only one usage left of role_regs_to_root_level, inline it into kvm_calc_cpu_role. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:53 -04:00
Paolo Bonzini	e5ed0fb010	KVM: x86/mmu: split cpu_role from mmu_role Snapshot the state of the processor registers that govern page walk into a new field of struct kvm_mmu. This is a more natural representation than having it mostly in mmu_role but not exclusively; the delta right now is represented in other fields, such as root_level. The nested MMU now has only the CPU role; and in fact the new function kvm_calc_cpu_role is analogous to the previous kvm_calc_nested_mmu_role, except that it has role.base.direct equal to !CR0.PG. For a walk-only MMU, "direct" has no meaning, but we set it to !CR0.PG so that role.ext.cr0_pg can go away in a future patch. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:53 -04:00
Paolo Bonzini	b89805082a	KVM: x86/mmu: remove "bool base_only" arguments The argument is always false now that kvm_mmu_calc_root_page_role has been removed. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:53 -04:00
Paolo Bonzini	39e7e2bf32	KVM: x86/mmu: pull computation of kvm_mmu_role_regs to kvm_init_mmu The init_kvm_*mmu functions, with the exception of shadow NPT, do not need to know the full values of CR0/CR4/EFER; they only need to know the bits that make up the "role". This cleanup however will take quite a few incremental steps. As a start, pull the common computation of the struct kvm_mmu_role_regs into their caller: all of them extract the struct from the vcpu as the very first step. Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:17 -04:00
Paolo Bonzini	82ffa13f79	KVM: x86/mmu: constify uses of struct kvm_mmu_role_regs struct kvm_mmu_role_regs is computed just once and then accessed. Use const to make this clearer, even though the const fields of struct kvm_mmu_role_regs already prevent (or make it harder...) to modify the contents of the struct. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:17 -04:00
Paolo Bonzini	daed87b876	KVM: x86/mmu: nested EPT cannot be used in SMM The role.base.smm flag is always zero when setting up shadow EPT, do not bother copying it over from vcpu->arch.root_mmu. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:17 -04:00
Sean Christopherson	8b9e74bfbf	KVM: x86/mmu: Use enable_mmio_caching to track if MMIO caching is enabled Clear enable_mmio_caching if hardware can't support MMIO caching and use the dedicated flag to detect if MMIO caching is enabled instead of assuming shadow_mmio_value==0 means MMIO caching is disabled. TDX will use a zero value even when caching is enabled, and is_mmio_spte() isn't so hot that it needs to avoid an extra memory access, i.e. there's no reason to be super clever. And the clever approach may not even be more performant, e.g. gcc-11 lands the extra check on a non-zero value inline, but puts the enable_mmio_caching out-of-line, i.e. avoids the few extra uops for non-MMIO SPTEs. Cc: Isaku Yamahata <isaku.yamahata@intel.com> Cc: Kai Huang <kai.huang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220420002747.3287931-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:49:16 -04:00
Paolo Bonzini	71d7c575a6	Merge branch 'kvm-fixes-for-5.18-rc5' into HEAD Fixes for (relatively) old bugs, to be merged in both the -rc and next development trees. The merge reconciles the ABI fixes for KVM_EXIT_SYSTEM_EVENT between 5.18 and commit `c24a950ec7` ("KVM, SEV: Add KVM_EXIT_SHUTDOWN metadata for SEV-ES", 2022-04-13).	2022-04-29 12:47:59 -04:00
Mingwei Zhang	44187235cb	KVM: x86/mmu: fix potential races when walking host page table KVM uses lookup_address_in_mm() to detect the hugepage size that the host uses to map a pfn. The function suffers from several issues: - no usage of READ_ONCE(*). This allows multiple dereference of the same page table entry. The TOCTOU problem because of that may cause KVM to incorrectly treat a newly generated leaf entry as a nonleaf one, and dereference the content by using its pfn value. - the information returned does not match what KVM needs; for non-present entries it returns the level at which the walk was terminated, as long as the entry is not 'none'. KVM needs level information of only 'present' entries, otherwise it may regard a non-present PXE entry as a present large page mapping. - the function is not safe for mappings that can be torn down, because it does not disable IRQs and because it returns a PTE pointer which is never safe to dereference after the function returns. So implement the logic for walking host page tables directly in KVM, and stop using lookup_address_in_mm(). Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20220429031757.2042406-1-mizhang@google.com> [Inline in host_pfn_mapping_level, ensure no semantic change for its callers. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:38:22 -04:00
Sean Christopherson	86931ff720	KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR. The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the guest, possibly with help from userspace, manages to coerce KVM into creating a SPTE for an "impossible" gfn, KVM will leak the associated shadow pages (page tables): WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57 kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Modules linked in: kvm_intel kvm irqbypass CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Call Trace: <TASK> kvm_arch_destroy_vm+0x130/0x1b0 [kvm] kvm_destroy_vm+0x162/0x2d0 [kvm] kvm_vm_release+0x1d/0x30 [kvm] __fput+0x82/0x240 task_work_run+0x5b/0x90 exit_to_user_mode_prepare+0xd2/0xe0 syscall_exit_to_user_mode+0x1d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> On bare metal, encountering an impossible gpa in the page fault path is well and truly impossible, barring CPU bugs, as the CPU will signal #PF during the gva=>gpa translation (or a similar failure when stuffing a physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR of the underlying hardware, in which case the hardware will not fault on the illegal-from-KVM's-perspective gpa. Alternatively, KVM could continue allowing the dodgy behavior and simply zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's a (minor) waste of cycles, and more importantly, KVM can't reasonably support impossible memslots when running on bare metal (or with an accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if KVM is running as a guest is not a safe option as the host isn't required to announce itself to the guest in any way, e.g. doesn't need to set the HYPERVISOR CPUID bit. A second alternative to disallowing the memslot behavior would be to disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That restriction is undesirable as there are legitimate use cases for doing so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous systems so that VMs can be migrated between hosts with different MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess. Note that any guest.MAXPHYADDR is valid with shadow paging, and it is even useful in order to test KVM with MAXPHYADDR=52 (i.e. without any reserved physical address bits). The now common kvm_mmu_max_gfn() is inclusive instead of exclusive. The memslot and TDP MMU code want an exclusive value, but the name implies the returned value is inclusive, and the MMIO path needs an inclusive check. Fixes: `faaf05b00a` ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: `524a1e4e38` ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220428233416.2446833-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-29 12:38:21 -04:00
Paolo Bonzini	a4cfff3f0f	Merge branch 'kvm-older-features' into HEAD Merge branch for features that did not make it into 5.18: * New ioctls to get/set TSC frequency for a whole VM * Allow userspace to opt out of hypercall patching Nested virtualization improvements for AMD: * Support for "nested nested" optimizations (nested vVMLOAD/VMSAVE, nested vGIF) * Allow AVIC to co-exist with a nested guest running * Fixes for LBR virtualizations when a nested guest is running, and nested LBR virtualization support * PAUSE filtering for nested hypervisors Guest support: * Decoupling of vcpu_is_preempted from PV spinlocks Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-13 13:37:17 -04:00
Sean Christopherson	1d0e848060	KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as -1 is technically undefined behavior when its value is read out by param_get_bool(), as boolean values are supposed to be '0' or '1'. Alternatively, KVM could define a custom getter for the param, but the auto value doesn't depend on the vendor module in any way, and printing "auto" would be unnecessarily unfriendly to the user. In addition to fixing the undefined behavior, resolving the auto value also fixes the scenario where the auto value resolves to N and no vendor module is loaded. Previously, -1 would result in Y being printed even though KVM would ultimately disable the mitigation. Rename the existing MMU module init/exit helpers to clarify that they're invoked with respect to the vendor module, and add comments to document why KVM has two separate "module init" flows. ========================================================================= UBSAN: invalid-load in kernel/params.c:320:33 load of value 255 is not a valid value for type '_Bool' CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x34/0x44 ubsan_epilogue+0x5/0x40 __ubsan_handle_load_invalid_value.cold+0x43/0x48 param_get_bool.cold+0xf/0x14 param_attr_show+0x55/0x80 module_attr_show+0x1c/0x30 sysfs_kf_seq_show+0x93/0xc0 seq_read_iter+0x11c/0x450 new_sync_read+0x11b/0x1a0 vfs_read+0xf0/0x190 ksys_read+0x5f/0xe0 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> ========================================================================= Fixes: `b8e8c8303f` ("kvm: mmu: ITLB_MULTIHIT mitigation") Cc: stable@vger.kernel.org Reported-by: Bruno Goncalves <bgoncalv@redhat.com> Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220331221359.3912754-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-05 08:09:46 -04:00
Hou Wenlong	8d5678a766	KVM: x86/mmu: Don't rebuild page when the page is synced and no tlb flushing is required Before Commit `c3e5e415bc` ("KVM: X86: Change kvm_sync_page() to return true when remote flush is needed"), the return value of kvm_sync_page() indicates whether the page is synced, and kvm_mmu_get_page() would rebuild page when the sync fails. But now, kvm_sync_page() returns false when the page is synced and no tlb flushing is required, which leads to rebuild page in kvm_mmu_get_page(). So return the return value of mmu->sync_page() directly and check it in kvm_mmu_get_page(). If the sync fails, the page will be zapped and the invalid_list is not empty, so set flush as true is accepted in mmu_sync_children(). Cc: stable@vger.kernel.org Fixes: `c3e5e415bc` ("KVM: X86: Change kvm_sync_page() to return true when remote flush is needed") Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Acked-by: Lai Jiangshan <jiangshanlai@gmail.com> Message-Id: <0dabeeb789f57b0d793f85d073893063e692032d.1647336064.git.houwenlong.hwl@antgroup.com> [mmu_sync_children should not flush if the page is zapped. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-02 05:44:23 -04:00
Maxim Levitsky	5959ff4ae9	KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was set It makes more sense to print new SPTE value than the old value. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220302102457.588450-1-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-02 05:34:45 -04:00
Lai Jiangshan	4f4aa80e3b	KVM: X86: Handle implicit supervisor access with SMAP There are two kinds of implicit supervisor access implicit supervisor access when CPL = 3 implicit supervisor access when CPL < 3 Current permission_fault() handles only the first kind for SMAP. But if the access is implicit when SMAP is on, data may not be read nor write from any user-mode address regardless the current CPL. So the second kind should be also supported. The first kind can be detect via CPL and access mode: if it is supervisor access and CPL = 3, it must be implicit supervisor access. But it is not possible to detect the second kind without extra information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS into @access. This extra information also works for the first kind, so the logic is changed to use this information for both cases. The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48 which is in the most significant 16 bits of u64 and less likely to be forced to change due to future hardware uses it. This patch removes the call to ->get_cpl() for access mode is determined by @access. Not only does it reduce a function call, but also remove confusions when the permission is checked for nested TDP. The nested TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing on it. The original code works just because it is always user walk for NPT and SMAP fault is not set for EPT in update_permission_bitmask. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-02 05:34:43 -04:00
Lai Jiangshan	94b4a2f174	KVM: X86: Fix comments in update_permission_bitmask The commit `09f037aa48` ("KVM: MMU: speedup update_permission_bitmask") refactored the code of update_permission_bitmask() and change the comments. It added a condition into a list to match the new code, so the number/order for conditions in the comments should be updated too. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-02 05:34:42 -04:00
Lai Jiangshan	5b22bbe717	KVM: X86: Change the type of access u32 to u64 Change the type of access u32 to u64 for FNAME(walk_addr) and ->gva_to_gpa(). The kinds of accesses are usually combinations of UWX, and VMX/SVM's nested paging adds a new factor of access: is it an access for a guest page table or for a final guest physical address. And SMAP relies a factor for supervisor access: explicit or implicit. So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include all these information to do the walk. Although @access(u32) has enough bits to encode all the kinds, this patch extends it to u64: o Extra bits will be in the higher 32 bits, so that we can easily obtain the traditional access mode (UWX) by converting it to u32. o Reuse the value for the access kind defined by SVM's nested paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as @error_code in kvm_handle_page_fault(). Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-02 05:34:42 -04:00
Sean Christopherson	f47e5bbbc9	KVM: x86/mmu: Zap only TDP MMU leafs in zap range and mmu_notifier unmap Re-introduce zapping only leaf SPTEs in kvm_zap_gfn_range() and kvm_tdp_mmu_unmap_gfn_range(), this time without losing a pending TLB flush when processing multiple roots (including nested TDP shadow roots). Dropping the TLB flush resulted in random crashes when running Hyper-V Server 2019 in a guest with KSM enabled in the host (or any source of mmu_notifier invalidations, KSM is just the easiest to force). This effectively revert commits `873dd12217` and `fcb93eb6d0`, and thus restores commit `cf3e26427c`, plus this delta on top: bool kvm_tdp_mmu_zap_leafs(struct kvm kvm, int as_id, gfn_t start, gfn_t end, struct kvm_mmu_page root; for_each_tdp_mmu_root_yield_safe(kvm, root, as_id) - flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false); + flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush); return flush; } Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220325230348.2587437-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-02 05:34:39 -04:00
Paolo Bonzini	a1a39128fa	KVM: MMU: propagate alloc_workqueue failure If kvm->arch.tdp_mmu_zap_wq cannot be created, the failure has to be propagated up to kvm_mmu_init_vm and kvm_arch_init_vm. kvm_arch_init_vm also has to undo all the initialization, so group all the MMU initialization code at the beginning and handle cleaning up of kvm_page_track_init. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-04-02 05:34:38 -04:00
Paolo Bonzini	873dd12217	Revert "KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()" This reverts commit `cf3e26427c`. Multi-vCPU Hyper-V guests started crashing randomly on boot with the latest kvm/queue and the problem can be bisected the problem to this particular patch. Basically, I'm not able to boot e.g. 16-vCPU guest successfully anymore. Both Intel and AMD seem to be affected. Reverting the commit saves the day. Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-21 05:11:51 -04:00
Paolo Bonzini	22b94c4b63	KVM: x86/mmu: Zap invalidated roots via asynchronous worker Use the system worker threads to zap the roots invalidated by the TDP MMU's "fast zap" mechanism, implemented by kvm_tdp_mmu_invalidate_all_roots(). At this point, apart from allowing some parallelism in the zapping of roots, the workqueue is a glorified linked list: work items are added and flushed entirely within a single kvm->slots_lock critical section. However, the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots() assumes that it owns a reference to all invalid roots; therefore, no one can set the invalid bit outside kvm_mmu_zap_all_fast(). Putting the invalidated roots on a linked list... erm, on a workqueue ensures that tdp_mmu_zap_root_work() only puts back those extra references that kvm_mmu_zap_all_invalidated_roots() had gifted to it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 10:55:27 -05:00
Sean Christopherson	bb95dfb9e2	KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages Defer TLB flushes to the caller when freeing TDP MMU shadow pages instead of immediately flushing. Because the shadow pages are freed in an RCU callback, so long as at least one CPU holds RCU, all CPUs are protected. For vCPUs running in the guest, i.e. consuming TLB entries, KVM only needs to ensure the caller services the pending TLB flush before dropping its RCU protections. I.e. use the caller's RCU as a proxy for all vCPUs running in the guest. Deferring the flushes allows batching flushes, e.g. when installing a 1gb hugepage and zapping a pile of SPs. And when zapping an entire root, deferring flushes allows skipping the flush entirely (because flushes are not needed in that case). Avoiding flushes when zapping an entire root is especially important as synchronizing with other CPUs via IPI after zapping every shadow page can cause significant performance issues for large VMs. The issue is exacerbated by KVM zapping entire top-level entries without dropping RCU protection, which can lead to RCU stalls even when zapping roots backing relatively "small" amounts of guest memory, e.g. 2tb. Removing the IPI bottleneck largely mitigates the RCU issues, though it's likely still a problem for 5-level paging. A future patch will further address the problem by zapping roots in multiple passes to avoid holding RCU for an extended duration. Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220226001546.360188-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:57 -05:00
Sean Christopherson	cf3e26427c	KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range() Zap only leaf SPTEs in the TDP MMU's zap_gfn_range(), and rename various functions accordingly. When removing mappings for functional correctness (except for the stupid VFIO GPU passthrough memslots bug), zapping the leaf SPTEs is sufficient as the paging structures themselves do not point at guest memory and do not directly impact the final translation (in the TDP MMU). Note, this aligns the TDP MMU with the legacy/full MMU, which zaps only the rmaps, a.k.a. leaf SPTEs, in kvm_zap_gfn_range() and kvm_unmap_gfn_range(). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-18-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:56 -05:00
Sean Christopherson	7ae5840e6f	KVM: x86/mmu: Document that zapping invalidated roots doesn't need to flush Remove the misleading flush "handling" when zapping invalidated TDP MMU roots, and document that flushing is unnecessary for all flavors of MMUs when zapping invalid/obsolete roots/pages. The "handling" in the TDP MMU is dead code, as zap_gfn_range() is called with shared=true, in which case it will never return true due to the flushing being handled by tdp_mmu_zap_spte_atomic(). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:36 -05:00
Sean Christopherson	db01416b22	KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic Explicitly ignore the result of zap_gfn_range() when putting the last reference to a TDP MMU root, and add a pile of comments to formalize the TDP MMU's behavior of deferring TLB flushes to alloc/reuse. Note, this only affects the !shared case, as zap_gfn_range() subtly never returns true for "flush" as the flush is handled by tdp_mmu_zap_spte_atomic(). Putting the root without a flush is ok because even if there are stale references to the root in the TLB, they are unreachable because KVM will not run the guest with the same ASID without first flushing (where ASID in this context refers to both SVM's explicit ASID and Intel's implicit ASID that is constructed from VPID+PCID+EPT4A+etc...). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220226001546.360188-5-seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:23 -05:00
Sean Christopherson	f28e9c7fce	KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap Fix misleading and arguably wrong comments in the TDP MMU's fast zap flow. The comments, and the fact that actually zapping invalid roots was added separately, strongly suggests that zapping invalid roots is an optimization and not required for correctness. That is a lie. KVM _must_ zap invalid roots before returning from kvm_mmu_zap_all_fast(), because when it's called from kvm_mmu_invalidate_zap_pages_in_memslot(), KVM is relying on it to fully remove all references to the memslot. Once the memslot is gone, KVM's mmu_notifier hooks will be unable to find the stale references as the hva=>gfn translation is done via the memslots. If KVM doesn't immediately zap SPTEs and userspace unmaps a range after deleting a memslot, KVM will fail to zap in response to the mmu_notifier due to not finding a memslot corresponding to the notifier's range, which leads to a variation of use-after-free. The other misleading comment (and code) explicitly states that roots without a reference should be skipped. While that's technically true, it's also extremely misleading as it should be impossible for KVM to encounter a defunct root on the list while holding mmu_lock for write. Opportunistically add a WARN to enforce that invariant. Fixes: `b7cccd397f` ("KVM: x86/mmu: Fast invalidation for TDP MMU") Fixes: `4c6654bd16` ("KVM: x86/mmu: Tear down roots before kvm_mmu_zap_all_fast returns") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-08 09:31:18 -05:00
Paolo Bonzini	0564eeb71b	Merge branch 'kvm-bugfixes' into HEAD Merge bugfixes from 5.17 before merging more tricky work.	2022-03-04 18:39:29 -05:00
Like Xu	c6c937d673	KVM: x86/mmu: Passing up the error state of mmu_alloc_shadow_roots() Just like on the optional mmu_alloc_direct_roots() path, once shadow path reaches "r = -EIO" somewhere, the caller needs to know the actual state in order to enter error handling and avoid something worse. Fixes: `4a38162ee9` ("KVM: MMU: load PDPTRs outside mmu_lock") Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220301124941.48412-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-02 10:55:58 -05:00
Sean Christopherson	5d6a322156	KVM: WARN if is_unsync_root() is called on a root without a shadow page WARN and bail if is_unsync_root() is passed a root for which there is no shadow page, i.e. is passed the physical address of one of the special roots, which do not have an associated shadow page. The current usage squeaks by without bug reports because neither kvm_mmu_sync_roots() nor kvm_mmu_sync_prev_roots() calls the helper with pae_root or pml4_root, and 5-level AMD CPUs are not generally available, i.e. no one can coerce KVM into calling is_unsync_root() on pml5_root. Note, this doesn't fix the mess with 5-level nNPT, it just (hopefully) prevents KVM from crashing. Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220225182248.3812651-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-01 08:58:26 -05:00
Sean Christopherson	527d5cd7ee	KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zapped Zap only obsolete roots when responding to zapping a single root shadow page. Because KVM keeps root_count elevated when stuffing a previous root into its PGD cache, shadowing a 64-bit guest means that zapping any root causes all vCPUs to reload all roots, even if their current root is not affected by the zap. For many kernels, zapping a single root is a frequent operation, e.g. in Linux it happens whenever an mm is dropped, e.g. process exits, etc... Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220225182248.3812651-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-01 08:58:25 -05:00
Sean Christopherson	2f6f66ccd2	KVM: Drop kvm_reload_remote_mmus(), open code request in x86 users Remove the generic kvm_reload_remote_mmus() and open code its functionality into the two x86 callers. x86 is (obviously) the only architecture that uses the hook, and is also the only architecture that uses KVM_REQ_MMU_RELOAD in a way that's consistent with the name. That will change in a future patch, as x86's usage when zapping a single shadow page x86 doesn't actually _need_ to reload all vCPUs' MMUs, only MMUs whose root is being zapped actually need to be reloaded. s390 also uses KVM_REQ_MMU_RELOAD, but for a slightly different purpose. Drop the generic code in anticipation of implementing s390 and x86 arch specific requests, which will allow dropping KVM_REQ_MMU_RELOAD entirely. Opportunistically reword the x86 TDP MMU comment to avoid making references to functions (and requests!) when possible, and to remove the rather ambiguous "this". No functional change intended. Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220225182248.3812651-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-03-01 08:58:25 -05:00
Paolo Bonzini	6d58f275e6	KVM: x86/mmu: clear MMIO cache when unloading the MMU For cleanliness, do not leave a stale GVA in the cache after all the roots are cleared. In practice, kvm_mmu_load will go through kvm_mmu_sync_roots if paging is on, and will not use vcpu_match_mmio_gva at all if paging is off. However, leaving data in the cache might cause bugs in the future. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-25 08:20:19 -05:00
Paolo Bonzini	d2e5f33341	KVM: x86/mmu: Always use current mmu's role when loading new PGD Since the guest PGD is now loaded after the MMU has been set up completely, the desired role for a cache hit is simply the current mmu_role. There is no need to compute it again, so __kvm_mmu_new_pgd can be folded in kvm_mmu_new_pgd. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-25 08:20:18 -05:00
Paolo Bonzini	3cffc89d9d	KVM: x86/mmu: load new PGD after the shadow MMU is initialized Now that __kvm_mmu_new_pgd does not look at the MMU's root_level and shadow_root_level anymore, pull the PGD load after the initialization of the shadow MMUs. Besides being more intuitive, this enables future simplifications and optimizations because it's not necessary anymore to compute the role outside kvm_init_mmu. In particular, kvm_mmu_reset_context was not attempting to use a cached PGD to avoid having to figure out the new role. With this change, it could follow what nested_{vmx,svm}_load_cr3 are doing, and avoid unloading all the cached roots. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-25 08:20:18 -05:00
Paolo Bonzini	5499ea73e7	KVM: x86/mmu: look for a cached PGD when going from 32-bit to 64-bit Right now, PGD caching avoids placing a PAE root in the cache by using the old value of mmu->root_level and mmu->shadow_root_level; it does not look for a cached PGD if the old root is a PAE one, and then frees it using kvm_mmu_free_roots. Change the logic instead to free the uncacheable root early. This way, __kvm_new_mmu_pgd is able to look up the cache when going from 32-bit to 64-bit (if there is a hit, the invalid root becomes the least recently used). An example of this is nested virtualization with shadow paging, when a 64-bit L1 runs a 32-bit L2. As a side effect (which is actually the reason why this patch was written), PGD caching does not use the old value of mmu->root_level and mmu->shadow_root_level anymore. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-25 08:20:18 -05:00
Paolo Bonzini	0c1c92f15f	KVM: x86/mmu: do not pass vcpu to root freeing functions These functions only operate on a given MMU, of which there is more than one in a vCPU (we care about two, because the third does not have any roots and is only used to walk guest page tables). They do need a struct kvm in order to lock the mmu_lock, but they do not needed anything else in the struct kvm_vcpu. So, pass the vcpu->kvm directly to them. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-25 08:20:17 -05:00
Paolo Bonzini	594bef7931	KVM: x86/mmu: do not consult levels when freeing roots Right now, PGD caching requires a complicated dance of first computing the MMU role and passing it to __kvm_mmu_new_pgd(), and then separately calling kvm_init_mmu(). Part of this is due to kvm_mmu_free_roots using mmu->root_level and mmu->shadow_root_level to distinguish whether the page table uses a single root or 4 PAE roots. Because kvm_init_mmu() can overwrite mmu->root_level, kvm_mmu_free_roots() must be called before kvm_init_mmu(). However, even after kvm_init_mmu() there is a way to detect whether the page table may hold PAE roots, as root.hpa isn't backed by a shadow when it points at PAE roots. Using this method results in simpler code, and is one less obstacle in moving all calls to __kvm_mmu_new_pgd() after the MMU has been initialized. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-25 08:20:17 -05:00
Paolo Bonzini	b9e5603c2a	KVM: x86: use struct kvm_mmu_root_info for mmu->root The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info. Use the struct to have more consistency between mmu->root and mmu->prev_roots. The patch is entirely search and replace except for cached_root_available, which does not need a temporary struct kvm_mmu_root_info anymore. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-25 08:20:16 -05:00
Paolo Bonzini	9191b8f074	KVM: x86/mmu: avoid NULL-pointer dereference on page freeing bugs WARN and bail if KVM attempts to free a root that isn't backed by a shadow page. KVM allocates a bare page for "special" roots, e.g. when using PAE paging or shadowing 2/3/4-level page tables with 4/5-level, and so root_hpa will be valid but won't be backed by a shadow page. It's all too easy to blindly call mmu_free_root_page() on root_hpa, be nice and WARN instead of crashing KVM and possibly the kernel. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-25 08:20:16 -05:00
Liang Zhang	6f3c1fc53d	KVM: x86/mmu: make apf token non-zero to fix bug In current async pagefault logic, when a page is ready, KVM relies on kvm_arch_can_dequeue_async_page_present() to determine whether to deliver a READY event to the Guest. This function test token value of struct kvm_vcpu_pv_apf_data, which must be reset to zero by Guest kernel when a READY event is finished by Guest. If value is zero meaning that a READY event is done, so the KVM can deliver another. But the kvm_arch_setup_async_pf() may produce a valid token with zero value, which is confused with previous mention and may lead the loss of this READY event. This bug may cause task blocked forever in Guest: INFO: task stress:7532 blocked for more than 1254 seconds. Not tainted 5.10.0 #16 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:stress state:D stack: 0 pid: 7532 ppid: 1409 flags:0x00000080 Call Trace: __schedule+0x1e7/0x650 schedule+0x46/0xb0 kvm_async_pf_task_wait_schedule+0xad/0xe0 ? exit_to_user_mode_prepare+0x60/0x70 __kvm_handle_async_pf+0x4f/0xb0 ? asm_exc_page_fault+0x8/0x30 exc_page_fault+0x6f/0x110 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 RIP: 0033:0x402d00 RSP: 002b:00007ffd31912500 EFLAGS: 00010206 RAX: 0000000000071000 RBX: ffffffffffffffff RCX: 00000000021a32b0 RDX: 000000000007d011 RSI: 000000000007d000 RDI: 00000000021262b0 RBP: 00000000021262b0 R08: 0000000000000003 R09: 0000000000000086 R10: 00000000000000eb R11: 00007fefbdf2baa0 R12: 0000000000000000 R13: 0000000000000002 R14: 000000000007d000 R15: 0000000000001000 Signed-off-by: Liang Zhang <zhangliang5@huawei.com> Message-Id: <20220222031239.1076682-1-zhangliang5@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-24 13:04:46 -05:00
Sean Christopherson	1bbc60d0c7	KVM: x86/mmu: Remove MMU auditing Remove mmu_audit.c and all its collateral, the auditing code has suffered severe bitrot, ironically partly due to shadow paging being more stable and thus not benefiting as much from auditing, but mostly due to TDP supplanting shadow paging for non-nested guests and shadowing of nested TDP not heavily stressing the logic that is being audited. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-18 13:46:23 -05:00
David Matlack	cb00a70bd4	KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG When using KVM_DIRTY_LOG_INITIALLY_SET, huge pages are not write-protected when dirty logging is enabled on the memslot. Instead they are write-protected once userspace invokes KVM_CLEAR_DIRTY_LOG for the first time and only for the specific sub-region being cleared. Enhance KVM_CLEAR_DIRTY_LOG to also try to split huge pages prior to write-protecting to avoid causing write-protection faults on vCPU threads. This also allows userspace to smear the cost of huge page splitting across multiple ioctls, rather than splitting the entire memslot as is the case when initially-all-set is not used. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-17-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:43 -05:00
David Matlack	a3fe5dbda0	KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:42 -05:00
David Matlack	315d86da89	KVM: x86/mmu: Move restore_acc_track_spte() to spte.h restore_acc_track_spte() is pure SPTE bit manipulation, making it a good fit for spte.h. And now that the WARN_ON_ONCE() calls have been removed, there isn't any good reason to not inline it. This move also prepares for a follow-up commit that will need to call restore_acc_track_spte() from spte.c No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-11-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:40 -05:00
David Matlack	77c23c77f9	KVM: x86/mmu: Drop new_spte local variable from restore_acc_track_spte() The new_spte local variable is unnecessary. Deleting it can save a line of code and simplify the remaining lines a bit. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-10-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:39 -05:00
David Matlack	59940e76d1	KVM: x86/mmu: Remove unnecessary warnings from restore_acc_track_spte() The warnings in restore_acc_track_spte() can be removed because the only caller checks is_access_track_spte(), and is_access_track_spte() checks !spte_ad_enabled(). In other words, the warning can never be triggered. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-9-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:39 -05:00
David Matlack	1346bbb6b4	KVM: x86/mmu: Rename __rmap_write_protect() to rmap_write_protect() The function formerly known as rmap_write_protect() has been renamed to kvm_vcpu_write_protect_gfn(), so we can get rid of the double underscores in front of __rmap_write_protect(). No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:37 -05:00

... 3 4 5 6 7 ...

887 Commits