Commit Graph

2674 Commits

Author SHA1 Message Date
Paul Mackerras
4f169d2118 KVM: PPC: Book3S PR: Don't let PAPR guest set MSR hypervisor bit
PAPR guests run in supervisor mode and should not be able to set the
MSR HV (hypervisor mode) bit or clear the ME (machine check enable)
bit by mtmsrd or any other means.  To enforce this, we force MSR_HV
off and MSR_ME on in kvmppc_set_msr_pr.  Without this, the guest
can appear to be in hypervisor mode to itself and to userspace.
This has been observed to cause a crash in QEMU when it tries to
deliver a system reset interrupt to the guest.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-13 09:45:28 +10:00
Paul Mackerras
a50623fb57 KVM: PPC: Book3S PR: Fix failure status setting in treclaim. emulation
The treclaim. emulation needs to record failure status in the TEXASR
register if the transaction had not previously failed.  However, the
current code first does kvmppc_save_tm_pr() (which does a treclaim.
itself) and then checks the failure summary bit in TEXASR after that.
Since treclaim. itself causes transaction failure, the FS bit is
always set, so we were never updating TEXASR with the failure cause
supplied by the guest as the RA parameter to the treclaim. instruction.
This caused the tm-unavailable test in tools/testing/selftests/powerpc/tm
to fail.

To fix this, we need to read TEXASR before calling kvmppc_save_tm_pr(),
and base the final value of TEXASR on that value.

Fixes: 03c81682a9 ("KVM: PPC: Book3S PR: Add emulation for treclaim.")
Reviewed-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-13 09:45:28 +10:00
Paul Mackerras
916ccadccd KVM: PPC: Book3S PR: Fix MSR setting when delivering interrupts
This makes sure that MSR "partial-function" bits are not transferred
to SRR1 when delivering an interrupt.  This was causing failures in
guests running kernels that include commit f3d96e698e ("powerpc/mm:
Overhaul handling of bad page faults", 2017-07-19), which added code
to check bits of SRR1 on instruction storage interrupts (ISIs) that
indicate a bad page fault.  The symptom was that a guest user program
that handled a signal and attempted to return from the signal handler
would get a SIGBUS signal and die.

The code that generated ISIs and some other interrupts would
previously set bits in the guest MSR to indicate the interrupt status
and then call kvmppc_book3s_queue_irqprio().  This technique no
longer works now that kvmppc_inject_interrupt() is masking off those
bits.  Instead we make kvmppc_core_queue_data_storage() and
kvmppc_core_queue_inst_storage() call kvmppc_inject_interrupt()
directly, and make sure that all the places that generate ISIs or
DSIs call kvmppc_core_queue_{data,inst}_storage instead of
kvmppc_book3s_queue_irqprio().

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-13 09:45:28 +10:00
Cameron Kaiser
b71dc519a9 KVM: PPC: Book3S PR: Handle additional interrupt types
This adds trivial handling for additional interrupt types that KVM-PR must
support for proper virtualization on a POWER9 host in HPT mode, as a further
prerequisite to enabling KVM-PR on that configuration.

Signed-off-by: Cameron Kaiser <spectre@floodgap.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-13 09:45:28 +10:00
Kees Cook
fad953ce0b treewide: Use array_size() in vzalloc()
The vzalloc() function has no 2-factor argument form, so multiplication
factors need to be wrapped in array_size(). This patch replaces cases of:

        vzalloc(a * b)

with:
        vzalloc(array_size(a, b))

as well as handling cases of:

        vzalloc(a * b * c)

with:

        vzalloc(array3_size(a, b, c))

This does, however, attempt to ignore constant size factors like:

        vzalloc(4 * 1024)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  vzalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  vzalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  vzalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
  vzalloc(
-	sizeof(TYPE) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT_ID
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT_ID
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

  vzalloc(
-	SIZE * COUNT
+	array_size(COUNT, SIZE)
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  vzalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  vzalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  vzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  vzalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  vzalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  vzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  vzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  vzalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  vzalloc(C1 * C2 * C3, ...)
|
  vzalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants.
@@
expression E1, E2;
constant C1, C2;
@@

(
  vzalloc(C1 * C2, ...)
|
  vzalloc(
-	E1 * E2
+	array_size(E1, E2)
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Kees Cook
42bc47b353 treewide: Use array_size() in vmalloc()
The vmalloc() function has no 2-factor argument form, so multiplication
factors need to be wrapped in array_size(). This patch replaces cases of:

        vmalloc(a * b)

with:
        vmalloc(array_size(a, b))

as well as handling cases of:

        vmalloc(a * b * c)

with:

        vmalloc(array3_size(a, b, c))

This does, however, attempt to ignore constant size factors like:

        vmalloc(4 * 1024)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  vmalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  vmalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  vmalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
  vmalloc(
-	sizeof(TYPE) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * COUNT_ID
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(THING) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * COUNT_ID
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

  vmalloc(
-	SIZE * COUNT
+	array_size(COUNT, SIZE)
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  vmalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  vmalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  vmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  vmalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  vmalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  vmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  vmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  vmalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  vmalloc(C1 * C2 * C3, ...)
|
  vmalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants.
@@
expression E1, E2;
constant C1, C2;
@@

(
  vmalloc(C1 * C2, ...)
|
  vmalloc(
-	E1 * E2
+	array_size(E1, E2)
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Linus Torvalds
b357bf6023 Small update for KVM.
* ARM: lazy context-switching of FPSIMD registers on arm64, "split"
 regions for vGIC redistributor
 
 * s390: cleanups for nested, clock handling, crypto, storage keys and
 control register bits
 
 * x86: many bugfixes, implement more Hyper-V super powers,
 implement lapic_timer_advance_ns even when the LAPIC timer
 is emulated using the processor's VMX preemption timer.  Two
 security-related bugfixes at the top of the branch.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJbH8Z/AAoJEL/70l94x66DF+UIAJeOuTp6LGasT/9uAb2OovaN
 +5kGmOPGFwkTcmg8BQHI2fXT4vhxMXWPFcQnyig9eXJVxhuwluXDOH4P9IMay0yw
 VDCBsWRdMvZDQad2hn6Z5zR4Jx01XrSaG/KqvXbbDKDCy96mWG7SYAY2m3ZwmeQi
 3Pa3O3BTijr7hBYnMhdXGkSn4ZyU8uPaAgIJ8795YKeOJ2JmioGYk6fj6y2WCxA3
 ztJymBjTmIoZ/F8bjuVouIyP64xH4q9roAyw4rpu7vnbWGqx1fjPYJoB8yddluWF
 JqCPsPzhKDO7mjZJy+lfaxIlzz2BN7tKBNCm88s5GefGXgZwk3ByAq/0GQ2M3rk=
 =H5zI
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "Small update for KVM:

  ARM:
   - lazy context-switching of FPSIMD registers on arm64
   - "split" regions for vGIC redistributor

  s390:
   - cleanups for nested
   - clock handling
   - crypto
   - storage keys
   - control register bits

  x86:
   - many bugfixes
   - implement more Hyper-V super powers
   - implement lapic_timer_advance_ns even when the LAPIC timer is
     emulated using the processor's VMX preemption timer.
   - two security-related bugfixes at the top of the branch"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (79 commits)
  kvm: fix typo in flag name
  kvm: x86: use correct privilege level for sgdt/sidt/fxsave/fxrstor access
  KVM: x86: pass kvm_vcpu to kvm_read_guest_virt and kvm_write_guest_virt_system
  KVM: x86: introduce linear_{read,write}_system
  kvm: nVMX: Enforce cpl=0 for VMX instructions
  kvm: nVMX: Add support for "VMWRITE to any supported field"
  kvm: nVMX: Restrict VMX capability MSR changes
  KVM: VMX: Optimize tscdeadline timer latency
  KVM: docs: nVMX: Remove known limitations as they do not exist now
  KVM: docs: mmu: KVM support exposing SLAT to guests
  kvm: no need to check return value of debugfs_create functions
  kvm: Make VM ioctl do valloc for some archs
  kvm: Change return type to vm_fault_t
  KVM: docs: mmu: Fix link to NPT presentation from KVM Forum 2008
  kvm: x86: Amend the KVM_GET_SUPPORTED_CPUID API documentation
  KVM: x86: hyperv: declare KVM_CAP_HYPERV_TLBFLUSH capability
  KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}_EX implementation
  KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} implementation
  KVM: introduce kvm_make_vcpus_request_mask() API
  KVM: x86: hyperv: do rep check for each hypercall separately
  ...
2018-06-12 11:34:04 -07:00
Linus Torvalds
c90fca951e powerpc updates for 4.18
Notable changes:
 
  - Support for split PMD page table lock on 64-bit Book3S (Power8/9).
 
  - Add support for HAVE_RELIABLE_STACKTRACE, so we properly support live
    patching again.
 
  - Add support for patching barrier_nospec in copy_from_user() and syscall entry.
 
  - A couple of fixes for our data breakpoints on Book3S.
 
  - A series from Nick optimising TLB/mm handling with the Radix MMU.
 
  - Numerous small cleanups to squash sparse/gcc warnings from Mathieu Malaterre.
 
  - Several series optimising various parts of the 32-bit code from Christophe Leroy.
 
  - Removal of support for two old machines, "SBC834xE" and "C2K" ("GEFanuc,C2K"),
    which is why the diffstat has so many deletions.
 
 And many other small improvements & fixes.
 
 There's a few out-of-area changes. Some minor ftrace changes OK'ed by Steve, and
 a fix to our powernv cpuidle driver. Then there's a series touching mm, x86 and
 fs/proc/task_mmu.c, which cleans up some details around pkey support. It was
 ack'ed/reviewed by Ingo & Dave and has been in next for several weeks.
 
 Thanks to:
   Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Al Viro, Andrew
   Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Arnd Bergmann, Balbir Singh,
   Cédric Le Goater, Christophe Leroy, Christophe Lombard, Colin Ian King, Dave
   Hansen, Fabio Estevam, Finn Thain, Frederic Barrat, Gautham R. Shenoy, Haren
   Myneni, Hari Bathini, Ingo Molnar, Jonathan Neuschäfer, Josh Poimboeuf,
   Kamalesh Babulal, Madhavan Srinivasan, Mahesh Salgaonkar, Mark Greer, Mathieu
   Malaterre, Matthew Wilcox, Michael Neuling, Michal Suchanek, Naveen N. Rao,
   Nicholas Piggin, Nicolai Stange, Olof Johansson, Paul Gortmaker, Paul
   Mackerras, Peter Rosin, Pridhiviraj Paidipeddi, Ram Pai, Rashmica Gupta, Ravi
   Bangoria, Russell Currey, Sam Bobroff, Samuel Mendoza-Jonas, Segher
   Boessenkool, Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stewart Smith,
   Thiago Jung Bauermann, Torsten Duwe, Vaibhav Jain, Wei Yongjun, Wolfram Sang,
   Yisheng Xie, YueHaibing.
 -----BEGIN PGP SIGNATURE-----
 
 iQIwBAABCAAaBQJbGQKBExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYBq
 TRAAioK7rz5xYMkxaM3Ng3ybobEeNAwQqOolz98xvmnB9SfDWNuc99vf8cGu0/fQ
 zc8AKZ5RcnwipOjyGlxW9oa1ZhVq0xtYnQPiYLEKMdLQmh5D+C7+KpvAd1UElweg
 ub40/xDySWfMujfuMSF9JDCWPIXyojt4Xg5nJKIVRrAm/3YMe/+i5Am7NWHuMCEb
 aQmZtlYW5Mz81XY0968hjpUO6eKFRmsaM7yFAhGTXx6+oLRpGj1PZB4AwdRIKS2L
 Ak7q/VgxtE4W+s3a0GK2s+eXIhGKeFuX9AVnx3nti+8/K1OqrqhDcLMUC/9JpCpv
 EvOtO7dxPnZujHjdu4Eai/xNoo4h6zRy7bWqve9LoBM40CP5jljKzu1lwqqb5yO0
 jC7/aXhgiSIxxcRJLjoI/TYpZPu40MifrkydmczykdPyPCnMIWEJDcj4KsRL/9Y8
 9SSbJzRNC/SgQNTbUYPZFFi6G0QaMmlcbCb628k8QT+Gn3Xkdf/ZtxzqEyoF4Irq
 46kFBsiSSK4Bu0rVlcUtJQLgdqytWULO6NKEYnD67laxYcgQd8pGFQ8SjZhRZLgU
 q5LA3HIWhoAI4M0wZhOnKXO6JfiQ1UbO8gUJLsWsfF0Fk5KAcdm+4kb4jbI1H4Qk
 Vol9WNRZwEllyaiqScZN9RuVVuH0GPOZeEH1dtWK+uWi0lM=
 =ZlBf
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Notable changes:

   - Support for split PMD page table lock on 64-bit Book3S (Power8/9).

   - Add support for HAVE_RELIABLE_STACKTRACE, so we properly support
     live patching again.

   - Add support for patching barrier_nospec in copy_from_user() and
     syscall entry.

   - A couple of fixes for our data breakpoints on Book3S.

   - A series from Nick optimising TLB/mm handling with the Radix MMU.

   - Numerous small cleanups to squash sparse/gcc warnings from Mathieu
     Malaterre.

   - Several series optimising various parts of the 32-bit code from
     Christophe Leroy.

   - Removal of support for two old machines, "SBC834xE" and "C2K"
     ("GEFanuc,C2K"), which is why the diffstat has so many deletions.

  And many other small improvements & fixes.

  There's a few out-of-area changes. Some minor ftrace changes OK'ed by
  Steve, and a fix to our powernv cpuidle driver. Then there's a series
  touching mm, x86 and fs/proc/task_mmu.c, which cleans up some details
  around pkey support. It was ack'ed/reviewed by Ingo & Dave and has
  been in next for several weeks.

  Thanks to: Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Al
  Viro, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Arnd
  Bergmann, Balbir Singh, Cédric Le Goater, Christophe Leroy, Christophe
  Lombard, Colin Ian King, Dave Hansen, Fabio Estevam, Finn Thain,
  Frederic Barrat, Gautham R. Shenoy, Haren Myneni, Hari Bathini, Ingo
  Molnar, Jonathan Neuschäfer, Josh Poimboeuf, Kamalesh Babulal,
  Madhavan Srinivasan, Mahesh Salgaonkar, Mark Greer, Mathieu Malaterre,
  Matthew Wilcox, Michael Neuling, Michal Suchanek, Naveen N. Rao,
  Nicholas Piggin, Nicolai Stange, Olof Johansson, Paul Gortmaker, Paul
  Mackerras, Peter Rosin, Pridhiviraj Paidipeddi, Ram Pai, Rashmica
  Gupta, Ravi Bangoria, Russell Currey, Sam Bobroff, Samuel
  Mendoza-Jonas, Segher Boessenkool, Shilpasri G Bhat, Simon Guo,
  Souptick Joarder, Stewart Smith, Thiago Jung Bauermann, Torsten Duwe,
  Vaibhav Jain, Wei Yongjun, Wolfram Sang, Yisheng Xie, YueHaibing"

* tag 'powerpc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (251 commits)
  powerpc/64s/radix: Fix missing ptesync in flush_cache_vmap
  cpuidle: powernv: Fix promotion from snooze if next state disabled
  powerpc: fix build failure by disabling attribute-alias warning in pci_32
  ocxl: Fix missing unlock on error in afu_ioctl_enable_p9_wait()
  powerpc-opal: fix spelling mistake "Uniterrupted" -> "Uninterrupted"
  powerpc: fix spelling mistake: "Usupported" -> "Unsupported"
  powerpc/pkeys: Detach execute_only key on !PROT_EXEC
  powerpc/powernv: copy/paste - Mask SO bit in CR
  powerpc: Remove core support for Marvell mv64x60 hostbridges
  powerpc/boot: Remove core support for Marvell mv64x60 hostbridges
  powerpc/boot: Remove support for Marvell mv64x60 i2c controller
  powerpc/boot: Remove support for Marvell MPSC serial controller
  powerpc/embedded6xx: Remove C2K board support
  powerpc/lib: optimise PPC32 memcmp
  powerpc/lib: optimise 32 bits __clear_user()
  powerpc/time: inline arch_vtime_task_switch()
  powerpc/Makefile: set -mcpu=860 flag for the 8xx
  powerpc: Implement csum_ipv6_magic in assembly
  powerpc/32: Optimise __csum_partial()
  powerpc/lib: Adjust .balign inside string functions for PPC32
  ...
2018-06-07 10:23:33 -07:00
Greg Kroah-Hartman
929f45e324 kvm: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

This cleans up the error handling a lot, as this code will never get
hit.

Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christoffer Dall <christoffer.dall@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim KrÄmář" <rkrcmar@redhat.com>
Cc: Arvind Yadav <arvind.yadav.cs@gmail.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Andre Przywara <andre.przywara@arm.com>
Cc: kvm-ppc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: kvmarm@lists.cs.columbia.edu
Cc: kvm@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-06-01 19:18:27 +02:00
Souptick Joarder
1499fa809e kvm: Change return type to vm_fault_t
Use new return type vm_fault_t for fault handler. For
now, this is just documenting that the function returns
a VM_FAULT value rather than an errno. Once all instances
are converted, vm_fault_t will become a distinct type.

commit 1c8f422059 ("mm: change return type to vm_fault_t")

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-06-01 19:18:25 +02:00
Simon Guo
deeb879de9 KVM: PPC: Book3S PR: Enable kvmppc_get/set_one_reg_pr() for HTM registers
We need to migrate PR KVM during transaction and userspace will use
kvmppc_get_one_reg_pr()/kvmppc_set_one_reg_pr() APIs to get/set
transaction checkpoint state. This patch adds support for that.

So far, QEMU on PR KVM doesn't fully function for migration but the
savevm/loadvm can be done against a RHEL72 guest. During savevm/
loadvm procedure, the kvm ioctls will be invoked as well.

Test has been performed to savevm/loadvm for a guest running
a HTM test program:
https://github.com/justdoitqd/publicFiles/blob/master/test-tm-mig.c

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:31:08 +10:00
Simon Guo
8f04412699 KVM: PPC: Book3S: Remove load/put vcpu for KVM_GET_REGS/KVM_SET_REGS
In both HV and PR KVM, the KVM_SET_REGS/KVM_GET_REGS ioctl should
be able to perform without the vcpu loaded.

Since the vcpu mutex locking/unlock has been moved out of vcpu_load()
/vcpu_put(), KVM_SET_REGS/KVM_GET_REGS don't need to do ioctl with
the vcpu loaded anymore. This patch removes vcpu_load()/vcpu_put()
from KVM_SET_REGS/KVM_GET_REGS ioctl.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:31:03 +10:00
Simon Guo
c8235c2891 KVM: PPC: Remove load/put vcpu for KVM_GET/SET_ONE_REG ioctl
Since the vcpu mutex locking/unlock has been moved out of vcpu_load()
/vcpu_put(), KVM_GET_ONE_REG and KVM_SET_ONE_REG doesn't need to do
ioctl with loading vcpu anymore. This patch removes vcpu_load()/vcpu_put()
from KVM_GET_ONE_REG and KVM_SET_ONE_REG ioctl.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:30:59 +10:00
Simon Guo
b3cebfe8c1 KVM: PPC: Move vcpu_load/vcpu_put down to each ioctl case in kvm_arch_vcpu_ioctl
Although we already have kvm_arch_vcpu_async_ioctl() which doesn't require
ioctl to load vcpu, the sync ioctl code need to be cleaned up when
CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL is not configured.

This patch moves vcpu_load/vcpu_put down to each ioctl switch case so that
each ioctl can decide to do vcpu_load/vcpu_put or not independently.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:30:53 +10:00
Simon Guo
d234d68eb7 KVM: PPC: Book3S PR: Enable HTM for PR KVM for KVM_CHECK_EXTENSION ioctl
With current patch set, PR KVM now supports HTM. So this patch turns it
on for PR KVM.

Tested with:
https://github.com/justdoitqd/publicFiles/blob/master/test_kvm_htm_cap.c

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:30:48 +10:00
Simon Guo
7284ca8a5e KVM: PPC: Book3S PR: Support TAR handling for PR KVM HTM
Currently guest kernel doesn't handle TAR facility unavailable and it
always runs with TAR bit on. PR KVM will lazily enable TAR. TAR is not
a frequent-use register and it is not included in SVCPU struct.

Due to the above, the checkpointed TAR val might be a bogus TAR val.
To solve this issue, we will make vcpu->arch.fscr tar bit consistent
with shadow_fscr when TM is enabled.

At the end of emulating treclaim., the correct TAR val need to be loaded
into the register if FSCR_TAR bit is on.

At the beginning of emulating trechkpt., TAR needs to be flushed so that
the right tar val can be copied into tar_tm.

Tested with:
tools/testing/selftests/powerpc/tm/tm-tar
tools/testing/selftests/powerpc/ptrace/ptrace-tm-tar (remove DSCR/PPR
related testing).

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:30:43 +10:00
Simon Guo
68ab07b985 KVM: PPC: Book3S PR: Add guard code to prevent returning to guest with PR=0 and Transactional state
Currently PR KVM doesn't support transaction memory in guest privileged
state.

This patch adds a check at setting guest msr, so that we can never return
to guest with PR=0 and TS=0b10. A tabort will be emulated to indicate
this and fail transaction immediately.

[paulus@ozlabs.org - don't change the TM_CAUSE_MISC definition, instead
 use TM_CAUSE_KVM_FAC_UNAV.]

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:30:39 +10:00
Simon Guo
26798f88d5 KVM: PPC: Book3S PR: Add emulation for tabort. in privileged state
Currently privileged-state guest will be run with TM disabled.

Although the privileged-state guest cannot initiate a new transaction,
it can use tabort to terminate its problem state's transaction.
So it is still necessary to emulate tabort. for privileged-state guest.

Tested with:
https://github.com/justdoitqd/publicFiles/blob/master/test_tabort.c

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:30:31 +10:00
Simon Guo
e32c53d1cf KVM: PPC: Book3S PR: Add emulation for trechkpt.
This patch adds host emulation when guest PR KVM executes "trechkpt.",
which is a privileged instruction and will trap into host.

We firstly copy vcpu ongoing content into vcpu tm checkpoint
content, then perform kvmppc_restore_tm_pr() to do trechkpt.
with updated vcpu tm checkpoint values.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:30:23 +10:00
Simon Guo
03c81682a9 KVM: PPC: Book3S PR: Add emulation for treclaim.
This patch adds support for "treclaim." emulation when PR KVM guest
executes treclaim. and traps to host.

We will firstly do treclaim. and save TM checkpoint. Then it is
necessary to update vcpu current reg content with checkpointed vals.
When rfid into guest again, those vcpu current reg content (now the
checkpoint vals) will be loaded into regs.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:30:19 +10:00
Simon Guo
19c585eb45 KVM: PPC: Book3S PR: Restore NV regs after emulating mfspr from TM SPRs
Currently kvmppc_handle_fac() will not update NV GPRs and thus it can
return with GUEST_RESUME.

However PR KVM guest always disables MSR_TM bit in privileged state.
If PR privileged-state guest is trying to read TM SPRs, it will
trigger TM facility unavailable exception and fall into
kvmppc_handle_fac().  Then the emulation will be done by
kvmppc_core_emulate_mfspr_pr().  The mfspr instruction can include a
RT with NV reg. So it is necessary to restore NV GPRs at this case, to
reflect the update to NV RT.

This patch make kvmppc_handle_fac() return GUEST_RESUME_NV for TM
facility unavailable exceptions in guest privileged state.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:30:14 +10:00
Simon Guo
5706340a33 KVM: PPC: Book3S PR: Always fail transactions in guest privileged state
Currently the kernel doesn't use transaction memory.
And there is an issue for privileged state in the guest that:
tbegin/tsuspend/tresume/tabort TM instructions can impact MSR TM bits
without trapping into the PR host. So following code will lead to a
false mfmsr result:
	tbegin	<- MSR bits update to Transaction active.
	beq 	<- failover handler branch
	mfmsr	<- still read MSR bits from magic page with
		transaction inactive.

It is not an issue for non-privileged guest state since its mfmsr is
not patched with magic page and will always trap into the PR host.

This patch will always fail tbegin attempt for privileged state in the
guest, so that the above issue is prevented. It is benign since
currently (guest) kernel doesn't initiate a transaction.

Test case:
https://github.com/justdoitqd/publicFiles/blob/master/test_tbegin_pr.c

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:30:10 +10:00
Simon Guo
533082ae86 KVM: PPC: Book3S PR: Emulate mtspr/mfspr using active TM SPRs
The mfspr/mtspr on TM SPRs(TEXASR/TFIAR/TFHAR) are non-privileged
instructions and can be executed by PR KVM guest in problem state
without trapping into the host. We only emulate mtspr/mfspr
texasr/tfiar/tfhar in guest PR=0 state.

When we are emulating mtspr tm sprs in guest PR=0 state, the emulation
result needs to be visible to guest PR=1 state. That is, the actual TM
SPR val should be loaded into actual registers.

We already flush TM SPRs into vcpu when switching out of CPU, and load
TM SPRs when switching back.

This patch corrects mfspr()/mtspr() emulation for TM SPRs to make the
actual source/dest be the actual TM SPRs.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:30:05 +10:00
Simon Guo
13989b65eb KVM: PPC: Book3S PR: Add math support for PR KVM HTM
The math registers will be saved into vcpu->arch.fp/vr and corresponding
vcpu->arch.fp_tm/vr_tm area.

We flush or giveup the math regs into vcpu->arch.fp/vr before saving
transaction. After transaction is restored, the math regs will be loaded
back into regs.

If there is a FP/VEC/VSX unavailable exception during transaction active
state, the math checkpoint content might be incorrect and we need to do
treclaim./load the correct checkpoint val/trechkpt. sequence to retry the
transaction. That will make our solution complicated. To solve this issue,
we always make the hardware guest MSR math bits (shadow_msr) consistent
with the MSR val which guest sees (kvmppc_get_msr()) when guest msr is
with tm enabled. Then all FP/VEC/VSX unavailable exception can be delivered
to guest and guest handles the exception by itself.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:30:00 +10:00
Simon Guo
8d2e2fc5e0 KVM: PPC: Book3S PR: Add transaction memory save/restore skeleton
The transaction memory checkpoint area save/restore behavior is
triggered when VCPU qemu process is switching out/into CPU, i.e.
at kvmppc_core_vcpu_put_pr() and kvmppc_core_vcpu_load_pr().

MSR TM active state is determined by TS bits:
    active: 10(transactional) or 01 (suspended)
    inactive: 00 (non-transactional)
We don't "fake" TM functionality for guest. We "sync" guest virtual
MSR TM active state(10 or 01) with shadow MSR. That is to say,
we don't emulate a transactional guest with a TM inactive MSR.

TM SPR support(TFIAR/TFAR/TEXASR) has already been supported by
commit 9916d57e64 ("KVM: PPC: Book3S PR: Expose TM registers").
Math register support (FPR/VMX/VSX) will be done at subsequent
patch.

Whether TM context need to be saved/restored can be determined
by kvmppc_get_msr() TM active state:
	* TM active - save/restore TM context
	* TM inactive - no need to do so and only save/restore
TM SPRs.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Suggested-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:29:55 +10:00
Simon Guo
66c33e796c KVM: PPC: Book3S PR: Add kvmppc_save/restore_tm_sprs() APIs
This patch adds 2 new APIs, kvmppc_save_tm_sprs() and
kvmppc_restore_tm_sprs(), for the purpose of TEXASR/TFIAR/TFHAR
save/restore.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:29:51 +10:00
Simon Guo
de7ad93219 KVM: PPC: Book3S PR: Add new kvmppc_copyto/from_vcpu_tm APIs
This patch adds 2 new APIs: kvmppc_copyto_vcpu_tm() and
kvmppc_copyfrom_vcpu_tm().  These 2 APIs will be used to copy from/to TM
data between VCPU_TM/VCPU area.

PR KVM will use these APIs for treclaim. or trechkpt. emulation.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:29:47 +10:00
Simon Guo
36383a0862 KVM: PPC: Book3S PR: Avoid changing TS bits when exiting guest
PR KVM host usually runs with TM enabled in its host MSR value, and
with non-transactional TS value.

When a guest with TM active traps into PR KVM host, the rfid at the
tail of kvmppc_interrupt_pr() will try to switch TS bits from
S0 (Suspended & TM disabled) to N1 (Non-transactional & TM enabled).

That will leads to TM Bad Thing interrupt.

This patch manually sets target TS bits unchanged to avoid this
exception.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:29:42 +10:00
Simon Guo
401a89e937 KVM: PPC: Book3S PR: Implement RFID TM behavior to suppress change from S0 to N0
According to ISA specification for RFID, in MSR TM disabled and TS
suspended state (S0), if the target MSR is TM disabled and TS state is
inactive (N0), rfid should suppress this update.

This patch makes the RFID emulation of PR KVM consistent with this.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:29:38 +10:00
Simon Guo
95757bfc72 KVM: PPC: Book3S PR: Sync TM bits to shadow msr for problem state guest
MSR TS bits can be modified with non-privileged instruction such as
tbegin./tend.  That means guest can change MSR value "silently" without
notifying host.

It is necessary to sync the TM bits to host so that host can calculate
shadow msr correctly.

Note, privileged mode in the guest will always fail transactions so we
only take care of problem state mode in the guest.

The logic is put into kvmppc_copy_from_svcpu() so that
kvmppc_handle_exit_pr() can use correct MSR TM bits even when preemption
occurs.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:29:33 +10:00
Simon Guo
901938add3 KVM: PPC: Book3S PR: Pass through MSR TM and TS bits to shadow_msr
PowerPC TM functionality needs MSR TM/TS bits support in hardware level.
Guest TM functionality can not be emulated with "fake" MSR (msr in magic
page) TS bits.

This patch syncs TM/TS bits in shadow_msr with the MSR value in magic
page, so that the MSR TS value which guest sees is consistent with actual
MSR bits running in guest.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:29:28 +10:00
Simon Guo
25ddda07b6 KVM: PPC: Book3S PR: Transition to Suspended state when injecting interrupt
This patch simulates interrupt behavior per Power ISA while injecting
interrupt in PR KVM:
- When interrupt happens, transactional state should be suspended.

kvmppc_mmu_book3s_64_reset_msr() will be invoked when injecting an
interrupt. This patch performs this ISA logic in
kvmppc_mmu_book3s_64_reset_msr().

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:29:22 +10:00
Simon Guo
caa3be92be KVM: PPC: Book3S PR: Add C function wrapper for _kvmppc_save/restore_tm()
Currently __kvmppc_save/restore_tm() APIs can only be invoked from
assembly function. This patch adds C function wrappers for them so
that they can be safely called from C function.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:29:17 +10:00
Simon Guo
7f386af7bd KVM: PPC: Book3S PR: Turn on FP/VSX/VMX MSR bits in kvmppc_save_tm()
kvmppc_save_tm() invokes store_fp_state/store_vr_state(). So it is
mandatory to turn on FP/VSX/VMX MSR bits for its execution, just
like what kvmppc_restore_tm() did.

Previously HV KVM has turned the bits on outside of function
kvmppc_save_tm().  Now we include this bit change in kvmppc_save_tm()
so that the logic is cleaner. And PR KVM can reuse it later.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:29:12 +10:00
Simon Guo
6f597c6b63 KVM: PPC: Book3S PR: Add guest MSR parameter for kvmppc_save_tm()/kvmppc_restore_tm()
HV KVM and PR KVM need different MSR source to indicate whether
treclaim. or trecheckpoint. is necessary.

This patch add new parameter (guest MSR) for these kvmppc_save_tm/
kvmppc_restore_tm() APIs:
- For HV KVM, it is VCPU_MSR
- For PR KVM, it is current host MSR or VCPU_SHADOW_SRR1

This enhancement enables these 2 APIs to be reused by PR KVM later.
And the patch keeps HV KVM logic unchanged.

This patch also reworks kvmppc_save_tm()/kvmppc_restore_tm() to
have a clean ABI: r3 for vcpu and r4 for guest_msr.

During kvmppc_save_tm/kvmppc_restore_tm(), the R1 need to be saved
or restored. Currently the R1 is saved into HSTATE_HOST_R1. In PR
KVM, we are going to add a C function wrapper for
kvmppc_save_tm/kvmppc_restore_tm() where the R1 will be incremented
with added stackframe and save into HSTATE_HOST_R1. There are several
places in HV KVM to load HSTATE_HOST_R1 as R1, and we don't want to
bring risk or confusion by TM code.

This patch will use HSTATE_SCRATCH2 to save/restore R1 in
kvmppc_save_tm/kvmppc_restore_tm() to avoid future confusion, since
the r1 is actually a temporary/scratch value to be saved/stored.

[paulus@ozlabs.org - rebased on top of 7b0e827c69 ("KVM: PPC: Book3S HV:
 Factor fake-suspend handling out of kvmppc_save/restore_tm", 2018-05-30)]

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-06-01 10:27:59 +10:00
Simon Guo
009c872a8b KVM: PPC: Book3S PR: Move kvmppc_save_tm/kvmppc_restore_tm to separate file
It is a simple patch just for moving kvmppc_save_tm/kvmppc_restore_tm()
functionalities to tm.S. There is no logic change. The reconstruct of
those APIs will be done in later patches to improve readability.

It is for preparation of reusing those APIs on both HV/PR PPC KVM.

Some slight change during move the functions includes:
- surrounds some HV KVM specific code with CONFIG_KVM_BOOK3S_HV_POSSIBLE
for compilation.
- use _GLOBAL() to define kvmppc_save_tm/kvmppc_restore_tm()

[paulus@ozlabs.org - rebased on top of 7b0e827c69 ("KVM: PPC: Book3S HV:
 Factor fake-suspend handling out of kvmppc_save/restore_tm", 2018-05-30)]

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-31 11:35:12 +10:00
Paul Mackerras
7b0e827c69 KVM: PPC: Book3S HV: Factor fake-suspend handling out of kvmppc_save/restore_tm
This splits out the handling of "fake suspend" mode, part of the
hypervisor TM assist code for POWER9, and puts almost all of it in
new kvmppc_save_tm_hv and kvmppc_restore_tm_hv functions.  The new
functions branch to kvmppc_save/restore_tm if the CPU does not
require hypervisor TM assistance.

With this, it will be more straightforward to move kvmppc_save_tm and
kvmppc_restore_tm to another file and use them for transactional
memory support in PR KVM.  Additionally, it also makes the code a
bit clearer and reduces the number of feature sections.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-31 09:23:28 +10:00
Paul Mackerras
9617a0b335 KVM: PPC: Book3S PR: Allow KVM_PPC_CONFIGURE_V3_MMU to succeed
Currently, PR KVM does not implement the configure_mmu operation, and
so the KVM_PPC_CONFIGURE_V3_MMU ioctl always fails with an EINVAL
error.  This causes recent kernels to fail to boot as a PR KVM guest
on POWER9, since recent kernels booted in HPT mode do the
H_REGISTER_PROC_TBL hypercall, which causes userspace (QEMU) to do
KVM_PPC_CONFIGURE_V3_MMU, which fails.

This implements a minimal configure_mmu operation for PR KVM.  It
succeeds only if the MMU is being configured for HPT mode and no
process table is being registered.  This is enough to get recent
kernels to boot as a PR KVM guest.

Reviewed-by: Greg Kurz <groug@kaod.org>
Tested-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-31 09:21:50 +10:00
Simon Guo
acc9eb9305 KVM: PPC: Reimplement LOAD_VMX/STORE_VMX instruction mmio emulation with analyse_instr() input
This patch reimplements LOAD_VMX/STORE_VMX MMIO emulation with
analyse_instr() input. When emulating the store, the VMX reg will need to
be flushed so that the right reg val can be retrieved before writing to
IO MEM.

This patch also adds support for lvebx/lvehx/lvewx/stvebx/stvehx/stvewx
MMIO emulation. To meet the requirement of handling different element
sizes, kvmppc_handle_load128_by2x64()/kvmppc_handle_store128_by2x64()
were replaced with kvmppc_handle_vmx_load()/kvmppc_handle_vmx_store().

The framework used is similar to VSX instruction MMIO emulation.

Suggested-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-22 19:53:00 +10:00
Simon Guo
da2a32b876 KVM: PPC: Expand mmio_vsx_copy_type to cover VMX load/store element types
VSX MMIO emulation uses mmio_vsx_copy_type to represent VSX emulated
element size/type, such as KVMPPC_VSX_COPY_DWORD_LOAD, etc. This
patch expands mmio_vsx_copy_type to cover VMX copy type, such as
KVMPPC_VMX_COPY_BYTE(stvebx/lvebx), etc. As a result,
mmio_vsx_copy_type is also renamed to mmio_copy_type.

It is a preparation for reimplementing VMX MMIO emulation.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-22 19:52:55 +10:00
Simon Guo
b01c78c297 KVM: PPC: Reimplement LOAD_VSX/STORE_VSX instruction mmio emulation with analyse_instr() input
This patch reimplements LOAD_VSX/STORE_VSX instruction MMIO emulation with
analyse_instr() input. It utilizes VSX_FPCONV/VSX_SPLAT/SIGNEXT exported
by analyse_instr() and handle accordingly.

When emulating VSX store, the VSX reg will need to be flushed so that
the right reg val can be retrieved before writing to IO MEM.

[paulus@ozlabs.org - mask the register number to 5 bits.]

Suggested-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-22 19:52:28 +10:00
Simon Guo
2b33cb585f KVM: PPC: Reimplement LOAD_FP/STORE_FP instruction mmio emulation with analyse_instr() input
This patch reimplements LOAD_FP/STORE_FP instruction MMIO emulation with
analyse_instr() input. It utilizes the FPCONV/UPDATE properties exported by
analyse_instr() and invokes kvmppc_handle_load(s)/kvmppc_handle_store()
accordingly.

For FP store MMIO emulation, the FP regs need to be flushed firstly so
that the right FP reg vals can be read from vcpu->arch.fpr, which will
be stored into MMIO data.

Suggested-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-22 19:51:18 +10:00
Simon Guo
2e6baa46b4 KVM: PPC: Add giveup_ext() hook to PPC KVM ops
Currently HV will save math regs(FP/VEC/VSX) when trap into host. But
PR KVM will only save math regs when qemu task switch out of CPU, or
when returning from qemu code.

To emulate FP/VEC/VSX mmio load, PR KVM need to make sure that math
regs were flushed firstly and then be able to update saved VCPU
FPR/VEC/VSX area reasonably.

This patch adds giveup_ext() field to KVM ops. Only PR KVM has non-NULL
giveup_ext() ops. kvmppc_complete_mmio_load() can invoke that hook
(when not NULL) to flush math regs accordingly, before updating saved
register vals.

Math regs flush is also necessary for STORE, which will be covered
in later patch within this patch series.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-22 19:51:13 +10:00
Simon Guo
7092360399 KVM: PPC: Reimplement non-SIMD LOAD/STORE instruction mmio emulation with analyse_instr() input
This patch reimplements non-SIMD LOAD/STORE instruction MMIO emulation
with analyse_instr() input. It utilizes the BYTEREV/UPDATE/SIGNEXT
properties exported by analyse_instr() and invokes
kvmppc_handle_load(s)/kvmppc_handle_store() accordingly.

It also moves CACHEOP type handling into the skeleton.

instruction_type within kvm_ppc.h is renamed to avoid conflict with
sstep.h.

Suggested-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-22 19:51:08 +10:00
Simon Guo
94dd7fa1c0 KVM: PPC: Add KVMPPC_VSX_COPY_WORD_LOAD_DUMP type support for mmio emulation
Some VSX instructions like lxvwsx will splat word into VSR. This patch
adds a new VSX copy type KVMPPC_VSX_COPY_WORD_LOAD_DUMP to support this.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-22 19:51:03 +10:00
Paul Mackerras
ec531d027a KVM: PPC: Book3S PR: Enable use on POWER9 inside HPT-mode guests
This relaxes the restriction on using PR KVM on POWER9.  The existing
code does work inside a guest partition running in HPT mode, because
hypercalls such as H_ENTER use the old HPTE format, not the new
format used by POWER9, and so no change to PR KVM's HPT manipulation
code is required.  PR KVM will still refuse to run if the kernel is
using radix translation or if it is running bare-metal.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-18 21:49:28 +10:00
Nicholas Piggin
7c1bd80cc2 KVM: PPC: Book3S HV: Send kvmppc_bad_interrupt NMIs to Linux handlers
It's possible to take a SRESET or MCE in these paths due to a bug
in the host code or a NMI IPI, etc. A recent bug attempting to load
a virtual address from real mode gave th complete but cryptic error,
abridged:

      Oops: Bad interrupt in KVM entry/exit code, sig: 6 [#1]
      LE SMP NR_CPUS=2048 NUMA PowerNV
      CPU: 53 PID: 6582 Comm: qemu-system-ppc Not tainted
      NIP:  c0000000000155ac LR: c0000000000c2430 CTR: c000000000015580
      REGS: c000000fff76dd80 TRAP: 0200   Not tainted
      MSR:  9000000000201003 <SF,HV,ME,RI,LE>  CR: 48082222  XER: 00000000
      CFAR: 0000000102900ef0 DAR: d00017fffd941a28 DSISR: 00000040 SOFTE: 3
      NIP [c0000000000155ac] perf_trace_tlbie+0x2c/0x1a0
      LR [c0000000000c2430] do_tlbies+0x230/0x2f0

Sending the NMIs through the Linux handlers gives a nicer output:

      Severe Machine check interrupt [Not recovered]
        NIP [c0000000000155ac]: perf_trace_tlbie+0x2c/0x1a0
        Initiator: CPU
        Error type: Real address [Load (bad)]
          Effective address: d00017fffcc01a28
      opal: Machine check interrupt unrecoverable: MSR(RI=0)
      opal: Hardware platform error: Unrecoverable Machine Check exception
      CPU: 0 PID: 6700 Comm: qemu-system-ppc Tainted: G   M
      NIP:  c0000000000155ac LR: c0000000000c23c0 CTR: c000000000015580
      REGS: c000000fff9e9d80 TRAP: 0200   Tainted: G   M
      MSR:  9000000000201001 <SF,HV,ME,LE>  CR: 48082222  XER: 00000000
      CFAR: 000000010cbc1a30 DAR: d00017fffcc01a28 DSISR: 00000040 SOFTE: 3
      NIP [c0000000000155ac] perf_trace_tlbie+0x2c/0x1a0
      LR [c0000000000c23c0] do_tlbies+0x1c0/0x280

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-18 15:38:23 +10:00
Nicholas Piggin
eadce3b48b KVM: PPC: Book3S HV: Fix kvmppc_bad_host_intr for real mode interrupts
When CONFIG_RELOCATABLE=n, the Linux real mode interrupt handlers call
into KVM using real address. This needs to be translated to the kernel
linear effective address before the MMU is switched on.

kvmppc_bad_host_intr misses adding these bits, so when it is used to
handle a system reset interrupt (that always gets delivered in real
mode), it results in an instruction access fault immediately after
the MMU is turned on.

Fix this by ensuring the top 2 address bits are set when the MMU is
turned on.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-18 15:38:23 +10:00
Nicholas Piggin
878cf2bb2d KVM: PPC: Book3S HV: radix: Do not clear partition PTE when RC or write bits do not match
Adding the write bit and RC bits to pte permissions does not require a
pte clear and flush. There should not be other bits changed here,
because restricting access or changing the PFN must have already
invalidated any existing ptes (otherwise the race is already lost).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-18 15:38:23 +10:00
Nicholas Piggin
bc64dd0e1c KVM: PPC: Book3S HV: radix: Refine IO region partition scope attributes
When the radix fault handler has no page from the process address
space (e.g., for IO memory), it looks up the process pte and sets
partition table pte using that to get attributes like CI and guarded.
If the process table entry is to be writable, set _PAGE_DIRTY as well
to avoid an RC update. If not, then ensure _PAGE_DIRTY does not come
across. Set _PAGE_ACCESSED as well to avoid RC update.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-18 15:38:23 +10:00
Nicholas Piggin
9a4506e11b KVM: PPC: Book3S HV: Make radix handle process scoped LPID flush in C, with relocation on
The radix guest code can has fewer restrictions about what context it
can run in, so move this flushing out of assembly and have it use the
Linux TLB flush implementations introduced previously.

This allows powerpc:tlbie trace events to be used.

This changes the tlbiel sequence to only execute RIC=2 flush once on
the first set flushed, then RIC=0 for the rest of the sets. The end
result of the flush should be unchanged. This matches the local PID
flush pattern that was introduced in a5998fcb92 ("powerpc/mm/radix:
Optimise tlbiel flush all case").

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-18 15:38:23 +10:00
Nicholas Piggin
d91cb39ffa KVM: PPC: Book3S HV: Make radix use the Linux translation flush functions for partition scope
This has the advantage of consolidating TLB flush code in fewer
places, and it also implements powerpc:tlbie trace events.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-18 15:38:23 +10:00
Nicholas Piggin
a5704e83aa KVM: PPC: Book3S HV: Recursively unmap all page table entries when unmapping
When partition scope mappings are unmapped with kvm_unmap_radix, the
pte is cleared, but the page table structure is left in place. If the
next page fault requests a different page table geometry (e.g., due to
THP promotion or split), kvmppc_create_pte is responsible for changing
the page tables.

When a page table entry is to be converted to a large pte, the page
table entry is cleared, the PWC flushed, then the page table it points
to freed. This will cause pte page tables to leak when a 1GB page is
to replace a pud entry points to a pmd table with pte tables under it:
The pmd table will be freed, but its pte tables will be missed.

Fix this by replacing the simple clear and free code with one that
walks down the page tables and frees children. Care must be taken to
clear the root entry being unmapped then flushing the PWC before
freeing any page tables, as explained in comments.

This requires PWC flush to logically become a flush-all-PWC (which it
already is in hardware, but the KVM API needs to be changed to avoid
confusion).

This code also checks that no unexpected pte entries exist in any page
table being freed, and unmaps those and emits a WARN. This is an
expensive operation for the pte page level, but partition scope
changes are rare, so it's unconditional for now to iron out bugs. It
can be put under a CONFIG option or removed after some time.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-18 15:38:23 +10:00
Nicholas Piggin
a5fad1e959 KVM: PPC: Book3S HV: Use a helper to unmap ptes in the radix fault path
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-18 15:38:23 +10:00
Nicholas Piggin
b755745147 KVM: PPC: Book3S HV: Lockless tlbie for HPT hcalls
tlbies to an LPAR do not have to be serialised since POWER4/PPC970,
after which the MMU_FTR_LOCKLESS_TLBIE feature was introduced to
avoid tlbie locking.

Since commit c17b98cf60 ("KVM: PPC: Book3S HV: Remove code for
PPC970 processors"), KVM no longer supports processors that do not
have this feature, so the tlbie locking can be removed completely.
A sanity check for the feature is put in kvmppc_mmu_hv_init.

Testing was done on a POWER9 system in HPT mode, with a -smp 32 guest
in HPT mode. 32 instances of the powerpc fork benchmark from selftests
were run with --fork, and the results measured.

Without this patch, total throughput was about 13.5K/sec, and this is
the top of the host profile:

   74.52%  [k] do_tlbies
    2.95%  [k] kvmppc_book3s_hv_page_fault
    1.80%  [k] calc_checksum
    1.80%  [k] kvmppc_vcpu_run_hv
    1.49%  [k] kvmppc_run_core

After this patch, throughput was about 51K/sec, with this profile:

   21.28%  [k] do_tlbies
    5.26%  [k] kvmppc_run_core
    4.88%  [k] kvmppc_book3s_hv_page_fault
    3.30%  [k] _raw_spin_lock_irqsave
    3.25%  [k] gup_pgd_range

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-18 15:38:23 +10:00
Simon Guo
f19d1f367a KVM: PPC: Fix a mmio_host_swabbed uninitialized usage issue
When KVM emulates VMX store, it will invoke kvmppc_get_vmx_data() to
retrieve VMX reg val. kvmppc_get_vmx_data() will check mmio_host_swabbed
to decide which double word of vr[] to be used. But the
mmio_host_swabbed can be uninitialized during VMX store procedure:

kvmppc_emulate_loadstore
	\- kvmppc_handle_store128_by2x64
		\- kvmppc_get_vmx_data

So vcpu->arch.mmio_host_swabbed is not meant to be used at all for
emulation of store instructions, and this patch makes that true for
VMX stores. This patch also initializes mmio_host_swabbed to avoid
possible future problems.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-18 15:38:23 +10:00
Simon Guo
173c520a04 KVM: PPC: Move nip/ctr/lr/xer registers to pt_regs in kvm_vcpu_arch
This patch moves nip/ctr/lr/xer registers from scattered places in
kvm_vcpu_arch to pt_regs structure.

cr register is "unsigned long" in pt_regs and u32 in vcpu->arch.
It will need more consideration and may move in later patches.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-18 15:38:23 +10:00
Simon Guo
1143a70665 KVM: PPC: Add pt_regs into kvm_vcpu_arch and move vcpu->arch.gpr[] into it
Current regs are scattered at kvm_vcpu_arch structure and it will
be more neat to organize them into pt_regs structure.

Also it will enable reimplementation of MMIO emulation code with
analyse_instr() later.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-18 15:38:23 +10:00
Paul Mackerras
9c9e9cf40a Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next
This merges in the ppc-kvm topic branch of the powerpc repository
to get some changes on which future patches will depend, in particular
the definitions of various new TLB flushing functions.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-18 15:30:10 +10:00
Souptick Joarder
16d5c39d54 KVM: PPC: Book3S: Change return type to vm_fault_t
Use new return type vm_fault_t for fault handler
in struct vm_operations_struct. For now, this is
just documenting that the function returns a
VM_FAULT value rather than an errno.  Once all
instances are converted, vm_fault_t will become
a distinct type.

commit 1c8f422059 ("mm: change return type to
vm_fault_t")

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-17 16:41:51 +10:00
Alexey Kardashevskiy
e45719af1c KVM: PPC: Book3S: Check KVM_CREATE_SPAPR_TCE_64 parameters
Although it does not seem possible to break the host by passing bad
parameters when creating a TCE table in KVM, it is still better to get
an early clear indication of that than debugging weird effect this might
bring.

This adds some sanity checks that the page size is 4KB..16GB as this is
what the actual LoPAPR supports and that the window actually fits 64bit
space.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-17 16:41:51 +10:00
Alexey Kardashevskiy
ca1fc489cf KVM: PPC: Book3S: Allow backing bigger guest IOMMU pages with smaller physical pages
At the moment we only support in the host the IOMMU page sizes which
the guest is aware of, which is 4KB/64KB/16MB. However P9 does not support
16MB IOMMU pages, 2MB and 1GB pages are supported instead. We can still
emulate bigger guest pages (for example 16MB) with smaller host pages
(4KB/64KB/2MB).

This allows the physical IOMMU pages to use a page size smaller or equal
than the guest visible IOMMU page size.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-17 16:41:51 +10:00
Alexey Kardashevskiy
c6b61661d2 KVM: PPC: Book3S: Use correct page shift in H_STUFF_TCE
The other TCE handlers use page shift from the guest visible TCE table
(described by kvmppc_spapr_tce_iommu_table) so let's make H_STUFF_TCE
handlers do the same thing.

This should cause no behavioral change now but soon we will allow
the iommu_table::it_page_shift being different from from the emulated
table page size so this will play a role.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-17 16:41:51 +10:00
Paul Mackerras
48e70b1ce6 KVM: PPC: Book3S HV: Fix inaccurate comment
We now have interrupts hard-disabled when coming back from
kvmppc_hv_entry_trampoline, so this changes the comment to reflect
that.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-17 16:36:56 +10:00
Paul Mackerras
7aa15842c1 KVM: PPC: Book3S HV: Set RWMR on POWER8 so PURR/SPURR count correctly
Although Linux doesn't use PURR and SPURR ((Scaled) Processor
Utilization of Resources Register), other OSes depend on them.
On POWER8 they count at a rate depending on whether the VCPU is
idle or running, the activity of the VCPU, and the value in the
RWMR (Region-Weighting Mode Register).  Hardware expects the
hypervisor to update the RWMR when a core is dispatched to reflect
the number of online VCPUs in the vcore.

This adds code to maintain a count in the vcore struct indicating
how many VCPUs are online.  In kvmppc_run_core we use that count
to set the RWMR register on POWER8.  If the core is split because
of a static or dynamic micro-threading mode, we use the value for
8 threads.  The RWMR value is not relevant when the host is
executing because Linux does not use the PURR or SPURR register,
so we don't bother saving and restoring the host value.

For the sake of old userspace which does not set the KVM_REG_PPC_ONLINE
register, we set online to 1 if it was 0 at the time of a KVM_RUN
ioctl.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-17 16:36:48 +10:00
Paul Mackerras
a1f158262a KVM: PPC: Book3S HV: Add 'online' register to ONE_REG interface
This adds a new KVM_REG_PPC_ONLINE register which userspace can set
to 0 or 1 via the GET/SET_ONE_REG interface to indicate whether it
considers the VCPU to be offline (0), that is, not currently running,
or online (1).  This will be used in a later patch to configure the
register which controls PURR and SPURR accumulation.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-17 16:36:36 +10:00
Paul Mackerras
df158189db KVM: PPC: Book 3S HV: Do ptesync in radix guest exit path
A radix guest can execute tlbie instructions to invalidate TLB entries.
After a tlbie or a group of tlbies, it must then do the architected
sequence eieio; tlbsync; ptesync to ensure that the TLB invalidation
has been processed by all CPUs in the system before it can rely on
no CPU using any translation that it just invalidated.

In fact it is the ptesync which does the actual synchronization in
this sequence, and hardware has a requirement that the ptesync must
be executed on the same CPU thread as the tlbies which it is expected
to order.  Thus, if a vCPU gets moved from one physical CPU to
another after it has done some tlbies but before it can get to do the
ptesync, the ptesync will not have the desired effect when it is
executed on the second physical CPU.

To fix this, we do a ptesync in the exit path for radix guests.  If
there are any pending tlbies, this will wait for them to complete.
If there aren't, then ptesync will just do the same as sync.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-17 15:17:13 +10:00
Benjamin Herrenschmidt
9dc81d6b0f KVM: PPC: Book3S HV: XIVE: Resend re-routed interrupts on CPU priority change
When a vcpu priority (CPPR) is set to a lower value (masking more
interrupts), we stop processing interrupts already in the queue
for the priorities that have now been masked.

If those interrupts were previously re-routed to a different
CPU, they might still be stuck until the older one that has
them in its queue processes them. In the case of guest CPU
unplug, that can be never.

To address that without creating additional overhead for
the normal interrupt processing path, this changes H_CPPR
handling so that when such a priority change occurs, we
scan the interrupt queue for that vCPU, and for any
interrupt in there that has been re-routed, we replace it
with a dummy and force a re-trigger.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-17 15:17:06 +10:00
Nicholas Piggin
7e3d9a1d0f KVM: PPC: Book3S HV: Make radix clear pte when unmapping
The current partition table unmap code clears the _PAGE_PRESENT bit
out of the pte, which leaves pud_huge/pmd_huge true and does not
clear pud_present/pmd_present.  This can confuse subsequent page
faults and possibly lead to the guest looping doing continual
hypervisor page faults.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-17 15:16:59 +10:00
Nicholas Piggin
e2560b108f KVM: PPC: Book3S HV: Make radix use correct tlbie sequence in kvmppc_radix_tlbie_page
The standard eieio ; tlbsync ; ptesync must follow tlbie to ensure it
is ordered with respect to subsequent operations.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-17 15:16:53 +10:00
Paul Mackerras
57b8daa70a KVM: PPC: Book3S HV: Snapshot timebase offset on guest entry
Currently, the HV KVM guest entry/exit code adds the timebase offset
from the vcore struct to the timebase on guest entry, and subtracts
it on guest exit.  Which is fine, except that it is possible for
userspace to change the offset using the SET_ONE_REG interface while
the vcore is running, as there is only one timebase offset per vcore
but potentially multiple VCPUs in the vcore.  If that were to happen,
KVM would subtract a different offset on guest exit from that which
it had added on guest entry, leading to the timebase being out of sync
between cores in the host, which then leads to bad things happening
such as hangs and spurious watchdog timeouts.

To fix this, we add a new field 'tb_offset_applied' to the vcore struct
which stores the offset that is currently applied to the timebase.
This value is set from the vcore tb_offset field on guest entry, and
is what is subtracted from the timebase on guest exit.  Since it is
zero when the timebase offset is not applied, we can simplify the
logic in kvmhv_start_timing and kvmhv_accumulate_time.

In addition, we had secondary threads reading the timebase while
running concurrently with code on the primary thread which would
eventually add or subtract the timebase offset from the timebase.
This occurred while saving or restoring the DEC register value on
the secondary threads.  Although no specific incorrect behaviour has
been observed, this is a race which should be fixed.  To fix it, we
move the DEC saving code to just before we call kvmhv_commence_exit,
and the DEC restoring code to after the point where we have waited
for the primary thread to switch the MMU context and add the timebase
offset.  That way we are sure that the timebase contains the guest
timebase value in both cases.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-05-17 15:16:45 +10:00
Michael Ellerman
a42ae34f71 Merge branch 'topic/ppc-kvm' into next
This brings in one commit that we may want to share with the kvm-ppc
tree, to avoid merge conflicts and get wider testing.
2018-05-15 22:28:19 +10:00
Aneesh Kumar K.V
21828c99ee powerpc/kvm: Switch kvm pmd allocator to custom allocator
In the next set of patches, we will switch pmd allocator to use page fragments
and the locking will be updated to split pmd ptlock. We want to avoid using
fragments for partition-scoped table. Use slab cache similar to level 4 table

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-15 22:27:06 +10:00
Naveen N. Rao
a4bc64d305 powerpc64/ftrace: Disable ftrace during kvm entry/exit
During guest entry/exit, we switch over to/from the guest MMU context
and we cannot take exceptions in the hypervisor code.

Since ftrace may be enabled and since it can result in us taking a trap,
disable ftrace by setting paca->ftrace_enabled to zero. There are two
paths through which we enter/exit a guest:
1. If we are the vcore runner, then we enter the guest via
__kvmppc_vcore_entry() and we disable ftrace around this. This is always
the case for Power9, and for the primary thread on Power8.
2. If we are a secondary thread in Power8, then we would be in nap due
to SMT being disabled. We are woken up by an IPI to enter the guest. In
this scenario, we enter the guest through kvm_start_guest(). We disable
ftrace at this point. In this scenario, ftrace would only get re-enabled
on the secondary thread when SMT is re-enabled (via start_secondary()).

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-03 22:32:27 +10:00
Laurentiu Tudor
b2d7ecbe35 powerpc/kvm/booke: Fix altivec related build break
Add missing "altivec unavailable" interrupt injection helper
thus fixing the linker error below:

  arch/powerpc/kvm/emulate_loadstore.o: In function `kvmppc_check_altivec_disabled':
  arch/powerpc/kvm/emulate_loadstore.c: undefined reference to `.kvmppc_core_queue_vec_unavail'

Fixes: 09f984961c ("KVM: PPC: Book3S: Add MMIO emulation for VMX instructions")
Signed-off-by: Laurentiu Tudor <laurentiu.tudor@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-27 16:36:03 +10:00
Linus Torvalds
b1cb4f93b5 powerpc fixes for 4.17 #2
- Fix crashes when loading modules built with a different CONFIG_RELOCATABLE
    value by adding CONFIG_RELOCATABLE to vermagic.
 
  - Fix busy loops in the OPAL NVRAM driver if we get certain error conditions
    from firmware.
 
  - Remove tlbie trace points from KVM code that's called in real mode, because
    it causes crashes.
 
  - Fix checkstops caused by invalid tlbiel on Power9 Radix.
 
  - Ensure the set of CPU features we "know" are always enabled is actually the
    minimal set when we build with support for firmware supplied CPU features.
 
 Thanks to:
   Aneesh Kumar K.V, Anshuman Khandual, Nicholas Piggin.
 -----BEGIN PGP SIGNATURE-----
 
 iQIwBAABCAAaBQJa0oWBExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYAV
 EA//UQB7n33HlroMBL3841VswUrIgY36gSi9+QZPjxHiTYGVStI+u0FHq5hm8OMm
 1FkronD0sfbN7t8LJ9FS6vCoAlW15vMBj95pWJXAL7NuQneO7cM5JvRbowVKbNbq
 GESn4EkUj9bAXl7aTzX2yA7jruzMJIwE/2bgl1J6YkpByHx5x/fgzKTuoCRggQqY
 Xd656ZZyWR/uIuWhAjCPFdrPnekVvo7/Gy5Yo5kcR6LbX4MO0JFNOHpiT3wPGGv/
 LJedJtdpH1biMk7SrqLfWD7mpMsVJeMelpkftEbe8zUXf17wBd1rl4JkybLTDwZD
 HznZ0nbnh0j5Q/KRHwOh0sLDSisJF6pQHH/Bnc4Xqrsn4OERwEk40/Ym/O0kDsjH
 n2F3X5a5QLWoBWRsdV3BMyEzc0bgnb++tXeBWOV4jJBEOhoqxwimCU+3fmLiy95A
 urJYf8Sljwve9I5kJyXKpruopgeoJ1aumRwopE8YCG9lfBgzvU/90u6+uKqAdkOv
 kpZxS5FDT2lNIwbCP9WjEelAOGrtA6JQNWU0iPGwsVAvAxX/p1RwwbQeWVlWs3Ek
 UdVF8eGezClpLPa/FQFdDyXfGE6WIf1vK9YnG7k3rUwwkSqoYxKgVQ3o1Vetp0Lg
 fzPsDdDi2V2I3KDYNav8s6kohAIIS/a9XYk4BGvT/htRnUg=
 =0e7c
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

 - Fix crashes when loading modules built with a different
   CONFIG_RELOCATABLE value by adding CONFIG_RELOCATABLE to vermagic.

 - Fix busy loops in the OPAL NVRAM driver if we get certain error
   conditions from firmware.

 - Remove tlbie trace points from KVM code that's called in real mode,
   because it causes crashes.

 - Fix checkstops caused by invalid tlbiel on Power9 Radix.

 - Ensure the set of CPU features we "know" are always enabled is
   actually the minimal set when we build with support for firmware
   supplied CPU features.

Thanks to: Aneesh Kumar K.V, Anshuman Khandual, Nicholas Piggin.

* tag 'powerpc-4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/64s: Fix CPU_FTRS_ALWAYS vs DT CPU features
  powerpc/mm/radix: Fix checkstops caused by invalid tlbiel
  KVM: PPC: Book3S HV: trace_tlbie must not be called in realmode
  powerpc/8xx: Fix build with hugetlbfs enabled
  powerpc/powernv: Fix OPAL NVRAM driver OPAL_BUSY loops
  powerpc/powernv: define a standard delay for OPAL_BUSY type retry loops
  powerpc/fscr: Enable interrupts earlier before calling get_user()
  powerpc/64s: Fix section mismatch warnings from setup_rfi_flush()
  powerpc/modules: Fix crashes by adding CONFIG_RELOCATABLE to vermagic
2018-04-15 11:57:12 -07:00
Nicholas Piggin
19ce7909ed KVM: PPC: Book3S HV: trace_tlbie must not be called in realmode
This crashes with a "Bad real address for load" attempting to load
from the vmalloc region in realmode (faulting address is in DAR).

  Oops: Bad interrupt in KVM entry/exit code, sig: 6 [#1]
  LE SMP NR_CPUS=2048 NUMA PowerNV
  CPU: 53 PID: 6582 Comm: qemu-system-ppc Not tainted 4.16.0-01530-g43d1859f0994
  NIP:  c0000000000155ac LR: c0000000000c2430 CTR: c000000000015580
  REGS: c000000fff76dd80 TRAP: 0200   Not tainted  (4.16.0-01530-g43d1859f0994)
  MSR:  9000000000201003 <SF,HV,ME,RI,LE>  CR: 48082222  XER: 00000000
  CFAR: 0000000102900ef0 DAR: d00017fffd941a28 DSISR: 00000040 SOFTE: 3
  NIP [c0000000000155ac] perf_trace_tlbie+0x2c/0x1a0
  LR [c0000000000c2430] do_tlbies+0x230/0x2f0

I suspect the reason is the per-cpu data is not in the linear chunk.
This could be restored if that was able to be fixed, but for now,
just remove the tracepoints.

Fixes: 0428491cba ("powerpc/mm: Trace tlbie(l) instructions")
Cc: stable@vger.kernel.org # v4.13+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-11 12:35:33 +10:00
Linus Torvalds
d8312a3f61 ARM:
- VHE optimizations
 - EL2 address space randomization
 - speculative execution mitigations ("variant 3a", aka execution past invalid
 privilege register access)
 - bugfixes and cleanups
 
 PPC:
 - improvements for the radix page fault handler for HV KVM on POWER9
 
 s390:
 - more kvm stat counters
 - virtio gpu plumbing
 - documentation
 - facilities improvements
 
 x86:
 - support for VMware magic I/O port and pseudo-PMCs
 - AMD pause loop exiting
 - support for AMD core performance extensions
 - support for synchronous register access
 - expose nVMX capabilities to userspace
 - support for Hyper-V signaling via eventfd
 - use Enlightened VMCS when running on Hyper-V
 - allow userspace to disable MWAIT/HLT/PAUSE vmexits
 - usual roundup of optimizations and nested virtualization bugfixes
 
 Generic:
 - API selftest infrastructure (though the only tests are for x86 as of now)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJay19UAAoJEL/70l94x66DGKYIAIu9PTHAEwaX0et15fPW5y2x
 rrtS355lSAmMrPJ1nePRQ+rProD/1B0Kizj3/9O+B9OTKKRsorRYNa4CSu9neO2k
 N3rdE46M1wHAPwuJPcYvh3iBVXtgbMayk1EK5aVoSXaMXEHh+PWZextkl+F+G853
 kC27yDy30jj9pStwnEFSBszO9ua/URdKNKBATNx8WUP6d9U/dlfm5xv3Dc3WtKt2
 UMGmog2wh0i7ecXo7hRkMK4R7OYP3ZxAexq5aa9BOPuFp+ZdzC/MVpN+jsjq2J/M
 Zq6RNyA2HFyQeP0E9QgFsYS2BNOPeLZnT5Jg1z4jyiD32lAZ/iC51zwm4oNKcDM=
 =bPlD
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "ARM:
   - VHE optimizations

   - EL2 address space randomization

   - speculative execution mitigations ("variant 3a", aka execution past
     invalid privilege register access)

   - bugfixes and cleanups

  PPC:
   - improvements for the radix page fault handler for HV KVM on POWER9

  s390:
   - more kvm stat counters

   - virtio gpu plumbing

   - documentation

   - facilities improvements

  x86:
   - support for VMware magic I/O port and pseudo-PMCs

   - AMD pause loop exiting

   - support for AMD core performance extensions

   - support for synchronous register access

   - expose nVMX capabilities to userspace

   - support for Hyper-V signaling via eventfd

   - use Enlightened VMCS when running on Hyper-V

   - allow userspace to disable MWAIT/HLT/PAUSE vmexits

   - usual roundup of optimizations and nested virtualization bugfixes

  Generic:
   - API selftest infrastructure (though the only tests are for x86 as
     of now)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (174 commits)
  kvm: x86: fix a prototype warning
  kvm: selftests: add sync_regs_test
  kvm: selftests: add API testing infrastructure
  kvm: x86: fix a compile warning
  KVM: X86: Add Force Emulation Prefix for "emulate the next instruction"
  KVM: X86: Introduce handle_ud()
  KVM: vmx: unify adjacent #ifdefs
  x86: kvm: hide the unused 'cpu' variable
  KVM: VMX: remove bogus WARN_ON in handle_ept_misconfig
  Revert "KVM: X86: Fix SMRAM accessing even if VM is shutdown"
  kvm: Add emulation for movups/movupd
  KVM: VMX: raise internal error for exception during invalid protected mode state
  KVM: nVMX: Optimization: Dont set KVM_REQ_EVENT when VMExit with nested_run_pending
  KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending
  KVM: x86: Fix misleading comments on handling pending exceptions
  KVM: x86: Rename interrupt.pending to interrupt.injected
  KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt
  x86/kvm: use Enlightened VMCS when running on Hyper-V
  x86/hyper-v: detect nested features
  x86/hyper-v: define struct hv_enlightened_vmcs and clean field bits
  ...
2018-04-09 11:42:31 -07:00
Linus Torvalds
49a695ba72 powerpc updates for 4.17
Notable changes:
 
  - Support for 4PB user address space on 64-bit, opt-in via mmap().
 
  - Removal of POWER4 support, which was accidentally broken in 2016 and no one
    noticed, and blocked use of some modern instructions.
 
  - Workarounds so that the hypervisor can enable Transactional Memory on Power9.
 
  - A series to disable the DAWR (Data Address Watchpoint Register) on Power9.
 
  - More information displayed in the meltdown/spectre_v1/v2 sysfs files.
 
  - A vpermxor (Power8 Altivec) implementation for the raid6 Q Syndrome.
 
  - A big series to make the allocation of our pacas (per cpu area), kernel page
    tables, and per-cpu stacks NUMA aware when using the Radix MMU on Power9.
 
 And as usual many fixes, reworks and cleanups.
 
 Thanks to:
   Aaro Koskinen, Alexandre Belloni, Alexey Kardashevskiy, Alistair Popple, Andy
   Shevchenko, Aneesh Kumar K.V, Anshuman Khandual, Balbir Singh, Benjamin
   Herrenschmidt, Christophe Leroy, Christophe Lombard, Cyril Bur, Daniel Axtens,
   Dave Young, Finn Thain, Frederic Barrat, Gustavo Romero, Horia Geantă,
   Jonathan Neuschäfer, Kees Cook, Larry Finger, Laurent Dufour, Laurent Vivier,
   Logan Gunthorpe, Madhavan Srinivasan, Mark Greer, Mark Hairgrove, Markus
   Elfring, Mathieu Malaterre, Matt Brown, Matt Evans, Mauricio Faria de
   Oliveira, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Paul Mackerras,
   Philippe Bergheaud, Ram Pai, Rob Herring, Sam Bobroff, Segher Boessenkool,
   Simon Guo, Simon Horman, Stewart Smith, Sukadev Bhattiprolu, Suraj Jitindar
   Singh, Thiago Jung Bauermann, Vaibhav Jain, Vaidyanathan Srinivasan, Vasant
   Hegde, Wei Yongjun.
 -----BEGIN PGP SIGNATURE-----
 
 iQIwBAABCAAaBQJayKxDExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYAr
 JQ/6A9Xs4zHDn9OeT9esEIxciETqUlrP0Wp64c4JVC7EkG1E7xRDZ4Xb4m8R2nNt
 9sPhtNO1yCtEk6kFQtPNB0N8v6pud4I6+aMcYnn+tP8mJRYQ4x9bYaF3Hw98IKmE
 Kd6TglmsUQvh2GpwPiF93KpzzWu1HB2kZzzqJcAMTMh7C79Qz00BjrTJltzXB2jx
 tJ+B4lVy8BeU8G5nDAzJEEwb5Ypkn8O40rS/lpAwVTYOBJ8Rbyq8Fj82FeREK9YO
 4EGaEKPkC/FdzX7OJV3v2/nldCd8pzV471fAoGuBUhJiJBMBoBybcTHIdDex7LlL
 zMLV1mUtGo8iolRPhL8iCH+GGifZz2WzstYCozz7hgIraWtc/frq9rZp6q0LdH/K
 trk7UbPGlVb92ecWZVpZyEcsMzKrCgZqnAe9wRNh1uEKScEdzd/bmRaMhENUObRh
 Hili6AVvmSKExpy7k2sZP/oUMaeC15/xz8Lk7l8a/iCkYhNmPYh5iSXM5+UKpcRT
 FYOcO0o3DwXsN46Whow3nJ7TqAsDy9/ecPUG71JQi3ZrHnRrm8jxkn8MCG5pZ1Fi
 KvKDxlg6RiJo3DF9/fSOpJUokvMwqBS5dJo4eh5eiDy94aBTqmBKFecvPxQm7a0L
 l3uXCF/6JuXEvMukFjGBO4RiYhw8i+B2uKsh81XUh7HKrgE=
 =HAB1
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Notable changes:

   - Support for 4PB user address space on 64-bit, opt-in via mmap().

   - Removal of POWER4 support, which was accidentally broken in 2016
     and no one noticed, and blocked use of some modern instructions.

   - Workarounds so that the hypervisor can enable Transactional Memory
     on Power9.

   - A series to disable the DAWR (Data Address Watchpoint Register) on
     Power9.

   - More information displayed in the meltdown/spectre_v1/v2 sysfs
     files.

   - A vpermxor (Power8 Altivec) implementation for the raid6 Q
     Syndrome.

   - A big series to make the allocation of our pacas (per cpu area),
     kernel page tables, and per-cpu stacks NUMA aware when using the
     Radix MMU on Power9.

  And as usual many fixes, reworks and cleanups.

  Thanks to: Aaro Koskinen, Alexandre Belloni, Alexey Kardashevskiy,
  Alistair Popple, Andy Shevchenko, Aneesh Kumar K.V, Anshuman Khandual,
  Balbir Singh, Benjamin Herrenschmidt, Christophe Leroy, Christophe
  Lombard, Cyril Bur, Daniel Axtens, Dave Young, Finn Thain, Frederic
  Barrat, Gustavo Romero, Horia Geantă, Jonathan Neuschäfer, Kees Cook,
  Larry Finger, Laurent Dufour, Laurent Vivier, Logan Gunthorpe,
  Madhavan Srinivasan, Mark Greer, Mark Hairgrove, Markus Elfring,
  Mathieu Malaterre, Matt Brown, Matt Evans, Mauricio Faria de Oliveira,
  Michael Neuling, Naveen N. Rao, Nicholas Piggin, Paul Mackerras,
  Philippe Bergheaud, Ram Pai, Rob Herring, Sam Bobroff, Segher
  Boessenkool, Simon Guo, Simon Horman, Stewart Smith, Sukadev
  Bhattiprolu, Suraj Jitindar Singh, Thiago Jung Bauermann, Vaibhav
  Jain, Vaidyanathan Srinivasan, Vasant Hegde, Wei Yongjun"

* tag 'powerpc-4.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (207 commits)
  powerpc/64s/idle: Fix restore of AMOR on POWER9 after deep sleep
  powerpc/64s: Fix POWER9 DD2.2 and above in cputable features
  powerpc/64s: Fix pkey support in dt_cpu_ftrs, add CPU_FTR_PKEY bit
  powerpc/64s: Fix dt_cpu_ftrs to have restore_cpu clear unwanted LPCR bits
  Revert "powerpc/64s/idle: POWER9 ESL=0 stop avoid save/restore overhead"
  powerpc: iomap.c: introduce io{read|write}64_{lo_hi|hi_lo}
  powerpc: io.h: move iomap.h include so that it can use readq/writeq defs
  cxl: Fix possible deadlock when processing page faults from cxllib
  powerpc/hw_breakpoint: Only disable hw breakpoint if cpu supports it
  powerpc/mm/radix: Update command line parsing for disable_radix
  powerpc/mm/radix: Parse disable_radix commandline correctly.
  powerpc/mm/hugetlb: initialize the pagetable cache correctly for hugetlb
  powerpc/mm/radix: Update pte fragment count from 16 to 256 on radix
  powerpc/mm/keys: Update documentation and remove unnecessary check
  powerpc/64s/idle: POWER9 ESL=0 stop avoid save/restore overhead
  powerpc/64s/idle: Consolidate power9_offline_stop()/power9_idle_stop()
  powerpc/powernv: Always stop secondaries before reboot/shutdown
  powerpc: hard disable irqs in smp_send_stop loop
  powerpc: use NMI IPI for smp_send_stop
  powerpc/powernv: Fix SMT4 forcing idle code
  ...
2018-04-07 12:08:19 -07:00
Nicholas Piggin
e303c08787 KVM: PPC: Book3S HV: Fix ppc_breakpoint_available compile error
arch/powerpc/kvm/book3s_hv.c: In function ‘kvmppc_h_set_mode’:
arch/powerpc/kvm/book3s_hv.c:745:8: error: implicit declaration of function ‘ppc_breakpoint_available’
   if (!ppc_breakpoint_available())
        ^~~~~~~~~~~~~~~~~~~~~~~~

Fixes: 398e712c00 ("KVM: PPC: Book3S HV: Return error from h_set_mode(SET_DAWR) on POWER9")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-03 21:50:07 +10:00
Nicholas Piggin
471d7ff8b5 powerpc/64s: Remove POWER4 support
POWER4 has been broken since at least the change 49d09bf2a6
("powerpc/64s: Optimise MSR handling in exception handling"), which
requires mtmsrd L=1 support. This was introduced in ISA v2.01, and
POWER4 supports ISA v2.00.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-01 00:47:50 +11:00
Aneesh Kumar K.V
ca9a16c3bc powerpc/kvm: Fix guest boot failure on Power9 since DAWR changes
SLOF checks for 'sc 1' (hypercall) support by issuing a hcall with
H_SET_DABR. Since the recent commit e8ebedbf31 ("KVM: PPC: Book3S
HV: Return error from h_set_dabr() on POWER9") changed H_SET_DABR to
return H_UNSUPPORTED on Power9, we see guest boot failures, the
symptom is the boot seems to just stop in SLOF, eg:

  SLOF ***************************************************************
  QEMU Starting
   Build Date = Sep 24 2017 12:23:07
   FW Version = buildd@ release 20170724
  <no further output>

SLOF can cope if H_SET_DABR returns H_HARDWARE. So wwitch the return
value to H_HARDWARE instead of H_UNSUPPORTED so that we don't break
the guest boot.

That does mean we return a different error to PowerVM in this case,
but that's probably not a big concern.

Fixes: e8ebedbf31 ("KVM: PPC: Book3S HV: Return error from h_set_dabr() on POWER9")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-01 00:47:13 +11:00
Michael Ellerman
f437c51748 Merge branch 'topic/paca' into next
Bring in yet another series that touches KVM code, and might need to
be merged into the kvm-ppc branch to resolve conflicts.

This required some changes in pnv_power9_force_smt4_catch/release()
due to the paca array becomming an array of pointers.
2018-03-31 09:09:36 +11:00
Linus Torvalds
72573481eb KVM fixes for v4.16-rc8
PPC:
  - Fix a bug causing occasional machine check exceptions on POWER8 hosts
    (introduced in 4.16-rc1)
 
 x86:
  - Fix a guest crashing regression with nested VMX and restricted guest
    (introduced in 4.16-rc1)
 
  - Fix dependency check for pv tlb flush (The wrong dependency that
    effectively disabled the feature was added in 4.16-rc4, the original
    feature in 4.16-rc1, so it got decent testing.)
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJavUt5AAoJEED/6hsPKofo8uQH/RuijrsAIUnymkYY+6BYFXlh
 Ri8qhG8VB+C3SpWEtsqcqNVkjJTepCD2Ej5BJTL4Gc9BSTWy7Ht6kqskEgwcnzu2
 xRfkg0q0vTj1+GDd+UiTZfxiinoHtB9x3fiXali5UNTCd1fweLxdidETfO+GqMMq
 KDhTR+S8dXE5VG7r+iJ80LZPtHQJ94f0fh9XpQk3X2ExTG5RBxag1U2nCfiKRAZk
 xRv1CNAxNaBxS38CgYfHzg31NJx38fnq/qREsIdOx0Ju9WQkglBFkhLAGUb4vL0I
 nn8YX/oV9cW2G8tyPWjC245AouABOLbzu0xyj5KgCY/z1leA9tdLFX/ET6Zye+E=
 =++uZ
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Radim Krčmář:
 "PPC:
   - Fix a bug causing occasional machine check exceptions on POWER8
     hosts (introduced in 4.16-rc1)

  x86:
   - Fix a guest crashing regression with nested VMX and restricted
     guest (introduced in 4.16-rc1)

   - Fix dependency check for pv tlb flush (the wrong dependency that
     effectively disabled the feature was added in 4.16-rc4, the
     original feature in 4.16-rc1, so it got decent testing)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: Fix pv tlb flush dependencies
  KVM: nVMX: sync vmcs02 segment regs prior to vmx_set_cr0
  KVM: PPC: Book3S HV: Fix duplication of host SLB entries
2018-03-30 07:24:14 -10:00
Nicholas Piggin
499dcd4137 powerpc/64s: Allocate LPPACAs individually
We no longer allocate lppacas in an array, so this patch removes the
1kB static alignment for the structure, and enforces the PAPR
alignment requirements at allocation time. We can not reduce the 1kB
allocation size however, due to existing KVM hypervisors.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-30 23:34:24 +11:00
Nicholas Piggin
d2e60075a3 powerpc/64: Use array of paca pointers and allocate pacas individually
Change the paca array into an array of pointers to pacas. Allocate
pacas individually.

This allows flexibility in where the PACAs are allocated. Future work
will allocate them node-local. Platforms that don't have address limits
on PACAs would be able to defer PACA allocations until later in boot
rather than allocate all possible ones up-front then freeing unused.

This is slightly more overhead (one additional indirection) for cross
CPU paca references, but those aren't too common.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-30 23:34:23 +11:00
Nicholas Piggin
8e0b634b13 powerpc/64s: Do not allocate lppaca if we are not virtualized
The "lppaca" is a structure registered with the hypervisor. This is
unnecessary when running on non-virtualised platforms. One field from
the lppaca (pmcregs_in_use) is also used by the host, so move the host
part out into the paca (lppaca field is still updated in
guest mode).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fix non-pseries build with some #ifdefs]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-30 23:34:22 +11:00
Linus Torvalds
a2601d78b7 powerpc fixes for 4.16 #6
These are actually all fixes for pre-4.16 code, or new hardware workarounds.
 
 Fix missing AT_BASE_PLATFORM (in auxv) when we're using a new firmware interface
 for describing CPU features.
 
 Fix lost pending interrupts due to a race in our interrupt soft-masking code.
 
 A workaround for a nest MMU bug with TLB invalidations on Power9.
 
 A workaround for broadcast TLB invalidations on Power9.
 
 Fix a bug in our instruction SLB miss handler, when handling bad addresses
 (eg. >= TASK_SIZE), which could corrupt non-volatile user GPRs.
 
 Thanks to:
   Aneesh Kumar K.V, Balbir Singh, Benjamin Herrenschmidt, Nicholas Piggin.
 -----BEGIN PGP SIGNATURE-----
 
 iQIwBAABCAAaBQJau3wfExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYCz
 dA/+JnB5iKCXCCebnqoaX4AFTqMfxT3nr/+JkfchovZLV0PBVzKME5JtL61udmDe
 j1JZU8UASLqN/8/j652s87XuuRi6xPjSPjMNXmU1LFQ7DjS9yA6FOAsbE4c1Xg4D
 jSded2BSnMRtA/yw8AupvdYr4w72zKMQYzo8/Or3eUQAAge+oX3d1SQiRkD3DOUg
 EdpHnOScSwz6GL9amfaQBhXwvik+4crTQ/wZ/SsTpQrfJkVzHXLn/DnHEP1qO+ky
 v/Y0ix5TxpH132XsVM7UaUvy1ZcZSyEmT2qGOisGm0fj4jesVn9dQMzP+97W4QeW
 ghfHj2fvzx6IsPM3PhNKITknQi/GTrukjSuzYNuj7MyvKY15HUP1MPXNeJUl5thw
 kI5uYWuTvyI3daQKFXRQa7V6H0auuYeEV6/RvIlJ2YtUfqmvyECviNM/+mDC0+Jk
 bgqz47qqeEz2cwIUu/vQm2phVpq+15cLPwmdA37IdyT6GvYgGmsW4HWVIsyxLR2z
 fo9ghX+1oMhmMNhgVYtL2P9BfCzQenK2R+uAmUOHdNyc0LBlGKN+RPAQqQkBhKGp
 BB1L2F13kpeNBNTOsPU4yH3DpPaJFtfnaeL7jd5SanwsxNnoKApFglf0nE73bvbw
 AwRF/vWokbd3WzuPmOtldtluWUHQhaLECU24odVGB/r3XCI=
 =qP8V
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.16-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "Some more powerpc fixes for 4.16. Apologies if this is a bit big at
  rc7, but they're all reasonably important fixes. None are actually for
  new code, so they aren't indicative of 4.16 being in bad shape from
  our point of view.

   - Fix missing AT_BASE_PLATFORM (in auxv) when we're using a new
     firmware interface for describing CPU features.

   - Fix lost pending interrupts due to a race in our interrupt
     soft-masking code.

   - A workaround for a nest MMU bug with TLB invalidations on Power9.

   - A workaround for broadcast TLB invalidations on Power9.

   - Fix a bug in our instruction SLB miss handler, when handling bad
     addresses (eg. >= TASK_SIZE), which could corrupt non-volatile user
     GPRs.

  Thanks to: Aneesh Kumar K.V, Balbir Singh, Benjamin Herrenschmidt,
  Nicholas Piggin"

* tag 'powerpc-4.16-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/64s: Fix i-side SLB miss bad address handler saving nonvolatile GPRs
  powerpc/mm: Fixup tlbie vs store ordering issue on POWER9
  powerpc/mm/radix: Move the functions that does the actual tlbie closer
  powerpc/mm/radix: Remove unused code
  powerpc/mm: Workaround Nest MMU bug with TLB invalidations
  powerpc/mm: Add tracking of the number of coprocessors using a context
  powerpc/64s: Fix lost pending interrupt due to race causing lost update to irq_happened
  powerpc/64s: Fix NULL AT_BASE_PLATFORM when using DT CPU features
2018-03-28 13:54:03 -10:00
Michael Ellerman
95dff480bb Merge branch 'fixes' into next
Merge our fixes branch from the 4.16 cycle.

There were a number of important fixes merged, in particular some Power9
workarounds that we want in next for testing purposes. There's also been
some conflicting changes in the CPU features code which are best merged
and tested before going upstream.
2018-03-28 22:59:50 +11:00
Paul Mackerras
31c8b0d069 KVM: PPC: Book3S HV: Use __gfn_to_pfn_memslot() in page fault handler
This changes the hypervisor page fault handler for radix guests to use
the generic KVM __gfn_to_pfn_memslot() function instead of using
get_user_pages_fast() and then handling the case of VM_PFNMAP vmas
specially.  The old code missed the case of VM_IO vmas; with this
change, VM_IO vmas will now be handled correctly by code within
__gfn_to_pfn_memslot.

Currently, __gfn_to_pfn_memslot calls hva_to_pfn, which only uses
__get_user_pages_fast for the initial lookup in the cases where
either atomic or async is set.  Since we are not setting either
atomic or async, we do our own __get_user_pages_fast first, for now.

This also adds code to check for the KVM_MEM_READONLY flag on the
memslot.  If it is set and this is a write access, we synthesize a
data storage interrupt for the guest.

In the case where the page is not normal RAM (i.e. page == NULL in
kvmppc_book3s_radix_page_fault(), we read the PTE from the Linux page
tables because we need the mapping attribute bits as well as the PFN.
(The mapping attribute bits indicate whether accesses have to be
non-cacheable and/or guarded.)

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-03-28 08:27:58 +11:00
Michael Ellerman
c0b346729b Merge branch 'topic/ppc-kvm' into next
Merge the DAWR series, which touches arch code and KVM code and may need
to be merged into the kvm-ppc tree.
2018-03-27 23:55:49 +11:00
Michael Neuling
b53221e704 KVM: PPC: Book3S HV: Handle migration with POWER9 disabled DAWR
POWER9 with the DAWR disabled causes problems for partition
migration. Either we have to fail the migration (since we lose the
DAWR) or we silently drop the DAWR and allow the migration to pass.

This patch does the latter and allows the migration to pass (at the
cost of silently losing the DAWR). This is not ideal but hopefully the
best overall solution. This approach has been acked by Paulus.

With this patch kvmppc_set_one_reg() will store the DAWR in the vcpu
but won't actually set it on POWER9 hardware.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-27 23:55:33 +11:00
Michael Neuling
e8ebedbf31 KVM: PPC: Book3S HV: Return error from h_set_dabr() on POWER9
POWER7 compat mode guests can use h_set_dabr on POWER9. POWER9 should
use the DAWR but since it's disabled there we can't.

This returns H_UNSUPPORTED on a h_set_dabr() on POWER9 where the DAWR
is disabled.

Current Linux guests ignore this error, so they will silently not get
the DAWR (sigh). The same error code is being used by POWERVM in this
case.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-27 23:55:32 +11:00
Michael Neuling
398e712c00 KVM: PPC: Book3S HV: Return error from h_set_mode(SET_DAWR) on POWER9
Return H_P2 on a h_set_mode(SET_DAWR) on POWER9 where the DAWR is
disabled.

Current Linux guests ignore this error, so they will silently not get
the DAWR (sigh). The same error code is being used by POWERVM in this
case.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-27 23:55:32 +11:00
Michael Ellerman
a26cf1c9fe Merge branch 'topic/ppc-kvm' into next
This brings in two series from Paul, one of which touches KVM code and
may need to be merged into the kvm-ppc tree to resolve conflicts.
2018-03-24 08:43:18 +11:00
Paul Mackerras
681c617b7c KVM: PPC: Book3S HV: Work around TEXASR bug in fake suspend state
This works around a hardware bug in "Nimbus" POWER9 DD2.2 processors,
where the contents of the TEXASR can get corrupted while a thread is
in fake suspend state.  The workaround is for the instruction emulation
code to use the value saved at the most recent guest exit in real
suspend mode.  We achieve this by simply not saving the TEXASR into
the vcpu struct on an exit in fake suspend state.  We also have to
take care to set the orig_texasr field only on guest exit in real
suspend state.

This also means that on guest entry in fake suspend state, TEXASR
will be restored to the value it had on the last exit in real suspend
state, effectively counteracting any hardware-caused corruption.  This
works because TEXASR may not be written in suspend state.

With this, the guest might see the wrong values in TEXASR if it reads
it while in suspend state, but will see the correct value in
non-transactional state (e.g. after a treclaim), and treclaim will
work correctly.

With this workaround, the code will actually run slightly faster, and
will operate correctly on systems without the TEXASR bug (since TEXASR
may not be written in suspend state, and is only changed by failure
recording, which will have already been done before we get into fake
suspend state).  Therefore these changes are not made subject to a CPU
feature bit.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-24 00:39:17 +11:00
Suraj Jitindar Singh
87a11bb6a7 KVM: PPC: Book3S HV: Work around XER[SO] bug in fake suspend mode
This works around a hardware bug in "Nimbus" POWER9 DD2.2 processors,
where a treclaim performed in fake suspend mode can cause subsequent
reads from the XER register to return inconsistent values for the SO
(summary overflow) bit.  The inconsistent SO bit state can potentially
be observed on any thread in the core.  We have to do the treclaim
because that is the only way to get the thread out of suspend state
(fake or real) and into non-transactional state.

The workaround for the bug is to force the core into SMT4 mode before
doing the treclaim.  This patch adds the code to do that, conditional
on the CPU_FTR_P9_TM_XER_SO_BUG feature bit.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-24 00:39:16 +11:00
Paul Mackerras
4bb3c7a020 KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9
POWER9 has hardware bugs relating to transactional memory and thread
reconfiguration (changes to hardware SMT mode).  Specifically, the core
does not have enough storage to store a complete checkpoint of all the
architected state for all four threads.  The DD2.2 version of POWER9
includes hardware modifications designed to allow hypervisor software
to implement workarounds for these problems.  This patch implements
those workarounds in KVM code so that KVM guests see a full, working
transactional memory implementation.

The problems center around the use of TM suspended state, where the
CPU has a checkpointed state but execution is not transactional.  The
workaround is to implement a "fake suspend" state, which looks to the
guest like suspended state but the CPU does not store a checkpoint.
In this state, any instruction that would cause a transition to
transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
checkpointed state (treclaim) causes a "soft patch" interrupt (vector
0x1500) to the hypervisor so that it can be emulated.  The trechkpt
instruction also causes a soft patch interrupt.

On POWER9 DD2.2, we avoid returning to the guest in any state which
would require a checkpoint to be present.  The trechkpt in the guest
entry path which would normally create that checkpoint is replaced by
either a transition to fake suspend state, if the guest is in suspend
state, or a rollback to the pre-transactional state if the guest is in
transactional state.  Fake suspend state is indicated by a flag in the
PACA plus a new bit in the PSSCR.  The new PSSCR bit is write-only and
reads back as 0.

On exit from the guest, if the guest is in fake suspend state, we still
do the treclaim instruction as we would in real suspend state, in order
to get into non-transactional state, but we do not save the resulting
register state since there was no checkpoint.

Emulation of the instructions that cause a softpatch interrupt is
handled in two paths.  If the guest is in real suspend mode, we call
kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
transitioning to transactional state.  This is called before we do the
treclaim in the guest exit path; because we haven't done treclaim, we
can get back to the guest with the transaction still active.  If the
instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
handle, or if the guest is in fake suspend state, then we proceed to
do the complete guest exit path and subsequently call
kvmhv_p9_tm_emulation() in host context with the MMU on.  This handles
all the cases including the cases that generate program interrupts
(illegal instruction or TM Bad Thing) and facility unavailable
interrupts.

The emulation is reasonably straightforward and is mostly concerned
with checking for exception conditions and updating the state of
registers such as MSR and CR0.  The treclaim emulation takes care to
ensure that the TEXASR register gets updated as if it were the guest
treclaim instruction that had done failure recording, not the treclaim
done in hypervisor state in the guest exit path.

With this, the KVM_CAP_PPC_HTM capability returns true (1) even if
transactional memory is not available to host userspace.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-24 00:39:13 +11:00
Aneesh Kumar K.V
a5d4b5891c powerpc/mm: Fixup tlbie vs store ordering issue on POWER9
On POWER9, under some circumstances, a broadcast TLB invalidation
might complete before all previous stores have drained, potentially
allowing stale stores from becoming visible after the invalidation.
This works around it by doubling up those TLB invalidations which was
verified by HW to be sufficient to close the risk window.

This will be documented in a yet-to-be-published errata.

Fixes: 1a472c9dba ("powerpc/mm/radix: Add tlbflush routines")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Enable the feature in the DT CPU features code for all Power9,
      rename the feature to CPU_FTR_P9_TLBIE_BUG per benh.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-23 20:48:03 +11:00
Paul Mackerras
cda4a14733 KVM: PPC: Book3S HV: Fix duplication of host SLB entries
Since commit 6964e6a4e4 ("KVM: PPC: Book3S HV: Do SLB load/unload
with guest LPCR value loaded", 2018-01-11), we have been seeing
occasional machine check interrupts on POWER8 systems when running
KVM guests, due to SLB multihit errors.

This turns out to be due to the guest exit code reloading the host
SLB entries from the SLB shadow buffer when the SLB was not previously
cleared in the guest entry path.  This can happen because the path
which skips from the guest entry code to the guest exit code without
entering the guest now does the skip before the SLB is cleared and
loaded with guest values, but the host values are loaded after the
point in the guest exit path that we skip to.

To fix this, we move the code that reloads the host SLB values up
so that it occurs just before the point in the guest exit code (the
label guest_bypass:) where we skip to from the guest entry path.

Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Fixes: 6964e6a4e4 ("KVM: PPC: Book3S HV: Do SLB load/unload with guest LPCR value loaded")
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-03-23 13:42:51 +11:00
Paul Mackerras
58c5c276b4 KVM: PPC: Book3S HV: Handle 1GB pages in radix page fault handler
This adds code to the radix hypervisor page fault handler to handle the
case where the guest memory is backed by 1GB hugepages, and put them
into the partition-scoped radix tree at the PUD level.  The code is
essentially analogous to the code for 2MB pages.  This also rearranges
kvmppc_create_pte() to make it easier to follow.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-03-19 10:08:53 +11:00
Paul Mackerras
f7caf712d8 KVM: PPC: Book3S HV: Streamline setting of reference and change bits
When using the radix MMU, we can get hypervisor page fault interrupts
with the DSISR_SET_RC bit set in DSISR/HSRR1, indicating that an
attempt to set the R (reference) or C (change) bit in a PTE atomically
failed.  Previously we would find the corresponding Linux PTE and
check the permission and dirty bits there, but this is not really
necessary since we only need to do what the hardware was trying to
do, namely set R or C atomically.  This removes the code that reads
the Linux PTE and just update the partition-scoped PTE, having first
checked that it is still present, and if the access is a write, that
the PTE still has write permission.

Furthermore, we now check whether any other relevant bits are set
in DSISR, and if there are, then we proceed with the rest of the
function in order to handle whatever condition they represent,
instead of returning to the guest as we did previously.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-03-19 10:08:44 +11:00
Paul Mackerras
c4c8a7643e KVM: PPC: Book3S HV: Radix page fault handler optimizations
This improves the handling of transparent huge pages in the radix
hypervisor page fault handler.  Previously, if a small page is faulted
in to a 2MB region of guest physical space, that means that there is
a page table pointer at the PMD level, which could never be replaced
by a leaf (2MB) PMD entry.  This adds the code to clear the PMD,
invlidate the page walk cache and free the page table page in this
situation, so that the leaf PMD entry can be created.

This also adds code to check whether a PMD or PTE being inserted is
the same as is already there (because of a race with another CPU that
faulted on the same page) and if so, we don't replace the existing
entry, meaning that we don't invalidate the PTE or PMD and do a TLB
invalidation.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-03-19 10:08:38 +11:00
Paul Mackerras
39c983ea0f KVM: PPC: Remove unused kvm_unmap_hva callback
Since commit fb1522e099 ("KVM: update to new mmu_notifier semantic
v2", 2017-08-31), the MMU notifier code in KVM no longer calls the
kvm_unmap_hva callback.  This removes the PPC implementations of
kvm_unmap_hva().

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-03-19 10:08:29 +11:00
Paul Mackerras
a8b48a4dcc KVM: PPC: Book3S HV: Fix trap number return from __kvmppc_vcore_entry
This fixes a bug where the trap number that is returned by
__kvmppc_vcore_entry gets corrupted.  The effect of the corruption
is that IPIs get ignored on POWER9 systems when the IPI is sent via
a doorbell interrupt to a CPU which is executing in a KVM guest.
The effect of the IPI being ignored is often that another CPU locks
up inside smp_call_function_many() (and if that CPU is holding a
spinlock, other CPUs then lock up inside raw_spin_lock()).

The trap number is currently held in register r12 for most of the
assembly-language part of the guest exit path.  In that path, we
call kvmppc_subcore_exit_guest(), which is a C function, without
restoring r12 afterwards.  Depending on the kernel config and the
compiler, it may modify r12 or it may not, so some config/compiler
combinations see the bug and others don't.

To fix this, we arrange for the trap number to be stored on the
stack from the 'guest_bypass:' label until the end of the function,
then the trap number is loaded and returned in r12 as before.

Cc: stable@vger.kernel.org # v4.8+
Fixes: fd7bacbca4 ("KVM: PPC: Book3S HV: Fix TB corruption in guest exit path on HMI interrupt")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-03-14 15:10:50 +11:00
Michael Ellerman
2852aab819 Merge two commits from 'kvm-ppc-fixes' into next
This merges two commits from the `kvm-ppc-fixes` branch into next, as
they fix build breaks we are seeing while testing next.
2018-03-13 15:08:41 +11:00
Radim Krčmář
db5679379a Fixes for PPC KVM:
- Fix guest time accounting in the host
 - Fix large-page backing for radix guests on POWER9
 - Fix HPT guests on POWER9 backed by 2M or 1G pages
 - Compile fixes for some configs and gcc versions
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJammR1AAoJEJ2a6ncsY3Gf2pQIAKf1sBKimpDj/yeBlWVbS41q
 6mIJh8R+4DV7TcOSVoOdzUz1dU1cseVznqqr3kexu+unoUpcqm240ZUDsDNWy9j0
 Xv0JyrGOcPor9sQmlb1s2gOsybxhic4u8Ih1eQV47bEUw1Rb84/da0JI1u5nMRTq
 nm3OHPGSnK2C8UkBBVjGelLJGUx+uaLFLjJSSTd0F9+hlxjGT3yXjP3wLG/ZNajT
 6Reuzpr95hGpmIaml8gh73clLk4WAjF3+5SyiLo5nlsXzvMnC0DyzaUrHocIo6i7
 nZxrx9UguzEdiUbuc5NEs4klTc+GPwMCfd+7z6vmtyw87A0sVOUgGWNGcZL60ew=
 =Wy2j
 -----END PGP SIGNATURE-----

Merge tag 'kvm-ppc-fixes-4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc

Fixes for PPC KVM:

- Fix guest time accounting in the host
- Fix large-page backing for radix guests on POWER9
- Fix HPT guests on POWER9 backed by 2M or 1G pages
- Compile fixes for some configs and gcc versions
2018-03-06 17:24:09 +01:00
Laurent Vivier
61bd0f66ff KVM: PPC: Book3S HV: Fix guest time accounting with VIRT_CPU_ACCOUNTING_GEN
Since commit 8b24e69fc4 ("KVM: PPC: Book3S HV: Close race with testing
for signals on guest entry"), if CONFIG_VIRT_CPU_ACCOUNTING_GEN is set, the
guest time is not accounted to guest time and user time, but instead to
system time.

This is because guest_enter()/guest_exit() are called while interrupts
are disabled and the tick counter cannot be updated between them.

To fix that, move guest_exit() after local_irq_enable(), and as
guest_enter() is called with IRQ disabled, call guest_enter_irqoff()
instead.

Fixes: 8b24e69fc4 ("KVM: PPC: Book3S HV: Close race with testing for signals on guest entry")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-03-03 19:28:34 +11:00
Paul Mackerras
debd574f41 KVM: PPC: Book3S HV: Fix VRMA initialization with 2MB or 1GB memory backing
The current code for initializing the VRMA (virtual real memory area)
for HPT guests requires the page size of the backing memory to be one
of 4kB, 64kB or 16MB.  With a radix host we have the possibility that
the backing memory page size can be 2MB or 1GB.  In these cases, if the
guest switches to HPT mode, KVM will not initialize the VRMA and the
guest will fail to run.

In fact it is not necessary that the VRMA page size is the same as the
backing memory page size; any VRMA page size less than or equal to the
backing memory page size is acceptable.  Therefore we now choose the
largest page size out of the set {4k, 64k, 16M} which is not larger
than the backing memory page size.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-03-02 15:38:24 +11:00
Paul Mackerras
c3856aeb29 KVM: PPC: Book3S HV: Fix handling of large pages in radix page fault handler
This fixes several bugs in the radix page fault handler relating to
the way large pages in the memory backing the guest were handled.
First, the check for large pages only checked for explicit huge pages
and missed transparent huge pages.  Then the check that the addresses
(host virtual vs. guest physical) had appropriate alignment was
wrong, meaning that the code never put a large page in the partition
scoped radix tree; it was always demoted to a small page.

Fixing this exposed bugs in kvmppc_create_pte().  We were never
invalidating a 2MB PTE, which meant that if a page was initially
faulted in without write permission and the guest then attempted
to store to it, we would never update the PTE to have write permission.
If we find a valid 2MB PTE in the PMD, we need to clear it and
do a TLB invalidation before installing either the new 2MB PTE or
a pointer to a page table page.

This also corrects an assumption that get_user_pages_fast would set
the _PAGE_DIRTY bit if we are writing, which is not true.  Instead we
mark the page dirty explicitly with set_page_dirty_lock().  This
also means we don't need the dirty bit set on the host PTE when
providing write access on a read fault.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-03-02 14:05:32 +11:00
Ingo Molnar
ed7158bae4 treewide/trivial: Remove ';;$' typo noise
On lkml suggestions were made to split up such trivial typo fixes into per subsystem
patches:

  --- a/arch/x86/boot/compressed/eboot.c
  +++ b/arch/x86/boot/compressed/eboot.c
  @@ -439,7 +439,7 @@ setup_uga32(void **uga_handle, unsigned long size, u32 *width, u32 *height)
          struct efi_uga_draw_protocol *uga = NULL, *first_uga;
          efi_guid_t uga_proto = EFI_UGA_PROTOCOL_GUID;
          unsigned long nr_ugas;
  -       u32 *handles = (u32 *)uga_handle;;
  +       u32 *handles = (u32 *)uga_handle;
          efi_status_t status = EFI_INVALID_PARAMETER;
          int i;

This patch is the result of the following script:

  $ sed -i 's/;;$/;/g' $(git grep -E ';;$'  | grep "\.[ch]:"  | grep -vwE 'for|ia64' | cut -d: -f1 | sort | uniq)

... followed by manual review to make sure it's all good.

Splitting this up is just crazy talk, let's get over with this and just do it.

Reported-by: Pavel Machek <pavel@ucw.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-22 10:59:33 +01:00
Paul Mackerras
6df3877fc9 KVM: PPC: Book3S: Fix compile error that occurs with some gcc versions
Some versions of gcc generate a warning that the variable "emulated"
may be used uninitialized in function kvmppc_handle_load128_by2x64().
It would be used uninitialized if kvmppc_handle_load128_by2x64 was
ever called with vcpu->arch.mmio_vmx_copy_nums == 0, but neither of
the callers ever do that, so there is no actual bug.  When gcc
generates a warning, it causes the build to fail because arch/powerpc
is compiled with -Werror.

This silences the warning by initializing "emulated" to EMULATE_DONE.

Fixes: 09f984961c ("KVM: PPC: Book3S: Add MMIO emulation for VMX instructions")
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-02-13 15:45:21 +11:00
Paul Mackerras
c662f77331 KVM: PPC: Fix compile error that occurs when CONFIG_ALTIVEC=n
Commit accb757d79 ("KVM: Move vcpu_load to arch-specific
kvm_arch_vcpu_ioctl_run", 2017-12-04) added a "goto out"
statement and an "out:" label to kvm_arch_vcpu_ioctl_run().
Since the only "goto out" is inside a CONFIG_VSX block,
compiling with CONFIG_VSX=n gives a warning that label "out"
is defined but not used, and because arch/powerpc is compiled
with -Werror, that becomes a compile error that makes the kernel
build fail.

Merge commit 1ab03c072f ("Merge tag 'kvm-ppc-next-4.16-2' of
git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc",
2018-02-09) added a similar block of code inside a #ifdef
CONFIG_ALTIVEC, with a "goto out" statement.

In order to make the build succeed, this adds a #ifdef around the
"out:" label.  This is a minimal, ugly fix, to be replaced later
by a refactoring of the code.  Since CONFIG_VSX depends on
CONFIG_ALTIVEC, it is sufficient to use #ifdef CONFIG_ALTIVEC here.

Fixes: accb757d79 ("KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_run")
Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-02-13 15:23:39 +11:00
Linus Torvalds
15303ba5d1 KVM changes for 4.16
ARM:
 - Include icache invalidation optimizations, improving VM startup time
 
 - Support for forwarded level-triggered interrupts, improving
   performance for timers and passthrough platform devices
 
 - A small fix for power-management notifiers, and some cosmetic changes
 
 PPC:
 - Add MMIO emulation for vector loads and stores
 
 - Allow HPT guests to run on a radix host on POWER9 v2.2 CPUs without
   requiring the complex thread synchronization of older CPU versions
 
 - Improve the handling of escalation interrupts with the XIVE interrupt
   controller
 
 - Support decrement register migration
 
 - Various cleanups and bugfixes.
 
 s390:
 - Cornelia Huck passed maintainership to Janosch Frank
 
 - Exitless interrupts for emulated devices
 
 - Cleanup of cpuflag handling
 
 - kvm_stat counter improvements
 
 - VSIE improvements
 
 - mm cleanup
 
 x86:
 - Hypervisor part of SEV
 
 - UMIP, RDPID, and MSR_SMI_COUNT emulation
 
 - Paravirtualized TLB shootdown using the new KVM_VCPU_PREEMPTED bit
 
 - Allow guests to see TOPOEXT, GFNI, VAES, VPCLMULQDQ, and more AVX512
   features
 
 - Show vcpu id in its anonymous inode name
 
 - Many fixes and cleanups
 
 - Per-VCPU MSR bitmaps (already merged through x86/pti branch)
 
 - Stable KVM clock when nesting on Hyper-V (merged through x86/hyperv)
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJafvMtAAoJEED/6hsPKofo6YcH/Rzf2RmshrWaC3q82yfIV0Qz
 Z8N8yJHSaSdc3Jo6cmiVj0zelwAxdQcyjwlT7vxt5SL2yML+/Q0st9Hc3EgGGXPm
 Il99eJEl+2MYpZgYZqV8ff3mHS5s5Jms+7BITAeh6Rgt+DyNbykEAvzt+MCHK9cP
 xtsIZQlvRF7HIrpOlaRzOPp3sK2/MDZJ1RBE7wYItK3CUAmsHim/LVYKzZkRTij3
 /9b4LP1yMMbziG+Yxt1o682EwJB5YIat6fmDG9uFeEVI5rWWN7WFubqs8gCjYy/p
 FX+BjpOdgTRnX+1m9GIj0Jlc/HKMXryDfSZS07Zy4FbGEwSiI5SfKECub4mDhuE=
 =C/uD
 -----END PGP SIGNATURE-----

Merge tag 'kvm-4.16-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Radim Krčmář:
 "ARM:

   - icache invalidation optimizations, improving VM startup time

   - support for forwarded level-triggered interrupts, improving
     performance for timers and passthrough platform devices

   - a small fix for power-management notifiers, and some cosmetic
     changes

  PPC:

   - add MMIO emulation for vector loads and stores

   - allow HPT guests to run on a radix host on POWER9 v2.2 CPUs without
     requiring the complex thread synchronization of older CPU versions

   - improve the handling of escalation interrupts with the XIVE
     interrupt controller

   - support decrement register migration

   - various cleanups and bugfixes.

  s390:

   - Cornelia Huck passed maintainership to Janosch Frank

   - exitless interrupts for emulated devices

   - cleanup of cpuflag handling

   - kvm_stat counter improvements

   - VSIE improvements

   - mm cleanup

  x86:

   - hypervisor part of SEV

   - UMIP, RDPID, and MSR_SMI_COUNT emulation

   - paravirtualized TLB shootdown using the new KVM_VCPU_PREEMPTED bit

   - allow guests to see TOPOEXT, GFNI, VAES, VPCLMULQDQ, and more
     AVX512 features

   - show vcpu id in its anonymous inode name

   - many fixes and cleanups

   - per-VCPU MSR bitmaps (already merged through x86/pti branch)

   - stable KVM clock when nesting on Hyper-V (merged through
     x86/hyperv)"

* tag 'kvm-4.16-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (197 commits)
  KVM: PPC: Book3S: Add MMIO emulation for VMX instructions
  KVM: PPC: Book3S HV: Branch inside feature section
  KVM: PPC: Book3S HV: Make HPT resizing work on POWER9
  KVM: PPC: Book3S HV: Fix handling of secondary HPTEG in HPT resizing code
  KVM: PPC: Book3S PR: Fix broken select due to misspelling
  KVM: x86: don't forget vcpu_put() in kvm_arch_vcpu_ioctl_set_sregs()
  KVM: PPC: Book3S PR: Fix svcpu copying with preemption enabled
  KVM: PPC: Book3S HV: Drop locks before reading guest memory
  kvm: x86: remove efer_reload entry in kvm_vcpu_stat
  KVM: x86: AMD Processor Topology Information
  x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested
  kvm: embed vcpu id to dentry of vcpu anon inode
  kvm: Map PFN-type memory regions as writable (if possible)
  x86/kvm: Make it compile on 32bit and with HYPYERVISOR_GUEST=n
  KVM: arm/arm64: Fixup userspace irqchip static key optimization
  KVM: arm/arm64: Fix userspace_irqchip_in_use counting
  KVM: arm/arm64: Fix incorrect timer_is_pending logic
  MAINTAINERS: update KVM/s390 maintainers
  MAINTAINERS: add Halil as additional vfio-ccw maintainer
  MAINTAINERS: add David as a reviewer for KVM/s390
  ...
2018-02-10 13:16:35 -08:00
Radim Krčmář
1ab03c072f Second PPC KVM update for 4.16
Seven fixes that are either trivial or that address bugs that people
 are actually hitting.  The main ones are:
 
 - Drop spinlocks before reading guest memory
 
 - Fix a bug causing corruption of VCPU state in PR KVM with preemption
   enabled
 
 - Make HPT resizing work on POWER9
 
 - Add MMIO emulation for vector loads and stores, because guests now
   use these instructions in memcpy and similar routines.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJafWn0AAoJEJ2a6ncsY3GfaMsIANF0hQD8SS78WNKnoy0vnZ/X
 PUXdjwHEsfkg5KdQ7o0oaa2BJHHqO3vozddmMiG14r2L1mNCHJpnVZCVV0GaEJcZ
 eU8++OPK6yrsPNNpAjnrtQ0Vk4LwzoT0bftEjS3TtLt1s2uSo+R1+HLmxbxGhQUX
 bZngo9wQ3cjUfAXLrPtAVhE5wTmgVOiufVRyfRsBRdFzRsAWqjY4hBtJAfwdff4r
 AA5H0RCrXO6e1feKr5ElU8KzX6b7IjH9Xu868oJ1r16zZfE05PBl1X5n4XG7XDm7
 xWvs8uLAB7iRv2o/ecFznYJ+Dz1NCBVzD0RmAUTqPCcVKDrxixaTkqMPFW97IAA=
 =HOJR
 -----END PGP SIGNATURE-----

Merge tag 'kvm-ppc-next-4.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc

Second PPC KVM update for 4.16

Seven fixes that are either trivial or that address bugs that people
are actually hitting.  The main ones are:

- Drop spinlocks before reading guest memory

- Fix a bug causing corruption of VCPU state in PR KVM with preemption
  enabled

- Make HPT resizing work on POWER9

- Add MMIO emulation for vector loads and stores, because guests now
  use these instructions in memcpy and similar routines.
2018-02-09 22:03:06 +01:00
Jose Ricardo Ziviani
09f984961c KVM: PPC: Book3S: Add MMIO emulation for VMX instructions
This patch provides the MMIO load/store vector indexed
X-Form emulation.

Instructions implemented:
lvx: the quadword in storage addressed by the result of EA &
0xffff_ffff_ffff_fff0 is loaded into VRT.

stvx: the contents of VRS are stored into the quadword in storage
addressed by the result of EA & 0xffff_ffff_ffff_fff0.

Reported-by: Gopesh Kumar Chaudhary <gopchaud@in.ibm.com>
Reported-by: Balamuruhan S <bala24@linux.vnet.ibm.com>
Signed-off-by: Jose Ricardo Ziviani <joserz@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-02-09 16:51:51 +11:00
Alexander Graf
d20fe50a7b KVM: PPC: Book3S HV: Branch inside feature section
We ended up with code that did a conditional branch inside a feature
section to code outside of the feature section. Depending on how the
object file gets organized, that might mean we exceed the 14bit
relocation limit for conditional branches:

  arch/powerpc/kvm/built-in.o:arch/powerpc/kvm/book3s_hv_rmhandlers.S:416:(__ftr_alt_97+0x8): relocation truncated to fit: R_PPC64_REL14 against `.text'+1ca4

So instead of doing a conditional branch outside of the feature section,
let's just jump at the end of the same, making the branch very short.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-02-09 15:39:41 +11:00
David Gibson
790a9df5fb KVM: PPC: Book3S HV: Make HPT resizing work on POWER9
This adds code to enable the HPT resizing code to work on POWER9,
which uses a slightly modified HPT entry format compared to POWER8.
On POWER9, we convert HPTEs read from the HPT from the new format to
the old format so that the rest of the HPT resizing code can work as
before.  HPTEs written to the new HPT are converted to the new format
as the last step before writing them into the new HPT.

This takes out the checks added by commit bcd3bb63db ("KVM: PPC:
Book3S HV: Disable HPT resizing on POWER9 for now", 2017-02-18),
now that HPT resizing works on POWER9.

On POWER9, when we pivot to the new HPT, we now call
kvmppc_setup_partition_table() to update the partition table in order
to make the hardware use the new HPT.

[paulus@ozlabs.org - added kvmppc_setup_partition_table() call,
 wrote commit message.]

Tested-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-02-09 15:36:36 +11:00
Paul Mackerras
05f2bb0313 KVM: PPC: Book3S HV: Fix handling of secondary HPTEG in HPT resizing code
This fixes the computation of the HPTE index to use when the HPT
resizing code encounters a bolted HPTE which is stored in its
secondary HPTE group.  The code inverts the HPTE group number, which
is correct, but doesn't then mask it with new_hash_mask.  As a result,
new_pteg will be effectively negative, resulting in new_hptep
pointing before the new HPT, which will corrupt memory.

In addition, this removes two BUG_ON statements.  The condition that
the BUG_ONs were testing -- that we have computed the hash value
incorrectly -- has never been observed in testing, and if it did
occur, would only affect the guest, not the host.  Given that
BUG_ON should only be used in conditions where the kernel (i.e.
the host kernel, in this case) can't possibly continue execution,
it is not appropriate here.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-02-09 15:35:24 +11:00
Ulf Magnusson
57ea5f161a KVM: PPC: Book3S PR: Fix broken select due to misspelling
Commit 76d837a4c0 ("KVM: PPC: Book3S PR: Don't include SPAPR TCE code
on non-pseries platforms") added a reference to the globally undefined
symbol PPC_SERIES. Looking at the rest of the commit, PPC_PSERIES was
probably intended.

Change PPC_SERIES to PPC_PSERIES.

Discovered with the
https://github.com/ulfalizer/Kconfiglib/blob/master/examples/list_undefined.py
script.

Fixes: 76d837a4c0 ("KVM: PPC: Book3S PR: Don't include SPAPR TCE code on non-pseries platforms")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Ulf Magnusson <ulfalizer@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-02-08 16:42:16 +11:00
Linus Torvalds
03f51d4efa powerpc updates for 4.16
Highlights:
 
  - Enable support for memory protection keys aka "pkeys" on Power7/8/9 when
    using the hash table MMU.
 
  - Extend our interrupt soft masking to support masking PMU interrupts as well
    as "normal" interrupts, and then use that to implement local_t for a ~4x
    speedup vs the current atomics-based implementation.
 
  - A new driver "ocxl" for "Open Coherent Accelerator Processor Interface
    (OpenCAPI)" devices.
 
  - Support for new device tree properties on PowerVM to describe hotpluggable
    memory and devices.
 
  - Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE to the 64-bit VDSO.
 
  - Freescale updates from Scott:
      "Contains fixes for CPM GPIO and an FSL PCI erratum workaround, plus a
       minor cleanup patch."
 
 As well as quite a lot of other changes all over the place, and small fixes and
 cleanups as always.
 
 Thanks to:
   Alan Modra, Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple, Andreas
   Schwab, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anshuman
   Khandual, Anton Blanchard, Arnd Bergmann, Balbir Singh, Benjamin
   Herrenschmidt, Bhaktipriya Shridhar, Bryant G. Ly, Cédric Le Goater,
   Christophe Leroy, Christophe Lombard, Cyril Bur, David Gibson, Desnes A. Nunes
   do Rosario, Dmitry Torokhov, Frederic Barrat, Geert Uytterhoeven, Guilherme G.
   Piccoli, Gustavo A. R. Silva, Gustavo Romero, Ivan Mikhaylov, Joakim
   Tjernlund, Joe Perches, Josh Poimboeuf, Juan J. Alvarez, Julia Cartwright,
   Kamalesh Babulal, Madhavan Srinivasan, Mahesh Salgaonkar, Mathieu Malaterre,
   Michael Bringmann, Michael Hanselmann, Michael Neuling, Nathan Fontenot,
   Naveen N. Rao, Nicholas Piggin, Paul Mackerras, Philippe Bergheaud, Ram Pai,
   Russell Currey, Santosh Sivaraj, Scott Wood, Seth Forshee, Simon Guo, Stewart
   Smith, Sukadev Bhattiprolu, Thiago Jung Bauermann, Vaibhav Jain, Vasyl
   Gomonovych.
 -----BEGIN PGP SIGNATURE-----
 
 iQIwBAABCAAaBQJadF6wExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYA2
 nBAAnguCEyAIYpc+ffE3WU9xJEWxa6bKuVufHcUFVntGiGD+igmMS+SHp4ay3Aos
 HcA4WFrpzNb2KZ++kmFWtAKWnMfCiW9xuYJNicjr7X5ZiVBEhLWN/mQCwBKs3p6L
 5+HhvytcdkKVbEcyVjEGvRL40AyxXNOI02o6Co9X8vanHsmWB4q0eWe4PHstZqlg
 6K6kazMp+NTvEFYwKNXDOvuHouKSL57l14SLROH7CpJkNTOQ9s+W59/LmnuCjRlu
 o70b7iWOAEbF9tvMma1ksDZVNj7mSyaymLYCyOXu4CkuuleJacZYJ9oQGNddoIbC
 wk7l93vPT/yze7DYg8x3uXpKcaDEvEepPuQ/ubz+UXFQWuJtl5ej6Cv+0eOmyZIs
 +bjWhGHKdNttnsiPlTRCX/gWD13RE1dB6xXJlfOJ7Oz9OnXXK8ZKc1NTREbQXRWM
 8tClAwf9upWpm86GHPVnyrgYbgZo5b1os4SoS8e3kESzakrQVQP7J376u2DtccRq
 2AGqjJ+tl5tYPnhm8zG1cNrpqHHpgkNGqLS7DvWRg3EPmEKVQcltN1b/0aKaAjHA
 aTRofjrVo+jJ4MX1uyEo59yNCEQPfjkmHRQdLwm+xjWTzEPfIMzpWyXm14tawDQf
 OjcAe90W/qQ18brw4z+2BI14J76XziOSX/QcunOn1u/sqaM=
 =3rYn
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Highlights:

   - Enable support for memory protection keys aka "pkeys" on Power7/8/9
     when using the hash table MMU.

   - Extend our interrupt soft masking to support masking PMU interrupts
     as well as "normal" interrupts, and then use that to implement
     local_t for a ~4x speedup vs the current atomics-based
     implementation.

   - A new driver "ocxl" for "Open Coherent Accelerator Processor
     Interface (OpenCAPI)" devices.

   - Support for new device tree properties on PowerVM to describe
     hotpluggable memory and devices.

   - Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE to the 64-bit
     VDSO.

   - Freescale updates from Scott: fixes for CPM GPIO and an FSL PCI
     erratum workaround, plus a minor cleanup patch.

  As well as quite a lot of other changes all over the place, and small
  fixes and cleanups as always.

  Thanks to: Alan Modra, Alastair D'Silva, Alexey Kardashevskiy,
  Alistair Popple, Andreas Schwab, Andrew Donnellan, Aneesh Kumar K.V,
  Anju T Sudhakar, Anshuman Khandual, Anton Blanchard, Arnd Bergmann,
  Balbir Singh, Benjamin Herrenschmidt, Bhaktipriya Shridhar, Bryant G.
  Ly, Cédric Le Goater, Christophe Leroy, Christophe Lombard, Cyril Bur,
  David Gibson, Desnes A. Nunes do Rosario, Dmitry Torokhov, Frederic
  Barrat, Geert Uytterhoeven, Guilherme G. Piccoli, Gustavo A. R. Silva,
  Gustavo Romero, Ivan Mikhaylov, Joakim Tjernlund, Joe Perches, Josh
  Poimboeuf, Juan J. Alvarez, Julia Cartwright, Kamalesh Babulal,
  Madhavan Srinivasan, Mahesh Salgaonkar, Mathieu Malaterre, Michael
  Bringmann, Michael Hanselmann, Michael Neuling, Nathan Fontenot,
  Naveen N. Rao, Nicholas Piggin, Paul Mackerras, Philippe Bergheaud,
  Ram Pai, Russell Currey, Santosh Sivaraj, Scott Wood, Seth Forshee,
  Simon Guo, Stewart Smith, Sukadev Bhattiprolu, Thiago Jung Bauermann,
  Vaibhav Jain, Vasyl Gomonovych"

* tag 'powerpc-4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (199 commits)
  powerpc/mm/radix: Fix build error when RADIX_MMU=n
  macintosh/ams-input: Use true and false for boolean values
  macintosh: change some data types from int to bool
  powerpc/watchdog: Print the NIP in soft_nmi_interrupt()
  powerpc/watchdog: regs can't be null in soft_nmi_interrupt()
  powerpc/watchdog: Tweak watchdog printks
  powerpc/cell: Remove axonram driver
  rtc-opal: Fix handling of firmware error codes, prevent busy loops
  powerpc/mpc52xx_gpt: make use of raw_spinlock variants
  macintosh/adb: Properly mark continued kernel messages
  powerpc/pseries: Fix cpu hotplug crash with memoryless nodes
  powerpc/numa: Ensure nodes initialized for hotplug
  powerpc/numa: Use ibm,max-associativity-domains to discover possible nodes
  powerpc/kernel: Block interrupts when updating TIDR
  powerpc/powernv/idoa: Remove unnecessary pcidev from pci_dn
  powerpc/mm/nohash: do not flush the entire mm when range is a single page
  powerpc/pseries: Add Initialization of VF Bars
  powerpc/pseries/pci: Associate PEs to VFs in configure SR-IOV
  powerpc/eeh: Add EEH notify resume sysfs
  powerpc/eeh: Add EEH operations to notify resume
  ...
2018-02-02 10:01:04 -08:00
Radim Krčmář
d2b9b2079e PPC KVM update for 4.16
- Allow HPT guests to run on a radix host on POWER9 v2.2 CPUs
   without requiring the complex thread synchronization that earlier
   CPU versions required.
 
 - A series from Ben Herrenschmidt to improve the handling of
   escalation interrupts with the XIVE interrupt controller.
 
 - Provide for the decrementer register to be copied across on
   migration.
 
 - Various minor cleanups and bugfixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJaYXViAAoJEJ2a6ncsY3GfDhgIAIDVBZH/Ftq7eJiUSxDpqyCQ
 DF/x7fNKzK/J33pu+3ntOI2gZsldExAy7vH2M27I4qLIkbI5y3vu4v8l3CDlS1LK
 9dKi72zg7baozoVF5mGUNm0B1sSvZiIQlC/kaami2aPTF1GcrJ561GthzfZwxENX
 TSLqOA4LkeUZh2tUsvbcUrPi6v+E4Em2lgacQcx2ioMblWz56sZu79VsUbSSw/a3
 P8+pIv7EbHw+TrOZMehjCbZkOdBeZ3IRLJsdlIAfe7y4vWME/5b9uVnQS/+XQj/B
 6f3rQrduGvF2P6GMjsm8gDkgE5oZ1zbKlgO4i5WApnu80MMLFlfEUN+GWuGJ95Q=
 =OjGs
 -----END PGP SIGNATURE-----

Merge tag 'kvm-ppc-next-4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc

PPC KVM update for 4.16

- Allow HPT guests to run on a radix host on POWER9 v2.2 CPUs
  without requiring the complex thread synchronization that earlier
  CPU versions required.

- A series from Ben Herrenschmidt to improve the handling of
  escalation interrupts with the XIVE interrupt controller.

- Provide for the decrementer register to be copied across on
  migration.

- Various minor cleanups and bugfixes.
2018-02-01 16:13:07 +01:00
Radim Krčmář
7bf14c28ee Merge branch 'x86/hyperv' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Topic branch for stable KVM clockource under Hyper-V.

Thanks to Christoffer Dall for resolving the ARM conflict.
2018-02-01 15:04:17 +01:00
Linus Torvalds
40b9672a2f Merge branch 'work.whack-a-mole' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull asm/uaccess.h whack-a-mole from Al Viro:
 "It's linux/uaccess.h, damnit... Oh, well - eventually they'll stop
  cropping up..."

* 'work.whack-a-mole' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  asm-prototypes.h: use linux/uaccess.h, not asm/uaccess.h
  riscv: use linux/uaccess.h, not asm/uaccess.h...
  ppc: for put_user() pull linux/uaccess.h, not asm/uaccess.h
2018-01-31 19:18:12 -08:00
Alexander Graf
07ae5389e9 KVM: PPC: Book3S PR: Fix svcpu copying with preemption enabled
When copying between the vcpu and svcpu, we may get scheduled away onto
a different host CPU which in turn means our svcpu pointer may change.

That means we need to atomically copy to and from the svcpu with preemption
disabled, so that all code around it always sees a coherent state.

Reported-by: Simon Guo <wei.guo.simon@gmail.com>
Fixes: 3d3319b45e ("KVM: PPC: Book3S: PR: Enable interrupts earlier")
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-02-01 13:35:33 +11:00
Paul Mackerras
36ee41d161 KVM: PPC: Book3S HV: Drop locks before reading guest memory
Running with CONFIG_DEBUG_ATOMIC_SLEEP reveals that HV KVM tries to
read guest memory, in order to emulate guest instructions, while
preempt is disabled and a vcore lock is held.  This occurs in
kvmppc_handle_exit_hv(), called from post_guest_process(), when
emulating guest doorbell instructions on POWER9 systems, and also
when checking whether we have hit a hypervisor breakpoint.
Reading guest memory can cause a page fault and thus cause the
task to sleep, so we need to avoid reading guest memory while
holding a spinlock or when preempt is disabled.

To fix this, we move the preempt_enable() in kvmppc_run_core() to
before the loop that calls post_guest_process() for each vcore that
has just run, and we drop and re-take the vcore lock around the calls
to kvmppc_emulate_debug_inst() and kvmppc_emulate_doorbell_instr().

Dropping the lock is safe with respect to the iteration over the
runnable vcpus in post_guest_process(); for_each_runnable_thread
is actually safe to use locklessly.  It is possible for a vcpu
to become runnable and add itself to the runnable_threads array
(code near the beginning of kvmppc_run_vcpu()) and then get included
in the iteration in post_guest_process despite the fact that it
has not just run.  This is benign because vcpu->arch.trap and
vcpu->arch.ceded will be zero.

Cc: stable@vger.kernel.org # v4.13+
Fixes: 579006944e ("KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-02-01 13:35:33 +11:00
Russell Currey
57ad583f20 powerpc: Use octal numbers for file permissions
Symbolic macros are unintuitive and hard to read, whereas octal constants
are much easier to interpret.  Replace macros for the basic permission
flags (user/group/other read/write/execute) with numeric constants
instead, across the whole powerpc tree.

Introducing a significant number of changes across the tree for no runtime
benefit isn't exactly desirable, but so long as these macros are still
used in the tree people will keep sending patches that add them.  Not only
are they hard to parse at a glance, there are multiple ways of coming to
the same value (as you can see with 0444 and 0644 in this patch) which
hurts readability.

Signed-off-by: Russell Currey <ruscur@russell.cc>
Reviewed-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-22 05:48:33 +11:00
Michael Ellerman
ebf0b6a8b1 Merge branch 'fixes' into next
Merge our fixes branch from the 4.15 cycle.

Unusually the fixes branch saw some significant features merged,
notably the RFI flush patches, so we want the code in next to be
tested against that, to avoid any surprises when the two are merged.

There's also some other work on the panic handling that was reverted
in fixes and we now want to do properly in next, which would conflict.

And we also fix a few other minor merge conflicts.
2018-01-21 23:21:14 +11:00
Michael Ellerman
5400fc229e Merge branch 'topic/ppc-kvm' into next
Merge the topic branch we share with kvm-ppc, this brings in two xive
commits, one from Paul to rework HMI handling, and a minor cleanup to
drop an unused flag.
2018-01-21 22:43:43 +11:00
Aneesh Kumar K.V
76b03dc07e powerpc/mm: Remove unused flag arg in global_invalidates
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-21 20:30:44 +11:00
Madhavan Srinivasan
4e26bc4a4e powerpc/64: Rename soft_enabled to irq_soft_mask
Rename the paca->soft_enabled to paca->irq_soft_mask as it is no
longer used as a flag for interrupt state, but a mask.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:37:01 +11:00
Paul Mackerras
3214d01f13 KVM: PPC: Book3S: Provide information about hardware/firmware CVE workarounds
This adds a new ioctl, KVM_PPC_GET_CPU_CHAR, that gives userspace
information about the underlying machine's level of vulnerability
to the recently announced vulnerabilities CVE-2017-5715,
CVE-2017-5753 and CVE-2017-5754, and whether the machine provides
instructions to assist software to work around the vulnerabilities.

The ioctl returns two u64 words describing characteristics of the
CPU and required software behaviour respectively, plus two mask
words which indicate which bits have been filled in by the kernel,
for extensibility.  The bit definitions are the same as for the
new H_GET_CPU_CHARACTERISTICS hypercall.

There is also a new capability, KVM_CAP_PPC_GET_CPU_CHAR, which
indicates whether the new ioctl is available.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-19 15:17:01 +11:00
Benjamin Herrenschmidt
9b9b13a6d1 KVM: PPC: Book3S HV: Keep XIVE escalation interrupt masked unless ceded
This works on top of the single escalation support. When in single
escalation, with this change, we will keep the escalation interrupt
disabled unless the VCPU is in H_CEDE (idle). In any other case, we
know the VCPU will be rescheduled and thus there is no need to take
escalation interrupts in the host whenever a guest interrupt fires.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-19 12:10:21 +11:00
Benjamin Herrenschmidt
35c2405efc KVM: PPC: Book3S HV: Make xive_pushed a byte, not a word
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-19 12:10:21 +11:00
Benjamin Herrenschmidt
2662efd050 KVM: PPC: Book3S HV: Check DR not IR to chose real vs virt mode MMIOs
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-19 12:10:21 +11:00
Benjamin Herrenschmidt
2267ea7661 KVM: PPC: Book3S HV: Don't use existing "prodded" flag for XIVE escalations
The prodded flag is only cleared at the beginning of H_CEDE,
so every time we have an escalation, we will cause the *next*
H_CEDE to return immediately.

Instead use a dedicated "irq_pending" flag to indicate that
a guest interrupt is pending for the VCPU. We don't reuse the
existing exception bitmap so as to avoid expensive atomic ops.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-19 12:10:21 +11:00
Benjamin Herrenschmidt
bf4159da47 KVM: PPC: Book3S HV: Enable use of the new XIVE "single escalation" feature
That feature, provided by Power9 DD2.0 and later, when supported
by newer OPAL versions, allows us to sacrifice a queue (priority 7)
in favor of merging all the escalation interrupts of the queues
of a single VP into a single interrupt.

This reduces the number of host interrupts used up by KVM guests
especially when those guests use multiple priorities.

It will also enable a future change to control the masking of the
escalation interrupts more precisely to avoid spurious ones.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-19 12:10:21 +11:00
Benjamin Herrenschmidt
c424c10823 KVM: PPC: Book3S HV: Add more info about XIVE queues in debugfs
Add details about enabled queues and escalation interrupts.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-19 12:10:21 +11:00
Paul Mackerras
d27998185d Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next
This merges in the ppc-kvm topic branch of the powerpc tree to get
two patches which are prerequisites for the following patch series,
plus another patch which touches both powerpc and KVM code.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-19 12:09:57 +11:00
Paul Mackerras
d075745d89 KVM: PPC: Book3S HV: Improve handling of debug-trigger HMIs on POWER9
Hypervisor maintenance interrupts (HMIs) are generated by various
causes, signalled by bits in the hypervisor maintenance exception
register (HMER).  In most cases calling OPAL to handle the interrupt
is the correct thing to do, but the "debug trigger" HMIs signalled by
PPC bit 17 (bit 46) of HMER are used to invoke software workarounds
for hardware bugs, and OPAL does not have any code to handle this
cause.  The debug trigger HMI is used in POWER9 DD2.0 and DD2.1 chips
to work around a hardware bug in executing vector load instructions to
cache inhibited memory.  In POWER9 DD2.2 chips, it is generated when
conditions are detected relating to threads being in TM (transactional
memory) suspended mode when the core SMT configuration needs to be
reconfigured.

The kernel currently has code to detect the vector CI load condition,
but only when the HMI occurs in the host, not when it occurs in a
guest.  If a HMI occurs in the guest, it is always passed to OPAL, and
then we always re-sync the timebase, because the HMI cause might have
been a timebase error, for which OPAL would re-sync the timebase, thus
removing the timebase offset which KVM applied for the guest.  Since
we don't know what OPAL did, we don't know whether to subtract the
timebase offset from the timebase, so instead we re-sync the timebase.

This adds code to determine explicitly what the cause of a debug
trigger HMI will be.  This is based on a new device-tree property
under the CPU nodes called ibm,hmi-special-triggers, if it is
present, or otherwise based on the PVR (processor version register).
The handling of debug trigger HMIs is pulled out into a separate
function which can be called from the KVM guest exit code.  If this
function handles and clears the HMI, and no other HMI causes remain,
then we skip calling OPAL and we proceed to subtract the guest
timebase offset from the timebase.

The overall handling for HMIs that occur in the host (i.e. not in a
KVM guest) is largely unchanged, except that we now don't set the flag
for the vector CI load workaround on DD2.2 processors.

This also removes a BUG_ON in the KVM code.  BUG_ON is generally not
useful in KVM guest entry/exit code since it is difficult to handle
the resulting trap gracefully.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18 15:31:25 +11:00
Paul Mackerras
00608e1f00 KVM: PPC: Book3S HV: Allow HPT and radix on the same core for POWER9 v2.2
POWER9 chip versions starting with "Nimbus" v2.2 can support running
with some threads of a core in HPT mode and others in radix mode.
This means that we don't have to prohibit independent-threads mode
when running a HPT guest on a radix host, and we don't have to do any
of the synchronization between threads that was introduced in commit
c01015091a ("KVM: PPC: Book3S HV: Run HPT guests on POWER9 radix
hosts", 2017-10-19).

Rather than using up another CPU feature bit, we just do an
explicit test on the PVR (processor version register) at module
startup time to determine whether we have to take steps to avoid
having some threads in HPT mode and some in radix mode (so-called
"mixed mode").  We test for "Nimbus" (indicated by 0 or 1 in the top
nibble of the lower 16 bits) v2.2 or later, or "Cumulus" (indicated by
2 or 3 in that nibble) v1.1 or later.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-18 12:05:19 +11:00
Nicholas Piggin
d4748276ae powerpc/64s: Improve local TLB flush for boot and MCE on POWER9
There are several cases outside the normal address space management
where a CPU's entire local TLB is to be flushed:

  1. Booting the kernel, in case something has left stale entries in
     the TLB (e.g., kexec).

  2. Machine check, to clean corrupted TLB entries.

One other place where the TLB is flushed, is waking from deep idle
states. The flush is a side-effect of calling ->cpu_restore with the
intention of re-setting various SPRs. The flush itself is unnecessary
because in the first case, the TLB should not acquire new corrupted
TLB entries as part of sleep/wake (though they may be lost).

This type of TLB flush is coded inflexibly, several times for each CPU
type, and they have a number of problems with ISA v3.0B:

- The current radix mode of the MMU is not taken into account, it is
  always done as a hash flushn For IS=2 (LPID-matching flush from host)
  and IS=3 with HV=0 (guest kernel flush), tlbie(l) is undefined if
  the R field does not match the current radix mode.

- ISA v3.0B hash must flush the partition and process table caches as
  well.

- ISA v3.0B radix must flush partition and process scoped translations,
  partition and process table caches, and also the page walk cache.

So consolidate the flushing code and implement it in C and inline asm
under the mm/ directory with the rest of the flush code. Add ISA v3.0B
cases for radix and hash, and use the radix flush in radix environment.

Provide a way for IS=2 (LPID flush) to specify the radix mode of the
partition. Have KVM pass in the radix mode of the guest.

Take out the flushes from early cputable/dt_cpu_ftrs detection hooks,
and move it later in the boot process after, the MMU registers are set
up and before relocation is first turned on.

The TLB flush is no longer called when restoring from deep idle states.
This was not be done as a separate step because booting secondaries
uses the same cpu_restore as idle restore, which needs the TLB flush.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18 00:40:31 +11:00
Paul Mackerras
6964e6a4e4 KVM: PPC: Book3S HV: Do SLB load/unload with guest LPCR value loaded
This moves the code that loads and unloads the guest SLB values so that
it is done while the guest LPCR value is loaded in the LPCR register.
The reason for doing this is that on POWER9, the behaviour of the
slbmte instruction depends on the LPCR[UPRT] bit.  If UPRT is 1, as
it is for a radix host (or guest), the SLB index is truncated to
2 bits.  This means that for a HPT guest on a radix host, the SLB
was not being loaded correctly, causing the guest to crash.

The SLB is now loaded much later in the guest entry path, after the
LPCR is loaded, which for a secondary thread is after it sees that
the primary thread has switched the MMU to the guest.  The loop that
waits for the primary thread has a branch out to the exit code that
is taken if it sees that other threads have commenced exiting the
guest.  Since we have now not loaded the SLB at this point, we make
this path branch to a new label 'guest_bypass' and we move the SLB
unload code to before this label.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-17 11:19:02 +11:00
Paul Mackerras
43ff3f6523 KVM: PPC: Book3S HV: Make sure we don't re-enter guest without XIVE loaded
This fixes a bug where it is possible to enter a guest on a POWER9
system without having the XIVE (interrupt controller) context loaded.
This can happen because we unload the XIVE context from the CPU
before doing the real-mode handling for machine checks.  After the
real-mode handler runs, it is possible that we re-enter the guest
via a fast path which does not load the XIVE context.

To fix this, we move the unloading of the XIVE context to come after
the real-mode machine check handler is called.

Fixes: 5af5099385 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-17 11:18:48 +11:00
Paul Mackerras
5855564c8a KVM: PPC: Book3S HV: Enable migration of decrementer register
This adds a register identifier for use with the one_reg interface
to allow the decrementer expiry time to be read and written by
userspace.  The decrementer expiry time is in guest timebase units
and is equal to the sum of the decrementer and the guest timebase.
(The expiry time is used rather than the decrementer value itself
because the expiry time is not constantly changing, though the
decrementer value is, while the guest vcpu is not running.)

Without this, a guest vcpu migrated to a new host will see its
decrementer set to some random value.  On POWER8 and earlier, the
decrementer is 32 bits wide and counts down at 512MHz, so the
guest vcpu will potentially see no decrementer interrupts for up
to about 4 seconds, which will lead to a stall.  With POWER9, the
decrementer is now 56 bits side, so the stall can be much longer
(up to 2.23 years) and more noticeable.

To help work around the problem in cases where userspace has not been
updated to migrate the decrementer expiry time, we now set the
default decrementer expiry at vcpu creation time to the current time
rather than the maximum possible value.  This should mean an
immediate decrementer interrupt when a migrated vcpu starts
running.  In cases where the decrementer is 32 bits wide and more
than 4 seconds elapse between the creation of the vcpu and when it
first runs, the decrementer would have wrapped around to positive
values and there may still be a stall - but this is no worse than
the current situation.  In the large-decrementer case, we are sure
to get an immediate decrementer interrupt (assuming the time from
vcpu creation to first run is less than 2.23 years) and we thus
avoid a very long stall.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-16 11:54:45 +11:00
Linus Torvalds
6bb821193b powerpc fixes for 4.15 #7
One fix for an oops at boot if we take a hotplug interrupt before we are ready
 to handle it.
 
 The bulk is patches to implement mitigation for Meltdown, see the change logs
 for more details.
 
 Thanks to:
   Nicholas Piggin, Michael Neuling, Oliver O'Halloran, Jon Masters, Jose Ricardo
   Ziviani, David Gibson.
 -----BEGIN PGP SIGNATURE-----
 
 iQIwBAABCAAaBQJaWXZ+ExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYBe
 ew/+KMp80VAIGSyFeYHLg8oClnQjKcEDMzlNoHo8WU6NwwNzk2XXBXNaGT7Osg7B
 Ny69R3/VRrGmi/2uule9OzF2eZt8BUmRn3cHDcUB5OwFjMysv4M2jc+OK0STQxLU
 F769rX+ZlkyAvQn4UxtNCUob+JDfVbE5Lurrx475ZGL1YHk6QqBGhlN/niXKtkBz
 upeBEpu4JgwsLLD9H7c4QQha9PkhxY51jtvxL24cgwDuKJKhTdxLahPlA3QmLafu
 FJxR9R0YzTj9Ivuae5yM7xRKr5wROTPtTRbG9sN7XOgFNrQ+LSgemXGrlV/fzdhV
 6iPn2hHs4bu45SOIM+Hhf7OwHTtnMetXmKo6NldlKHVCw1HwBLigRohnOTPjiHfZ
 P6brR4T/cglakBiE29+8T0UFqOPizEJiQRHnsoWzwIa/aTqCGEE3m3rgTxU6ovt5
 3JggC51ZRDFsjGvj+JFnxGxCjYC4tDbfBI28M+GuWZZPBKLkFnEk8PlSzYyHPW2p
 5uVA/UN4KHVGYUfJVUC+LMP+ZIOSEyEISKk7RuK0C0mwHSXcvxbeEefXbc18Xjai
 J/pl/c9/4CRnEbdPm305xBxKKmU6gWxWRK3KbRFsoTI/wp+Ryx5TjPpgAa5GnaLv
 nrtrfKs8skWfGMoDWH7+KILt0fgRR2x+91GTSxc5aaHtYBI=
 =UZHv
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.15-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "One fix for an oops at boot if we take a hotplug interrupt before we
  are ready to handle it.

  The bulk is patches to implement mitigation for Meltdown, see the
  change logs for more details.

  Thanks to: Nicholas Piggin, Michael Neuling, Oliver O'Halloran, Jon
  Masters, Jose Ricardo Ziviani, David Gibson"

* tag 'powerpc-4.15-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/powernv: Check device-tree for RFI flush settings
  powerpc/pseries: Query hypervisor for RFI flush settings
  powerpc/64s: Support disabling RFI flush with no_rfi_flush and nopti
  powerpc/64s: Add support for RFI flush of L1-D cache
  powerpc/64s: Convert slb_miss_common to use RFI_TO_USER/KERNEL
  powerpc/64: Convert fast_exception_return to use RFI_TO_USER/KERNEL
  powerpc/64: Convert the syscall exit path to use RFI_TO_USER/KERNEL
  powerpc/64s: Simple RFI macro conversions
  powerpc/64: Add macros for annotating the destination of rfid/hrfid
  powerpc/pseries: Add H_GET_CPU_CHARACTERISTICS flags & wrapper
  powerpc/pseries: Make RAS IRQ explicitly dependent on DLPAR WQ
2018-01-14 15:03:17 -08:00
Paolo Bonzini
0217690f88 PPC KVM fixes for 4.15
Four commits here, including two that were tagged but never merged.
 Three of them are for the HPT resizing code; two of those fix a
 user-triggerable use-after-free in the host, and one that fixes
 stale TLB entries in the guest.  The remaining commit fixes a bug
 causing PR KVM guests under PowerVM to fail to start.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJaVfPgAAoJEJ2a6ncsY3GfA10IANZMkwtIpqxGlsAeXKr5bWdl
 iXYD9ymb2/FOHBbg6v8Eh6Gb1ycjzXpXqn74/Y9TE4Ort7mdiH+W6kXYEsMqL8yg
 7Uwnj8DuWFuFxX0x0V4SJQzgdCnOefVcfoo/RnLUzmLsW0Vqtr3A1djM5iHlxFvv
 ntkNtGYPOoaHl6rjtfHTDfLWN/DzEJbaIU/0O1LIkBxPG4STzSXErAucLL46Pa/X
 NuPO2HfpxQiacHVG62iy89eJeAcraEAXnH5e6eVPRQQqh3DSIERMU6n6jXyZeMU5
 NWX8Qme3VGBpiJOiCGMvMrnJmQmMTSWTtkGljyaFy+vZWMqGZ6xJ3wIP+5t9d+Q=
 =dw6K
 -----END PGP SIGNATURE-----

Merge tag 'kvm-ppc-fixes-4.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into kvm-master

PPC KVM fixes for 4.15

Four commits here, including two that were tagged but never merged.
Three of them are for the HPT resizing code; two of those fix a
user-triggerable use-after-free in the host, and one that fixes
stale TLB entries in the guest.  The remaining commit fixes a bug
causing PR KVM guests under PowerVM to fail to start.
2018-01-11 14:07:27 +01:00
Markus Elfring
1627301020 KVM: PPC: Use seq_puts() in kvmppc_exit_timing_show()
A headline should be quickly put into a sequence. Thus use the
function "seq_puts" instead of "seq_printf" for this purpose.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-11 20:36:06 +11:00
Alexander Graf
81ceca05a4 KVM: PPC: Book3S HV: Remove vcpu->arch.dec usage
On Book3S in HV mode, we don't use the vcpu->arch.dec field at all.
Instead, all logic is built around vcpu->arch.dec_expires.

So let's remove the one remaining piece of code that was setting it.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-11 20:36:06 +11:00
David Gibson
ecba8297aa KVM: PPC: Book3S HV: Always flush TLB in kvmppc_alloc_reset_hpt()
The KVM_PPC_ALLOCATE_HTAB ioctl(), implemented by kvmppc_alloc_reset_hpt()
is supposed to completely clear and reset a guest's Hashed Page Table (HPT)
allocating or re-allocating it if necessary.

In the case where an HPT of the right size already exists and it just
zeroes it, it forces a TLB flush on all guest CPUs, to remove any stale TLB
entries loaded from the old HPT.

However, that situation can arise when the HPT is resizing as well - or
even when switching from an RPT to HPT - so those cases need a TLB flush as
well.

So, move the TLB flush to trigger in all cases except for errors.

Cc: stable@vger.kernel.org # v4.10+
Fixes: f98a8bf9ee ("KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-10 20:45:41 +11:00
Alexey Kardashevskiy
6c7d47c33e KVM: PPC: Book3S PR: Fix WIMG handling under pHyp
Commit 96df226 ("KVM: PPC: Book3S PR: Preserve storage control bits")
added code to preserve WIMG bits but it missed 2 special cases:
- a magic page in kvmppc_mmu_book3s_64_xlate() and
- guest real mode in kvmppc_handle_pagefault().

For these ptes, WIMG was 0 and pHyp failed on these causing a guest to
stop in the very beginning at NIP=0x100 (due to bd9166ffe "KVM: PPC:
Book3S PR: Exit KVM on failed mapping").

According to LoPAPR v1.1 14.5.4.1.2 H_ENTER:

 The hypervisor checks that the WIMG bits within the PTE are appropriate
 for the physical page number else H_Parameter return. (For System Memory
 pages WIMG=0010, or, 1110 if the SAO option is enabled, and for IO pages
 WIMG=01**.)

This hence initializes WIMG to non-zero value HPTE_R_M (0x10), as expected
by pHyp.

[paulus@ozlabs.org - fix compile for 32-bit]

Cc: stable@vger.kernel.org # v4.11+
Fixes: 96df226 "KVM: PPC: Book3S PR: Preserve storage control bits"
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Tested-by: Ruediger Oertel <ro@suse.de>
Reviewed-by: Greg Kurz <groug@kaod.org>
Tested-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-10 20:45:00 +11:00
Nicholas Piggin
222f20f140 powerpc/64s: Simple RFI macro conversions
This commit does simple conversions of rfi/rfid to the new macros that
include the expected destination context. By simple we mean cases
where there is a single well known destination context, and it's
simply a matter of substituting the instruction for the appropriate
macro.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-10 03:07:30 +11:00
Laurent Vivier
7333b5aca4 KVM: PPC: Book3S HV: Fix pending_pri value in kvmppc_xive_get_icp()
When we migrate a VM from a POWER8 host (XICS) to a POWER9 host
(XICS-on-XIVE), we have an error:

qemu-kvm: Unable to restore KVM interrupt controller state \
          (0xff000000) for CPU 0: Invalid argument

This is because kvmppc_xics_set_icp() checks the new state
is internaly consistent, and especially:

...
   1129         if (xisr == 0) {
   1130                 if (pending_pri != 0xff)
   1131                         return -EINVAL;
...

On the other side, kvmppc_xive_get_icp() doesn't set
neither the pending_pri value, nor the xisr value (set to 0)
(and kvmppc_xive_set_icp() ignores the pending_pri value)

As xisr is 0, pending_pri must be set to 0xff.

Fixes: 5af5099385 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-22 15:36:24 +11:00
Cédric Le Goater
dc1c4165d1 KVM: PPC: Book3S: fix XIVE migration of pending interrupts
When restoring a pending interrupt, we are setting the Q bit to force
a retrigger in xive_finish_unmask(). But we also need to force an EOI
in this case to reach the same initial state : P=1, Q=0.

This can be done by not setting 'old_p' for pending interrupts which
will inform xive_finish_unmask() that an EOI needs to be sent.

Fixes: 5af5099385 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
Cc: stable@vger.kernel.org # v4.12+
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Tested-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-22 15:34:02 +11:00
Paolo Bonzini
5cb0944c0c KVM: introduce kvm_arch_vcpu_async_ioctl
After the vcpu_load/vcpu_put pushdown, the handling of asynchronous VCPU
ioctl is already much clearer in that it is obvious that they bypass
vcpu_load and vcpu_put.

However, it is still not perfect in that the different state of the VCPU
mutex is still hidden in the caller.  Separate those ioctls into a new
function kvm_arch_vcpu_async_ioctl that returns -ENOIOCTLCMD for more
"traditional" synchronous ioctls.

Cc: James Hogan <jhogan@kernel.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Suggested-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:59 +01:00
Christoffer Dall
9b062471e5 KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl
Move the calls to vcpu_load() and vcpu_put() in to the architecture
specific implementations of kvm_arch_vcpu_ioctl() which dispatches
further architecture-specific ioctls on to other functions.

Some architectures support asynchronous vcpu ioctls which cannot call
vcpu_load() or take the vcpu->mutex, because that would prevent
concurrent execution with a running VCPU, which is the intended purpose
of these ioctls, for example because they inject interrupts.

We repeat the separate checks for these specifics in the architecture
code for MIPS, S390 and PPC, and avoid taking the vcpu->mutex and
calling vcpu_load for these ioctls.

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:58 +01:00
Christoffer Dall
66b5656222 KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_set_guest_debug
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_set_guest_debug().

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:56 +01:00
Christoffer Dall
1da5b61dac KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_translate
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_translate().

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:55 +01:00
Christoffer Dall
b4ef9d4e8c KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_set_sregs
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_set_sregs().

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:53 +01:00
Christoffer Dall
bcdec41cef KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_get_sregs
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_get_sregs().

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:52 +01:00
Christoffer Dall
875656fe0c KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_set_regs
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_set_regs().

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:52 +01:00
Christoffer Dall
1fc9b76b3d KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_get_regs
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_get_regs().

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:51 +01:00
Christoffer Dall
accb757d79 KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_run
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_run().

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> # s390 parts
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
[Rebased. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:50 +01:00
Serhii Popovych
4ed11aeefd KVM: PPC: Book3S HV: Fix use after free in case of multiple resize requests
When serving multiple resize requests following could happen:

    CPU0                                    CPU1
    ----                                    ----
    kvm_vm_ioctl_resize_hpt_prepare(1);
      -> schedule_work()
                                            /* system_rq might be busy: delay */
    kvm_vm_ioctl_resize_hpt_prepare(2);
      mutex_lock();
      if (resize) {
         ...
         release_hpt_resize();
      }
      ...                                   resize_hpt_prepare_work()
      -> schedule_work()                    {
      mutex_unlock()                           /* resize->kvm could be wrong */
                                               struct kvm *kvm = resize->kvm;

                                               mutex_lock(&kvm->lock);   <<<< UAF
                                               ...
                                            }

i.e. a second resize request with different order could be started by
kvm_vm_ioctl_resize_hpt_prepare(), causing the previous request to be
free()d when there's still an active worker thread which will try to
access it.  This leads to a use after free in point marked with UAF on
the diagram above.

To prevent this from happening, instead of unconditionally releasing a
pre-existing resize structure from the prepare ioctl(), we check if
the existing structure has an in-progress worker.  We do that by
checking if the resize->error == -EBUSY, which is safe because the
resize->error field is protected by the kvm->lock.  If there is an
active worker, instead of releasing, we mark the structure as stale by
unlinking it from kvm_struct.

In the worker thread we check for a stale structure (with kvm->lock
held), and in that case abort, releasing the stale structure ourself.
We make the check both before and the actual allocation.  Strictly,
only the check afterwards is needed, the check before is an
optimization: if the structure happens to become stale before the
worker thread is dispatched, rather than during the allocation, it
means we can avoid allocating then immediately freeing a potentially
substantial amount of memory.

This fixes following or similar host kernel crash message:

[  635.277361] Unable to handle kernel paging request for data at address 0x00000000
[  635.277438] Faulting instruction address: 0xc00000000052f568
[  635.277446] Oops: Kernel access of bad area, sig: 11 [#1]
[  635.277451] SMP NR_CPUS=2048 NUMA PowerNV
[  635.277470] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4
nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter nfsv3 nfs_acl nfs
lockd grace fscache kvm_hv kvm rpcrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi
scsi_transport_iscsi ib_srpt target_core_mod ext4 ib_srp scsi_transport_srp
ib_ipoib mbcache jbd2 rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ocrdma(T)
ib_core ses enclosure scsi_transport_sas sg shpchp leds_powernv ibmpowernv i2c_opal
i2c_core powernv_rng ipmi_powernv ipmi_devintf ipmi_msghandler ip_tables xfs
libcrc32c sr_mod sd_mod cdrom lpfc nvme_fc(T) nvme_fabrics nvme_core ipr nvmet_fc(T)
tg3 nvmet libata be2net crc_t10dif crct10dif_generic scsi_transport_fc ptp scsi_tgt
pps_core crct10dif_common dm_mirror dm_region_hash dm_log dm_mod
[  635.278687] CPU: 40 PID: 749 Comm: kworker/40:1 Tainted: G
------------ T 3.10.0.bz1510771+ #1
[  635.278782] Workqueue: events resize_hpt_prepare_work [kvm_hv]
[  635.278851] task: c0000007e6840000 ti: c0000007e9180000 task.ti: c0000007e9180000
[  635.278919] NIP: c00000000052f568 LR: c0000000009ea310 CTR: c0000000009ea4f0
[  635.278988] REGS: c0000007e91837f0 TRAP: 0300   Tainted: G
------------ T  (3.10.0.bz1510771+)
[  635.279077] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24002022  XER:
00000000
[  635.279248] CFAR: c000000000009368 DAR: 0000000000000000 DSISR: 40000000 SOFTE: 1
GPR00: c0000000009ea310 c0000007e9183a70 c000000001250b00 c0000007e9183b10
GPR04: 0000000000000000 0000000000000000 c0000007e9183650 0000000000000000
GPR08: c0000007ffff7b80 00000000ffffffff 0000000080000028 d00000000d2529a0
GPR12: 0000000000002200 c000000007b56800 c000000000120028 c0000007f135bb40
GPR16: 0000000000000000 c000000005c1e018 c000000005c1e018 0000000000000000
GPR20: 0000000000000001 c0000000011bf778 0000000000000001 fffffffffffffef7
GPR24: 0000000000000000 c000000f1e262e50 0000000000000002 c0000007e9180000
GPR28: c000000f1e262e4c c000000f1e262e50 0000000000000000 c0000007e9183b10
[  635.280149] NIP [c00000000052f568] __list_add+0x38/0x110
[  635.280197] LR [c0000000009ea310] __mutex_lock_slowpath+0xe0/0x2c0
[  635.280253] Call Trace:
[  635.280277] [c0000007e9183af0] [c0000000009ea310] __mutex_lock_slowpath+0xe0/0x2c0
[  635.280356] [c0000007e9183b70] [c0000000009ea554] mutex_lock+0x64/0x70
[  635.280426] [c0000007e9183ba0] [d00000000d24da04]
resize_hpt_prepare_work+0xe4/0x1c0 [kvm_hv]
[  635.280507] [c0000007e9183c40] [c000000000113c0c] process_one_work+0x1dc/0x680
[  635.280587] [c0000007e9183ce0] [c000000000114250] worker_thread+0x1a0/0x520
[  635.280655] [c0000007e9183d80] [c00000000012010c] kthread+0xec/0x100
[  635.280724] [c0000007e9183e30] [c00000000000a4b8] ret_from_kernel_thread+0x5c/0xa4
[  635.280814] Instruction dump:
[  635.280880] 7c0802a6 fba1ffe8 fbc1fff0 7cbd2b78 fbe1fff8 7c9e2378 7c7f1b78
f8010010
[  635.281099] f821ff81 e8a50008 7fa52040 40de00b8 <e8be0000> 7fbd2840 40de008c
7fbff040
[  635.281324] ---[ end trace b628b73449719b9d ]---

Cc: stable@vger.kernel.org # v4.10+
Fixes: b5baa68773 ("KVM: PPC: Book3S HV: KVM-HV HPT resizing implementation")
Signed-off-by: Serhii Popovych <spopovyc@redhat.com>
[dwg: Replaced BUG_ON()s with WARN_ONs() and reworded commit message
 for clarity]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-12-06 13:36:22 +11:00
Serhii Popovych
3073774e63 KVM: PPC: Book3S HV: Drop prepare_done from struct kvm_resize_hpt
Currently the kvm_resize_hpt structure has two fields relevant to the
state of an ongoing resize: 'prepare_done', which indicates whether
the worker thread has completed or not, and 'error' which indicates
whether it was successful or not.

Since the success/failure isn't known until completion, this is
confusingly redundant.  This patch consolidates the information into
just the 'error' value: -EBUSY indicates the worked is still in
progress, other negative values indicate (completed) failure, 0
indicates successful completion.

As a bonus this reduces size of struct kvm_resize_hpt by
__alignof__(struct kvm_hpt_info) and saves few bytes of code.

While there correct comment in struct kvm_resize_hpt which references
a non-existent semaphore (leftover from an early draft).

Assert with WARN_ON() in case of HPT allocation thread work runs more
than once for resize request or resize_hpt_allocate() returns -EBUSY
that is treated specially.

Change comparison against zero to make checkpatch.pl happy.

Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Serhii Popovych <spopovyc@redhat.com>
[dwg: Changed BUG_ON()s to WARN_ON()s and altered commit message for
 clarity]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-12-06 13:35:21 +11:00
Al Viro
5bb866de17 ppc: for put_user() pull linux/uaccess.h, not asm/uaccess.h
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-12-04 14:43:20 -05:00
Linus Torvalds
9e0600f5cf * x86 bugfixes: APIC, nested virtualization, IOAPIC
* PPC bugfix: HPT guests on a POWER9 radix host
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJaICi1AAoJEL/70l94x66DjvEIAIML/e9YX1YrJZi0rsB9cbm0
 Le3o5b3wKxPrlZdnpOZQ2mVWubUQdiHMPGX6BkpgyiJWUchnbj5ql1gUf5S0i3jk
 TOk6nae6DU94xBuboeqZJlmx2VfPY/fqzLWsX3HFHpnzRl4XvXL5o7cWguIxVcVO
 yU6bPgbAXyXSBennLWZxC3aQ2Ojikr3uxZQpUZTAPOW5hFINpCKCpqJBMxsb67wq
 rwI0cJhRl92mHpbe8qeNJhavqY5eviy9iPUaZrOW9P4yw1uqjTAjgsUc1ydiaZSV
 rOHeKBOgVfY/KBaNJKyKySfuL1MJ+DLcQqm9RlGpKNpFIeB0vvSf0gtmmqIAXIk=
 =kh2y
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:

 - x86 bugfixes: APIC, nested virtualization, IOAPIC

 - PPC bugfix: HPT guests on a POWER9 radix host

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (26 commits)
  KVM: Let KVM_SET_SIGNAL_MASK work as advertised
  KVM: VMX: Fix vmx->nested freeing when no SMI handler
  KVM: VMX: Fix rflags cache during vCPU reset
  KVM: X86: Fix softlockup when get the current kvmclock
  KVM: lapic: Fixup LDR on load in x2apic
  KVM: lapic: Split out x2apic ldr calculation
  KVM: PPC: Book3S HV: Fix migration and HPT resizing of HPT guests on radix hosts
  KVM: vmx: use X86_CR4_UMIP and X86_FEATURE_UMIP
  KVM: x86: Fix CPUID function for word 6 (80000001_ECX)
  KVM: nVMX: Fix vmx_check_nested_events() return value in case an event was reinjected to L2
  KVM: x86: ioapic: Preserve read-only values in the redirection table
  KVM: x86: ioapic: Clear Remote IRR when entry is switched to edge-triggered
  KVM: x86: ioapic: Remove redundant check for Remote IRR in ioapic_set_irq
  KVM: x86: ioapic: Don't fire level irq when Remote IRR set
  KVM: x86: ioapic: Fix level-triggered EOI and IOAPIC reconfigure race
  KVM: x86: inject exceptions produced by x86_decode_insn
  KVM: x86: Allow suppressing prints on RDMSR/WRMSR of unhandled MSRs
  KVM: x86: fix em_fxstor() sleeping while in atomic
  KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure
  KVM: nVMX: Validate the IA32_BNDCFGS on nested VM-entry
  ...
2017-11-30 08:15:19 -08:00
Paolo Bonzini
a63dd7480d PPC KVM fixes for 4.15
One commit here, that fixes a couple of bugs relating to the patch
 series that enables HPT guests to run on a radix host on POWER9
 systems.  This patch series went upstream in the 4.15 merge window,
 so no stable backport is required.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJaGKSnAAoJEJ2a6ncsY3GfF7MIANLLhznEMrWq8jw4g95WsJU1
 MkDGwp8kIdhIOM9HD6JRskoJZB5Mws2BWlQ5PSaVFxO6v6eUgNLaRb/UBxC1r7gU
 1f9/8corY4BNkezSdJqTL7Xgp13KjTU726OwYAqCPEyCSPEc9ciMyeIgyZuv2dPa
 Pju+u4tnA+9JJyskgNL+/ybOOZwVat91VmNUVRq29zP6+zo1tmIDxrQchy6Bqui/
 7Wg298G+yjAkJ8ktQu69ACk+0oEBGUOcLUlraqGSr9auR+b0nJ1PAGCDRaONdwgE
 +X+OE+t+UC6rU+coUXMwO+Id0X7HMdsLQd3066ODEtD55g8MIVZ126Wt8xDmj5o=
 =GSTh
 -----END PGP SIGNATURE-----

Merge tag 'kvm-ppc-fixes-4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into kvm-master

PPC KVM fixes for 4.15

One commit here, that fixes a couple of bugs relating to the patch
series that enables HPT guests to run on a radix host on POWER9
systems.  This patch series went upstream in the 4.15 merge window,
so no stable backport is required.
2017-11-27 17:54:13 +01:00
Jan H. Schönherr
20b7035c66 KVM: Let KVM_SET_SIGNAL_MASK work as advertised
KVM API says for the signal mask you set via KVM_SET_SIGNAL_MASK, that
"any unblocked signal received [...] will cause KVM_RUN to return with
-EINTR" and that "the signal will only be delivered if not blocked by
the original signal mask".

This, however, is only true, when the calling task has a signal handler
registered for a signal. If not, signal evaluation is short-circuited for
SIG_IGN and SIG_DFL, and the signal is either ignored without KVM_RUN
returning or the whole process is terminated.

Make KVM_SET_SIGNAL_MASK behave as advertised by utilizing logic similar
to that in do_sigtimedwait() to avoid short-circuiting of signals.

Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-11-27 17:53:47 +01:00
Paul Mackerras
9aa6825bbb KVM: PPC: Book3S: Eliminate some unnecessary checks
In an excess of caution, commit 6f63e81bda ("KVM: PPC: Book3S: Add
MMIO emulation for FP and VSX instructions", 2017-02-21) included
checks for the case that vcpu->arch.mmio_vsx_copy_nums is less than
zero, even though its type is u8.  This causes a Coverity warning,
so we remove the check for < 0.  We also adjust the associated
comment to be more accurate ("4 or less" rather than "less than 4").

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-23 14:23:34 +11:00
Paul Mackerras
c0093f1a38 KVM: PPC: Book3S HV: Fix conditions for starting vcpu
This corrects the test that determines whether a vcpu that has just
become able to run in the guest (e.g. it has just finished handling
a hypercall or hypervisor page fault) and whose virtual core is
already running somewhere as a "piggybacked" vcore can start
immediately or not.  (A piggybacked vcore is one which is executing
along with another vcore as a result of dynamic micro-threading.)

Previously the test tried to lock the piggybacked vcore using
spin_trylock, which would always fail because the vcore was already
locked, and so the vcpu would have to wait until its vcore exited
the guest before it could enter.

In fact the vcpu can enter if its vcore is in VCORE_PIGGYBACK state
and not already exiting (or exited) the guest, so the test in
VCORE_PIGGYBACK state is basically the same as for VCORE_RUNNING
state.

Coverity detected this as a double unlock issue, which it isn't
because the spin_trylock would always fail.  This will fix the
apparent double unlock as well.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-23 14:23:22 +11:00
Paul Mackerras
4fcf361dbd KVM: PPC: Book3S HV: Remove useless statement
This removes a statement that has no effect.  It should have been
removed in commit 898b25b202 ("KVM: PPC: Book3S HV: Simplify dynamic
micro-threading code", 2017-06-22) along with the loop over the
piggy-backed virtual cores.

This issue was reported by Coverity.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-23 14:23:14 +11:00
Paul Mackerras
117647ff93 KVM: PPC: Book3S HV: Fix typo in kvmppc_hv_get_dirty_log_radix()
This fixes a typo where the intent was to assign to 'j' in order to
skip some number of bits in the dirty bitmap for a guest.  The effect
of the typo is benign since it means we just iterate through all the
bits rather than skipping bits which we know will be zero.  This issue
was found by Coverity.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-23 14:23:08 +11:00
Paul Mackerras
ded13fc11b KVM: PPC: Book3S HV: Fix migration and HPT resizing of HPT guests on radix hosts
This fixes two errors that prevent a guest using the HPT MMU from
successfully migrating to a POWER9 host in radix MMU mode, or resizing
its HPT when running on a radix host.

The first bug was that commit 8dc6cca556 ("KVM: PPC: Book3S HV:
Don't rely on host's page size information", 2017-09-11) missed two
uses of hpte_base_page_size(), one in the HPT rehashing code and
one in kvm_htab_write() (which is used on the destination side in
migrating a HPT guest).  Instead we use kvmppc_hpte_base_page_shift().
Having the shift count means that we can use left and right shifts
instead of multiplication and division in a few places.

Along the way, this adds a check in kvm_htab_write() to ensure that the
page size encoding in the incoming HPTEs is recognized, and if not
return an EINVAL error to userspace.

The second bug was that kvm_htab_write was performing some but not all
of the functions of kvmhv_setup_mmu(), resulting in the destination VM
being left in radix mode as far as the hardware is concerned.  The
simplest fix for now is make kvm_htab_write() call
kvmppc_setup_partition_table() like kvmppc_hv_setup_htab_rma() does.
In future it would be better to refactor the code more extensively
to remove the duplication.

Fixes: 8dc6cca556 ("KVM: PPC: Book3S HV: Don't rely on host's page size information")
Fixes: 7a84084c60 ("KVM: PPC: Book3S HV: Set partition table rather than SDR1 on POWER9")
Reported-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Tested-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-23 14:17:54 +11:00
Kees Cook
86cb30ec07 treewide: setup_timer() -> timer_setup() (2 field)
This converts all remaining setup_timer() calls that use a nested field
to reach a struct timer_list. Coccinelle does not have an easy way to
match multiple fields, so a new script is needed to change the matches of
"&_E->_timer" into "&_E->_field1._timer" in all the rules.

spatch --very-quiet --all-includes --include-headers \
	-I ./arch/x86/include -I ./arch/x86/include/generated \
	-I ./include -I ./arch/x86/include/uapi \
	-I ./arch/x86/include/generated/uapi -I ./include/uapi \
	-I ./include/generated/uapi --include ./include/linux/kconfig.h \
	--dir . \
	--cocci-file ~/src/data/timer_setup-2fields.cocci

@fix_address_of depends@
expression e;
@@

 setup_timer(
-&(e)
+&e
 , ...)

// Update any raw setup_timer() usages that have a NULL callback, but
// would otherwise match change_timer_function_usage, since the latter
// will update all function assignments done in the face of a NULL
// function initialization in setup_timer().
@change_timer_function_usage_NULL@
expression _E;
identifier _field1;
identifier _timer;
type _cast_data;
@@

(
-setup_timer(&_E->_field1._timer, NULL, _E);
+timer_setup(&_E->_field1._timer, NULL, 0);
|
-setup_timer(&_E->_field1._timer, NULL, (_cast_data)_E);
+timer_setup(&_E->_field1._timer, NULL, 0);
|
-setup_timer(&_E._field1._timer, NULL, &_E);
+timer_setup(&_E._field1._timer, NULL, 0);
|
-setup_timer(&_E._field1._timer, NULL, (_cast_data)&_E);
+timer_setup(&_E._field1._timer, NULL, 0);
)

@change_timer_function_usage@
expression _E;
identifier _field1;
identifier _timer;
struct timer_list _stl;
identifier _callback;
type _cast_func, _cast_data;
@@

(
-setup_timer(&_E->_field1._timer, _callback, _E);
+timer_setup(&_E->_field1._timer, _callback, 0);
|
-setup_timer(&_E->_field1._timer, &_callback, _E);
+timer_setup(&_E->_field1._timer, _callback, 0);
|
-setup_timer(&_E->_field1._timer, _callback, (_cast_data)_E);
+timer_setup(&_E->_field1._timer, _callback, 0);
|
-setup_timer(&_E->_field1._timer, &_callback, (_cast_data)_E);
+timer_setup(&_E->_field1._timer, _callback, 0);
|
-setup_timer(&_E->_field1._timer, (_cast_func)_callback, _E);
+timer_setup(&_E->_field1._timer, _callback, 0);
|
-setup_timer(&_E->_field1._timer, (_cast_func)&_callback, _E);
+timer_setup(&_E->_field1._timer, _callback, 0);
|
-setup_timer(&_E->_field1._timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E->_field1._timer, _callback, 0);
|
-setup_timer(&_E->_field1._timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E->_field1._timer, _callback, 0);
|
-setup_timer(&_E._field1._timer, _callback, (_cast_data)_E);
+timer_setup(&_E._field1._timer, _callback, 0);
|
-setup_timer(&_E._field1._timer, _callback, (_cast_data)&_E);
+timer_setup(&_E._field1._timer, _callback, 0);
|
-setup_timer(&_E._field1._timer, &_callback, (_cast_data)_E);
+timer_setup(&_E._field1._timer, _callback, 0);
|
-setup_timer(&_E._field1._timer, &_callback, (_cast_data)&_E);
+timer_setup(&_E._field1._timer, _callback, 0);
|
-setup_timer(&_E._field1._timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E._field1._timer, _callback, 0);
|
-setup_timer(&_E._field1._timer, (_cast_func)_callback, (_cast_data)&_E);
+timer_setup(&_E._field1._timer, _callback, 0);
|
-setup_timer(&_E._field1._timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E._field1._timer, _callback, 0);
|
-setup_timer(&_E._field1._timer, (_cast_func)&_callback, (_cast_data)&_E);
+timer_setup(&_E._field1._timer, _callback, 0);
|
 _E->_field1._timer@_stl.function = _callback;
|
 _E->_field1._timer@_stl.function = &_callback;
|
 _E->_field1._timer@_stl.function = (_cast_func)_callback;
|
 _E->_field1._timer@_stl.function = (_cast_func)&_callback;
|
 _E._field1._timer@_stl.function = _callback;
|
 _E._field1._timer@_stl.function = &_callback;
|
 _E._field1._timer@_stl.function = (_cast_func)_callback;
|
 _E._field1._timer@_stl.function = (_cast_func)&_callback;
)

// callback(unsigned long arg)
@change_callback_handle_cast
 depends on change_timer_function_usage@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._field1;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
identifier _handle;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *t
 )
 {
(
	... when != _origarg
	_handletype *_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _field1._timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle =
-(void *)_origarg;
+from_timer(_handle, t, _field1._timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle;
	... when != _handle
	_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _field1._timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle;
	... when != _handle
	_handle =
-(void *)_origarg;
+from_timer(_handle, t, _field1._timer);
	... when != _origarg
)
 }

// callback(unsigned long arg) without existing variable
@change_callback_handle_cast_no_arg
 depends on change_timer_function_usage &&
                     !change_callback_handle_cast@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._field1;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *t
 )
 {
+	_handletype *_origarg = from_timer(_origarg, t, _field1._timer);
+
	... when != _origarg
-	(_handletype *)_origarg
+	_origarg
	... when != _origarg
 }

// Avoid already converted callbacks.
@match_callback_converted
 depends on change_timer_function_usage &&
            !change_callback_handle_cast &&
	    !change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier t;
@@

 void _callback(struct timer_list *t)
 { ... }

// callback(struct something *handle)
@change_callback_handle_arg
 depends on change_timer_function_usage &&
	    !match_callback_converted &&
            !change_callback_handle_cast &&
            !change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._field1;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
@@

 void _callback(
-_handletype *_handle
+struct timer_list *t
 )
 {
+	_handletype *_handle = from_timer(_handle, t, _field1._timer);
	...
 }

// If change_callback_handle_arg ran on an empty function, remove
// the added handler.
@unchange_callback_handle_arg
 depends on change_timer_function_usage &&
	    change_callback_handle_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._field1;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
identifier t;
@@

 void _callback(struct timer_list *t)
 {
-	_handletype *_handle = from_timer(_handle, t, _field1._timer);
 }

// We only want to refactor the setup_timer() data argument if we've found
// the matching callback. This undoes changes in change_timer_function_usage.
@unchange_timer_function_usage
 depends on change_timer_function_usage &&
            !change_callback_handle_cast &&
            !change_callback_handle_cast_no_arg &&
	    !change_callback_handle_arg@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._field1;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type change_timer_function_usage._cast_data;
@@

(
-timer_setup(&_E->_field1._timer, _callback, 0);
+setup_timer(&_E->_field1._timer, _callback, (_cast_data)_E);
|
-timer_setup(&_E._field1._timer, _callback, 0);
+setup_timer(&_E._field1._timer, _callback, (_cast_data)&_E);
)

// If we fixed a callback from a .function assignment, fix the
// assignment cast now.
@change_timer_function_assignment
 depends on change_timer_function_usage &&
            (change_callback_handle_cast ||
             change_callback_handle_cast_no_arg ||
             change_callback_handle_arg)@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._field1;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_func;
typedef TIMER_FUNC_TYPE;
@@

(
 _E->_field1._timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_field1._timer.function =
-&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_field1._timer.function =
-(_cast_func)_callback;
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_field1._timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._field1._timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._field1._timer.function =
-&_callback;
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._field1._timer.function =
-(_cast_func)_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._field1._timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
)

// Sometimes timer functions are called directly. Replace matched args.
@change_timer_function_calls
 depends on change_timer_function_usage &&
            (change_callback_handle_cast ||
             change_callback_handle_cast_no_arg ||
             change_callback_handle_arg)@
expression _E;
identifier change_timer_function_usage._field1;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_data;
@@

 _callback(
(
-(_cast_data)_E
+&_E->_field1._timer
|
-(_cast_data)&_E
+&_E._field1._timer
|
-_E
+&_E->_field1._timer
)
 )

// If a timer has been configured without a data argument, it can be
// converted without regard to the callback argument, since it is unused.
@match_timer_function_unused_data@
expression _E;
identifier _field1;
identifier _timer;
identifier _callback;
@@

(
-setup_timer(&_E->_field1._timer, _callback, 0);
+timer_setup(&_E->_field1._timer, _callback, 0);
|
-setup_timer(&_E->_field1._timer, _callback, 0L);
+timer_setup(&_E->_field1._timer, _callback, 0);
|
-setup_timer(&_E->_field1._timer, _callback, 0UL);
+timer_setup(&_E->_field1._timer, _callback, 0);
|
-setup_timer(&_E._field1._timer, _callback, 0);
+timer_setup(&_E._field1._timer, _callback, 0);
|
-setup_timer(&_E._field1._timer, _callback, 0L);
+timer_setup(&_E._field1._timer, _callback, 0);
|
-setup_timer(&_E._field1._timer, _callback, 0UL);
+timer_setup(&_E._field1._timer, _callback, 0);
|
-setup_timer(&_field1._timer, _callback, 0);
+timer_setup(&_field1._timer, _callback, 0);
|
-setup_timer(&_field1._timer, _callback, 0L);
+timer_setup(&_field1._timer, _callback, 0);
|
-setup_timer(&_field1._timer, _callback, 0UL);
+timer_setup(&_field1._timer, _callback, 0);
|
-setup_timer(_field1._timer, _callback, 0);
+timer_setup(_field1._timer, _callback, 0);
|
-setup_timer(_field1._timer, _callback, 0L);
+timer_setup(_field1._timer, _callback, 0);
|
-setup_timer(_field1._timer, _callback, 0UL);
+timer_setup(_field1._timer, _callback, 0);
)

@change_callback_unused_data
 depends on match_timer_function_unused_data@
identifier match_timer_function_unused_data._callback;
type _origtype;
identifier _origarg;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *unused
 )
 {
	... when != _origarg
 }

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-11-21 15:57:09 -08:00
Linus Torvalds
974aa5630b First batch of KVM changes for 4.15
Common:
  - Python 3 support in kvm_stat
 
  - Accounting of slabs to kmemcg
 
 ARM:
  - Optimized arch timer handling for KVM/ARM
 
  - Improvements to the VGIC ITS code and introduction of an ITS reset
    ioctl
 
  - Unification of the 32-bit fault injection logic
 
  - More exact external abort matching logic
 
 PPC:
  - Support for running hashed page table (HPT) MMU mode on a host that
    is using the radix MMU mode;  single threaded mode on POWER 9 is
    added as a pre-requisite
 
  - Resolution of merge conflicts with the last second 4.14 HPT fixes
 
  - Fixes and cleanups
 
 s390:
  - Some initial preparation patches for exitless interrupts and crypto
 
  - New capability for AIS migration
 
  - Fixes
 
 x86:
  - Improved emulation of LAPIC timer mode changes, MCi_STATUS MSRs, and
    after-reset state
 
  - Refined dependencies for VMX features
 
  - Fixes for nested SMI injection
 
  - A lot of cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJaDayXAAoJEED/6hsPKofo/3UH/3HvlcHt+ADTkCU1/iiKAs+i
 0zngIOXIxgHDnV0ww6bV+Znww0BzTYgKCAXX76z603jdpDwG/pzQQcbLDF5ZoJnD
 sQtF10gZinWaRsHlfbLqjrHGL2pGDHO1UKBKLJ0bAIyORPZBxs7i+VmrY/blnr9c
 0wsybJ8RbvwAxjsDL5jeX/z4NehPupmKUc4Lf0eZdSHwVOf9sjn+MP6jJ0r2JcIb
 D+zddPBiLStzN97t4gZpQsrlj3LKrDS+6hY+1TjSvlh+yHKFVFh58VhLm4DuDeb5
 bYOAlWJ/gAWEzfvr5Ld+Nd7SqWWn/14logPkQ4gcU4BI/neAOzk4c6hJfCHl1nk=
 =593n
 -----END PGP SIGNATURE-----

Merge tag 'kvm-4.15-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Radim Krčmář:
 "First batch of KVM changes for 4.15

  Common:
   - Python 3 support in kvm_stat
   - Accounting of slabs to kmemcg

  ARM:
   - Optimized arch timer handling for KVM/ARM
   - Improvements to the VGIC ITS code and introduction of an ITS reset
     ioctl
   - Unification of the 32-bit fault injection logic
   - More exact external abort matching logic

  PPC:
   - Support for running hashed page table (HPT) MMU mode on a host that
     is using the radix MMU mode; single threaded mode on POWER 9 is
     added as a pre-requisite
   - Resolution of merge conflicts with the last second 4.14 HPT fixes
   - Fixes and cleanups

  s390:
   - Some initial preparation patches for exitless interrupts and crypto
   - New capability for AIS migration
   - Fixes

  x86:
   - Improved emulation of LAPIC timer mode changes, MCi_STATUS MSRs,
     and after-reset state
   - Refined dependencies for VMX features
   - Fixes for nested SMI injection
   - A lot of cleanups"

* tag 'kvm-4.15-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (89 commits)
  KVM: s390: provide a capability for AIS state migration
  KVM: s390: clear_io_irq() requests are not expected for adapter interrupts
  KVM: s390: abstract conversion between isc and enum irq_types
  KVM: s390: vsie: use common code functions for pinning
  KVM: s390: SIE considerations for AP Queue virtualization
  KVM: s390: document memory ordering for kvm_s390_vcpu_wakeup
  KVM: PPC: Book3S HV: Cosmetic post-merge cleanups
  KVM: arm/arm64: fix the incompatible matching for external abort
  KVM: arm/arm64: Unify 32bit fault injection
  KVM: arm/arm64: vgic-its: Implement KVM_DEV_ARM_ITS_CTRL_RESET
  KVM: arm/arm64: Document KVM_DEV_ARM_ITS_CTRL_RESET
  KVM: arm/arm64: vgic-its: Free caches when GITS_BASER Valid bit is cleared
  KVM: arm/arm64: vgic-its: New helper functions to free the caches
  KVM: arm/arm64: vgic-its: Remove kvm_its_unmap_device
  arm/arm64: KVM: Load the timer state when enabling the timer
  KVM: arm/arm64: Rework kvm_timer_should_fire
  KVM: arm/arm64: Get rid of kvm_timer_flush_hwstate
  KVM: arm/arm64: Avoid phys timer emulation in vcpu entry/exit
  KVM: arm/arm64: Move phys_timer_emulate function
  KVM: arm/arm64: Use kvm_arm_timer_set/get_reg for guest register traps
  ...
2017-11-16 13:00:24 -08:00
Linus Torvalds
5b0e2cb020 powerpc updates for 4.15
Non-highlights:
 
  - Five fixes for the >128T address space handling, both to fix bugs in our
    implementation and to bring the semantics exactly into line with x86.
 
 Highlights:
 
  - Support for a new OPAL call on bare metal machines which gives us a true NMI
    (ie. is not masked by MSR[EE]=0) for debugging etc.
 
  - Support for Power9 DD2 in the CXL driver.
 
  - Improvements to machine check handling so that uncorrectable errors can be
    reported into the generic memory_failure() machinery.
 
  - Some fixes and improvements for VPHN, which is used under PowerVM to notify
    the Linux partition of topology changes.
 
  - Plumbing to enable TM (transactional memory) without suspend on some Power9
    processors (PPC_FEATURE2_HTM_NO_SUSPEND).
 
  - Support for emulating vector loads form cache-inhibited memory, on some
    Power9 revisions.
 
  - Disable the fast-endian switch "syscall" by default (behind a CONFIG), we
    believe it has never had any users.
 
  - A major rework of the API drivers use when initiating and waiting for long
    running operations performed by OPAL firmware, and changes to the
    powernv_flash driver to use the new API.
 
  - Several fixes for the handling of FP/VMX/VSX while processes are using
    transactional memory.
 
  - Optimisations of TLB range flushes when using the radix MMU on Power9.
 
  - Improvements to the VAS facility used to access coprocessors on Power9, and
    related improvements to the way the NX crypto driver handles requests.
 
  - Implementation of PMEM_API and UACCESS_FLUSHCACHE for 64-bit.
 
 Thanks to:
   Alexey Kardashevskiy, Alistair Popple, Allen Pais, Andrew Donnellan, Aneesh
   Kumar K.V, Arnd Bergmann, Balbir Singh, Benjamin Herrenschmidt, Breno Leitao,
   Christophe Leroy, Christophe Lombard, Cyril Bur, Frederic Barrat, Gautham R.
   Shenoy, Geert Uytterhoeven, Guilherme G. Piccoli, Gustavo Romero, Haren
   Myneni, Joel Stanley, Kamalesh Babulal, Kautuk Consul, Markus Elfring, Masami
   Hiramatsu, Michael Bringmann, Michael Neuling, Michal Suchanek, Naveen N. Rao,
   Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pedro Miraglia Franco de
   Carvalho, Philippe Bergheaud, Sandipan Das, Seth Forshee, Shriya, Stephen
   Rothwell, Stewart Smith, Sukadev Bhattiprolu, Tyrel Datwyler, Vaibhav Jain,
   Vaidyanathan Srinivasan, William A. Kennington III.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJaDXGuAAoJEFHr6jzI4aWAEqwP/0TA35KFAK6wqfkCf67z4q+O
 I+5piI4eDV4jdCakfoIN1JfjhQRULNePSoCHTccan30mu/bm30p69xtOLL2/h5xH
 Mhz/eDBAOo0lrT20nyZfYMW3FnM66wnNf++qJ0O+8L052r4WOB02J0k1uM1ST01D
 5Lb5mUoxRLRzCgKRYAYWJifn+IFPUB9NMsvMTym94krAFlIjIzMEQXhDoln+jJMr
 QmY5f1BTA/fLfXobn0zwoc/C1oa2PUtxd+rxbwGrLoZ6G843mMqUi90SMr5ybhXp
 RzepnBTj4by3vOsnk/X1mANyaZfLsunp75FwnjHdPzKrAS/TuPp8D/iSxxE/PzEq
 cLwJFBnFXSgQMefDErXxhHSDz2dAg5r14rsTpDcq2Ko8TPV4rPsuSfmbd9Txekb0
 yWHsjoJUBBMl2QcWqIHl+AlV8j1RklF6solcTBcGnH1CZJMfa05VKXV7xGEvOHa0
 RJ+/xPyR9KjoB/SUp++9Vmx/M6SwQYFOJlr3Zpg9LNtR8WpoPYu1E6eO+u1Hhzny
 eJqaNstH+i+VdY9eqszkAsEBh8o9M/+b+7Wx7TetvU+v368CbXtgFYs9qy2oZjPF
 t9sY/BHaHZ8eZ7I00an77a0fVV5B1PVASUtIz5CqkwGpMvX6Z6W2K/XUUFI61kuu
 E06HS6Ht8UPJAzrAPUMl
 =Rq81
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "A bit of a small release, I suspect in part due to me travelling for
  KS. But my backlog of patches to review is smaller than usual, so I
  think in part folks just didn't send as much this cycle.

  Non-highlights:

   - Five fixes for the >128T address space handling, both to fix bugs
     in our implementation and to bring the semantics exactly into line
     with x86.

  Highlights:

   - Support for a new OPAL call on bare metal machines which gives us a
     true NMI (ie. is not masked by MSR[EE]=0) for debugging etc.

   - Support for Power9 DD2 in the CXL driver.

   - Improvements to machine check handling so that uncorrectable errors
     can be reported into the generic memory_failure() machinery.

   - Some fixes and improvements for VPHN, which is used under PowerVM
     to notify the Linux partition of topology changes.

   - Plumbing to enable TM (transactional memory) without suspend on
     some Power9 processors (PPC_FEATURE2_HTM_NO_SUSPEND).

   - Support for emulating vector loads form cache-inhibited memory, on
     some Power9 revisions.

   - Disable the fast-endian switch "syscall" by default (behind a
     CONFIG), we believe it has never had any users.

   - A major rework of the API drivers use when initiating and waiting
     for long running operations performed by OPAL firmware, and changes
     to the powernv_flash driver to use the new API.

   - Several fixes for the handling of FP/VMX/VSX while processes are
     using transactional memory.

   - Optimisations of TLB range flushes when using the radix MMU on
     Power9.

   - Improvements to the VAS facility used to access coprocessors on
     Power9, and related improvements to the way the NX crypto driver
     handles requests.

   - Implementation of PMEM_API and UACCESS_FLUSHCACHE for 64-bit.

  Thanks to: Alexey Kardashevskiy, Alistair Popple, Allen Pais, Andrew
  Donnellan, Aneesh Kumar K.V, Arnd Bergmann, Balbir Singh, Benjamin
  Herrenschmidt, Breno Leitao, Christophe Leroy, Christophe Lombard,
  Cyril Bur, Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven,
  Guilherme G. Piccoli, Gustavo Romero, Haren Myneni, Joel Stanley,
  Kamalesh Babulal, Kautuk Consul, Markus Elfring, Masami Hiramatsu,
  Michael Bringmann, Michael Neuling, Michal Suchanek, Naveen N. Rao,
  Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pedro Miraglia
  Franco de Carvalho, Philippe Bergheaud, Sandipan Das, Seth Forshee,
  Shriya, Stephen Rothwell, Stewart Smith, Sukadev Bhattiprolu, Tyrel
  Datwyler, Vaibhav Jain, Vaidyanathan Srinivasan, and William A.
  Kennington III"

* tag 'powerpc-4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (151 commits)
  powerpc/64s: Fix Power9 DD2.0 workarounds by adding DD2.1 feature
  powerpc/64s: Fix masking of SRR1 bits on instruction fault
  powerpc/64s: mm_context.addr_limit is only used on hash
  powerpc/64s/radix: Fix 128TB-512TB virtual address boundary case allocation
  powerpc/64s/hash: Allow MAP_FIXED allocations to cross 128TB boundary
  powerpc/64s/hash: Fix fork() with 512TB process address space
  powerpc/64s/hash: Fix 128TB-512TB virtual address boundary case allocation
  powerpc/64s/hash: Fix 512T hint detection to use >= 128T
  powerpc: Fix DABR match on hash based systems
  powerpc/signal: Properly handle return value from uprobe_deny_signal()
  powerpc/fadump: use kstrtoint to handle sysfs store
  powerpc/lib: Implement UACCESS_FLUSHCACHE API
  powerpc/lib: Implement PMEM API
  powerpc/powernv/npu: Don't explicitly flush nmmu tlb
  powerpc/powernv/npu: Use flush_all_mm() instead of flush_tlb_mm()
  powerpc/powernv/idle: Round up latency and residency values
  powerpc/kprobes: refactor kprobe_lookup_name for safer string operations
  powerpc/kprobes: Blacklist emulate_update_regs() from kprobes
  powerpc/kprobes: Do not disable interrupts for optprobes and kprobes_on_ftrace
  powerpc/kprobes: Disable preemption before invoking probe handler for optprobes
  ...
2017-11-16 12:47:46 -08:00
Paul Mackerras
432953b445 KVM: PPC: Book3S HV: Cosmetic post-merge cleanups
This rearranges the code in kvmppc_run_vcpu() and kvmppc_run_vcpu_hv()
to be neater and clearer.  Deeply indented code in kvmppc_run_vcpu()
is moved out to a helper function, kvmhv_setup_mmu().  In
kvmppc_vcpu_run_hv(), make use of the existing variable 'kvm' in
place of 'vcpu->kvm'.

No functional change.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-09 15:37:10 +11:00
Paul Mackerras
072df8130c Merge branch 'kvm-ppc-fixes' into kvm-ppc-next
This merges in a couple of fixes from the kvm-ppc-fixes branch that
modify the same areas of code as some commits from the kvm-ppc-next
branch, in order to resolve the conflicts.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-09 14:30:24 +11:00
Radim Krčmář
d850a255d5 PPC KVM fixes for 4.14
Just one fix here for a host crash that can occur with HV KVM
 as a result of resizing the guest hashed page table (HPT).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJaApLVAAoJEJ2a6ncsY3GfNfcIAJk93C9FK6k2urAORP3lDmKy
 P6a4LnkMrQTuUCBGrkP4F1hGq2vpH6o/KeoEdhAgLMHHsarzMyBc5N7rHMHgZUzI
 bUna0LaXtjdb5IP0kcDb8HmulmBaFiMf+sa2i3dIW3sCxtvqzzmxOluR0C29fG1I
 gTdJV0XDzhQHJLixcQ3i4pi/K6b+wzXrY7fFPMpI2Wji6cKYr0ZL0fG8bQ0pV4OZ
 0YgV9sR8mVN17JKU9R4GYz9fkp3+cXDG4xBVtczDlK6TJzF2XVUGgY/iJLMAyDRw
 9gcEiIc+khkqyfuQt8iYBiHqRJ7HiT4yX1LMI9dM2vTZi23zsG3yTmsIc16QZLg=
 =MzO/
 -----END PGP SIGNATURE-----

Merge tag 'kvm-ppc-fixes-4.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc

PPC KVM fixes for 4.14

Just one fix here for a host crash that can occur with HV KVM
as a result of resizing the guest hashed page table (HPT).
2017-11-08 14:08:59 +01:00
Paul Mackerras
38c53af853 KVM: PPC: Book3S HV: Fix exclusion between HPT resizing and other HPT updates
Commit 5e9859699a ("KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing
implementation", 2016-12-20) added code that tries to exclude any use
or update of the hashed page table (HPT) while the HPT resizing code
is iterating through all the entries in the HPT.  It does this by
taking the kvm->lock mutex, clearing the kvm->arch.hpte_setup_done
flag and then sending an IPI to all CPUs in the host.  The idea is
that any VCPU task that tries to enter the guest will see that the
hpte_setup_done flag is clear and therefore call kvmppc_hv_setup_htab_rma,
which also takes the kvm->lock mutex and will therefore block until
we release kvm->lock.

However, any VCPU that is already in the guest, or is handling a
hypervisor page fault or hypercall, can re-enter the guest without
rechecking the hpte_setup_done flag.  The IPI will cause a guest exit
of any VCPUs that are currently in the guest, but does not prevent
those VCPU tasks from immediately re-entering the guest.

The result is that after resize_hpt_rehash_hpte() has made a HPTE
absent, a hypervisor page fault can occur and make that HPTE present
again.  This includes updating the rmap array for the guest real page,
meaning that we now have a pointer in the rmap array which connects
with pointers in the old rev array but not the new rev array.  In
fact, if the HPT is being reduced in size, the pointer in the rmap
array could point outside the bounds of the new rev array.  If that
happens, we can get a host crash later on such as this one:

[91652.628516] Unable to handle kernel paging request for data at address 0xd0000000157fb10c
[91652.628668] Faulting instruction address: 0xc0000000000e2640
[91652.628736] Oops: Kernel access of bad area, sig: 11 [#1]
[91652.628789] LE SMP NR_CPUS=1024 NUMA PowerNV
[91652.628847] Modules linked in: binfmt_misc vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables ses enclosure scsi_transport_sas i2c_opal ipmi_powernv ipmi_devintf i2c_core ipmi_msghandler powernv_op_panel nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc kvm_hv kvm_pr kvm scsi_dh_alua dm_service_time dm_multipath tg3 ptp pps_core [last unloaded: stap_552b612747aec2da355051e464fa72a1_14259]
[91652.629566] CPU: 136 PID: 41315 Comm: CPU 21/KVM Tainted: G           O    4.14.0-1.rc4.dev.gitb27fc5c.el7.centos.ppc64le #1
[91652.629684] task: c0000007a419e400 task.stack: c0000000028d8000
[91652.629750] NIP:  c0000000000e2640 LR: d00000000c36e498 CTR: c0000000000e25f0
[91652.629829] REGS: c0000000028db5d0 TRAP: 0300   Tainted: G           O     (4.14.0-1.rc4.dev.gitb27fc5c.el7.centos.ppc64le)
[91652.629932] MSR:  900000010280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 44022422  XER: 00000000
[91652.630034] CFAR: d00000000c373f84 DAR: d0000000157fb10c DSISR: 40000000 SOFTE: 1
[91652.630034] GPR00: d00000000c36e498 c0000000028db850 c000000001403900 c0000007b7960000
[91652.630034] GPR04: d0000000117fb100 d000000007ab00d8 000000000033bb10 0000000000000000
[91652.630034] GPR08: fffffffffffffe7f 801001810073bb10 d00000000e440000 d00000000c373f70
[91652.630034] GPR12: c0000000000e25f0 c00000000fdb9400 f000000003b24680 0000000000000000
[91652.630034] GPR16: 00000000000004fb 00007ff7081a0000 00000000000ec91a 000000000033bb10
[91652.630034] GPR20: 0000000000010000 00000000001b1190 0000000000000001 0000000000010000
[91652.630034] GPR24: c0000007b7ab8038 d0000000117fb100 0000000ec91a1190 c000001e6a000000
[91652.630034] GPR28: 00000000033bb100 000000000073bb10 c0000007b7960000 d0000000157fb100
[91652.630735] NIP [c0000000000e2640] kvmppc_add_revmap_chain+0x50/0x120
[91652.630806] LR [d00000000c36e498] kvmppc_book3s_hv_page_fault+0xbb8/0xc40 [kvm_hv]
[91652.630884] Call Trace:
[91652.630913] [c0000000028db850] [c0000000028db8b0] 0xc0000000028db8b0 (unreliable)
[91652.630996] [c0000000028db8b0] [d00000000c36e498] kvmppc_book3s_hv_page_fault+0xbb8/0xc40 [kvm_hv]
[91652.631091] [c0000000028db9e0] [d00000000c36a078] kvmppc_vcpu_run_hv+0xdf8/0x1300 [kvm_hv]
[91652.631179] [c0000000028dbb30] [d00000000c2248c4] kvmppc_vcpu_run+0x34/0x50 [kvm]
[91652.631266] [c0000000028dbb50] [d00000000c220d54] kvm_arch_vcpu_ioctl_run+0x114/0x2a0 [kvm]
[91652.631351] [c0000000028dbbd0] [d00000000c2139d8] kvm_vcpu_ioctl+0x598/0x7a0 [kvm]
[91652.631433] [c0000000028dbd40] [c0000000003832e0] do_vfs_ioctl+0xd0/0x8c0
[91652.631501] [c0000000028dbde0] [c000000000383ba4] SyS_ioctl+0xd4/0x130
[91652.631569] [c0000000028dbe30] [c00000000000b8e0] system_call+0x58/0x6c
[91652.631635] Instruction dump:
[91652.631676] fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ffa1 2fa70000 793d0020 e9432110
[91652.631814] 7bbf26e4 7c7e1b78 7feafa14 409e0094 <807f000c> 786326e4 7c6a1a14 93a40008
[91652.631959] ---[ end trace ac85ba6db72e5b2e ]---

To fix this, we tighten up the way that the hpte_setup_done flag is
checked to ensure that it does provide the guarantee that the resizing
code needs.  In kvmppc_run_core(), we check the hpte_setup_done flag
after disabling interrupts and refuse to enter the guest if it is
clear (for a HPT guest).  The code that checks hpte_setup_done and
calls kvmppc_hv_setup_htab_rma() is moved from kvmppc_vcpu_run_hv()
to a point inside the main loop in kvmppc_run_vcpu(), ensuring that
we don't just spin endlessly calling kvmppc_run_core() while
hpte_setup_done is clear, but instead have a chance to block on the
kvm->lock mutex.

Finally we also check hpte_setup_done inside the region in
kvmppc_book3s_hv_page_fault() where the HPTE is locked and we are about
to update the HPTE, and bail out if it is clear.  If another CPU is
inside kvm_vm_ioctl_resize_hpt_commit) and has cleared hpte_setup_done,
then we know that either we are looking at a HPTE
that resize_hpt_rehash_hpte() has not yet processed, which is OK,
or else we will see hpte_setup_done clear and refuse to update it,
because of the full barrier formed by the unlock of the HPTE in
resize_hpt_rehash_hpte() combined with the locking of the HPTE
in kvmppc_book3s_hv_page_fault().

Fixes: 5e9859699a ("KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing implementation")
Cc: stable@vger.kernel.org # v4.10+
Reported-by: Satheesh Rajendran <satheera@in.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-08 15:14:02 +11:00
Nicholas Piggin
6de6638b35 KVM: PPC: Book3S HV: Handle host system reset in guest mode
If the host takes a system reset interrupt while a guest is running,
the CPU must exit the guest before processing the host exception
handler.

After this patch, taking a sysrq+x with a CPU running in a guest
gives a trace like this:

   cpu 0x27: Vector: 100 (System Reset) at [c000000fdf5776f0]
       pc: c008000010158b80: kvmppc_run_core+0x16b8/0x1ad0 [kvm_hv]
       lr: c008000010158b80: kvmppc_run_core+0x16b8/0x1ad0 [kvm_hv]
       sp: c000000fdf577850
      msr: 9000000002803033
     current = 0xc000000fdf4b1e00
     paca    = 0xc00000000fd4d680	 softe: 3	 irq_happened: 0x01
       pid   = 6608, comm = qemu-system-ppc
   Linux version 4.14.0-rc7-01489-g47e1893a404a-dirty #26 SMP
   [c000000fdf577a00] c008000010159dd4 kvmppc_vcpu_run_hv+0x3dc/0x12d0 [kvm_hv]
   [c000000fdf577b30] c0080000100a537c kvmppc_vcpu_run+0x44/0x60 [kvm]
   [c000000fdf577b60] c0080000100a1ae0 kvm_arch_vcpu_ioctl_run+0x118/0x310 [kvm]
   [c000000fdf577c00] c008000010093e98 kvm_vcpu_ioctl+0x530/0x7c0 [kvm]
   [c000000fdf577d50] c000000000357bf8 do_vfs_ioctl+0xd8/0x8c0
   [c000000fdf577df0] c000000000358448 SyS_ioctl+0x68/0x100
   [c000000fdf577e30] c00000000000b220 system_call+0x58/0x6c
   --- Exception: c01 (System Call) at 00007fff76868df0
   SP (7fff7069baf0) is in userspace

Fixes: e36d0a2ed5 ("powerpc/powernv: Implement NMI IPI with OPAL_SIGNAL_SYSTEM_RESET")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-06 16:48:06 +11:00
Linus Torvalds
ead751507d License cleanup: add SPDX license identifiers to some files
Many source files in the tree are missing licensing information, which
 makes it harder for compliance tools to determine the correct license.
 
 By default all files without license information are under the default
 license of the kernel, which is GPL version 2.
 
 Update the files which contain no license information with the 'GPL-2.0'
 SPDX license identifier.  The SPDX identifier is a legally binding
 shorthand, which can be used instead of the full boiler plate text.
 
 This patch is based on work done by Thomas Gleixner and Kate Stewart and
 Philippe Ombredanne.
 
 How this work was done:
 
 Patches were generated and checked against linux-4.14-rc6 for a subset of
 the use cases:
  - file had no licensing information it it.
  - file was a */uapi/* one with no licensing information in it,
  - file was a */uapi/* one with existing licensing information,
 
 Further patches will be generated in subsequent months to fix up cases
 where non-standard license headers were used, and references to license
 had to be inferred by heuristics based on keywords.
 
 The analysis to determine which SPDX License Identifier to be applied to
 a file was done in a spreadsheet of side by side results from of the
 output of two independent scanners (ScanCode & Windriver) producing SPDX
 tag:value files created by Philippe Ombredanne.  Philippe prepared the
 base worksheet, and did an initial spot review of a few 1000 files.
 
 The 4.13 kernel was the starting point of the analysis with 60,537 files
 assessed.  Kate Stewart did a file by file comparison of the scanner
 results in the spreadsheet to determine which SPDX license identifier(s)
 to be applied to the file. She confirmed any determination that was not
 immediately clear with lawyers working with the Linux Foundation.
 
 Criteria used to select files for SPDX license identifier tagging was:
  - Files considered eligible had to be source code files.
  - Make and config files were included as candidates if they contained >5
    lines of source
  - File already had some variant of a license header in it (even if <5
    lines).
 
 All documentation files were explicitly excluded.
 
 The following heuristics were used to determine which SPDX license
 identifiers to apply.
 
  - when both scanners couldn't find any license traces, file was
    considered to have no license information in it, and the top level
    COPYING file license applied.
 
    For non */uapi/* files that summary was:
 
    SPDX license identifier                            # files
    ---------------------------------------------------|-------
    GPL-2.0                                              11139
 
    and resulted in the first patch in this series.
 
    If that file was a */uapi/* path one, it was "GPL-2.0 WITH
    Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
 
    SPDX license identifier                            # files
    ---------------------------------------------------|-------
    GPL-2.0 WITH Linux-syscall-note                        930
 
    and resulted in the second patch in this series.
 
  - if a file had some form of licensing information in it, and was one
    of the */uapi/* ones, it was denoted with the Linux-syscall-note if
    any GPL family license was found in the file or had no licensing in
    it (per prior point).  Results summary:
 
    SPDX license identifier                            # files
    ---------------------------------------------------|------
    GPL-2.0 WITH Linux-syscall-note                       270
    GPL-2.0+ WITH Linux-syscall-note                      169
    ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
    ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
    LGPL-2.1+ WITH Linux-syscall-note                      15
    GPL-1.0+ WITH Linux-syscall-note                       14
    ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
    LGPL-2.0+ WITH Linux-syscall-note                       4
    LGPL-2.1 WITH Linux-syscall-note                        3
    ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
    ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
 
    and that resulted in the third patch in this series.
 
  - when the two scanners agreed on the detected license(s), that became
    the concluded license(s).
 
  - when there was disagreement between the two scanners (one detected a
    license but the other didn't, or they both detected different
    licenses) a manual inspection of the file occurred.
 
  - In most cases a manual inspection of the information in the file
    resulted in a clear resolution of the license that should apply (and
    which scanner probably needed to revisit its heuristics).
 
  - When it was not immediately clear, the license identifier was
    confirmed with lawyers working with the Linux Foundation.
 
  - If there was any question as to the appropriate license identifier,
    the file was flagged for further research and to be revisited later
    in time.
 
 In total, over 70 hours of logged manual review was done on the
 spreadsheet to determine the SPDX license identifiers to apply to the
 source files by Kate, Philippe, Thomas and, in some cases, confirmation
 by lawyers working with the Linux Foundation.
 
 Kate also obtained a third independent scan of the 4.13 code base from
 FOSSology, and compared selected files where the other two scanners
 disagreed against that SPDX file, to see if there was new insights.  The
 Windriver scanner is based on an older version of FOSSology in part, so
 they are related.
 
 Thomas did random spot checks in about 500 files from the spreadsheets
 for the uapi headers and agreed with SPDX license identifier in the
 files he inspected. For the non-uapi files Thomas did random spot checks
 in about 15000 files.
 
 In initial set of patches against 4.14-rc6, 3 files were found to have
 copy/paste license identifier errors, and have been fixed to reflect the
 correct identifier.
 
 Additionally Philippe spent 10 hours this week doing a detailed manual
 inspection and review of the 12,461 patched files from the initial patch
 version early this week with:
  - a full scancode scan run, collecting the matched texts, detected
    license ids and scores
  - reviewing anything where there was a license detected (about 500+
    files) to ensure that the applied SPDX license was correct
  - reviewing anything where there was no detection but the patch license
    was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
    SPDX license was correct
 
 This produced a worksheet with 20 files needing minor correction.  This
 worksheet was then exported into 3 different .csv files for the
 different types of files to be modified.
 
 These .csv files were then reviewed by Greg.  Thomas wrote a script to
 parse the csv files and add the proper SPDX tag to the file, in the
 format that the file expected.  This script was further refined by Greg
 based on the output to detect more types of files automatically and to
 distinguish between header and source .c files (which need different
 comment types.)  Finally Greg ran the script using the .csv files to
 generate the patches.
 
 Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
 Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
 Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWfswbQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykvEwCfXU1MuYFQGgMdDmAZXEc+xFXZvqgAoKEcHDNA
 6dVh26uchcEQLN/XqUDt
 =x306
 -----END PGP SIGNATURE-----

Merge tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull initial SPDX identifiers from Greg KH:
 "License cleanup: add SPDX license identifiers to some files

  Many source files in the tree are missing licensing information, which
  makes it harder for compliance tools to determine the correct license.

  By default all files without license information are under the default
  license of the kernel, which is GPL version 2.

  Update the files which contain no license information with the
  'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally
  binding shorthand, which can be used instead of the full boiler plate
  text.

  This patch is based on work done by Thomas Gleixner and Kate Stewart
  and Philippe Ombredanne.

  How this work was done:

  Patches were generated and checked against linux-4.14-rc6 for a subset
  of the use cases:

   - file had no licensing information it it.

   - file was a */uapi/* one with no licensing information in it,

   - file was a */uapi/* one with existing licensing information,

  Further patches will be generated in subsequent months to fix up cases
  where non-standard license headers were used, and references to
  license had to be inferred by heuristics based on keywords.

  The analysis to determine which SPDX License Identifier to be applied
  to a file was done in a spreadsheet of side by side results from of
  the output of two independent scanners (ScanCode & Windriver)
  producing SPDX tag:value files created by Philippe Ombredanne.
  Philippe prepared the base worksheet, and did an initial spot review
  of a few 1000 files.

  The 4.13 kernel was the starting point of the analysis with 60,537
  files assessed. Kate Stewart did a file by file comparison of the
  scanner results in the spreadsheet to determine which SPDX license
  identifier(s) to be applied to the file. She confirmed any
  determination that was not immediately clear with lawyers working with
  the Linux Foundation.

  Criteria used to select files for SPDX license identifier tagging was:

   - Files considered eligible had to be source code files.

   - Make and config files were included as candidates if they contained
     >5 lines of source

   - File already had some variant of a license header in it (even if <5
     lines).

  All documentation files were explicitly excluded.

  The following heuristics were used to determine which SPDX license
  identifiers to apply.

   - when both scanners couldn't find any license traces, file was
     considered to have no license information in it, and the top level
     COPYING file license applied.

     For non */uapi/* files that summary was:

       SPDX license identifier                            # files
       ---------------------------------------------------|-------
       GPL-2.0                                              11139

     and resulted in the first patch in this series.

     If that file was a */uapi/* path one, it was "GPL-2.0 WITH
     Linux-syscall-note" otherwise it was "GPL-2.0". Results of that
     was:

       SPDX license identifier                            # files
       ---------------------------------------------------|-------
       GPL-2.0 WITH Linux-syscall-note                        930

     and resulted in the second patch in this series.

   - if a file had some form of licensing information in it, and was one
     of the */uapi/* ones, it was denoted with the Linux-syscall-note if
     any GPL family license was found in the file or had no licensing in
     it (per prior point). Results summary:

       SPDX license identifier                            # files
       ---------------------------------------------------|------
       GPL-2.0 WITH Linux-syscall-note                       270
       GPL-2.0+ WITH Linux-syscall-note                      169
       ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
       ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
       LGPL-2.1+ WITH Linux-syscall-note                      15
       GPL-1.0+ WITH Linux-syscall-note                       14
       ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
       LGPL-2.0+ WITH Linux-syscall-note                       4
       LGPL-2.1 WITH Linux-syscall-note                        3
       ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
       ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

     and that resulted in the third patch in this series.

   - when the two scanners agreed on the detected license(s), that
     became the concluded license(s).

   - when there was disagreement between the two scanners (one detected
     a license but the other didn't, or they both detected different
     licenses) a manual inspection of the file occurred.

   - In most cases a manual inspection of the information in the file
     resulted in a clear resolution of the license that should apply
     (and which scanner probably needed to revisit its heuristics).

   - When it was not immediately clear, the license identifier was
     confirmed with lawyers working with the Linux Foundation.

   - If there was any question as to the appropriate license identifier,
     the file was flagged for further research and to be revisited later
     in time.

  In total, over 70 hours of logged manual review was done on the
  spreadsheet to determine the SPDX license identifiers to apply to the
  source files by Kate, Philippe, Thomas and, in some cases,
  confirmation by lawyers working with the Linux Foundation.

  Kate also obtained a third independent scan of the 4.13 code base from
  FOSSology, and compared selected files where the other two scanners
  disagreed against that SPDX file, to see if there was new insights.
  The Windriver scanner is based on an older version of FOSSology in
  part, so they are related.

  Thomas did random spot checks in about 500 files from the spreadsheets
  for the uapi headers and agreed with SPDX license identifier in the
  files he inspected. For the non-uapi files Thomas did random spot
  checks in about 15000 files.

  In initial set of patches against 4.14-rc6, 3 files were found to have
  copy/paste license identifier errors, and have been fixed to reflect
  the correct identifier.

  Additionally Philippe spent 10 hours this week doing a detailed manual
  inspection and review of the 12,461 patched files from the initial
  patch version early this week with:

   - a full scancode scan run, collecting the matched texts, detected
     license ids and scores

   - reviewing anything where there was a license detected (about 500+
     files) to ensure that the applied SPDX license was correct

   - reviewing anything where there was no detection but the patch
     license was not GPL-2.0 WITH Linux-syscall-note to ensure that the
     applied SPDX license was correct

  This produced a worksheet with 20 files needing minor correction. This
  worksheet was then exported into 3 different .csv files for the
  different types of files to be modified.

  These .csv files were then reviewed by Greg. Thomas wrote a script to
  parse the csv files and add the proper SPDX tag to the file, in the
  format that the file expected. This script was further refined by Greg
  based on the output to detect more types of files automatically and to
  distinguish between header and source .c files (which need different
  comment types.) Finally Greg ran the script using the .csv files to
  generate the patches.

  Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
  Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
  Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

* tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  License cleanup: add SPDX license identifier to uapi header files with a license
  License cleanup: add SPDX license identifier to uapi header files with no license
  License cleanup: add SPDX GPL-2.0 license identifier to files with no license
2017-11-02 10:04:46 -07:00
Greg Kroah-Hartman
b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Paul Mackerras
c01015091a KVM: PPC: Book3S HV: Run HPT guests on POWER9 radix hosts
This patch removes the restriction that a radix host can only run
radix guests, allowing us to run HPT (hashed page table) guests as
well.  This is useful because it provides a way to run old guest
kernels that know about POWER8 but not POWER9.

Unfortunately, POWER9 currently has a restriction that all threads
in a given code must either all be in HPT mode, or all in radix mode.
This means that when entering a HPT guest, we have to obtain control
of all 4 threads in the core and get them to switch their LPIDR and
LPCR registers, even if they are not going to run a guest.  On guest
exit we also have to get all threads to switch LPIDR and LPCR back
to host values.

To make this feasible, we require that KVM not be in the "independent
threads" mode, and that the CPU cores be in single-threaded mode from
the host kernel's perspective (only thread 0 online; threads 1, 2 and
3 offline).  That allows us to use the same code as on POWER8 for
obtaining control of the secondary threads.

To manage the LPCR/LPIDR changes required, we extend the kvm_split_info
struct to contain the information needed by the secondary threads.
All threads perform a barrier synchronization (where all threads wait
for every other thread to reach the synchronization point) on guest
entry, both before and after loading LPCR and LPIDR.  On guest exit,
they all once again perform a barrier synchronization both before
and after loading host values into LPCR and LPIDR.

Finally, it is also currently necessary to flush the entire TLB every
time we enter a HPT guest on a radix host.  We do this on thread 0
with a loop of tlbiel instructions.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01 15:36:41 +11:00
Paul Mackerras
516f7898ae KVM: PPC: Book3S HV: Allow for running POWER9 host in single-threaded mode
This patch allows for a mode on POWER9 hosts where we control all the
threads of a core, much as we do on POWER8.  The mode is controlled by
a module parameter on the kvm_hv module, called "indep_threads_mode".
The normal mode on POWER9 is the "independent threads" mode, with
indep_threads_mode=Y, where the host is in SMT4 mode (or in fact any
desired SMT mode) and each thread independently enters and exits from
KVM guests without reference to what other threads in the core are
doing.

If indep_threads_mode is set to N at the point when a VM is started,
KVM will expect every core that the guest runs on to be in single
threaded mode (that is, threads 1, 2 and 3 offline), and will set the
flag that prevents secondary threads from coming online.  We can still
use all four threads; the code that implements dynamic micro-threading
on POWER8 will become active in over-commit situations and will allow
up to three other VCPUs to be run on the secondary threads of the core
whenever a VCPU is run.

The reason for wanting this mode is that this will allow us to run HPT
guests on a radix host on a POWER9 machine that does not support
"mixed mode", that is, having some threads in a core be in HPT mode
while other threads are in radix mode.  It will also make it possible
to implement a "strict threads" mode in future, if desired.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01 15:36:35 +11:00
Paul Mackerras
18c3640cef KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix host
This sets up the machinery for switching a guest between HPT (hashed
page table) and radix MMU modes, so that in future we can run a HPT
guest on a radix host on POWER9 machines.

* The KVM_PPC_CONFIGURE_V3_MMU ioctl can now specify either HPT or
  radix mode, on a radix host.

* The KVM_CAP_PPC_MMU_HASH_V3 capability now returns 1 on POWER9
  with HV KVM on a radix host.

* The KVM_PPC_GET_SMMU_INFO returns information about the HPT MMU on a
  radix host.

* The KVM_PPC_ALLOCATE_HTAB ioctl on a radix host will switch the
  guest to HPT mode and allocate a HPT.

* For simplicity, we now allocate the rmap array for each memslot,
  even on a radix host, since it will be needed if the guest switches
  to HPT mode.

* Since we cannot yet run a HPT guest on a radix host, the KVM_RUN
  ioctl will return an EINVAL error in that case.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01 15:36:28 +11:00
Paul Mackerras
e641a31783 KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix
Currently, the HPT code in HV KVM maintains a dirty bit per guest page
in the rmap array, whether or not dirty page tracking has been enabled
for the memory slot.  In contrast, the radix code maintains a dirty
bit per guest page in memslot->dirty_bitmap, and only does so when
dirty page tracking has been enabled.

This changes the HPT code to maintain the dirty bits in the memslot
dirty_bitmap like radix does.  This results in slightly less code
overall, and will mean that we do not lose the dirty bits when
transitioning between HPT and radix mode in future.

There is one minor change to behaviour as a result.  With HPT, when
dirty tracking was enabled for a memslot, we would previously clear
all the dirty bits at that point (both in the HPT entries and in the
rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately
following would show no pages as dirty (assuming no vcpus have run
in the meantime).  With this change, the dirty bits on HPT entries
are not cleared at the point where dirty tracking is enabled, so
KVM_GET_DIRTY_LOG would show as dirty any guest pages that are
resident in the HPT and dirty.  This is consistent with what happens
on radix.

This also fixes a bug in the mark_pages_dirty() function for radix
(in the sense that the function no longer exists).  In the case where
a large page of 64 normal pages or more is marked dirty, the
addressing of the dirty bitmap was incorrect and could write past
the end of the bitmap.  Fortunately this case was never hit in
practice because a 2MB large page is only 32 x 64kB pages, and we
don't support backing the guest with 1GB huge pages at this point.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01 15:36:21 +11:00
Paul Mackerras
1b151ce466 KVM: PPC: Book3S HV: Rename hpte_setup_done to mmu_ready
This renames the kvm->arch.hpte_setup_done field to mmu_ready because
we will want to use it for radix guests too -- both for setting things
up before vcpu execution, and for excluding vcpus from executing while
MMU-related things get changed, such as in future switching the MMU
from radix to HPT mode or vice-versa.

This also moves the call to kvmppc_setup_partition_table() that was
done in kvmppc_hv_setup_htab_rma() for HPT guests, and the setting
of mmu_ready, into the caller in kvmppc_vcpu_run_hv().

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01 15:36:12 +11:00
Paul Mackerras
8dc6cca556 KVM: PPC: Book3S HV: Don't rely on host's page size information
This removes the dependence of KVM on the mmu_psize_defs array (which
stores information about hardware support for various page sizes) and
the things derived from it, chiefly hpte_page_sizes[], hpte_page_size(),
hpte_actual_page_size() and get_sllp_encoding().  We also no longer
rely on the mmu_slb_size variable or the MMU_FTR_1T_SEGMENTS feature
bit.

The reason for doing this is so we can support a HPT guest on a radix
host.  In a radix host, the mmu_psize_defs array contains information
about page sizes supported by the MMU in radix mode rather than the
page sizes supported by the MMU in HPT mode.  Similarly, mmu_slb_size
and the MMU_FTR_1T_SEGMENTS bit are not set.

Instead we hard-code knowledge of the behaviour of the HPT MMU in the
POWER7, POWER8 and POWER9 processors (which are the only processors
supported by HV KVM) - specifically the encoding of the LP fields in
the HPT and SLB entries, and the fact that they have 32 SLB entries
and support 1TB segments.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01 15:36:06 +11:00
Paul Mackerras
3e8f150a3b Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next
This merges in the ppc-kvm topic branch of the powerpc tree to get the
commit that reverts the patch "KVM: PPC: Book3S HV: POWER9 does not
require secondary thread management".  This is needed for subsequent
patches which will be applied on this branch.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01 15:33:39 +11:00
Nicholas Piggin
93897a1f4b KVM: PPC: Book3S: Fix gas warning due to using r0 as immediate 0
This fixes the message:

arch/powerpc/kvm/book3s_segment.S: Assembler messages:
arch/powerpc/kvm/book3s_segment.S:330: Warning: invalid register expression

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01 15:17:25 +11:00
Greg Kurz
f4093ee9d0 KVM: PPC: Book3S PR: Only install valid SLBs during KVM_SET_SREGS
Userland passes an array of 64 SLB descriptors to KVM_SET_SREGS,
some of which are valid (ie, SLB_ESID_V is set) and the rest are
likely all-zeroes (with QEMU at least).

Each of them is then passed to kvmppc_mmu_book3s_64_slbmte(), which
assumes to find the SLB index in the 3 lower bits of its rb argument.
When passed zeroed arguments, it happily overwrites the 0th SLB entry
with zeroes. This is exactly what happens while doing live migration
with QEMU when the destination pushes the incoming SLB descriptors to
KVM PR. When reloading the SLBs at the next synchronization, QEMU first
clears its SLB array and only restore valid ones, but the 0th one is
now gone and we cannot access the corresponding memory anymore:

(qemu) x/x $pc
c0000000000b742c: Cannot access memory

To avoid this, let's filter out non-valid SLB entries. While here, we
also force a full SLB flush before installing new entries. Since SLB
is for 64-bit only, we now build this path conditionally to avoid a
build break on 32-bit, which doesn't define SLB_ESID_V.

Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01 15:17:25 +11:00
Paul Mackerras
00bb6ae500 KVM: PPC: Book3S HV: Don't call real-mode XICS hypercall handlers if not enabled
When running a guest on a POWER9 system with the in-kernel XICS
emulation disabled (for example by running QEMU with the parameter
"-machine pseries,kernel_irqchip=off"), the kernel does not pass
the XICS-related hypercalls such as H_CPPR up to userspace for
emulation there as it should.

The reason for this is that the real-mode handlers for these
hypercalls don't check whether a XICS device has been instantiated
before calling the xics-on-xive code.  That code doesn't check
either, leading to potential NULL pointer dereferences because
vcpu->arch.xive_vcpu is NULL.  Those dereferences won't cause an
exception in real mode but will lead to kernel memory corruption.

This fixes it by adding kvmppc_xics_enabled() checks before calling
the XICS functions.

Cc: stable@vger.kernel.org # v4.11+
Fixes: 5af5099385 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-01 15:09:32 +11:00
Michael Ellerman
2a3d6553cb KVM: PPC: Tie KVM_CAP_PPC_HTM to the user-visible TM feature
Currently we use CPU_FTR_TM to decide if the CPU/kernel can support
TM (Transactional Memory), and if it's true we advertise that to
Qemu (or similar) via KVM_CAP_PPC_HTM.

PPC_FEATURE2_HTM is the user-visible feature bit, which indicates that
the CPU and kernel can support TM. Currently CPU_FTR_TM and
PPC_FEATURE2_HTM always have the same value, either true or false, so
using the former for KVM_CAP_PPC_HTM is correct.

However some Power9 CPUs can operate in a mode where TM is enabled but
TM suspended state is disabled. In this mode CPU_FTR_TM is true, but
PPC_FEATURE2_HTM is false. Instead a different PPC_FEATURE2 bit is
set, to indicate that this different mode of TM is available.

It is not safe to let guests use TM as-is, when the CPU is in this
mode. So to prevent that from happening, use PPC_FEATURE2_HTM to
determine the value of KVM_CAP_PPC_HTM.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-10-20 11:09:26 +11:00
Paul Mackerras
31a4d4480c Revert "KVM: PPC: Book3S HV: POWER9 does not require secondary thread management"
This reverts commit 94a04bc25a.

In order to run HPT guests on a radix POWER9 host, we will have to run
the host in single-threaded mode, because POWER9 processors do not
currently support running some threads of a core in HPT mode while
others are in radix mode ("mixed mode").

That means that we will need the same mechanisms that are used on
POWER8 to make the secondary threads available to KVM, which were
disabled on POWER9 by commit 94a04bc25a.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-10-19 15:28:04 +11:00
Benjamin Herrenschmidt
ad98dd1a75 KVM: PPC: Book3S HV: Add more barriers in XIVE load/unload code
On POWER9 systems, we push the VCPU context onto the XIVE (eXternal
Interrupt Virtualization Engine) hardware when entering a guest,
and pull the context off the XIVE when exiting the guest.  The push
is done with cache-inhibited stores, and the pull with cache-inhibited
loads.

Testing has revealed that it is possible (though very rare) for
the stores to get reordered with the loads so that we end up with the
guest VCPU context still loaded on the XIVE after we have exited the
guest.  When that happens, it is possible for the same VCPU context
to then get loaded on another CPU, which causes the machine to
checkstop.

To fix this, we add I/O barrier instructions (eieio) before and
after the push and pull operations.  As partial compensation for the
potential slowdown caused by the extra barriers, we remove the eieio
instructions between the two stores in the push operation, and between
the two loads in the pull operation.  (The architecture requires
loads to cache-inhibited, guarded storage to be kept in order, and
requires stores to cache-inhibited, guarded storage likewise to be
kept in order, but allows such loads and stores to be reordered with
respect to each other.)

Reported-by: Carol L Soto <clsoto@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-16 08:46:46 +11:00
Paul Mackerras
891f1ebf65 KVM: PPC: Book3S HV: Explicitly disable HPT operations on radix guests
This adds code to make sure that we don't try to access the
non-existent HPT for a radix guest using the htab file for the VM
in debugfs, a file descriptor obtained using the KVM_PPC_GET_HTAB_FD
ioctl, or via the KVM_PPC_RESIZE_HPT_{PREPARE,COMMIT} ioctls.

At present nothing bad happens if userspace does access these
interfaces on a radix guest, mostly because kvmppc_hpt_npte()
gives 0 for a radix guest, which in turn is because 1 << -4
comes out as 0 on POWER processors.  However, that relies on
undefined behaviour, so it is better to be explicit about not
accessing the HPT for a radix guest.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-16 08:09:53 +11:00
Alexey Kardashevskiy
3f2bb76433 KVM: PPC: Book3S PR: Enable in-kernel TCE handlers for PR KVM
The handlers support PR KVM from the day one; however the PR KVM's
enable/disable hcalls handler missed these ones.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-14 16:38:19 +11:00
Markus Elfring
9c7e53dc00 KVM: PPC: Book3S HV: Delete an error message for a failed memory allocation in kvmppc_allocate_hpt()
Omit an extra message for a memory allocation failure in this function.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-14 16:38:15 +11:00
Thomas Meyer
4bdcb7016f KVM: PPC: BookE: Use vma_pages function
Use vma_pages function on vma object instead of explicit computation.
Found by coccinelle spatch "api/vma_pages.cocci"

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-14 13:39:49 +11:00
Thomas Meyer
4bb817ed83 KVM: PPC: Book3S HV: Use ARRAY_SIZE macro
Use ARRAY_SIZE macro, rather than explicitly coding some variant of it
yourself.
Found with: find -type f -name "*.c" -o -name "*.h" | xargs perl -p -i -e
's/\bsizeof\s*\(\s*(\w+)\s*\)\s*\ /\s*sizeof\s*\(\s*\1\s*\[\s*0\s*\]\s*\)
/ARRAY_SIZE(\1)/g' and manual check/verification.

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-14 13:39:49 +11:00
Paul Mackerras
857b99e140 KVM: PPC: Book3S HV: Handle unexpected interrupts better
At present, if an interrupt (i.e. an exception or trap) occurs in the
code where KVM is switching the MMU to or from guest context, we jump
to kvmppc_bad_host_intr, where we simply spin with interrupts disabled.
In this situation, it is hard to debug what happened because we get no
indication as to which interrupt occurred or where.  Typically we get
a cascade of stall and soft lockup warnings from other CPUs.

In order to get more information for debugging, this adds code to
create a stack frame on the emergency stack and save register values
to it.  We start half-way down the emergency stack in order to give
ourselves some chance of being able to do a stack trace on secondary
threads that are already on the emergency stack.

On POWER7 or POWER8, we then just spin, as before, because we don't
know what state the MMU context is in or what other threads are doing,
and we can't switch back to host context without coordinating with
other threads.  On POWER9 we can do better; there we load up the host
MMU context and jump to C code, which prints an oops message to the
console and panics.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-14 13:35:51 +11:00
Alexey Kardashevskiy
8f6a9f0d06 KVM: PPC: Book3S: Protect kvmppc_gpa_to_ua() with SRCU
kvmppc_gpa_to_ua() accesses KVM memory slot array via
srcu_dereference_check() and this produces warnings from RCU like below.

This extends the existing srcu_read_lock/unlock to cover that
kvmppc_gpa_to_ua() as well.

We did not hit this before as this lock is not needed for the realmode
handlers and hash guests would use the realmode path all the time;
however the radix guests are always redirected to the virtual mode
handlers and hence the warning.

[   68.253798] ./include/linux/kvm_host.h:575 suspicious rcu_dereference_check() usage!
[   68.253799]
               other info that might help us debug this:

[   68.253802]
               rcu_scheduler_active = 2, debug_locks = 1
[   68.253804] 1 lock held by qemu-system-ppc/6413:
[   68.253806]  #0:  (&vcpu->mutex){+.+.}, at: [<c00800000e3c22f4>] vcpu_load+0x3c/0xc0 [kvm]
[   68.253826]
               stack backtrace:
[   68.253830] CPU: 92 PID: 6413 Comm: qemu-system-ppc Tainted: G        W       4.14.0-rc3-00553-g432dcba58e9c-dirty #72
[   68.253833] Call Trace:
[   68.253839] [c000000fd3d9f790] [c000000000b7fcc8] dump_stack+0xe8/0x160 (unreliable)
[   68.253845] [c000000fd3d9f7d0] [c0000000001924c0] lockdep_rcu_suspicious+0x110/0x180
[   68.253851] [c000000fd3d9f850] [c0000000000e825c] kvmppc_gpa_to_ua+0x26c/0x2b0
[   68.253858] [c000000fd3d9f8b0] [c00800000e3e1984] kvmppc_h_put_tce+0x12c/0x2a0 [kvm]

Fixes: 121f80ba68 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-14 11:35:41 +11:00
Nicholas Piggin
2cde371632 KVM: PPC: Book3S HV: POWER9 more doorbell fixes
- Add another case where msgsync is required.
- Required barrier sequence for global doorbells is msgsync ; lwsync

When msgsnd is used for IPIs to other cores, msgsync must be executed by
the target to order stores performed on the source before its msgsnd
(provided the source executes the appropriate sync).

Fixes: 1704a81cce ("KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9")
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-14 11:32:53 +11:00
Greg Kurz
ac64115a66 KVM: PPC: Fix oops when checking KVM_CAP_PPC_HTM
The following program causes a kernel oops:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <linux/kvm.h>

main()
{
    int fd = open("/dev/kvm", O_RDWR);
    ioctl(fd, KVM_CHECK_EXTENSION, KVM_CAP_PPC_HTM);
}

This happens because when using the global KVM fd with
KVM_CHECK_EXTENSION, kvm_vm_ioctl_check_extension() gets
called with a NULL kvm argument, which gets dereferenced
in is_kvmppc_hv_enabled(). Spotted while reading the code.

Let's use the hv_enabled fallback variable, like everywhere
else in this function.

Fixes: 23528bb21e ("KVM: PPC: Introduce KVM_CAP_PPC_HTM")
Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-14 11:32:53 +11:00
Sam Bobroff
2fb1e94645 KVM: PPC: Book3S: Fix server always zero from kvmppc_xive_get_xive()
In KVM's XICS-on-XIVE emulation, kvmppc_xive_get_xive() returns the
value of state->guest_server as "server". However, this value is not
set by it's counterpart kvmppc_xive_set_xive(). When the guest uses
this interface to migrate interrupts away from a CPU that is going
offline, it sees all interrupts as belonging to CPU 0, so they are
left assigned to (now) offline CPUs.

This patch removes the guest_server field from the state, and returns
act_server in it's place (that is, the CPU actually handling the
interrupt, which may differ from the one requested).

Fixes: 5af5099385 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
Cc: stable@vger.kernel.org
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-03 17:58:16 +02:00
Michael Neuling
e001fa78d4 KVM: PPC: Book3S HV: Check for updated HDSISR on P9 HDSI exception
On POWER9 DD2.1 and below, sometimes on a Hypervisor Data Storage
Interrupt (HDSI) the HDSISR is not be updated at all.

To work around this we put a canary value into the HDSISR before
returning to a guest and then check for this canary when we take a
HDSI. If we find the canary on a HDSI, we know the hardware didn't
update the HDSISR. In this case we return to the guest to retake the
HDSI which should correctly update the HDSISR the second time HDSI
entry.

After talking to Paulus we've applied this workaround to all POWER9
CPUs. The workaround of returning to the guest shouldn't ever be
triggered on well behaving CPU. The extra instructions should have
negligible performance impact.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-09-22 10:45:46 +02:00
Davidlohr Bueso
267ad7bc2d kvm,powerpc: Serialize wq active checks in ops->vcpu_kick
Particularly because kvmppc_fast_vcpu_kick_hv() is a callback,
ensure that we properly serialize wq active checks in order to
avoid potentially missing a wakeup due to racing with the waiter
side.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-09-15 16:57:13 +02:00
Radim Krčmář
648d453dd4 Merge branch 'kvm-ppc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
Bug fixes for stable.
2017-09-14 17:21:10 +02:00
Paul Mackerras
67f8a8c115 KVM: PPC: Book3S HV: Fix bug causing host SLB to be restored incorrectly
Aneesh Kumar reported seeing host crashes when running recent kernels
on POWER8.  The symptom was an oops like this:

Unable to handle kernel paging request for data at address 0xf00000000786c620
Faulting instruction address: 0xc00000000030e1e4
Oops: Kernel access of bad area, sig: 11 [#1]
LE SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in: powernv_op_panel
CPU: 24 PID: 6663 Comm: qemu-system-ppc Tainted: G        W 4.13.0-rc7-43932-gfc36c59 #2
task: c000000fdeadfe80 task.stack: c000000fdeb68000
NIP:  c00000000030e1e4 LR: c00000000030de6c CTR: c000000000103620
REGS: c000000fdeb6b450 TRAP: 0300   Tainted: G        W        (4.13.0-rc7-43932-gfc36c59)
MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24044428  XER: 20000000
CFAR: c00000000030e134 DAR: f00000000786c620 DSISR: 40000000 SOFTE: 0
GPR00: 0000000000000000 c000000fdeb6b6d0 c0000000010bd000 000000000000e1b0
GPR04: c00000000115e168 c000001fffa6e4b0 c00000000115d000 c000001e1b180386
GPR08: f000000000000000 c000000f9a8913e0 f00000000786c600 00007fff587d0000
GPR12: c000000fdeb68000 c00000000fb0f000 0000000000000001 00007fff587cffff
GPR16: 0000000000000000 c000000000000000 00000000003fffff c000000fdebfe1f8
GPR20: 0000000000000004 c000000fdeb6b8a8 0000000000000001 0008000000000040
GPR24: 07000000000000c0 00007fff587cffff c000000fdec20bf8 00007fff587d0000
GPR28: c000000fdeca9ac0 00007fff587d0000 00007fff587c0000 00007fff587d0000
NIP [c00000000030e1e4] __get_user_pages_fast+0x434/0x1070
LR [c00000000030de6c] __get_user_pages_fast+0xbc/0x1070
Call Trace:
[c000000fdeb6b6d0] [c00000000139dab8] lock_classes+0x0/0x35fe50 (unreliable)
[c000000fdeb6b7e0] [c00000000030ef38] get_user_pages_fast+0xf8/0x120
[c000000fdeb6b830] [c000000000112318] kvmppc_book3s_hv_page_fault+0x308/0xf30
[c000000fdeb6b960] [c00000000010e10c] kvmppc_vcpu_run_hv+0xfdc/0x1f00
[c000000fdeb6bb20] [c0000000000e915c] kvmppc_vcpu_run+0x2c/0x40
[c000000fdeb6bb40] [c0000000000e5650] kvm_arch_vcpu_ioctl_run+0x110/0x300
[c000000fdeb6bbe0] [c0000000000d6468] kvm_vcpu_ioctl+0x528/0x900
[c000000fdeb6bd40] [c0000000003bc04c] do_vfs_ioctl+0xcc/0x950
[c000000fdeb6bde0] [c0000000003bc930] SyS_ioctl+0x60/0x100
[c000000fdeb6be30] [c00000000000b96c] system_call+0x58/0x6c
Instruction dump:
7ca81a14 2fa50000 41de0010 7cc8182a 68c60002 78c6ffe2 0b060000 3cc2000a
794a3664 390610d8 e9080000 7d485214 <e90a0020> 7d435378 790507e1 408202f0
---[ end trace fad4a342d0414aa2 ]---

It turns out that what has happened is that the SLB entry for the
vmmemap region hasn't been reloaded on exit from a guest, and it has
the wrong page size.  Then, when the host next accesses the vmemmap
region, it gets a page fault.

Commit a25bd72bad ("powerpc/mm/radix: Workaround prefetch issue with
KVM", 2017-07-24) modified the guest exit code so that it now only clears
out the SLB for hash guest.  The code tests the radix flag and puts the
result in a non-volatile CR field, CR2, and later branches based on CR2.

Unfortunately, the kvmppc_save_tm function, which gets called between
those two points, modifies all the user-visible registers in the case
where the guest was in transactional or suspended state, except for a
few which it restores (namely r1, r2, r9 and r13).  Thus the hash/radix indication in CR2 gets corrupted.

This fixes the problem by re-doing the comparison just before the
result is needed.  For good measure, this also adds comments next to
the call sites of kvmppc_save_tm and kvmppc_restore_tm pointing out
that non-volatile register state will be lost.

Cc: stable@vger.kernel.org # v4.13
Fixes: a25bd72bad ("powerpc/mm/radix: Workaround prefetch issue with KVM")
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-09-12 16:02:46 +10:00
Paul Mackerras
cf5f6f3125 KVM: PPC: Book3S HV: Hold kvm->lock around call to kvmppc_update_lpcr
Commit 468808bd35 ("KVM: PPC: Book3S HV: Set process table for HPT
guests on POWER9", 2017-01-30) added a call to kvmppc_update_lpcr()
which doesn't hold the kvm->lock mutex around the call, as required.
This adds the lock/unlock pair, and for good measure, includes
the kvmppc_setup_partition_table() call in the locked region, since
it is altering global state of the VM.

This error appears not to have any fatal consequences for the host;
the consequences would be that the VCPUs could end up running with
different LPCR values, or an update to the LPCR value by userspace
using the one_reg interface could get overwritten, or the update
done by kvmhv_configure_mmu() could get overwritten.

Cc: stable@vger.kernel.org # v4.10+
Fixes: 468808bd35 ("KVM: PPC: Book3S HV: Set process table for HPT guests on POWER9")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-09-12 16:02:27 +10:00
Benjamin Herrenschmidt
d222af0723 KVM: PPC: Book3S HV: Don't access XIVE PIPR register using byte accesses
The XIVE interrupt controller on POWER9 machines doesn't support byte
accesses to any register in the thread management area other than the
CPPR (current processor priority register).  In particular, when
reading the PIPR (pending interrupt priority register), we need to
do a 32-bit or 64-bit load.

Cc: stable@vger.kernel.org # v4.13
Fixes: 2c4fb78f78 ("KVM: PPC: Book3S HV: Workaround POWER9 DD1.0 bug causing IPB bit loss")
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-09-12 16:02:07 +10:00
Linus Torvalds
0756b7fbb6 First batch of KVM changes for 4.14
Common:
  - improve heuristic for boosting preempted spinlocks by ignoring VCPUs
    in user mode
 
 ARM:
  - fix for decoding external abort types from guests
 
  - added support for migrating the active priority of interrupts when
    running a GICv2 guest on a GICv3 host
 
  - minor cleanup
 
 PPC:
  - expose storage keys to userspace
 
  - merge powerpc/topic/ppc-kvm branch that contains
    find_linux_pte_or_hugepte and POWER9 thread management cleanup
 
  - merge kvm-ppc-fixes with a fix that missed 4.13 because of vacations
 
  - fixes
 
 s390:
  - merge of topic branch tlb-flushing from the s390 tree to get the
    no-dat base features
 
  - merge of kvm/master to avoid conflicts with additional sthyi fixes
 
  - wire up the no-dat enhancements in KVM
 
  - multiple epoch facility (z14 feature)
 
  - Configuration z/Architecture Mode
 
  - more sthyi fixes
 
  - gdb server range checking fix
 
  - small code cleanups
 
 x86:
  - emulate Hyper-V TSC frequency MSRs
 
  - add nested INVPCID
 
  - emulate EPTP switching VMFUNC
 
  - support Virtual GIF
 
  - support 5 level page tables
 
  - speedup nested VM exits by packing byte operations
 
  - speedup MMIO by using hardware provided physical address
 
  - a lot of fixes and cleanups, especially nested
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJZspE1AAoJEED/6hsPKofoDcMIALT11n+LKV50QGwQdg2W1GOt
 aChbgnj/Kegit3hQlDhVNb8kmdZEOZzSL81Lh0VPEr7zXU8QiWn2snbizDPv8sde
 MpHhcZYZZ0YrpoiZKjl8yiwcu88OWGn2qtJ7OpuTS5hvEGAfxMncp0AMZho6fnz/
 ySTwJ9GK2MTgBw39OAzCeDOeoYn4NKYMwjJGqBXRhNX8PG/1wmfqv0vPrd6wfg31
 KJ58BumavwJjr8YbQ1xELm9rpQrAmaayIsG0R1dEUqCbt5a1+t2gt4h2uY7tWcIv
 ACt2bIze7eF3xA+OpRs+eT+yemiH3t9btIVmhCfzUpnQ+V5Z55VMSwASLtTuJRQ=
 =R8Ry
 -----END PGP SIGNATURE-----

Merge tag 'kvm-4.14-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Radim Krčmář:
 "First batch of KVM changes for 4.14

  Common:
   - improve heuristic for boosting preempted spinlocks by ignoring
     VCPUs in user mode

  ARM:
   - fix for decoding external abort types from guests

   - added support for migrating the active priority of interrupts when
     running a GICv2 guest on a GICv3 host

   - minor cleanup

  PPC:
   - expose storage keys to userspace

   - merge kvm-ppc-fixes with a fix that missed 4.13 because of
     vacations

   - fixes

  s390:
   - merge of kvm/master to avoid conflicts with additional sthyi fixes

   - wire up the no-dat enhancements in KVM

   - multiple epoch facility (z14 feature)

   - Configuration z/Architecture Mode

   - more sthyi fixes

   - gdb server range checking fix

   - small code cleanups

  x86:
   - emulate Hyper-V TSC frequency MSRs

   - add nested INVPCID

   - emulate EPTP switching VMFUNC

   - support Virtual GIF

   - support 5 level page tables

   - speedup nested VM exits by packing byte operations

   - speedup MMIO by using hardware provided physical address

   - a lot of fixes and cleanups, especially nested"

* tag 'kvm-4.14-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (67 commits)
  KVM: arm/arm64: Support uaccess of GICC_APRn
  KVM: arm/arm64: Extract GICv3 max APRn index calculation
  KVM: arm/arm64: vITS: Drop its_ite->lpi field
  KVM: arm/arm64: vgic: constify seq_operations and file_operations
  KVM: arm/arm64: Fix guest external abort matching
  KVM: PPC: Book3S HV: Fix memory leak in kvm_vm_ioctl_get_htab_fd
  KVM: s390: vsie: cleanup mcck reinjection
  KVM: s390: use WARN_ON_ONCE only for checking
  KVM: s390: guestdbg: fix range check
  KVM: PPC: Book3S HV: Report storage key support to userspace
  KVM: PPC: Book3S HV: Fix case where HDEC is treated as 32-bit on POWER9
  KVM: PPC: Book3S HV: Fix invalid use of register expression
  KVM: PPC: Book3S HV: Fix H_REGISTER_VPA VPA size validation
  KVM: PPC: Book3S HV: Fix setting of storage key in H_ENTER
  KVM: PPC: e500mc: Fix a NULL dereference
  KVM: PPC: e500: Fix some NULL dereferences on error
  KVM: PPC: Book3S HV: Protect updates to spapr_tce_tables list
  KVM: s390: we are always in czam mode
  KVM: s390: expose no-DAT to guest and migration support
  KVM: s390: sthyi: remove invalid guest write access
  ...
2017-09-08 15:18:36 -07:00
Radim Krčmář
5f54c8b2d4 Merge branch 'kvm-ppc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
This fix was intended for 4.13, but didn't get in because both
maintainers were on vacation.

Paul Mackerras:
 "It adds mutual exclusion between list_add_rcu and list_del_rcu calls
  on the kvm->arch.spapr_tce_tables list.  Without this, userspace could
  potentially trigger corruption of the list and cause a host crash or
  worse."
2017-09-08 14:40:43 +02:00
Linus Torvalds
bac65d9d87 powerpc updates for 4.14
Nothing really major this release, despite quite a lot of activity. Just lots of
 things all over the place.
 
 Some things of note include:
 
  - Access via perf to a new type of PMU (IMC) on Power9, which can count both
    core events as well as nest unit events (Memory controller etc).
 
  - Optimisations to the radix MMU TLB flushing, mostly to avoid unnecessary Page
    Walk Cache (PWC) flushes when the structure of the tree is not changing.
 
  - Reworks/cleanups of do_page_fault() to modernise it and bring it closer to
    other architectures where possible.
 
  - Rework of our page table walking so that THP updates only need to send IPIs
    to CPUs where the affected mm has run, rather than all CPUs.
 
  - The size of our vmalloc area is increased to 56T on 64-bit hash MMU systems.
    This avoids problems with the percpu allocator on systems with very sparse
    NUMA layouts.
 
  - STRICT_KERNEL_RWX support on PPC32.
 
  - A new sched domain topology for Power9, to capture the fact that pairs of
    cores may share an L2 cache.
 
  - Power9 support for VAS, which is a new mechanism for accessing coprocessors,
    and initial support for using it with the NX compression accelerator.
 
  - Major work on the instruction emulation support, adding support for many new
    instructions, and reworking it so it can be used to implement the emulation
    needed to fixup alignment faults.
 
  - Support for guests under PowerVM to use the Power9 XIVE interrupt controller.
 
 And probably that many things again that are almost as interesting, but I had to
 keep the list short. Plus the usual fixes and cleanups as always.
 
 Thanks to:
   Alexey Kardashevskiy, Alistair Popple, Andreas Schwab, Aneesh Kumar K.V, Anju
   T Sudhakar, Arvind Yadav, Balbir Singh, Benjamin Herrenschmidt, Bhumika Goyal,
   Breno Leitao, Bryant G. Ly, Christophe Leroy, Cédric Le Goater, Dan Carpenter,
   Dou Liyang, Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff Levand,
   Hannes Reinecke, Haren Myneni, Ivan Mikhaylov, John Allen, Julia Lawall, LABBE
   Corentin, Laurentiu Tudor, Madhavan Srinivasan, Markus Elfring, Masahiro
   Yamada, Matt Brown, Michael Neuling, Murilo Opsfelder Araujo, Nathan Fontenot,
   Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Rashmica
   Gupta, Rob Herring, Rui Teng, Sam Bobroff, Santosh Sivaraj, Scott Wood,
   Shilpasri G Bhat, Sukadev Bhattiprolu, Suraj Jitindar Singh, Tobin C. Harding,
   Victor Aoqui.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZr83SAAoJEFHr6jzI4aWA6pUP/3CEaj2bSxNzWIwidqyYjuoS
 O1moEsP0oYH7eBEWVHalYxvo0QPIIAhbFPaFyrOrgtfDH01Szwu9LcCALGb8orC5
 Hg3IY8mpNG3Q1T8wEtTa56Ik4b5ZFty35S5+X9qLNSFoDUqSvGlSsLzhPNN7f2tl
 XFm2hWqd8wXCwDsuVSFBCF61M3SAm+g6NMVNJ+VL2KIDCwBrOZLhKDPRoxLTAuMa
 jjSdjVIozWyXjUrBFi8HVcoOWLxcT1HsNF0tRs51LwY/+Mlj2jAtFtsx+a06HZa6
 f2p/Kcp/MEispSTk064Ap9cC1seXWI18zwZKpCUFqu0Ec2yTAiGdjOWDyYQldIp+
 ttVPSHQ01YrVKwDFTtM9CiA0EET6fVPhWgAPkPfvH5TvtKwGkNdy0b+nQLuWrYip
 BUmOXmjdIG3nujCzA9sv6/uNNhjhj2y+HWwuV7Qo002VFkhgZFL67u2SSUQLpYPj
 PxdkY8pPVq+O+in94oDV3c36dYFF6+g6A6505Vn6eKUm/TLpszRFGkS3bKKA5vtn
 74FR+guV/5RwYJcdZbfm04DgAocl7AfUDxpwRxibt6KtAK2VZKQuw4ugUTgYEd7W
 mL2+AMmPKuajWXAMTHjCZPbUp9gFNyYyBQTFfGVX/XLiM8erKBnGfoa1/KzUJkhr
 fVZLYIO/gzl34PiTIfgD
 =UJtt
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Nothing really major this release, despite quite a lot of activity.
  Just lots of things all over the place.

  Some things of note include:

   - Access via perf to a new type of PMU (IMC) on Power9, which can
     count both core events as well as nest unit events (Memory
     controller etc).

   - Optimisations to the radix MMU TLB flushing, mostly to avoid
     unnecessary Page Walk Cache (PWC) flushes when the structure of the
     tree is not changing.

   - Reworks/cleanups of do_page_fault() to modernise it and bring it
     closer to other architectures where possible.

   - Rework of our page table walking so that THP updates only need to
     send IPIs to CPUs where the affected mm has run, rather than all
     CPUs.

   - The size of our vmalloc area is increased to 56T on 64-bit hash MMU
     systems. This avoids problems with the percpu allocator on systems
     with very sparse NUMA layouts.

   - STRICT_KERNEL_RWX support on PPC32.

   - A new sched domain topology for Power9, to capture the fact that
     pairs of cores may share an L2 cache.

   - Power9 support for VAS, which is a new mechanism for accessing
     coprocessors, and initial support for using it with the NX
     compression accelerator.

   - Major work on the instruction emulation support, adding support for
     many new instructions, and reworking it so it can be used to
     implement the emulation needed to fixup alignment faults.

   - Support for guests under PowerVM to use the Power9 XIVE interrupt
     controller.

  And probably that many things again that are almost as interesting,
  but I had to keep the list short. Plus the usual fixes and cleanups as
  always.

  Thanks to: Alexey Kardashevskiy, Alistair Popple, Andreas Schwab,
  Aneesh Kumar K.V, Anju T Sudhakar, Arvind Yadav, Balbir Singh,
  Benjamin Herrenschmidt, Bhumika Goyal, Breno Leitao, Bryant G. Ly,
  Christophe Leroy, Cédric Le Goater, Dan Carpenter, Dou Liyang,
  Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff Levand, Hannes
  Reinecke, Haren Myneni, Ivan Mikhaylov, John Allen, Julia Lawall,
  LABBE Corentin, Laurentiu Tudor, Madhavan Srinivasan, Markus Elfring,
  Masahiro Yamada, Matt Brown, Michael Neuling, Murilo Opsfelder Araujo,
  Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran,
  Paul Mackerras, Rashmica Gupta, Rob Herring, Rui Teng, Sam Bobroff,
  Santosh Sivaraj, Scott Wood, Shilpasri G Bhat, Sukadev Bhattiprolu,
  Suraj Jitindar Singh, Tobin C. Harding, Victor Aoqui"

* tag 'powerpc-4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (321 commits)
  powerpc/xive: Fix section __init warning
  powerpc: Fix kernel crash in emulation of vector loads and stores
  powerpc/xive: improve debugging macros
  powerpc/xive: add XIVE Exploitation Mode to CAS
  powerpc/xive: introduce H_INT_ESB hcall
  powerpc/xive: add the HW IRQ number under xive_irq_data
  powerpc/xive: introduce xive_esb_write()
  powerpc/xive: rename xive_poke_esb() in xive_esb_read()
  powerpc/xive: guest exploitation of the XIVE interrupt controller
  powerpc/xive: introduce a common routine xive_queue_page_alloc()
  powerpc/sstep: Avoid used uninitialized error
  axonram: Return directly after a failed kzalloc() in axon_ram_probe()
  axonram: Improve a size determination in axon_ram_probe()
  axonram: Delete an error message for a failed memory allocation in axon_ram_probe()
  powerpc/powernv/npu: Move tlb flush before launching ATSD
  powerpc/macintosh: constify wf_sensor_ops structures
  powerpc/iommu: Use permission-specific DEVICE_ATTR variants
  powerpc/eeh: Delete an error out of memory message at init time
  powerpc/mm: Use seq_putc() in two functions
  macintosh: Convert to using %pOF instead of full_name
  ...
2017-09-07 10:15:40 -07:00
nixiaoming
43f6b0cfb2 KVM: PPC: Book3S HV: Fix memory leak in kvm_vm_ioctl_get_htab_fd
We do ctx = kzalloc(sizeof(*ctx), GFP_KERNEL) and then later on call
anon_inode_getfd(), but if that fails we don't free ctx, so that
memory gets leaked.  To fix it, this adds kfree(ctx) in the failure
path.

Signed-off-by: nixiaoming <nixiaoming@huawei.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-09-01 10:17:58 +10:00
Paul Mackerras
4dafecde44 Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next
This merges in the 'ppc-kvm' topic branch from the powerpc tree in
order to bring in some fixes which touch both powerpc and KVM code.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-31 12:37:03 +10:00
Paul Mackerras
e3bfed1df3 KVM: PPC: Book3S HV: Report storage key support to userspace
This adds information about storage keys to the struct returned by
the KVM_PPC_GET_SMMU_INFO ioctl.  The new fields replace a pad field,
which was zeroed by previous kernel versions.  Thus userspace that
knows about the new fields will see zeroes when running on an older
kernel, indicating that storage keys are not supported.  The size of
the structure has not changed.

The number of keys is hard-coded for the CPUs supported by HV KVM,
which is just POWER7, POWER8 and POWER9.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-31 12:36:44 +10:00
Paul Mackerras
a4faf2e77a KVM: PPC: Book3S HV: Fix case where HDEC is treated as 32-bit on POWER9
Commit 2f2724630f ("KVM: PPC: Book3S HV: Cope with host using large
decrementer mode", 2017-05-22) added code to treat the hypervisor
decrementer (HDEC) as a 64-bit value on POWER9 rather than 32-bit.
Unfortunately, that commit missed one place where HDEC is treated
as a 32-bit value.  This fixes it.

This bug should not have any user-visible consequences that I can
think of, beyond an occasional unnecessary exit to the host kernel.
If the hypervisor decrementer has gone negative, then the bottom
32 bits will be negative for about 4 seconds after that, so as
long as we get out of the guest within those 4 seconds we won't
conclude that the HDEC interrupt is spurious.

Reported-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Fixes: 2f2724630f ("KVM: PPC: Book3S HV: Cope with host using large decrementer mode")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-31 12:36:44 +10:00
Andreas Schwab
0bfa33c7f7 KVM: PPC: Book3S HV: Fix invalid use of register expression
binutils >= 2.26 now warns about misuse of register expressions in
assembler operands that are actually literals. In this instance r0 is
being used where a literal 0 should be used.

Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
[mpe: Split into separate KVM patch, tweak change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-31 12:36:44 +10:00
Nicholas Piggin
eaac112eac KVM: PPC: Book3S HV: Fix H_REGISTER_VPA VPA size validation
KVM currently validates the size of the VPA registered by the client
against sizeof(struct lppaca), however we align (and therefore size)
that struct to 1kB to avoid crossing a 4kB boundary in the client.

PAPR calls for sizes >= 640 bytes to be accepted. Hard code this with
a comment.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-31 12:36:44 +10:00
Ram Pai
d182b8fd60 KVM: PPC: Book3S HV: Fix setting of storage key in H_ENTER
In handling a H_ENTER hypercall, the code in kvmppc_do_h_enter
clobbers the high-order two bits of the storage key, which is stored
in a split field in the second doubleword of the HPTE.  Any storage
key number above 7 hence fails to operate correctly.

This makes sure we preserve all the bits of the storage key.

Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-31 12:36:44 +10:00
Dan Carpenter
50a1a25987 KVM: PPC: e500mc: Fix a NULL dereference
We should set "err = -ENOMEM;", otherwise it means we're returning
ERR_PTR(0) which is NULL.  It results in a NULL pointer dereference in
the caller.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-31 12:36:44 +10:00
Dan Carpenter
73e77c0982 KVM: PPC: e500: Fix some NULL dereferences on error
There are some error paths in kvmppc_core_vcpu_create_e500() where we
forget to set the error code.  It means that we return ERR_PTR(0) which
is NULL and it results in a NULL pointer dereference in the caller.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-31 12:36:44 +10:00
Paul Mackerras
edd03602d9 KVM: PPC: Book3S HV: Protect updates to spapr_tce_tables list
Al Viro pointed out that while one thread of a process is executing
in kvm_vm_ioctl_create_spapr_tce(), another thread could guess the
file descriptor returned by anon_inode_getfd() and close() it before
the first thread has added it to the kvm->arch.spapr_tce_tables list.
That highlights a more general problem: there is no mutual exclusion
between writers to the spapr_tce_tables list, leading to the
possibility of the list becoming corrupted, which could cause a
host kernel crash.

To fix the mutual exclusion problem, we add a mutex_lock/unlock
pair around the list_del_rce in kvm_spapr_tce_release().  Also,
this moves the call to anon_inode_getfd() inside the region
protected by the kvm->lock mutex, after we have done the check for
a duplicate LIOBN.  This means that if another thread does guess the
file descriptor and closes it, its call to kvm_spapr_tce_release()
will not do any harm because it will have to wait until the first
thread has released kvm->lock.  With this, there are no failure
points in kvm_vm_ioctl_create_spapr_tce() after the call to
anon_inode_getfd().

The other things that the second thread could do with the guessed
file descriptor are to mmap it or to pass it as a parameter to a
KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE ioctl on a KVM device fd.  An mmap
call won't cause any harm because kvm_spapr_tce_mmap() and
kvm_spapr_tce_fault() don't access the spapr_tce_tables list or
the kvmppc_spapr_tce_table.list field, and the fields that they do use
have been properly initialized by the time of the anon_inode_getfd()
call.

The KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE ioctl calls
kvm_spapr_tce_attach_iommu_group(), which scans the spapr_tce_tables
list looking for the kvmppc_spapr_tce_table struct corresponding to
the fd given as the parameter.  Either it will find the new entry
or it won't; if it doesn't, it just returns an error, and if it
does, it will function normally.  So, in each case there is no
harmful effect.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-30 14:59:31 +10:00
Michael Ellerman
82b7fcc005 Merge branch 'topic/ppc-kvm' into next
Merge Nicks commit to rework the KVM thread management, shared with the
KVM tree via the ppc-kvm topic branch.
2017-08-29 21:26:30 +10:00
Nicholas Piggin
94a04bc25a KVM: PPC: Book3S HV: POWER9 does not require secondary thread management
POWER9 CPUs have independent MMU contexts per thread, so KVM does not
need to quiesce secondary threads, so the hwthread_req/hwthread_state
protocol does not have to be used. So patch it away on POWER9, and patch
away the branch from the Linux idle wakeup to kvm_start_guest that is
never used.

Add a warning and error out of kvmppc_grab_hwthread in case it is ever
called on POWER9.

This avoids a hwsync in the idle wakeup path on POWER9.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
[mpe: Use WARN(...) instead of WARN_ON()/pr_err(...)]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-29 14:48:59 +10:00
Paul Mackerras
47c5310a8d KVM: PPC: Book3S: Fix race and leak in kvm_vm_ioctl_create_spapr_tce()
Nixiaoming pointed out that there is a memory leak in
kvm_vm_ioctl_create_spapr_tce() if the call to anon_inode_getfd()
fails; the memory allocated for the kvmppc_spapr_tce_table struct
is not freed, and nor are the pages allocated for the iommu
tables.  In addition, we have already incremented the process's
count of locked memory pages, and this doesn't get restored on
error.

David Hildenbrand pointed out that there is a race in that the
function checks early on that there is not already an entry in the
stt->iommu_tables list with the same LIOBN, but an entry with the
same LIOBN could get added between then and when the new entry is
added to the list.

This fixes all three problems.  To simplify things, we now call
anon_inode_getfd() before placing the new entry in the list.  The
check for an existing entry is done while holding the kvm->lock
mutex, immediately before adding the new entry to the list.
Finally, on failure we now call kvmppc_account_memlimit to
decrement the process's count of locked memory pages.

Reported-by: Nixiaoming <nixiaoming@huawei.com>
Reported-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-25 11:08:57 +02:00
Benjamin Herrenschmidt
bb9b52bd51 KVM: PPC: Book3S HV: Add missing barriers to XIVE code and document them
This adds missing memory barriers to order updates/tests of
the virtual CPPR and MFRR, thus fixing a lost IPI problem.

While at it also document all barriers in this file.

This fixes a bug causing guest IPIs to occasionally get lost.  The
symptom then is hangs or stalls in the guest.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Tested-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-24 20:02:01 +10:00
Benjamin Herrenschmidt
2c4fb78f78 KVM: PPC: Book3S HV: Workaround POWER9 DD1.0 bug causing IPB bit loss
This adds a workaround for a bug in POWER9 DD1 chips where changing
the CPPR (Current Processor Priority Register) can cause bits in the
IPB (Interrupt Pending Buffer) to get lost.  Thankfully it only
happens when manually manipulating CPPR which is quite rare.  When it
does happen it can cause interrupts to be delayed or lost.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-24 20:01:39 +10:00
Nicholas Piggin
bd0fdb191c KVM: PPC: Book3S HV: Use msgsync with hypervisor doorbells on POWER9
When msgsnd is used for IPIs to other cores, msgsync must be executed by
the target to order stores performed on the source before its msgsnd
(provided the source executes the appropriate sync).

Fixes: 1704a81cce ("KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-24 20:01:39 +10:00
Michael Ellerman
15c659ff9d Merge branch 'fixes' into next
There's a non-trivial dependency between some commits we want to put in
next and the KVM prefetch work around that went into fixes. So merge
fixes into next.
2017-08-23 22:20:10 +10:00
Michael Ellerman
8434f0892e Merge branch 'topic/ppc-kvm' into next
Bring in the commit to rename find_linux_pte_or_hugepte() which touches
arch and KVM code, and might need to be merged with the kvmppc tree to
avoid conflicts.
2017-08-17 23:14:17 +10:00
Aneesh Kumar K.V
94171b19c3 powerpc/mm: Rename find_linux_pte_or_hugepte()
Add newer helpers to make the function usage simpler. It is always
recommended to use find_current_mm_pte() for walking the page table.
If we cannot use find_current_mm_pte(), it should be documented why
the said usage of __find_linux_pte() is safe against a parallel THP
split.

For now we have KVM code using __find_linux_pte(). This is because kvm
code ends up calling __find_linux_pte() in real mode with MSR_EE=0 but
with PACA soft_enabled = 1. We may want to fix that later and make
sure we keep the MSR_EE and PACA soft_enabled in sync. When we do that
we can switch kvm to use find_linux_pte().

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-17 23:13:46 +10:00
Longpeng(Mike)
199b5763d3 KVM: add spinlock optimization framework
If a vcpu exits due to request a user mode spinlock, then
the spinlock-holder may be preempted in user mode or kernel mode.
(Note that not all architectures trap spin loops in user mode,
only AMD x86 and ARM/ARM64 currently do).

But if a vcpu exits in kernel mode, then the holder must be
preempted in kernel mode, so we should choose a vcpu in kernel mode
as a more likely candidate for the lock holder.

This introduces kvm_arch_vcpu_in_kernel() to decide whether the
vcpu is in kernel-mode when it's preempted.  kvm_vcpu_on_spin's
new argument says the same of the spinning VCPU.

Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-08 10:57:43 +02:00
Benjamin Herrenschmidt
870cfe77a9 powerpc/mm: Update definitions of DSISR bits
This updates the definitions for the various DSISR bits to
match both some historical stuff and to match new bits on
POWER9.

In addition, we define some masks corresponding to the "bad"
faults on Book3S, and some masks corresponding to the bits
that match between DSISR and SRR1 for a DSI and an ISI.

This comes with a small code update to change the definition
of DSISR_PGDIRFAULT which becomes DSISR_PRTABLE_FAULT to
match architecture 3.0B

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03 16:06:43 +10:00
Michael Ellerman
bb272221e9 Linux v4.13-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJZapWhAAoJEHm+PkMAQRiGKb0IAJM6b7SbWaw69Og7+qiFB+zZ
 xp29iXqbE9fPISC6a5BRQV1ONjeDM6opGixGHqGC8Hla6k2IYz25VDNoF8wd0MXN
 cz/Ih20vd3C5afxXGe5cTT8lsPAlV0mWXxForlu6j8jPeL62FPfq6RhEkw7AcrYL
 yfYy3k3qSdOrrvBdII0WAAUi46UfIs+we9BQgbsMbkHOiqV2K0MOrzKE84Xbgepq
 RAy2xg6P4b4+hTx8xTrYc1MXwpnqjRc0oJ08gdmiwW3AOOU7LxYFn7zDkLPWi9Rr
 g4x6r4YhBTGxT4wNvovLIiqd9QFs//dMCuPWYwEtTICG48umIqqq24beQ0mvCdg=
 =08Ic
 -----END PGP SIGNATURE-----

Merge tag 'v4.13-rc1' into fixes

The fixes branch is based off a random pre-rc1 commit, because we had
some fixes that needed to go in before rc1 was released.

However we now need to fix some code that went in after that point, but
before rc1, so merge rc1 to get that code into fixes so we can fix it!
2017-07-31 20:20:29 +10:00
Linus Torvalds
8155469341 s390: SRCU fix.
PPC: host crash fixes.
 x86: bugfixes, including making nested posted interrupts really work.
 Generic: tweaks to kvm_stat and to uevents
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJZe2EYAAoJEL/70l94x66D4JYH/AnvioKWTsplhUKt4Y4JlpJX
 EXYjQd/CIZ+MHNNUH+U+XEj6tKQymKrz4TeZSs1o0nyxCeyparR3gK27OYVpPspN
 GkPSit3hyRgW9r5uXp6pZCJuFCAMpMZ6z4sKbT1FxDhnWnpWayV9w8KA+yQT/UUX
 dNQ9JJPUxApcM4NCaj2OCQ8K1koNIDCc52+jATf0iK/Heiaf6UGqCcHXUIy5I5wM
 OWk05Qm32VBAYb6P6FfoyGdLMNAAkJtr1fyOJDkxX730CYgwpjIP0zifnJ1bt8V2
 YRnjvPO5QciDHbZ8VynwAkKi0ZAd8psjwXh0KbyahPL/2/sA2xCztMH25qweriI=
 =fsfr
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "s390:
   - SRCU fix

  PPC:
   - host crash fixes

  x86:
   - bugfixes, including making nested posted interrupts really work

  Generic:
   - tweaks to kvm_stat and to uevents"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: LAPIC: Fix reentrancy issues with preempt notifiers
  tools/kvm_stat: add '-f help' to get the available event list
  tools/kvm_stat: use variables instead of hard paths in help output
  KVM: nVMX: Fix loss of L2's NMI blocking state
  KVM: nVMX: Fix posted intr delivery when vcpu is in guest mode
  x86: irq: Define a global vector for nested posted interrupts
  KVM: x86: do mask out upper bits of PAE CR3
  KVM: make pid available for uevents without debugfs
  KVM: s390: take srcu lock when getting/setting storage keys
  KVM: VMX: remove unused field
  KVM: PPC: Book3S HV: Fix host crash on changing HPT size
  KVM: PPC: Book3S HV: Enable TM before accessing TM registers
2017-07-28 13:36:56 -07:00
Linus Torvalds
080012bad6 powerpc fixes for 4.13 #4
The highlight is Ben's patch to work around a host killing bug when running KVM
 guests with the Radix MMU on Power9. See the long change log of that commit for
 more detail.
 
 And then three fairly minor fixes:
  - Fix of_node_put() underflow during reconfig remove, using old DLPAR tools.
  - Fix recently introduced ld version check with 64-bit LE-only toolchain.
  - Free the subpage_prot_table correctly, avoiding a memory leak.
 
 Thanks to:
   Aneesh Kumar K.V, Benjamin Herrenschmidt, Laurent Vivier.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZezIGAAoJEFHr6jzI4aWAyukP/3mxlQ3WdQlYPByZ18cj6YL5
 L0kbRxDgAosD9HzcOPqku1um1l7D6Gk5KZFZfsol7SasSmCZEwV4MbdTrAiRxS6K
 tF0V14hP/BDQeIKfxlnUepzfL8PY7CkO6sDAa6BjHXvBk4+POI+37uw9+2GEV8DY
 tA45fHjA/Zq3eUXsK0WTHIcd09lJXXarf9Tlx+YNZ+3yJ1OMfOji3CXgTkjtwYM9
 XTtsKzsagY1zLwr5gXJu1P05+OGna2VmY6+Tn2lnf7scTFW3qYGF3eWRx71diiKS
 PpZCjqfzWF4+TDIGPoYIrkTE+ZKR0lyo6F38GYwae0cYZMs9pGPEpeNahd8Nun+v
 MLU6TnhNfOI40GEYgmOMNKHPJJLSx59Qr/GnrAi/h2nUEocuN76jzNbaeFBtj3jD
 /vrRTmVUtt1wGqORX7BK4YZFHqcHmZBCM7bQnxibJtLv7fMue0sk58fs2jAaZ1iD
 NacpzsXG7CWYgj6ApclVCYuF99dXTpjrw/WPxilXDg84Pxb7Dv1SpIvLb2T5Guq+
 iqqavViRHP1ng+5/giIsOvF9CnsCzbRYLb0zZTP91nckMmYI6wX2zc56lofjcI5j
 Qc5o/aJvBk4vSM9sibBGEdrZJ1Vt16gGorQ5NZUurZund/cVqvQFhm/4Tvnc0cVN
 yvLNZI8am35pI9CCJ2im
 =Z6uF
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "The highlight is Ben's patch to work around a host killing bug when
  running KVM guests with the Radix MMU on Power9. See the long change
  log of that commit for more detail.

  And then three fairly minor fixes:

   - fix of_node_put() underflow during reconfig remove, using old DLPAR
     tools.

   - fix recently introduced ld version check with 64-bit LE-only
     toolchain.

   - free the subpage_prot_table correctly, avoiding a memory leak.

  Thanks to: Aneesh Kumar K.V, Benjamin Herrenschmidt, Laurent Vivier"

* tag 'powerpc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/mm/hash: Free the subpage_prot_table correctly
  powerpc/Makefile: Fix ld version check with 64-bit LE-only toolchain
  powerpc/pseries: Fix of_node_put() underflow during reconfig remove
  powerpc/mm/radix: Workaround prefetch issue with KVM
2017-07-28 13:25:15 -07:00
Benjamin Herrenschmidt
a25bd72bad powerpc/mm/radix: Workaround prefetch issue with KVM
There's a somewhat architectural issue with Radix MMU and KVM.

When coming out of a guest with AIL (Alternate Interrupt Location, ie,
MMU enabled), we start executing hypervisor code with the PID register
still containing whatever the guest has been using.

The problem is that the CPU can (and will) then start prefetching or
speculatively load from whatever host context has that same PID (if
any), thus bringing translations for that context into the TLB, which
Linux doesn't know about.

This can cause stale translations and subsequent crashes.

Fixing this in a way that is neither racy nor a huge performance
impact is difficult. We could just make the host invalidations always
use broadcast forms but that would hurt single threaded programs for
example.

We chose to fix it instead by partitioning the PID space between guest
and host. This is possible because today Linux only use 19 out of the
20 bits of PID space, so existing guests will work if we make the host
use the top half of the 20 bits space.

We additionally add support for a property to indicate to Linux the
size of the PID register which will be useful if we eventually have
processors with a larger PID space available.

There is still an issue with malicious guests purposefully setting the
PID register to a value in the hosts PID range. Hopefully future HW
can prevent that, but in the meantime, we handle it with a pair of
kludges:

 - On the way out of a guest, before we clear the current VCPU in the
   PACA, we check the PID and if it's outside of the permitted range
   we flush the TLB for that PID.

 - When context switching, if the mm is "new" on that CPU (the
   corresponding bit was set for the first time in the mm cpumask), we
   check if any sibling thread is in KVM (has a non-NULL VCPU pointer
   in the PACA). If that is the case, we also flush the PID for that
   CPU (core).

This second part is needed to handle the case where a process is
migrated (or starts a new pthread) on a sibling thread of the CPU
coming out of KVM, as there's a window where stale translations can
exist before we detect it and flush them out.

A future optimization could be added by keeping track of whether the
PID has ever been used and avoid doing that for completely fresh PIDs.
We could similarily mark PIDs that have been the subject of a global
invalidation as "fresh". But for now this will do.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Rework the asm to build with CONFIG_PPC_RADIX_MMU=n, drop
      unneeded include of kvm_book3s_asm.h]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-26 16:41:52 +10:00
Paul Mackerras
ef42719814 KVM: PPC: Book3S HV: Fix host crash on changing HPT size
Commit f98a8bf9ee ("KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB
ioctl() to change HPT size", 2016-12-20) changed the behaviour of
the KVM_PPC_ALLOCATE_HTAB ioctl so that it now allocates a new HPT
and new revmap array if there was a previously-allocated HPT of a
different size from the size being requested.  In this case, we need
to reset the rmap arrays of the memslots, because the rmap arrays
will contain references to HPTEs which are no longer valid.  Worse,
these references are also references to slots in the new revmap
array (which parallels the HPT), and the new revmap array contains
random contents, since it doesn't get zeroed on allocation.

The effect of having these stale references to slots in the revmap
array that contain random contents is that subsequent calls to
functions such as kvmppc_add_revmap_chain will crash because they
will interpret the non-zero contents of the revmap array as HPTE
indexes and thus index outside of the revmap array.  This leads to
host crashes such as the following.

[ 7072.862122] Unable to handle kernel paging request for data at address 0xd000000c250c00f8
[ 7072.862218] Faulting instruction address: 0xc0000000000e1c78
[ 7072.862233] Oops: Kernel access of bad area, sig: 11 [#1]
[ 7072.862286] SMP NR_CPUS=1024
[ 7072.862286] NUMA
[ 7072.862325] PowerNV
[ 7072.862378] Modules linked in: kvm_hv vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm iw_cxgb3 mlx5_ib ib_core ses enclosure scsi_transport_sas ipmi_powernv ipmi_devintf ipmi_msghandler powernv_op_panel i2c_opal nfsd auth_rpcgss oid_registry
[ 7072.863085]  nfs_acl lockd grace sunrpc kvm_pr kvm xfs libcrc32c scsi_dh_alua dm_service_time radeon lpfc nvme_fc nvme_fabrics nvme_core scsi_transport_fc i2c_algo_bit tg3 drm_kms_helper ptp pps_core syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm dm_multipath i2c_core cxgb3 mlx5_core mdio [last unloaded: kvm_hv]
[ 7072.863381] CPU: 72 PID: 56929 Comm: qemu-system-ppc Not tainted 4.12.0-kvm+ #59
[ 7072.863457] task: c000000fe29e7600 task.stack: c000001e3ffec000
[ 7072.863520] NIP: c0000000000e1c78 LR: c0000000000e2e3c CTR: c0000000000e25f0
[ 7072.863596] REGS: c000001e3ffef560 TRAP: 0300   Not tainted  (4.12.0-kvm+)
[ 7072.863658] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE,TM[E]>
[ 7072.863667]   CR: 44082882  XER: 20000000
[ 7072.863767] CFAR: c0000000000e2e38 DAR: d000000c250c00f8 DSISR: 42000000 SOFTE: 1
GPR00: c0000000000e2e3c c000001e3ffef7e0 c000000001407d00 d000000c250c00f0
GPR04: d00000006509fb70 d00000000b3d2048 0000000003ffdfb7 0000000000000000
GPR08: 00000001007fdfb7 00000000c000000f d0000000250c0000 000000000070f7bf
GPR12: 0000000000000008 c00000000fdad000 0000000010879478 00000000105a0d78
GPR16: 00007ffaf4080000 0000000000001190 0000000000000000 0000000000010000
GPR20: 4001ffffff000415 d00000006509fb70 0000000004091190 0000000ee1881190
GPR24: 0000000003ffdfb7 0000000003ffdfb7 00000000007fdfb7 c000000f5c958000
GPR28: d00000002d09fb70 0000000003ffdfb7 d00000006509fb70 d00000000b3d2048
[ 7072.864439] NIP [c0000000000e1c78] kvmppc_add_revmap_chain+0x88/0x130
[ 7072.864503] LR [c0000000000e2e3c] kvmppc_do_h_enter+0x84c/0x9e0
[ 7072.864566] Call Trace:
[ 7072.864594] [c000001e3ffef7e0] [c000001e3ffef830] 0xc000001e3ffef830 (unreliable)
[ 7072.864671] [c000001e3ffef830] [c0000000000e2e3c] kvmppc_do_h_enter+0x84c/0x9e0
[ 7072.864751] [c000001e3ffef920] [d00000000b38d878] kvmppc_map_vrma+0x168/0x200 [kvm_hv]
[ 7072.864831] [c000001e3ffef9e0] [d00000000b38a684] kvmppc_vcpu_run_hv+0x1284/0x1300 [kvm_hv]
[ 7072.864914] [c000001e3ffefb30] [d00000000f465664] kvmppc_vcpu_run+0x44/0x60 [kvm]
[ 7072.865008] [c000001e3ffefb60] [d00000000f461864] kvm_arch_vcpu_ioctl_run+0x114/0x290 [kvm]
[ 7072.865152] [c000001e3ffefbe0] [d00000000f453c98] kvm_vcpu_ioctl+0x598/0x7a0 [kvm]
[ 7072.865292] [c000001e3ffefd40] [c000000000389328] do_vfs_ioctl+0xd8/0x8c0
[ 7072.865410] [c000001e3ffefde0] [c000000000389be4] SyS_ioctl+0xd4/0x130
[ 7072.865526] [c000001e3ffefe30] [c00000000000b760] system_call+0x58/0x6c
[ 7072.865644] Instruction dump:
[ 7072.865715] e95b2110 793a0020 7b4926e4 7f8a4a14 409e0098 807c000c 786326e4 7c6a1a14
[ 7072.865857] 935e0008 7bbd0020 813c000c 913e000c <93a30008> 93bc000c 48000038 60000000
[ 7072.866001] ---[ end trace 627b6e4bf8080edc ]---

Note that to trigger this, it is necessary to use a recent upstream
QEMU (or other userspace that resizes the HPT at CAS time), specify
a maximum memory size substantially larger than the current memory
size, and boot a guest kernel that does not support HPT resizing.

This fixes the problem by resetting the rmap arrays when the old HPT
is freed.

Fixes: f98a8bf9ee ("KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size")
Cc: stable@vger.kernel.org # v4.11+
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-07-24 16:10:44 +10:00
Paul Mackerras
e470571514 KVM: PPC: Book3S HV: Enable TM before accessing TM registers
Commit 46a704f840 ("KVM: PPC: Book3S HV: Preserve userspace HTM state
properly", 2017-06-15) added code to read transactional memory (TM)
registers but forgot to enable TM before doing so.  The result is
that if userspace does have live values in the TM registers, a KVM_RUN
ioctl will cause a host kernel crash like this:

[  181.328511] Unrecoverable TM Unavailable Exception f60 at d00000001e7d9980
[  181.328605] Oops: Unrecoverable TM Unavailable Exception, sig: 6 [#1]
[  181.328613] SMP NR_CPUS=2048
[  181.328613] NUMA
[  181.328618] PowerNV
[  181.328646] Modules linked in: vhost_net vhost tap nfs_layout_nfsv41_files rpcsec_gss_krb5 nfsv4 dns_resolver nfs
+fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
+nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun ebtable_filter ebtables
+ip6table_filter ip6_tables iptable_filter bridge stp llc kvm_hv kvm nfsd ses enclosure scsi_transport_sas ghash_generic
+auth_rpcgss gf128mul xts sg ctr nfs_acl lockd vmx_crypto shpchp ipmi_powernv i2c_opal grace ipmi_devintf i2c_core
+powernv_rng sunrpc ipmi_msghandler ibmpowernv uio_pdrv_genirq uio leds_powernv powernv_op_panel ip_tables xfs sd_mod
+lpfc ipr bnx2x libata mdio ptp pps_core scsi_transport_fc libcrc32c dm_mirror dm_region_hash dm_log dm_mod
[  181.329278] CPU: 40 PID: 9926 Comm: CPU 0/KVM Not tainted 4.12.0+ #1
[  181.329337] task: c000003fc6980000 task.stack: c000003fe4d80000
[  181.329396] NIP: d00000001e7d9980 LR: d00000001e77381c CTR: d00000001e7d98f0
[  181.329465] REGS: c000003fe4d837e0 TRAP: 0f60   Not tainted  (4.12.0+)
[  181.329523] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
[  181.329527]   CR: 24022448  XER: 00000000
[  181.329608] CFAR: d00000001e773818 SOFTE: 1
[  181.329608] GPR00: d00000001e77381c c000003fe4d83a60 d00000001e7ef410 c000003fdcfe0000
[  181.329608] GPR04: c000003fe4f00000 0000000000000000 0000000000000000 c000003fd7954800
[  181.329608] GPR08: 0000000000000001 c000003fc6980000 0000000000000000 d00000001e7e2880
[  181.329608] GPR12: d00000001e7d98f0 c000000007b19000 00000001295220e0 00007fffc0ce2090
[  181.329608] GPR16: 0000010011886608 00007fff8c89f260 0000000000000001 00007fff8c080028
[  181.329608] GPR20: 0000000000000000 00000100118500a6 0000010011850000 0000010011850000
[  181.329608] GPR24: 00007fffc0ce1b48 0000010011850000 00000000d673b901 0000000000000000
[  181.329608] GPR28: 0000000000000000 c000003fdcfe0000 c000003fdcfe0000 c000003fe4f00000
[  181.330199] NIP [d00000001e7d9980] kvmppc_vcpu_run_hv+0x90/0x6b0 [kvm_hv]
[  181.330264] LR [d00000001e77381c] kvmppc_vcpu_run+0x2c/0x40 [kvm]
[  181.330322] Call Trace:
[  181.330351] [c000003fe4d83a60] [d00000001e773478] kvmppc_set_one_reg+0x48/0x340 [kvm] (unreliable)
[  181.330437] [c000003fe4d83b30] [d00000001e77381c] kvmppc_vcpu_run+0x2c/0x40 [kvm]
[  181.330513] [c000003fe4d83b50] [d00000001e7700b4] kvm_arch_vcpu_ioctl_run+0x114/0x2a0 [kvm]
[  181.330586] [c000003fe4d83bd0] [d00000001e7642f8] kvm_vcpu_ioctl+0x598/0x7a0 [kvm]
[  181.330658] [c000003fe4d83d40] [c0000000003451b8] do_vfs_ioctl+0xc8/0x8b0
[  181.330717] [c000003fe4d83de0] [c000000000345a64] SyS_ioctl+0xc4/0x120
[  181.330776] [c000003fe4d83e30] [c00000000000b004] system_call+0x58/0x6c
[  181.330833] Instruction dump:
[  181.330869] e92d0260 e9290b50 e9290108 792807e3 41820058 e92d0260 e9290b50 e9290108
[  181.330941] 792ae8a4 794a1f87 408204f4 e92d0260 <7d4022a6> f9490ff0 e92d0260 7d4122a6
[  181.331013] ---[ end trace 6f6ddeb4bfe92a92 ]---

The fix is just to turn on the TM bit in the MSR before accessing the
registers.

Cc: stable@vger.kernel.org # v3.14+
Fixes: 46a704f840 ("KVM: PPC: Book3S HV: Preserve userspace HTM state properly")
Reported-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-07-24 16:10:44 +10:00
Michal Hocko
dcda9b0471 mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
__GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
the page allocator.  This has been true but only for allocations
requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
ignored for smaller sizes.  This is a bit unfortunate because there is
no way to express the same semantic for those requests and they are
considered too important to fail so they might end up looping in the
page allocator for ever, similarly to GFP_NOFAIL requests.

Now that the whole tree has been cleaned up and accidental or misled
usage of __GFP_REPEAT flag has been removed for !costly requests we can
give the original flag a better name and more importantly a more useful
semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
that the allocator would try really hard but there is no promise of a
success.  This will work independent of the order and overrides the
default allocator behavior.  Page allocator users have several levels of
guarantee vs.  cost options (take GFP_KERNEL as an example)

 - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
   attempt to free memory at all. The most light weight mode which even
   doesn't kick the background reclaim. Should be used carefully because
   it might deplete the memory and the next user might hit the more
   aggressive reclaim

 - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
   allocation without any attempt to free memory from the current
   context but can wake kswapd to reclaim memory if the zone is below
   the low watermark. Can be used from either atomic contexts or when
   the request is a performance optimization and there is another
   fallback for a slow path.

 - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
   non sleeping allocation with an expensive fallback so it can access
   some portion of memory reserves. Usually used from interrupt/bh
   context with an expensive slow path fallback.

 - GFP_KERNEL - both background and direct reclaim are allowed and the
   _default_ page allocator behavior is used. That means that !costly
   allocation requests are basically nofail but there is no guarantee of
   that behavior so failures have to be checked properly by callers
   (e.g. OOM killer victim is allowed to fail currently).

 - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
   and all allocation requests fail early rather than cause disruptive
   reclaim (one round of reclaim in this implementation). The OOM killer
   is not invoked.

 - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
   behavior and all allocation requests try really hard. The request
   will fail if the reclaim cannot make any progress. The OOM killer
   won't be triggered.

 - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
   and all allocation requests will loop endlessly until they succeed.
   This might be really dangerous especially for larger orders.

Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
because they already had their semantic.  No new users are added.
__alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
there is no progress and we have already passed the OOM point.

This means that all the reclaim opportunities have been exhausted except
the most disruptive one (the OOM killer) and a user defined fallback
behavior is more sensible than keep retrying in the page allocator.

[akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
[mhocko@suse.com: semantic fix]
  Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
[mhocko@kernel.org: address other thing spotted by Vlastimil]
  Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alex Belits <alex.belits@cavium.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: NeilBrown <neilb@suse.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:03 -07:00
Linus Torvalds
d691b7e7d1 powerpc updates for 4.13
Highlights include:
 
  - Support for STRICT_KERNEL_RWX on 64-bit server CPUs.
 
  - Platform support for FSP2 (476fpe) board
 
  - Enable ZONE_DEVICE on 64-bit server CPUs.
 
  - Generic & powerpc spin loop primitives to optimise busy waiting
 
  - Convert VDSO update function to use new update_vsyscall() interface
 
  - Optimisations to hypercall/syscall/context-switch paths
 
  - Improvements to the CPU idle code on Power8 and Power9.
 
 As well as many other fixes and improvements.
 
 Thanks to:
   Akshay Adiga, Andrew Donnellan, Andrew Jeffery, Anshuman Khandual, Anton
   Blanchard, Balbir Singh, Benjamin Herrenschmidt, Christophe Leroy, Christophe
   Lombard, Colin Ian King, Dan Carpenter, Gautham R. Shenoy, Hari Bathini, Ian
   Munsie, Ivan Mikhaylov, Javier Martinez Canillas, Madhavan Srinivasan,
   Masahiro Yamada, Matt Brown, Michael Neuling, Michal Suchanek, Murilo
   Opsfelder Araujo, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Paul
   Mackerras, Pavel Machek, Russell Currey, Santosh Sivaraj, Stephen Rothwell,
   Thiago Jung Bauermann, Yang Li.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZXyPCAAoJEFHr6jzI4aWAI9QQAISf2x5y//cqCi4ISyQB5pTq
 KLS/yQajNkQOw7c0fzBZOaH5Xd/SJ6AcKWDg8yDlpDR3+sRRsr98iIRECgKS5I7/
 DxD9ywcbSoMXFQQo1ZMCp5CeuMUIJRtugBnUQM+JXCSUCPbznCY5DchDTLyTBTpO
 MeMVhI//JxthhoOMA9MudiEGaYCU9ho442Z4OJUSiLUv8WRbvQX9pTqoc4vx1fxA
 BWf2mflztBVcIfKIyxIIIlDLukkMzix6gSYPMCbC7lzkbnU7JSqKiheJXjo1gJS2
 ePHKDxeNR2/QP0g/j3aT/MR1uTt9MaNBSX3gANE1xQ9OoJ8m1sOtCO4gNbSdLWae
 eXhDnoiEp930DRZOeEioOItuWWoxFaMyYk3BMmRKV4QNdYL3y3TRVeL2/XmRGqYL
 Lxz4IY/x5TteFEJNGcRX90uizNSi8AaEXPF16pUk8Ctt6eH3ZSwPMv2fHeYVCMr0
 KFlKHyaPEKEoztyzLcUR6u9QB56yxDN58bvLpd32AeHvKhqyxFoySy59x9bZbatn
 B2y8mmDItg860e0tIG6jrtplpOVvL8i5jla5RWFVoQDuxxrLAds3vG9JZQs+eRzx
 Fiic93bqeUAS6RzhXbJ6QQJYIyhE7yqpcgv7ME1W87SPef3HPBk9xlp3yIDwdA2z
 bBDvrRnvupusz8qCWrxe
 =w8rj
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Highlights include:

   - Support for STRICT_KERNEL_RWX on 64-bit server CPUs.

   - Platform support for FSP2 (476fpe) board

   - Enable ZONE_DEVICE on 64-bit server CPUs.

   - Generic & powerpc spin loop primitives to optimise busy waiting

   - Convert VDSO update function to use new update_vsyscall() interface

   - Optimisations to hypercall/syscall/context-switch paths

   - Improvements to the CPU idle code on Power8 and Power9.

  As well as many other fixes and improvements.

  Thanks to: Akshay Adiga, Andrew Donnellan, Andrew Jeffery, Anshuman
  Khandual, Anton Blanchard, Balbir Singh, Benjamin Herrenschmidt,
  Christophe Leroy, Christophe Lombard, Colin Ian King, Dan Carpenter,
  Gautham R. Shenoy, Hari Bathini, Ian Munsie, Ivan Mikhaylov, Javier
  Martinez Canillas, Madhavan Srinivasan, Masahiro Yamada, Matt Brown,
  Michael Neuling, Michal Suchanek, Murilo Opsfelder Araujo, Naveen N.
  Rao, Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pavel Machek,
  Russell Currey, Santosh Sivaraj, Stephen Rothwell, Thiago Jung
  Bauermann, Yang Li"

* tag 'powerpc-4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (158 commits)
  powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs
  powerpc/mm/radix: Implement STRICT_RWX/mark_rodata_ro() for Radix
  powerpc/mm/hash: Implement mark_rodata_ro() for hash
  powerpc/vmlinux.lds: Align __init_begin to 16M
  powerpc/lib/code-patching: Use alternate map for patch_instruction()
  powerpc/xmon: Add patch_instruction() support for xmon
  powerpc/kprobes/optprobes: Use patch_instruction()
  powerpc/kprobes: Move kprobes over to patch_instruction()
  powerpc/mm/radix: Fix execute permissions for interrupt_vectors
  powerpc/pseries: Fix passing of pp0 in updatepp() and updateboltedpp()
  powerpc/64s: Blacklist rtas entry/exit from kprobes
  powerpc/64s: Blacklist functions invoked on a trap
  powerpc/64s: Un-blacklist system_call() from kprobes
  powerpc/64s: Move system_call() symbol to just after setting MSR_EE
  powerpc/64s: Blacklist system_call() and system_call_common() from kprobes
  powerpc/64s: Convert .L__replay_interrupt_return to a local label
  powerpc64/elfv1: Only dereference function descriptor for non-text symbols
  cxl: Export library to support IBM XSL
  powerpc/dts: Use #include "..." to include local DT
  powerpc/perf/hv-24x7: Aggregate result elements on POWER9 SMT8
  ...
2017-07-07 13:55:45 -07:00
Linus Torvalds
c136b84393 PPC:
- Better machine check handling for HV KVM
 - Ability to support guests with threads=2, 4 or 8 on POWER9
 - Fix for a race that could cause delayed recognition of signals
 - Fix for a bug where POWER9 guests could sleep with interrupts pending.
 
 ARM:
 - VCPU request overhaul
 - allow timer and PMU to have their interrupt number selected from userspace
 - workaround for Cavium erratum 30115
 - handling of memory poisonning
 - the usual crop of fixes and cleanups
 
 s390:
 - initial machine check forwarding
 - migration support for the CMMA page hinting information
 - cleanups and fixes
 
 x86:
 - nested VMX bugfixes and improvements
 - more reliable NMI window detection on AMD
 - APIC timer optimizations
 
 Generic:
 - VCPU request overhaul + documentation of common code patterns
 - kvm_stat improvements
 
 There is a small conflict in arch/s390 due to an arch-wide field rename.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJZW4XTAAoJEL/70l94x66DkhMH/izpk54KI17PtyQ9VYI2sYeZ
 BWK6Kl886g3ij4pFi3pECqjDJzWaa3ai+vFfzzpJJ8OkCJT5Rv4LxC5ERltVVmR8
 A3T1I/MRktSC0VJLv34daPC2z4Lco/6SPipUpPnL4bE2HATKed4vzoOjQ3tOeGTy
 dwi7TFjKwoVDiM7kPPDRnTHqCe5G5n13sZ49dBe9WeJ7ttJauWqoxhlYosCGNPEj
 g8ZX8+cvcAhVnz5uFL8roqZ8ygNEQq2mgkU18W8ZZKuiuwR0gdsG0gSBFNTdwIMK
 NoreRKMrw0+oLXTIB8SZsoieU6Qi7w3xMAMabe8AJsvYtoersugbOmdxGCr1lsA=
 =OD7H
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "PPC:
   - Better machine check handling for HV KVM
   - Ability to support guests with threads=2, 4 or 8 on POWER9
   - Fix for a race that could cause delayed recognition of signals
   - Fix for a bug where POWER9 guests could sleep with interrupts pending.

  ARM:
   - VCPU request overhaul
   - allow timer and PMU to have their interrupt number selected from userspace
   - workaround for Cavium erratum 30115
   - handling of memory poisonning
   - the usual crop of fixes and cleanups

  s390:
   - initial machine check forwarding
   - migration support for the CMMA page hinting information
   - cleanups and fixes

  x86:
   - nested VMX bugfixes and improvements
   - more reliable NMI window detection on AMD
   - APIC timer optimizations

  Generic:
   - VCPU request overhaul + documentation of common code patterns
   - kvm_stat improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (124 commits)
  Update my email address
  kvm: vmx: allow host to access guest MSR_IA32_BNDCFGS
  x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12
  kvm: x86: mmu: allow A/D bits to be disabled in an mmu
  x86: kvm: mmu: make spte mmio mask more explicit
  x86: kvm: mmu: dead code thanks to access tracking
  KVM: PPC: Book3S: Fix typo in XICS-on-XIVE state saving code
  KVM: PPC: Book3S HV: Close race with testing for signals on guest entry
  KVM: PPC: Book3S HV: Simplify dynamic micro-threading code
  KVM: x86: remove ignored type attribute
  KVM: LAPIC: Fix lapic timer injection delay
  KVM: lapic: reorganize restart_apic_timer
  KVM: lapic: reorganize start_hv_timer
  kvm: nVMX: Check memory operand to INVVPID
  KVM: s390: Inject machine check into the nested guest
  KVM: s390: Inject machine check into the guest
  tools/kvm_stat: add new interactive command 'b'
  tools/kvm_stat: add new command line switch '-i'
  tools/kvm_stat: fix error on interactive command 'g'
  KVM: SVM: suppress unnecessary NMI singlestep on GIF=0 and nested exit
  ...
2017-07-06 18:38:31 -07:00
Linus Torvalds
9a9594efe5 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP hotplug updates from Thomas Gleixner:
 "This update is primarily a cleanup of the CPU hotplug locking code.

  The hotplug locking mechanism is an open coded RWSEM, which allows
  recursive locking. The main problem with that is the recursive nature
  as it evades the full lockdep coverage and hides potential deadlocks.

  The rework replaces the open coded RWSEM with a percpu RWSEM and
  establishes full lockdep coverage that way.

  The bulk of the changes fix up recursive locking issues and address
  the now fully reported potential deadlocks all over the place. Some of
  these deadlocks have been observed in the RT tree, but on mainline the
  probability was low enough to hide them away."

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
  cpu/hotplug: Constify attribute_group structures
  powerpc: Only obtain cpu_hotplug_lock if called by rtasd
  ARM/hw_breakpoint: Fix possible recursive locking for arch_hw_breakpoint_init
  cpu/hotplug: Remove unused check_for_tasks() function
  perf/core: Don't release cred_guard_mutex if not taken
  cpuhotplug: Link lock stacks for hotplug callbacks
  acpi/processor: Prevent cpu hotplug deadlock
  sched: Provide is_percpu_thread() helper
  cpu/hotplug: Convert hotplug locking to percpu rwsem
  s390: Prevent hotplug rwsem recursion
  arm: Prevent hotplug rwsem recursion
  arm64: Prevent cpu hotplug rwsem recursion
  kprobes: Cure hotplug lock ordering issues
  jump_label: Reorder hotplug lock and jump_label_lock
  perf/tracing/cpuhotplug: Fix locking order
  ACPI/processor: Use cpu_hotplug_disable() instead of get_online_cpus()
  PCI: Replace the racy recursion prevention
  PCI: Use cpu_hotplug_disable() instead of get_online_cpus()
  perf/x86/intel: Drop get_online_cpus() in intel_snb_check_microcode()
  x86/perf: Drop EXPORT of perf_check_microcode
  ...
2017-07-03 18:08:06 -07:00
Michael Ellerman
218ea31039 Merge branch 'fixes' into next
Merge our fixes branch, a few of them are tripping people up while
working on top of next, and we also have a dependency between the CXL
fixes and new CXL code we want to merge into next.
2017-07-03 23:05:43 +10:00
Paolo Bonzini
8a53e7e572 Merge branch 'kvm-ppc-next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD
- Better machine check handling for HV KVM
- Ability to support guests with threads=2, 4 or 8 on POWER9
- Fix for a race that could cause delayed recognition of signals
- Fix for a bug where POWER9 guests could sleep with interrupts
  pending.
2017-07-03 10:41:59 +02:00
Paul Mackerras
00c14757f6 KVM: PPC: Book3S: Fix typo in XICS-on-XIVE state saving code
This fixes a typo where the wrong loop index was used to index
the kvmppc_xive_vcpu.queues[] array in xive_pre_save_scan().
The variable i contains the vcpu number; we need to index queues[]
using j, which iterates from 0 to KVMPPC_XIVE_Q_COUNT-1.

The effect of this bug is that things that save the interrupt
controller state, such as "virsh dump", on a VM with more than
8 vCPUs, result in xive_pre_save_queue() getting called on a
bogus queue structure, usually resulting in a crash like this:

[  501.821107] Unable to handle kernel paging request for data at address 0x00000084
[  501.821212] Faulting instruction address: 0xc008000004c7c6f8
[  501.821234] Oops: Kernel access of bad area, sig: 11 [#1]
[  501.821305] SMP NR_CPUS=1024
[  501.821307] NUMA
[  501.821376] PowerNV
[  501.821470] Modules linked in: vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables ses enclosure scsi_transport_sas ipmi_powernv ipmi_devintf ipmi_msghandler powernv_op_panel kvm_hv nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc kvm tg3 ptp pps_core
[  501.822477] CPU: 3 PID: 3934 Comm: live_migration Not tainted 4.11.0-4.git8caa70f.el7.centos.ppc64le #1
[  501.822633] task: c0000003f9e3ae80 task.stack: c0000003f9ed4000
[  501.822745] NIP: c008000004c7c6f8 LR: c008000004c7c628 CTR: 0000000030058018
[  501.822877] REGS: c0000003f9ed7980 TRAP: 0300   Not tainted  (4.11.0-4.git8caa70f.el7.centos.ppc64le)
[  501.823030] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
[  501.823047]   CR: 28022244  XER: 00000000
[  501.823203] CFAR: c008000004c7c77c DAR: 0000000000000084 DSISR: 40000000 SOFTE: 1
[  501.823203] GPR00: c008000004c7c628 c0000003f9ed7c00 c008000004c91450 00000000000000ff
[  501.823203] GPR04: c0000003f5580000 c0000003f559bf98 9000000000009033 0000000000000000
[  501.823203] GPR08: 0000000000000084 0000000000000000 00000000000001e0 9000000000001003
[  501.823203] GPR12: c00000000008a7d0 c00000000fdc1b00 000000000a9a0000 0000000000000000
[  501.823203] GPR16: 00000000402954e8 000000000a9a0000 0000000000000004 0000000000000000
[  501.823203] GPR20: 0000000000000008 c000000002e8f180 c000000002e8f1e0 0000000000000001
[  501.823203] GPR24: 0000000000000008 c0000003f5580008 c0000003f4564018 c000000002e8f1e8
[  501.823203] GPR28: 00003ff6e58bdc28 c0000003f4564000 0000000000000000 0000000000000000
[  501.825441] NIP [c008000004c7c6f8] xive_get_attr+0x3b8/0x5b0 [kvm]
[  501.825671] LR [c008000004c7c628] xive_get_attr+0x2e8/0x5b0 [kvm]
[  501.825887] Call Trace:
[  501.825991] [c0000003f9ed7c00] [c008000004c7c628] xive_get_attr+0x2e8/0x5b0 [kvm] (unreliable)
[  501.826312] [c0000003f9ed7cd0] [c008000004c62ec4] kvm_device_ioctl_attr+0x64/0xa0 [kvm]
[  501.826581] [c0000003f9ed7d20] [c008000004c62fcc] kvm_device_ioctl+0xcc/0xf0 [kvm]
[  501.826843] [c0000003f9ed7d40] [c000000000350c70] do_vfs_ioctl+0xd0/0x8c0
[  501.827060] [c0000003f9ed7de0] [c000000000351534] SyS_ioctl+0xd4/0xf0
[  501.827282] [c0000003f9ed7e30] [c00000000000b8e0] system_call+0x38/0xfc
[  501.827496] Instruction dump:
[  501.827632] 419e0078 3b760008 e9160008 83fb000c 83db0010 80fb0008 2f280000 60000000
[  501.827901] 60000000 60420000 419a0050 7be91764 <7d284c2c> 552a0ffe 7f8af040 419e003c
[  501.828176] ---[ end trace 2d0529a5bbbbafed ]---

Cc: stable@vger.kernel.org
Fixes: 5af5099385 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-07-03 10:40:22 +02:00
Paul Mackerras
8b24e69fc4 KVM: PPC: Book3S HV: Close race with testing for signals on guest entry
At present, interrupts are hard-disabled fairly late in the guest
entry path, in the assembly code.  Since we check for pending signals
for the vCPU(s) task(s) earlier in the guest entry path, it is
possible for a signal to be delivered before we enter the guest but
not be noticed until after we exit the guest for some other reason.

Similarly, it is possible for the scheduler to request a reschedule
while we are in the guest entry path, and we won't notice until after
we have run the guest, potentially for a whole timeslice.

Furthermore, with a radix guest on POWER9, we can take the interrupt
with the MMU on.  In this case we end up leaving interrupts
hard-disabled after the guest exit, and they are likely to stay
hard-disabled until we exit to userspace or context-switch to
another process.  This was masking the fact that we were also not
setting the RI (recoverable interrupt) bit in the MSR, meaning
that if we had taken an interrupt, it would have crashed the host
kernel with an unrecoverable interrupt message.

To close these races, we need to check for signals and reschedule
requests after hard-disabling interrupts, and then keep interrupts
hard-disabled until we enter the guest.  If there is a signal or a
reschedule request from another CPU, it will send an IPI, which will
cause a guest exit.

This puts the interrupt disabling before we call kvmppc_start_thread()
for all the secondary threads of this core that are going to run vCPUs.
The reason for that is that once we have started the secondary threads
there is no easy way to back out without going through at least part
of the guest entry path.  However, kvmppc_start_thread() includes some
code for radix guests which needs to call smp_call_function(), which
must be called with interrupts enabled.  To solve this problem, this
patch moves that code into a separate function that is called earlier.

When the guest exit is caused by an external interrupt, a hypervisor
doorbell or a hypervisor maintenance interrupt, we now handle these
using the replay facility.  __kvmppc_vcore_entry() now returns the
trap number that caused the exit on this thread, and instead of the
assembly code jumping to the handler entry, we return to C code with
interrupts still hard-disabled and set the irq_happened flag in the
PACA, so that when we do local_irq_enable() the appropriate handler
gets called.

With all this, we now have the interrupt soft-enable flag clear while
we are in the guest.  This is useful because code in the real-mode
hypercall handlers that checks whether interrupts are enabled will
now see that they are disabled, which is correct, since interrupts
are hard-disabled in the real-mode code.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-07-01 18:59:38 +10:00
Paul Mackerras
898b25b202 KVM: PPC: Book3S HV: Simplify dynamic micro-threading code
Since commit b009031f74 ("KVM: PPC: Book3S HV: Take out virtual
core piggybacking code", 2016-09-15), we only have at most one
vcore per subcore.  Previously, the fact that there might be more
than one vcore per subcore meant that we had the notion of a
"master vcore", which was the vcore that controlled thread 0 of
the subcore.  We also needed a list per subcore in the core_info
struct to record which vcores belonged to each subcore.  Now that
there can only be one vcore in the subcore, we can replace the
list with a simple pointer and get rid of the notion of the
master vcore (and in fact treat every vcore as a master vcore).

We can also get rid of the subcore_vm[] field in the core_info
struct since it is never read.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-07-01 18:59:01 +10:00
Paolo Bonzini
04a7ea04d5 KVM/ARM updates for 4.13
- vcpu request overhaul
 - allow timer and PMU to have their interrupt number
   selected from userspace
 - workaround for Cavium erratum 30115
 - handling of memory poisonning
 - the usual crop of fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAllWCM0VHG1hcmMuenlu
 Z2llckBhcm0uY29tAAoJECPQ0LrRPXpDjJ0QAI16x6+trKhH31lTSYekYfqm4hZ2
 Fp7IbALW9KNCaY35tZov2Zuh99qGRduxTh7ewqhKpON8kkU+UKj0F7zH22+vfN4m
 yas/+uNr8R9VLyvea4ysPsgx8Q8v1Ix9setohHYNZIL9/klVqtaHpYvArHVF/mzq
 p2j/NxRS2dlp9r2TtoMRMhA05u6r0wolhUuh+z9v2ipib0gfOBIG24jsqCTEcD9n
 5A/cVd+ztYshkrV95h3y9peahwt3zOA4QBGzrQ2K25jp0s54nqhmC7JTNSa8dtar
 YGW2MuAMoIFTwCFAlpwCzrwpOJFzF3Q6A8bOxei2fjclzjPMgT1xQxuhOoe4ntFa
 lTPxSHalm5W6dFTW90YSo2DBcPe+N7sQkhjR0cCeY3GYsOFhXMLTlOl5Pt1YK1or
 +3FAI74tFRKvVmb9mhZeGTvuzhDgRvtf3Qq5rjwlGzKc2BBOEgtMyj/Wgwo4N6Dz
 IjOnoRaUGELoBCWoTorMxLpsPBdPVSUxNyJTdAhqZ/ZtT1xqjhFNLZcrVWmOTzDM
 1cav+jZkla4sLmJSNDD54aCSvvtPHis0nZn9PRlh12xgOyYiAVx4K++MNuWP0P37
 hbh1gbPT+FcoVxPurUsX/pjNlTucPZcBwFytZDQlpwtPBpEFzJiImLYe/PldRb0f
 9WQOH1Y1+q14MF+N
 =6hNK
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/ARM updates for 4.13

- vcpu request overhaul
- allow timer and PMU to have their interrupt number
  selected from userspace
- workaround for Cavium erratum 30115
- handling of memory poisonning
- the usual crop of fixes and cleanups

Conflicts:
	arch/s390/include/asm/kvm_host.h
2017-06-30 12:38:26 +02:00
Balbir Singh
0428491cba powerpc/mm: Trace tlbie(l) instructions
Add a trace point for tlbie(l) (Translation Lookaside Buffer Invalidate
Entry (Local)) instructions.

The tlbie instruction has changed over the years, so not all versions
accept the same operands. Use the ISA v3 field operands because they are
the most verbose, we may change them in future.

Example output:

  qemu-system-ppc-5371  [016]  1412.369519: tlbie:
  	tlbie with lpid 0, local 1, rb=67bd8900174c11c1, rs=0, ric=0 prs=0 r=0

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[mpe: Add some missing trace_tlbie()s, reword change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-23 21:14:49 +10:00
Paul Mackerras
2ed4f9dd19 KVM: PPC: Book3S HV: Add capability to report possible virtual SMT modes
Now that userspace can set the virtual SMT mode by enabling the
KVM_CAP_PPC_SMT capability, it is useful for userspace to be able
to query the set of possible virtual SMT modes.  This provides a
new capability, KVM_CAP_PPC_SMT_POSSIBLE, to provide this
information.  The return value is a bitmap of possible modes, with
bit N set if virtual SMT mode 2^N is available.  That is, 1 indicates
SMT1 is available, 2 indicates that SMT2 is available, 3 indicates
that both SMT1 and SMT2 are available, and so on.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-06-22 11:25:31 +10:00
Aravinda Prasad
e20bbd3d8d KVM: PPC: Book3S HV: Exit guest upon MCE when FWNMI capability is enabled
Enhance KVM to cause a guest exit with KVM_EXIT_NMI
exit reason upon a machine check exception (MCE) in
the guest address space if the KVM_CAP_PPC_FWNMI
capability is enabled (instead of delivering a 0x200
interrupt to guest). This enables QEMU to build error
log and deliver machine check exception to guest via
guest registered machine check handler.

This approach simplifies the delivery of machine
check exception to guest OS compared to the earlier
approach of KVM directly invoking 0x200 guest interrupt
vector.

This design/approach is based on the feedback for the
QEMU patches to handle machine check exception. Details
of earlier approach of handling machine check exception
in QEMU and related discussions can be found at:

https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html

Note:

This patch now directly invokes machine_check_print_event_info()
from kvmppc_handle_exit_hv() to print the event to host console
at the time of guest exit before the exception is passed on to the
guest. Hence, the host-side handling which was performed earlier
via machine_check_fwnmi is removed.

The reasons for this approach is (i) it is not possible
to distinguish whether the exception occurred in the
guest or the host from the pt_regs passed on the
machine_check_exception(). Hence machine_check_exception()
calls panic, instead of passing on the exception to
the guest, if the machine check exception is not
recoverable. (ii) the approach introduced in this
patch gives opportunity to the host kernel to perform
actions in virtual mode before passing on the exception
to the guest. This approach does not require complex
tweaks to machine_check_fwnmi and friends.

Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-06-22 11:24:57 +10:00
Aravinda Prasad
134764ed6e KVM: PPC: Book3S HV: Add new capability to control MCE behaviour
This introduces a new KVM capability to control how KVM behaves
on machine check exception (MCE) in HV KVM guests.

If this capability has not been enabled, KVM redirects machine check
exceptions to guest's 0x200 vector, if the address in error belongs to
the guest. With this capability enabled, KVM will cause a guest exit
with the exit reason indicating an NMI.

The new capability is required to avoid problems if a new kernel/KVM
is used with an old QEMU, running a guest that doesn't issue
"ibm,nmi-register".  As old QEMU does not understand the NMI exit
type, it treats it as a fatal error.  However, the guest could have
handled the machine check error if the exception was delivered to
guest's 0x200 interrupt vector instead of NMI exit in case of old
QEMU.

[paulus@ozlabs.org - Reworded the commit message to be clearer,
 enable only on HV KVM.]

Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-06-21 13:37:08 +10:00
Radim Krčmář
c72544d85f Merge branch 'kvm-ppc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
* fix problems that could cause hangs or crashes in the host on POWER9
* fix problems that could allow guests to potentially affect or disrupt
  the execution of the controlling userspace
2017-06-20 14:32:57 +02:00
Paul Mackerras
ee3308a254 KVM: PPC: Book3S HV: Don't sleep if XIVE interrupt pending on POWER9
On a POWER9 system, it is possible for an interrupt to become pending
for a VCPU when that VCPU is about to cede (execute a H_CEDE hypercall)
and has already disabled interrupts, or in the H_CEDE processing up
to the point where the XIVE context is pulled from the hardware.  In
such a case, the H_CEDE should not sleep, but should return immediately
to the guest.  However, the conditions tested in kvmppc_vcpu_woken()
don't include the condition that a XIVE interrupt is pending, so the
VCPU could sleep until the next decrementer interrupt.

To fix this, we add a new xive_interrupt_pending() helper which looks
in the XIVE context that was pulled from the hardware to see if the
priority of any pending interrupt is higher (numerically lower than)
the CPU priority.  If so then kvmppc_vcpu_woken() will return true.
If the XIVE context has never been used, then both the pipr and the
cppr fields will be zero and the test will indicate that no interrupt
is pending.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-06-20 15:46:12 +10:00
Nicholas Piggin
9d29250136 powerpc/64s/idle: Avoid SRR usage in idle sleep/wake paths
Idle code now always runs at the 0xc... effective address whether
in real or virtual mode. This means rfid can be ditched, along
with a lot of SRR manipulations.

In the wakeup path, carry SRR1 around in r12. Use mtmsrd to change
MSR states as required.

This also balances the return prediction for the idle call, by
doing blr rather than rfid to return to the idle caller.

On POWER9, 2-process context switch on different cores, with snooze
disabled, increases performance by 2%.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Incorporate v2 fixes from Nick]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-19 19:46:29 +10:00
Paul Mackerras
579006944e KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9
On POWER9, we no longer have the restriction that we had on POWER8
where all threads in a core have to be in the same partition, so
the CPU threads are now independent.  However, we still want to be
able to run guests with a virtual SMT topology, if only to allow
migration of guests from POWER8 systems to POWER9.

A guest that has a virtual SMT mode greater than 1 will expect to
be able to use the doorbell facility; it will expect the msgsndp
and msgclrp instructions to work appropriately and to be able to read
sensible values from the TIR (thread identification register) and
DPDES (directed privileged doorbell exception status) special-purpose
registers.  However, since each CPU thread is a separate sub-processor
in POWER9, these instructions and registers can only be used within
a single CPU thread.

In order for these instructions to appear to act correctly according
to the guest's virtual SMT mode, we have to trap and emulate them.
We cause them to trap by clearing the HFSCR_MSGP bit in the HFSCR
register.  The emulation is triggered by the hypervisor facility
unavailable interrupt that occurs when the guest uses them.

To cause a doorbell interrupt to occur within the guest, we set the
DPDES register to 1.  If the guest has interrupts enabled, the CPU
will generate a doorbell interrupt and clear the DPDES register in
hardware.  The DPDES hardware register for the guest is saved in the
vcpu->arch.vcore->dpdes field.  Since this gets written by the guest
exit code, other VCPUs wishing to cause a doorbell interrupt don't
write that field directly, but instead set a vcpu->arch.doorbell_request
flag.  This is consumed and set to 0 by the guest entry code, which
then sets DPDES to 1.

Emulating reads of the DPDES register is somewhat involved, because
it requires reading the doorbell pending interrupt status of all of the
VCPU threads in the virtual core, and if any of those VCPUs are
running, their doorbell status is only up-to-date in the hardware
DPDES registers of the CPUs where they are running.  In order to get
a reasonable approximation of the current doorbell status, we send
those CPUs an IPI, causing an exit from the guest which will update
the vcpu->arch.vcore->dpdes field.  We then use that value in
constructing the emulated DPDES register value.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-06-19 14:34:37 +10:00
Paul Mackerras
3c31352460 KVM: PPC: Book3S HV: Allow userspace to set the desired SMT mode
This allows userspace to set the desired virtual SMT (simultaneous
multithreading) mode for a VM, that is, the number of VCPUs that
get assigned to each virtual core.  Previously, the virtual SMT mode
was fixed to the number of threads per subcore, and if userspace
wanted to have fewer vcpus per vcore, then it would achieve that by
using a sparse CPU numbering.  This had the disadvantage that the
vcpu numbers can get quite large, particularly for SMT1 guests on
a POWER8 with 8 threads per core.  With this patch, userspace can
set its desired virtual SMT mode and then use contiguous vcpu
numbering.

On POWER8, where the threading mode is "strict", the virtual SMT mode
must be less than or equal to the number of threads per subcore.  On
POWER9, which implements a "loose" threading mode, the virtual SMT
mode can be any power of 2 between 1 and 8, even though there is
effectively one thread per subcore, since the threads are independent
and can all be in different partitions.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-06-19 14:34:20 +10:00
Paul Mackerras
769377f77c KVM: PPC: Book3S HV: Context-switch HFSCR between host and guest on POWER9
This adds code to allow us to use a different value for the HFSCR
(Hypervisor Facilities Status and Control Register) when running the
guest from that which applies in the host.  The reason for doing this
is to allow us to trap the msgsndp instruction and related operations
in future so that they can be virtualized.  We also save the value of
HFSCR when a hypervisor facility unavailable interrupt occurs, because
the high byte of HFSCR indicates which facility the guest attempted to
access.

We save and restore the host value on guest entry/exit because some
bits of it affect host userspace execution.

We only do all this on POWER9, not on POWER8, because we are not
intending to virtualize any of the facilities controlled by HFSCR on
POWER8.  In particular, the HFSCR bit that controls execution of
msgsndp and related operations does not exist on POWER8.  The HFSCR
doesn't exist at all on POWER7.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-06-19 14:08:02 +10:00
Paul Mackerras
1da4e2f4fb KVM: PPC: Book3S HV: Don't let VCPU sleep if it has a doorbell pending
It is possible, through a narrow race condition, for a VCPU to exit
the guest with a H_CEDE hypercall while it has a doorbell interrupt
pending.  In this case, the H_CEDE should return immediately, but in
fact it puts the VCPU to sleep until some other interrupt becomes
pending or a prod is received (via another VCPU doing H_PROD).

This fixes it by checking the DPDES (Directed Privileged Doorbell
Exception Status) bit for the thread along with the other interrupt
pending bits.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-06-19 14:05:22 +10:00
Paul Mackerras
1bc3fe818c KVM: PPC: Book3S HV: Enable guests to use large decrementer mode on POWER9
This allows userspace (e.g. QEMU) to enable large decrementer mode for
the guest when running on a POWER9 host, by setting the LPCR_LD bit in
the guest LPCR value.  With this, the guest exit code saves 64 bits of
the guest DEC value on exit.  Other places that use the guest DEC
value check the LPCR_LD bit in the guest LPCR value, and if it is set,
omit the 32-bit sign extension that would otherwise be done.

This doesn't change the DEC emulation used by PR KVM because PR KVM
is not supported on POWER9 yet.

This is partly based on an earlier patch by Oliver O'Halloran.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-06-19 14:02:04 +10:00
Linus Torvalds
5ac447d268 powerpc fixes for 4.12 #6
Three small fixes for recently merged code:
 
  - remove a spurious WARN_ON when a PCI device has no of_node, it's allowed in
    some circumstances for there to be no of_node.
 
  - fix the offset for store EOI MMIOs in the XIVE interrupt controller.
 
  - fix non-const WARN_ONs which were becoming BUGs due to them losing
    BUGFLAG_WARNING in a recent cleanup patch.
 
 Thanks to:
   Alexey Kardashevskiy, Alistair Popple, Benjamin Herrenschmidt.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZQ7XeAAoJEFHr6jzI4aWA2bgP/R++c9YehNdDKbZyLqumY+6q
 7ns8NoxgEW/Gc8JuSTE4MW51q2HimBJu6ntyHfMUwgpGQpGzOGDn6g8OxVfjOySa
 kFc7cytgOOhEpTHENDZ3xxZtcSd9iafyX9ga/0dz6UycfEHcZayiXDRuXffRzJwa
 RNqbwDxOtkgn6w4bW02SRlfDSTra+zQZQd6NsPXSJJgF+tb3MflMj1A9WoJp/mj/
 tXc9fpKQsZkIG/AvAHziizHqeAKJxUrmoVb8qy1SYTKVUDZoxTYgiO1G1nebZX/s
 Zzsdd/fcHcd0DIEJkjf2V3cegmIGTLzw7mUOodU7IF3mZ1LPgCMVF5lTTZzjcXDQ
 d1gugVojHnGr7KB3lNNijyHxsmHG7LdTQmRHcyZ2L8KYpa3/+Ca3ZuFnTwjvgRNx
 dJEFX5JdAhCrkg1B73rvcjKCFg0ysVIrkdf27SaameaQdQQuZU4+5+s1LB2EqJQr
 II3+pnZr/RF3OWu4yJE5KAHX5ZBQQ+unzVPpW4pqvwYMoVKhO7dhCPPISeRCtzJE
 +po5Ys4ncheSRhwf5dQhf+H04kXmL6ekpl1GJOBB3BskJcsIr8hiLp3/mF238et1
 80o6yTAJLADKUIl75ISiePz+KFZNamgke1/XWZolfHYZ9dNRF0c//E0qvpopz8jE
 F90hxEAtJ9ws/VUlo40Q
 =Mnxp
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.12-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "Three small fixes for recently merged code:

   - remove a spurious WARN_ON when a PCI device has no of_node, it's
     allowed in some circumstances for there to be no of_node.

   - fix the offset for store EOI MMIOs in the XIVE interrupt
     controller.

   - fix non-const WARN_ONs which were becoming BUGs due to them losing
     BUGFLAG_WARNING in a recent cleanup patch.

  Thanks to: Alexey Kardashevskiy, Alistair Popple, Benjamin
  Herrenschmidt"

* tag 'powerpc-4.12-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/debug: Add missing warn flag to WARN_ON's non-builtin path
  powerpc/xive: Fix offset for store EOI MMIOs
  powerpc/npu-dma: Remove spurious WARN_ON when a PCI device has no of_node
2017-06-17 05:57:54 +09:00
Paul Mackerras
3d3efb68c1 KVM: PPC: Book3S HV: Ignore timebase offset on POWER9 DD1
POWER9 DD1 has an erratum where writing to the TBU40 register, which
is used to apply an offset to the timebase, can cause the timebase to
lose counts.  This results in the timebase on some CPUs getting out of
sync with other CPUs, which then results in misbehaviour of the
timekeeping code.

To work around the problem, we make KVM ignore the timebase offset for
all guests on POWER9 DD1 machines.  This means that live migration
cannot be supported on POWER9 DD1 machines.

Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-06-16 16:04:57 +10:00
Paul Mackerras
7ceaa6dcd8 KVM: PPC: Book3S HV: Save/restore host values of debug registers
At present, HV KVM on POWER8 and POWER9 machines loses any instruction
or data breakpoint set in the host whenever a guest is run.
Instruction breakpoints are currently only used by xmon, but ptrace
and the perf_event subsystem can set data breakpoints as well as xmon.

To fix this, we save the host values of the debug registers (CIABR,
DAWR and DAWRX) before entering the guest and restore them on exit.
To provide space to save them in the stack frame, we expand the stack
frame allocated by kvmppc_hv_entry() from 112 to 144 bytes.

Fixes: b005255e12 ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
Cc: stable@vger.kernel.org # v3.14+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-06-16 11:53:19 +10:00
Benjamin Herrenschmidt
25642705b2 powerpc/xive: Fix offset for store EOI MMIOs
Architecturally we should apply a 0x400 offset for these. Not doing
it will break future HW implementations.

The offset of 0 is supposed to remain for "triggers" though not all
sources support both trigger and store EOI, and in P9 specifically,
some sources will treat 0 as a store EOI. But future chips will not.
So this makes us use the properly architected offset which should work
always.

Fixes: 243e25112d ("powerpc/xive: Native exploitation of the XIVE interrupt controller")
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-15 23:29:39 +10:00
Paul Mackerras
46a704f840 KVM: PPC: Book3S HV: Preserve userspace HTM state properly
If userspace attempts to call the KVM_RUN ioctl when it has hardware
transactional memory (HTM) enabled, the values that it has put in the
HTM-related SPRs TFHAR, TFIAR and TEXASR will get overwritten by
guest values.  To fix this, we detect this condition and save those
SPR values in the thread struct, and disable HTM for the task.  If
userspace goes to access those SPRs or the HTM facility in future,
a TM-unavailable interrupt will occur and the handler will reload
those SPRs and re-enable HTM.

If userspace has started a transaction and suspended it, we would
currently lose the transactional state in the guest entry path and
would almost certainly get a "TM Bad Thing" interrupt, which would
cause the host to crash.  To avoid this, we detect this case and
return from the KVM_RUN ioctl with an EINVAL error, with the KVM
exit reason set to KVM_EXIT_FAIL_ENTRY.

Fixes: b005255e12 ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
Cc: stable@vger.kernel.org # v3.14+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-06-15 16:18:17 +10:00
Paul Mackerras
4c3bb4ccd0 KVM: PPC: Book3S HV: Restore critical SPRs to host values on guest exit
This restores several special-purpose registers (SPRs) to sane values
on guest exit that were missed before.

TAR and VRSAVE are readable and writable by userspace, and we need to
save and restore them to prevent the guest from potentially affecting
userspace execution (not that TAR or VRSAVE are used by any known
program that run uses the KVM_RUN ioctl).  We save/restore these
in kvmppc_vcpu_run_hv() rather than on every guest entry/exit.

FSCR affects userspace execution in that it can prohibit access to
certain facilities by userspace.  We restore it to the normal value
for the task on exit from the KVM_RUN ioctl.

IAMR is normally 0, and is restored to 0 on guest exit.  However,
with a radix host on POWER9, it is set to a value that prevents the
kernel from executing user-accessible memory.  On POWER9, we save
IAMR on guest entry and restore it on guest exit to the saved value
rather than 0.  On POWER8 we continue to set it to 0 on guest exit.

PSPB is normally 0.  We restore it to 0 on guest exit to prevent
userspace taking advantage of the guest having set it non-zero
(which would allow userspace to set its SMT priority to high).

UAMOR is normally 0.  We restore it to 0 on guest exit to prevent
the AMR from being used as a covert channel between userspace
processes, since the AMR is not context-switched at present.

Fixes: b005255e12 ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
Cc: stable@vger.kernel.org # v3.14+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-06-15 16:17:09 +10:00
Paul Mackerras
ca8efa1df1 KVM: PPC: Book3S HV: Context-switch EBB registers properly
This adds code to save the values of three SPRs (special-purpose
registers) used by userspace to control event-based branches (EBBs),
which are essentially interrupts that get delivered directly to
userspace.  These registers are loaded up with guest values when
entering the guest, and their values are saved when exiting the
guest, but we were not saving the host values and restoring them
before going back to userspace.

On POWER8 this would only affect userspace programs which explicitly
request the use of EBBs and also use the KVM_RUN ioctl, since the
only source of EBBs on POWER8 is the PMU, and there is an explicit
enable bit in the PMU registers (and those PMU registers do get
properly context-switched between host and guest).  On POWER9 there
is provision for externally-generated EBBs, and these are not subject
to the control in the PMU registers.

Since these registers only affect userspace, we can save them when
we first come in from userspace and restore them before returning to
userspace, rather than saving/restoring the host values on every
guest entry/exit.  Similarly, we don't need to worry about their
values on offline secondary threads since they execute in the context
of the idle task, which never executes in userspace.

Fixes: b005255e12 ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
Cc: stable@vger.kernel.org # v3.14+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-06-13 14:12:02 +10:00
Radim Krčmář
2fa6e1e12a KVM: add kvm_request_pending
A first step in vcpu->requests encapsulation.  Additionally, we now
use READ_ONCE() when accessing vcpu->requests, which ensures we
always load vcpu->requests when it's accessed.  This is important as
other threads can change it any time.  Also, READ_ONCE() documents
that vcpu->requests is used with other threads, likely requiring
memory barriers, which it does.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
[ Documented the new use of READ_ONCE() and converted another check
  in arch/mips/kvm/vz.c ]
Signed-off-by: Andrew Jones <drjones@redhat.com>
Acked-by: Christoffer Dall <cdall@linaro.org>
Signed-off-by: Christoffer Dall <cdall@linaro.org>
2017-06-04 16:53:00 +02:00
Paul Mackerras
2f2724630f KVM: PPC: Book3S HV: Cope with host using large decrementer mode
POWER9 introduces a new mode for the decrementer register, called
large decrementer mode, in which the decrementer counter is 56 bits
wide rather than 32, and reads are sign-extended rather than
zero-extended.  For the decrementer, this new mode is optional and
controlled by a bit in the LPCR.  The hypervisor decrementer (HDEC)
is 56 bits wide on POWER9 and has no mode control.

Since KVM code reads and writes the decrementer and hypervisor
decrementer registers in a few places, it needs to be aware of the
need to treat the decrementer value as a 64-bit quantity, and only do
a 32-bit sign extension when large decrementer mode is not in effect.
Similarly, the HDEC should always be treated as a 64-bit quantity on
POWER9.  We define a new EXTEND_HDEC macro to encapsulate the feature
test for POWER9 and the sign extension.

To enable the sign extension to be removed in large decrementer mode,
we test the LPCR_LD bit in the host LPCR image stored in the struct
kvm for the guest.  If is set then large decrementer mode is enabled
and the sign extension should be skipped.

This is partly based on an earlier patch by Oliver O'Halloran.

Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-05-29 16:01:26 +10:00
Sebastian Andrzej Siewior
419af25fa4 KVM/PPC/Book3S HV: Use cpuhp_setup_state_nocalls_cpuslocked()
kvmppc_alloc_host_rm_ops() holds get_online_cpus() while invoking
cpuhp_setup_state_nocalls().

cpuhp_setup_state_nocalls() invokes get_online_cpus() as well. This is
correct, but prevents the conversion of the hotplug locking to a percpu
rwsem.

Use cpuhp_setup_state_nocalls_cpuslocked() to avoid the nested
call. Convert *_online_cpus() to the new interfaces while at it.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: kvm@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: kvm-ppc@vger.kernel.org
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Alexander Graf <agraf@suse.com>
Link: http://lkml.kernel.org/r/20170524081547.809616236@linutronix.de
2017-05-26 10:10:39 +02:00
Paul Mackerras
76d837a4c0 KVM: PPC: Book3S PR: Don't include SPAPR TCE code on non-pseries platforms
Commit e91aa8e6ec ("KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64
permanently", 2017-03-22) enabled the SPAPR TCE code for all 64-bit
Book 3S kernel configurations in order to simplify the code and
reduce #ifdefs.  However, 64-bit Book 3S PPC platforms other than
pseries and powernv don't implement the necessary IOMMU callbacks,
leading to build failures like the following (for a pasemi config):

scripts/kconfig/conf  --silentoldconfig Kconfig
warning: (KVM_BOOK3S_64) selects SPAPR_TCE_IOMMU which has unmet direct dependencies (IOMMU_SUPPORT && (PPC_POWERNV || PPC_PSERIES))

...

  CC [M]  arch/powerpc/kvm/book3s_64_vio.o
/home/paulus/kernel/kvm/arch/powerpc/kvm/book3s_64_vio.c: In function ‘kvmppc_clear_tce’:
/home/paulus/kernel/kvm/arch/powerpc/kvm/book3s_64_vio.c:363:2: error: implicit declaration of function ‘iommu_tce_xchg’ [-Werror=implicit-function-declaration]
  iommu_tce_xchg(tbl, entry, &hpa, &dir);
  ^

To fix this, we make the inclusion of the SPAPR TCE support, and the
code that uses it in book3s_vio.c and book3s_vio_hv.c, depend on
the inclusion of support for the pseries and/or powernv platforms.
This means that when running a 'pseries' guest on those platforms,
the guest won't have in-kernel acceleration of the PAPR TCE hypercalls,
but at least now they compile.

Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-05-12 15:32:30 +10:00
Paul Mackerras
67325e988f KVM: PPC: Book3S PR: Check copy_to/from_user return values
The PR KVM implementation of the PAPR HPT hypercalls (H_ENTER etc.)
access an image of the HPT in userspace memory using copy_from_user
and copy_to_user.  Recently, the declarations of those functions were
annotated to indicate that the return value must be checked.  Since
this code doesn't currently check the return value, this causes
compile warnings like the ones shown below, and since on PPC the
default is to compile arch/powerpc with -Werror, this causes the
build to fail.

To fix this, we check the return values, and if non-zero, fail the
hypercall being processed with a H_FUNCTION error return value.
There is really no good error return value to use since PAPR didn't
envisage the possibility that the hypervisor may not be able to access
the guest's HPT, and H_FUNCTION (function not supported) seems as
good as any.

The typical compile warnings look like this:

  CC      arch/powerpc/kvm/book3s_pr_papr.o
/home/paulus/kernel/kvm/arch/powerpc/kvm/book3s_pr_papr.c: In function ‘kvmppc_h_pr_enter’:
/home/paulus/kernel/kvm/arch/powerpc/kvm/book3s_pr_papr.c:53:2: error: ignoring return value of ‘copy_from_user’, declared with attribute warn_unused_result [-Werror=unused-result]
  copy_from_user(pteg, (void __user *)pteg_addr, sizeof(pteg));
  ^
/home/paulus/kernel/kvm/arch/powerpc/kvm/book3s_pr_papr.c:74:2: error: ignoring return value of ‘copy_to_user’, declared with attribute warn_unused_result [-Werror=unused-result]
  copy_to_user((void __user *)pteg_addr, hpte, HPTE_SIZE);
  ^

... etc.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-05-12 15:09:03 +10:00
Paul Mackerras
acde25726b KVM: PPC: Book3S HV: Add radix checks in real-mode hypercall handlers
POWER9 running a radix guest will take some hypervisor interrupts
without going to real mode (turning off the MMU).  This means that
early hypercall handlers may now be called in virtual mode.  Most of
the handlers work just fine in both modes, but there are some that
can crash the host if called in virtual mode, notably the TCE (IOMMU)
hypercalls H_PUT_TCE, H_STUFF_TCE and H_PUT_TCE_INDIRECT.  These
already have both a real-mode and a virtual-mode version, so we
arrange for the real-mode version to return H_TOO_HARD for radix
guests, which will result in the virtual-mode version being called.

The other hypercall which is sensitive to the MMU mode is H_RANDOM.
It doesn't have a virtual-mode version, so this adds code to enable
it to be called in either mode.

An alternative solution was considered which would refuse to call any
of the early hypercall handlers when doing a virtual-mode exit from a
radix guest.  However, the XICS-on-XIVE code depends on the XICS
hypercalls being handled early even for virtual-mode exits, because
the handlers need to be called before the XIVE vCPU state has been
pulled off the hardware.  Therefore that solution would have become
quite invasive and complicated, and was rejected in favour of the
simpler, though less elegant, solution presented here.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-05-12 15:08:09 +10:00
Paolo Bonzini
4415b33528 Merge branch 'kvm-ppc-next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD
The main thing here is a new implementation of the in-kernel
XICS interrupt controller emulation for POWER9 machines, from Ben
Herrenschmidt.

POWER9 has a new interrupt controller called XIVE (eXternal Interrupt
Virtualization Engine) which is able to deliver interrupts directly
to guest virtual CPUs in hardware without hypervisor intervention.
With this new code, the guest still sees the old XICS interface but
performance is better because the XICS emulation in the host uses the
XIVE directly rather than going through a XICS emulation in firmware.

Conflicts:
	arch/powerpc/kernel/cpu_setup_power.S [cherry-picked fix]
	arch/powerpc/kvm/book3s_xive.c [include asm/debugfs.h]
2017-05-09 11:50:01 +02:00
Linus Torvalds
2d3e4866de * ARM: HYP mode stub supports kexec/kdump on 32-bit; improved PMU
support; virtual interrupt controller performance improvements; support
 for userspace virtual interrupt controller (slower, but necessary for
 KVM on the weird Broadcom SoCs used by the Raspberry Pi 3)
 
 * MIPS: basic support for hardware virtualization (ImgTec
 P5600/P6600/I6400 and Cavium Octeon III)
 
 * PPC: in-kernel acceleration for VFIO
 
 * s390: support for guests without storage keys; adapter interruption
 suppression
 
 * x86: usual range of nVMX improvements, notably nested EPT support for
 accessed and dirty bits; emulation of CPL3 CPUID faulting
 
 * generic: first part of VCPU thread request API; kvm_stat improvements
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJZEHUkAAoJEL/70l94x66DBeYH/09wrpJ2FjU4Rqv7FxmqgWfH
 9WGi4wvn/Z+XzQSyfMJiu2SfZVzU69/Y67OMHudy7vBT6knB+ziM7Ntoiu/hUfbG
 0g5KsDX79FW15HuvuuGh9kSjUsj7qsQdyPZwP4FW/6ZoDArV9mibSvdjSmiUSMV/
 2wxaoLzjoShdOuCe9EABaPhKK0XCrOYkygT6Paz1pItDxaSn8iW3ulaCuWMprUfG
 Niq+dFemK464E4yn6HVD88xg5j2eUM6bfuXB3qR3eTR76mHLgtwejBzZdDjLG9fk
 32PNYKhJNomBxHVqtksJ9/7cSR6iNPs7neQ1XHemKWTuYqwYQMlPj1NDy0aslQU=
 =IsiZ
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "ARM:
   - HYP mode stub supports kexec/kdump on 32-bit
   - improved PMU support
   - virtual interrupt controller performance improvements
   - support for userspace virtual interrupt controller (slower, but
     necessary for KVM on the weird Broadcom SoCs used by the Raspberry
     Pi 3)

  MIPS:
   - basic support for hardware virtualization (ImgTec P5600/P6600/I6400
     and Cavium Octeon III)

  PPC:
   - in-kernel acceleration for VFIO

  s390:
   - support for guests without storage keys
   - adapter interruption suppression

  x86:
   - usual range of nVMX improvements, notably nested EPT support for
     accessed and dirty bits
   - emulation of CPL3 CPUID faulting

  generic:
   - first part of VCPU thread request API
   - kvm_stat improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (227 commits)
  kvm: nVMX: Don't validate disabled secondary controls
  KVM: put back #ifndef CONFIG_S390 around kvm_vcpu_kick
  Revert "KVM: Support vCPU-based gfn->hva cache"
  tools/kvm: fix top level makefile
  KVM: x86: don't hold kvm->lock in KVM_SET_GSI_ROUTING
  KVM: Documentation: remove VM mmap documentation
  kvm: nVMX: Remove superfluous VMX instruction fault checks
  KVM: x86: fix emulation of RSM and IRET instructions
  KVM: mark requests that need synchronization
  KVM: return if kvm_vcpu_wake_up() did wake up the VCPU
  KVM: add explicit barrier to kvm_vcpu_kick
  KVM: perform a wake_up in kvm_make_all_cpus_request
  KVM: mark requests that do not need a wakeup
  KVM: remove #ifndef CONFIG_S390 around kvm_vcpu_wake_up
  KVM: x86: always use kvm_make_request instead of set_bit
  KVM: add kvm_{test,clear}_request to replace {test,clear}_bit
  s390: kvm: Cpu model support for msa6, msa7 and msa8
  KVM: x86: remove irq disablement around KVM_SET_CLOCK/KVM_GET_CLOCK
  kvm: better MWAIT emulation for guests
  KVM: x86: virtualize cpuid faulting
  ...
2017-05-08 12:37:56 -07:00
Linus Torvalds
c6a677c6f3 Staging/IIO patches for 4.12-rc1
Here is the big staging tree update for 4.12-rc1.  And it's a big one,
 adding about 350k new lines of crap^Wcode, mostly all in a big dump of
 media drivers from Intel.  But there's other new drivers in here as
 well, yet-another-wifi driver, new IIO drivers, and a new crypto
 accelerator.  We also deleted a bunch of stuff, mostly in patch
 cleanups, but also the Android ION code has shrunk a lot, and the
 Android low memory killer driver was finally deleted, much to the
 celebration of the -mm developers.
 
 All of these have been in linux-next with a few build issues that will
 show up when you merge to your tree, I'll follow up with fixes for those
 after this gets merged.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWQzzlQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ylNMgCcD+GoaF/Ml7YnULRl2GG/526II78AnitZ8qjd
 rPqeowMIewYu9fgckLUc
 =7rzO
 -----END PGP SIGNATURE-----

Merge tag 'staging-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

Pull staging/IIO updates from Greg KH:
 "Here is the big staging tree update for 4.12-rc1.

  It's a big one, adding about 350k new lines of crap^Wcode, mostly all
  in a big dump of media drivers from Intel. But there's other new
  drivers in here as well, yet-another-wifi driver, new IIO drivers, and
  a new crypto accelerator.

  We also deleted a bunch of stuff, mostly in patch cleanups, but also
  the Android ION code has shrunk a lot, and the Android low memory
  killer driver was finally deleted, much to the celebration of the -mm
  developers.

  All of these have been in linux-next with a few build issues that will
  show up when you merge to your tree"

Merge conflicts in the new rtl8723bs driver (due to the wifi changes
this merge window) handled as per linux-next, courtesy of Stephen
Rothwell.

* tag 'staging-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (1182 commits)
  staging: fsl-mc/dpio: add cpu <--> LE conversion for dpaa2_fd
  staging: ks7010: remove line continuations in quoted strings
  staging: vt6656: use tabs instead of spaces
  staging: android: ion: Fix unnecessary initialization of static variable
  staging: media: atomisp: fix range checking on clk_num
  staging: media: atomisp: fix misspelled word in comment
  staging: media: atomisp: kmap() can't fail
  staging: atomisp: remove #ifdef for runtime PM functions
  staging: atomisp: satm include directory is gone
  atomisp: remove some more unused files
  atomisp: remove hmm_load/store/clear indirections
  atomisp: kill off mmgr_free
  atomisp: clean up the hmm init/cleanup indirections
  atomisp: handle allocation calls before init in the hmm layer
  staging: fsl-dpaa2/eth: Add maintainer for Ethernet driver
  staging: fsl-dpaa2/eth: Add TODO file
  staging: fsl-dpaa2/eth: Add trace points
  staging: fsl-dpaa2/eth: Add driver specific stats
  staging: fsl-dpaa2/eth: Add ethtool support
  staging: fsl-dpaa2/eth: Add Freescale DPAA2 Ethernet driver
  ...
2017-05-05 18:16:23 -07:00
Linus Torvalds
7246f60068 powerpc updates for 4.12 part 1.
Highlights include:
 
  - Larger virtual address space on 64-bit server CPUs. By default we use a 128TB
    virtual address space, but a process can request access to the full 512TB by
    passing a hint to mmap().
 
  - Support for the new Power9 "XIVE" interrupt controller.
 
  - TLB flushing optimisations for the radix MMU on Power9.
 
  - Support for CAPI cards on Power9, using the "Coherent Accelerator Interface
    Architecture 2.0".
 
  - The ability to configure the mmap randomisation limits at build and runtime.
 
  - Several small fixes and cleanups to the kprobes code, as well as support for
    KPROBES_ON_FTRACE.
 
  - Major improvements to handling of system reset interrupts, correctly treating
    them as NMIs, giving them a dedicated stack and using a new hypervisor call
    to trigger them, all of which should aid debugging and robustness.
 
 Many fixes and other minor enhancements.
 
 Thanks to:
   Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan,
   Aneesh Kumar K.V, Anshuman Khandual, Anton Blanchard, Balbir Singh, Ben
   Hutchings, Benjamin Herrenschmidt, Bhupesh Sharma, Chris Packham, Christian
   Zigotzky, Christophe Leroy, Christophe Lombard, Daniel Axtens, David Gibson,
   Gautham R. Shenoy, Gavin Shan, Geert Uytterhoeven, Guilherme G. Piccoli,
   Hamish Martin, Hari Bathini, Kees Cook, Laurent Dufour, Madhavan Srinivasan,
   Mahesh J Salgaonkar, Mahesh Salgaonkar, Masami Hiramatsu, Matt Brown, Matthew
   R. Ochs, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran,
   Pan Xinhui, Paul Mackerras, Rashmica Gupta, Russell Currey, Sukadev
   Bhattiprolu, Thadeu Lima de Souza Cascardo, Tobin C. Harding, Tyrel Datwyler,
   Uma Krishnan, Vaibhav Jain, Vipin K Parashar, Yang Shi.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZDHUMAAoJEFHr6jzI4aWAT7oQALkE2Nj3gjcn1z0SkFhq/1iO
 Py9Elmqm4E+L6NKYtBY5dS8xVAJ088ffzERyqJ1FY1LHkB8tn8bWRcMQmbjAFzTI
 V4TAzDNI890BN/F4ptrYRwNFxRBHAvZ4NDunTzagwYnwmTzW9PYHmOi4pvWTo3Tw
 KFUQ0joLSEgHzyfXxYB3fyj41u8N0FZvhfazdNSqia2Y5Vwwv/ION5jKplDM+09Y
 EtVEXFvaKAS1sjbM/d/Jo5rblHfR0D9/lYV10+jjyIokjzslIpyTbnj3izeYoM5V
 I4h99372zfsEjBGPPXyM3khL3zizGMSDYRmJHQSaKxjtecS9SPywPTZ8ufO/aSzV
 Ngq6nlND+f1zep29VQ0cxd3Jh40skWOXzxJaFjfDT25xa6FbfsWP2NCtk8PGylZ7
 EyqTuCWkMgIP02KlX3oHvEB2LRRPCDmRU2zECecRGNJrIQwYC2xjoiVi7Q8Qe8rY
 gr7Ib5Jj/a+uiTcCIy37+5nXq2s14/JBOKqxuYZIxeuZFvKYuRUipbKWO05WDOAz
 m/pSzeC3J8AAoYiqR0gcSOuJTOnJpGhs7zrQFqnEISbXIwLW+ICumzOmTAiBqOEY
 Rt8uW2gYkPwKLrE05445RfVUoERaAjaE06eRMOWS6slnngHmmnRJbf3PcoALiJkT
 ediqGEj0/N1HMB31V5tS
 =vSF3
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Highlights include:

   - Larger virtual address space on 64-bit server CPUs. By default we
     use a 128TB virtual address space, but a process can request access
     to the full 512TB by passing a hint to mmap().

   - Support for the new Power9 "XIVE" interrupt controller.

   - TLB flushing optimisations for the radix MMU on Power9.

   - Support for CAPI cards on Power9, using the "Coherent Accelerator
     Interface Architecture 2.0".

   - The ability to configure the mmap randomisation limits at build and
     runtime.

   - Several small fixes and cleanups to the kprobes code, as well as
     support for KPROBES_ON_FTRACE.

   - Major improvements to handling of system reset interrupts,
     correctly treating them as NMIs, giving them a dedicated stack and
     using a new hypervisor call to trigger them, all of which should
     aid debugging and robustness.

   - Many fixes and other minor enhancements.

  Thanks to: Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple,
  Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual, Anton
  Blanchard, Balbir Singh, Ben Hutchings, Benjamin Herrenschmidt,
  Bhupesh Sharma, Chris Packham, Christian Zigotzky, Christophe Leroy,
  Christophe Lombard, Daniel Axtens, David Gibson, Gautham R. Shenoy,
  Gavin Shan, Geert Uytterhoeven, Guilherme G. Piccoli, Hamish Martin,
  Hari Bathini, Kees Cook, Laurent Dufour, Madhavan Srinivasan, Mahesh J
  Salgaonkar, Mahesh Salgaonkar, Masami Hiramatsu, Matt Brown, Matthew
  R. Ochs, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Oliver
  O'Halloran, Pan Xinhui, Paul Mackerras, Rashmica Gupta, Russell
  Currey, Sukadev Bhattiprolu, Thadeu Lima de Souza Cascardo, Tobin C.
  Harding, Tyrel Datwyler, Uma Krishnan, Vaibhav Jain, Vipin K Parashar,
  Yang Shi"

* tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (214 commits)
  powerpc/64s: Power9 has no LPCR[VRMASD] field so don't set it
  powerpc/powernv: Fix TCE kill on NVLink2
  powerpc/mm/radix: Drop support for CPUs without lockless tlbie
  powerpc/book3s/mce: Move add_taint() later in virtual mode
  powerpc/sysfs: Move #ifdef CONFIG_HOTPLUG_CPU out of the function body
  powerpc/smp: Document irq enable/disable after migrating IRQs
  powerpc/mpc52xx: Don't select user-visible RTAS_PROC
  powerpc/powernv: Document cxl dependency on special case in pnv_eeh_reset()
  powerpc/eeh: Clean up and document event handling functions
  powerpc/eeh: Avoid use after free in eeh_handle_special_event()
  cxl: Mask slice error interrupts after first occurrence
  cxl: Route eeh events to all drivers in cxl_pci_error_detected()
  cxl: Force context lock during EEH flow
  powerpc/64: Allow CONFIG_RELOCATABLE if COMPILE_TEST
  powerpc/xmon: Teach xmon oops about radix vectors
  powerpc/mm/hash: Fix off-by-one in comment about kernel contexts ids
  powerpc/pseries: Enable VFIO
  powerpc/powernv: Fix iommu table size calculation hook for small tables
  powerpc/powernv: Check kzalloc() return value in pnv_pci_table_alloc
  powerpc: Add arch/powerpc/tools directory
  ...
2017-05-05 11:36:44 -07:00
Paul Mackerras
fb7dcf723d Merge remote-tracking branch 'remotes/powerpc/topic/xive' into kvm-ppc-next
This merges in the powerpc topic/xive branch to bring in the code for
the in-kernel XICS interrupt controller emulation to use the new XIVE
(eXternal Interrupt Virtualization Engine) hardware in the POWER9 chip
directly, rather than via a XICS emulation in firmware.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-28 08:23:16 +10:00
Denis Kirjanov
db4b0dfab7 KVM: PPC: Book3S HV: Avoid preemptibility warning in module initialization
With CONFIG_DEBUG_PREEMPT, get_paca() produces the following warning
in kvmppc_book3s_init_hv() since it calls debug_smp_processor_id().

There is no real issue with the xics_phys field.
If paca->kvm_hstate.xics_phys is non-zero on one cpu, it will be
non-zero on them all.  Therefore this is not fixing any actual
problem, just the warning.

[  138.521188] BUG: using smp_processor_id() in preemptible [00000000] code: modprobe/5596
[  138.521308] caller is .kvmppc_book3s_init_hv+0x184/0x350 [kvm_hv]
[  138.521404] CPU: 5 PID: 5596 Comm: modprobe Not tainted 4.11.0-rc3-00022-gc7e790c #1
[  138.521509] Call Trace:
[  138.521563] [c0000007d018b810] [c0000000023eef10] .dump_stack+0xe4/0x150 (unreliable)
[  138.521694] [c0000007d018b8a0] [c000000001f6ec04] .check_preemption_disabled+0x134/0x150
[  138.521829] [c0000007d018b940] [d00000000a010274] .kvmppc_book3s_init_hv+0x184/0x350 [kvm_hv]
[  138.521963] [c0000007d018ba00] [c00000000191d5cc] .do_one_initcall+0x5c/0x1c0
[  138.522082] [c0000007d018bad0] [c0000000023e9494] .do_init_module+0x84/0x240
[  138.522201] [c0000007d018bb70] [c000000001aade18] .load_module+0x1f68/0x2a10
[  138.522319] [c0000007d018bd20] [c000000001aaeb30] .SyS_finit_module+0xc0/0xf0
[  138.522439] [c0000007d018be30] [c00000000191baec] system_call+0x38/0xfc

Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-28 08:21:51 +10:00
Radim Krčmář
72875d8a4d KVM: add kvm_{test,clear}_request to replace {test,clear}_bit
Users were expected to use kvm_check_request() for testing and clearing,
but request have expanded their use since then and some users want to
only test or do a faster clear.

Make sure that requests are not directly accessed with bit operations.

Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-27 14:12:22 +02:00
Benjamin Herrenschmidt
5af5099385 KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller
This patch makes KVM capable of using the XIVE interrupt controller
to provide the standard PAPR "XICS" style hypercalls. It is necessary
for proper operations when the host uses XIVE natively.

This has been lightly tested on an actual system, including PCI
pass-through with a TG3 device.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Cleanup pr_xxx(), unsplit pr_xxx() strings, etc., fix build
 failures by adding KVM_XIVE which depends on KVM_XICS and XIVE, and
 adding empty stubs for the kvm_xive_xxx() routines, fixup subject,
 integrate fixes from Paul for building PR=y HV=n]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27 21:37:29 +10:00
Thomas Huth
feafd13c96 KVM: PPC: Book3S PR: Do not fail emulation with mtspr/mfspr for unknown SPRs
According to the PowerISA 2.07, mtspr and mfspr should not always
generate an illegal instruction exception when being used with an
undefined SPR, but rather treat the instruction as a NOP or inject a
privilege exception in some cases, too - depending on the SPR number.
Also turn the printk here into a ratelimited print statement, so that
the guest can not flood the dmesg log of the host by issueing lots of
illegal mtspr/mfspr instruction here.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-20 11:39:32 +10:00
Alexey Kardashevskiy
121f80ba68 KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.

To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.

If we fail to update a hardware IOMMU table unexpected reason, we just
clear it and move on as there is nothing really we can do about it -
for example, if we hot plug a VFIO device to a guest, existing TCE tables
will be mirrored automatically to the hardware and there is no interface
to report to the guest about possible failures.

This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is cached and referenced so we do not have to look up for it in real mode.

This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is destroyed; this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

This adds real mode version of WARN_ON_ONCE() as the generic version
causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
returns in the code, this also adds a check for already existing
vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().

This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-20 11:39:26 +10:00
Alexey Kardashevskiy
b1af23d836 KVM: PPC: iommu: Unify TCE checking
This reworks helpers for checking TCE update parameters in way they
can be used in KVM.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-20 11:39:21 +10:00
Alexey Kardashevskiy
da6f59e192 KVM: PPC: Use preregistered memory API to access TCE list
VFIO on sPAPR already implements guest memory pre-registration
when the entire guest RAM gets pinned. This can be used to translate
the physical address of a guest page containing the TCE list
from H_PUT_TCE_INDIRECT.

This makes use of the pre-registrered memory API to access TCE list
pages in order to avoid unnecessary locking on the KVM memory
reverse map as we know that all of guest memory is pinned and
we have a flat array mapping GPA to HPA which makes it simpler and
quicker to index into that array (even with looking up the
kernel page tables in vmalloc_to_phys) than it is to find the memslot,
lock the rmap entry, look up the user page tables, and unlock the rmap
entry. Note that the rmap pointer is initialized to NULL
where declared (not in this patch).

If a requested chunk of memory has not been preregistered, this will
fall back to non-preregistered case and lock rmap.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-20 11:39:16 +10:00
Alexey Kardashevskiy
503bfcbe18 KVM: PPC: Pass kvm* to kvmppc_find_table()
The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
there. This will be used in the following patches where we will be
attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
to VCPU).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-20 11:39:12 +10:00
Alexey Kardashevskiy
e91aa8e6ec KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
It does not make much sense to have KVM in book3s-64 and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s KVM.

Having IOMMU_API always enabled makes it unnecessary to have a lot of
"#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-20 11:39:07 +10:00
Alexey Kardashevskiy
3762d45aa7 KVM: PPC: Align the table size to system page size
At the moment the userspace can request a table smaller than a page size
and this value will be stored as kvmppc_spapr_tce_table::size.
However the actual allocated size will still be aligned to the system
page size as alloc_page() is used there.

This aligns the table size up to the system page size. It should not
change the existing behaviour but when in-kernel TCE acceleration patchset
reaches the upstream kernel, this will allow small TCE tables be
accelerated as well: PCI IODA iommu_table allocator already aligns
the size and, without this patch, an IOMMU group won't attach to LIOBN
due to the mismatching table size.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-20 11:38:20 +10:00
Alexey Kardashevskiy
96df226769 KVM: PPC: Book3S PR: Preserve storage control bits
PR KVM page fault handler performs eaddr to pte translation for a guest,
however kvmppc_mmu_book3s_64_xlate() does not preserve WIMG bits
(storage control) in the kvmppc_pte struct. If PR KVM is running as
a second level guest under HV KVM, and PR KVM tries inserting HPT entry,
this fails in HV KVM if it already has this mapping.

This preserves WIMG bits between kvmppc_mmu_book3s_64_xlate() and
kvmppc_mmu_map_page().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-20 11:38:14 +10:00
Alexey Kardashevskiy
bd9166ffe6 KVM: PPC: Book3S PR: Exit KVM on failed mapping
At the moment kvmppc_mmu_map_page() returns -1 if
mmu_hash_ops.hpte_insert() fails for any reason so the page fault handler
resumes the guest and it faults on the same address again.

This adds distinction to kvmppc_mmu_map_page() to return -EIO if
mmu_hash_ops.hpte_insert() failed for a reason other than full pteg.
At the moment only pSeries_lpar_hpte_insert() returns -2 if
plpar_pte_enter() failed with a code other than H_PTEG_FULL.
Other mmu_hash_ops.hpte_insert() instances can only fail with
-1 "full pteg".

With this change, if PR KVM fails to update HPT, it can signal
the userspace about this instead of returning to guest and having
the very same page fault over and over again.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-20 11:38:04 +10:00
Alexey Kardashevskiy
9eecec126e KVM: PPC: Book3S PR: Get rid of unused local variable
@is_mmio has never been used since introduction in
commit 2f4cf5e42d ("Add book3s.c") from 2009.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-20 11:38:00 +10:00
Markus Elfring
37655490db KVM: PPC: e500: Use kcalloc() in e500_mmu_host_init()
* A multiplication for the size determination of a memory allocation
  indicated that an array data structure should be processed.
  Thus use the corresponding function "kcalloc".

  This issue was detected by using the Coccinelle software.

* Replace the specification of a data type by a pointer dereference
  to make the corresponding size determination a bit safer according to
  the Linux coding style convention.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-20 11:37:55 +10:00
Markus Elfring
a1c52e1c7c KVM: PPC: Book3S HV: Use common error handling code in kvmppc_clr_passthru_irq()
Add a jump target so that a bit of exception handling can be better reused
at the end of this function.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-20 11:37:50 +10:00
Paul Mackerras
9b5ab00513 KVM: PPC: Add MMIO emulation for remaining floating-point instructions
For completeness, this adds emulation of the lfiwax and lfiwzx
instructions.  With this, all floating-point load and store instructions
as of Power ISA V2.07 are emulated.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-20 11:37:44 +10:00
Paul Mackerras
ceba57df43 KVM: PPC: Emulation for more integer loads and stores
This adds emulation for the following integer loads and stores,
thus enabling them to be used in a guest for accessing emulated
MMIO locations.

- lhaux
- lwaux
- lwzux
- ldu
- lwa
- stdux
- stwux
- stdu
- ldbrx
- stdbrx

Previously, most of these would cause an emulation failure exit to
userspace, though ldu and lwa got treated incorrectly as ld, and
stdu got treated incorrectly as std.

This also tidies up some of the formatting and updates the comment
listing instructions that still need to be implemented.

With this, all integer loads and stores that are defined in the Power
ISA v2.07 are emulated, except for those that are permitted to trap
when used on cache-inhibited or write-through mappings (and which do
in fact trap on POWER8), that is, lmw/stmw, lswi/stswi, lswx/stswx,
lq/stq, and l[bhwdq]arx/st[bhwdq]cx.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-20 11:37:38 +10:00
Alexey Kardashevskiy
91242fd1a3 KVM: PPC: Add MMIO emulation for stdx (store doubleword indexed)
This adds missing stdx emulation for emulated MMIO accesses by KVM
guests.  This allows the Mellanox mlx5_core driver from recent kernels
to work when MMIO emulation is enforced by userspace.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-20 11:37:33 +10:00
Bin Lu
6f63e81bda KVM: PPC: Book3S: Add MMIO emulation for FP and VSX instructions
This patch provides the MMIO load/store emulation for instructions
of 'double & vector unsigned char & vector signed char & vector
unsigned short & vector signed short & vector unsigned int & vector
signed int & vector double '.

The instructions that this adds emulation for are:

- ldx, ldux, lwax,
- lfs, lfsx, lfsu, lfsux, lfd, lfdx, lfdu, lfdux,
- stfs, stfsx, stfsu, stfsux, stfd, stfdx, stfdu, stfdux, stfiwx,
- lxsdx, lxsspx, lxsiwax, lxsiwzx, lxvd2x, lxvw4x, lxvdsx,
- stxsdx, stxsspx, stxsiwx, stxvd2x, stxvw4x

[paulus@ozlabs.org - some cleanups, fixes and rework, make it
 compile for Book E, fix build when PR KVM is built in]

Signed-off-by: Bin Lu <lblulb@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-20 11:36:41 +10:00
Paul Mackerras
307d927967 KVM: PPC: Provide functions for queueing up FP/VEC/VSX unavailable interrupts
This provides functions that can be used for generating interrupts
indicating that a given functional unit (floating point, vector, or
VSX) is unavailable.  These functions will be used in instruction
emulation code.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-20 10:39:50 +10:00
Laura Abbott
f318dd083c cma: Store a name in the cma structure
Frameworks that may want to enumerate CMA heaps (e.g. Ion) will find it
useful to have an explicit name attached to each region. Store the name
in each CMA structure.

Signed-off-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-18 20:41:12 +02:00
Michael Ellerman
3c19d5ada1 Merge branch 'topic/xive' (early part) into next
This merges the arch part of the XIVE support, leaving the final commit
with the KVM specific pieces dangling on the branch for Paul to merge
via the kvm-ppc tree.
2017-04-12 22:31:37 +10:00
Michael Ellerman
3ae05fb3cc powerpc: Remove unnecessary includes of asm/debug.h
These files don't seem to have any need for asm/debug.h, now that all it
includes are the debugger hooks and breakpoint definitions.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-11 07:46:04 +10:00
Michael Ellerman
7644d5819c powerpc: Create asm/debugfs.h and move powerpc_debugfs_root there
powerpc_debugfs_root is the dentry representing the root of the
"powerpc" directory tree in debugfs.

Currently it sits in asm/debug.h, a long with some other things that
have "debug" in the name, but are otherwise unrelated.

Pull it out into a separate header, which also includes linux/debugfs.h,
and convert all the users to include debugfs.h instead of debug.h.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-11 07:46:03 +10:00
Benjamin Herrenschmidt
d381d7caf8 powerpc: Consolidate variants of real-mode MMIOs
We have all sort of variants of MMIO accessors for the real mode
instructions. This creates a clean set of accessors based on
Linux normal naming conventions, replacing all occurrences of
the old ones in the tree.

I have purposefully removed the "out/in" variants in favor of
only including __raw variants. Any code using these is already
pretty much hand tuned to operate in a very specific environment.
I've fixed up the 2 users (only one of them actually needed
a barrier in the first place).

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-10 21:43:16 +10:00
Benjamin Herrenschmidt
936774cd3f powerpc/kvm: Make kvmppc_xics_create_icp static
It's only used within the same file it's defined

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-10 21:43:15 +10:00
Benjamin Herrenschmidt
d3989143d0 powerpc/kvm: Massage order of #include
We traditionally have linux/ before asm/

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-10 21:43:15 +10:00
Benjamin Herrenschmidt
243e25112d powerpc/xive: Native exploitation of the XIVE interrupt controller
The XIVE interrupt controller is the new interrupt controller
found in POWER9. It supports advanced virtualization capabilities
among other things.

Currently we use a set of firmware calls that simulate the old
"XICS" interrupt controller but this is fairly inefficient.

This adds the framework for using XIVE along with a native
backend which OPAL for configuration. Later, a backend allowing
the use in a KVM or PowerVM guest will also be provided.

This disables some fast path for interrupts in KVM when XIVE is
enabled as these rely on the firmware emulation code which is no
longer available when the XIVE is used natively by Linux.

A latter patch will make KVM also directly exploit the XIVE, thus
recovering the lost performance (and more).

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Fixup pr_xxx("XIVE:"...), don't split pr_xxx() strings,
 tweak Kconfig so XIVE_NATIVE selects XIVE and depends on POWERNV,
 fix build errors when SMP=n, fold in fixes from Ben:
   Don't call cpu_online() on an invalid CPU number
   Fix irq target selection returning out of bounds cpu#
   Extra sanity checks on cpu numbers
 ]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-10 21:41:34 +10:00
Paolo Bonzini
3042255899 kvm: make KVM_CAP_COALESCED_MMIO architecture agnostic
Remove code from architecture files that can be moved to virt/kvm, since there
is already common code for coalesced MMIO.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
[Removed a pointless 'break' after 'return'.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-07 16:49:00 +02:00
Dan Carpenter
abd80dcbc4 KVM: PPC: Book3S HV: Check for kmalloc errors in ioctl
kzalloc() won't actually fail because sizeof(*resize) is small, but
static checkers complain.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-06 15:50:43 +10:00
Aneesh Kumar K.V
e6f81a9201 powerpc/mm/hash: Support 68 bit VA
Inorder to support large effective address range (512TB), we want to
increase the virtual address bits to 68. But we do have platforms like
p4 and p5 that can only do 65 bit VA. We support those platforms by
limiting context bits on them to 16.

The protovsid -> vsid conversion is verified to work with both 65 and 68
bit va values. I also documented the restrictions in a table format as
part of code comments.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31 23:10:00 +11:00
Michael Ellerman
a336f2f5b0 powerpc/mm/hash: Abstract context id allocation for KVM
KVM wants to be able to allocate an MMU context id, which it does
currently by calling __init_new_context().

We're about to rework that code, so provide a wrapper for KVM so it
can not worry about the details.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31 23:09:57 +11:00
Aneesh Kumar K.V
d19469e841 power/mm: update pte_write and pte_wrprotect to handle savedwrite
We use pte_write() to check whethwer the pte entry is writable.  This is
mostly used to later mark the pte read only if it is writable.  The other
use of pte_write() is to check whether the pte_entry is writable so that
hardware page table entry can be marked accordingly.  This is used in kvm
where we look at qemu page table entry and update hardware hash page table
for the guest with correct write enable bit.

With the above, for the first usage we should also check the savedwrite
bit so that we can correctly clear the savedwite bit.  For the later, we
add a new variant __pte_write().

With this we can revert write_protect_page part of 595cd8f256 ("mm/ksm:
handle protnone saved writes when making page write protect").  But I left
it as it is as an example code for savedwrite check.

Fixes: c137a2757b ("powerpc/mm/autonuma: switch ppc64 to its own implementation of saved write")
Link: http://lkml.kernel.org/r/1488203787-17849-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09 17:01:09 -08:00
Linus Torvalds
2d62e0768d Second batch of KVM changes for 4.11 merge window
PPC:
  * correct assumption about ASDR on POWER9
  * fix MMIO emulation on POWER9
 
 x86:
  * add a simple test for ioperm
  * cleanup TSS
    (going through KVM tree as the whole undertaking was caused by VMX's
     use of TSS)
  * fix nVMX interrupt delivery
  * fix some performance counters in the guest
 
 And two cleanup patches.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJYuu5qAAoJEED/6hsPKofoRAUH/jkx/KFDcw3FggixysWVgRai
 iLSbbAZemnSLFSOkOU/t7Bz0fXCUgB0tAcMJd9ow01Dg1zObiTpuUIo6qEPaYHdX
 gqtUzlHuyECZEcgK0RXS9kDYLrvw7EFocxnDWQfV91qCZSS6nBSSLF3ST1rNV69W
 mUvcZG+MciDcZUe1lTexoswVTh1m7avvozEnQ5OHnZR9yicoXiadBQjzL6yqWoqf
 Ml/29zRk5+MvloTudxjkAKm3mh7psW88jNMh37TXbAA7i+Xwl9cU6GLR9mFWstoP
 7Ot7ecq9mNAUO3lTIQh7lqvB60LMFznS4IlYK7MbplC3kvJLkfzhTWaN1aGvh90=
 =cqHo
 -----END PGP SIGNATURE-----

Merge tag 'kvm-4.11-2' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more KVM updates from Radim Krčmář:
 "Second batch of KVM changes for the 4.11 merge window:

  PPC:
   - correct assumption about ASDR on POWER9
   - fix MMIO emulation on POWER9

  x86:
   - add a simple test for ioperm
   - cleanup TSS (going through KVM tree as the whole undertaking was
     caused by VMX's use of TSS)
   - fix nVMX interrupt delivery
   - fix some performance counters in the guest

  ... and two cleanup patches"

* tag 'kvm-4.11-2' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: nVMX: Fix pending events injection
  x86/kvm/vmx: remove unused variable in segment_base()
  selftests/x86: Add a basic selftest for ioperm
  x86/asm: Tidy up TSS limit code
  kvm: convert kvm.users_count from atomic_t to refcount_t
  KVM: x86: never specify a sample period for virtualized in_tx_cp counters
  KVM: PPC: Book3S HV: Don't use ASDR for real-mode HPT faults on POWER9
  KVM: PPC: Book3S HV: Fix software walk of guest process page tables
2017-03-04 11:36:19 -08:00
Ingo Molnar
b2d0910310 sched/headers: Prepare to use <linux/rcuupdate.h> instead of <linux/rculist.h> in <linux/sched.h>
We don't actually need the full rculist.h header in sched.h anymore,
we will be able to include the smaller rcupdate.h header instead.

But first update code that relied on the implicit header inclusion.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:38 +01:00
Ingo Molnar
589ee62844 sched/headers: Prepare to remove the <linux/mm_types.h> dependency from <linux/sched.h>
Update code that relied on sched.h including various MM types for them.

This will allow us to remove the <linux/mm_types.h> include from <linux/sched.h>.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:37 +01:00
Ingo Molnar
03441a3482 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/stat.h>
We are going to split <linux/sched/stat.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/stat.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:34 +01:00
Ingo Molnar
174cd4b1e5 sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>
Fix up affected files that include this signal functionality via sched.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:32 +01:00
Ingo Molnar
3f07c01441 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:29 +01:00
Paul Mackerras
4e5acdc23a KVM: PPC: Book3S HV: Don't use ASDR for real-mode HPT faults on POWER9
In HPT mode on POWER9, the ASDR register is supposed to record
segment information for hypervisor page faults.  It turns out that
POWER9 DD1 does not record the page size information in the ASDR
for faults in guest real mode.  We have the necessary information
in memory already, so by moving the checks for real mode that already
existed, we can use the in-memory copy.  Since a load is likely to
be faster than reading an SPR, we do this unconditionally (not just
for POWER9 DD1).

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-03-01 11:54:10 +11:00
Paul Mackerras
70cd4c10b2 KVM: PPC: Book3S HV: Fix software walk of guest process page tables
This fixes some bugs in the code that walks the guest's page tables.
These bugs cause MMIO emulation to fail whenever the guest is in
virtial mode (MMU on), leading to the guest hanging if it tried to
access a virtio device.

The first bug was that when reading the guest's process table, we were
using the whole of arch->process_table, not just the field that contains
the process table base address.  The second bug was that the mask used
when reading the process table entry to get the radix tree base address,
RPDB_MASK, had the wrong value.

Fixes: 9e04ba69be ("KVM: PPC: Book3S HV: Add basic infrastructure for radix guests")
Fixes: e99833448c ("powerpc/mm/radix: Add partition table format & callback")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-03-01 11:53:45 +11:00
Lucas Stach
e2f466e32f mm: cma_alloc: allow to specify GFP mask
Most users of this interface just want to use it with the default
GFP_KERNEL flags, but for cases where DMA memory is allocated it may be
called from a different context.

No functional change yet, just passing through the flag to the
underlying alloc_contig_range function.

Link: http://lkml.kernel.org/r/20170127172328.18574-2-l.stach@pengutronix.de
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Alexander Graf <agraf@suse.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:55 -08:00
Dave Jiang
11bac80004 mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmf
->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
take a vma and vmf parameter when the vma already resides in vmf.

Remove the vma parameter to simplify things.

[arnd@arndb.de: fix ARM build]
  Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:54 -08:00
Paolo Bonzini
a26d553400 Merge branch 'kvm-ppc-next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD
Paul Mackerras writes:
"Please do a pull from my kvm-ppc-next branch to get some fixes which I
would like to have in 4.11.  There are four small commits there; two
are fixes for potential host crashes in the new HPT resizing code, and
the other two are changes to printks to make KVM on PPC a little less
noisy."
2017-02-20 11:54:22 +01:00
Paul Mackerras
bcd3bb63db KVM: PPC: Book3S HV: Disable HPT resizing on POWER9 for now
The new HPT resizing code added in commit b5baa68773 ("KVM: PPC:
Book3S HV: KVM-HV HPT resizing implementation", 2016-12-20) doesn't
have code to handle the new HPTE format which POWER9 uses.  Thus it
would be best not to advertise it to userspace on POWER9 systems
until it works properly.

Also, since resize_hpt_rehash_hpte() contains BUG_ON() calls that
could be hit on POWER9, let's prevent it from being called on POWER9
for now.

Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-02-18 14:22:47 +11:00
Paolo Bonzini
460df4c1fc KVM: race-free exit from KVM_RUN without POSIX signals
The purpose of the KVM_SET_SIGNAL_MASK API is to let userspace "kick"
a VCPU out of KVM_RUN through a POSIX signal.  A signal is attached
to a dummy signal handler; by blocking the signal outside KVM_RUN and
unblocking it inside, this possible race is closed:

          VCPU thread                     service thread
   --------------------------------------------------------------
        check flag
                                          set flag
                                          raise signal
        (signal handler does nothing)
        KVM_RUN

However, one issue with KVM_SET_SIGNAL_MASK is that it has to take
tsk->sighand->siglock on every KVM_RUN.  This lock is often on a
remote NUMA node, because it is on the node of a thread's creator.
Taking this lock can be very expensive if there are many userspace
exits (as is the case for SMP Windows VMs without Hyper-V reference
time counter).

As an alternative, we can put the flag directly in kvm_run so that
KVM can see it:

          VCPU thread                     service thread
   --------------------------------------------------------------
                                          raise signal
        signal handler
          set run->immediate_exit
        KVM_RUN
          check run->immediate_exit

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-17 12:27:37 +01:00
Thomas Huth
3a4f176081 KVM: PPC: Book3S HV: Turn "KVM guest htab" message into a debug message
The average user likely does not know what a "htab" or "LPID" is,
and it's annoying that these messages are quickly filling the dmesg
log when you're doing boot cycle tests, so let's turn it into a debug
message instead.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-02-17 22:05:56 +11:00
Vipin K Parashar
4da934dc65 KVM: PPC: Book3S PR: Ratelimit copy data failure error messages
kvm_ppc_mmu_book3s_32/64 xlat() logs "KVM can't copy data" error
upon failing to copy user data to kernel space. This floods kernel
log once such fails occur in short time period. Ratelimit this
error to avoid flooding kernel logs upon copy data failures.

Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-02-17 14:03:35 +11:00
David Gibson
5b73d6347e KVM: PPC: Book3S HV: Prevent double-free on HPT resize commit path
resize_hpt_release(), called once the HPT resize of a KVM guest is
completed (successfully or unsuccessfully) frees the state structure for
the resize.  It is currently not safe to call with a NULL pointer.

However, one of the error paths in kvm_vm_ioctl_resize_hpt_commit() can
invoke it with a NULL pointer.  This will occur if userspace improperly
invokes KVM_PPC_RESIZE_HPT_COMMIT without previously calling
KVM_PPC_RESIZE_HPT_PREPARE, or if it calls COMMIT twice without an
intervening PREPARE.

To fix this potential crash bug - and maybe others like it, make it safe
(and a no-op) to call resize_hpt_release() with a NULL resize pointer.

Found by Dan Carpenter with a static checker.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-02-16 16:32:19 +11:00
Wei Yongjun
5982f0849e KVM: PPC: Book 3S: Fix error return in kvm_vm_ioctl_create_spapr_tce()
Fix to return error code -ENOMEM from the memory alloc error handling
case instead of 0, as done elsewhere in this function.

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-02-09 22:06:28 +11:00
Paul Mackerras
a4a741a048 Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next
This merges in a fix which touches both PPC and KVM code,
which was therefore put into a topic branch in the powerpc
tree.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-02-08 19:35:34 +11:00
Benjamin Herrenschmidt
ab9bad0ead powerpc/powernv: Remove separate entry for OPAL real mode calls
All entry points already read the MSR so they can easily do
the right thing.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-02-07 16:40:18 +11:00
David Gibson
050f23390f KVM: PPC: Book3S HV: Advertise availablity of HPT resizing on KVM HV
This updates the KVM_CAP_SPAPR_RESIZE_HPT capability to advertise the
presence of in-kernel HPT resizing on KVM HV.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-31 22:00:07 +11:00
David Gibson
b5baa68773 KVM: PPC: Book3S HV: KVM-HV HPT resizing implementation
This adds the "guts" of the implementation for the HPT resizing PAPR
extension.  It has the code to allocate and clear a new HPT, rehash an
existing HPT's entries into it, and accomplish the switchover for a
KVM guest from the old HPT to the new one.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-31 22:00:00 +11:00
David Gibson
5e9859699a KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing implementation
This adds a not yet working outline of the HPT resizing PAPR
extension.  Specifically it adds the necessary ioctl() functions,
their basic steps, the work function which will handle preparation for
the resize, and synchronization between these, the guest page fault
path and guest HPT update path.

The actual guts of the implementation isn't here yet, so for now the
calls will always fail.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-31 21:59:56 +11:00
David Gibson
639e459768 KVM: PPC: Book3S HV: Create kvmppc_unmap_hpte_helper()
The kvm_unmap_rmapp() function, called from certain MMU notifiers, is used
to force all guest mappings of a particular host page to be set ABSENT, and
removed from the reverse mappings.

For HPT resizing, we will have some cases where we want to set just a
single guest HPTE ABSENT and remove its reverse mappings.  To prepare with
this, we split out the logic from kvm_unmap_rmapp() to evict a single HPTE,
moving it to a new helper function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-31 21:59:50 +11:00
David Gibson
f98a8bf9ee KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size
The KVM_PPC_ALLOCATE_HTAB ioctl() is used to set the size of hashed page
table (HPT) that userspace expects a guest VM to have, and is also used to
clear that HPT when necessary (e.g. guest reboot).

At present, once the ioctl() is called for the first time, the HPT size can
never be changed thereafter - it will be cleared but always sized as from
the first call.

With upcoming HPT resize implementation, we're going to need to allow
userspace to resize the HPT at reset (to change it back to the default size
if the guest changed it).

So, we need to allow this ioctl() to change the HPT size.

This patch also updates Documentation/virtual/kvm/api.txt to reflect
the new behaviour.  In fact the documentation was already slightly
incorrect since 572abd5 "KVM: PPC: Book3S HV: Don't fall back to
smaller HPT size in allocation ioctl"

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-31 21:59:45 +11:00
David Gibson
aae0777f1e KVM: PPC: Book3S HV: Split HPT allocation from activation
Currently, kvmppc_alloc_hpt() both allocates a new hashed page table (HPT)
and sets it up as the active page table for a VM.  For the upcoming HPT
resize implementation we're going to want to allocate HPTs separately from
activating them.

So, split the allocation itself out into kvmppc_allocate_hpt() and perform
the activation with a new kvmppc_set_hpt() function.  Likewise we split
kvmppc_free_hpt(), which just frees the HPT, from kvmppc_release_hpt()
which unsets it as an active HPT, then frees it.

We also move the logic to fall back to smaller HPT sizes if the first try
fails into the single caller which used that behaviour,
kvmppc_hv_setup_htab_rma().  This introduces a slight semantic change, in
that previously if the initial attempt at CMA allocation failed, we would
fall back to attempting smaller sizes with the page allocator.  Now, we
try first CMA, then the page allocator at each size.  As far as I can tell
this change should be harmless.

To match, we make kvmppc_free_hpt() just free the actual HPT itself.  The
call to kvmppc_free_lpid() that was there, we move to the single caller.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-31 21:59:39 +11:00
David Gibson
3d089f84c6 KVM: PPC: Book3S HV: Don't store values derivable from HPT order
Currently the kvm_hpt_info structure stores the hashed page table's order,
and also the number of HPTEs it contains and a mask for its size.  The
last two can be easily derived from the order, so remove them and just
calculate them as necessary with a couple of helper inlines.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-31 21:59:34 +11:00
David Gibson
3f9d4f5a5f KVM: PPC: Book3S HV: Gather HPT related variables into sub-structure
Currently, the powerpc kvm_arch structure contains a number of variables
tracking the state of the guest's hashed page table (HPT) in KVM HV.  This
patch gathers them all together into a single kvm_hpt_info substructure.
This makes life more convenient for the upcoming HPT resizing
implementation.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-31 21:59:28 +11:00
David Gibson
db9a290d9c KVM: PPC: Book3S HV: Rename kvm_alloc_hpt() for clarity
The difference between kvm_alloc_hpt() and kvmppc_alloc_hpt() is not at
all obvious from the name.  In practice kvmppc_alloc_hpt() allocates an HPT
by whatever means, and calls kvm_alloc_hpt() which will attempt to allocate
it with CMA only.

To make this less confusing, rename kvm_alloc_hpt() to kvm_alloc_hpt_cma().
Similarly, kvm_release_hpt() is renamed kvm_free_hpt_cma().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-31 21:59:23 +11:00
Paul Mackerras
167c76e055 Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next
This merges in the POWER9 radix MMU host and guest support, which
was put into a topic branch because it touches both powerpc and
KVM code.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-31 19:21:26 +11:00
Paul Mackerras
8cf4ecc0ca KVM: PPC: Book3S HV: Enable radix guest support
This adds a few last pieces of the support for radix guests:

* Implement the backends for the KVM_PPC_CONFIGURE_V3_MMU and
  KVM_PPC_GET_RMMU_INFO ioctls for radix guests

* On POWER9, allow secondary threads to be on/off-lined while guests
  are running.

* Set up LPCR and the partition table entry for radix guests.

* Don't allocate the rmap array in the kvm_memory_slot structure
  on radix.

* Don't try to initialize the HPT for radix guests, since they don't
  have an HPT.

* Take out the code that prevents the HV KVM module from
  initializing on radix hosts.

At this stage, we only support radix guests if the host is running
in radix mode, and only support HPT guests if the host is running in
HPT mode.  Thus a guest cannot switch from one mode to the other,
which enables some simplifications.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31 19:11:52 +11:00
Paul Mackerras
f11f6f79b6 KVM: PPC: Book3S HV: Invalidate ERAT on guest entry/exit for POWER9 DD1
On POWER9 DD1, we need to invalidate the ERAT (effective to real
address translation cache) when changing the PIDR register, which
we do as part of guest entry and exit.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31 19:11:52 +11:00
Paul Mackerras
53af3ba2e8 KVM: PPC: Book3S HV: Allow guest exit path to have MMU on
If we allow LPCR[AIL] to be set for radix guests, then interrupts from
the guest to the host can be delivered by the hardware with relocation
on, and thus the code path starting at kvmppc_interrupt_hv can be
executed in virtual mode (MMU on) for radix guests (previously it was
only ever executed in real mode).

Most of the code is indifferent to whether the MMU is on or off, but
the calls to OPAL that use the real-mode OPAL entry code need to
be switched to use the virtual-mode code instead.  The affected
calls are the calls to the OPAL XICS emulation functions in
kvmppc_read_one_intr() and related functions.  We test the MSR[IR]
bit to detect whether we are in real or virtual mode, and call the
opal_rm_* or opal_* function as appropriate.

The other place that depends on the MMU being off is the optimization
where the guest exit code jumps to the external interrupt vector or
hypervisor doorbell interrupt vector, or returns to its caller (which
is __kvmppc_vcore_entry).  If the MMU is on and we are returning to
the caller, then we don't need to use an rfid instruction since the
MMU is already on; a simple blr suffices.  If there is an external
or hypervisor doorbell interrupt to handle, we branch to the
relocation-on version of the interrupt vector.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31 19:11:51 +11:00
Paul Mackerras
a29ebeaf55 KVM: PPC: Book3S HV: Invalidate TLB on radix guest vcpu movement
With radix, the guest can do TLB invalidations itself using the tlbie
(global) and tlbiel (local) TLB invalidation instructions.  Linux guests
use local TLB invalidations for translations that have only ever been
accessed on one vcpu.  However, that doesn't mean that the translations
have only been accessed on one physical cpu (pcpu) since vcpus can move
around from one pcpu to another.  Thus a tlbiel might leave behind stale
TLB entries on a pcpu where the vcpu previously ran, and if that task
then moves back to that previous pcpu, it could see those stale TLB
entries and thus access memory incorrectly.  The usual symptom of this
is random segfaults in userspace programs in the guest.

To cope with this, we detect when a vcpu is about to start executing on
a thread in a core that is a different core from the last time it
executed.  If that is the case, then we mark the core as needing a
TLB flush and then send an interrupt to any thread in the core that is
currently running a vcpu from the same guest.  This will get those vcpus
out of the guest, and the first one to re-enter the guest will do the
TLB flush.  The reason for interrupting the vcpus executing on the old
core is to cope with the following scenario:

	CPU 0			CPU 1			CPU 4
	(core 0)			(core 0)			(core 1)

	VCPU 0 runs task X      VCPU 1 runs
	core 0 TLB gets
	entries from task X
	VCPU 0 moves to CPU 4
							VCPU 0 runs task X
							Unmap pages of task X
							tlbiel

				(still VCPU 1)			task X moves to VCPU 1
				task X runs
				task X sees stale TLB
				entries

That is, as soon as the VCPU starts executing on the new core, it
could unmap and tlbiel some page table entries, and then the task
could migrate to one of the VCPUs running on the old core and
potentially see stale TLB entries.

Since the TLB is shared between all the threads in a core, we only
use the bit of kvm->arch.need_tlb_flush corresponding to the first
thread in the core.  To ensure that we don't have a window where we
can miss a flush, this moves the clearing of the bit from before the
actual flush to after it.  This way, two threads might both do the
flush, but we prevent the situation where one thread can enter the
guest before the flush is finished.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31 19:11:51 +11:00
Paul Mackerras
65dae5403a KVM: PPC: Book3S HV: Make HPT-specific hypercalls return error in radix mode
If the guest is in radix mode, then it doesn't have a hashed page
table (HPT), so all of the hypercalls that manipulate the HPT can't
work and should return an error.  This adds checks to make them
return H_FUNCTION ("function not supported").

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31 19:11:50 +11:00
Paul Mackerras
8f7b79b837 KVM: PPC: Book3S HV: Implement dirty page logging for radix guests
This adds code to keep track of dirty pages when requested (that is,
when memslot->dirty_bitmap is non-NULL) for radix guests.  We use the
dirty bits in the PTEs in the second-level (partition-scoped) page
tables, together with a bitmap of pages that were dirty when their
PTE was invalidated (e.g., when the page was paged out).  This bitmap
is stored in the first half of the memslot->dirty_bitmap area, and
kvm_vm_ioctl_get_dirty_log_hv() now uses the second half for the
bitmap that gets returned to userspace.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31 19:11:50 +11:00
Paul Mackerras
01756099e0 KVM: PPC: Book3S HV: MMU notifier callbacks for radix guests
This adapts our implementations of the MMU notifier callbacks
(unmap_hva, unmap_hva_range, age_hva, test_age_hva, set_spte_hva)
to call radix functions when the guest is using radix.  These
implementations are much simpler than for HPT guests because we
have only one PTE to deal with, so we don't need to traverse
rmap chains.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31 19:11:49 +11:00
Paul Mackerras
5a319350a4 KVM: PPC: Book3S HV: Page table construction and page faults for radix guests
This adds the code to construct the second-level ("partition-scoped" in
architecturese) page tables for guests using the radix MMU.  Apart from
the PGD level, which is allocated when the guest is created, the rest
of the tree is all constructed in response to hypervisor page faults.

As well as hypervisor page faults for missing pages, we also get faults
for reference/change (RC) bits needing to be set, as well as various
other error conditions.  For now, we only set the R or C bit in the
guest page table if the same bit is set in the host PTE for the
backing page.

This code can take advantage of the guest being backed with either
transparent or ordinary 2MB huge pages, and insert 2MB page entries
into the guest page tables.  There is no support for 1GB huge pages
yet.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31 19:11:49 +11:00
Paul Mackerras
f4c51f841d KVM: PPC: Book3S HV: Modify guest entry/exit paths to handle radix guests
This adds code to  branch around the parts that radix guests don't
need - clearing and loading the SLB with the guest SLB contents,
saving the guest SLB contents on exit, and restoring the host SLB
contents.

Since the host is now using radix, we need to save and restore the
host value for the PID register.

On hypervisor data/instruction storage interrupts, we don't do the
guest HPT lookup on radix, but just save the guest physical address
for the fault (from the ASDR register) in the vcpu struct.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31 19:11:48 +11:00
Paul Mackerras
9e04ba69be KVM: PPC: Book3S HV: Add basic infrastructure for radix guests
This adds a field in struct kvm_arch and an inline helper to
indicate whether a guest is a radix guest or not, plus a new file
to contain the radix MMU code, which currently contains just a
translate function which knows how to traverse the guest page
tables to translate an address.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31 19:11:48 +11:00
Paul Mackerras
ef8c640cb9 KVM: PPC: Book3S HV: Use ASDR for HPT guests on POWER9
POWER9 adds a register called ASDR (Access Segment Descriptor
Register), which is set by hypervisor data/instruction storage
interrupts to contain the segment descriptor for the address
being accessed, assuming the guest is using HPT translation.
(For radix guests, it contains the guest real address of the
access.)

Thus, for HPT guests on POWER9, we can use this register rather
than looking up the SLB with the slbfee. instruction.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31 19:11:48 +11:00
Paul Mackerras
468808bd35 KVM: PPC: Book3S HV: Set process table for HPT guests on POWER9
This adds the implementation of the KVM_PPC_CONFIGURE_V3_MMU ioctl
for HPT guests on POWER9.  With this, we can return 1 for the
KVM_CAP_PPC_MMU_HASH_V3 capability.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31 19:11:47 +11:00
Paul Mackerras
c927013227 KVM: PPC: Book3S HV: Add userspace interfaces for POWER9 MMU
This adds two capabilities and two ioctls to allow userspace to
find out about and configure the POWER9 MMU in a guest.  The two
capabilities tell userspace whether KVM can support a guest using
the radix MMU, or using the hashed page table (HPT) MMU with a
process table and segment tables.  (Note that the MMUs in the
POWER9 processor cores do not use the process and segment tables
when in HPT mode, but the nest MMU does).

The KVM_PPC_CONFIGURE_V3_MMU ioctl allows userspace to specify
whether a guest will use the radix MMU or the HPT MMU, and to
specify the size and location (in guest space) of the process
table.

The KVM_PPC_GET_RMMU_INFO ioctl gives userspace information about
the radix MMU.  It returns a list of supported radix tree geometries
(base page size and number of bits indexed at each level of the
radix tree) and the encoding used to specify the various page
sizes for the TLB invalidate entry instruction.

Initially, both capabilities return 0 and the ioctls return -EINVAL,
until the necessary infrastructure for them to operate correctly
is added.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31 19:11:47 +11:00
Nicholas Piggin
a97a65d53d KVM: PPC: Book3S: 64-bit CONFIG_RELOCATABLE support for interrupts
64-bit Book3S exception handlers must find the dynamic kernel base
to add to the target address when branching beyond __end_interrupts,
in order to support kernel running at non-0 physical address.

Support this in KVM by branching with CTR, similarly to regular
interrupt handlers. The guest CTR saved in HSTATE_SCRATCH1 and
restored after the branch.

Without this, the host kernel hangs and crashes randomly when it is
running at a non-0 address and a KVM guest is started.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-31 19:07:39 +11:00
Thomas Huth
fcd4f3c6d1 KVM: PPC: Book3S PR: Refactor program interrupt related code into separate function
The function kvmppc_handle_exit_pr() is quite huge and thus hard to read,
and even contains a "spaghetti-code"-like goto between the different case
labels of the big switch statement. This can be made much more readable
by moving the code related to injecting program interrupts / instruction
emulation into a separate function instead.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-27 20:34:28 +11:00
Paul Mackerras
8464c8842d KVM: PPC: Book3S HV: Fix H_PROD to actually wake the target vcpu
The H_PROD hypercall is supposed to wake up an idle vcpu.  We have
an implementation, but because Linux doesn't use it except when
doing cpu hotplug, it was never tested properly.  AIX does use it,
and reported it broken.  It turns out we were waking the wrong
vcpu (the one doing H_PROD, not the target of the prod) and we
weren't handling the case where the target needs an IPI to wake
it.  Fix it by using the existing kvmppc_fast_vcpu_kick_hv()
function, which is intended for this kind of thing, and by using
the target vcpu not the current vcpu.

We were also not looking at the prodded flag when checking whether a
ceded vcpu should wake up, so this adds checks for the prodded flag
alongside the checks for pending exceptions.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-27 20:23:13 +11:00
Nicholas Piggin
d3918e7fd4 KVM: PPC: Book3S: Change interrupt call to reduce scratch space use on HV
Change the calling convention to put the trap number together with
CR in two halves of r12, which frees up HSTATE_SCRATCH2 in the HV
handler.

The 64-bit PR handler entry translates the calling convention back
to match the previous call convention (i.e., shared with 32-bit), for
simplicity.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-27 15:41:20 +11:00
Li Zhong
21acd0e4df KVM: PPC: Book 3S: XICS: Don't lock twice when checking for resend
This patch improves the code that takes lock twice to check the resend flag
and do the actual resending, by checking the resend flag locklessly, and
add a boolean parameter check_resend to icp_[rm_]deliver_irq(), so the
resend flag can be checked in the lock when doing the delivery.

We need make sure when we clear the ics's bit in the icp's resend_map, we
don't miss the resend flag of the irqs that set the bit. It could be
ordered through the barrier in test_and_clear_bit(), and a newly added
wmb between setting irq's resend flag, and icp's resend_map.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-27 10:27:21 +11:00
Li Zhong
17d48610ae KVM: PPC: Book 3S: XICS: Implement ICS P/Q states
This patch implements P(Presented)/Q(Queued) states for ICS irqs.

When the interrupt is presented, set P. Present if P was not set.
If P is already set, don't present again, set Q.
When the interrupt is EOI'ed, move Q into P (and clear Q). If it is
set, re-present.

The asserted flag used by LSI is also incorporated into the P bit.

When the irq state is saved, P/Q bits are also saved, they need some
qemu modifications to be recognized and passed around to be restored.
KVM_XICS_PENDING bit set and saved should also indicate
KVM_XICS_PRESENTED bit set and saved. But it is possible some old
code doesn't have/recognize the P bit, so when we restore, we set P
for PENDING bit, too.

The idea and much of the code come from Ben.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-27 10:27:02 +11:00
Li Zhong
bf5a71d538 KVM: PPC: Book 3S: XICS: Fix potential issue with duplicate IRQ resends
It is possible that in the following order, one irq is resent twice:

        CPU 1                                   CPU 2

ics_check_resend()
  lock ics_lock
    see resend set
  unlock ics_lock
                                       /* change affinity of the irq */
                                       kvmppc_xics_set_xive()
                                         write_xive()
                                           lock ics_lock
                                             see resend set
                                           unlock ics_lock

                                         icp_deliver_irq() /* resend */

  icp_deliver_irq() /* resend again */

It doesn't have any user-visible effect at present, but needs to be avoided
when the following patch implementing the P/Q stuff is applied.

This patch clears the resend flag before releasing the ics lock, when we
know we will do a re-delivery after checking the flag, or setting the flag.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-27 10:26:09 +11:00
Li Zhong
37451bc95d KVM: PPC: Book 3S: XICS: correct the real mode ICP rejecting counter
Some counters are added in Commit 6e0365b782 ("KVM: PPC: Book3S HV:
Add ICP real mode counters"), to provide some performance statistics to
determine whether further optimizing is needed for real mode functions.

The n_reject counter counts how many times ICP rejects an irq because of
priority in real mode. The redelivery of an lsi that is still asserted
after eoi doesn't fall into this category, so the increasement there is
removed.

Also, it needs to be increased in icp_rm_deliver_irq() if it rejects
another one.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-27 10:25:44 +11:00
Li Zhong
5efa660515 KVM: PPC: Book 3S: XICS cleanup: remove XICS_RM_REJECT
Commit b0221556db ("KVM: PPC: Book3S HV: Move virtual mode ICP functions
 to real-mode") removed the setting of the XICS_RM_REJECT flag. And
since that commit, nothing else sets the flag any more, so we can remove
the flag and the remaining code that handles it, including the counter
that counts how many times it get set.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-27 10:25:06 +11:00
Paul Mackerras
3deda5e50c KVM: PPC: Book3S HV: Don't try to signal cpu -1
If the target vcpu for kvmppc_fast_vcpu_kick_hv() is not running on
any CPU, then we will have vcpu->arch.thread_cpu == -1, and as it
happens, kvmppc_fast_vcpu_kick_hv will call kvmppc_ipi_thread with
-1 as the cpu argument.  Although this is not meaningful, in the past,
before commit 1704a81cce ("KVM: PPC: Book3S HV: Use msgsnd for IPIs
to other cores on POWER9", 2016-11-18), it was harmless because CPU
-1 is not in the same core as any real CPU thread.  On a POWER9,
however, we don't do the "same core" check, so we were trying to
do a msgsnd to thread -1, which is invalid.  To avoid this, we add
a check to see that vcpu->arch.thread_cpu is >= 0 before calling
kvmppc_ipi_thread() with it.  Since vcpu->arch.thread_vcpu can change
asynchronously, we use READ_ONCE to ensure that the value we check is
the same value that we use as the argument to kvmppc_ipi_thread().

Fixes: 1704a81cce ("KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-01-27 08:58:34 +11:00
Thomas Gleixner
8b0e195314 ktime: Cleanup ktime_set() usage
ktime_set(S,N) was required for the timespec storage type and is still
useful for situations where a Seconds and Nanoseconds part of a time value
needs to be converted. For anything where the Seconds argument is 0, this
is pointless and can be replaced with a simple assignment.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
2016-12-25 17:21:22 +01:00
Linus Torvalds
7c0f6ba682 Replace <asm/uaccess.h> with <linux/uaccess.h> globally
This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-24 11:46:01 -08:00
Linus Torvalds
93173b5bf2 Small release, the most interesting stuff is x86 nested virt improvements.
x86: userspace can now hide nested VMX features from guests; nested
 VMX can now run Hyper-V in a guest; support for AVX512_4VNNIW and
 AVX512_FMAPS in KVM; infrastructure support for virtual Intel GPUs.
 
 PPC: support for KVM guests on POWER9; improved support for interrupt
 polling; optimizations and cleanups.
 
 s390: two small optimizations, more stuff is in flight and will be
 in 4.11.
 
 ARM: support for the GICv3 ITS on 32bit platforms.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQExBAABCAAbBQJYTkP0FBxwYm9uemluaUByZWRoYXQuY29tAAoJEL/70l94x66D
 lZIH/iT1n9OQXcuTpYYnQhuCenzI3GZZOIMTbCvK2i5bo0FIJKxVn0EiAAqZSXvO
 nO185FqjOgLuJ1AD1kJuxzye5suuQp4HIPWWgNHcexLuy43WXWKZe0IQlJ4zM2Xf
 u31HakpFmVDD+Cd1qN3yDXtDrRQ79/xQn2kw7CWb8olp+pVqwbceN3IVie9QYU+3
 gCz0qU6As0aQIwq2PyalOe03sO10PZlm4XhsoXgWPG7P18BMRhNLTDqhLhu7A/ry
 qElVMANT7LSNLzlwNdpzdK8rVuKxETwjlc1UP8vSuhrwad4zM2JJ1Exk26nC2NaG
 D0j4tRSyGFIdx6lukZm7HmiSHZ0=
 =mkoB
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "Small release, the most interesting stuff is x86 nested virt
  improvements.

  x86:
   - userspace can now hide nested VMX features from guests
   - nested VMX can now run Hyper-V in a guest
   - support for AVX512_4VNNIW and AVX512_FMAPS in KVM
   - infrastructure support for virtual Intel GPUs.

  PPC:
   - support for KVM guests on POWER9
   - improved support for interrupt polling
   - optimizations and cleanups.

  s390:
   - two small optimizations, more stuff is in flight and will be in
     4.11.

  ARM:
   - support for the GICv3 ITS on 32bit platforms"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (94 commits)
  arm64: KVM: pmu: Reset PMSELR_EL0.SEL to a sane value before entering the guest
  KVM: arm/arm64: timer: Check for properly initialized timer on init
  KVM: arm/arm64: vgic-v2: Limit ITARGETSR bits to number of VCPUs
  KVM: x86: Handle the kthread worker using the new API
  KVM: nVMX: invvpid handling improvements
  KVM: nVMX: check host CR3 on vmentry and vmexit
  KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry
  KVM: nVMX: propagate errors from prepare_vmcs02
  KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT
  KVM: nVMX: load GUEST_EFER after GUEST_CR0 during emulated VM-entry
  KVM: nVMX: generate MSR_IA32_CR{0,4}_FIXED1 from guest CPUID
  KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation
  KVM: nVMX: support restore of VMX capability MSRs
  KVM: nVMX: generate non-true VMX MSRs based on true versions
  KVM: x86: Do not clear RFLAGS.TF when a singlestep trap occurs.
  KVM: x86: Add kvm_skip_emulated_instruction and use it.
  KVM: VMX: Move skip_emulated_instruction out of nested_vmx_check_vmcs12
  KVM: VMX: Reorder some skip_emulated_instruction calls
  KVM: x86: Add a return value to kvm_emulate_cpuid
  KVM: PPC: Book3S: Move prototypes for KVM functions into kvm_ppc.h
  ...
2016-12-13 15:47:02 -08:00
Anna-Maria Gleixner
3f7cd919f3 KVM/PPC/Book3S HV: Convert to hotplug state machine
Install the callbacks via the state machine.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: kvm@vger.kernel.org
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: kvm-ppc@vger.kernel.org
Cc: Paul Mackerras <paulus@samba.org>
Cc: rt@linutronix.de
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Alexander Graf <agraf@suse.com>
Link: http://lkml.kernel.org/r/20161126231350.10321-18-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-02 00:52:38 +01:00
Paul Mackerras
e34af78490 KVM: PPC: Book3S: Move prototypes for KVM functions into kvm_ppc.h
This moves the prototypes for functions that are only called from
assembler code out of asm/asm-prototypes.h into asm/kvm_ppc.h.
The prototypes were added in commit ebe4535fbe ("KVM: PPC:
Book3S HV: sparse: prototypes for functions called from assembler",
2016-10-10), but given that the functions are KVM functions,
having them in a KVM header will be better for long-term
maintenance.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-12-01 14:03:46 +11:00
Suraj Jitindar Singh
908a09359e KVM: PPC: Book3S HV: Comment style and print format fixups
Fix comment block to match kernel comment style.

Fix print format from signed to unsigned.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-28 11:48:47 +11:00
Suraj Jitindar Singh
e03f3921e5 KVM: PPC: Book3S HV: Add check for module parameter halt_poll_ns
The kvm module parameter halt_poll_ns defines the global maximum halt
polling interval and can be dynamically changed by writing to the
/sys/module/kvm/parameters/halt_poll_ns sysfs file. However in kvm-hv
this module parameter value is only ever checked when we grow the current
polling interval for the given vcore. This means that if we decrease the
halt_poll_ns value below the current polling interval we won't see any
effect unless we try to grow the polling interval above the new max at some
point or it happens to be shrunk below the halt_poll_ns value.

Update the halt polling code so that we always check for a new module param
value of halt_poll_ns and set the current halt polling interval to it if
it's currently greater than the new max. This means that it's redundant to
also perform this check in the grow_halt_poll_ns() function now.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-28 11:48:47 +11:00
Suraj Jitindar Singh
307d93e476 KVM: PPC: Book3S HV: Use generic kvm module parameters
The previous patch exported the variables which back the module parameters
of the generic kvm module. Now use these variables in the kvm-hv module
so that any change to the generic module parameters will also have the
same effect for the kvm-hv module. This removes the duplication of the
kvm module parameters which was redundant and should reduce confusion when
tuning them.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-28 11:48:47 +11:00
David Gibson
a8acaece5d KVM: PPC: Correctly report KVM_CAP_PPC_ALLOC_HTAB
At present KVM on powerpc always reports KVM_CAP_PPC_ALLOC_HTAB as enabled.
However, the ioctl() it advertises (KVM_PPC_ALLOCATE_HTAB) only actually
works on KVM HV.  On KVM PR it will fail with ENOTTY.

QEMU already has a workaround for this, so it's not breaking things in
practice, but it would be better to advertise this correctly.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 14:23:05 +11:00
Paul Mackerras
e2702871b4 KVM: PPC: Book3S HV: Fix compilation with unusual configurations
This adds the "again" parameter to the dummy version of
kvmppc_check_passthru(), so that it matches the real version.
This fixes compilation with CONFIG_BOOK3S_64_HV set but
CONFIG_KVM_XICS=n.

This includes asm/smp.h in book3s_hv_builtin.c to fix compilation
with CONFIG_SMP=n.  The explicit inclusion is necessary to provide
definitions of hard_smp_processor_id() and get_hard_smp_processor_id()
in UP configs.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 14:20:00 +11:00
Suraj Jitindar Singh
2ee13be34b KVM: PPC: Book3S HV: Update kvmppc_set_arch_compat() for ISA v3.00
The function kvmppc_set_arch_compat() is used to determine the value of the
processor compatibility register (PCR) for a guest running in a given
compatibility mode. There is currently no support for v3.00 of the ISA.

Add support for v3.00 of the ISA which adds an ISA v2.07 compatilibity mode
to the PCR.

We also add a check to ensure the processor we are running on is capable of
emulating the chosen processor (for example a POWER7 cannot emulate a
POWER8, similarly with a POWER8 and a POWER9).

Based on work by: Paul Mackerras <paulus@ozlabs.org>

[paulus@ozlabs.org - moved dummy PCR_ARCH_300 definition here; set
 guest_pcr_bit when arch_compat == 0, added comment.]

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:24:23 +11:00
Paul Mackerras
45c940ba49 KVM: PPC: Book3S HV: Treat POWER9 CPU threads as independent subcores
With POWER9, each CPU thread has its own MMU context and can be
in the host or a guest independently of the other threads; there is
still however a restriction that all threads must use the same type
of address translation, either radix tree or hashed page table (HPT).

Since we only support HPT guests on a HPT host at this point, we
can treat the threads as being independent, and avoid all of the
work of coordinating the CPU threads.  To make this simpler, we
introduce a new threads_per_vcore() function that returns 1 on
POWER9 and threads_per_subcore on POWER7/8, and use that instead
of threads_per_subcore or threads_per_core in various places.

This also changes the value of the KVM_CAP_PPC_SMT capability on
POWER9 systems from 4 to 1, so that userspace will not try to
create VMs with multiple vcpus per vcore.  (If userspace did create
a VM that thought it was in an SMT mode, the VM might try to use
the msgsndp instruction, which will not work as expected.  In
future it may be possible to trap and emulate msgsndp in order to
allow VMs to think they are in an SMT mode, if only for the purpose
of allowing migration from POWER8 systems.)

With all this, we can now run guests on POWER9 as long as the host
is running with HPT translation.  Since userspace currently has no
way to request radix tree translation for the guest, the guest has
no choice but to use HPT translation.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:24:23 +11:00
Paul Mackerras
84f7139c06 KVM: PPC: Book3S HV: Enable hypervisor virtualization interrupts while in guest
The new XIVE interrupt controller on POWER9 can direct external
interrupts to the hypervisor or the guest.  The interrupts directed to
the hypervisor are controlled by an LPCR bit called LPCR_HVICE, and
come in as a "hypervisor virtualization interrupt".  This sets the
LPCR bit so that hypervisor virtualization interrupts can occur while
we are in the guest.  We then also need to cope with exiting the guest
because of a hypervisor virtualization interrupt.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:24:23 +11:00
Paul Mackerras
bf53c88e42 KVM: PPC: Book3S HV: Use stop instruction rather than nap on POWER9
POWER9 replaces the various power-saving mode instructions on POWER8
(doze, nap, sleep and rvwinkle) with a single "stop" instruction, plus
a register, PSSCR, which controls the depth of the power-saving mode.
This replaces the use of the nap instruction when threads are idle
during guest execution with the stop instruction, and adds code to
set PSSCR to a value which will allow an SMT mode switch while the
thread is idle (given that the core as a whole won't be idle in these
cases).

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:24:23 +11:00
Paul Mackerras
f725758b89 KVM: PPC: Book3S HV: Use OPAL XICS emulation on POWER9
POWER9 includes a new interrupt controller, called XIVE, which is
quite different from the XICS interrupt controller on POWER7 and
POWER8 machines.  KVM-HV accesses the XICS directly in several places
in order to send and clear IPIs and handle interrupts from PCI
devices being passed through to the guest.

In order to make the transition to XIVE easier, OPAL firmware will
include an emulation of XICS on top of XIVE.  Access to the emulated
XICS is via OPAL calls.  The one complication is that the EOI
(end-of-interrupt) function can now return a value indicating that
another interrupt is pending; in this case, the XIVE will not signal
an interrupt in hardware to the CPU, and software is supposed to
acknowledge the new interrupt without waiting for another interrupt
to be delivered in hardware.

This adapts KVM-HV to use the OPAL calls on machines where there is
no XICS hardware.  When there is no XICS, we look for a device-tree
node with "ibm,opal-intc" in its compatible property, which is how
OPAL indicates that it provides XICS emulation.

In order to handle the EOI return value, kvmppc_read_intr() has
become kvmppc_read_one_intr(), with a boolean variable passed by
reference which can be set by the EOI functions to indicate that
another interrupt is pending.  The new kvmppc_read_intr() keeps
calling kvmppc_read_one_intr() until there are no more interrupts
to process.  The return value from kvmppc_read_intr() is the
largest non-zero value of the returns from kvmppc_read_one_intr().

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:24:23 +11:00
Paul Mackerras
1704a81cce KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9
On POWER9, the msgsnd instruction is able to send interrupts to
other cores, as well as other threads on the local core.  Since
msgsnd is generally simpler and faster than sending an IPI via the
XICS, we use msgsnd for all IPIs sent by KVM on POWER9.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:24:23 +11:00
Paul Mackerras
7c5b06cadf KVM: PPC: Book3S HV: Adapt TLB invalidations to work on POWER9
POWER9 adds new capabilities to the tlbie (TLB invalidate entry)
and tlbiel (local tlbie) instructions.  Both instructions get a
set of new parameters (RIC, PRS and R) which appear as bits in the
instruction word.  The tlbiel instruction now has a second register
operand, which contains a PID and/or LPID value if needed, and
should otherwise contain 0.

This adapts KVM-HV's usage of tlbie and tlbiel to work on POWER9
as well as older processors.  Since we only handle HPT guests so
far, we need RIC=0 PRS=0 R=0, which ends up with the same instruction
word as on previous processors, so we don't need to conditionally
execute different instructions depending on the processor.

The local flush on first entry to a guest in book3s_hv_rmhandlers.S
is a loop which depends on the number of TLB sets.  Rather than
using feature sections to set the number of iterations based on
which CPU we're on, we now work out this number at VM creation time
and store it in the kvm_arch struct.  That will make it possible to
get the number from the device tree in future, which will help with
compatibility with future processors.

Since mmu_partition_table_set_entry() does a global flush of the
whole LPID, we don't need to do the TLB flush on first entry to the
guest on each processor.  Therefore we don't set all bits in the
tlb_need_flush bitmap on VM startup on POWER9.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:24:23 +11:00
Paul Mackerras
e9cf1e0856 KVM: PPC: Book3S HV: Add new POWER9 guest-accessible SPRs
This adds code to handle two new guest-accessible special-purpose
registers on POWER9: TIDR (thread ID register) and PSSCR (processor
stop status and control register).  They are context-switched
between host and guest, and the guest values can be read and set
via the one_reg interface.

The PSSCR contains some fields which are guest-accessible and some
which are only accessible in hypervisor mode.  We only allow the
guest-accessible fields to be read or set by userspace.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:24:23 +11:00
Paul Mackerras
83677f551e KVM: PPC: Book3S HV: Adjust host/guest context switch for POWER9
Some special-purpose registers that were present and accessible
by guests on POWER8 no longer exist on POWER9, so this adds
feature sections to ensure that we don't try to context-switch
them when going into or out of a guest on POWER9.  These are
all relatively obscure, rarely-used registers, but we had to
context-switch them on POWER8 to avoid creating a covert channel.
They are: SPMC1, SPMC2, MMCRS, CSIGR, TACR, TCSCR, and ACOP.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:24:23 +11:00
Paul Mackerras
7a84084c60 KVM: PPC: Book3S HV: Set partition table rather than SDR1 on POWER9
On POWER9, the SDR1 register (hashed page table base address) is no
longer used, and instead the hardware reads the HPT base address
and size from the partition table.  The partition table entry also
contains the bits that specify the page size for the VRMA mapping,
which were previously in the LPCR.  The VPM0 bit of the LPCR is
now reserved; the processor now always uses the VRMA (virtual
real-mode area) mechanism for guest real-mode accesses in HPT mode,
and the RMO (real-mode offset) mechanism has been dropped.

When entering or exiting the guest, we now only have to set the
LPIDR (logical partition ID register), not the SDR1 register.
There is also no requirement now to transition via a reserved
LPID value.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:24:23 +11:00
Paul Mackerras
abb7c7ddba KVM: PPC: Book3S HV: Adapt to new HPTE format on POWER9
This adapts the KVM-HV hashed page table (HPT) code to read and write
HPT entries in the new format defined in Power ISA v3.00 on POWER9
machines.  The new format moves the B (segment size) field from the
first doubleword to the second, and trims some bits from the AVA
(abbreviated virtual address) and ARPN (abbreviated real page number)
fields.  As far as possible, the conversion is done when reading or
writing the HPT entries, and the rest of the code continues to use
the old format.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-24 09:24:23 +11:00
Geliang Tang
68b8b72bb2 KVM: PPC: Book3S HV: Drop duplicate header asm/iommu.h
Drop duplicate header asm/iommu.h from book3s_64_vio_hv.c.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-21 15:29:26 +11:00
Paul Mackerras
f064a0de15 KVM: PPC: Book3S HV: Don't lose hardware R/C bit updates in H_PROTECT
The hashed page table MMU in POWER processors can update the R
(reference) and C (change) bits in a HPTE at any time until the
HPTE has been invalidated and the TLB invalidation sequence has
completed.  In kvmppc_h_protect, which implements the H_PROTECT
hypercall, we read the HPTE, modify the second doubleword,
invalidate the HPTE in memory, do the TLB invalidation sequence,
and then write the modified value of the second doubleword back
to memory.  In doing so we could overwrite an R/C bit update done
by hardware between when we read the HPTE and when the TLB
invalidation completed.  To fix this we re-read the second
doubleword after the TLB invalidation and OR in the (possibly)
new values of R and C.  We can use an OR since hardware only ever
sets R and C, never clears them.

This race was found by code inspection.  In principle this bug could
cause occasional guest memory corruption under host memory pressure.

Fixes: a8606e20e4 ("KVM: PPC: Handle some PAPR hcalls in the kernel", 2011-06-29)
Cc: stable@vger.kernel.org # v3.19+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-21 15:29:20 +11:00
Paul Mackerras
0d808df06a KVM: PPC: Book3S HV: Save/restore XER in checkpointed register state
When switching from/to a guest that has a transaction in progress,
we need to save/restore the checkpointed register state.  Although
XER is part of the CPU state that gets checkpointed, the code that
does this saving and restoring doesn't save/restore XER.

This fixes it by saving and restoring the XER.  To allow userspace
to read/write the checkpointed XER value, we also add a new ONE_REG
specifier.

The visible effect of this bug is that the guest may see its XER
value being corrupted when it uses transactions.

Fixes: e4e3812150 ("KVM: PPC: Book3S HV: Add transactional memory support")
Fixes: 0a8eccefcb ("KVM: PPC: Book3S HV: Add missing code for transaction reclaim on guest exit")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-21 15:17:55 +11:00
Yongji Xie
a56ee9f8f0 KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().

In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.

Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-21 15:17:55 +11:00
Yongji Xie
f05859827d KVM: PPC: Book3S HV: Clear the key field of HPTE when the page is paged out
Currently we mark a HPTE for emulated MMIO with HPTE_V_ABSENT bit
set as well as key 0x1f. However, those HPTEs may be conflicted with
the HPTE for real guest RAM page HPTE with key 0x1f when the page
get paged out.

This patch clears the key field of HPTE when the page is paged out,
then recover it when HPTE is re-established.

Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-21 15:17:55 +11:00
Wei Yongjun
28d057c897 KVM: PPC: Book3S HV: Use list_move_tail instead of list_del/list_add_tail
Using list_move_tail() instead of list_del() + list_add_tail().

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-21 15:17:55 +11:00
Daniel Axtens
ebe4535fbe KVM: PPC: Book3S HV: sparse: prototypes for functions called from assembler
A bunch of KVM functions are only called from assembler.
Give them prototypes in asm-prototypes.h
This reduces sparse warnings.

Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-21 15:17:54 +11:00
Daniel Axtens
025c951138 KVM: PPC: Book3S HV: Fix sparse static warning
Squash a couple of sparse warnings by making things static.

Build tested.

Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-21 15:16:53 +11:00
Michael Ellerman
62623d5f91 KVM: PPC: Book3S HV: Fix build error when SMP=n
Commit 5d375199ea ("KVM: PPC: Book3S HV: Set server for passed-through
interrupts") broke the SMP=n build:

  arch/powerpc/kvm/book3s_hv_rm_xics.c:758:2: error: implicit declaration of function 'get_hard_smp_processor_id'

That is because we lost the implicit include of asm/smp.h, so include it
explicitly to get the definition for get_hard_smp_processor_id().

Fixes: 5d375199ea ("KVM: PPC: Book3S HV: Set server for passed-through interrupts")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-10-22 08:44:37 +11:00
Thomas Huth
fa73c3b25b KVM: PPC: Book3s PR: Allow access to unprivileged MMCR2 register
The MMCR2 register is available twice, one time with number 785
(privileged access), and one time with number 769 (unprivileged,
but it can be disabled completely). In former times, the Linux
kernel was using the unprivileged register 769 only, but since
commit 8dd75ccb57 ("powerpc: Use privileged SPR number
for MMCR2"), it uses the privileged register 785 instead.
The KVM-PR code then of course also switched to use the SPR 785,
but this is causing older guest kernels to crash, since these
kernels still access 769 instead. So to support older kernels
with KVM-PR again, we have to support register 769 in KVM-PR, too.

Fixes: 8dd75ccb57
Cc: stable@vger.kernel.org # v3.10+
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-27 15:14:29 +10:00
Thomas Huth
2365f6b67c KVM: PPC: Book3S PR: Support 64kB page size on POWER8E and POWER8NVL
On POWER8E and POWER8NVL, KVM-PR does not announce support for
64kB page sizes and 1TB segments yet. Looks like this has just
been forgotton so far, since there is no reason why this should
be different to the normal POWER8 CPUs.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-27 15:14:29 +10:00
Dan Carpenter
ac0e89bb47 KVM: PPC: BookE: Fix a sanity check
We use logical negate where bitwise negate was intended.  It means that
we never return -EINVAL here.

Fixes: ce11e48b7f ('KVM: PPC: E500: Add userspace debug stub support')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-27 15:14:29 +10:00
Paul Mackerras
b009031f74 KVM: PPC: Book3S HV: Take out virtual core piggybacking code
This takes out the code that arranges to run two (or more) virtual
cores on a single subcore when possible, that is, when both vcores
are from the same VM, the VM is configured with one CPU thread per
virtual core, and all the per-subcore registers have the same value
in each vcore.  Since the VTB (virtual timebase) is a per-subcore
register, and will almost always differ between vcores, this code
is disabled on POWER8 machines, meaning that it is only usable on
POWER7 machines (which don't have VTB).  Given the tiny number of
POWER7 machines which have firmware that allows them to run HV KVM,
the benefit of simplifying the code outweighs the loss of this
feature on POWER7 machines.

Tested-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-27 14:42:07 +10:00
Paul Mackerras
88b02cf97b KVM: PPC: Book3S: Treat VTB as a per-subcore register, not per-thread
POWER8 has one virtual timebase (VTB) register per subcore, not one
per CPU thread.  The HV KVM code currently treats VTB as a per-thread
register, which can lead to spurious soft lockup messages from guests
which use the VTB as the time source for the soft lockup detector.
(CPUs before POWER8 did not have the VTB register.)

For HV KVM, this fixes the problem by making only the primary thread
in each virtual core save and restore the VTB value.  With this,
the VTB state becomes part of the kvmppc_vcore structure.  This
also means that "piggybacking" of multiple virtual cores onto one
subcore is not possible on POWER8, because then the virtual cores
would share a single VTB register.

PR KVM emulates a VTB register, which is per-vcpu because PR KVM
has no notion of CPU threads or SMT.  For PR KVM we move the VTB
state into the kvmppc_vcpu_book3s struct.

Cc: stable@vger.kernel.org # v3.14+
Reported-by: Thomas Huth <thuth@redhat.com>
Tested-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-27 14:41:39 +10:00
Luiz Capitulino
235539b48a kvm: add stubs for arch specific debugfs support
Two stubs are added:

 o kvm_arch_has_vcpu_debugfs(): must return true if the arch
   supports creating debugfs entries in the vcpu debugfs dir
   (which will be implemented by the next commit)

 o kvm_arch_create_vcpu_debugfs(): code that creates debugfs
   entries in the vcpu debugfs dir

For x86, this commit introduces a new file to avoid growing
arch/x86/kvm/x86.c even more.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-16 16:57:47 +02:00
Markus Elfring
aad9e5ba24 KVM: PPC: e500: Rename jump labels in kvmppc_e500_tlb_init()
Adjust jump labels according to the current Linux coding style convention.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-13 14:32:47 +10:00
Markus Elfring
90235dc19e KVM: PPC: e500: Use kmalloc_array() in kvmppc_e500_tlb_init()
* A multiplication for the size determination of a memory allocation
  indicated that an array data structure should be processed.
  Thus use the corresponding function "kmalloc_array".

* Replace the specification of a data structure by a pointer dereference
  to make the corresponding size determination a bit safer according to
  the Linux coding style convention.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-12 10:13:01 +10:00
Markus Elfring
b0ac477bc4 KVM: PPC: e500: Replace kzalloc() calls by kcalloc() in two functions
* A multiplication for the size determination of a memory allocation
  indicated that an array data structure should be processed.
  Thus use the corresponding function "kcalloc".

  Suggested-by: Paolo Bonzini <pbonzini@redhat.com>

  This issue was detected also by using the Coccinelle software.

* Replace the specification of data structures by pointer dereferences
  to make the corresponding size determination a bit safer according to
  the Linux coding style convention.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-12 10:12:56 +10:00
Markus Elfring
cfb60813fb KVM: PPC: e500: Delete an unnecessary initialisation in kvm_vcpu_ioctl_config_tlb()
The local variable "g2h_bitmap" will be set to an appropriate value
a bit later. Thus omit the explicit initialisation at the beginning.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-12 10:12:51 +10:00
Markus Elfring
46d4e74792 KVM: PPC: e500: Less function calls in kvm_vcpu_ioctl_config_tlb() after error detection
The kfree() function was called in two cases by the
kvm_vcpu_ioctl_config_tlb() function during error handling
even if the passed data structure element contained a null pointer.

* Split a condition check for memory allocation failures.

* Adjust jump targets according to the Linux coding style convention.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-12 10:12:46 +10:00
Markus Elfring
f3c0ce86a8 KVM: PPC: e500: Use kmalloc_array() in kvm_vcpu_ioctl_config_tlb()
* A multiplication for the size determination of a memory allocation
  indicated that an array data structure should be processed.
  Thus use the corresponding function "kmalloc_array".

  This issue was detected by using the Coccinelle software.

* Replace the specification of a data type by a pointer dereference
  to make the corresponding size determination a bit safer according to
  the Linux coding style convention.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-12 10:12:40 +10:00
Suresh Warrier
65e7026a6c KVM: PPC: Book3S HV: Counters for passthrough IRQ stats
Add VCPU stat counters to track affinity for passthrough
interrupts.

pthru_all: Counts all passthrough interrupts whose IRQ mappings are
           in the kvmppc_passthru_irq_map structure.
pthru_host: Counts all cached passthrough interrupts that were injected
	    from the host through kvm_set_irq (i.e. not handled in
	    real mode).
pthru_bad_aff: Counts how many cached passthrough interrupts have
               bad affinity (receiving CPU is not running VCPU that is
	       the target of the virtual interrupt in the guest).

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-12 10:12:34 +10:00
Paul Mackerras
5d375199ea KVM: PPC: Book3S HV: Set server for passed-through interrupts
When a guest has a PCI pass-through device with an interrupt, it
will direct the interrupt to a particular guest VCPU.  In fact the
physical interrupt might arrive on any CPU, and then get
delivered to the target VCPU in the emulated XICS (guest interrupt
controller), and eventually delivered to the target VCPU.

Now that we have code to handle device interrupts in real mode
without exiting to the host kernel, there is an advantage to having
the device interrupt arrive on the same sub(core) as the target
VCPU is running on.  In this situation, the interrupt can be
delivered to the target VCPU without any exit to the host kernel
(using a hypervisor doorbell interrupt between threads if
necessary).

This patch aims to get passed-through device interrupts arriving
on the correct core by setting the interrupt server in the real
hardware XICS for the interrupt to the first thread in the (sub)core
where its target VCPU is running.  We do this in the real-mode H_EOI
code because the H_EOI handler already needs to look at the
emulated ICS state for the interrupt (whereas the H_XIRR handler
doesn't), and we know we are running in the target VCPU context
at that point.

We set the server CPU in hardware using an OPAL call, regardless of
what the IRQ affinity mask for the interrupt says, and without
updating the affinity mask.  This amounts to saying that when an
interrupt is passed through to a guest, as a matter of policy we
allow the guest's affinity for the interrupt to override the host's.

This is inspired by an earlier patch from Suresh Warrier, although
none of this code came from that earlier patch.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-12 10:12:28 +10:00
Suresh Warrier
366274f59c KVM: PPC: Book3S HV: Update irq stats for IRQs handled in real mode
When a passthrough IRQ is handled completely within KVM real
mode code, it has to also update the IRQ stats since this
does not go through the generic IRQ handling code.

However, the per CPU kstat_irqs field is an allocated (not static)
field and so cannot be directly accessed in real mode safely.

The function this_cpu_inc_rm() is introduced to safely increment
per CPU fields (currently coded for unsigned integers only) that
are allocated and could thus be vmalloced also.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-12 10:12:23 +10:00
Suresh Warrier
644abbb254 KVM: PPC: Book3S HV: Tunable to disable KVM IRQ bypass
Add a  module parameter kvm_irq_bypass for kvm_hv.ko to
disable IRQ bypass for passthrough interrupts. The default
value of this tunable is 1 - that is enable the feature.

Since the tunable is used by built-in kernel code, we use
the module_param_cb macro to achieve this.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-12 10:12:18 +10:00
Suresh Warrier
af893c7dc9 KVM: PPC: Book3S HV: Dump irqmap in debugfs
Dump the passthrough irqmap structure associated with a
guest as part of /sys/kernel/debug/powerpc/kvm-xics-*.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-12 10:12:13 +10:00
Suresh Warrier
f7af5209b8 KVM: PPC: Book3S HV: Complete passthrough interrupt in host
In existing real mode ICP code, when updating the virtual ICP
state, if there is a required action that cannot be completely
handled in real mode, as for instance, a VCPU needs to be woken
up, flags are set in the ICP to indicate the required action.
This is checked when returning from hypercalls to decide whether
the call needs switch back to the host where the action can be
performed in virtual mode. Note that if h_ipi_redirect is enabled,
real mode code will first try to message a free host CPU to
complete this job instead of returning the host to do it ourselves.

Currently, the real mode PCI passthrough interrupt handling code
checks if any of these flags are set and simply returns to the host.
This is not good enough as the trap value (0x500) is treated as an
external interrupt by the host code. It is only when the trap value
is a hypercall that the host code searches for and acts on unfinished
work by calling kvmppc_xics_rm_complete.

This patch introduces a special trap BOOK3S_INTERRUPT_HV_RM_HARD
which is returned by KVM if there is unfinished business to be
completed in host virtual mode after handling a PCI passthrough
interrupt. The host checks for this special interrupt condition
and calls into the kvmppc_xics_rm_complete, which is made an
exported function for this reason.

[paulus@ozlabs.org - moved logic to set r12 to BOOK3S_INTERRUPT_HV_RM_HARD
 in book3s_hv_rmhandlers.S into the end of kvmppc_check_wake_reason.]

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-12 10:12:07 +10:00
Suresh Warrier
e3c13e56a4 KVM: PPC: Book3S HV: Handle passthrough interrupts in guest
Currently, KVM switches back to the host to handle any external
interrupt (when the interrupt is received while running in the
guest). This patch updates real-mode KVM to check if an interrupt
is generated by a passthrough adapter that is owned by this guest.
If so, the real mode KVM will directly inject the corresponding
virtual interrupt to the guest VCPU's ICS and also EOI the interrupt
in hardware. In short, the interrupt is handled entirely in real
mode in the guest context without switching back to the host.

In some rare cases, the interrupt cannot be completely handled in
real mode, for instance, a VCPU that is sleeping needs to be woken
up. In this case, KVM simply switches back to the host with trap
reason set to 0x500. This works, but it is clearly not very efficient.
A following patch will distinguish this case and handle it
correctly in the host. Note that we can use the existing
check_too_hard() routine even though we are not in a hypercall to
determine if there is unfinished business that needs to be
completed in host virtual mode.

The patch assumes that the mapping between hardware interrupt IRQ
and virtual IRQ to be injected to the guest already exists for the
PCI passthrough interrupts that need to be handled in real mode.
If the mapping does not exist, KVM falls back to the default
existing behavior.

The KVM real mode code reads mappings from the mapped array in the
passthrough IRQ map without taking any lock.  We carefully order the
loads and stores of the fields in the kvmppc_irq_map data structure
using memory barriers to avoid an inconsistent mapping being seen by
the reader. Thus, although it is possible to miss a map entry, it is
not possible to read a stale value.

[paulus@ozlabs.org - get irq_chip from irq_map rather than pimap,
 pulled out powernv eoi change into a separate patch, made
 kvmppc_read_intr get the vcpu from the paca rather than being
 passed in, rewrote the logic at the end of kvmppc_read_intr to
 avoid deep indentation, simplified logic in book3s_hv_rmhandlers.S
 since we were always restoring SRR0/1 anyway, get rid of the cached
 array (just use the mapped array), removed the kick_all_cpus_sync()
 call, clear saved_xirr PACA field when we handle the interrupt in
 real mode, fix compilation with CONFIG_KVM_XICS=n.]

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-12 10:11:00 +10:00
Suresh Warrier
c57875f5f9 KVM: PPC: Book3S HV: Enable IRQ bypass
Add the irq_bypass_add_producer and irq_bypass_del_producer
functions. These functions get called whenever a GSI is being
defined for a guest. They create/remove the mapping between
host real IRQ numbers and the guest GSI.

Add the following helper functions to manage the
passthrough IRQ map.

kvmppc_set_passthru_irq()
  Creates a mapping in the passthrough IRQ map that maps a host
  IRQ to a guest GSI. It allocates the structure (one per guest VM)
  the first time it is called.

kvmppc_clr_passthru_irq()
  Removes the passthrough IRQ map entry given a guest GSI.
  The passthrough IRQ map structure is not freed even when the
  number of mapped entries goes to zero. It is only freed when
  the VM is destroyed.

[paulus@ozlabs.org - modified to use is_pnv_opal_msi() rather than
 requiring all passed-through interrupts to use the same irq_chip;
 changed deletion so it zeroes out the r_hwirq field rather than
 copying the last entry down and decrementing the number of entries.]

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-09 16:26:51 +10:00
Suresh Warrier
8daaafc88b KVM: PPC: Book3S HV: Introduce kvmppc_passthru_irqmap
This patch introduces an IRQ mapping structure, the
kvmppc_passthru_irqmap structure that is to be used
to map the real hardware IRQ in the host with the virtual
hardware IRQ (gsi) that is injected into a guest by KVM for
passthrough adapters.

Currently, we assume a separate IRQ mapping structure for
each guest. Each kvmppc_passthru_irqmap has a mapping arrays,
containing all defined real<->virtual IRQs.

[paulus@ozlabs.org - removed irq_chip field from struct
 kvmppc_passthru_irqmap; changed parameter for
 kvmppc_get_passthru_irqmap from struct kvm_vcpu * to struct
 kvm *, removed small cached array.]

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-09 16:26:19 +10:00
Suresh Warrier
9576730d0e KVM: PPC: select IRQ_BYPASS_MANAGER
Select IRQ_BYPASS_MANAGER for PPC when CONFIG_KVM is set.
Add the PPC producer functions for add and del producer.

[paulus@ozlabs.org - Moved new functions from book3s.c to powerpc.c
 so booke compiles; added kvm_arch_has_irq_bypass implementation.]

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-09 16:26:19 +10:00
Suresh Warrier
37f55d30df KVM: PPC: Book3S HV: Convert kvmppc_read_intr to a C function
Modify kvmppc_read_intr to make it a C function.  Because it is called
from kvmppc_check_wake_reason, any of the assembler code that calls
either kvmppc_read_intr or kvmppc_check_wake_reason now has to assume
that the volatile registers might have been modified.

This also adds in the optimization of clearing saved_xirr in the case
where we completely handle and EOI an IPI.  Without this, the next
device interrupt will require two trips through the host interrupt
handling code.

[paulus@ozlabs.org - made kvmppc_check_wake_reason create a stack frame
 when it is calling kvmppc_read_intr, which means we can set r12 to
 the trap number (0x500) after the call to kvmppc_read_intr, instead
 of using r31.  Also moved the deliver_guest_interrupt label so as to
 restore XER and CTR, plus other minor tweaks.]

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-09 16:25:25 +10:00
Paul Mackerras
99212c864e Merge branch 'kvm-ppc-infrastructure' into kvm-ppc-next
This merges the topic branch 'kvm-ppc-infrastructure' into kvm-ppc-next
so that I can then apply further patches that need the changes in the
kvm-ppc-infrastructure branch.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-09 16:24:23 +10:00
Paolo Bonzini
3f25777499 powerpc: move hmi.c to arch/powerpc/kvm/
hmi.c functions are unused unless sibling_subcore_state is nonzero, and
that in turn happens only if KVM is in use.  So move the code to
arch/powerpc/kvm/, putting it under CONFIG_KVM_BOOK3S_HV_POSSIBLE
rather than CONFIG_PPC_BOOK3S_64.  The sibling_subcore_state is also
included in struct paca_struct only if KVM is supported by the kernel.

Cc: Daniel Axtens <dja@axtens.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: kvm-ppc@vger.kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-09 16:18:07 +10:00
Suraj Jitindar Singh
2a27f514a4 KVM: PPC: Implement existing and add new halt polling vcpu stats
vcpu stats are used to collect information about a vcpu which can be viewed
in the debugfs. For example halt_attempted_poll and halt_successful_poll
are used to keep track of the number of times the vcpu attempts to and
successfully polls. These stats are currently not used on powerpc.

Implement incrementation of the halt_attempted_poll and
halt_successful_poll vcpu stats for powerpc. Since these stats are summed
over all the vcpus for all running guests it doesn't matter which vcpu
they are attributed to, thus we choose the current runner vcpu of the
vcore.

Also add new vcpu stats: halt_poll_success_ns, halt_poll_fail_ns and
halt_wait_ns to be used to accumulate the total time spend polling
successfully, polling unsuccessfully and waiting respectively, and
halt_successful_wait to accumulate the number of times the vcpu waits.
Given that halt_poll_success_ns, halt_poll_fail_ns and halt_wait_ns are
expressed in nanoseconds it is necessary to represent these as 64-bit
quantities, otherwise they would overflow after only about 4 seconds.

Given that the total time spend either polling or waiting will be known and
the number of times that each was done, it will be possible to determine
the average poll and wait times. This will give the ability to tune the kvm
module parameters based on the calculated average wait and poll times.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-08 12:25:37 +10:00
Suraj Jitindar Singh
0cda69dd7c KVM: PPC: Book3S HV: Implement halt polling
This patch introduces new halt polling functionality into the kvm_hv kernel
module. When a vcore is idle it will poll for some period of time before
scheduling itself out.

When all of the runnable vcpus on a vcore have ceded (and thus the vcore is
idle) we schedule ourselves out to allow something else to run. In the
event that we need to wake up very quickly (for example an interrupt
arrives), we are required to wait until we get scheduled again.

Implement halt polling so that when a vcore is idle, and before scheduling
ourselves, we poll for vcpus in the runnable_threads list which have
pending exceptions or which leave the ceded state. If we poll successfully
then we can get back into the guest very quickly without ever scheduling
ourselves, otherwise we schedule ourselves out as before.

There exists generic halt_polling code in virt/kvm_main.c, however on
powerpc the polling conditions are different to the generic case. It would
be nice if we could just implement an arch specific kvm_check_block()
function, but there is still other arch specific things which need to be
done for kvm_hv (for example manipulating vcore states) which means that a
separate implementation is the best option.

Testing of this patch with a TCP round robin test between two guests with
virtio network interfaces has found a decrease in round trip time of ~15us
on average. A performance gain is only seen when going out of and
back into the guest often and quickly, otherwise there is no net benefit
from the polling. The polling interval is adjusted such that when we are
often scheduled out for long periods of time it is reduced, and when we
often poll successfully it is increased. The rate at which the polling
interval increases or decreases, and the maximum polling interval, can
be set through module parameters.

Based on the implementation in the generic kvm module by Wanpeng Li and
Paolo Bonzini, and on direction from Paul Mackerras.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-08 12:21:45 +10:00
Suraj Jitindar Singh
7b5f8272c7 KVM: PPC: Book3S HV: Change vcore element runnable_threads from linked-list to array
The struct kvmppc_vcore is a structure used to store various information
about a virtual core for a kvm guest. The runnable_threads element of the
struct provides a list of all of the currently runnable vcpus on the core
(those in the KVMPPC_VCPU_RUNNABLE state). The previous implementation of
this list was a linked_list. The next patch requires that the list be able
to be iterated over without holding the vcore lock.

Reimplement the runnable_threads list in the kvmppc_vcore struct as an
array. Implement function to iterate over valid entries in the array and
update access sites accordingly.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-09-08 12:21:44 +10:00
Paul Mackerras
4b3d173d04 KVM: PPC: Always select KVM_VFIO, plus Makefile cleanup
As discussed recently on the kvm mailing list, David Gibson's
intention in commit 178a787502 ("vfio: Enable VFIO device for
powerpc", 2016-02-01) was to have the KVM VFIO device built in
on all powerpc platforms.  This patch adds the "select KVM_VFIO"
statement that makes this happen.

Currently, arch/powerpc/kvm/Makefile doesn't include vfio.o for
the 64-bit kvm module, because the list of objects doesn't use
the $(common-objs-y) list.  The reason it doesn't is because we
don't necessarily want coalesced_mmio.o or emulate.o (for example
if HV KVM is the only target), and common-objs-y includes both.

Since this is confusing, this patch adjusts the definitions so that
we now use $(common-objs-y) in the list for the 64-bit kvm.ko
module, emulate.o is removed from common-objs-y and added in the
places that need it, and the inclusion of coalesced_mmio.o now
depends on CONFIG_KVM_MMIO.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-08-25 22:25:49 +10:00
Paul Mackerras
34a75b0f63 KVM: PPC: Implement kvm_arch_intc_initialized() for PPC
It doesn't make sense to create irqfds for a VM that doesn't have
in-kernel interrupt controller emulation.  There is an existing
interface for architecture code to tell the irqfd code whether or
not any interrupt controller has been initialized, called
kvm_arch_intc_initialized(), so let's implement that for powerpc.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-08-19 13:00:06 +10:00
Paul Mackerras
e48ba1cbce KVM: PPC: Book3S: Don't crash if irqfd used with no in-kernel XICS emulation
It turns out that if userspace creates a pseries-type VM without
in-kernel XICS (interrupt controller) emulation, and then connects
an eventfd to the VM as an irqfd, and the eventfd gets signalled,
that the code will try to deliver an interrupt via the non-existent
XICS object and crash the host kernel with a NULL pointer dereference.

To fix this, we check for the presence of the XICS object before
trying to deliver the interrupt, and return with an error if not.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-08-19 13:00:06 +10:00
Christoffer Dall
a28ebea2ad KVM: Protect device ops->create and list_add with kvm->lock
KVM devices were manipulating list data structures without any form of
synchronization, and some implementations of the create operations also
suffered from a lack of synchronization.

Now when we've split the xics create operation into create and init, we
can hold the kvm->lock mutex while calling the create operation and when
manipulating the devices list.

The error path in the generic code gets slightly ugly because we have to
take the mutex again and delete the device from the list, but holding
the mutex during anon_inode_getfd or releasing/locking the mutex in the
common non-error path seemed wrong.

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-08-12 12:01:27 +02:00
Christoffer Dall
023e9fddc3 KVM: PPC: Move xics_debugfs_init out of create
As we are about to hold the kvm->lock during the create operation on KVM
devices, we should move the call to xics_debugfs_init into its own
function, since holding a mutex over extended amounts of time might not
be a good idea.

Introduce an init operation on the kvm_device_ops struct which cannot
fail and call this, if configured, after the device has been created.

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-08-12 12:01:26 +02:00
Linus Torvalds
f716a85cd6 Merge branch 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull kbuild updates from Michal Marek:

 - GCC plugin support by Emese Revfy from grsecurity, with a fixup from
   Kees Cook.  The plugins are meant to be used for static analysis of
   the kernel code.  Two plugins are provided already.

 - reduction of the gcc commandline by Arnd Bergmann.

 - IS_ENABLED / IS_REACHABLE macro enhancements by Masahiro Yamada

 - bin2c fix by Michael Tautschnig

 - setlocalversion fix by Wolfram Sang

* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
  gcc-plugins: disable under COMPILE_TEST
  kbuild: Abort build on bad stack protector flag
  scripts: Fix size mismatch of kexec_purgatory_size
  kbuild: make samples depend on headers_install
  Kbuild: don't add obj tree in additional includes
  Kbuild: arch: look for generated headers in obtree
  Kbuild: always prefix objtree in LINUXINCLUDE
  Kbuild: avoid duplicate include path
  Kbuild: don't add ../../ to include path
  vmlinux.lds.h: replace config_enabled() with IS_ENABLED()
  kconfig.h: allow to use IS_{ENABLE,REACHABLE} in macro expansion
  kconfig.h: use already defined macros for IS_REACHABLE() define
  export.h: use __is_defined() to check if __KSYM_* is defined
  kconfig.h: use __is_defined() to check if MODULE is defined
  kbuild: setlocalversion: print error to STDERR
  Add sancov plugin
  Add Cyclomatic complexity GCC plugin
  GCC plugin infrastructure
  Shared library support
2016-08-02 16:37:12 -04:00
Linus Torvalds
221bb8a46e - ARM: GICv3 ITS emulation and various fixes. Removal of the old
VGIC implementation.
 
 - s390: support for trapping software breakpoints, nested virtualization
 (vSIE), the STHYI opcode, initial extensions for CPU model support.
 
 - MIPS: support for MIPS64 hosts (32-bit guests only) and lots of cleanups,
 preliminary to this and the upcoming support for hardware virtualization
 extensions.
 
 - x86: support for execute-only mappings in nested EPT; reduced vmexit
 latency for TSC deadline timer (by about 30%) on Intel hosts; support for
 more than 255 vCPUs.
 
 - PPC: bugfixes.
 
 The ugly bit is the conflicts.  A couple of them are simple conflicts due
 to 4.7 fixes, but most of them are with other trees. There was definitely
 too much reliance on Acked-by here.  Some conflicts are for KVM patches
 where _I_ gave my Acked-by, but the worst are for this pull request's
 patches that touch files outside arch/*/kvm.  KVM submaintainers should
 probably learn to synchronize better with arch maintainers, with the
 latter providing topic branches whenever possible instead of Acked-by.
 This is what we do with arch/x86.  And I should learn to refuse pull
 requests when linux-next sends scary signals, even if that means that
 submaintainers have to rebase their branches.
 
 Anyhow, here's the list:
 
 - arch/x86/kvm/vmx.c: handle_pcommit and EXIT_REASON_PCOMMIT was removed
 by the nvdimm tree.  This tree adds handle_preemption_timer and
 EXIT_REASON_PREEMPTION_TIMER at the same place.  In general all mentions
 of pcommit have to go.
 
 There is also a conflict between a stable fix and this patch, where the
 stable fix removed the vmx_create_pml_buffer function and its call.
 
 - virt/kvm/kvm_main.c: kvm_cpu_notifier was removed by the hotplug tree.
 This tree adds kvm_io_bus_get_dev at the same place.
 
 - virt/kvm/arm/vgic.c: a few final bugfixes went into 4.7 before the
 file was completely removed for 4.8.
 
 - include/linux/irqchip/arm-gic-v3.h: this one is entirely our fault;
 this is a change that should have gone in through the irqchip tree and
 pulled by kvm-arm.  I think I would have rejected this kvm-arm pull
 request.  The KVM version is the right one, except that it lacks
 GITS_BASER_PAGES_SHIFT.
 
 - arch/powerpc: what a mess.  For the idle_book3s.S conflict, the KVM
 tree is the right one; everything else is trivial.  In this case I am
 not quite sure what went wrong.  The commit that is causing the mess
 (fd7bacbca4, "KVM: PPC: Book3S HV: Fix TB corruption in guest exit
 path on HMI interrupt", 2016-05-15) touches both arch/powerpc/kernel/
 and arch/powerpc/kvm/.  It's large, but at 396 insertions/5 deletions
 I guessed that it wasn't really possible to split it and that the 5
 deletions wouldn't conflict.  That wasn't the case.
 
 - arch/s390: also messy.  First is hypfs_diag.c where the KVM tree
 moved some code and the s390 tree patched it.  You have to reapply the
 relevant part of commits 6c22c98637, plus all of e030c1125e, to
 arch/s390/kernel/diag.c.  Or pick the linux-next conflict
 resolution from http://marc.info/?l=kvm&m=146717549531603&w=2.
 Second, there is a conflict in gmap.c between a stable fix and 4.8.
 The KVM version here is the correct one.
 
 I have pushed my resolution at refs/heads/merge-20160802 (commit
 3d1f534198) at git://git.kernel.org/pub/scm/virt/kvm/kvm.git.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJXoGm7AAoJEL/70l94x66DugQIAIj703ePAFepB/fCrKHkZZia
 SGrsBdvAtNsOhr7FQ5qvvjLxiv/cv7CymeuJivX8H+4kuUHUllDzey+RPHYHD9X7
 U6n1PdCH9F15a3IXc8tDjlDdOMNIKJixYuq1UyNZMU6NFwl00+TZf9JF8A2US65b
 x/41W98ilL6nNBAsoDVmCLtPNWAqQ3lajaZELGfcqRQ9ZGKcAYOaLFXHv2YHf2XC
 qIDMf+slBGSQ66UoATnYV2gAopNlWbZ7n0vO6tE2KyvhHZ1m399aBX1+k8la/0JI
 69r+Tz7ZHUSFtmlmyByi5IAB87myy2WQHyAPwj+4vwJkDGPcl0TrupzbG7+T05Y=
 =42ti
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:

 - ARM: GICv3 ITS emulation and various fixes.  Removal of the
   old VGIC implementation.

 - s390: support for trapping software breakpoints, nested
   virtualization (vSIE), the STHYI opcode, initial extensions
   for CPU model support.

 - MIPS: support for MIPS64 hosts (32-bit guests only) and lots
   of cleanups, preliminary to this and the upcoming support for
   hardware virtualization extensions.

 - x86: support for execute-only mappings in nested EPT; reduced
   vmexit latency for TSC deadline timer (by about 30%) on Intel
   hosts; support for more than 255 vCPUs.

 - PPC: bugfixes.

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (302 commits)
  KVM: PPC: Introduce KVM_CAP_PPC_HTM
  MIPS: Select HAVE_KVM for MIPS64_R{2,6}
  MIPS: KVM: Reset CP0_PageMask during host TLB flush
  MIPS: KVM: Fix ptr->int cast via KVM_GUEST_KSEGX()
  MIPS: KVM: Sign extend MFC0/RDHWR results
  MIPS: KVM: Fix 64-bit big endian dynamic translation
  MIPS: KVM: Fail if ebase doesn't fit in CP0_EBase
  MIPS: KVM: Use 64-bit CP0_EBase when appropriate
  MIPS: KVM: Set CP0_Status.KX on MIPS64
  MIPS: KVM: Make entry code MIPS64 friendly
  MIPS: KVM: Use kmap instead of CKSEG0ADDR()
  MIPS: KVM: Use virt_to_phys() to get commpage PFN
  MIPS: Fix definition of KSEGX() for 64-bit
  KVM: VMX: Add VMCS to CPU's loaded VMCSs before VMPTRLD
  kvm: x86: nVMX: maintain internal copy of current VMCS
  KVM: PPC: Book3S HV: Save/restore TM state in H_CEDE
  KVM: PPC: Book3S HV: Pull out TM state save/restore into separate procedures
  KVM: arm64: vgic-its: Simplify MAPI error handling
  KVM: arm64: vgic-its: Make vgic_its_cmd_handle_mapi similar to other handlers
  KVM: arm64: vgic-its: Turn device_id validation into generic ID validation
  ...
2016-08-02 16:11:27 -04:00
Sam Bobroff
23528bb21e KVM: PPC: Introduce KVM_CAP_PPC_HTM
Introduce a new KVM capability, KVM_CAP_PPC_HTM, that can be queried to
determine if a PowerPC KVM guest should use HTM (Hardware Transactional
Memory).

This will be used by QEMU to populate the pa-features bits in the
guest's device tree.

Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-08-01 19:42:06 +02:00
Paul Mackerras
93d17397e4 KVM: PPC: Book3S HV: Save/restore TM state in H_CEDE
It turns out that if the guest does a H_CEDE while the CPU is in
a transactional state, and the H_CEDE does a nap, and the nap
loses the architected state of the CPU (which is is allowed to do),
then we lose the checkpointed state of the virtual CPU.  In addition,
the transactional-memory state recorded in the MSR gets reset back
to non-transactional, and when we try to return to the guest, we take
a TM bad thing type of program interrupt because we are trying to
transition from non-transactional to transactional with a hrfid
instruction, which is not permitted.

The result of the program interrupt occurring at that point is that
the host CPU will hang in an infinite loop with interrupts disabled.
Thus this is a denial of service vulnerability in the host which can
be triggered by any guest (and depending on the guest kernel, it can
potentially triggered by unprivileged userspace in the guest).

This vulnerability has been assigned the ID CVE-2016-5412.

To fix this, we save the TM state before napping and restore it
on exit from the nap, when handling a H_CEDE in real mode.  The
case where H_CEDE exits to host virtual mode is already OK (as are
other hcalls which exit to host virtual mode) because the exit
path saves the TM state.

Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-07-28 16:10:07 +10:00
Paul Mackerras
f024ee0984 KVM: PPC: Book3S HV: Pull out TM state save/restore into separate procedures
This moves the transactional memory state save and restore sequences
out of the guest entry/exit paths into separate procedures.  This is
so that these sequences can be used in going into and out of nap
in a subsequent patch.

The only code changes here are (a) saving and restore LR on the
stack, since these new procedures get called with a bl instruction,
(b) explicitly saving r1 into the PACA instead of assuming that
HSTATE_HOST_R1(r13) is already set, and (c) removing an unnecessary
and redundant setting of MSR[TM] that should have been removed by
commit 9d4d0bdd9e0a ("KVM: PPC: Book3S HV: Add transactional memory
support", 2013-09-24) but wasn't.

Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-07-28 16:09:34 +10:00
Benjamin Herrenschmidt
7025776ed1 powerpc/mm: Move hash table ops to a separate structure
Moving probe_machine() to after mmu init will cause the ppc_md
fields relative to the hash table management to be overwritten.

Since we have essentially disconnected the machine type from
the hash backend ops, finish the job by moving them to a different
structure.

The only callback that didn't quite fix is update_partition_table
since this is not specific to hash, so I moved it to a standalone
variable for now. We can revisit later if needed.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Fix ppc64e build failure in kexec]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-21 18:59:09 +10:00
Benjamin Herrenschmidt
d3cbff1b5a powerpc: Put exception configuration in a common place
The various calls to establish exception endianness and AIL are
now done from a single point using already established CPU and FW
feature bits to decide what to do.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-21 18:56:31 +10:00
Arnd Bergmann
58ab5e0c2c Kbuild: arch: look for generated headers in obtree
There are very few files that need add an -I$(obj) gcc for the preprocessor
or the assembler. For C files, we add always these for both the objtree and
srctree, but for the other ones we require the Makefile to add them, and
Kbuild then adds it for both trees.

As a preparation for changing the meaning of the -I$(obj) directive to
only refer to the srctree, this changes the two instances in arch/x86 to use
an explictit $(objtree) prefix where needed, otherwise we won't find the
headers any more, as reported by the kbuild 0day builder.

arch/x86/realmode/rm/realmode.lds.S:75:20: fatal error: pasyms.h: No such file or directory

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michal Marek <mmarek@suse.com>
2016-07-18 21:31:35 +02:00
Shreyas B. Prabhu
5fa6b6bd7a powerpc/powernv: Rename reusable idle functions to hardware agnostic names
Functions like power7_wakeup_loss, power7_wakeup_noloss,
power7_wakeup_tb_loss are used by POWER7 and POWER8 hardware. They can
also be used by POWER9. Hence rename these functions hardware agnostic
names.

Suggested-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-15 20:18:39 +10:00
Daniel Axtens
f8750513b7 powerpc/kvm: Clarify __user annotations
kvmppc_h_put_tce_indirect labels a u64 pointer as __user. It also
labelled the u64 where get_user puts the result as __user. This isn't
a pointer and so doesn't need to be labelled __user.

Split the u64 value definition onto a new line to make it clear that
it doesn't get the annotation.

Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-14 20:43:50 +10:00
Radim Krčmář
c63cf538eb KVM: pass struct kvm to kvm_set_routing_entry
Arch-specific code will use it.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:56 +02:00
Paolo Bonzini
6d5315b3a6 Merge branch 'kvm-ppc-next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD 2016-07-11 18:10:06 +02:00
Paolo Bonzini
6edaa5307f KVM: remove kvm_guest_enter/exit wrappers
Use the functions from context_tracking.h directly.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-01 11:03:21 +02:00
Mahesh Salgaonkar
fd7bacbca4 KVM: PPC: Book3S HV: Fix TB corruption in guest exit path on HMI interrupt
When a guest is assigned to a core it converts the host Timebase (TB)
into guest TB by adding guest timebase offset before entering into
guest. During guest exit it restores the guest TB to host TB. This means
under certain conditions (Guest migration) host TB and guest TB can differ.

When we get an HMI for TB related issues the opal HMI handler would
try fixing errors and restore the correct host TB value. With no guest
running, we don't have any issues. But with guest running on the core
we run into TB corruption issues.

If we get an HMI while in the guest, the current HMI handler invokes opal
hmi handler before forcing guest to exit. The guest exit path subtracts
the guest TB offset from the current TB value which may have already
been restored with host value by opal hmi handler. This leads to incorrect
host and guest TB values.

With split-core, things become more complex. With split-core, TB also gets
split and each subcore gets its own TB register. When a hmi handler fixes
a TB error and restores the TB value, it affects all the TB values of
sibling subcores on the same core. On TB errors all the thread in the core
gets HMI. With existing code, the individual threads call opal hmi handle
independently which can easily throw TB out of sync if we have guest
running on subcores. Hence we will need to co-ordinate with all the
threads before making opal hmi handler call followed by TB resync.

This patch introduces a sibling subcore state structure (shared by all
threads in the core) in paca which holds information about whether sibling
subcores are in Guest mode or host mode. An array in_guest[] of size
MAX_SUBCORE_PER_CORE=4 is used to maintain the state of each subcore.
The subcore id is used as index into in_guest[] array. Only primary
thread entering/exiting the guest is responsible to set/unset its
designated array element.

On TB error, we get HMI interrupt on every thread on the core. Upon HMI,
this patch will now force guest to vacate the core/subcore. Primary
thread from each subcore will then turn off its respective bit
from the above bitmap during the guest exit path just after the
guest->host partition switch is complete.

All other threads that have just exited the guest OR were already in host
will wait until all other subcores clears their respective bit.
Once all the subcores turn off their respective bit, all threads will
will make call to opal hmi handler.

It is not necessary that opal hmi handler would resync the TB value for
every HMI interrupts. It would do so only for the HMI caused due to
TB errors. For rest, it would not touch TB value. Hence to make things
simpler, primary thread would call TB resync explicitly once for each
core immediately after opal hmi handler instead of subtracting guest
offset from TB. TB resync call will restore the TB with host value.
Thus we can be sure about the TB state.

One of the primary threads exiting the guest will take up the
responsibility of calling TB resync. It will use one of the top bits
(bit 63) from subcore state flags bitmap to make the decision. The first
primary thread (among the subcores) that is able to set the bit will
have to call the TB resync. Rest all other threads will wait until TB
resync is complete.  Once TB resync is complete all threads will then
proceed.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-06-20 14:11:25 +10:00
Thomas Huth
b69890d18f KVM: PPC: Book3S PR: Fix contents of SRR1 when injecting a program exception
vcpu->arch.shadow_srr1 only contains usable values for injecting
a program exception into the guest if we entered the function
kvmppc_handle_exit_pr() with exit_nr == BOOK3S_INTERRUPT_PROGRAM.
In other cases, the shadow_srr1 bits are zero. Since we want to
pass an illegal-instruction program check to the guest, set
"flags" to SRR1_PROGILL for these other cases.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-06-20 14:11:25 +10:00
Thomas Huth
708e75a3ee KVM: PPC: Book3S PR: Fix illegal opcode emulation
If kvmppc_handle_exit_pr() calls kvmppc_emulate_instruction() to emulate
one instruction (in the BOOK3S_INTERRUPT_H_EMUL_ASSIST case), it calls
kvmppc_core_queue_program() afterwards if kvmppc_emulate_instruction()
returned EMULATE_FAIL, so the guest gets an program interrupt for the
illegal opcode.
However, the kvmppc_emulate_instruction() also tried to inject a
program exception for this already, so the program interrupt gets
injected twice and the return address in srr0 gets destroyed.
All other callers of kvmppc_emulate_instruction() are also injecting
a program interrupt, and since the callers have the right knowledge
about the srr1 flags that should be used, it is the function
kvmppc_emulate_instruction() that should _not_ inject program
interrupts, so remove the kvmppc_core_queue_program() here.

This fixes the issue discovered by Laurent Vivier with kvm-unit-tests
where the logs are filled with these messages when the test tries
to execute an illegal instruction:

     Couldn't emulate instruction 0x00000000 (op 0 xop 0)
     kvmppc_handle_exit_pr: emulation at 700 failed (00000000)

Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Alexander Graf <agraf@suse.de>
Tested-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-06-20 14:11:25 +10:00
Michael Ellerman
f55d966536 powerpc: Define and use PPC64_ELF_ABI_v2/v1
We're approaching 20 locations where we need to check for ELF ABI v2.
That's fine, except the logic is a bit awkward, because we have to check
that _CALL_ELF is defined and then what its value is.

So check it once in asm/types.h and define PPC64_ELF_ABI_v2 when ELF ABI
v2 is detected.

We also have a few places where what we're really trying to check is
that we are using the 64-bit v1 ABI, ie. function descriptors. So also
add a #define for that, which simplifies several checks.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-14 13:58:27 +10:00
Linus Torvalds
c04a588029 powerpc updates for 4.7
Highlights:
  - Support for Power ISA 3.0 (Power9) Radix Tree MMU from Aneesh Kumar K.V
  - Live patching support for ppc64le (also merged via livepatching.git)
 
 Various cleanups & minor fixes from:
  - Aaro Koskinen, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
    Chris Smart, Daniel Axtens, Frederic Barrat, Gavin Shan, Ian Munsie, Lennart
    Sorensen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring, Michael
    Ellerman, Oliver O'Halloran, Paul Gortmaker, Paul Mackerras, Rashmica Gupta,
    Russell Currey, Suraj Jitindar Singh, Thiago Jung Bauermann, Valentin
    Rothberg, Vipin K Parashar.
 
 General:
  - Update LMB associativity index during DLPAR add/remove from Nathan Fontenot
  - Fix branching to OOL handlers in relocatable kernel from Hari Bathini
  - Add support for userspace Power9 copy/paste from Chris Smart
  - Always use STRICT_MM_TYPECHECKS from Michael Ellerman
  - Add mask of possible MMU features from Michael Ellerman
 
 PCI:
  - Enable pass through of NVLink to guests from Alexey Kardashevskiy
  - Cleanups in preparation for powernv PCI hotplug from Gavin Shan
  - Don't report error in eeh_pe_reset_and_recover() from Gavin Shan
  - Restore initial state in eeh_pe_reset_and_recover() from Gavin Shan
  - Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell" from Guilherme G. Piccoli
  - Remove the dependency on EEH struct in DDW mechanism from Guilherme G. Piccoli
 
 selftests:
  - Test cp_abort during context switch from Chris Smart
  - Add several tests for transactional memory support from Rashmica Gupta
 
 perf:
  - Add support for sampling interrupt register state from Anju T
  - Add support for unwinding perf-stackdump from Chandan Kumar
 
 cxl:
  - Configure the PSL for two CAPI ports on POWER8NVL from Philippe Bergheaud
  - Allow initialization on timebase sync failures from Frederic Barrat
  - Increase timeout for detection of AFU mmio hang from Frederic Barrat
  - Handle num_of_processes larger than can fit in the SPA from Ian Munsie
  - Ensure PSL interrupt is configured for contexts with no AFU IRQs from Ian Munsie
  - Add kernel API to allow a context to operate with relocate disabled from Ian Munsie
  - Check periodically the coherent platform function's state from Christophe Lombard
 
 Freescale:
  - Updates from Scott: "Contains 86xx fixes, minor device tree fixes, an erratum
    workaround, and a kconfig dependency fix."
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXPsGzAAoJEFHr6jzI4aWAVoAP/iKdrDe0eYHlVAE9SqnbsiZs
 lgDxdsC8P3fsmP1G9o/HkKhC82zHl/La8Ztz8dtqa+LkSzbfliWP1ztJsI7GsBFo
 tyCKzWnX9Rwvd3meHu/o/SQ29TNLm/PbPyyRqpj5QPbJ8XCXkAXR7ZZZqjvcMsJW
 /AgIr7Cgf53tl9oZzzl/c7CnNHhMq+NBdA71vhWtUx+T97wfJEGyKW6HhZyHDbEU
 iAki7fu77ZpEqC/Fh9swf0dCGBJ+a132NoMVo0AdV7EQLznUYlQpQEqa+1PyHZOP
 /ArOzf2mDg6m3PfCo1eiB07v8PnVZ3llEUbVAJNg3GUxbE4SHrqq/kwm0iElm3p/
 DvFxerCwdX9vmskJX4wDs+pSZRabXYj9XVMptsgFzA4joWrqqb7mBHqaort88YcY
 YSljEt1bHyXmiJ+dBya40qARsWUkCVN7ZgEzdxckq0KI3w7g2tqpqIbO2lClWT6t
 B3GpqQ4jp34+d1M14FB91fIGK7tMvOhSInE0Mv9+tPvRsepXqiiU/SwdAtRlr3m2
 zs/K+4FYcVjJ3Rmpgc+tI38PbZxHe212I35YN6L1LP+4ZfAtzz0NyKdooTIBtkbO
 19pX4WbBjKq8zK+YutrySncBIrbnI6VjW51vtRhgVKZliPFO/6zKagyU6FbxM+E5
 udQES+t3F/9gvtxgxtDe
 =YvyQ
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Highlights:
   - Support for Power ISA 3.0 (Power9) Radix Tree MMU from Aneesh Kumar K.V
   - Live patching support for ppc64le (also merged via livepatching.git)

  Various cleanups & minor fixes from:
   - Aaro Koskinen, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
     Chris Smart, Daniel Axtens, Frederic Barrat, Gavin Shan, Ian Munsie,
     Lennart Sorensen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring,
     Michael Ellerman, Oliver O'Halloran, Paul Gortmaker, Paul Mackerras,
     Rashmica Gupta, Russell Currey, Suraj Jitindar Singh, Thiago Jung
     Bauermann, Valentin Rothberg, Vipin K Parashar.

  General:
   - Update LMB associativity index during DLPAR add/remove from Nathan
     Fontenot
   - Fix branching to OOL handlers in relocatable kernel from Hari Bathini
   - Add support for userspace Power9 copy/paste from Chris Smart
   - Always use STRICT_MM_TYPECHECKS from Michael Ellerman
   - Add mask of possible MMU features from Michael Ellerman

  PCI:
   - Enable pass through of NVLink to guests from Alexey Kardashevskiy
   - Cleanups in preparation for powernv PCI hotplug from Gavin Shan
   - Don't report error in eeh_pe_reset_and_recover() from Gavin Shan
   - Restore initial state in eeh_pe_reset_and_recover() from Gavin Shan
   - Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell"
     from Guilherme G Piccoli
   - Remove the dependency on EEH struct in DDW mechanism from Guilherme
     G Piccoli

  selftests:
   - Test cp_abort during context switch from Chris Smart
   - Add several tests for transactional memory support from Rashmica
     Gupta

  perf:
   - Add support for sampling interrupt register state from Anju T
   - Add support for unwinding perf-stackdump from Chandan Kumar

  cxl:
   - Configure the PSL for two CAPI ports on POWER8NVL from Philippe
     Bergheaud
   - Allow initialization on timebase sync failures from Frederic Barrat
   - Increase timeout for detection of AFU mmio hang from Frederic
     Barrat
   - Handle num_of_processes larger than can fit in the SPA from Ian
     Munsie
   - Ensure PSL interrupt is configured for contexts with no AFU IRQs
     from Ian Munsie
   - Add kernel API to allow a context to operate with relocate disabled
     from Ian Munsie
   - Check periodically the coherent platform function's state from
     Christophe Lombard

  Freescale:
   - Updates from Scott: "Contains 86xx fixes, minor device tree fixes,
     an erratum workaround, and a kconfig dependency fix."

* tag 'powerpc-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (192 commits)
  powerpc/86xx: Fix PCI interrupt map definition
  powerpc/86xx: Move pci1 definition to the include file
  powerpc/fsl: Fix build of the dtb embedded kernel images
  powerpc/fsl: Fix rcpm compatible string
  powerpc/fsl: Remove FSL_SOC dependency from FSL_LBC
  powerpc/fsl-pci: Add a workaround for PCI 5 errata
  powerpc/fsl: Fix SPI compatible on t208xrdb and t1040rdb
  powerpc/powernv/npu: Add PE to PHB's list
  powerpc/powernv: Fix insufficient memory allocation
  powerpc/iommu: Remove the dependency on EEH struct in DDW mechanism
  Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell"
  powerpc/eeh: Drop unnecessary label in eeh_pe_change_owner()
  powerpc/eeh: Ignore handlers in eeh_pe_reset_and_recover()
  powerpc/eeh: Restore initial state in eeh_pe_reset_and_recover()
  powerpc/eeh: Don't report error in eeh_pe_reset_and_recover()
  Revert "powerpc/powernv: Exclude root bus in pnv_pci_reset_secondary_bus()"
  powerpc/powernv/npu: Enable NVLink pass through
  powerpc/powernv/npu: Rework TCE Kill handling
  powerpc/powernv/npu: Add set/unset window helpers
  powerpc/powernv/ioda2: Export debug helper pe_level_printk()
  ...
2016-05-20 10:12:41 -07:00
Christian Borntraeger
3491caf275 KVM: halt_polling: provide a way to qualify wakeups during poll
Some wakeups should not be considered a sucessful poll. For example on
s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
would be considered runnable - letting all vCPUs poll all the time for
transactional like workload, even if one vCPU would be enough.
This can result in huge CPU usage for large guests.
This patch lets architectures provide a way to qualify wakeups if they
should be considered a good/bad wakeups in regard to polls.

For s390 the implementation will fence of halt polling for anything but
known good, single vCPU events. The s390 implementation for floating
interrupts does a wakeup for one vCPU, but the interrupt will be delivered
by whatever CPU checks first for a pending interrupt. We prefer the
woken up CPU by marking the poll of this CPU as "good" poll.
This code will also mark several other wakeup reasons like IPI or
expired timers as "good". This will of course also mark some events as
not sucessful. As  KVM on z runs always as a 2nd level hypervisor,
we prefer to not poll, unless we are really sure, though.

This patch successfully limits the CPU usage for cases like uperf 1byte
transactional ping pong workload or wakeup heavy workload like OLTP
while still providing a proper speedup.

This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
wakeups that are considered not good for polling.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version)
Cc: David Matlack <dmatlack@google.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
[Rename config symbol. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-13 17:29:23 +02:00
Paul Mackerras
b1a4286b8f KVM: PPC: Book3S HV: Re-enable XICS fast path for irqfd-generated interrupts
Commit c9a5eccac1 ("kvm/eventfd: add arch-specific set_irq",
2015-10-16) added the possibility for architecture-specific code
to handle the generation of virtual interrupts in atomic context
where possible, without having to schedule a work function.

Since we can easily generate virtual interrupts on XICS without
having to do anything worse than take a spinlock, we define a
kvm_arch_set_irq_inatomic() for XICS.  We also remove kvm_set_msi()
since it is not used any more.

The one slightly tricky thing is that with the new interface, we
don't get told whether the interrupt is an MSI (or other edge
sensitive interrupt) vs. level-sensitive.  The difference as far
as interrupt generation is concerned is that for LSIs we have to
set the asserted flag so it will continue to fire until it is
explicitly cleared.

In fact the XICS code gets told which interrupts are LSIs by userspace
when it configures the interrupt via the KVM_DEV_XICS_GRP_SOURCES
attribute group on the XICS device.  To store this information, we add
a new "lsi" field to struct ics_irq_state.  With that we can also do a
better job of returning accurate values when reading the attribute
group.

Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-05-12 16:40:55 +10:00
Gavin Shan
07f8ab255f KVM: PPC: Book3S HV: Fix build error in book3s_hv.c
When CONFIG_KVM_XICS is enabled, CPU_UP_PREPARE and other macros for
CPU states in linux/cpu.h are needed by arch/powerpc/kvm/book3s_hv.c.
Otherwise, build error as below is seen:

   gwshan@gwshan:~/sandbox/l$ make arch/powerpc/kvm/book3s_hv.o
    :
   CC      arch/powerpc/kvm/book3s_hv.o
   arch/powerpc/kvm/book3s_hv.c: In function ‘kvmppc_cpu_notify’:
   arch/powerpc/kvm/book3s_hv.c:3072:7: error: ‘CPU_UP_PREPARE’ \
   undeclared (first use in this function)

This fixes the issue introduced by commit <6f3bb80944> ("KVM: PPC:
Book3S HV: kvmppc_host_rm_ops - handle offlining CPUs").

Fixes: 6f3bb80944
Cc: stable@vger.kernel.org # v4.6
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-05-11 21:19:10 +10:00
Paul Mackerras
eb8b056016 KVM: PPC: Fix emulated MMIO sign-extension
When the guest does a sign-extending load instruction (such as lha
or lwa) to an emulated MMIO location, it results in a call to
kvmppc_handle_loads() in the host.  That function sets the
vcpu->arch.mmio_sign_extend flag and calls kvmppc_handle_load()
to do the rest of the work.  However, kvmppc_handle_load() sets
the mmio_sign_extend flag to 0 unconditionally, so the sign
extension never gets done.

To fix this, we rename kvmppc_handle_load to __kvmppc_handle_load
and add an explicit parameter to indicate whether sign extension
is required.  kvmppc_handle_load() and kvmppc_handle_loads() then
become 1-line functions that just call __kvmppc_handle_load()
with the extra parameter.

Reported-by: Bin Lu <lblulb@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-05-11 21:19:10 +10:00
Alexey Kardashevskiy
ade3ac660a KVM: PPC: Fix debug macros
When XICS_DBG is enabled, gcc produces format errors. This fixes
formats to match passed values types.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-05-11 21:19:10 +10:00
Laurent Vivier
11dd6ac025 KVM: PPC: Book3S PR: Manage single-step mode
Until now, when we connect gdb to the QEMU gdb-server, the
single-step mode is not managed.

This patch adds this, only for kvm-pr:

If KVM_GUESTDBG_SINGLESTEP is set, we enable single-step trace bit in the
MSR (MSR_SE) just before the __kvmppc_vcpu_run(), and disable it just after.
In kvmppc_handle_exit_pr, instead of routing the interrupt to
the guest, we return to host, with KVM_EXIT_DEBUG reason.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-05-11 21:19:10 +10:00
Aneesh Kumar K.V
50de596de8 powerpc/mm/hash: Add support for Power9 Hash
PowerISA 3.0 adds a parition table indexed by LPID. Parition table
allows us to specify the MMU model that will be used for guest and host
translation.

This patch adds support with SLB based hash model (UPRT = 0). What is
required with this model is to support the new hash page table entry
format and also setup partition table such that we use hash table for
address translation.

We don't have segment table support yet.

In order to make sure we don't load KVM module on Power9 (since we don't
have kvm support yet) this patch also disables KVM on Power9.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:32:40 +10:00
Aneesh Kumar K.V
30bda41aba powerpc/mm: Drop WIMG in favour of new constants
PowerISA 3.0 introduces two pte bits with the below meaning for radix:
  00 -> Normal Memory
  01 -> Strong Access Order (SAO)
  10 -> Non idempotent I/O (Cache inhibited and guarded)
  11 -> Tolerant I/O (Cache inhibited)

We drop the existing WIMG bits in the Linux page table in favour of the
above constants. We loose _PAGE_WRITETHRU with this conversion. We only
use writethru via pgprot_cached_wthru() which is used by
fbdev/controlfb.c which is Apple control display and also PPC32.

With respect to _PAGE_COHERENCE, we have been marking hpte always
coherent for some time now. htab_convert_pte_flags() always added
HPTE_R_M.

NOTE: KVM changes need closer review.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01 18:32:33 +10:00
Paolo Bonzini
0af574be32 KVM: PPC: do not compile in vfio.o unconditionally
Build on 32-bit PPC fails with the following error:

 int kvm_vfio_ops_init(void)
      ^
 In file included from arch/powerpc/kvm/../../../virt/kvm/vfio.c:21:0:
 arch/powerpc/kvm/../../../virt/kvm/vfio.h:8:90: note: previous definition of ‘kvm_vfio_ops_init’ was here
 arch/powerpc/kvm/../../../virt/kvm/vfio.c:292:6: error: redefinition of ‘kvm_vfio_ops_exit’
 void kvm_vfio_ops_exit(void)
             ^
 In file included from arch/powerpc/kvm/../../../virt/kvm/vfio.c:21:0:
 arch/powerpc/kvm/../../../virt/kvm/vfio.h:12:91: note: previous definition of ‘kvm_vfio_ops_exit’ was here
 scripts/Makefile.build:258: recipe for target arch/powerpc/kvm/../../../virt/kvm/vfio.o failed
 make[3]: *** [arch/powerpc/kvm/../../../virt/kvm/vfio.o] Error 1

Check whether CONFIG_KVM_VFIO is set before including vfio.o
in the build.

Reported-by: Pranith Kumar <bobby.prani@gmail.com>
Tested-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:38:38 +01:00
Lan Tianyu
489153c746 KVM/PPC: update the comment of memory barrier in the kvmppc_prepare_to_enter()
The barrier also orders the write to mode from any reads
to the page tables done and so update the comment.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:38:37 +01:00
Alexey Kardashevskiy
31217db720 KVM: PPC: Create a virtual-mode only TCE table handlers
Upcoming in-kernel VFIO acceleration needs different handling in real
and virtual modes which makes it hard to support both modes in
the same handler.

This creates a copy of kvmppc_rm_h_stuff_tce and kvmppc_rm_h_put_tce
in addition to the existing kvmppc_rm_h_put_tce_indirect.

This also fixes linker breakage when only PR KVM was selected (leaving
HV KVM off): the kvmppc_h_put_tce/kvmppc_h_stuff_tce functions
would not compile at all and the linked would fail.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 12:02:51 +01:00
Linus Torvalds
d5e2d00898 powerpc updates for 4.6
Highlights:
  - Restructure Linux PTE on Book3S/64 to Radix format from Paul Mackerras
  - Book3s 64 MMU cleanup in preparation for Radix MMU from Aneesh Kumar K.V
  - Add POWER9 cputable entry from Michael Neuling
  - FPU/Altivec/VSX save/restore optimisations from Cyril Bur
  - Add support for new ftrace ABI on ppc64le from Torsten Duwe
 
 Various cleanups & minor fixes from:
  - Adam Buchbinder, Andrew Donnellan, Balbir Singh, Christophe Leroy, Cyril
    Bur, Luis Henriques, Madhavan Srinivasan, Pan Xinhui, Russell Currey,
    Sukadev Bhattiprolu, Suraj Jitindar Singh.
 
 General:
  - atomics: Allow architectures to define their own __atomic_op_* helpers from
    Boqun Feng
  - Implement atomic{, 64}_*_return_* variants and acquire/release/relaxed
    variants for (cmp)xchg from Boqun Feng
  - Add powernv_defconfig from Jeremy Kerr
  - Fix BUG_ON() reporting in real mode from Balbir Singh
  - Add xmon command to dump OPAL msglog from Andrew Donnellan
  - Add xmon command to dump process/task similar to ps(1) from Douglas Miller
  - Clean up memory hotplug failure paths from David Gibson
 
 pci/eeh:
  - Redesign SR-IOV on PowerNV to give absolute isolation between VFs from Wei
    Yang.
  - EEH Support for SRIOV VFs from Wei Yang and Gavin Shan.
  - PCI/IOV: Rename and export virtfn_{add, remove} from Wei Yang
  - PCI: Add pcibios_bus_add_device() weak function from Wei Yang
  - MAINTAINERS: Update EEH details and maintainership from Russell Currey
 
 cxl:
  - Support added to the CXL driver for running on both bare-metal and
    hypervisor systems, from Christophe Lombard and Frederic Barrat.
  - Ignore probes for virtual afu pci devices from Vaibhav Jain
 
 perf:
  - Export Power8 generic and cache events to sysfs from Sukadev Bhattiprolu
  - hv-24x7: Fix usage with chip events, display change in counter values,
    display domain indices in sysfs, eliminate domain suffix in event names,
    from Sukadev Bhattiprolu
 
 Freescale:
  - Updates from Scott: "Highlights include 8xx optimizations, 32-bit checksum
    optimizations, 86xx consolidation, e5500/e6500 cpu hotplug, more fman and
    other dt bits, and minor fixes/cleanup."
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJW69OrAAoJEFHr6jzI4aWAe5EQAJw/hE6WBQc6a7Tj70AnXOqR
 qk/m5pZjuTwQxfBteIvHR1pE5eXdlvtAjcD254LVkFkAbIn19W/h2k0VX/nlee7P
 n/VRHRifjtGmukqHrPYJJ7ua9mNlY7pxh3leGSixBFASnSWqMxNNNziNQtSTcuCs
 TjHiw6NkZ/kzeunA4bAfE4yHVUZjmL74oiS9JbLyaVHqoW4fqWLlh26AKo2yYMZI
 qPicBBG4HBi3FGvoexnKxlJNdcV4HO7LzDjJmCSfUKYCJi+Pw19T5qmhso0q0qVz
 vHg/A8HNeG4Hn83pNVmLeQSAIQRZ3DvTtcLgbjPo+TVwm/hzrRRBWipTeOVbkLW8
 2bcOXT4t7LWUq15EAJ1LYgYZGzcLrfRfUeOcuQ1TWd3+PcfY9pE7FmizsxAAfaVe
 E9j9mpz4XnIqBtWkFHneTIHkQ5OWptyKuZJEaYH0nut4VsP0k8NarkseafGqBPu7
 5eG83gbiQbCVixfOgblV9eocJ29JcwpjPAY4CZSGJimShg909FV7WRgZgJkKWrbK
 dBRco8Jcp4VglGfo2qymv7Uj4KwQoypBREOhiKUvrAsVlDxPfx+bcskhjGu9xGDC
 xs/+nme0/lKa/wg5K4C3mQ1GAlkMWHI0ojhJjsyODbetup5UbkEu03wjAaTdO9dT
 Y6ptGm0rYAJluPNlziFj
 =qkAt
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "This was delayed a day or two by some build-breakage on old toolchains
  which we've now fixed.

  There's two PCI commits both acked by Bjorn.

  There's one commit to mm/hugepage.c which is (co)authored by Kirill.

  Highlights:
   - Restructure Linux PTE on Book3S/64 to Radix format from Paul
     Mackerras
   - Book3s 64 MMU cleanup in preparation for Radix MMU from Aneesh
     Kumar K.V
   - Add POWER9 cputable entry from Michael Neuling
   - FPU/Altivec/VSX save/restore optimisations from Cyril Bur
   - Add support for new ftrace ABI on ppc64le from Torsten Duwe

  Various cleanups & minor fixes from:
   - Adam Buchbinder, Andrew Donnellan, Balbir Singh, Christophe Leroy,
     Cyril Bur, Luis Henriques, Madhavan Srinivasan, Pan Xinhui, Russell
     Currey, Sukadev Bhattiprolu, Suraj Jitindar Singh.

  General:
   - atomics: Allow architectures to define their own __atomic_op_*
     helpers from Boqun Feng
   - Implement atomic{, 64}_*_return_* variants and acquire/release/
     relaxed variants for (cmp)xchg from Boqun Feng
   - Add powernv_defconfig from Jeremy Kerr
   - Fix BUG_ON() reporting in real mode from Balbir Singh
   - Add xmon command to dump OPAL msglog from Andrew Donnellan
   - Add xmon command to dump process/task similar to ps(1) from Douglas
     Miller
   - Clean up memory hotplug failure paths from David Gibson

  pci/eeh:
   - Redesign SR-IOV on PowerNV to give absolute isolation between VFs
     from Wei Yang.
   - EEH Support for SRIOV VFs from Wei Yang and Gavin Shan.
   - PCI/IOV: Rename and export virtfn_{add, remove} from Wei Yang
   - PCI: Add pcibios_bus_add_device() weak function from Wei Yang
   - MAINTAINERS: Update EEH details and maintainership from Russell
     Currey

  cxl:
   - Support added to the CXL driver for running on both bare-metal and
     hypervisor systems, from Christophe Lombard and Frederic Barrat.
   - Ignore probes for virtual afu pci devices from Vaibhav Jain

  perf:
   - Export Power8 generic and cache events to sysfs from Sukadev
     Bhattiprolu
   - hv-24x7: Fix usage with chip events, display change in counter
     values, display domain indices in sysfs, eliminate domain suffix in
     event names, from Sukadev Bhattiprolu

  Freescale:
   - Updates from Scott: "Highlights include 8xx optimizations, 32-bit
     checksum optimizations, 86xx consolidation, e5500/e6500 cpu
     hotplug, more fman and other dt bits, and minor fixes/cleanup"

* tag 'powerpc-4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (179 commits)
  powerpc: Fix unrecoverable SLB miss during restore_math()
  powerpc/8xx: Fix do_mtspr_cpu6() build on older compilers
  powerpc/rcpm: Fix build break when SMP=n
  powerpc/book3e-64: Use hardcoded mttmr opcode
  powerpc/fsl/dts: Add "jedec,spi-nor" flash compatible
  powerpc/T104xRDB: add tdm riser card node to device tree
  powerpc32: PAGE_EXEC required for inittext
  powerpc/mpc85xx: Add pcsphy nodes to FManV3 device tree
  powerpc/mpc85xx: Add MDIO bus muxing support to the board device tree(s)
  powerpc/86xx: Introduce and use common dtsi
  powerpc/86xx: Update device tree
  powerpc/86xx: Move dts files to fsl directory
  powerpc/86xx: Switch to kconfig fragments approach
  powerpc/86xx: Update defconfigs
  powerpc/86xx: Consolidate common platform code
  powerpc32: Remove one insn in mulhdu
  powerpc32: small optimisation in flush_icache_range()
  powerpc: Simplify test in __dma_sync()
  powerpc32: move xxxxx_dcache_range() functions inline
  powerpc32: Remove clear_pages() and define clear_page() inline
  ...
2016-03-19 15:38:41 -07:00
Linus Torvalds
10dc374766 One of the largest releases for KVM... Hardly any generic improvement,
but lots of architecture-specific changes.
 
 * ARM:
 - VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
 - PMU support for guests
 - 32bit world switch rewritten in C
 - various optimizations to the vgic save/restore code.
 
 * PPC:
 - enabled KVM-VFIO integration ("VFIO device")
 - optimizations to speed up IPIs between vcpus
 - in-kernel handling of IOMMU hypercalls
 - support for dynamic DMA windows (DDW).
 
 * s390:
 - provide the floating point registers via sync regs;
 - separated instruction vs. data accesses
 - dirty log improvements for huge guests
 - bugfixes and documentation improvements.
 
 * x86:
 - Hyper-V VMBus hypercall userspace exit
 - alternative implementation of lowest-priority interrupts using vector
 hashing (for better VT-d posted interrupt support)
 - fixed guest debugging with nested virtualizations
 - improved interrupt tracking in the in-kernel IOAPIC
 - generic infrastructure for tracking writes to guest memory---currently
 its only use is to speedup the legacy shadow paging (pre-EPT) case, but
 in the future it will be used for virtual GPUs as well
 - much cleanup (LAPIC, kvmclock, MMU, PIT), including ubsan fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJW5r3BAAoJEL/70l94x66D2pMH/jTSWWwdTUJMctrDjPVzKzG0
 yOzHW5vSLFoFlwEOY2VpslnXzn5TUVmCAfrdmFNmQcSw6hGb3K/xA/ZX/KLwWhyb
 oZpr123ycahga+3q/ht/dFUBCCyWeIVMdsLSFwpobEBzPL0pMgc9joLgdUC6UpWX
 tmN0LoCAeS7spC4TTiTTpw3gZ/L+aB0B6CXhOMjldb9q/2CsgaGyoVvKA199nk9o
 Ngu7ImDt7l/x1VJX4/6E/17VHuwqAdUrrnbqerB/2oJ5ixsZsHMGzxQ3sHCmvyJx
 WG5L00ubB1oAJAs9fBg58Y/MdiWX99XqFhdEfxq4foZEiQuCyxygVvq3JwZTxII=
 =OUZZ
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "One of the largest releases for KVM...  Hardly any generic
  changes, but lots of architecture-specific updates.

  ARM:
   - VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
   - PMU support for guests
   - 32bit world switch rewritten in C
   - various optimizations to the vgic save/restore code.

  PPC:
   - enabled KVM-VFIO integration ("VFIO device")
   - optimizations to speed up IPIs between vcpus
   - in-kernel handling of IOMMU hypercalls
   - support for dynamic DMA windows (DDW).

  s390:
   - provide the floating point registers via sync regs;
   - separated instruction vs.  data accesses
   - dirty log improvements for huge guests
   - bugfixes and documentation improvements.

  x86:
   - Hyper-V VMBus hypercall userspace exit
   - alternative implementation of lowest-priority interrupts using
     vector hashing (for better VT-d posted interrupt support)
   - fixed guest debugging with nested virtualizations
   - improved interrupt tracking in the in-kernel IOAPIC
   - generic infrastructure for tracking writes to guest
     memory - currently its only use is to speedup the legacy shadow
     paging (pre-EPT) case, but in the future it will be used for
     virtual GPUs as well
   - much cleanup (LAPIC, kvmclock, MMU, PIT), including ubsan fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (217 commits)
  KVM: x86: remove eager_fpu field of struct kvm_vcpu_arch
  KVM: x86: disable MPX if host did not enable MPX XSAVE features
  arm64: KVM: vgic-v3: Only wipe LRs on vcpu exit
  arm64: KVM: vgic-v3: Reset LRs at boot time
  arm64: KVM: vgic-v3: Do not save an LR known to be empty
  arm64: KVM: vgic-v3: Save maintenance interrupt state only if required
  arm64: KVM: vgic-v3: Avoid accessing ICH registers
  KVM: arm/arm64: vgic-v2: Make GICD_SGIR quicker to hit
  KVM: arm/arm64: vgic-v2: Only wipe LRs on vcpu exit
  KVM: arm/arm64: vgic-v2: Reset LRs at boot time
  KVM: arm/arm64: vgic-v2: Do not save an LR known to be empty
  KVM: arm/arm64: vgic-v2: Move GICH_ELRSR saving to its own function
  KVM: arm/arm64: vgic-v2: Save maintenance interrupt state only if required
  KVM: arm/arm64: vgic-v2: Avoid accessing GICH registers
  KVM: s390: allocate only one DMA page per VM
  KVM: s390: enable STFLE interpretation only if enabled for the guest
  KVM: s390: wake up when the VCPU cpu timer expires
  KVM: s390: step the VCPU timer while in enabled wait
  KVM: s390: protect VCPU cpu timer with a seqcount
  KVM: s390: step VCPU cpu timer during kvm_run ioctl
  ...
2016-03-16 09:55:35 -07:00
Linus Torvalds
d4e796152a Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle are:

   - Make schedstats a runtime tunable (disabled by default) and
     optimize it via static keys.

     As most distributions enable CONFIG_SCHEDSTATS=y due to its
     instrumentation value, this is a nice performance enhancement.
     (Mel Gorman)

   - Implement 'simple waitqueues' (swait): these are just pure
     waitqueues without any of the more complex features of full-blown
     waitqueues (callbacks, wake flags, wake keys, etc.).  Simple
     waitqueues have less memory overhead and are faster.

     Use simple waitqueues in the RCU code (in 4 different places) and
     for handling KVM vCPU wakeups.

     (Peter Zijlstra, Daniel Wagner, Thomas Gleixner, Paul Gortmaker,
     Marcelo Tosatti)

   - sched/numa enhancements (Rik van Riel)

   - NOHZ performance enhancements (Rik van Riel)

   - Various sched/deadline enhancements (Steven Rostedt)

   - Various fixes (Peter Zijlstra)

   - ... and a number of other fixes, cleanups and smaller enhancements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
  sched/cputime: Fix steal_account_process_tick() to always return jiffies
  sched/deadline: Remove dl_new from struct sched_dl_entity
  Revert "kbuild: Add option to turn incompatible pointer check into error"
  sched/deadline: Remove superfluous call to switched_to_dl()
  sched/debug: Fix preempt_disable_ip recording for preempt_disable()
  sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity
  time, acct: Drop irq save & restore from __acct_update_integrals()
  acct, time: Change indentation in __acct_update_integrals()
  sched, time: Remove non-power-of-two divides from __acct_update_integrals()
  sched/rt: Kick RT bandwidth timer immediately on start up
  sched/debug: Add deadline scheduler bandwidth ratio to /proc/sched_debug
  sched/debug: Move sched_domain_sysctl to debug.c
  sched/debug: Move the /sys/kernel/debug/sched_features file setup into debug.c
  sched/rt: Fix PI handling vs. sched_setscheduler()
  sched/core: Remove duplicated sched_group_set_shares() prototype
  sched/fair: Consolidate nohz CPU load update code
  sched/fair: Avoid using decay_load_missed() with a negative value
  sched/deadline: Always calculate end of period on sched_yield()
  sched/cgroup: Fix cgroup entity load tracking tear-down
  rcu: Use simple wait queues where possible in rcutree
  ...
2016-03-14 19:14:06 -07:00
Paul Mackerras
ccec44563b KVM: PPC: Book3S HV: Sanitize special-purpose register values on guest exit
Thomas Huth discovered that a guest could cause a hard hang of a
host CPU by setting the Instruction Authority Mask Register (IAMR)
to a suitable value.  It turns out that this is because when the
code was added to context-switch the new special-purpose registers
(SPRs) that were added in POWER8, we forgot to add code to ensure
that they were restored to a sane value on guest exit.

This adds code to set those registers where a bad value could
compromise the execution of the host kernel to a suitable neutral
value on guest exit.

Cc: stable@vger.kernel.org # v3.14+
Fixes: b005255e12
Reported-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-03-08 13:36:42 +11:00
Aneesh Kumar K.V
f64e8084c9 powerpc/mm: Move hash related mmu-*.h headers to book3s/
No code changes.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-03-03 21:19:21 +11:00
Alexey Kardashevskiy
58ded4201f KVM: PPC: Add support for 64bit TCE windows
The existing KVM_CREATE_SPAPR_TCE only supports 32bit windows which is not
enough for directly mapped windows as the guest can get more than 4GB.

This adds KVM_CREATE_SPAPR_TCE_64 ioctl and advertises it
via KVM_CAP_SPAPR_TCE_64 capability. The table size is checked against
the locked memory limit.

Since 64bit windows are to support Dynamic DMA windows (DDW), let's add
@bus_offset and @page_shift which are also required by DDW.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-03-02 09:56:50 +11:00
Alexey Kardashevskiy
14f853f1b2 KVM: PPC: Add @offset to kvmppc_spapr_tce_table
This enables userspace view of TCE tables to start from non-zero offset
on a bus. This will be used for huge DMA windows.

This only changes the internal structure, the user interface needs to
change in order to use an offset.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-03-02 09:56:50 +11:00
Alexey Kardashevskiy
fe26e52712 KVM: PPC: Add @page_shift to kvmppc_spapr_tce_table
At the moment the kvmppc_spapr_tce_table struct can only describe
4GB windows and handle fixed size (4K) pages. Dynamic DMA windows
support more so these limits need to be extended.

This replaces window_size (in bytes, 4GB max) with page_shift (32bit)
and size (64bit, in pages).

This should cause no behavioural change as this is changing
the internal structures only - the user interface still only
allows one to create a 32-bit table with 4KiB pages at this stage.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-03-02 09:56:50 +11:00
Adam Buchbinder
446957ba51 powerpc: Fix misspellings in comments.
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-03-01 19:27:20 +11:00
Suresh E. Warrier
520fe9c607 KVM: PPC: Book3S HV: Add tunable to control H_IPI redirection
Redirecting the wakeup of a VCPU from the H_IPI hypercall to
a core running in the host is usually a good idea, most workloads
seemed to benefit. However, in one heavily interrupt-driven SMT1
workload, some regression was observed. This patch adds a kvm_hv
module parameter called h_ipi_redirect to control this feature.

The default value for this tunable is 1 - that is enable the feature.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-02-29 16:25:06 +11:00
Suresh E. Warrier
e17769eb8c KVM: PPC: Book3S HV: Send IPI to host core to wake VCPU
This patch adds support to real-mode KVM to search for a core
running in the host partition and send it an IPI message with
VCPU to be woken. This avoids having to switch to the host
partition to complete an H_IPI hypercall when the VCPU which
is the target of the the H_IPI is not loaded (is not running
in the guest).

The patch also includes the support in the IPI handler running
in the host to do the wakeup by calling kvmppc_xics_ipi_action
for the PPC_MSG_RM_HOST_ACTION message.

When a guest is being destroyed, we need to ensure that there
are no pending IPIs waiting to wake up a VCPU before we free
the VCPUs of the guest. This is accomplished by:
- Forces a PPC_MSG_CALL_FUNCTION IPI to be completed by all CPUs
  before freeing any VCPUs in kvm_arch_destroy_vm().
- Any PPC_MSG_RM_HOST_ACTION messages must be executed first
  before any other PPC_MSG_CALL_FUNCTION messages.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-02-29 16:25:06 +11:00
Suresh Warrier
0c2a660624 KVM: PPC: Book3S HV: Host side kick VCPU when poked by real-mode KVM
This patch adds the support for the kick VCPU operation for
kvmppc_host_rm_ops. The kvmppc_xics_ipi_action() function
provides the function to be invoked for a host side operation
when poked by the real mode KVM. This is initiated by KVM by
sending an IPI to any free host core.

KVM real mode must set the rm_action to XICS_RM_KICK_VCPU and
rm_data to point to the VCPU to be woken up before sending the IPI.
Note that we have allocated one kvmppc_host_rm_core structure
per core. The above values need to be set in the structure
corresponding to the core to which the IPI will be sent.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-02-29 16:25:06 +11:00
Suresh Warrier
6f3bb80944 KVM: PPC: Book3S HV: kvmppc_host_rm_ops - handle offlining CPUs
The kvmppc_host_rm_ops structure keeps track of which cores are
are in the host by maintaining a bitmask of active/runnable
online CPUs that have not entered the guest. This patch adds
support to manage the bitmask when a CPU is offlined or onlined
in the host.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-02-29 16:25:06 +11:00
Suresh Warrier
b8e6a87c82 KVM: PPC: Book3S HV: Manage core host state
Update the core host state in kvmppc_host_rm_ops whenever
the primary thread of the core enters the guest or returns
back.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-02-29 16:25:06 +11:00
Suresh Warrier
79b6c247e9 KVM: PPC: Book3S HV: Host-side RM data structures
This patch defines the data structures to support the setting up
of host side operations while running in real mode in the guest,
and also the functions to allocate and free it.

The operations are for now limited to virtual XICS operations.
Currently, we have only defined one operation in the data
structure:
         - Wake up a VCPU sleeping in the host when it
           receives a virtual interrupt

The operations are assigned at the core level because PowerKVM
requires that the host run in SMT off mode. For each core,
we will need to manage its state atomically - where the state
is defined by:
1. Is the core running in the host?
2. Is there a Real Mode (RM) operation pending on the host?

Currently, core state is only managed at the whole-core level
even when the system is in split-core mode. This just limits
the number of free or "available" cores in the host to perform
any host-side operations.

The kvmppc_host_rm_core.rm_data allows any data to be passed by
KVM in real mode to the host core along with the operation to
be performed.

The kvmppc_host_rm_ops structure is allocated the very first time
a guest VM is started. Initial core state is also set - all online
cores are in the host. This structure is never deleted, not even
when there are no active guests. However, it needs to be freed
when the module is unloaded because the kvmppc_host_rm_ops_hv
can contain function pointers to kvm-hv.ko functions for the
different supported host operations.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-02-29 16:25:06 +11:00
Marcelo Tosatti
8577370fb0 KVM: Use simple waitqueue for vcpu->wq
The problem:

On -rt, an emulated LAPIC timer instances has the following path:

1) hard interrupt
2) ksoftirqd is scheduled
3) ksoftirqd wakes up vcpu thread
4) vcpu thread is scheduled

This extra context switch introduces unnecessary latency in the
LAPIC path for a KVM guest.

The solution:

Allow waking up vcpu thread from hardirq context,
thus avoiding the need for ksoftirqd to be scheduled.

Normal waitqueues make use of spinlocks, which on -RT
are sleepable locks. Therefore, waking up a waitqueue
waiter involves locking a sleeping lock, which
is not allowed from hard interrupt context.

cyclictest command line:

This patch reduces the average latency in my tests from 14us to 11us.

Daniel writes:
Paolo asked for numbers from kvm-unit-tests/tscdeadline_latency
benchmark on mainline. The test was run 1000 times on
tip/sched/core 4.4.0-rc8-01134-g0905f04:

  ./x86-run x86/tscdeadline_latency.flat -cpu host

with idle=poll.

The test seems not to deliver really stable numbers though most of
them are smaller. Paolo write:

"Anything above ~10000 cycles means that the host went to C1 or
lower---the number means more or less nothing in that case.

The mean shows an improvement indeed."

Before:

               min             max         mean           std
count  1000.000000     1000.000000  1000.000000   1000.000000
mean   5162.596000  2019270.084000  5824.491541  20681.645558
std      75.431231   622607.723969    89.575700   6492.272062
min    4466.000000    23928.000000  5537.926500    585.864966
25%    5163.000000  1613252.750000  5790.132275  16683.745433
50%    5175.000000  2281919.000000  5834.654000  23151.990026
75%    5190.000000  2382865.750000  5861.412950  24148.206168
max    5228.000000  4175158.000000  6254.827300  46481.048691

After
               min            max         mean           std
count  1000.000000     1000.00000  1000.000000   1000.000000
mean   5143.511000  2076886.10300  5813.312474  21207.357565
std      77.668322   610413.09583    86.541500   6331.915127
min    4427.000000    25103.00000  5529.756600    559.187707
25%    5148.000000  1691272.75000  5784.889825  17473.518244
50%    5160.000000  2308328.50000  5832.025000  23464.837068
75%    5172.000000  2393037.75000  5853.177675  24223.969976
max    5222.000000  3922458.00000  6186.720500  42520.379830

[Patch was originaly based on the swait implementation found in the -rt
 tree. Daniel ported it to mainline's version and gathered the
 benchmark numbers for tscdeadline_latency test.]

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-rt-users@vger.kernel.org
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1455871601-27484-4-git-send-email-wagi@monom.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-25 11:27:16 +01:00
Alexey Kardashevskiy
d3695aa4f4 KVM: PPC: Add support for multiple-TCE hcalls
This adds real and virtual mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for user space emulated devices such as IBMVIO
devices or emulated PCI. These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition between kernel and user space.

The current implementation of kvmppc_h_stuff_tce() allows it to be
executed in both real and virtual modes so there is one helper.
The kvmppc_rm_h_put_tce_indirect() needs to translate the guest address
to the host address and since the translation is different, there are
2 helpers - one for each mode.

This implements the KVM_CAP_PPC_MULTITCE capability. When present,
the kernel will try handling H_PUT_TCE_INDIRECT and H_STUFF_TCE if these
are enabled by the userspace via KVM_CAP_PPC_ENABLE_HCALL.
If they can not be handled by the kernel, they are passed on to
the user space. The user space still has to have an implementation
for these.

Both HV and PR-syle KVM are supported.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-02-16 13:44:26 +11:00
Alexey Kardashevskiy
5ee7af1864 KVM: PPC: Move reusable bits of H_PUT_TCE handler to helpers
Upcoming multi-tce support (H_PUT_TCE_INDIRECT/H_STUFF_TCE hypercalls)
will validate TCE (not to have unexpected bits) and IO address
(to be within the DMA window boundaries).

This introduces helpers to validate TCE and IO address. The helpers are
exported as they compile into vmlinux (to work in realmode) and will be
used later by KVM kernel module in virtual mode.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-02-16 13:44:26 +11:00
Alexey Kardashevskiy
462ee11e58 KVM: PPC: Replace SPAPR_TCE_SHIFT with IOMMU_PAGE_SHIFT_4K
SPAPR_TCE_SHIFT is used in few places only and since IOMMU_PAGE_SHIFT_4K
can be easily used instead, remove SPAPR_TCE_SHIFT.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-02-16 13:44:26 +11:00
Alexey Kardashevskiy
f8626985c7 KVM: PPC: Account TCE-containing pages in locked_vm
At the moment pages used for TCE tables (in addition to pages addressed
by TCEs) are not counted in locked_vm counter so a malicious userspace
tool can call ioctl(KVM_CREATE_SPAPR_TCE) as many times as
RLIMIT_NOFILE and lock a lot of memory.

This adds counting for pages used for TCE tables.

This counts the number of pages required for a table plus pages for
the kvmppc_spapr_tce_table struct (TCE table descriptor) itself.

This changes release_spapr_tce_table() to store @npages on stack to
avoid calling kvmppc_stt_npages() in the loop (tiny optimization,
probably).

This does not change the amount of used memory.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-02-16 13:44:26 +11:00
Alexey Kardashevskiy
366baf28ee KVM: PPC: Use RCU for arch.spapr_tce_tables
At the moment only spapr_tce_tables updates are protected against races
but not lookups. This fixes missing protection by using RCU for the list.
As lookups also happen in real mode, this uses
list_for_each_entry_lockless() (which is expected not to access any
vmalloc'd memory).

This converts release_spapr_tce_table() to a RCU scheduled handler.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-02-16 13:44:26 +11:00
Alexey Kardashevskiy
fcbb2ce672 KVM: PPC: Rework H_PUT_TCE/H_GET_TCE handlers
This reworks the existing H_PUT_TCE/H_GET_TCE handlers to have following
patches applied nicer.

This moves the ioba boundaries check to a helper and adds a check for
least bits which have to be zeros.

The patch is pretty mechanical (only check for least ioba bits is added)
so no change in behaviour is expected.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-02-16 13:44:26 +11:00
David Gibson
178a787502 vfio: Enable VFIO device for powerpc
ec53500f "kvm: Add VFIO device" added a special KVM pseudo-device which is
used to handle any necessary interactions between KVM and VFIO.

Currently that device is built on x86 and ARM, but not powerpc, although
powerpc does support both KVM and VFIO.  This makes things awkward in
userspace

Currently qemu prints an alarming error message if you attempt to use VFIO
and it can't initialize the KVM VFIO device.  We don't want to remove the
warning, because lack of the KVM VFIO device could mean coherency problems
on x86.  On powerpc, however, the error is harmless but looks disturbing,
and a test based on host architecture in qemu would be ugly, and break if
we do need the KVM VFIO device for something important in future.

There's nothing preventing the KVM VFIO device from being built for
powerpc, so this patch turns it on.  It won't actually do anything, since
we don't define any of the arch_*() hooks, but it will make qemu happy and
we can extend it in future if we need to.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-02-15 13:42:25 +11:00
Linus Torvalds
f0ce3ff42e s390 and POWER bug fixes, plus enabling the KVM-VFIO interface on s390.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJWp5FIAAoJEL/70l94x66DozoIAKWFLnwi6PQ7Sm+Tvr0Sl/mp
 yGDM+PuVyG4C6bPS66xd4XkXEFIwfpIJQzq1Zt7Xg0m27t50DRtmInLJU4ql7MFD
 vYn3h6lOudtUR0i5kPYSy6SMehFjx8wS0GX+O+iX9tMlYI0vGWEJ+7C06FmnqpGP
 RUUjnVvljAMih8wsAHLOjH8rwZHnO+EfHNi+V7Q+eFBHJnn2R06IqK8z5DmTBeTQ
 ZecNdZslX5qZvMNulTo7nC2P6VShkdRuMJkLEFeSaTlCqx5WcnzgwEiMMavUeYI8
 aIWuE51v0RMxRhsvRXUQOeRpRgU0AF2OYJEKXBRbO3P6IRikFohJq4QH67zN2HA=
 =54tv
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "s390 and POWER bug fixes, plus enabling the KVM-VFIO interface on
  s390"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM doc: Fix KVM_SMI chapter number
  KVM: s390: fix memory overwrites when vx is disabled
  KVM: s390: Enable the KVM-VFIO device
  KVM: s390: fix guest fprs memory leak
  KVM: PPC: Fix ONE_REG AltiVec support
  KVM: PPC: Increase memslots to 512
  KVM: PPC: Book3S PR: Remove unused variable 'vcpu_book3s'
  KVM: PPC: Fix emulation of H_SET_DABR/X on POWER8
  KVM: PPC: Book3S HV: Handle unexpected traps in guest entry/exit code better
2016-01-27 10:50:42 -08:00
Paolo Bonzini
b8bc3bde9c KVM: s390: Fixes for kvm/master (targeting 4.5)
1. Fallout of some bigger floating point/vector rework in s390
 - memory leak -> stable 4.3+
 - memory overwrite -> stable 4.4+
 
 2. enable KVM-VFIO for s390
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJWp4qRAAoJEBF7vIC1phx8iwIP/0JSu7xL2X8hpU9vAqL0AfYH
 +EHyX/V5gOBav+otQmLG1hibMmg6wr5aWBoA8ICRec8EWGs0+hQEcUf+2Iv6z27t
 BiU+WWo6CoxZq3759pokCnFle8AafLDU3zmGYp85wNTvlRehDo1dJ3BEpKHHYqiZ
 zmZf45ruIUjSqX1aCQZlobxybb5nslGmfRoZcI/dYlknols33HbDz4brll1T1AiQ
 E0d0fPZwjWtWTOu2/wk8vlt5Bp76x+rVT2Vs81KCP4qJaUc1IOrMIemgnL4Sv2xu
 qQCSQeW2917Rv4pxSIpyRFW8GoTJ+1+NmsFNIzLjcngDmRhGiSoGp3mPPi08pTb5
 mJJ90yDS8RXKQD6FwSwcfjNuNnjabiGysuGxBlDyB8cFhq0608xKECQI/Zcz/ptd
 rm+MJIzVX09CR8uNgCSUHJ9w9EuwYlFgXP3Kbpq6QwZ9JDyIxMa3DwW3JhH8imZf
 e53oVlSWIW3ceu+yxFUQ9tNc7fxBO1Y7HTS4PXzIAYNkJofi3BtWm1ZmvPBPD58F
 9evrnxlKidU+MoWrZctmVmnVRcn7rTUXAS1YqHaE4lMCZWXnpHUCxHBRrUkuZSEl
 la96uPHrLLS9nzbTorHpUeG47Vf/vLt2Q5qbBma5kRmvleQdAmSAn5wUSXy+EGCE
 eHUWOdTnEc6HzhmWWv0n
 =PNSy
 -----END PGP SIGNATURE-----

Merge tag 'kvm-s390-master-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD

KVM: s390: Fixes for kvm/master (targeting 4.5)

1. Fallout of some bigger floating point/vector rework in s390
- memory leak -> stable 4.3+
- memory overwrite -> stable 4.4+

2. enable KVM-VFIO for s390
2016-01-26 16:28:36 +01:00
Dan Williams
ba049e93ae kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace).  This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o.  It allows userspace to coordinate
DMA/RDMA from/to persistent memory.

The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver.  The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.

The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag.  Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.

Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array.  Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory.  The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.

This patch (of 18):

The core has developed a need for a "pfn_t" type [1].  Move the existing
pfn_t in KVM to kvm_pfn_t [2].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Linus Torvalds
f689b742f2 powerpc updates for 4.5
- Ground work for the new Power9 MMU from Aneesh Kumar K.V
  - Optimise FP/VMX/VSX context switching from Anton Blanchard
 
  - Various cleanups from Krzysztof Kozlowski, John Ogness, Rashmica Gupta,
    Russell Currey, Gavin Shan, Daniel Axtens, Michael Neuling, Andrew Donnellan
  - Allow wrapper to work on non-english system from Laurent Vivier
  - Add rN aliases to the pt_regs_offset table from Rashmica Gupta
  - Fix module autoload for rackmeter & axonram drivers from Luis de Bethencourt
  - Include KVM guest test in all interrupt vectors from Paul Mackerras
  - Fix DSCR inheritance over fork() from Anton Blanchard
  - Make value-returning atomics & {cmp}xchg* & their atomic_ versions fully ordered from Boqun Feng
  - Print MSR TM bits in oops messages from Michael Neuling
  - Add TM signal return & invalid stack selftests from Michael Neuling
  - Limit EPOW reset event warnings from Vipin K Parashar
  - Remove the Cell QPACE code from Rashmica Gupta
  - Append linux_banner to exception information in xmon from Rashmica Gupta
  - Add selftest to check if VSRs are corrupted from Rashmica Gupta
  - Remove broken GregorianDay() from Daniel Axtens
  - Import Anton's context_switch2 benchmark into selftests from Michael Ellerman
  - Add selftest script to test HMI functionality from Daniel Axtens
  - Remove obsolete OPAL v2 support from Stewart Smith
  - Make enter_rtas() private from Michael Ellerman
  - PPR exception cleanups from Michael Ellerman
  - Add page soft dirty tracking from Laurent Dufour
  - Add support for Nvlink NPUs from Alistair Popple
  - Add support for kexec on 476fpe from Alistair Popple
  - Enable kernel CPU dlpar from sysfs from Nathan Fontenot
  - Copy only required pieces of the mm_context_t to the paca from Michael Neuling
  - Add a kmsg_dumper that flushes OPAL console output on panic from Russell Currey
  - Implement save_stack_trace_regs() to enable kprobe stack tracing from Steven Rostedt
  - Add HWCAP bits for Power9 from Michael Ellerman
  - Fix _PAGE_PTE breaking swapoff from Aneesh Kumar K.V
  - Fix _PAGE_SWP_SOFT_DIRTY breaking swapoff from Hugh Dickins
  - scripts/recordmcount.pl: support data in text section on powerpc from Ulrich Weigand
  - Handle R_PPC64_ENTRY relocations in modules from Ulrich Weigand
 
  - cxl: Fix possible idr warning when contexts are released from Vaibhav Jain
  - cxl: use correct operator when writing pcie config space values from Andrew Donnellan
  - cxl: Fix DSI misses when the context owning task exits from Vaibhav Jain
  - cxl: fix build for GCC 4.6.x from Brian Norris
  - cxl: use -Werror only with CONFIG_PPC_WERROR from Brian Norris
  - cxl: Enable PCI device ID for future IBM CXL adapter from Uma Krishnan
 
  - Freescale updates from Scott: Highlights include moving QE code out of
    arch/powerpc (to be shared with arm), device tree updates, and minor fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWmIxeAAoJEFHr6jzI4aWAA+cQAIXAw4WfVWJ2V4ZK+1eKfB57
 fdXG71PuXG+WYIWy71ly8keLHdzzD1NQ2OUB64bUVRq202nRgVc15ZYKRJ/FE/sP
 SkxaQ2AG/2kI2EflWshOi0Lu9qaZ+LMHJnszIqE/9lnGSB2kUI/cwsSXgziiMKXR
 XNci9v14SdDd40YV/6BSZXoxApwyq9cUbZ7rnzFLmz4hrFuKmB/L3LABDF8QcpH7
 sGt/YaHGOtqP0UX7h5KQTFLGe1OPvK6NWixSXeZKQ71ED6cho1iKUEOtBA9EZeIN
 QM5JdHFWgX8MMRA0OHAgidkSiqO38BXjmjkVYWoIbYz7Zax3ThmrDHB4IpFwWnk3
 l7WBykEXY7KEqpZzbh0GFGehZWzVZvLnNgDdvpmpk/GkPzeYKomBj7ZZfm3H1yGD
 BTHPwuWCTX+/K75yEVNO8aJO12wBg7DRl4IEwBgqhwU8ga4FvUOCJkm+SCxA1Dnn
 qlpS7qPwTXNIEfKMJcxp5X0KiwDY1EoOotd4glTN0jbeY5GEYcxe+7RQ302GrYxP
 zcc8EGLn8h6BtQvV3ypNHF5l6QeTW/0ZlO9c236tIuUQ5gQU39SQci7jQKsYjSzv
 BB1XdLHkbtIvYDkmbnr1elbeJCDbrWL9rAXRUTRyfuCzaFWTfZmfVNe8c8qwDMLk
 TUxMR/38aI7bLcIQjwj9
 =R5bX
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Core:
   - Ground work for the new Power9 MMU from Aneesh Kumar K.V
   - Optimise FP/VMX/VSX context switching from Anton Blanchard

  Misc:
   - Various cleanups from Krzysztof Kozlowski, John Ogness, Rashmica
     Gupta, Russell Currey, Gavin Shan, Daniel Axtens, Michael Neuling,
     Andrew Donnellan
   - Allow wrapper to work on non-english system from Laurent Vivier
   - Add rN aliases to the pt_regs_offset table from Rashmica Gupta
   - Fix module autoload for rackmeter & axonram drivers from Luis de
     Bethencourt
   - Include KVM guest test in all interrupt vectors from Paul Mackerras
   - Fix DSCR inheritance over fork() from Anton Blanchard
   - Make value-returning atomics & {cmp}xchg* & their atomic_ versions
     fully ordered from Boqun Feng
   - Print MSR TM bits in oops messages from Michael Neuling
   - Add TM signal return & invalid stack selftests from Michael Neuling
   - Limit EPOW reset event warnings from Vipin K Parashar
   - Remove the Cell QPACE code from Rashmica Gupta
   - Append linux_banner to exception information in xmon from Rashmica
     Gupta
   - Add selftest to check if VSRs are corrupted from Rashmica Gupta
   - Remove broken GregorianDay() from Daniel Axtens
   - Import Anton's context_switch2 benchmark into selftests from
     Michael Ellerman
   - Add selftest script to test HMI functionality from Daniel Axtens
   - Remove obsolete OPAL v2 support from Stewart Smith
   - Make enter_rtas() private from Michael Ellerman
   - PPR exception cleanups from Michael Ellerman
   - Add page soft dirty tracking from Laurent Dufour
   - Add support for Nvlink NPUs from Alistair Popple
   - Add support for kexec on 476fpe from Alistair Popple
   - Enable kernel CPU dlpar from sysfs from Nathan Fontenot
   - Copy only required pieces of the mm_context_t to the paca from
     Michael Neuling
   - Add a kmsg_dumper that flushes OPAL console output on panic from
     Russell Currey
   - Implement save_stack_trace_regs() to enable kprobe stack tracing
     from Steven Rostedt
   - Add HWCAP bits for Power9 from Michael Ellerman
   - Fix _PAGE_PTE breaking swapoff from Aneesh Kumar K.V
   - Fix _PAGE_SWP_SOFT_DIRTY breaking swapoff from Hugh Dickins
   - scripts/recordmcount.pl: support data in text section on powerpc
     from Ulrich Weigand
   - Handle R_PPC64_ENTRY relocations in modules from Ulrich Weigand

  cxl:
   - cxl: Fix possible idr warning when contexts are released from
     Vaibhav Jain
   - cxl: use correct operator when writing pcie config space values
     from Andrew Donnellan
   - cxl: Fix DSI misses when the context owning task exits from Vaibhav
     Jain
   - cxl: fix build for GCC 4.6.x from Brian Norris
   - cxl: use -Werror only with CONFIG_PPC_WERROR from Brian Norris
   - cxl: Enable PCI device ID for future IBM CXL adapter from Uma
     Krishnan

  Freescale:
   - Freescale updates from Scott: Highlights include moving QE code out
     of arch/powerpc (to be shared with arm), device tree updates, and
     minor fixes"

* tag 'powerpc-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (149 commits)
  powerpc/module: Handle R_PPC64_ENTRY relocations
  scripts/recordmcount.pl: support data in text section on powerpc
  powerpc/powernv: Fix OPAL_CONSOLE_FLUSH prototype and usages
  powerpc/mm: fix _PAGE_SWP_SOFT_DIRTY breaking swapoff
  powerpc/mm: Fix _PAGE_PTE breaking swapoff
  cxl: Enable PCI device ID for future IBM CXL adapter
  cxl: use -Werror only with CONFIG_PPC_WERROR
  cxl: fix build for GCC 4.6.x
  powerpc: Add HWCAP bits for Power9
  powerpc/powernv: Reserve PE#0 on NPU
  powerpc/powernv: Change NPU PE# assignment
  powerpc/powernv: Fix update of NVLink DMA mask
  powerpc/powernv: Remove misleading comment in pci.c
  powerpc: Implement save_stack_trace_regs() to enable kprobe stack tracing
  powerpc: Fix build break due to paca mm_context_t changes
  cxl: Fix DSI misses when the context owning task exits
  MAINTAINERS: Update Scott Wood's e-mail address
  powerpc/powernv: Fix minor off-by-one error in opal_mce_check_early_recovery()
  powerpc: Fix style of self-test config prompts
  powerpc/powernv: Only delay opal_rtc_read() retry when necessary
  ...
2016-01-15 13:18:47 -08:00
Greg Kurz
b4d7f161fe KVM: PPC: Fix ONE_REG AltiVec support
The get and set operations got exchanged by mistake when moving the
code from book3s.c to powerpc.c.

Fixes: 3840edc803
Cc: stable@vger.kernel.org # 3.18+
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2016-01-14 14:43:35 +11:00
Linus Torvalds
1baa5efbeb * s390: Support for runtime instrumentation within guests,
support of 248 VCPUs.
 
 * ARM: rewrite of the arm64 world switch in C, support for
 16-bit VM identifiers.  Performance counter virtualization
 missed the boat.
 
 * x86: Support for more Hyper-V features (synthetic interrupt
 controller), MMU cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJWlSKwAAoJEL/70l94x66DY0UIAK5vp4zfQoQOJC4KP4Xgxwdu
 kpnK2Boz3/74o1b0y5+eJZoUZCsXCVLtmP5uhmMxUYWDgByFG2X8ZDhPFwB5FYLT
 2dN+Lr4tsolgIfRdHZtrT6Svp9SDL039bWTdscnbR6l37/j9FRWvpKdhI3orloFD
 /i4CSW2dVIq1/9Xctwu/rtcOEesEx4Cad+6YV3/530eVAXFzE908nXfmqJNZTocY
 YCGcmrMVCOu0ng5QM4xSzmmYjKMLUcRs+QzZWkVBzdJtTgwZUr09yj7I2dZ1yj/i
 cxYrJy6shSwE74XkXsmvG+au3C5u3vX4tnXjBFErnPJ99oqzHatVnFWNRhj4dLQ=
 =PIj1
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "PPC changes will come next week.

   - s390: Support for runtime instrumentation within guests, support of
     248 VCPUs.

   - ARM: rewrite of the arm64 world switch in C, support for 16-bit VM
     identifiers.  Performance counter virtualization missed the boat.

   - x86: Support for more Hyper-V features (synthetic interrupt
     controller), MMU cleanups"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (115 commits)
  kvm: x86: Fix vmwrite to SECONDARY_VM_EXEC_CONTROL
  kvm/x86: Hyper-V SynIC timers tracepoints
  kvm/x86: Hyper-V SynIC tracepoints
  kvm/x86: Update SynIC timers on guest entry only
  kvm/x86: Skip SynIC vector check for QEMU side
  kvm/x86: Hyper-V fix SynIC timer disabling condition
  kvm/x86: Reorg stimer_expiration() to better control timer restart
  kvm/x86: Hyper-V unify stimer_start() and stimer_restart()
  kvm/x86: Drop stimer_stop() function
  kvm/x86: Hyper-V timers fix incorrect logical operation
  KVM: move architecture-dependent requests to arch/
  KVM: renumber vcpu->request bits
  KVM: document which architecture uses each request bit
  KVM: Remove unused KVM_REQ_KICK to save a bit in vcpu->requests
  kvm: x86: Check kvm_write_guest return value in kvm_write_wall_clock
  KVM: s390: implement the RI support of guest
  kvm/s390: drop unpaired smp_mb
  kvm: x86: fix comment about {mmu,nested_mmu}.gva_to_gpa
  KVM: x86: MMU: Use clear_page() instead of init_shadow_page_table()
  arm/arm64: KVM: Detect vGIC presence at runtime
  ...
2016-01-12 13:22:12 -08:00
Paul Mackerras
c20875a3e6 KVM: PPC: Book3S HV: Prohibit setting illegal transaction state in MSR
Currently it is possible for userspace (e.g. QEMU) to set a value
for the MSR for a guest VCPU which has both of the TS bits set,
which is an illegal combination.  The result of this is that when
we execute a hrfid (hypervisor return from interrupt doubleword)
instruction to enter the guest, the CPU will take a TM Bad Thing
type of program interrupt (vector 0x700).

Now, if PR KVM is configured in the kernel along with HV KVM, we
actually handle this without crashing the host or giving hypervisor
privilege to the guest; instead what happens is that we deliver a
program interrupt to the guest, with SRR0 reflecting the address
of the hrfid instruction and SRR1 containing the MSR value at that
point.  If PR KVM is not configured in the kernel, then we try to
run the host's program interrupt handler with the MMU set to the
guest context, which almost certainly causes a host crash.

This closes the hole by making kvmppc_set_msr_hv() check for the
illegal combination and force the TS field to a safe value (00,
meaning non-transactional).

Cc: stable@vger.kernel.org # v3.9+
Signed-off-by: Paul Mackerras <paulus@samba.org>
2015-12-10 11:34:27 +11:00
Geyslan G. Bem
edfaff269f KVM: PPC: Book3S PR: Remove unused variable 'vcpu_book3s'
The vcpu_book3s variable is assigned but never used. So remove it.
Found using cppcheck.

Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2015-12-09 16:06:32 +11:00
Thomas Huth
760a7364f2 KVM: PPC: Fix emulation of H_SET_DABR/X on POWER8
In the old DABR register, the BT (Breakpoint Translation) bit
is bit number 61. In the new DAWRX register, the WT (Watchpoint
Translation) bit is bit number 59. So to move the DABR-BT bit
into the position of the DAWRX-WT bit, it has to be shifted by
two, not only by one. This fixes hardware watchpoints in gdb of
older guests that only use the H_SET_DABR/X interface instead
of the new H_SET_MODE interface.

Cc: stable@vger.kernel.org # v3.14+
Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2015-12-09 16:05:01 +11:00
Paul Mackerras
1c9e3d51d5 KVM: PPC: Book3S HV: Handle unexpected traps in guest entry/exit code better
As we saw with the TM Bad Thing type of program interrupt occurring
on the hrfid that enters the guest, it is not completely impossible
to have a trap occurring in the guest entry/exit code, despite the
fact that the code has been written to avoid taking any traps.

This adds a check in the kvmppc_handle_exit_hv() function to detect
the case when a trap has occurred in the hypervisor-mode code, and
instead of treating it just like a trap in guest code, we now print
a message and return to userspace with a KVM_EXIT_INTERNAL_ERROR
exit reason.

Of the various interrupts that get handled in the assembly code in
the guest exit path and that can return directly to the guest, the
only one that can occur when MSR.HV=1 and MSR.EE=0 is machine check
(other than system call, which we can avoid just by not doing a sc
instruction).  Therefore this adds code to the machine check path to
ensure that if the MCE occurred in hypervisor mode, we exit to the
host rather than trying to continue the guest.

Signed-off-by: Paul Mackerras <paulus@samba.org>
2015-12-09 15:46:14 +11:00
Anton Blanchard
579e633e76 powerpc: create flush_all_to_thread()
Create a single function that flushes everything (FP, VMX, VSX, SPE).
Doing this all at once means we only do one MSR write.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-02 19:34:40 +11:00
Anton Blanchard
c208505900 powerpc: create giveup_all()
Create a single function that gives everything up (FP, VMX, VSX, SPE).
Doing this all at once means we only do one MSR write.

A context switch microbenchmark using yield():

http://ozlabs.org/~anton/junkcode/context_switch2.c

./context_switch2 --test=yield --fp --altivec --vector 0 0

shows an improvement of 3% on POWER8.

Signed-off-by: Anton Blanchard <anton@samba.org>
[mpe: giveup_all() needs to be EXPORT_SYMBOL'ed]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-02 19:34:26 +11:00
Anton Blanchard
dc4fbba11e powerpc: Create disable_kernel_{fp,altivec,vsx,spe}()
The enable_kernel_*() functions leave the relevant MSR bits enabled
until we exit the kernel sometime later. Create disable versions
that wrap the kernel use of FP, Altivec VSX or SPE.

While we don't want to disable it normally for performance reasons
(MSR writes are slow), it will be used for a debug boot option that
does this and catches bad uses in other areas of the kernel.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-01 13:52:25 +11:00
David Hildenbrand
e09fefdeeb KVM: Use common function for VCPU lookup by id
Let's reuse the new common function for VPCU lookup by id.

Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
[split out the new function into a separate patch]
2015-11-30 12:47:04 +01:00
Yaowei Bai
378b417d65 KVM: powerpc: kvmppc_visible_gpa can be boolean
In another patch kvm_is_visible_gfn is maken return bool due to this
function only returns zero or one as its return value, let's also make
kvmppc_visible_gpa return bool to keep consistent.

No functional change.

Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-25 17:24:24 +01:00
Linus Torvalds
3370b69eb0 Four changes:
- x86: work around two nasty cases where a benign exception occurs while
 another is being delivered.  The endless stream of exceptions causes an
 infinite loop in the processor, which not even NMIs or SMIs can interrupt;
 in the virt case, there is no possibility to exit to the host either.
 
 - x86: support for Skylake per-guest TSC rate.  Long supported by AMD,
 the patches mostly move things from there to common arch/x86/kvm/ code.
 
 - generic: remove local_irq_save/restore from the guest entry and exit
 paths when context tracking is enabled.  The patches are a few months
 old, but we discussed them again at kernel summit.  Andy will pick up
 from here and, in 4.5, try to remove it from the user entry/exit paths.
 
 - PPC: Two bug fixes, see merge commit 370289756b for details.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJWRFb0AAoJEL/70l94x66DjjMH/31jr8d119MW0uv2x+03+wRq
 6dbJ8tjQ8grvBRExKvLsUVjDmHlhCa1BQl5qjCsyYhX9UeAf4NQOmoEFpq+YTLxh
 Ctveyn+yiZWC7qxbQDmauiQ4JCOp+W9ial782iqw5+ouQMajGOffq5WrojCa2ZNF
 jI278JgdHJLrKj/uie//WBu3V7MJY5Apc3p4zatnSYFSQ3MA0sxl4r4zIrwOa5qs
 23ZeeoqbP4sHh4X5wL/30Y6XFSCHj0qoYHHyAgzLi0PCMvBdt4DrAFUPDG/Rhlv6
 o1WB/kcUfcz3DtBX85wfSOMuw0nF6patWhWv07R/3EIbYoz3dKvp9d6ORYgXqlY=
 =Um9M
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull second batch of kvm updates from Paolo Bonzini:
 "Four changes:

   - x86: work around two nasty cases where a benign exception occurs
     while another is being delivered.  The endless stream of exceptions
     causes an infinite loop in the processor, which not even NMIs or
     SMIs can interrupt; in the virt case, there is no possibility to
     exit to the host either.

   - x86: support for Skylake per-guest TSC rate.  Long supported by
     AMD, the patches mostly move things from there to common
     arch/x86/kvm/ code.

   - generic: remove local_irq_save/restore from the guest entry and
     exit paths when context tracking is enabled.  The patches are a few
     months old, but we discussed them again at kernel summit.  Andy
     will pick up from here and, in 4.5, try to remove it from the user
     entry/exit paths.

   - PPC: Two bug fixes, see merge commit 370289756b for details"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
  KVM: x86: rename update_db_bp_intercept to update_bp_intercept
  KVM: svm: unconditionally intercept #DB
  KVM: x86: work around infinite loop in microcode when #AC is delivered
  context_tracking: avoid irq_save/irq_restore on guest entry and exit
  context_tracking: remove duplicate enabled check
  KVM: VMX: Dump TSC multiplier in dump_vmcs()
  KVM: VMX: Use a scaled host TSC for guest readings of MSR_IA32_TSC
  KVM: VMX: Setup TSC scaling ratio when a vcpu is loaded
  KVM: VMX: Enable and initialize VMX TSC scaling
  KVM: x86: Use the correct vcpu's TSC rate to compute time scale
  KVM: x86: Move TSC scaling logic out of call-back read_l1_tsc()
  KVM: x86: Move TSC scaling logic out of call-back adjust_tsc_offset()
  KVM: x86: Replace call-back compute_tsc_offset() with a common function
  KVM: x86: Replace call-back set_tsc_khz() with a common function
  KVM: x86: Add a common TSC scaling function
  KVM: x86: Add a common TSC scaling ratio field in kvm_vcpu_arch
  KVM: x86: Collect information for setting TSC scaling ratio
  KVM: x86: declare a few variables as __read_mostly
  KVM: x86: merge handle_mmio_page_fault and handle_mmio_page_fault_common
  KVM: PPC: Book3S HV: Don't dynamically split core when already split
  ...
2015-11-12 14:34:06 -08:00
Linus Torvalds
2f4bf528ec powerpc updates for 4.4
- Kconfig: remove BE-only platforms from LE kernel build from Boqun Feng
  - Refresh ps3_defconfig from Geoff Levand
  - Emit GNU & SysV hashes for the vdso from Michael Ellerman
  - Define an enum for the bolted SLB indexes from Anshuman Khandual
  - Use a local to avoid multiple calls to get_slb_shadow() from Michael Ellerman
  - Add gettimeofday() benchmark from Michael Neuling
  - Avoid link stack corruption in __get_datapage() from Michael Neuling
  - Add virt_to_pfn and use this instead of opencoding from Aneesh Kumar K.V
  - Add ppc64le_defconfig from Michael Ellerman
  - pseries: extract of_helpers module from Andy Shevchenko
  - Correct string length in pseries_of_derive_parent() from Nathan Fontenot
  - Free the MSI bitmap if it was slab allocated from Denis Kirjanov
  - Shorten irq_chip name for the SIU from Christophe Leroy
  - Wait 1s for secondaries to enter OPAL during kexec from Samuel Mendoza-Jonas
  - Fix _ALIGN_* errors due to type difference. from Aneesh Kumar K.V
  - powerpc/pseries/hvcserver: don't memset pi_buff if it is null from Colin Ian King
  - Disable hugepd for 64K page size. from Aneesh Kumar K.V
  - Differentiate between hugetlb and THP during page walk from Aneesh Kumar K.V
  - Make PCI non-optional for pseries from Michael Ellerman
  - Individual System V IPC system calls from Sam bobroff
  - Add selftest of unmuxed IPC calls from Michael Ellerman
  - discard .exit.data at runtime from Stephen Rothwell
  - Delete old orphaned PrPMC 280/2800 DTS and boot file. from Paul Gortmaker
  - Use of_get_next_parent to simplify code from Christophe Jaillet
  - Paginate some xmon output from Sam bobroff
  - Add some more elements to the xmon PACA dump from Michael Ellerman
  - Allow the tm-syscall selftest to build with old headers from Michael Ellerman
  - Run EBB selftests only on POWER8 from Denis Kirjanov
  - Drop CONFIG_TUNE_CELL in favour of CONFIG_CELL_CPU from Michael Ellerman
  - Avoid reference to potentially freed memory in prom.c from Christophe Jaillet
  - Quieten boot wrapper output with run_cmd from Geoff Levand
  - EEH fixes and cleanups from Gavin Shan
  - Fix recursive fenced PHB on Broadcom shiner adapter from Gavin Shan
  - Use of_get_next_parent() in of_get_ibm_chip_id() from Michael Ellerman
  - Fix section mismatch warning in msi_bitmap_alloc() from Denis Kirjanov
  - Fix ps3-lpm white space from Rudhresh Kumar J
  - Fix ps3-vuart null dereference from Colin King
  - nvram: Add missing kfree in error path from Christophe Jaillet
  - nvram: Fix function name in some errors messages. from Christophe Jaillet
  - drivers/macintosh: adb: fix misleading Kconfig help text from Aaro Koskinen
  - agp/uninorth: fix a memleak in create_gatt_table from Denis Kirjanov
  - cxl: Free virtual PHB when removing from Andrew Donnellan
  - scripts/kconfig/Makefile: Allow KBUILD_DEFCONFIG to be a target from Michael Ellerman
  - scripts/kconfig/Makefile: Fix KBUILD_DEFCONFIG check when building with O= from Michael Ellerman
 
  - Freescale updates from Scott: Highlights include 64-bit book3e kexec/kdump
    support, a rework of the qoriq clock driver, device tree changes including
    qoriq fman nodes, support for a new 85xx board, and some fixes.
 
  - MPC5xxx updates from Anatolij: Highlights include a driver for MPC512x
    LocalPlus Bus FIFO with its device tree binding documentation, mpc512x
    device tree updates and some minor fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWPEZgAAoJEFHr6jzI4aWANjYQAKX2Q/95hqKfCuF5FBcUmtMC
 Pu/Nff027MVzxZ2ApDcvvLGps5Nz2bn3nIhc9zjkXc5E8DuL6X3Yl8ce7qyNcc3g
 cJJ8RvtUo6J1OMWetXFehtPYniAAwKMhZYKnj0+WnLr2SyH/Vhl3ehDkFbGyPtuH
 r+2E7krFjfVgU+bzciIFnOaDekFuFN/pXWMb6e6zQyBJe9N8ZIp96uouGCebKVd0
 VDLItzdaKErT8JFfbymMPvZm3V0rMVx4WWu3kAbQX8LrD5a18NF1zrjAOHRXc61n
 kkk8/DPuNOon1PbXXyiS5BcFyZRe+KE3VBnoW5sOMqMIRg5WdO1oU3e2pEfXMO8+
 leXYwFLXiKzUZuOgQG2QiUhrzD2yC1o6/TJWATv0dSl9AwrecgPX+Vj6X357slAf
 A9E3eMy5tgnpndBWZmvZS3W7YDKH+NkeZ+Q40+NErAlqr++ErrTcKVndk5vWlYTT
 7mMZeTXagX66al/k5ATKqwB7iUSpnYHSAa9fcUYPSM2FnXsDxPyeJGkBbcoOmkGj
 QrpgNYOvJaUJd076goZCV39v0c1xpfV9/9kyVch8HUadf6JcjpVZwYnbGw2qlJjh
 ZanuBG2VOeSwaKQqXiRBSBetnpAg8CVpFjDmX9wOBfSek2wxEJqDX/vQExdbIDQQ
 pUs7vnUxLzhmW/x+ygOI
 =YwcM
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:

 - Kconfig: remove BE-only platforms from LE kernel build from Boqun
   Feng
 - Refresh ps3_defconfig from Geoff Levand
 - Emit GNU & SysV hashes for the vdso from Michael Ellerman
 - Define an enum for the bolted SLB indexes from Anshuman Khandual
 - Use a local to avoid multiple calls to get_slb_shadow() from Michael
   Ellerman
 - Add gettimeofday() benchmark from Michael Neuling
 - Avoid link stack corruption in __get_datapage() from Michael Neuling
 - Add virt_to_pfn and use this instead of opencoding from Aneesh Kumar
   K.V
 - Add ppc64le_defconfig from Michael Ellerman
 - pseries: extract of_helpers module from Andy Shevchenko
 - Correct string length in pseries_of_derive_parent() from Nathan
   Fontenot
 - Free the MSI bitmap if it was slab allocated from Denis Kirjanov
 - Shorten irq_chip name for the SIU from Christophe Leroy
 - Wait 1s for secondaries to enter OPAL during kexec from Samuel
   Mendoza-Jonas
 - Fix _ALIGN_* errors due to type difference, from Aneesh Kumar K.V
 - powerpc/pseries/hvcserver: don't memset pi_buff if it is null from
   Colin Ian King
 - Disable hugepd for 64K page size, from Aneesh Kumar K.V
 - Differentiate between hugetlb and THP during page walk from Aneesh
   Kumar K.V
 - Make PCI non-optional for pseries from Michael Ellerman
 - Individual System V IPC system calls from Sam bobroff
 - Add selftest of unmuxed IPC calls from Michael Ellerman
 - discard .exit.data at runtime from Stephen Rothwell
 - Delete old orphaned PrPMC 280/2800 DTS and boot file, from Paul
   Gortmaker
 - Use of_get_next_parent to simplify code from Christophe Jaillet
 - Paginate some xmon output from Sam bobroff
 - Add some more elements to the xmon PACA dump from Michael Ellerman
 - Allow the tm-syscall selftest to build with old headers from Michael
   Ellerman
 - Run EBB selftests only on POWER8 from Denis Kirjanov
 - Drop CONFIG_TUNE_CELL in favour of CONFIG_CELL_CPU from Michael
   Ellerman
 - Avoid reference to potentially freed memory in prom.c from Christophe
   Jaillet
 - Quieten boot wrapper output with run_cmd from Geoff Levand
 - EEH fixes and cleanups from Gavin Shan
 - Fix recursive fenced PHB on Broadcom shiner adapter from Gavin Shan
 - Use of_get_next_parent() in of_get_ibm_chip_id() from Michael
   Ellerman
 - Fix section mismatch warning in msi_bitmap_alloc() from Denis
   Kirjanov
 - Fix ps3-lpm white space from Rudhresh Kumar J
 - Fix ps3-vuart null dereference from Colin King
 - nvram: Add missing kfree in error path from Christophe Jaillet
 - nvram: Fix function name in some errors messages, from Christophe
   Jaillet
 - drivers/macintosh: adb: fix misleading Kconfig help text from Aaro
   Koskinen
 - agp/uninorth: fix a memleak in create_gatt_table from Denis Kirjanov
 - cxl: Free virtual PHB when removing from Andrew Donnellan
 - scripts/kconfig/Makefile: Allow KBUILD_DEFCONFIG to be a target from
   Michael Ellerman
 - scripts/kconfig/Makefile: Fix KBUILD_DEFCONFIG check when building
   with O= from Michael Ellerman
 - Freescale updates from Scott: Highlights include 64-bit book3e
   kexec/kdump support, a rework of the qoriq clock driver, device tree
   changes including qoriq fman nodes, support for a new 85xx board, and
   some fixes.
 - MPC5xxx updates from Anatolij: Highlights include a driver for
   MPC512x LocalPlus Bus FIFO with its device tree binding
   documentation, mpc512x device tree updates and some minor fixes.

* tag 'powerpc-4.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (106 commits)
  powerpc/msi: Fix section mismatch warning in msi_bitmap_alloc()
  powerpc/prom: Use of_get_next_parent() in of_get_ibm_chip_id()
  powerpc/pseries: Correct string length in pseries_of_derive_parent()
  powerpc/e6500: hw tablewalk: make sure we invalidate and write to the same tlb entry
  powerpc/mpc85xx: Add FSL QorIQ DPAA FMan support to the SoC device tree(s)
  powerpc/mpc85xx: Create dts components for the FSL QorIQ DPAA FMan
  powerpc/fsl: Add #clock-cells and clockgen label to clockgen nodes
  powerpc: handle error case in cpm_muram_alloc()
  powerpc: mpic: use IRQCHIP_SKIP_SET_WAKE instead of redundant mpic_irq_set_wake
  powerpc/book3e-64: Enable kexec
  powerpc/book3e-64/kexec: Set "r4 = 0" when entering spinloop
  powerpc/booke: Only use VIRT_PHYS_OFFSET on booke32
  powerpc/book3e-64/kexec: Enable SMP release
  powerpc/book3e-64/kexec: create an identity TLB mapping
  powerpc/book3e-64: Don't limit paca to 256 MiB
  powerpc/book3e/kdump: Enable crash_kexec_wait_realmode
  powerpc/book3e: support CONFIG_RELOCATABLE
  powerpc/booke64: Fix args to copy_and_flush
  powerpc/book3e-64: rename interrupt_end_book3e with __end_interrupts
  powerpc/e6500: kexec: Handle hardware threads
  ...
2015-11-05 23:38:43 -08:00
Paul Mackerras
f74f2e2e26 KVM: PPC: Book3S HV: Don't dynamically split core when already split
In static micro-threading modes, the dynamic micro-threading code
is supposed to be disabled, because subcores can't make independent
decisions about what micro-threading mode to put the core in - there is
only one micro-threading mode for the whole core.  The code that
implements dynamic micro-threading checks for this, except that the
check was missed in one case.  This means that it is possible for a
subcore in static 2-way micro-threading mode to try to put the core
into 4-way micro-threading mode, which usually leads to stuck CPUs,
spinlock lockups, and other stalls in the host.

The problem was in the can_split_piggybacked_subcores() function, which
should always return false if the system is in a static micro-threading
mode.  This fixes the problem by making can_split_piggybacked_subcores()
use subcore_config_ok() for its checks, as subcore_config_ok() includes
the necessary check for the static micro-threading modes.

Credit to Gautham Shenoy for working out that the reason for the hangs
and stalls we were seeing was that we were trying to do dynamic 4-way
micro-threading while we were in static 2-way mode.

Fixes: b4deba5c41
Cc: vger@stable.kernel.org # v4.3
Signed-off-by: Paul Mackerras <paulus@samba.org>
2015-11-06 16:02:59 +11:00
Paul Mackerras
cf29b21595 KVM: PPC: Book3S HV: Synthesize segment fault if SLB lookup fails
When handling a hypervisor data or instruction storage interrupt (HDSI
or HISI), we look up the SLB entry for the address being accessed in
order to translate the effective address to a virtual address which can
be looked up in the guest HPT.  This lookup can occasionally fail due
to the guest replacing an SLB entry without invalidating the evicted
SLB entry.  In this situation an ERAT (effective to real address
translation cache) entry can persist and be used by the hardware even
though there is no longer a corresponding SLB entry.

Previously we would just deliver a data or instruction storage interrupt
(DSI or ISI) to the guest in this case.  However, this is not correct
and has been observed to cause guests to crash, typically with a
data storage protection interrupt on a store to the vmemmap area.

Instead, what we do now is to synthesize a data or instruction segment
interrupt.  That should cause the guest to reload an appropriate entry
into the SLB and retry the faulting instruction.  If it still faults,
we should find an appropriate SLB entry next time and be able to handle
the fault.

Tested-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2015-11-06 15:40:42 +11:00
Linus Torvalds
933425fb00 s390: A bunch of fixes and optimizations for interrupt and time
handling.
 
 PPC: Mostly bug fixes.
 
 ARM: No big features, but many small fixes and prerequisites including:
 - a number of fixes for the arch-timer
 - introducing proper level-triggered semantics for the arch-timers
 - a series of patches to synchronously halt a guest (prerequisite for
   IRQ forwarding)
 - some tracepoint improvements
 - a tweak for the EL2 panic handlers
 - some more VGIC cleanups getting rid of redundant state
 
 x86: quite a few changes:
 
 - support for VT-d posted interrupts (i.e. PCI devices can inject
 interrupts directly into vCPUs).  This introduces a new component (in
 virt/lib/) that connects VFIO and KVM together.  The same infrastructure
 will be used for ARM interrupt forwarding as well.
 
 - more Hyper-V features, though the main one Hyper-V synthetic interrupt
 controller will have to wait for 4.5.  These will let KVM expose Hyper-V
 devices.
 
 - nested virtualization now supports VPID (same as PCID but for vCPUs)
 which makes it quite a bit faster
 
 - for future hardware that supports NVDIMM, there is support for clflushopt,
 clwb, pcommit
 
 - support for "split irqchip", i.e. LAPIC in kernel + IOAPIC/PIC/PIT in
 userspace, which reduces the attack surface of the hypervisor
 
 - obligatory smattering of SMM fixes
 
 - on the guest side, stable scheduler clock support was rewritten to not
 require help from the hypervisor.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJWO2IQAAoJEL/70l94x66D/K0H/3AovAgYmJQToZlimsktMk6a
 f2xhdIqfU5lIQQh5uNBCfL3o9o8H9Py1ym7aEw3fmztPHHJYc91oTatt2UEKhmEw
 VtZHp/dFHt3hwaIdXmjRPEXiYctraKCyrhaUYdWmUYkoKi7lW5OL5h+S7frG2U6u
 p/hFKnHRZfXHr6NSgIqvYkKqtnc+C0FWY696IZMzgCksOO8jB1xrxoSN3tANW3oJ
 PDV+4og0fN/Fr1capJUFEc/fejREHneANvlKrLaa8ht0qJQutoczNADUiSFLcMPG
 iHljXeDsv5eyjMtUuIL8+MPzcrIt/y4rY41ZPiKggxULrXc6H+JJL/e/zThZpXc=
 =iv2z
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "First batch of KVM changes for 4.4.

  s390:
     A bunch of fixes and optimizations for interrupt and time handling.

  PPC:
     Mostly bug fixes.

  ARM:
     No big features, but many small fixes and prerequisites including:

      - a number of fixes for the arch-timer

      - introducing proper level-triggered semantics for the arch-timers

      - a series of patches to synchronously halt a guest (prerequisite
        for IRQ forwarding)

      - some tracepoint improvements

      - a tweak for the EL2 panic handlers

      - some more VGIC cleanups getting rid of redundant state

  x86:
     Quite a few changes:

      - support for VT-d posted interrupts (i.e. PCI devices can inject
        interrupts directly into vCPUs).  This introduces a new
        component (in virt/lib/) that connects VFIO and KVM together.
        The same infrastructure will be used for ARM interrupt
        forwarding as well.

      - more Hyper-V features, though the main one Hyper-V synthetic
        interrupt controller will have to wait for 4.5.  These will let
        KVM expose Hyper-V devices.

      - nested virtualization now supports VPID (same as PCID but for
        vCPUs) which makes it quite a bit faster

      - for future hardware that supports NVDIMM, there is support for
        clflushopt, clwb, pcommit

      - support for "split irqchip", i.e.  LAPIC in kernel +
        IOAPIC/PIC/PIT in userspace, which reduces the attack surface of
        the hypervisor

      - obligatory smattering of SMM fixes

      - on the guest side, stable scheduler clock support was rewritten
        to not require help from the hypervisor"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (123 commits)
  KVM: VMX: Fix commit which broke PML
  KVM: x86: obey KVM_X86_QUIRK_CD_NW_CLEARED in kvm_set_cr0()
  KVM: x86: allow RSM from 64-bit mode
  KVM: VMX: fix SMEP and SMAP without EPT
  KVM: x86: move kvm_set_irq_inatomic to legacy device assignment
  KVM: device assignment: remove pointless #ifdefs
  KVM: x86: merge kvm_arch_set_irq with kvm_set_msi_inatomic
  KVM: x86: zero apic_arb_prio on reset
  drivers/hv: share Hyper-V SynIC constants with userspace
  KVM: x86: handle SMBASE as physical address in RSM
  KVM: x86: add read_phys to x86_emulate_ops
  KVM: x86: removing unused variable
  KVM: don't pointlessly leave KVM_COMPAT=y in non-KVM configs
  KVM: arm/arm64: Merge vgic_set_lr() and vgic_sync_lr_elrsr()
  KVM: arm/arm64: Clean up vgic_retire_lr() and surroundings
  KVM: arm/arm64: Optimize away redundant LR tracking
  KVM: s390: use simple switch statement as multiplexer
  KVM: s390: drop useless newline in debugging data
  KVM: s390: SCA must not cross page boundaries
  KVM: arm: Do not indent the arguments of DECLARE_BITMAP
  ...
2015-11-05 16:26:26 -08:00
Paul Mackerras
23316316c1 powerpc: Revert "Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8"
This reverts commit 9678cdaae9 ("Use the POWER8 Micro Partition
Prefetch Engine in KVM HV on POWER8") because the original commit had
multiple, partly self-cancelling bugs, that could cause occasional
memory corruption.

In fact the logmpp instruction was incorrectly using register r0 as the
source of the buffer address and operation code, and depending on what
was in r0, it would either do nothing or corrupt the 64k page pointed to
by r0.

The logmpp instruction encoding and the operation code definitions could
be corrected, but then there is the problem that there is no clearly
defined way to know when the hardware has finished writing to the
buffer.

The original commit attempted to work around this by aborting the
write-out before starting the prefetch, but this is ineffective in the
case where the virtual core is now executing on a different physical
core from the one where the write-out was initiated.

These problems plus advice from the hardware designers not to use the
function (since the measured performance improvement from using the
feature was actually mostly negative), mean that reverting the code is
the best option.

Fixes: 9678cdaae9 ("Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8")
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-10-21 20:50:30 +11:00
Gautham R. Shenoy
70aa3961a1 KVM: PPC: Book3S HV: Handle H_DOORBELL on the guest exit path
Currently a CPU running a guest can receive a H_DOORBELL in the
following two cases:
1) When the CPU is napping due to CEDE or there not being a guest
vcpu.
2) The CPU is running the guest vcpu.

Case 1), the doorbell message is not cleared since we were waking up
from nap. Hence when the EE bit gets set on transition from guest to
host, the H_DOORBELL interrupt is delivered to the host and the
corresponding handler is invoked.

However in Case 2), the message gets cleared by the action of taking
the H_DOORBELL interrupt. Since the CPU was running a guest, instead
of invoking the doorbell handler, the code invokes the second-level
interrupt handler to switch the context from the guest to the host. At
this point the setting of the EE bit doesn't result in the CPU getting
the doorbell interrupt since it has already been delivered once. So,
the handler for this doorbell is never invoked!

This causes softlockups if the missed DOORBELL was an IPI sent from a
sibling subcore on the same CPU.

This patch fixes it by explitly invoking the doorbell handler on the
exit path if the exit reason is H_DOORBELL similar to the way an
EXTERNAL interrupt is handled. Since this will also handle Case 1), we
can unconditionally clear the doorbell message in
kvmppc_check_wake_reason.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2015-10-21 16:31:52 +11:00
Nikunj A Dadhania
bfec5c2cc0 KVM: PPC: Implement extension to report number of memslots
QEMU assumes 32 memslots if this extension is not implemented. Although,
current value of KVM_USER_MEM_SLOTS is 32, once KVM_USER_MEM_SLOTS
changes QEMU would take a wrong value.

Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2015-10-21 16:31:46 +11:00
Paul Mackerras
c64dfe2af3 KVM: PPC: Book3S HV: Make H_REMOVE return correct HPTE value for absent HPTEs
This fixes a bug where the old HPTE value returned by H_REMOVE has
the valid bit clear if the HPTE was an absent HPTE, as happens for
HPTEs for emulated MMIO pages and for RAM pages that have been paged
out by the host.  If the absent bit is set, we clear it and set the
valid bit, because from the guest's point of view, the HPTE is valid.

Signed-off-by: Paul Mackerras <paulus@samba.org>
2015-10-21 16:25:06 +11:00
Paul Mackerras
572abd563b KVM: PPC: Book3S HV: Don't fall back to smaller HPT size in allocation ioctl
Currently the KVM_PPC_ALLOCATE_HTAB will try to allocate the requested
size of HPT, and if that is not possible, then try to allocate smaller
sizes (by factors of 2) until either a minimum is reached or the
allocation succeeds.  This is not ideal for userspace, particularly in
migration scenarios, where the destination VM really does require the
size requested.  Also, the minimum HPT size of 256kB may be
insufficient for the guest to run successfully.

This removes the fallback to smaller sizes on allocation failure for
the KVM_PPC_ALLOCATE_HTAB ioctl.  The fallback still exists for the
case where the HPT is allocated at the time the first VCPU is run, if
no HPT has been allocated by ioctl by that time.

Signed-off-by: Paul Mackerras <paulus@samba.org>
2015-10-21 16:24:57 +11:00
Mahesh Salgaonkar
966d713e86 KVM: PPC: Book3S HV: Deliver machine check with MSR(RI=0) to guest as MCE
For the machine check interrupt that happens while we are in the guest,
kvm layer attempts the recovery, and then delivers the machine check interrupt
directly to the guest if recovery fails. On successful recovery we go back to
normal functioning of the guest. But there can be cases where a machine check
interrupt can happen with MSR(RI=0) while we are in the guest. This means
MC interrupt is unrecoverable and we have to deliver a machine check to the
guest since the machine check interrupt might have trashed valid values in
SRR0/1. The current implementation do not handle this case, causing guest
to crash with Bad kernel stack pointer instead of machine check oops message.

[26281.490060] Bad kernel stack pointer 3fff9ccce5b0 at c00000000000490c
[26281.490434] Oops: Bad kernel stack pointer, sig: 6 [#1]
[26281.490472] SMP NR_CPUS=2048 NUMA pSeries

This patch fixes this issue by checking MSR(RI=0) in KVM layer and forwarding
unrecoverable interrupt to guest which then panics with proper machine check
Oops message.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2015-10-16 11:53:47 +11:00
Tudor Laurentiu
224f363246 KVM: PPC: e500: fix couple of shift operations on 64 bits
Fix couple of cases where we shift left a 32-bit
value thus might get truncated results on 64-bit
targets.

Signed-off-by: Laurentiu Tudor <Laurentiu.Tudor@freescale.com>
Suggested-by: Scott Wood <scotttwood@freescale.com>
Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2015-10-15 15:59:19 +11:00
Tudor Laurentiu
2daab50e17 KVM: PPC: e500: Emulate TMCFG0 TMRN register
Emulate TMCFG0 TMRN register exposing one HW thread per vcpu.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
[Laurentiu.Tudor@freescale.com: rebased on latest kernel, use
 define instead of hardcoded value, moved code in own function]
Signed-off-by: Laurentiu Tudor <Laurentiu.Tudor@freescale.com>
Acked-by: Scott Wood <scotttwood@freescale.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2015-10-15 15:58:16 +11:00
Andrzej Hajda
d4cd4f9586 KVM: PPC: e500: fix handling local_sid_lookup result
The function can return negative value.

The problem has been detected using proposed semantic patch
scripts/coccinelle/tests/assign_signed_to_unsigned.cocci [1].

[1]: http://permalink.gmane.org/gmane.linux.kernel/2046107

Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2015-10-15 15:58:16 +11:00
Aneesh Kumar K.V
891121e6c0 powerpc/mm: Differentiate between hugetlb and THP during page walk
We need to properly identify whether a hugepage is an explicit or
a transparent hugepage in follow_huge_addr(). We used to depend
on hugepage shift argument to do that. But in some case that can
result in wrong results. For ex:

On finding a transparent hugepage we set hugepage shift to PMD_SHIFT.
But we can end up clearing the thp pte, via pmdp_huge_get_and_clear.
We do prevent reusing the pfn page via the usage of
kick_all_cpus_sync(). But that happens after we updated the pte to 0.
Hence in follow_huge_addr() we can find hugepage shift set, but transparent
huge page check fail for a thp pte.

NOTE: We fixed a variant of this race against thp split in commit
691e95fd73
("powerpc/mm/thp: Make page table walk safe against thp split/collapse")

Without this patch, we may hit the BUG_ON(flags & FOLL_GET) in
follow_page_mask occasionally.

In the long term, we may want to switch ppc64 64k page size config to
enable CONFIG_ARCH_WANT_GENERAL_HUGETLB

Reported-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-10-12 15:30:09 +11:00
Thomas Huth
3eb4ee6825 KVM: PPC: Book3S: Take the kvm->srcu lock in kvmppc_h_logical_ci_load/store()
Access to the kvm->buses (like with the kvm_io_bus_read() and -write()
functions) has to be protected via the kvm->srcu lock.
The kvmppc_h_logical_ci_load() and -store() functions are missing
this lock so far, so let's add it there, too.
This fixes the problem that the kernel reports "suspicious RCU usage"
when lock debugging is enabled.

Cc: stable@vger.kernel.org # v4.1+
Fixes: 99342cf804
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2015-09-21 09:05:15 +10:00
Gautham R. Shenoy
7e022e717f KVM: PPC: Book3S HV: Pass the correct trap argument to kvmhv_commence_exit
In guest_exit_cont we call kvmhv_commence_exit which expects the trap
number as the argument. However r3 doesn't contain the trap number at
this point and as a result we would be calling the function with a
spurious trap number.

Fix this by copying r12 into r3 before calling kvmhv_commence_exit as
r12 contains the trap number.

Cc: stable@vger.kernel.org # v4.1+
Fixes: eddb60fb14
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2015-09-21 09:05:12 +10:00
Paul Mackerras
5fc3e64f94 KVM: PPC: Book3S HV: Fix handling of interrupted VCPUs
This fixes a bug which results in stale vcore pointers being left in
the per-cpu preempted vcore lists when a VM is destroyed.  The result
of the stale vcore pointers is usually either a crash or a lockup
inside collect_piggybacks() when another VM is run.  A typical
lockup message looks like:

[  472.161074] NMI watchdog: BUG: soft lockup - CPU#24 stuck for 22s! [qemu-system-ppc:7039]
[  472.161204] Modules linked in: kvm_hv kvm_pr kvm xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ses enclosure shpchp rtc_opal i2c_opal powernv_rng binfmt_misc dm_service_time scsi_dh_alua radeon i2c_algo_bit drm_kms_helper ttm drm tg3 ptp pps_core cxgb3 ipr i2c_core mdio dm_multipath [last unloaded: kvm_hv]
[  472.162111] CPU: 24 PID: 7039 Comm: qemu-system-ppc Not tainted 4.2.0-kvm+ #49
[  472.162187] task: c000001e38512750 ti: c000001e41bfc000 task.ti: c000001e41bfc000
[  472.162262] NIP: c00000000096b094 LR: c00000000096b08c CTR: c000000000111130
[  472.162337] REGS: c000001e41bff520 TRAP: 0901   Not tainted  (4.2.0-kvm+)
[  472.162399] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24848844  XER: 00000000
[  472.162588] CFAR: c00000000096b0ac SOFTE: 1
GPR00: c000000000111170 c000001e41bff7a0 c00000000127df00 0000000000000001
GPR04: 0000000000000003 0000000000000001 0000000000000000 0000000000874821
GPR08: c000001e41bff8e0 0000000000000001 0000000000000000 d00000000efde740
GPR12: c000000000111130 c00000000fdae400
[  472.163053] NIP [c00000000096b094] _raw_spin_lock_irqsave+0xa4/0x130
[  472.163117] LR [c00000000096b08c] _raw_spin_lock_irqsave+0x9c/0x130
[  472.163179] Call Trace:
[  472.163206] [c000001e41bff7a0] [c000001e41bff7f0] 0xc000001e41bff7f0 (unreliable)
[  472.163295] [c000001e41bff7e0] [c000000000111170] __wake_up+0x40/0x90
[  472.163375] [c000001e41bff830] [d00000000efd6fc0] kvmppc_run_core+0x1240/0x1950 [kvm_hv]
[  472.163465] [c000001e41bffa30] [d00000000efd8510] kvmppc_vcpu_run_hv+0x5a0/0xd90 [kvm_hv]
[  472.163559] [c000001e41bffb70] [d00000000e9318a4] kvmppc_vcpu_run+0x44/0x60 [kvm]
[  472.163653] [c000001e41bffba0] [d00000000e92e674] kvm_arch_vcpu_ioctl_run+0x64/0x170 [kvm]
[  472.163745] [c000001e41bffbe0] [d00000000e9263a8] kvm_vcpu_ioctl+0x538/0x7b0 [kvm]
[  472.163834] [c000001e41bffd40] [c0000000002d0f50] do_vfs_ioctl+0x480/0x7c0
[  472.163910] [c000001e41bffde0] [c0000000002d1364] SyS_ioctl+0xd4/0xf0
[  472.163986] [c000001e41bffe30] [c000000000009260] system_call+0x38/0xd0
[  472.164060] Instruction dump:
[  472.164098] ebc1fff0 ebe1fff8 7c0803a6 4e800020 60000000 60000000 60420000 8bad02e2
[  472.164224] 7fc3f378 4b6a57c1 60000000 7c210b78 <e92d0000> 89290009 792affe3 40820070

The bug is that kvmppc_run_vcpu does not correctly handle the case
where a vcpu task receives a signal while its guest vcpu is executing
in the guest as a result of being piggy-backed onto the execution of
another vcore.  In that case we need to wait for the vcpu to finish
executing inside the guest, and then remove this vcore from the
preempted vcores list.  That way, we avoid leaving this vcpu's vcore
on the preempted vcores list when the vcpu gets interrupted.

Fixes: ec25716508
Reported-by: Thomas Huth <thuth@redhat.com>
Tested-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2015-09-21 09:00:47 +10:00
Paolo Bonzini
62bea5bff4 KVM: add halt_attempted_poll to VCPU stats
This new statistic can help diagnosing VCPUs that, for any reason,
trigger bad behavior of halt_poll_ns autotuning.

For example, say halt_poll_ns = 480000, and wakeups are spaced exactly
like 479us, 481us, 479us, 481us. Then KVM always fails polling and wastes
10+20+40+80+160+320+480 = 1110 microseconds out of every
479+481+479+481+479+481+479 = 3359 microseconds. The VCPU then
is consuming about 30% more CPU than it would use without
polling.  This would show as an abnormally high number of
attempted polling compared to the successful polls.

Acked-by: Christian Borntraeger <borntraeger@de.ibm.com<
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-16 12:17:00 +02:00
Linus Torvalds
519f526d39 ARM:
- Full debug support for arm64
 - Active state switching for timer interrupts
 - Lazy FP/SIMD save/restore for arm64
 - Generic ARMv8 target
 
 PPC:
 - Book3S: A few bug fixes
 - Book3S: Allow micro-threading on POWER8
 
 x86:
 - Compiler warnings
 
 Generic:
 - Adaptive polling for guest halt
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJV7qd/AAoJEL/70l94x66DDBcH/2OLomKHjDOGXqJ/dpkqf4UU
 FYI1pVjs2zP4z3L7RYV/DeuEsD6XaWzS7EXQOS3mcb9d8GWahPrdofeVmpmhg/8y
 jmkuUEFHl2Ut6imk8qDlG3m42c86Mk8/1k38l1bp8S3lL0/Q7IyADyYAlHdwzpOx
 yEyOAE4VU4n+VyQH5dbnzc12QRTeHfRQc/dI3eQq238gf37SF/1qzOzeLIdbEa+N
 DCzqQ8SExbctiRaLzCY5Ogan+unZBQbFfhrDrUSryywrzo/8WRFVmbjuf5O5Ucxa
 +UTLMvmm1YgxvBvWhlcmA+HSzSVeWNvaHQ9illgE5+74G5CzaD2ukurmoz/+r+A=
 =XtrL
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more kvm updates from Paolo Bonzini:
 "ARM:
   - Full debug support for arm64
   - Active state switching for timer interrupts
   - Lazy FP/SIMD save/restore for arm64
   - Generic ARMv8 target

  PPC:
   - Book3S: A few bug fixes
   - Book3S: Allow micro-threading on POWER8

  x86:
   - Compiler warnings

  Generic:
   - Adaptive polling for guest halt"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (49 commits)
  kvm: irqchip: fix memory leak
  kvm: move new trace event outside #ifdef CONFIG_KVM_ASYNC_PF
  KVM: trace kvm_halt_poll_ns grow/shrink
  KVM: dynamic halt-polling
  KVM: make halt_poll_ns per-vCPU
  Silence compiler warning in arch/x86/kvm/emulate.c
  kvm: compile process_smi_save_seg_64() only for x86_64
  KVM: x86: avoid uninitialized variable warning
  KVM: PPC: Book3S: Fix typo in top comment about locking
  KVM: PPC: Book3S: Fix size of the PSPB register
  KVM: PPC: Book3S HV: Exit on H_DOORBELL if HOST_IPI is set
  KVM: PPC: Book3S HV: Fix race in starting secondary threads
  KVM: PPC: Book3S: correct width in XER handling
  KVM: PPC: Book3S HV: Fix preempted vcore stolen time calculation
  KVM: PPC: Book3S HV: Fix preempted vcore list locking
  KVM: PPC: Book3S HV: Implement H_CLEAR_REF and H_CLEAR_MOD
  KVM: PPC: Book3S HV: Fix bug in dirty page tracking
  KVM: PPC: Book3S HV: Fix race in reading change bit when removing HPTE
  KVM: PPC: Book3S HV: Implement dynamic micro-threading on POWER8
  KVM: PPC: Book3S HV: Make use of unused threads when running guests
  ...
2015-09-10 16:42:49 -07:00
Greg Kurz
4e33d1f0a1 KVM: PPC: Book3S: Fix typo in top comment about locking
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2015-09-04 07:28:05 +10:00
Gautham R. Shenoy
06554d9f6c KVM: PPC: Book3S HV: Exit on H_DOORBELL if HOST_IPI is set
The code that handles the case when we receive a H_DOORBELL interrupt
has a comment which says "Hypervisor doorbell - exit only if host IPI
flag set".  However, the current code does not actually check if the
host IPI flag is set.  This is due to a comparison instruction that
got missed.

As a result, the current code performs the exit to host only
if some sibling thread or a sibling sub-core is exiting to the
host.  This implies that, an IPI sent to a sibling core in
(subcores-per-core != 1) mode will be missed by the host unless the
sibling core is on the exit path to the host.

This patch adds the missing comparison operation which will ensure
that when HOST_IPI flag is set, we unconditionally exit to the host.

Fixes: 66feed61cd
Cc: stable@vger.kernel.org # v4.1+
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2015-09-03 16:08:34 +10:00
Gautham R. Shenoy
7f23532866 KVM: PPC: Book3S HV: Fix race in starting secondary threads
The current dynamic micro-threading code has a race due to which a
secondary thread naps when it is supposed to be running a vcpu. As a
side effect of this, on a guest exit, the primary thread in
kvmppc_wait_for_nap() finds that this secondary thread hasn't cleared
its vcore pointer. This results in "CPU X seems to be stuck!"
warnings.

The race is possible since the primary thread on exiting the guests
only waits for all the secondaries to clear its vcore pointer. It
subsequently expects the secondary threads to enter nap while it
unsplits the core. A secondary thread which hasn't yet entered the nap
will loop in kvm_no_guest until its vcore pointer and the do_nap flag
are unset. Once the core has been unsplit, a new vcpu thread can grab
the core and set the do_nap flag *before* setting the vcore pointers
of the secondary. As a result, the secondary thread will now enter nap
via kvm_unsplit_nap instead of running the guest vcpu.

Fix this by setting the do_nap flag after setting the vcore pointer in
the PACA of the secondary in kvmppc_run_core. Also, ensure that a
secondary thread doesn't nap in kvm_unsplit_nap when the vcore pointer
in its PACA struct is set.

Fixes: b4deba5c41
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2015-09-03 16:07:42 +10:00
Linus Torvalds
089b669506 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "The usual stuff from trivial tree for 4.3 (kerneldoc updates, printk()
  fixes, Documentation and MAINTAINERS updates)"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits)
  MAINTAINERS: update my e-mail address
  mod_devicetable: add space before */
  scsi: a100u2w: trivial typo in printk
  i2c: Fix typo in i2c-bfin-twi.c
  treewide: fix typos in comment blocks
  Doc: fix trivial typo in SubmittingPatches
  proportions: Spelling s/consitent/consistent/
  dm: Spelling s/consitent/consistent/
  aic7xxx: Fix typo in error message
  pcmcia: Fix typo in locking documentation
  scsi/arcmsr: Fix typos in error log
  drm/nouveau/gr: Fix typo in nv10.c
  [SCSI] Fix printk typos in drivers/scsi
  staging: comedi: Grammar s/Enable support a/Enable support for a/
  Btrfs: Spelling s/consitent/consistent/
  README: GTK+ is a acronym
  ASoC: omap: Fix typo in config option description
  mm: tlb.c: Fix error message
  ntfs: super.c: Fix error log
  fix typo in Documentation/SubmittingPatches
  ...
2015-09-01 18:46:42 -07:00
Sam bobroff
c63517c2e3 KVM: PPC: Book3S: correct width in XER handling
In 64 bit kernels, the Fixed Point Exception Register (XER) is a 64
bit field (e.g. in kvm_regs and kvm_vcpu_arch) and in most places it is
accessed as such.

This patch corrects places where it is accessed as a 32 bit field by a
64 bit kernel.  In some cases this is via a 32 bit load or store
instruction which, depending on endianness, will cause either the
lower or upper 32 bits to be missed.  In another case it is cast as a
u32, causing the upper 32 bits to be cleared.

This patch corrects those places by extending the access methods to
64 bits.

Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Tested-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-08-22 11:16:19 +02:00
Paul Mackerras
563a1e93af KVM: PPC: Book3S HV: Fix preempted vcore stolen time calculation
Whenever a vcore state is VCORE_PREEMPT we need to be counting stolen
time for it.  This currently isn't the case when we have a vcore that
no longer has any runnable threads in it but still has a runner task,
so we do an explicit call to kvmppc_core_start_stolen() in that case.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-08-22 11:16:19 +02:00
Paul Mackerras
402813fe39 KVM: PPC: Book3S HV: Fix preempted vcore list locking
When a vcore gets preempted, we put it on the preempted vcore list for
the current CPU.  The runner task then calls schedule() and comes back
some time later and takes itself off the list.  We need to be careful
to lock the list that it was put onto, which may not be the list for the
current CPU since the runner task may have moved to another CPU.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-08-22 11:16:19 +02:00
Paul Mackerras
cdeee51842 KVM: PPC: Book3S HV: Implement H_CLEAR_REF and H_CLEAR_MOD
This adds implementations for the H_CLEAR_REF (test and clear reference
bit) and H_CLEAR_MOD (test and clear changed bit) hypercalls.

When clearing the reference or change bit in the guest view of the HPTE,
we also have to clear it in the real HPTE so that we can detect future
references or changes.  When we do so, we transfer the R or C bit value
to the rmap entry for the underlying host page so that kvm_age_hva_hv(),
kvm_test_age_hva_hv() and kvmppc_hv_get_dirty_log() know that the page
has been referenced and/or changed.

These hypercalls are not used by Linux guests.  These implementations
have been tested using a FreeBSD guest.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-08-22 11:16:18 +02:00
Paul Mackerras
08fe1e7bd2 KVM: PPC: Book3S HV: Fix bug in dirty page tracking
This fixes a bug in the tracking of pages that get modified by the
guest.  If the guest creates a large-page HPTE, writes to memory
somewhere within the large page, and then removes the HPTE, we only
record the modified state for the first normal page within the large
page, when in fact the guest might have modified some other normal
page within the large page.

To fix this we use some unused bits in the rmap entry to record the
order (log base 2) of the size of the page that was modified, when
removing an HPTE.  Then in kvm_test_clear_dirty_npages() we use that
order to return the correct number of modified pages.

The same thing could in principle happen when removing a HPTE at the
host's request, i.e. when paging out a page, except that we never
page out large pages, and the guest can only create large-page HPTEs
if the guest RAM is backed by large pages.  However, we also fix
this case for the sake of future-proofing.

The reference bit is also subject to the same loss of information.  We
don't make the same fix here for the reference bit because there isn't
an interface for userspace to find out which pages the guest has
referenced, whereas there is one for userspace to find out which pages
the guest has modified.  Because of this loss of information, the
kvm_age_hva_hv() and kvm_test_age_hva_hv() functions might incorrectly
say that a page has not been referenced when it has, but that doesn't
matter greatly because we never page or swap out large pages.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-08-22 11:16:18 +02:00
Paul Mackerras
1e5bf454f5 KVM: PPC: Book3S HV: Fix race in reading change bit when removing HPTE
The reference (R) and change (C) bits in a HPT entry can be set by
hardware at any time up until the HPTE is invalidated and the TLB
invalidation sequence has completed.  This means that when removing
a HPTE, we need to read the HPTE after the invalidation sequence has
completed in order to obtain reliable values of R and C.  The code
in kvmppc_do_h_remove() used to do this.  However, commit 6f22bd3265
("KVM: PPC: Book3S HV: Make HTAB code LE host aware") removed the
read after invalidation as a side effect of other changes.  This
restores the read of the HPTE after invalidation.

The user-visible effect of this bug would be that when migrating a
guest, there is a small probability that a page modified by the guest
and then unmapped by the guest might not get re-transmitted and thus
the destination might end up with a stale copy of the page.

Fixes: 6f22bd3265
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-08-22 11:16:18 +02:00
Paul Mackerras
b4deba5c41 KVM: PPC: Book3S HV: Implement dynamic micro-threading on POWER8
This builds on the ability to run more than one vcore on a physical
core by using the micro-threading (split-core) modes of the POWER8
chip.  Previously, only vcores from the same VM could be run together,
and (on POWER8) only if they had just one thread per core.  With the
ability to split the core on guest entry and unsplit it on guest exit,
we can run up to 8 vcpu threads from up to 4 different VMs, and we can
run multiple vcores with 2 or 4 vcpus per vcore.

Dynamic micro-threading is only available if the static configuration
of the cores is whole-core mode (unsplit), and only on POWER8.

To manage this, we introduce a new kvm_split_mode struct which is
shared across all of the subcores in the core, with a pointer in the
paca on each thread.  In addition we extend the core_info struct to
have information on each subcore.  When deciding whether to add a
vcore to the set already on the core, we now have two possibilities:
(a) piggyback the vcore onto an existing subcore, or (b) start a new
subcore.

Currently, when any vcpu needs to exit the guest and switch to host
virtual mode, we interrupt all the threads in all subcores and switch
the core back to whole-core mode.  It may be possible in future to
allow some of the subcores to keep executing in the guest while
subcore 0 switches to the host, but that is not implemented in this
patch.

This adds a module parameter called dynamic_mt_modes which controls
which micro-threading (split-core) modes the code will consider, as a
bitmap.  In other words, if it is 0, no micro-threading mode is
considered; if it is 2, only 2-way micro-threading is considered; if
it is 4, only 4-way, and if it is 6, both 2-way and 4-way
micro-threading mode will be considered.  The default is 6.

With this, we now have secondary threads which are the primary thread
for their subcore and therefore need to do the MMU switch.  These
threads will need to be started even if they have no vcpu to run, so
we use the vcore pointer in the PACA rather than the vcpu pointer to
trigger them.

It is now possible for thread 0 to find that an exit has been
requested before it gets to switch the subcore state to the guest.  In
that case we haven't added the guest's timebase offset to the
timebase, so we need to be careful not to subtract the offset in the
guest exit path.  In fact we just skip the whole path that switches
back to host context, since we haven't switched to the guest context.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-08-22 11:16:17 +02:00
Paul Mackerras
ec25716508 KVM: PPC: Book3S HV: Make use of unused threads when running guests
When running a virtual core of a guest that is configured with fewer
threads per core than the physical cores have, the extra physical
threads are currently unused.  This makes it possible to use them to
run one or more other virtual cores from the same guest when certain
conditions are met.  This applies on POWER7, and on POWER8 to guests
with one thread per virtual core.  (It doesn't apply to POWER8 guests
with multiple threads per vcore because they require a 1-1 virtual to
physical thread mapping in order to be able to use msgsndp and the
TIR.)

The idea is that we maintain a list of preempted vcores for each
physical cpu (i.e. each core, since the host runs single-threaded).
Then, when a vcore is about to run, it checks to see if there are
any vcores on the list for its physical cpu that could be
piggybacked onto this vcore's execution.  If so, those additional
vcores are put into state VCORE_PIGGYBACK and their runnable VCPU
threads are started as well as the original vcore, which is called
the master vcore.

After the vcores have exited the guest, the extra ones are put back
onto the preempted list if any of their VCPUs are still runnable and
not idle.

This means that vcpu->arch.ptid is no longer necessarily the same as
the physical thread that the vcpu runs on.  In order to make it easier
for code that wants to send an IPI to know which CPU to target, we
now store that in a new field in struct vcpu_arch, called thread_cpu.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-08-22 11:16:17 +02:00
Tudor Laurentiu
845ac985cf KVM: PPC: add missing pt_regs initialization
On this switch branch the regs initialization
doesn't happen so add it.
This was found with the help of a static
code analysis tool.

Signed-off-by: Laurentiu Tudor <Laurentiu.Tudor@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-08-22 11:16:17 +02:00
Thomas Huth
5358a96341 KVM: PPC: Fix warnings from sparse
When compiling the KVM code for POWER with "make C=1", sparse
complains about functions missing proper prototypes and a 64-bit
constant missing the ULL prefix. Let's fix this by making the
functions static or by including the proper header with the
prototypes, and by appending a ULL prefix to the constant
PPC_MPPE_ADDRESS_MASK.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-08-22 11:16:16 +02:00
Thomas Huth
129fd4233b KVM: PPC: Remove PPC970 from KVM_BOOK3S_64_HV text in Kconfig
Since the PPC970 support has been removed from the kvm-hv kernel
module recently, we should also reflect this change in the help
text of the corresponding Kconfig option.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-08-22 11:16:16 +02:00
Tudor Laurentiu
f5ffe330f5 KVM: PPC: fix suspicious use of conditional operator
This was signaled by a static code analysis tool.

Signed-off-by: Laurentiu Tudor <Laurentiu.Tudor@freescale.com>
Reviewed-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-08-22 11:16:16 +02:00
Laurent Pinchart
60acc4ebe7 treewide: Fix typo compatability -> compatibility
Even though 'compatability' has a dedicated entry in the Wiktionary,
it's listed as 'Mispelling of compatibility'. Fix it.

Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> for the atomic_helper.c
Signed-off-by: Jiri Kosina <jkosina@suse.com>
2015-08-07 14:01:39 +02:00
Konstantin Khlebnikov
c56dadf397 sched/preempt, powerpc, kvm: Use need_resched() instead of should_resched()
Function should_resched() is equal to (!preempt_count() && need_resched()).
In preemptive kernel preempt_count here is non-zero because of vc->lock.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Graf <agraf@suse.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150715095203.12246.72922.stgit@buzz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-03 12:21:24 +02:00
Linus Torvalds
4e241557fc The bulk of the changes here is for x86. And for once it's not
for silicon that no one owns: these are really new features for
 everyone.
 
 * ARM: several features are in progress but missed the 4.2 deadline.
 So here is just a smattering of bug fixes, plus enabling the VFIO
 integration.
 
 * s390: Some fixes/refactorings/optimizations, plus support for
 2GB pages.
 
 * x86: 1) host and guest support for marking kvmclock as a stable
 scheduler clock. 2) support for write combining. 3) support for
 system management mode, needed for secure boot in guests. 4) a bunch
 of cleanups required for 2+3.  5) support for virtualized performance
 counters on AMD; 6) legacy PCI device assignment is deprecated and
 defaults to "n" in Kconfig; VFIO replaces it.  On top of this there are
 also bug fixes and eager FPU context loading for FPU-heavy guests.
 
 * Common code: Support for multiple address spaces; for now it is
 used only for x86 SMM but the s390 folks also have plans.
 
 There are some x86 conflicts, one with the rc8 pull request and
 the rest with Ingo's FPU rework.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJViYzhAAoJEL/70l94x66Dda0H/1IepMbfEy+o849d5G71fNTs
 F8Y8qUP2GZuL7T53FyFUGSBw+AX7kimu9ia4gR/PmDK+QYsdosYeEjwlsolZfTBf
 sHuzNtPoJhi5o1o/ur4NGameo0WjGK8f1xyzr+U8z74QDQyQv/QYCdK/4isp4BJL
 ugHNHkuROX6Zng4i7jc9rfaSRg29I3GBxQUYpMkEnD3eMYMUBWGm6Rs8pHgGAMvL
 vqzntgW00WNxehTqcAkmD/Wv+txxhkvIadZnjgaxH49e9JeXeBKTIR5vtb7Hns3s
 SuapZUyw+c95DIipXq4EznxxaOrjbebOeFgLCJo8+XMXZum8RZf/ob24KroYad0=
 =YsAR
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull first batch of KVM updates from Paolo Bonzini:
 "The bulk of the changes here is for x86.  And for once it's not for
  silicon that no one owns: these are really new features for everyone.

  Details:

   - ARM:
        several features are in progress but missed the 4.2 deadline.
        So here is just a smattering of bug fixes, plus enabling the
        VFIO integration.

   - s390:
        Some fixes/refactorings/optimizations, plus support for 2GB
        pages.

   - x86:
        * host and guest support for marking kvmclock as a stable
          scheduler clock.
        * support for write combining.
        * support for system management mode, needed for secure boot in
          guests.
        * a bunch of cleanups required for the above
        * support for virtualized performance counters on AMD
        * legacy PCI device assignment is deprecated and defaults to "n"
          in Kconfig; VFIO replaces it

        On top of this there are also bug fixes and eager FPU context
        loading for FPU-heavy guests.

   - Common code:
        Support for multiple address spaces; for now it is used only for
        x86 SMM but the s390 folks also have plans"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (124 commits)
  KVM: s390: clear floating interrupt bitmap and parameters
  KVM: x86/vPMU: Enable PMU handling for AMD PERFCTRn and EVNTSELn MSRs
  KVM: x86/vPMU: Implement AMD vPMU code for KVM
  KVM: x86/vPMU: Define kvm_pmu_ops to support vPMU function dispatch
  KVM: x86/vPMU: introduce kvm_pmu_msr_idx_to_pmc
  KVM: x86/vPMU: reorder PMU functions
  KVM: x86/vPMU: whitespace and stylistic adjustments in PMU code
  KVM: x86/vPMU: use the new macros to go between PMC, PMU and VCPU
  KVM: x86/vPMU: introduce pmu.h header
  KVM: x86/vPMU: rename a few PMU functions
  KVM: MTRR: do not map huge page for non-consistent range
  KVM: MTRR: simplify kvm_mtrr_get_guest_memory_type
  KVM: MTRR: introduce mtrr_for_each_mem_type
  KVM: MTRR: introduce fixed_mtrr_addr_* functions
  KVM: MTRR: sort variable MTRRs
  KVM: MTRR: introduce var_mtrr_range
  KVM: MTRR: introduce fixed_mtrr_segment table
  KVM: MTRR: improve kvm_mtrr_get_guest_memory_type
  KVM: MTRR: do not split 64 bits MSR content
  KVM: MTRR: clean up mtrr default type
  ...
2015-06-24 09:36:49 -07:00
Linus Torvalds
08d183e3c1 powerpc updates for 4.2
- Disable the 32-bit vdso when building LE, so we can build with a 64-bit only
    toolchain.
  - EEH fixes from Gavin & Richard.
  - Enable the sys_kcmp syscall from Laurent.
  - Sysfs control for fastsleep workaround from Shreyas.
  - Expose OPAL events as an irq chip by Alistair.
  - MSI ops moved to pci_controller_ops by Daniel.
  - Fix for kernel to userspace backtraces for perf from Anton.
  - Merge pseries and pseries_le defconfigs from Cyril.
  - CXL in-kernel API from Mikey.
  - OPAL prd driver from Jeremy.
  - Fix for DSCR handling & tests from Anshuman.
  - Powernv flash mtd driver from Cyril.
  - Dynamic DMA Window support on powernv from Alexey.
  - LLVM clang fixes & workarounds from Anton.
  - Reworked version of the patch to abort syscalls when transactional.
  - Fix the swap encoding to support 4TB, from Aneesh.
  - Various fixes as usual.
  - Freescale updates from Scott: Highlights include more 8xx optimizations, an
    e6500 hugetlb optimization, QMan device tree nodes, t1024/t1023 support, and
    various fixes and cleanup.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJViSZqAAoJEFHr6jzI4aWAA7kQAKq3+pejfo2rY7alpKJyeVao
 vlaIEaDNOTh+ctcmu3MFF9Jy6fai8gNZziRXU5JRmE5RW4GVBN4KZiqXRbkVjdBK
 uG9sCX7Y58VRsS2vnGBYLsamfTMgjaXeDvgunQHVLiechJnrDr0RHEK90F3LSi73
 Axp6l8XIG63a3zFZmkhzANMCme2lm5+MWmGlSjUUNi5F+viQUgJc5iiO8xrVUgM5
 RpNlV2NJSqFiU+gMQWJ226V85UIniouq4j+qtyUcu8/m9BberyolXVU0GPlPFdsx
 r/Qh9uCJyZaUdSB5hzomQZj50IsSz6J6nEuJTeGRoVZOmeI8Dnc2xU9fxQF5fC8H
 lUJw10WPoNOggQZTeSUKn7wTXw3i4p3KsWNUczaW68VJdhqZUVaSp0+I6mnDSqzs
 9iGC+VffLYNa1OHq7mGRFrgDdLBCHes31aZ3CxlQsmyNpAPCwMzsD4TUfVnvOG6E
 oJOeaQ4mZM9PvqxEYJfoIL+vgRxmQ8sdIBtNY4in+C7J6eFnZNFO9xmPnJZuVU31
 PGtx60kjFCOVMXvqn34WkRNbgqGWI91IK0KcRwFO2LXVio1uY77TWL52kNK2IMsp
 Az+VDDvqnT3+BoV1yz0P6SrXAkwTpvFk2y+IdmEiUUN7zZFL5ZSA2epej9AzHTAK
 WID2bc5yVtIL6p6x5ICH
 =d9Wh
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux

Pull powerpc updates from Michael Ellerman:

 - disable the 32-bit vdso when building LE, so we can build with a
   64-bit only toolchain.

 - EEH fixes from Gavin & Richard.

 - enable the sys_kcmp syscall from Laurent.

 - sysfs control for fastsleep workaround from Shreyas.

 - expose OPAL events as an irq chip by Alistair.

 - MSI ops moved to pci_controller_ops by Daniel.

 - fix for kernel to userspace backtraces for perf from Anton.

 - merge pseries and pseries_le defconfigs from Cyril.

 - CXL in-kernel API from Mikey.

 - OPAL prd driver from Jeremy.

 - fix for DSCR handling & tests from Anshuman.

 - Powernv flash mtd driver from Cyril.

 - dynamic DMA Window support on powernv from Alexey.

 - LLVM clang fixes & workarounds from Anton.

 - reworked version of the patch to abort syscalls when transactional.

 - fix the swap encoding to support 4TB, from Aneesh.

 - various fixes as usual.

 - Freescale updates from Scott: Highlights include more 8xx
   optimizations, an e6500 hugetlb optimization, QMan device tree nodes,
   t1024/t1023 support, and various fixes and cleanup.

* tag 'powerpc-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (180 commits)
  cxl: Fix typo in debug print
  cxl: Add CXL_KERNEL_API config option
  powerpc/powernv: Fix wrong IOMMU table in pnv_ioda_setup_bus_dma()
  powerpc/mm: Change the swap encoding in pte.
  powerpc/mm: PTE_RPN_MAX is not used, remove the same
  powerpc/tm: Abort syscalls in active transactions
  powerpc/iommu/ioda2: Enable compile with IOV=on and IOMMU_API=off
  powerpc/include: Add opal-prd to installed uapi headers
  powerpc/powernv: fix construction of opal PRD messages
  powerpc/powernv: Increase opal-irqchip initcall priority
  powerpc: Make doorbell check preemption safe
  powerpc/powernv: pnv_init_idle_states() should only run on powernv
  macintosh/nvram: Remove as unused
  powerpc: Don't use gcc specific options on clang
  powerpc: Don't use -mno-strict-align on clang
  powerpc: Only use -mtraceback=no, -mno-string and -msoft-float if toolchain supports it
  powerpc: Only use -mabi=altivec if toolchain supports it
  powerpc: Fix duplicate const clang warning in user access code
  vfio: powerpc/spapr: Support Dynamic DMA windows
  vfio: powerpc/spapr: Register memory and define IOMMU v2
  ...
2015-06-24 08:46:32 -07:00
Anshuman Khandual
1db365258a powerpc/kernel: Rename PACA_DSCR to PACA_DSCR_DEFAULT
PACA_DSCR offset macro tracks dscr_default element in the paca
structure. Better change the name of this macro to match that of the
data element it tracks. Makes the code more readable.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-07 19:29:00 +10:00
Paolo Bonzini
f36f3f2846 KVM: add "new" argument to kvm_arch_commit_memory_region
This lets the function access the new memory slot without going through
kvm_memslots and id_to_memslot.  It will simplify the code when more
than one address space will be supported.

Unfortunately, the "const"ness of the new argument must be casted
away in two places.  Fixing KVM to accept const struct kvm_memory_slot
pointers would require modifications in pretty much all architectures,
and is left for later.

Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-28 10:42:58 +02:00
Paolo Bonzini
09170a4942 KVM: const-ify uses of struct kvm_userspace_memory_region
Architecture-specific helpers are not supposed to muck with
struct kvm_userspace_memory_region contents.  Add const to
enforce this.

In order to eliminate the only write in __kvm_set_memory_region,
the cleaning of deleted slots is pulled up from update_memslots
to __kvm_set_memory_region.

Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-26 12:40:13 +02:00
Paolo Bonzini
9f6b802978 KVM: use kvm_memslots whenever possible
kvm_memslots provides lockdep checking.  Use it consistently instead of
explicit dereferencing of kvm->memslots.

Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-26 12:40:08 +02:00
Paolo Bonzini
a9b4fb7e79 Merge branch 'kvm-master' into kvm-next
Grab MPX bugfix, and fix conflicts against Rik's adaptive FPU
deactivation patch.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-20 12:31:37 +02:00
Paul Mackerras
17d489019c KVM: PPC: Book3S HV: Fix list traversal in error case
This fixes a regression introduced in commit 25fedfca94, "KVM: PPC:
Book3S HV: Move vcore preemption point up into kvmppc_run_vcpu", which
leads to a user-triggerable oops.

In the case where we try to run a vcore on a physical core that is
not in single-threaded mode, or the vcore has too many threads for
the physical core, we iterate the list of runnable vcpus to make
each one return an EBUSY error to userspace.  Since this involves
taking each vcpu off the runnable_threads list for the vcore, we
need to use list_for_each_entry_safe rather than list_for_each_entry
to traverse the list.  Otherwise the kernel will crash with an oops
message like this:

Unable to handle kernel paging request for data at address 0x000fff88
Faulting instruction address: 0xd00000001e635dc8
Oops: Kernel access of bad area, sig: 11 [#2]
SMP NR_CPUS=1024 NUMA PowerNV
...
CPU: 48 PID: 91256 Comm: qemu-system-ppc Tainted: G      D        3.18.0 #1
task: c00000274e507500 ti: c0000027d1924000 task.ti: c0000027d1924000
NIP: d00000001e635dc8 LR: d00000001e635df8 CTR: c00000000011ba50
REGS: c0000027d19275b0 TRAP: 0300   Tainted: G      D         (3.18.0)
MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 22002824  XER: 00000000
CFAR: c000000000008468 DAR: 00000000000fff88 DSISR: 40000000 SOFTE: 1
GPR00: d00000001e635df8 c0000027d1927830 d00000001e64c850 0000000000000001
GPR04: 0000000000000001 0000000000000001 0000000000000000 0000000000000000
GPR08: 0000000000200200 0000000000000000 0000000000000000 d00000001e63e588
GPR12: 0000000000002200 c000000007dbc800 c000000fc7800000 000000000000000a
GPR16: fffffffffffffffc c000000fd5439690 c000000fc7801c98 0000000000000001
GPR20: 0000000000000003 c0000027d1927aa8 c000000fd543b348 c000000fd543b350
GPR24: 0000000000000000 c000000fa57f0000 0000000000000030 0000000000000000
GPR28: fffffffffffffff0 c000000fd543b328 00000000000fe468 c000000fd543b300
NIP [d00000001e635dc8] kvmppc_run_core+0x198/0x17c0 [kvm_hv]
LR [d00000001e635df8] kvmppc_run_core+0x1c8/0x17c0 [kvm_hv]
Call Trace:
[c0000027d1927830] [d00000001e635df8] kvmppc_run_core+0x1c8/0x17c0 [kvm_hv] (unreliable)
[c0000027d1927a30] [d00000001e638350] kvmppc_vcpu_run_hv+0x5b0/0xdd0 [kvm_hv]
[c0000027d1927b70] [d00000001e510504] kvmppc_vcpu_run+0x44/0x60 [kvm]
[c0000027d1927ba0] [d00000001e50d4a4] kvm_arch_vcpu_ioctl_run+0x64/0x170 [kvm]
[c0000027d1927be0] [d00000001e504be8] kvm_vcpu_ioctl+0x5e8/0x7a0 [kvm]
[c0000027d1927d40] [c0000000002d6720] do_vfs_ioctl+0x490/0x780
[c0000027d1927de0] [c0000000002d6ae4] SyS_ioctl+0xd4/0xf0
[c0000027d1927e30] [c000000000009358] syscall_exit+0x0/0x98
Instruction dump:
60000000 60420000 387e1b30 38800003 38a00001 38c00000 480087d9 e8410018
ebde1c98 7fbdf040 3bdee368 419e0048 <813e1b20> 939e1b18 2f890001 409effcc
---[ end trace 8cdf50251cca6680 ]---

Fixes: 25fedfca94
Signed-off-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-10 17:26:05 +02:00
Paolo Bonzini
e233d54d4d KVM: booke: use __kvm_guest_exit
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-07 11:28:22 +02:00
Christian Borntraeger
ccf73aaf5a KVM: arm/mips/x86/power use __kvm_guest_{enter|exit}
Use __kvm_guest_{enter|exit} instead of kvm_guest_{enter|exit}
where interrupts are disabled.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-07 11:28:22 +02:00
Michael Ellerman
433c5c20c5 powerpc/kvm: Fix SMP=n build error in book3s_xics.c
Commit 34cb7954c0 "Convert ICS mutex lock to spin lock" added an
include of asm/spinlock.h, which does not work in the SMP=n case.

It should instead include linux/spinlock.h

Fixes: 34cb7954c0 ("KVM: PPC: Book3S HV: Convert ICS mutex lock to spin lock")
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-04-29 08:06:32 +10:00
Linus Torvalds
63905bba5b powerpc fixes for 4.1
- Fix for mm_dec_nr_pmds() from Scott.
 - Fixes for oopses seen with KVM + THP from Aneesh.
 - Build fixes from Aneesh & Shreyas.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVOsk5AAoJEFHr6jzI4aWACWMP/3EaNoeA1g8VbZWZEdRaoLvX
 W7D08DI3Dt8HLxyn2JR08jYZF0gr68XrF6OiscVVki7wVXT8fbH4jSNmBkbzNH95
 d9taScJyR1CUavkhsXivnR1qEE1Fi2KA2OW9RaNfoSt1MVtdsvOK6xXklUGksuJQ
 XygzyrRr4Dj82kuMUAMO0YDMvknMlzi3a8dzyrWZBXBZOOTWavGB6bQKtCTaOQ99
 3OFGLQ10uY7lmdHDi0t0tQ99FuYfLiJpg5fTLoUni4J5tFp8JlZ+x0Gwc0apN0cy
 Ym8EO6++qWDv8FXvYEPfVUEjbF1fyPiawUgpkMnyvXgd8K5G85SIrtkGW0Ml+6sX
 GfJH8w9hpDbF5EnWlC9bn/jT7sHBHFdrxZuQUc0L4M2OtM73R2a0Xr3b7ZxFCD1q
 7RpYu8MKKcyvaIXNg7VBJjj8zL+WmUJKF6J5uX5bGU2xH0khmp0vTknyyjbwrlcF
 uHidv5ZhMt3aAI70v14jA5BTEmLyOYRu58Ei6cT/VT/DjdbpEApdK8BMAvKSEeib
 +hzh6oDFT92AM0tbg15bNmqGbGfgqtVKe4GDS2QyGaHGAFOGs1nPuSa9se1xYDcM
 CCtRyABwpzJsrCfwra2fsTU6FxlatK4ONViyWFBXa6mEjBNSZ4XmyZvdWUqlwpSC
 F5jNGppm5Ama6xxcLphA
 =6yQx
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux

Pull powerpc fixes from Michael Ellerman:

 - fix for mm_dec_nr_pmds() from Scott.

 - fixes for oopses seen with KVM + THP from Aneesh.

 - build fixes from Aneesh & Shreyas.

* tag 'powerpc-4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux:
  powerpc/mm: Fix build error with CONFIG_PPC_TRANSACTIONAL_MEM disabled
  powerpc/kvm: Fix ppc64_defconfig + PPC_POWERNV=n build error
  powerpc/mm/thp: Return pte address if we find trans_splitting.
  powerpc/mm/thp: Make page table walk safe against thp split/collapse
  KVM: PPC: Remove page table walk helpers
  KVM: PPC: Use READ_ONCE when dereferencing pte_t pointer
  powerpc/hugetlb: Call mm_dec_nr_pmds() in hugetlb_free_pmd_range()
2015-04-26 13:23:15 -07:00
Linus Torvalds
eadf16a912 This mostly includes the PPC changes for 4.1, which this time cover
Book3S HV only (debugging aids, minor performance improvements and some
 cleanups).  But there are also bug fixes and small cleanups for ARM,
 x86 and s390.
 
 The task_migration_notifier revert and real fix is still pending review,
 but I'll send it as soon as possible after -rc1.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJVOONLAAoJEL/70l94x66DbsMIAIpZPsaqgXOC1sDEiZuYay+6
 rD4n4id7j8hIAzcf3AlZdyf5XgLlr6I1Zyt62s1WcoRq/CCnL7k9EljzSmw31WFX
 P2y7/J0iBdkn0et+PpoNThfL2GsgTqNRCLOOQlKgEQwMP9Dlw5fnUbtC1UchOzTg
 eAMeBIpYwufkWkXhdMw4PAD4lJ9WxUZ1eXHEBRzJb0o0ZxAATJ1tPZGrFJzoUOSM
 WsVNTuBsNd7upT02kQdvA1TUo/OPjseTOEoksHHwfcORt6bc5qvpctL3jYfcr7sk
 /L6sIhYGVNkjkuredjlKGLfT2DDJjSEdJb1k2pWrDRsY76dmottQubAE9J9cDTk=
 =OAi2
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull second batch of KVM changes from Paolo Bonzini:
 "This mostly includes the PPC changes for 4.1, which this time cover
  Book3S HV only (debugging aids, minor performance improvements and
  some cleanups).  But there are also bug fixes and small cleanups for
  ARM, x86 and s390.

  The task_migration_notifier revert and real fix is still pending
  review, but I'll send it as soon as possible after -rc1"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (29 commits)
  KVM: arm/arm64: check IRQ number on userland injection
  KVM: arm: irqfd: fix value returned by kvm_irq_map_gsi
  KVM: VMX: Preserve host CR4.MCE value while in guest mode.
  KVM: PPC: Book3S HV: Use msgsnd for signalling threads on POWER8
  KVM: PPC: Book3S HV: Translate kvmhv_commence_exit to C
  KVM: PPC: Book3S HV: Streamline guest entry and exit
  KVM: PPC: Book3S HV: Use bitmap of active threads rather than count
  KVM: PPC: Book3S HV: Use decrementer to wake napping threads
  KVM: PPC: Book3S HV: Don't wake thread with no vcpu on guest IPI
  KVM: PPC: Book3S HV: Get rid of vcore nap_count and n_woken
  KVM: PPC: Book3S HV: Move vcore preemption point up into kvmppc_run_vcpu
  KVM: PPC: Book3S HV: Minor cleanups
  KVM: PPC: Book3S HV: Simplify handling of VCPUs that need a VPA update
  KVM: PPC: Book3S HV: Accumulate timing information for real-mode code
  KVM: PPC: Book3S HV: Create debugfs file for each guest's HPT
  KVM: PPC: Book3S HV: Add ICP real mode counters
  KVM: PPC: Book3S HV: Move virtual mode ICP functions to real-mode
  KVM: PPC: Book3S HV: Convert ICS mutex lock to spin lock
  KVM: PPC: Book3S HV: Add guest->host real mode completion counters
  KVM: PPC: Book3S HV: Add helpers for lock/unlock hpte
  ...
2015-04-26 13:06:22 -07:00
Paul Mackerras
66feed61cd KVM: PPC: Book3S HV: Use msgsnd for signalling threads on POWER8
This uses msgsnd where possible for signalling other threads within
the same core on POWER8 systems, rather than IPIs through the XICS
interrupt controller.  This includes waking secondary threads to run
the guest, the interrupts generated by the virtual XICS, and the
interrupts to bring the other threads out of the guest when exiting.

Aggregated statistics from debugfs across vcpus for a guest with 32
vcpus, 8 threads/vcore, running on a POWER8, show this before the
change:

 rm_entry:     3387.6ns (228 - 86600, 1008969 samples)
  rm_exit:     4561.5ns (12 - 3477452, 1009402 samples)
  rm_intr:     1660.0ns (12 - 553050, 3600051 samples)

and this after the change:

 rm_entry:     3060.1ns (212 - 65138, 953873 samples)
  rm_exit:     4244.1ns (12 - 9693408, 954331 samples)
  rm_intr:     1342.3ns (12 - 1104718, 3405326 samples)

for a test of booting Fedora 20 big-endian to the login prompt.

The time taken for a H_PROD hcall (which is handled in the host
kernel) went down from about 35 microseconds to about 16 microseconds
with this change.

The noinline added to kvmppc_run_core turned out to be necessary for
good performance, at least with gcc 4.9.2 as packaged with Fedora 21
and a little-endian POWER8 host.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-21 15:21:34 +02:00
Paul Mackerras
eddb60fb14 KVM: PPC: Book3S HV: Translate kvmhv_commence_exit to C
This replaces the assembler code for kvmhv_commence_exit() with C code
in book3s_hv_builtin.c.  It also moves the IPI sending code that was
in book3s_hv_rm_xics.c into a new kvmhv_rm_send_ipi() function so it
can be used by kvmhv_commence_exit() as well as icp_rm_set_vcpu_irq().

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-21 15:21:34 +02:00
Paul Mackerras
6af27c847a KVM: PPC: Book3S HV: Streamline guest entry and exit
On entry to the guest, secondary threads now wait for the primary to
switch the MMU after loading up most of their state, rather than before.
This means that the secondary threads get into the guest sooner, in the
common case where the secondary threads get to kvmppc_hv_entry before
the primary thread.

On exit, the first thread out increments the exit count and interrupts
the other threads (to get them out of the guest) before saving most
of its state, rather than after.  That means that the other threads
exit sooner and means that the first thread doesn't spend so much
time waiting for the other threads at the point where the MMU gets
switched back to the host.

This pulls out the code that increments the exit count and interrupts
other threads into a separate function, kvmhv_commence_exit().
This also makes sure that r12 and vcpu->arch.trap are set correctly
in some corner cases.

Statistics from /sys/kernel/debug/kvm/vm*/vcpu*/timings show the
improvement.  Aggregating across vcpus for a guest with 32 vcpus,
8 threads/vcore, running on a POWER8, gives this before the change:

 rm_entry:     avg 4537.3ns (222 - 48444, 1068878 samples)
  rm_exit:     avg 4787.6ns (152 - 165490, 1010717 samples)
  rm_intr:     avg 1673.6ns (12 - 341304, 3818691 samples)

and this after the change:

 rm_entry:     avg 3427.7ns (232 - 68150, 1118921 samples)
  rm_exit:     avg 4716.0ns (12 - 150720, 1119477 samples)
  rm_intr:     avg 1614.8ns (12 - 522436, 3850432 samples)

showing a substantial reduction in the time spent per guest entry in
the real-mode guest entry code, and smaller reductions in the real
mode guest exit and interrupt handling times.  (The test was to start
the guest and boot Fedora 20 big-endian to the login prompt.)

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-21 15:21:33 +02:00
Paul Mackerras
7d6c40da19 KVM: PPC: Book3S HV: Use bitmap of active threads rather than count
Currently, the entry_exit_count field in the kvmppc_vcore struct
contains two 8-bit counts, one of the threads that have started entering
the guest, and one of the threads that have started exiting the guest.
This changes it to an entry_exit_map field which contains two bitmaps
of 8 bits each.  The advantage of doing this is that it gives us a
bitmap of which threads need to be signalled when exiting the guest.
That means that we no longer need to use the trick of setting the
HDEC to 0 to pull the other threads out of the guest, which led in
some cases to a spurious HDEC interrupt on the next guest entry.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-21 15:21:33 +02:00
Paul Mackerras
fd6d53b124 KVM: PPC: Book3S HV: Use decrementer to wake napping threads
This arranges for threads that are napping due to their vcpu having
ceded or due to not having a vcpu to wake up at the end of the guest's
timeslice without having to be poked with an IPI.  We do that by
arranging for the decrementer to contain a value no greater than the
number of timebase ticks remaining until the end of the timeslice.
In the case of a thread with no vcpu, this number is in the hypervisor
decrementer already.  In the case of a ceded vcpu, we use the smaller
of the HDEC value and the DEC value.

Using the DEC like this when ceded means we need to save and restore
the guest decrementer value around the nap.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-21 15:21:33 +02:00
Paul Mackerras
ccc07772c9 KVM: PPC: Book3S HV: Don't wake thread with no vcpu on guest IPI
When running a multi-threaded guest and vcpu 0 in a virtual core
is not running in the guest (i.e. it is busy elsewhere in the host),
thread 0 of the physical core will switch the MMU to the guest and
then go to nap mode in the code at kvm_do_nap.  If the guest sends
an IPI to thread 0 using the msgsndp instruction, that will wake
up thread 0 and cause all the threads in the guest to exit to the
host unnecessarily.  To avoid the unnecessary exit, this arranges
for the PECEDP bit to be cleared in this situation.  When napping
due to a H_CEDE from the guest, we still set PECEDP so that the
thread will wake up on an IPI sent using msgsndp.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-21 15:21:32 +02:00
Paul Mackerras
5d5b99cd68 KVM: PPC: Book3S HV: Get rid of vcore nap_count and n_woken
We can tell when a secondary thread has finished running a guest by
the fact that it clears its kvm_hstate.kvm_vcpu pointer, so there
is no real need for the nap_count field in the kvmppc_vcore struct.
This changes kvmppc_wait_for_nap to poll the kvm_hstate.kvm_vcpu
pointers of the secondary threads rather than polling vc->nap_count.
Besides reducing the size of the kvmppc_vcore struct by 8 bytes,
this also means that we can tell which secondary threads have got
stuck and thus print a more informative error message.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-21 15:21:32 +02:00
Paul Mackerras
25fedfca94 KVM: PPC: Book3S HV: Move vcore preemption point up into kvmppc_run_vcpu
Rather than calling cond_resched() in kvmppc_run_core() before doing
the post-processing for the vcpus that we have just run (that is,
calling kvmppc_handle_exit_hv(), kvmppc_set_timer(), etc.), we now do
that post-processing before calling cond_resched(), and that post-
processing is moved out into its own function, post_guest_process().

The reschedule point is now in kvmppc_run_vcpu() and we define a new
vcore state, VCORE_PREEMPT, to indicate that that the vcore's runner
task is runnable but not running.  (Doing the reschedule with the
vcore in VCORE_INACTIVE state would be bad because there are potentially
other vcpus waiting for the runner in kvmppc_wait_for_exec() which
then wouldn't get woken up.)

Also, we make use of the handy cond_resched_lock() function, which
unlocks and relocks vc->lock for us around the reschedule.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-21 15:21:32 +02:00
Paul Mackerras
1f09c3ed86 KVM: PPC: Book3S HV: Minor cleanups
* Remove unused kvmppc_vcore::n_busy field.
* Remove setting of RMOR, since it was only used on PPC970 and the
  PPC970 KVM support has been removed.
* Don't use r1 or r2 in setting the runlatch since they are
  conventionally reserved for other things; use r0 instead.
* Streamline the code a little and remove the ext_interrupt_to_host
  label.
* Add some comments about register usage.
* hcall_try_real_mode doesn't need to be global, and can't be
  called from C code anyway.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-21 15:21:32 +02:00
Paul Mackerras
d911f0bedd KVM: PPC: Book3S HV: Simplify handling of VCPUs that need a VPA update
Previously, if kvmppc_run_core() was running a VCPU that needed a VPA
update (i.e. one of its 3 virtual processor areas needed to be pinned
in memory so the host real mode code can update it on guest entry and
exit), we would drop the vcore lock and do the update there and then.
Future changes will make it inconvenient to drop the lock, so instead
we now remove it from the list of runnable VCPUs and wake up its
VCPU task.  This will have the effect that the VCPU task will exit
kvmppc_run_vcpu(), go around the do loop in kvmppc_vcpu_run_hv(), and
re-enter kvmppc_run_vcpu(), whereupon it will do the necessary call
to kvmppc_update_vpas() and then rejoin the vcore.

The one complication is that the runner VCPU (whose VCPU task is the
current task) might be one of the ones that gets removed from the
runnable list.  In that case we just return from kvmppc_run_core()
and let the code in kvmppc_run_vcpu() wake up another VCPU task to be
the runner if necessary.

This all means that the VCORE_STARTING state is no longer used, so we
remove it.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-21 15:21:31 +02:00
Paul Mackerras
b6c295df31 KVM: PPC: Book3S HV: Accumulate timing information for real-mode code
This reads the timebase at various points in the real-mode guest
entry/exit code and uses that to accumulate total, minimum and
maximum time spent in those parts of the code.  Currently these
times are accumulated per vcpu in 5 parts of the code:

* rm_entry - time taken from the start of kvmppc_hv_entry() until
  just before entering the guest.
* rm_intr - time from when we take a hypervisor interrupt in the
  guest until we either re-enter the guest or decide to exit to the
  host.  This includes time spent handling hcalls in real mode.
* rm_exit - time from when we decide to exit the guest until the
  return from kvmppc_hv_entry().
* guest - time spend in the guest
* cede - time spent napping in real mode due to an H_CEDE hcall
  while other threads in the same vcore are active.

These times are exposed in debugfs in a directory per vcpu that
contains a file called "timings".  This file contains one line for
each of the 5 timings above, with the name followed by a colon and
4 numbers, which are the count (number of times the code has been
executed), the total time, the minimum time, and the maximum time,
all in nanoseconds.

The overhead of the extra code amounts to about 30ns for an hcall that
is handled in real mode (e.g. H_SET_DABR), which is about 25%.  Since
production environments may not wish to incur this overhead, the new
code is conditional on a new config symbol,
CONFIG_KVM_BOOK3S_HV_EXIT_TIMING.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-21 15:21:31 +02:00
Paul Mackerras
e23a808b16 KVM: PPC: Book3S HV: Create debugfs file for each guest's HPT
This creates a debugfs directory for each HV guest (assuming debugfs
is enabled in the kernel config), and within that directory, a file
by which the contents of the guest's HPT (hashed page table) can be
read.  The directory is named vmnnnn, where nnnn is the PID of the
process that created the guest.  The file is named "htab".  This is
intended to help in debugging problems in the host's management
of guest memory.

The contents of the file consist of a series of lines like this:

  3f48 4000d032bf003505 0000000bd7ff1196 00000003b5c71196

The first field is the index of the entry in the HPT, the second and
third are the HPT entry, so the third entry contains the real page
number that is mapped by the entry if the entry's valid bit is set.
The fourth field is the guest's view of the second doubleword of the
entry, so it contains the guest physical address.  (The format of the
second through fourth fields are described in the Power ISA and also
in arch/powerpc/include/asm/mmu-hash64.h.)

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-21 15:21:31 +02:00
Suresh Warrier
6e0365b782 KVM: PPC: Book3S HV: Add ICP real mode counters
Add two counters to count how often we generate real-mode ICS resend
and reject events. The counters provide some performance statistics
that could be used in the future to consider if the real mode functions
need further optimizing. The counters are displayed as part of IPC and
ICP state provided by /sys/debug/kernel/powerpc/kvm* for each VM.

Also added two counters that count (approximately) how many times we
don't find an ICP or ICS we're looking for. These are not currently
exposed through sysfs, but can be useful when debugging crashes.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-21 15:21:30 +02:00
Suresh Warrier
b0221556db KVM: PPC: Book3S HV: Move virtual mode ICP functions to real-mode
Interrupt-based hypercalls return H_TOO_HARD to inform KVM that it needs
to switch to the host to complete the rest of hypercall function in
virtual mode. This patch ports the virtual mode ICS/ICP reject and resend
functions to be runnable in hypervisor real mode, thus avoiding the need
to switch to the host to execute these functions in virtual mode. However,
the hypercalls continue to return H_TOO_HARD for vcpu_wakeup and notify
events - these events cannot be done in real mode and they will still need
a switch to host virtual mode.

There are sufficient differences between the real mode code and the
virtual mode code for the ICS/ICP resend and reject functions that
for now the code has been duplicated instead of sharing common code.
In the future, we can look at creating common functions.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-21 15:21:30 +02:00
Suresh Warrier
34cb7954c0 KVM: PPC: Book3S HV: Convert ICS mutex lock to spin lock
Replaces the ICS mutex lock with a spin lock since we will be porting
these routines to real mode. Note that we need to disable interrupts
before we take the lock in anticipation of the fact that on the guest
side, we are running in the context of a hard irq and interrupts are
disabled (EE bit off) when the lock is acquired. Again, because we
will be acquiring the lock in hypervisor real mode, we need to use
an arch_spinlock_t instead of a normal spinlock here as we want to
avoid running any lockdep code (which may not be safe to execute in
real mode).

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-21 15:21:30 +02:00
Suresh E. Warrier
878610fe98 KVM: PPC: Book3S HV: Add guest->host real mode completion counters
Add counters to track number of times we switch from guest real mode
to host virtual mode during an interrupt-related hyper call because the
hypercall requires actions that cannot be completed in real mode. This
will help when making optimizations that reduce guest-host transitions.

It is safe to use an ordinary increment rather than an atomic operation
because there is one ICP per virtual CPU and kvmppc_xics_rm_complete()
only works on the ICP for the current VCPU.

The counters are displayed as part of IPC and ICP state provided by
/sys/debug/kernel/powerpc/kvm* for each VM.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-21 15:21:30 +02:00
Aneesh Kumar K.V
a4bd6eb07c KVM: PPC: Book3S HV: Add helpers for lock/unlock hpte
This adds helper routines for locking and unlocking HPTEs, and uses
them in the rest of the code.  We don't change any locking rules in
this patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-21 15:21:29 +02:00
Aneesh Kumar K.V
31037ecad2 KVM: PPC: Book3S HV: Remove RMA-related variables from code
We don't support real-mode areas now that 970 support is removed.
Remove the remaining details of rma from the code.  Also rename
rma_setup_done to hpte_setup_done to better reflect the changes.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-21 15:21:29 +02:00
Michael Ellerman
e928e9cb36 KVM: PPC: Book3S HV: Add fast real-mode H_RANDOM implementation.
Some PowerNV systems include a hardware random-number generator.
This HWRNG is present on POWER7+ and POWER8 chips and is capable of
generating one 64-bit random number every microsecond.  The random
numbers are produced by sampling a set of 64 unstable high-frequency
oscillators and are almost completely entropic.

PAPR defines an H_RANDOM hypercall which guests can use to obtain one
64-bit random sample from the HWRNG.  This adds a real-mode
implementation of the H_RANDOM hypercall.  This hypercall was
implemented in real mode because the latency of reading the HWRNG is
generally small compared to the latency of a guest exit and entry for
all the threads in the same virtual core.

Userspace can detect the presence of the HWRNG and the H_RANDOM
implementation by querying the KVM_CAP_PPC_HWRNG capability.  The
H_RANDOM hypercall implementation will only be invoked when the guest
does an H_RANDOM hypercall if userspace first enables the in-kernel
H_RANDOM implementation using the KVM_CAP_PPC_ENABLE_HCALL capability.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-21 15:21:29 +02:00
David Gibson
99342cf804 kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM
On POWER, storage caching is usually configured via the MMU - attributes
such as cache-inhibited are stored in the TLB and the hashed page table.

This makes correctly performing cache inhibited IO accesses awkward when
the MMU is turned off (real mode).  Some CPU models provide special
registers to control the cache attributes of real mode load and stores but
this is not at all consistent.  This is a problem in particular for SLOF,
the firmware used on KVM guests, which runs entirely in real mode, but
which needs to do IO to load the kernel.

To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD
and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to
a logical address (aka guest physical address).  SLOF uses these for IO.

However, because these are implemented within qemu, not the host kernel,
these bypass any IO devices emulated within KVM itself.  The simplest way
to see this problem is to attempt to boot a KVM guest from a virtio-blk
device with iothread / dataplane enabled.  The iothread code relies on an
in kernel implementation of the virtio queue notification, which is not
triggered by the IO hcalls, and so the guest will stall in SLOF unable to
load the guest OS.

This patch addresses this by providing in-kernel implementations of the
2 hypercalls, which correctly scan the KVM IO bus.  Any access to an
address not handled by the KVM IO bus will cause a VM exit, hitting the
qemu implementation as before.

Note that a userspace change is also required, in order to enable these
new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[agraf: fix compilation]
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-04-21 15:21:28 +02:00
Shreyas B. Prabhu
a7e73e711e powerpc/kvm: Fix ppc64_defconfig + PPC_POWERNV=n build error
kvm_no_guest() calls power7_wakeup_loss() to put the thread into the
deepest supported idle state. power7_wakeup_loss() is defined in
arch/powerpc/kernel/idle_power7.S, which is compiled only when
PPC_P7_NAP=y.

And PPC_P7_NAP is selected when PPC_POWERNV=y.

Hence in cases where PPC_POWERNV=n and KVM_BOOK3S_64_HV=y we see the
following error:

  arch/powerpc/kvm/built-in.o: In function `kvm_no_guest':
  arch/powerpc/kvm/book3s_hv_rmhandlers.o:(.text+0x42c): undefined reference to `power7_wakeup_loss'

Fix this by adding PPC_POWERNV as a dependency for KVM_BOOK3S_64_HV.

Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-04-17 11:57:36 +10:00
Aneesh Kumar K.V
7d6e7f7ffa powerpc/mm/thp: Return pte address if we find trans_splitting.
For THP that is marked trans splitting, we return the pte.
This require the callers to handle the pmd_trans_splitting scenario,
if they care. All the current callers are either looking at pfn or
write_ok, hence we don't need to update them.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-04-17 11:23:40 +10:00
Aneesh Kumar K.V
691e95fd73 powerpc/mm/thp: Make page table walk safe against thp split/collapse
We can disable a THP split or a hugepage collapse by disabling irq.
We do send IPI to all the cpus in the early part of split/collapse,
and disabling local irq ensure we don't make progress with
split/collapse. If the THP is getting split we return NULL from
find_linux_pte_or_hugepte(). For all the current callers it should be ok.
We need to be careful if we want to use returned pte_t pointer outside
the irq disabled region. W.r.t to THP split, the pfn remains the same,
but then a hugepage collapse will result in a pfn change. There are
few steps we can take to avoid a hugepage collapse.One way is to take page
reference inside the irq disable region. Other option is to take
mmap_sem so that a parallel collapse will not happen. We can also
disable collapse by taking pmd_lock. Another method used by kvm
subsystem is to check whether we had a mmu_notifer update in between
using mmu_notifier_retry().

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-04-17 11:23:39 +10:00
Aneesh Kumar K.V
dac5657067 KVM: PPC: Remove page table walk helpers
This patch remove helpers which we had used only once in the code.
Limiting page table walk variants help in ensuring that we won't
end up with code walking page table with wrong assumptions.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-04-17 11:23:24 +10:00
Aneesh Kumar K.V
5e1d44aef1 KVM: PPC: Use READ_ONCE when dereferencing pte_t pointer
pte can get updated from other CPUs as part of multiple activities
like THP split, huge page collapse, unmap. We need to make sure we
don't reload the pte value again and again for different checks.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-04-17 11:23:24 +10:00
Linus Torvalds
d19d5efd8c powerpc updates for 4.1
- Numerous minor fixes, cleanups etc.
 - More EEH work from Gavin to remove its dependency on device_nodes.
 - Memory hotplug implemented entirely in the kernel from Nathan Fontenot.
 - Removal of redundant CONFIG_PPC_OF by Kevin Hao.
 - Rewrite of VPHN parsing logic & tests from Greg Kurz.
 - A fix from Nish Aravamudan to reduce memory usage by clamping
   nodes_possible_map.
 - Support for pstore on powernv from Hari Bathini.
 - Removal of old powerpc specific byte swap routines by David Gibson.
 - Fix from Vasant Hegde to prevent the flash driver telling you it was flashing
   your firmware when it wasn't.
 - Patch from Ben Herrenschmidt to add an OPAL heartbeat driver.
 - Fix for an oops causing get/put_cpu_var() imbalance in perf by Jan Stancek.
 - Some fixes for migration from Tyrel Datwyler.
 - A new syscall to switch the cpu endian by Michael Ellerman.
 - Large series from Wei Yang to implement SRIOV, reviewed and acked by Bjorn.
 - A fix for the OPAL sensor driver from Cédric Le Goater.
 - Fixes to get STRICT_MM_TYPECHECKS building again by Michael Ellerman.
 - Large series from Daniel Axtens to make our PCI hooks per PHB rather than per
   machine.
 - Small patch from Sam Bobroff to explicitly abort non-suspended transactions
   on syscalls, plus a test to exercise it.
 - Numerous reworks and fixes for the 24x7 PMU from Sukadev Bhattiprolu.
 - Small patch to enable the hard lockup detector from Anton Blanchard.
 - Fix from Dave Olson for missing L2 cache information on some CPUs.
 - Some fixes from Michael Ellerman to get Cell machines booting again.
 - Freescale updates from Scott: Highlights include BMan device tree nodes, an
   MSI erratum workaround, a couple minor performance improvements, config
   updates, and misc fixes/cleanup.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVL2cxAAoJEFHr6jzI4aWAR8cP/19VTo/CzCE4ffPSx7qR464n
 F+WFZcbNjIMXu6+B0YLuJZEsuWtKKrCit/MCg3+mSgE4iqvxmtI+HDD0445Buszj
 UD4E4HMdPrXQ+KUSUDORvRjv/FFUXIa94LSv/0g2UeMsPz/HeZlhMxEu7AkXw9Nf
 rTxsmRTsOWME85Y/c9ss7XHuWKXT3DJV7fOoK9roSaN3dJAuWTtG3WaKS0nUu0ok
 0M81D6ZczoD6ybwh2DUMPD9K6SGxLdQ4OzQwtW6vWzcQIBDfy5Pdeo0iAFhGPvXf
 T4LLPkv4cF4AwHsAC4rKDPHQNa+oZBoLlScrHClaebAlDiv+XYKNdMogawUObvSh
 h7avKmQr0Ygp1OvvZAaXLhuDJI9FJJ8lf6AOIeULgHsDR9SyKMjZWxRzPe11uarO
 Fyi0qj3oJaQu6LjazZraApu8mo+JBtQuD3z3o5GhLxeFtBBF60JXj6zAXJikufnl
 kk1/BUF10nKUhtKcDX767AMUCtMH3fp5hx8K/z9T5v+pobJB26Wup1bbdT68pNBT
 NjdKUppV6QTjZvCsA6U2/ECu6E9KeIaFtFSL2IRRoiI0dWBN5/5eYn3RGkO2ZFoL
 1NdwKA2XJcchwTPkpSRrUG70sYH0uM2AldNYyaLfjzrQqza7Y6lF699ilxWmCN/H
 OplzJAE5cQ8Am078veTW
 =03Yh
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux

Pull powerpc updates from Michael Ellerman:

 - Numerous minor fixes, cleanups etc.

 - More EEH work from Gavin to remove its dependency on device_nodes.

 - Memory hotplug implemented entirely in the kernel from Nathan
   Fontenot.

 - Removal of redundant CONFIG_PPC_OF by Kevin Hao.

 - Rewrite of VPHN parsing logic & tests from Greg Kurz.

 - A fix from Nish Aravamudan to reduce memory usage by clamping
   nodes_possible_map.

 - Support for pstore on powernv from Hari Bathini.

 - Removal of old powerpc specific byte swap routines by David Gibson.

 - Fix from Vasant Hegde to prevent the flash driver telling you it was
   flashing your firmware when it wasn't.

 - Patch from Ben Herrenschmidt to add an OPAL heartbeat driver.

 - Fix for an oops causing get/put_cpu_var() imbalance in perf by Jan
   Stancek.

 - Some fixes for migration from Tyrel Datwyler.

 - A new syscall to switch the cpu endian by Michael Ellerman.

 - Large series from Wei Yang to implement SRIOV, reviewed and acked by
   Bjorn.

 - A fix for the OPAL sensor driver from Cédric Le Goater.

 - Fixes to get STRICT_MM_TYPECHECKS building again by Michael Ellerman.

 - Large series from Daniel Axtens to make our PCI hooks per PHB rather
   than per machine.

 - Small patch from Sam Bobroff to explicitly abort non-suspended
   transactions on syscalls, plus a test to exercise it.

 - Numerous reworks and fixes for the 24x7 PMU from Sukadev Bhattiprolu.

 - Small patch to enable the hard lockup detector from Anton Blanchard.

 - Fix from Dave Olson for missing L2 cache information on some CPUs.

 - Some fixes from Michael Ellerman to get Cell machines booting again.

 - Freescale updates from Scott: Highlights include BMan device tree
   nodes, an MSI erratum workaround, a couple minor performance
   improvements, config updates, and misc fixes/cleanup.

* tag 'powerpc-4.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (196 commits)
  powerpc/powermac: Fix build error seen with powermac smp builds
  powerpc/pseries: Fix compile of memory hotplug without CONFIG_MEMORY_HOTREMOVE
  powerpc: Remove PPC32 code from pseries specific find_and_init_phbs()
  powerpc/cell: Fix iommu breakage caused by controller_ops change
  powerpc/eeh: Fix crash in eeh_add_device_early() on Cell
  powerpc/perf: Cap 64bit userspace backtraces to PERF_MAX_STACK_DEPTH
  powerpc/perf/hv-24x7: Fail 24x7 initcall if create_events_from_catalog() fails
  powerpc/pseries: Correct memory hotplug locking
  powerpc: Fix missing L2 cache size in /sys/devices/system/cpu
  powerpc: Add ppc64 hard lockup detector support
  oprofile: Disable oprofile NMI timer on ppc64
  powerpc/perf/hv-24x7: Add missing put_cpu_var()
  powerpc/perf/hv-24x7: Break up single_24x7_request
  powerpc/perf/hv-24x7: Define update_event_count()
  powerpc/perf/hv-24x7: Whitespace cleanup
  powerpc/perf/hv-24x7: Define add_event_to_24x7_request()
  powerpc/perf/hv-24x7: Rename hv_24x7_event_update
  powerpc/perf/hv-24x7: Move debug prints to separate function
  powerpc/perf/hv-24x7: Drop event_24x7_request()
  powerpc/perf/hv-24x7: Use pr_devel() to log message
  ...

Conflicts:
	tools/testing/selftests/powerpc/Makefile
	tools/testing/selftests/powerpc/tm/Makefile
2015-04-16 13:53:32 -05:00
Linus Torvalds
9003601310 The most interesting bit here is irqfd/ioeventfd support for ARM and ARM64.
ARM/ARM64: fixes for live migration, irqfd and ioeventfd support (enabling
 vhost, too), page aging
 
 s390: interrupt handling rework, allowing to inject all local interrupts
 via new ioctl and to get/set the full local irq state for migration
 and introspection.  New ioctls to access memory by virtual address,
 and to get/set the guest storage keys.  SIMD support.
 
 MIPS: FPU and MIPS SIMD Architecture (MSA) support.  Includes some patches
 from Ralf Baechle's MIPS tree.
 
 x86: bugfixes (notably for pvclock, the others are small) and cleanups.
 Another small latency improvement for the TSC deadline timer.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJVJ9vmAAoJEL/70l94x66DoMEH/R3rh8IMf4jTiWRkcqohOMPX
 k1+NaSY/lCKayaSgggJ2hcQenMbQoXEOdslvaA/H0oC+VfJGK+lmU6E63eMyyhjQ
 Y+Px6L85NENIzDzaVu/TIWWuhil5PvIRr3VO8cvntExRoCjuekTUmNdOgCvN2ObW
 wswN2qRdPIeEj2kkulbnye+9IV4G0Ne9bvsmUdOdfSSdi6ZcV43JcvrpOZT++mKj
 RrKB+3gTMZYGJXMMLBwMkdl8mK1ozriD+q0mbomT04LUyGlPwYLl4pVRDBqyksD7
 KsSSybaK2E4i5R80WEljgDMkNqrCgNfg6VZe4n9Y+CfAAOToNnkMJaFEi+yuqbs=
 =yu2b
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "First batch of KVM changes for 4.1

  The most interesting bit here is irqfd/ioeventfd support for ARM and
  ARM64.

  Summary:

  ARM/ARM64:
     fixes for live migration, irqfd and ioeventfd support (enabling
     vhost, too), page aging

  s390:
     interrupt handling rework, allowing to inject all local interrupts
     via new ioctl and to get/set the full local irq state for migration
     and introspection.  New ioctls to access memory by virtual address,
     and to get/set the guest storage keys.  SIMD support.

  MIPS:
     FPU and MIPS SIMD Architecture (MSA) support.  Includes some
     patches from Ralf Baechle's MIPS tree.

  x86:
     bugfixes (notably for pvclock, the others are small) and cleanups.
     Another small latency improvement for the TSC deadline timer"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits)
  KVM: use slowpath for cross page cached accesses
  kvm: mmu: lazy collapse small sptes into large sptes
  KVM: x86: Clear CR2 on VCPU reset
  KVM: x86: DR0-DR3 are not clear on reset
  KVM: x86: BSP in MSR_IA32_APICBASE is writable
  KVM: x86: simplify kvm_apic_map
  KVM: x86: avoid logical_map when it is invalid
  KVM: x86: fix mixed APIC mode broadcast
  KVM: x86: use MDA for interrupt matching
  kvm/ppc/mpic: drop unused IRQ_testbit
  KVM: nVMX: remove unnecessary double caching of MAXPHYADDR
  KVM: nVMX: checks for address bits beyond MAXPHYADDR on VM-entry
  KVM: x86: cache maxphyaddr CPUID leaf in struct kvm_vcpu
  KVM: vmx: pass error code with internal error #2
  x86: vdso: fix pvclock races with task migration
  KVM: remove kvm_read_hva and kvm_read_hva_atomic
  KVM: x86: optimize delivery of TSC deadline timer interrupt
  KVM: x86: extract blocking logic from __vcpu_run
  kvm: x86: fix x86 eflags fixed bit
  KVM: s390: migrate vcpu interrupt state
  ...
2015-04-13 09:47:01 -07:00
Arseny Solokha
1945606031 kvm/ppc/mpic: drop unused IRQ_testbit
Drop unused static procedure which doesn't have callers within its
translation unit. It had been already removed independently in QEMU[1]
from the OpenPIC implementation borrowed from the kernel.

[1] https://lists.gnu.org/archive/html/qemu-devel/2014-06/msg01812.html

Signed-off-by: Arseny Solokha <asolokha@kb.kras.ru>
Cc: Alexander Graf <agraf@suse.de>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <1424768706-23150-3-git-send-email-asolokha@kb.kras.ru>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08 10:46:58 +02:00
Michael Ellerman
b7f859dda9 Merge branch 'next-remove-ldst' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc into next 2015-04-07 13:25:14 +10:00
Andre Przywara
af669ac6dc KVM: move iodev.h from virt/kvm/ to include/kvm
iodev.h contains definitions for the kvm_io_bus framework. This is
needed both by the generic KVM code in virt/kvm as well as by
architecture specific code under arch/. Putting the header file in
virt/kvm and using local includes in the architecture part seems at
least dodgy to me, so let's move the file into include/kvm, so that a
more natural "#include <kvm/iodev.h>" can be used by all of the code.
This also solves a problem later when using struct kvm_io_device
in arm_vgic.h.
Fixing up the FSF address in the GPL header and a wrong include path
on the way.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2015-03-26 21:43:12 +00:00
Nikolay Nikolaev
e32edf4fd0 KVM: Redesign kvm_io_bus_ API to pass VCPU structure to the callbacks.
This is needed in e.g. ARM vGIC emulation, where the MMIO handling
depends on the VCPU that does the access.

Signed-off-by: Nikolay Nikolaev <n.nikolaev@virtualopensystems.com>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2015-03-26 21:43:11 +00:00
David Gibson
d078eed35d powerpc: Cleanup KVM emulated load/store endian handling
Sometimes the KVM code on powerpc needs to emulate load or store
instructions from the guest, which can include both normal and byte
reversed forms.

We currently (AFAICT) handle this correctly, but some variable names are
very misleading.  In particular we use "is_bigendian" in several places to
actually mean "is the IO the same endian as the host", but we now support
little-endian powerpc hosts.  This also ties into the misleadingly named
ld_le*() and st_le*() functions, which in fact always byteswap, even on
an LE host.

This patch cleans this up by renaming to more accurate "host_swabbed", and
uses the generic swab*() functions instead of the powerpc specific and
misleadingly named ld_le*() and st_le*() functions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2015-03-24 13:33:51 +11:00
Paul Mackerras
2bf27601c7 KVM: PPC: Book3S HV: Fix instruction emulation
Commit 4a157d61b4 ("KVM: PPC: Book3S HV: Fix endianness of
instruction obtained from HEIR register") had the side effect that
we no longer reset vcpu->arch.last_inst to -1 on guest exit in
the cases where the instruction is not fetched from the guest.
This means that if instruction emulation turns out to be required
in those cases, the host will emulate the wrong instruction, since
vcpu->arch.last_inst will contain the last instruction that was
emulated.

This fixes it by making sure that vcpu->arch.last_inst is reset
to -1 in those cases.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-03-20 11:42:33 +01:00
Paul Mackerras
ecb6d6185b KVM: PPC: Book3S HV: Endian fix for accessing VPA yield count
The VPA (virtual processor area) is defined by PAPR and is therefore
big-endian, so we need a be32_to_cpu when reading it in
kvmppc_get_yield_count().  Without this, H_CONFER always fails on a
little-endian host, causing SMP guests to waste time spinning on
spinlocks.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-03-20 11:42:08 +01:00
Paul Mackerras
8f902b005e KVM: PPC: Book3S HV: Fix spinlock/mutex ordering issue in kvmppc_set_lpcr()
Currently, kvmppc_set_lpcr() has a spinlock around the whole function,
and inside that does mutex_lock(&kvm->lock).  It is not permitted to
take a mutex while holding a spinlock, because the mutex_lock might
call schedule().  In addition, this causes lockdep to warn about a
lock ordering issue:

======================================================
[ INFO: possible circular locking dependency detected ]
3.18.0-kvm-04645-gdfea862-dirty #131 Not tainted
-------------------------------------------------------
qemu-system-ppc/8179 is trying to acquire lock:
 (&kvm->lock){+.+.+.}, at: [<d00000000ecc1f54>] .kvmppc_set_lpcr+0xf4/0x1c0 [kvm_hv]

but task is already holding lock:
 (&(&vcore->lock)->rlock){+.+...}, at: [<d00000000ecc1ea0>] .kvmppc_set_lpcr+0x40/0x1c0 [kvm_hv]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&(&vcore->lock)->rlock){+.+...}:
       [<c000000000b3c120>] .mutex_lock_nested+0x80/0x570
       [<d00000000ecc7a14>] .kvmppc_vcpu_run_hv+0xc4/0xe40 [kvm_hv]
       [<d00000000eb9f5cc>] .kvmppc_vcpu_run+0x2c/0x40 [kvm]
       [<d00000000eb9cb24>] .kvm_arch_vcpu_ioctl_run+0x54/0x160 [kvm]
       [<d00000000eb94478>] .kvm_vcpu_ioctl+0x4a8/0x7b0 [kvm]
       [<c00000000026cbb4>] .do_vfs_ioctl+0x444/0x770
       [<c00000000026cfa4>] .SyS_ioctl+0xc4/0xe0
       [<c000000000009264>] syscall_exit+0x0/0x98

-> #0 (&kvm->lock){+.+.+.}:
       [<c0000000000ff28c>] .lock_acquire+0xcc/0x1a0
       [<c000000000b3c120>] .mutex_lock_nested+0x80/0x570
       [<d00000000ecc1f54>] .kvmppc_set_lpcr+0xf4/0x1c0 [kvm_hv]
       [<d00000000ecc510c>] .kvmppc_set_one_reg_hv+0x4dc/0x990 [kvm_hv]
       [<d00000000eb9f234>] .kvmppc_set_one_reg+0x44/0x330 [kvm]
       [<d00000000eb9c9dc>] .kvm_vcpu_ioctl_set_one_reg+0x5c/0x150 [kvm]
       [<d00000000eb9ced4>] .kvm_arch_vcpu_ioctl+0x214/0x2c0 [kvm]
       [<d00000000eb940b0>] .kvm_vcpu_ioctl+0xe0/0x7b0 [kvm]
       [<c00000000026cbb4>] .do_vfs_ioctl+0x444/0x770
       [<c00000000026cfa4>] .SyS_ioctl+0xc4/0xe0
       [<c000000000009264>] syscall_exit+0x0/0x98

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&(&vcore->lock)->rlock);
                               lock(&kvm->lock);
                               lock(&(&vcore->lock)->rlock);
  lock(&kvm->lock);

 *** DEADLOCK ***

2 locks held by qemu-system-ppc/8179:
 #0:  (&vcpu->mutex){+.+.+.}, at: [<d00000000eb93f18>] .vcpu_load+0x28/0x90 [kvm]
 #1:  (&(&vcore->lock)->rlock){+.+...}, at: [<d00000000ecc1ea0>] .kvmppc_set_lpcr+0x40/0x1c0 [kvm_hv]

stack backtrace:
CPU: 4 PID: 8179 Comm: qemu-system-ppc Not tainted 3.18.0-kvm-04645-gdfea862-dirty #131
Call Trace:
[c000001a66c0f310] [c000000000b486ac] .dump_stack+0x88/0xb4 (unreliable)
[c000001a66c0f390] [c0000000000f8bec] .print_circular_bug+0x27c/0x3d0
[c000001a66c0f440] [c0000000000fe9e8] .__lock_acquire+0x2028/0x2190
[c000001a66c0f5d0] [c0000000000ff28c] .lock_acquire+0xcc/0x1a0
[c000001a66c0f6a0] [c000000000b3c120] .mutex_lock_nested+0x80/0x570
[c000001a66c0f7c0] [d00000000ecc1f54] .kvmppc_set_lpcr+0xf4/0x1c0 [kvm_hv]
[c000001a66c0f860] [d00000000ecc510c] .kvmppc_set_one_reg_hv+0x4dc/0x990 [kvm_hv]
[c000001a66c0f8d0] [d00000000eb9f234] .kvmppc_set_one_reg+0x44/0x330 [kvm]
[c000001a66c0f960] [d00000000eb9c9dc] .kvm_vcpu_ioctl_set_one_reg+0x5c/0x150 [kvm]
[c000001a66c0f9f0] [d00000000eb9ced4] .kvm_arch_vcpu_ioctl+0x214/0x2c0 [kvm]
[c000001a66c0faf0] [d00000000eb940b0] .kvm_vcpu_ioctl+0xe0/0x7b0 [kvm]
[c000001a66c0fcb0] [c00000000026cbb4] .do_vfs_ioctl+0x444/0x770
[c000001a66c0fd90] [c00000000026cfa4] .SyS_ioctl+0xc4/0xe0
[c000001a66c0fe30] [c000000000009264] syscall_exit+0x0/0x98

This fixes it by moving the mutex_lock()/mutex_unlock() pair outside
the spin-locked region.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-03-20 11:41:14 +01:00
Mahesh Salgaonkar
45706bb53d powerpc/book3s: Fix flush_tlb cpu_spec hook to take a generic argument.
The flush_tlb hook in cpu_spec was introduced as a generic function hook
to invalidate TLBs. But the current implementation of flush_tlb hook
takes IS (invalidation selector) as an argument which is architecture
dependent. Hence, It is not right to have a generic routine where caller
has to pass non-generic argument.

This patch fixes this and makes flush_tlb hook as high level API.

Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-03-17 07:52:48 +11:00
Linus Torvalds
c833e17e27 Tighten rules for ACCESS_ONCE
This series tightens the rules for ACCESS_ONCE to only work
 on scalar types. It also contains the necessary fixups as
 indicated by build bots of linux-next.
 Now everything is in place to prevent new non-scalar users
 of ACCESS_ONCE and we can continue to convert code to
 READ_ONCE/WRITE_ONCE.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJU2H5MAAoJEBF7vIC1phx8Jm4QALPqKOMDSUBCrqJFWJeujtv2
 ILxJKsnjrAlt3dxnlVI3q6e5wi896hSce75PcvZ/vs/K3GdgMxOjrakBJGTJ2Qjg
 5njW9aGJDDr/SYFX33MLWfqy222TLtpxgSz379UgXjEzB0ymMWbJJ3FnGjVqQJdp
 RXDutpncRySc/rGHh9UPREIRR5GvimONsWE2zxgXjUzB8vIr2fCGvHTXfIb6RKbQ
 yaFoihzn0m+eisc5Gy4tQ1qhhnaYyWEGrINjHTjMFTQOWTlH80BZAyQeLdbyj2K5
 qloBPS/VhBTr/5TxV5onM+nVhu0LiblVNrdMHVeb7jyST4LeFOCaWK98lB3axSB5
 v/2D1YKNb3g1U1x3In/oNGQvs36zGiO1uEdMF1l8ZFXgCvHmATSFSTWBtqUhb5Ew
 JA3YyqMTG6dpRTMSnmu3/frr4wDqnxlB/ktQC1pf3tDp87mr1ZYEy/dQld+tltjh
 9Z5GSdrw0nf91wNI3DJf+26ZDdz5B+EpDnPnOKG8anI1lc/mQneI21/K/xUteFXw
 UZ1XGPLV2vbv9/a13u44SdjenHvQs1egsGeebMxVPoj6WmDLVmcIqinyS6NawYzn
 IlDGy/b3bSnXWMBP0ZVBX94KWLxqDDc4a/ayxsmxsP1tPZ+jDXjVDa7E3zskcHxG
 Uj5ULCPyU087t8Sl76mv
 =Dj70
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux

Pull ACCESS_ONCE() rule tightening from Christian Borntraeger:
 "Tighten rules for ACCESS_ONCE

  This series tightens the rules for ACCESS_ONCE to only work on scalar
  types.  It also contains the necessary fixups as indicated by build
  bots of linux-next.  Now everything is in place to prevent new
  non-scalar users of ACCESS_ONCE and we can continue to convert code to
  READ_ONCE/WRITE_ONCE"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux:
  kernel: Fix sparse warning for ACCESS_ONCE
  next: sh: Fix compile error
  kernel: tighten rules for ACCESS ONCE
  mm/gup: Replace ACCESS_ONCE with READ_ONCE
  x86/spinlock: Leftover conversion ACCESS_ONCE->READ_ONCE
  x86/xen/p2m: Replace ACCESS_ONCE with READ_ONCE
  ppc/hugetlbfs: Replace ACCESS_ONCE with READ_ONCE
  ppc/kvm: Replace ACCESS_ONCE with READ_ONCE
2015-02-14 10:54:28 -08:00
Linus Torvalds
b9085bcbf5 Fairly small update, but there are some interesting new features.
Common: Optional support for adding a small amount of polling on each HLT
 instruction executed in the guest (or equivalent for other architectures).
 This can improve latency up to 50% on some scenarios (e.g. O_DSYNC writes
 or TCP_RR netperf tests).  This also has to be enabled manually for now,
 but the plan is to auto-tune this in the future.
 
 ARM/ARM64: the highlights are support for GICv3 emulation and dirty page
 tracking
 
 s390: several optimizations and bugfixes.  Also a first: a feature
 exposed by KVM (UUID and long guest name in /proc/sysinfo) before
 it is available in IBM's hypervisor! :)
 
 MIPS: Bugfixes.
 
 x86: Support for PML (page modification logging, a new feature in
 Broadwell Xeons that speeds up dirty page tracking), nested virtualization
 improvements (nested APICv---a nice optimization), usual round of emulation
 fixes.  There is also a new option to reduce latency of the TSC deadline
 timer in the guest; this needs to be tuned manually.
 
 Some commits are common between this pull and Catalin's; I see you
 have already included his tree.
 
 ARM has other conflicts where functions are added in the same place
 by 3.19-rc and 3.20 patches.  These are not large though, and entirely
 within KVM.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJU28rkAAoJEL/70l94x66DXqQH/1TDOfJIjW7P2kb0Sw7Fy1wi
 cEX1KO/VFxAqc8R0E/0Wb55CXyPjQJM6xBXuFr5cUDaIjQ8ULSktL4pEwXyyv/s5
 DBDkN65mriry2w5VuEaRLVcuX9Wy+tqLQXWNkEySfyb4uhZChWWHvKEcgw5SqCyg
 NlpeHurYESIoNyov3jWqvBjr4OmaQENyv7t2c6q5ErIgG02V+iCux5QGbphM2IC9
 LFtPKxoqhfeB2xFxTOIt8HJiXrZNwflsTejIlCl/NSEiDVLLxxHCxK2tWK/tUXMn
 JfLD9ytXBWtNMwInvtFm4fPmDouv2VDyR0xnK2db+/axsJZnbxqjGu1um4Dqbak=
 =7gdx
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM update from Paolo Bonzini:
 "Fairly small update, but there are some interesting new features.

  Common:
     Optional support for adding a small amount of polling on each HLT
     instruction executed in the guest (or equivalent for other
     architectures).  This can improve latency up to 50% on some
     scenarios (e.g. O_DSYNC writes or TCP_RR netperf tests).  This
     also has to be enabled manually for now, but the plan is to
     auto-tune this in the future.

  ARM/ARM64:
     The highlights are support for GICv3 emulation and dirty page
     tracking

  s390:
     Several optimizations and bugfixes.  Also a first: a feature
     exposed by KVM (UUID and long guest name in /proc/sysinfo) before
     it is available in IBM's hypervisor! :)

  MIPS:
     Bugfixes.

  x86:
     Support for PML (page modification logging, a new feature in
     Broadwell Xeons that speeds up dirty page tracking), nested
     virtualization improvements (nested APICv---a nice optimization),
     usual round of emulation fixes.

     There is also a new option to reduce latency of the TSC deadline
     timer in the guest; this needs to be tuned manually.

     Some commits are common between this pull and Catalin's; I see you
     have already included his tree.

  Powerpc:
     Nothing yet.

     The KVM/PPC changes will come in through the PPC maintainers,
     because I haven't received them yet and I might end up being
     offline for some part of next week"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (130 commits)
  KVM: ia64: drop kvm.h from installed user headers
  KVM: x86: fix build with !CONFIG_SMP
  KVM: x86: emulate: correct page fault error code for NoWrite instructions
  KVM: Disable compat ioctl for s390
  KVM: s390: add cpu model support
  KVM: s390: use facilities and cpu_id per KVM
  KVM: s390/CPACF: Choose crypto control block format
  s390/kernel: Update /proc/sysinfo file with Extended Name and UUID
  KVM: s390: reenable LPP facility
  KVM: s390: floating irqs: fix user triggerable endless loop
  kvm: add halt_poll_ns module parameter
  kvm: remove KVM_MMIO_SIZE
  KVM: MIPS: Don't leak FPU/DSP to guest
  KVM: MIPS: Disable HTW while in guest
  KVM: nVMX: Enable nested posted interrupt processing
  KVM: nVMX: Enable nested virtual interrupt delivery
  KVM: nVMX: Enable nested apic register virtualization
  KVM: nVMX: Make nested control MSRs per-cpu
  KVM: nVMX: Enable nested virtualize x2apic mode
  KVM: nVMX: Prepare for using hardware MSR bitmap
  ...
2015-02-13 09:55:09 -08:00
Mel Gorman
8a0516ed8b mm: convert p[te|md]_numa users to p[te|md]_protnone_numa
Convert existing users of pte_numa and friends to the new helper.  Note
that the kernel is broken after this patch is applied until the other page
table modifiers are also altered.  This patch layout is to make review
easier.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:08 -08:00
Linus Torvalds
d3f180ea1a powerpc updates for 3.20
Including:
 
 - Update of all defconfigs
 - Addition of a bunch of config options to modernise our defconfigs
 - Some PS3 updates from Geoff
 - Optimised memcmp for 64 bit from Anton
 - Fix for kprobes that allows 'perf probe' to work from Naveen
 - Several cxl updates from Ian & Ryan
 - Expanded support for the '24x7' PMU from Cody & Sukadev
 - Freescale updates from Scott:
   "Highlights include 8xx optimizations, some more work on datapath device
    tree content, e300 machine check support, t1040 corenet error reporting,
    and various cleanups and fixes."
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJU2/LSAAoJEFHr6jzI4aWATDAQAKPU6v2Mq0sLnGst69waHU/Q
 vvpIq9hqVeSr6znHhrnazc3iQTLk0acqIdxUl/dT+5ADhi9+FxGD5Ckk+BH1DDve
 g6mQelSMlVZF9hKonHsbr4iUuTUyZyx2vj2qjdgOaRiv9Xubq6vUFNeolq3AeHxv
 J33vqRTmowj3VJ52u+V1dmzXQGfUye7DG2jHpjXoBieZsroTvyuYm5GoIPblWFO6
 zbYRh6IitALnQRtXfwIManPyWMkJti9JX8PwDkmvacr+V+MXbrksHpIOITMhNlo1
 WsVnFMpxuk80XuUfhaKZgISgBSfCqBckvKDn2QwztF2/kBnV6Su5xiOKVgouzM6B
 myy+maiMZlNJlNjqdMK5v2bqMXICP048zgfMbDN2e1K25jSSlRawt0RngoCQO2EP
 7aWmEDAlL3shgzkl68pj1fevQokxC/40C1yExIgAa9C31+bjtMz4Xb1SfN1SSveW
 7uWEY/eG9eLsrSE1CeBDvh6B8BRdyuIHgPhux4Tgc/bUtBGFQ29NuXwKh3QCeEy9
 9wWrRGx3U69eP06Ey7P5js3jPTQs80bjJewyGaiPQF5XHB89To8Dg8VfXjEV49Dx
 Pa3OLL5QsQloKfEBiEhQeGfKYImC00pVYAxc0qpmnr9T+25Ri1TLdF1EBAwriSYE
 5p9kSW+ZIht0lvzsdPNm
 =xDU3
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux

Pull powerpc updates from Michael Ellerman:

 - Update of all defconfigs

 - Addition of a bunch of config options to modernise our defconfigs

 - Some PS3 updates from Geoff

 - Optimised memcmp for 64 bit from Anton

 - Fix for kprobes that allows 'perf probe' to work from Naveen

 - Several cxl updates from Ian & Ryan

 - Expanded support for the '24x7' PMU from Cody & Sukadev

 - Freescale updates from Scott:
    "Highlights include 8xx optimizations, some more work on datapath
     device tree content, e300 machine check support, t1040 corenet
     error reporting, and various cleanups and fixes"

* tag 'powerpc-3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (102 commits)
  cxl: Add missing return statement after handling AFU errror
  cxl: Fail AFU initialisation if an invalid configuration record is found
  cxl: Export optional AFU configuration record in sysfs
  powerpc/mm: Warn on flushing tlb page in kernel context
  powerpc/powernv: Add OPAL soft-poweroff routine
  powerpc/perf/hv-24x7: Document sysfs event description entries
  powerpc/perf/hv-gpci: add the remaining gpci requests
  powerpc/perf/{hv-gpci, hv-common}: generate requests with counters annotated
  powerpc/perf/hv-24x7: parse catalog and populate sysfs with events
  perf: define EVENT_DEFINE_RANGE_FORMAT_LITE helper
  perf: add PMU_EVENT_ATTR_STRING() helper
  perf: provide sysfs_show for struct perf_pmu_events_attr
  powerpc/kernel: Avoid initializing device-tree pointer twice
  powerpc: Remove old compile time disabled syscall tracing code
  powerpc/kernel: Make syscall_exit a local label
  cxl: Fix device_node reference counting
  powerpc/mm: bail out early when flushing TLB page
  powerpc: defconfigs: add MTD_SPI_NOR (new dependency for M25P80)
  perf/powerpc: reset event hw state when adding it to the PMU
  powerpc/qe: Use strlcpy()
  ...
2015-02-11 18:15:38 -08:00
Paolo Bonzini
f781951299 kvm: add halt_poll_ns module parameter
This patch introduces a new module parameter for the KVM module; when it
is present, KVM attempts a bit of polling on every HLT before scheduling
itself out via kvm_vcpu_block.

This parameter helps a lot for latency-bound workloads---in particular
I tested it with O_DSYNC writes with a battery-backed disk in the host.
In this case, writes are fast (because the data doesn't have to go all
the way to the platters) but they cannot be merged by either the host or
the guest.  KVM's performance here is usually around 30% of bare metal,
or 50% if you use cache=directsync or cache=writethrough (these
parameters avoid that the guest sends pointless flush requests, and
at the same time they are not slow because of the battery-backed cache).
The bad performance happens because on every halt the host CPU decides
to halt itself too.  When the interrupt comes, the vCPU thread is then
migrated to a new physical CPU, and in general the latency is horrible
because the vCPU thread has to be scheduled back in.

With this patch performance reaches 60-65% of bare metal and, more
important, 99% of what you get if you use idle=poll in the guest.  This
means that the tunable gets rid of this particular bottleneck, and more
work can be done to improve performance in the kernel or QEMU.

Of course there is some price to pay; every time an otherwise idle vCPUs
is interrupted by an interrupt, it will poll unnecessarily and thus
impose a little load on the host.  The above results were obtained with
a mostly random value of the parameter (500000), and the load was around
1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.

The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
that can be used to tune the parameter.  It counts how many HLT
instructions received an interrupt during the polling period; each
successful poll avoids that Linux schedules the VCPU thread out and back
in, and may also avoid a likely trip to C1 and back for the physical CPU.

While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
Of these halts, almost all are failed polls.  During the benchmark,
instead, basically all halts end within the polling period, except a more
or less constant stream of 50 per second coming from vCPUs that are not
running the benchmark.  The wasted time is thus very low.  Things may
be slightly different for Windows VMs, which have a ~10 ms timer tick.

The effect is also visible on Marcelo's recently-introduced latency
test for the TSC deadline timer.  Though of course a non-RT kernel has
awful latency bounds, the latency of the timer is around 8000-10000 clock
cycles compared to 20000-120000 without setting halt_poll_ns.  For the TSC
deadline timer, thus, the effect is both a smaller average latency and
a smaller variance.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-06 13:08:37 +01:00
Dominik Dingel
31928aa586 KVM: remove unneeded return value of vcpu_postcreate
The return value of kvm_arch_vcpu_postcreate is not checked in its
caller.  This is okay, because only x86 provides vcpu_postcreate right
now and it could only fail if vcpu_load failed.  But that is not
possible during KVM_CREATE_VCPU (kvm_arch_vcpu_load is void, too), so
just get rid of the unchecked return value.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2015-01-23 13:24:52 +01:00
Christian Borntraeger
5ee07612e9 ppc/kvm: Replace ACCESS_ONCE with READ_ONCE
ACCESS_ONCE does not work reliably on non-scalar types. For
example gcc 4.6 and 4.7 might remove the volatile tag for such
accesses during the SRA (scalar replacement of aggregates) step
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)

Change the ppc/kvm code to replace ACCESS_ONCE with READ_ONCE.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Alexander Graf <agraf@suse.de>
2015-01-19 14:14:07 +01:00
Pranith Kumar
83fe27ea53 rcu: Make SRCU optional by using CONFIG_SRCU
SRCU is not necessary to be compiled by default in all cases. For tinification
efforts not compiling SRCU unless necessary is desirable.

The current patch tries to make compiling SRCU optional by introducing a new
Kconfig option CONFIG_SRCU which is selected when any of the components making
use of SRCU are selected.

If we do not select CONFIG_SRCU, srcu.o will not be compiled at all.

   text    data     bss     dec     hex filename
   2007       0       0    2007     7d7 kernel/rcu/srcu.o

Size of arch/powerpc/boot/zImage changes from

   text    data     bss     dec     hex filename
 831552   64180   23944  919676   e087c arch/powerpc/boot/zImage : before
 829504   64180   23952  917636   e0084 arch/powerpc/boot/zImage : after

so the savings are about ~2000 bytes.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]
2015-01-06 11:04:29 -08:00
Michael Ellerman
9a4fc4eaf1 powerpc/kvm: Create proper names for the kvm_host_state PMU fields
We have two arrays in kvm_host_state that contain register values for
the PMU. Currently we only create an asm-offsets symbol for the base of
the arrays, and do the array offset in the assembly code.

Creating an asm-offsets symbol for each field individually makes the
code much nicer to read, particularly for the MMCRx/SIxR/SDAR fields, and
might have helped us notice the recent double restore bug we had in this
code.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Alexander Graf <agraf@suse.de>
2014-12-29 15:45:55 +11:00
Linus Torvalds
66dcff86ba 3.19 changes for KVM:
- spring cleaning: removed support for IA64, and for hardware-assisted
 virtualization on the PPC970
 - ARM, PPC, s390 all had only small fixes
 
 For x86:
 - small performance improvements (though only on weird guests)
 - usual round of hardware-compliancy fixes from Nadav
 - APICv fixes
 - XSAVES support for hosts and guests.  XSAVES hosts were broken because
 the (non-KVM) XSAVES patches inadvertently changed the KVM userspace
 ABI whenever XSAVES was enabled; hence, this part is going to stable.
 Guest support is just a matter of exposing the feature and CPUID leaves
 support.
 
 Right now KVM is broken for PPC BookE in your tree (doesn't compile).
 I'll reply to the pull request with a patch, please apply it either
 before the pull request or in the merge commit, in order to preserve
 bisectability somewhat.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJUkpg+AAoJEL/70l94x66DUmoH/jzXYkptSW9NGgm79KqxGJlD
 lzLnLBkitVvx++Mz5YBhdJEhKKLUlCtifFT1zPJQ/pthQhIRSaaAwZyNGgUs5w5x
 yMGKHiPQFyZRbmQtZhCInW0BftJoYHHciO3nUfHCZnp34My9MP2D55W7/z+fYFfQ
 DuqBSE9ThyZJtZ4zh8NRA9fCOeuqwVYRyoBs820Wbsh4cpIBoIK63Dg7k+CLE+ZV
 MZa/mRL6bAfsn9W5bnOUAgHJ3SPznnWbO3/g0aV+roL/5pffblprJx9lKNR08xUM
 6hDFLop2gDehDJesDkY/o8Ckp1hEouvfsVpSShry4vcgtn0hgh2O5/6Orbmj6vE=
 =Zwq1
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM update from Paolo Bonzini:
 "3.19 changes for KVM:

   - spring cleaning: removed support for IA64, and for hardware-
     assisted virtualization on the PPC970

   - ARM, PPC, s390 all had only small fixes

  For x86:
   - small performance improvements (though only on weird guests)
   - usual round of hardware-compliancy fixes from Nadav
   - APICv fixes
   - XSAVES support for hosts and guests.  XSAVES hosts were broken
     because the (non-KVM) XSAVES patches inadvertently changed the KVM
     userspace ABI whenever XSAVES was enabled; hence, this part is
     going to stable.  Guest support is just a matter of exposing the
     feature and CPUID leaves support"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (179 commits)
  KVM: move APIC types to arch/x86/
  KVM: PPC: Book3S: Enable in-kernel XICS emulation by default
  KVM: PPC: Book3S HV: Improve H_CONFER implementation
  KVM: PPC: Book3S HV: Fix endianness of instruction obtained from HEIR register
  KVM: PPC: Book3S HV: Remove code for PPC970 processors
  KVM: PPC: Book3S HV: Tracepoints for KVM HV guest interactions
  KVM: PPC: Book3S HV: Simplify locking around stolen time calculations
  arch: powerpc: kvm: book3s_paired_singles.c: Remove unused function
  arch: powerpc: kvm: book3s_pr.c: Remove unused function
  arch: powerpc: kvm: book3s.c: Remove some unused functions
  arch: powerpc: kvm: book3s_32_mmu.c: Remove unused function
  KVM: PPC: Book3S HV: Check wait conditions before sleeping in kvmppc_vcore_blocked
  KVM: PPC: Book3S HV: ptes are big endian
  KVM: PPC: Book3S HV: Fix inaccuracies in ICP emulation for H_IPI
  KVM: PPC: Book3S HV: Fix KSM memory corruption
  KVM: PPC: Book3S HV: Fix an issue where guest is paused on receiving HMI
  KVM: PPC: Book3S HV: Fix computation of tlbie operand
  KVM: PPC: Book3S HV: Add missing HPTE unlock
  KVM: PPC: BookE: Improve irq inject tracepoint
  arm/arm64: KVM: Require in-kernel vgic for the arch timers
  ...
2014-12-18 16:05:28 -08:00
Alexander Graf
91ed9e8a32 KVM: PPC: E500: Compile fix in this_cpu_write
Commit 69111bac42 ("powerpc: Replace __get_cpu_var uses") introduced
compile breakage to the e500 target by introducing invalid automatically
created C syntax.

Fix up the breakage and make the code compile again.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-18 16:02:26 -08:00
Anton Blanchard
476ce5ef09 KVM: PPC: Book3S: Enable in-kernel XICS emulation by default
The in-kernel XICS emulation is faster than doing it all in QEMU
and it has got a lot of testing, so enable it by default.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-12-17 22:23:22 +01:00
Sam Bobroff
90fd09f804 KVM: PPC: Book3S HV: Improve H_CONFER implementation
Currently the H_CONFER hcall is implemented in kernel virtual mode,
meaning that whenever a guest thread does an H_CONFER, all the threads
in that virtual core have to exit the guest.  This is bad for
performance because it interrupts the other threads even if they
are doing useful work.

The H_CONFER hcall is called by a guest VCPU when it is spinning on a
spinlock and it detects that the spinlock is held by a guest VCPU that
is currently not running on a physical CPU.  The idea is to give this
VCPU's time slice to the holder VCPU so that it can make progress
towards releasing the lock.

To avoid having the other threads exit the guest unnecessarily,
we add a real-mode implementation of H_CONFER that checks whether
the other threads are doing anything.  If all the other threads
are idle (i.e. in H_CEDE) or trying to confer (i.e. in H_CONFER),
it returns H_TOO_HARD which causes a guest exit and allows the
H_CONFER to be handled in virtual mode.

Otherwise it spins for a short time (up to 10 microseconds) to give
other threads the chance to observe that this thread is trying to
confer.  The spin loop also terminates when any thread exits the guest
or when all other threads are idle or trying to confer.  If the
timeout is reached, the H_CONFER returns H_SUCCESS.  In this case the
guest VCPU will recheck the spinlock word and most likely call
H_CONFER again.

This also improves the implementation of the H_CONFER virtual mode
handler.  If the VCPU is part of a virtual core (vcore) which is
runnable, there will be a 'runner' VCPU which has taken responsibility
for running the vcore.  In this case we yield to the runner VCPU
rather than the target VCPU.

We also introduce a check on the target VCPU's yield count: if it
differs from the yield count passed to H_CONFER, the target VCPU
has run since H_CONFER was called and may have already released
the lock.  This check is required by PAPR.

Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-12-17 13:53:39 +01:00
Paul Mackerras
4a157d61b4 KVM: PPC: Book3S HV: Fix endianness of instruction obtained from HEIR register
There are two ways in which a guest instruction can be obtained from
the guest in the guest exit code in book3s_hv_rmhandlers.S.  If the
exit was caused by a Hypervisor Emulation interrupt (i.e. an illegal
instruction), the offending instruction is in the HEIR register
(Hypervisor Emulation Instruction Register).  If the exit was caused
by a load or store to an emulated MMIO device, we load the instruction
from the guest by turning data relocation on and loading the instruction
with an lwz instruction.

Unfortunately, in the case where the guest has opposite endianness to
the host, these two methods give results of different endianness, but
both get put into vcpu->arch.last_inst.  The HEIR value has been loaded
using guest endianness, whereas the lwz will load the instruction using
host endianness.  The rest of the code that uses vcpu->arch.last_inst
assumes it was loaded using host endianness.

To fix this, we define a new vcpu field to store the HEIR value.  Then,
in kvmppc_handle_exit_hv(), we transfer the value from this new field to
vcpu->arch.last_inst, doing a byte-swap if the guest and host endianness
differ.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-12-17 13:50:39 +01:00
Paul Mackerras
c17b98cf60 KVM: PPC: Book3S HV: Remove code for PPC970 processors
This removes the code that was added to enable HV KVM to work
on PPC970 processors.  The PPC970 is an old CPU that doesn't
support virtualizing guest memory.  Removing PPC970 support also
lets us remove the code for allocating and managing contiguous
real-mode areas, the code for the !kvm->arch.using_mmu_notifiers
case, the code for pinning pages of guest memory when first
accessed and keeping track of which pages have been pinned, and
the code for handling H_ENTER hypercalls in virtual mode.

Book3S HV KVM is now supported only on POWER7 and POWER8 processors.
The KVM_CAP_PPC_RMA capability now always returns 0.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-12-17 13:44:03 +01:00
Suresh E. Warrier
3c78f78af9 KVM: PPC: Book3S HV: Tracepoints for KVM HV guest interactions
This patch adds trace points in the guest entry and exit code and also
for exceptions handled by the host in kernel mode - hypercalls and page
faults. The new events are added to /sys/kernel/debug/tracing/events
under a new subsystem called kvm_hv.

Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-12-17 13:29:27 +01:00
Paul Mackerras
2711e248a3 KVM: PPC: Book3S HV: Simplify locking around stolen time calculations
Currently the calculations of stolen time for PPC Book3S HV guests
uses fields in both the vcpu struct and the kvmppc_vcore struct.  The
fields in the kvmppc_vcore struct are protected by the
vcpu->arch.tbacct_lock of the vcpu that has taken responsibility for
running the virtual core.  This works correctly but confuses lockdep,
because it sees that the code takes the tbacct_lock for a vcpu in
kvmppc_remove_runnable() and then takes another vcpu's tbacct_lock in
vcore_stolen_time(), and it thinks there is a possibility of deadlock,
causing it to print reports like this:

=============================================
[ INFO: possible recursive locking detected ]
3.18.0-rc7-kvm-00016-g8db4bc6 #89 Not tainted
---------------------------------------------
qemu-system-ppc/6188 is trying to acquire lock:
 (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb1fe8>] .vcore_stolen_time+0x48/0xd0 [kvm_hv]

but task is already holding lock:
 (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv]

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&(&vcpu->arch.tbacct_lock)->rlock);
  lock(&(&vcpu->arch.tbacct_lock)->rlock);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

3 locks held by qemu-system-ppc/6188:
 #0:  (&vcpu->mutex){+.+.+.}, at: [<d00000000eb93f98>] .vcpu_load+0x28/0xe0 [kvm]
 #1:  (&(&vcore->lock)->rlock){+.+...}, at: [<d00000000ecb41b0>] .kvmppc_vcpu_run_hv+0x530/0x1530 [kvm_hv]
 #2:  (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv]

stack backtrace:
CPU: 40 PID: 6188 Comm: qemu-system-ppc Not tainted 3.18.0-rc7-kvm-00016-g8db4bc6 #89
Call Trace:
[c000000b2754f3f0] [c000000000b31b6c] .dump_stack+0x88/0xb4 (unreliable)
[c000000b2754f470] [c0000000000faeb8] .__lock_acquire+0x1878/0x2190
[c000000b2754f600] [c0000000000fbf0c] .lock_acquire+0xcc/0x1a0
[c000000b2754f6d0] [c000000000b2954c] ._raw_spin_lock_irq+0x4c/0x70
[c000000b2754f760] [d00000000ecb1fe8] .vcore_stolen_time+0x48/0xd0 [kvm_hv]
[c000000b2754f7f0] [d00000000ecb25b4] .kvmppc_remove_runnable.part.3+0x44/0xd0 [kvm_hv]
[c000000b2754f880] [d00000000ecb43ec] .kvmppc_vcpu_run_hv+0x76c/0x1530 [kvm_hv]
[c000000b2754f9f0] [d00000000eb9f46c] .kvmppc_vcpu_run+0x2c/0x40 [kvm]
[c000000b2754fa60] [d00000000eb9c9a4] .kvm_arch_vcpu_ioctl_run+0x54/0x160 [kvm]
[c000000b2754faf0] [d00000000eb94538] .kvm_vcpu_ioctl+0x498/0x760 [kvm]
[c000000b2754fcb0] [c000000000267eb4] .do_vfs_ioctl+0x444/0x770
[c000000b2754fd90] [c0000000002682a4] .SyS_ioctl+0xc4/0xe0
[c000000b2754fe30] [c0000000000092e4] syscall_exit+0x0/0x98

In order to make the locking easier to analyse, we change the code to
use a spinlock in the kvmppc_vcore struct to protect the stolen_tb and
preempt_tb fields.  This lock needs to be an irq-safe lock since it is
used in the kvmppc_core_vcpu_load_hv() and kvmppc_core_vcpu_put_hv()
functions, which are called with the scheduler rq lock held, which is
an irq-safe lock.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-12-17 13:20:09 +01:00
Rickard Strandqvist
a0499cf746 arch: powerpc: kvm: book3s_paired_singles.c: Remove unused function
Remove the function inst_set_field() that is not used anywhere.

This was partially found by using a static code analysis program called cppcheck.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-12-17 13:13:29 +01:00
Rickard Strandqvist
6178839b01 arch: powerpc: kvm: book3s_pr.c: Remove unused function
Remove the function get_fpr_index() that is not used anywhere.

This was partially found by using a static code analysis program called cppcheck.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-12-17 13:13:16 +01:00
Rickard Strandqvist
54ca162a0c arch: powerpc: kvm: book3s.c: Remove some unused functions
Removes some functions that are not used anywhere:
kvmppc_core_load_guest_debugstate() kvmppc_core_load_host_debugstate()

This was partially found by using a static code analysis program called cppcheck.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-12-17 13:12:42 +01:00
Rickard Strandqvist
24aaaf22ea arch: powerpc: kvm: book3s_32_mmu.c: Remove unused function
Remove the function sr_nx() that is not used anywhere.

This was partially found by using a static code analysis program called cppcheck.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-12-17 13:12:25 +01:00
Suresh E. Warrier
1bc5d59c35 KVM: PPC: Book3S HV: Check wait conditions before sleeping in kvmppc_vcore_blocked
The kvmppc_vcore_blocked() code does not check for the wait condition
after putting the process on the wait queue. This means that it is
possible for an external interrupt to become pending, but the vcpu to
remain asleep until the next decrementer interrupt.  The fix is to
make one last check for pending exceptions and ceded state before
calling schedule().

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-12-15 13:27:25 +01:00
Cédric Le Goater
ffada016fb KVM: PPC: Book3S HV: ptes are big endian
When being restored from qemu, the kvm_get_htab_header are in native
endian, but the ptes are big endian.

This patch fixes restore on a KVM LE host. Qemu also needs a fix for
this :

     http://lists.nongnu.org/archive/html/qemu-ppc/2014-11/msg00008.html

Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-12-15 13:27:24 +01:00
Suresh E. Warrier
5b88cda665 KVM: PPC: Book3S HV: Fix inaccuracies in ICP emulation for H_IPI
This fixes some inaccuracies in the state machine for the virtualized
ICP when implementing the H_IPI hcall (Set_MFFR and related states):

1. The old code wipes out any pending interrupts when the new MFRR is
   more favored than the CPPR but less favored than a pending
   interrupt (by always modifying xisr and the pending_pri). This can
   cause us to lose a pending external interrupt.

   The correct code here is to only modify the pending_pri and xisr in
   the ICP if the MFRR is equal to or more favored than the current
   pending pri (since in this case, it is guaranteed that that there
   cannot be a pending external interrupt). The code changes are
   required in both kvmppc_rm_h_ipi and kvmppc_h_ipi.

2. Again, in both kvmppc_rm_h_ipi and kvmppc_h_ipi, there is a check
   for whether MFRR is being made less favored AND further if new MFFR
   is also less favored than the current CPPR, we check for any
   resends pending in the ICP. These checks look like they are
   designed to cover the case where if the MFRR is being made less
   favored, we opportunistically trigger a resend of any interrupts
   that had been previously rejected. Although, this is not a state
   described by PAPR, this is an action we actually need to do
   especially if the CPPR is already at 0xFF.  Because in this case,
   the resend bit will stay on until another ICP state change which
   may be a long time coming and the interrupt stays pending until
   then. The current code which checks for MFRR < CPPR is broken when
   CPPR is 0xFF since it will not get triggered in that case.

   Ideally, we would want to do a resend only if

   	prio(pending_interrupt) < mfrr && prio(pending_interrupt) < cppr

   where pending interrupt is the one that was rejected. But we don't
   have the priority of the pending interrupt state saved, so we
   simply trigger a resend whenever the MFRR is made less favored.

3. In kvmppc_rm_h_ipi, where we save state to pass resends to the
   virtual mode, we also need to save the ICP whose need_resend we
   reset since this does not need to be my ICP (vcpu->arch.icp) as is
   incorrectly assumed by the current code. A new field rm_resend_icp
   is added to the kvmppc_icp structure for this purpose.

Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-12-15 13:27:24 +01:00
Paul Mackerras
b4a839009a KVM: PPC: Book3S HV: Fix KSM memory corruption
Testing with KSM active in the host showed occasional corruption of
guest memory.  Typically a page that should have contained zeroes
would contain values that look like the contents of a user process
stack (values such as 0x0000_3fff_xxxx_xxx).

Code inspection in kvmppc_h_protect revealed that there was a race
condition with the possibility of granting write access to a page
which is read-only in the host page tables.  The code attempts to keep
the host mapping read-only if the host userspace PTE is read-only, but
if that PTE had been temporarily made invalid for any reason, the
read-only check would not trigger and the host HPTE could end up
read-write.  Examination of the guest HPT in the failure situation
revealed that there were indeed shared pages which should have been
read-only that were mapped read-write.

To close this race, we don't let a page go from being read-only to
being read-write, as far as the real HPTE mapping the page is
concerned (the guest view can go to read-write, but the actual mapping
stays read-only).  When the guest tries to write to the page, we take
an HDSI and let kvmppc_book3s_hv_page_fault take care of providing a
writable HPTE for the page.

This eliminates the occasional corruption of shared pages
that was previously seen with KSM active.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-12-15 13:27:24 +01:00
Mahesh Salgaonkar
dee6f24c33 KVM: PPC: Book3S HV: Fix an issue where guest is paused on receiving HMI
When we get an HMI (hypervisor maintenance interrupt) while in a
guest, we see that guest enters into paused state.  The reason is, in
kvmppc_handle_exit_hv it falls through default path and returns to
host instead of resuming guest.  This causes guest to enter into
paused state.  HMI is a hypervisor only interrupt and it is safe to
resume the guest since the host has handled it already.  This patch
adds a switch case to resume the guest.

Without this patch we see guest entering into paused state with following
console messages:

[ 3003.329351] Severe Hypervisor Maintenance interrupt [Recovered]
[ 3003.329356]  Error detail: Timer facility experienced an error
[ 3003.329359] 	HMER: 0840000000000000
[ 3003.329360] 	TFMR: 4a12000980a84000
[ 3003.329366] vcpu c0000007c35094c0 (40):
[ 3003.329368] pc  = c0000000000c2ba0  msr = 8000000000009032  trap = e60
[ 3003.329370] r 0 = c00000000021ddc0  r16 = 0000000000000046
[ 3003.329372] r 1 = c00000007a02bbd0  r17 = 00003ffff27d5d98
[ 3003.329375] r 2 = c0000000010980b8  r18 = 00001fffffc9a0b0
[ 3003.329377] r 3 = c00000000142d6b8  r19 = c00000000142d6b8
[ 3003.329379] r 4 = 0000000000000002  r20 = 0000000000000000
[ 3003.329381] r 5 = c00000000524a110  r21 = 0000000000000000
[ 3003.329383] r 6 = 0000000000000001  r22 = 0000000000000000
[ 3003.329386] r 7 = 0000000000000000  r23 = c00000000524a110
[ 3003.329388] r 8 = 0000000000000000  r24 = 0000000000000001
[ 3003.329391] r 9 = 0000000000000001  r25 = c00000007c31da38
[ 3003.329393] r10 = c0000000014280b8  r26 = 0000000000000002
[ 3003.329395] r11 = 746f6f6c2f68656c  r27 = c00000000524a110
[ 3003.329397] r12 = 0000000028004484  r28 = c00000007c31da38
[ 3003.329399] r13 = c00000000fe01400  r29 = 0000000000000002
[ 3003.329401] r14 = 0000000000000046  r30 = c000000003011e00
[ 3003.329403] r15 = ffffffffffffffba  r31 = 0000000000000002
[ 3003.329404] ctr = c00000000041a670  lr  = c000000000272520
[ 3003.329405] srr0 = c00000000007e8d8 srr1 = 9000000000001002
[ 3003.329406] sprg0 = 0000000000000000 sprg1 = c00000000fe01400
[ 3003.329407] sprg2 = c00000000fe01400 sprg3 = 0000000000000005
[ 3003.329408] cr = 48004482  xer = 2000000000000000  dsisr = 42000000
[ 3003.329409] dar = 0000010015020048
[ 3003.329410] fault dar = 0000010015020048 dsisr = 42000000
[ 3003.329411] SLB (8 entries):
[ 3003.329412]   ESID = c000000008000000 VSID = 40016e7779000510
[ 3003.329413]   ESID = d000000008000001 VSID = 400142add1000510
[ 3003.329414]   ESID = f000000008000004 VSID = 4000eb1a81000510
[ 3003.329415]   ESID = 00001f000800000b VSID = 40004fda0a000d90
[ 3003.329416]   ESID = 00003f000800000c VSID = 400039f536000d90
[ 3003.329417]   ESID = 000000001800000d VSID = 0001251b35150d90
[ 3003.329417]   ESID = 000001000800000e VSID = 4001e46090000d90
[ 3003.329418]   ESID = d000080008000019 VSID = 40013d349c000400
[ 3003.329419] lpcr = c048800001847001 sdr1 = 0000001b19000006 last_inst = ffffffff
[ 3003.329421] trap=0xe60 | pc=0xc0000000000c2ba0 | msr=0x8000000000009032
[ 3003.329524] Severe Hypervisor Maintenance interrupt [Recovered]
[ 3003.329526]  Error detail: Timer facility experienced an error
[ 3003.329527] 	HMER: 0840000000000000
[ 3003.329527] 	TFMR: 4a12000980a94000
[ 3006.359786] Severe Hypervisor Maintenance interrupt [Recovered]
[ 3006.359792]  Error detail: Timer facility experienced an error
[ 3006.359795] 	HMER: 0840000000000000
[ 3006.359797] 	TFMR: 4a12000980a84000

 Id    Name                           State
----------------------------------------------------
 2     guest2                         running
 3     guest3                         paused
 4     guest4                         running

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-12-15 13:27:24 +01:00
Aneesh Kumar K.V
f6fb9e848c KVM: PPC: Book3S HV: Add missing HPTE unlock
In kvm_test_clear_dirty(), if we find an invalid HPTE we move on to the
next HPTE without unlocking the invalid one.  In fact we should never
find an invalid and unlocked HPTE in the rmap chain, but for robustness
we should unlock it.  This adds the missing unlock.

Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-12-15 13:27:23 +01:00
Alexander Graf
b6b612571e KVM: PPC: BookE: Improve irq inject tracepoint
When injecting an IRQ, we only document which IRQ priority (which translates
to IRQ type) gets injected. However, when reading traces you don't necessarily
have all the numbers in your head to know which IRQ really is meant.

This patch converts the IRQ number field to a symbolic name that is in sync
with the respective define. That way it's a lot easier for readers to figure
out what interrupt gets injected.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-12-15 13:27:23 +01:00
Linus Torvalds
a7cb7bb664 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree update from Jiri Kosina:
 "Usual stuff: documentation updates, printk() fixes, etc"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (24 commits)
  intel_ips: fix a type in error message
  cpufreq: cpufreq-dt: Move newline to end of error message
  ps3rom: fix error return code
  treewide: fix typo in printk and Kconfig
  ARM: dts: bcm63138: change "interupts" to "interrupts"
  Replace mentions of "list_struct" to "list_head"
  kernel: trace: fix printk message
  scsi: mpt2sas: fix ioctl in comment
  zbud, zswap: change module author email
  clocksource: Fix 'clcoksource' typo in comment
  arm: fix wording of "Crotex" in CONFIG_ARCH_EXYNOS3 help
  gpio: msm-v1: make boolean argument more obvious
  usb: Fix typo in usb-serial-simple.c
  PCI: Fix comment typo 'COMFIG_PM_OPS'
  powerpc: Fix comment typo 'CONIFG_8xx'
  powerpc: Fix comment typos 'CONFiG_ALTIVEC'
  clk: st: Spelling s/stucture/structure/
  isci: Spelling s/stucture/structure/
  usb: gadget: zero: Spelling s/infrastucture/infrastructure/
  treewide: Fix company name in module descriptions
  ...
2014-12-12 10:08:06 -08:00
Paul Mackerras
56548fc0e8 powerpc/powernv: Return to cpu offline loop when finished in KVM guest
When a secondary hardware thread has finished running a KVM guest, we
currently put that thread into nap mode using a nap instruction in
the KVM code.  This changes the code so that instead of doing a nap
instruction directly, we instead cause the call to power7_nap() that
put the thread into nap mode to return.  The reason for doing this is
to avoid having the KVM code having to know what low-power mode to
put the thread into.

In the case of a secondary thread used to run a KVM guest, the thread
will be offline from the point of view of the host kernel, and the
relevant power7_nap() call is the one in pnv_smp_cpu_disable().
In this case we don't want to clear pending IPIs in the offline loop
in that function, since that might cause us to miss the wakeup for
the next time the thread needs to run a guest.  To tell whether or
not to clear the interrupt, we use the SRR1 value returned from
power7_nap(), and check if it indicates an external interrupt.  We
arrange that the return from power7_nap() when we have finished running
a guest returns 0, so pending interrupts don't get flushed in that
case.

Note that it is important a secondary thread that has finished
executing in the guest, or that didn't have a guest to run, should
not return to power7_nap's caller while the kvm_hstate.hwthread_req
flag in the PACA is non-zero, because the return from power7_nap
will reenable the MMU, and the MMU might still be in guest context.
In this situation we spin at low priority in real mode waiting for
hwthread_req to become zero.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-12-08 13:16:31 +11:00
Masanari Iida
6774def642 treewide: fix typo in printk and Kconfig
This patch fix spelling typo in printk and Kconfig within
various part of kernel sources.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-11-20 14:56:11 +01:00
Anton Blanchard
68cf0d642f powerpc: Remove superfluous bootmem includes
Lots of places included bootmem.h even when not using bootmem.

Signed-off-by: Anton Blanchard <anton@samba.org>
Tested-by: Emil Medve <Emilian.Medve@Freescale.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-11-10 09:59:26 +11:00
Anton Blanchard
14ed740957 powerpc: Remove some old bootmem related comments
Now bootmem is gone from powerpc we can remove comments mentioning it.

Signed-off-by: Anton Blanchard <anton@samba.org>
Tested-by: Emil Medve <Emilian.Medve@Freescale.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-11-10 09:59:25 +11:00
Christoph Lameter
69111bac42 powerpc: Replace __get_cpu_var uses
This still has not been merged and now powerpc is the only arch that does
not have this change. Sorry about missing linuxppc-dev before.

V2->V2
  - Fix up to work against 3.18-rc1

__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x).  This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.

Other use cases are for storing and retrieving data from the current
processors percpu area.  __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.

__get_cpu_var() is defined as :

__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.

this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.

This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset.  Thereby address calculations are avoided and less registers
are used when code is generated.

At the end of the patch set all uses of __get_cpu_var have been removed so
the macro is removed too.

The patch set includes passes over all arches as well. Once these operations
are used throughout then specialized macros can be defined in non -x86
arches as well in order to optimize per cpu access by f.e.  using a global
register that may be set to the per cpu base.

Transformations done to __get_cpu_var()

1. Determine the address of the percpu instance of the current processor.

	DEFINE_PER_CPU(int, y);
	int *x = &__get_cpu_var(y);

    Converts to

	int *x = this_cpu_ptr(&y);

2. Same as #1 but this time an array structure is involved.

	DEFINE_PER_CPU(int, y[20]);
	int *x = __get_cpu_var(y);

    Converts to

	int *x = this_cpu_ptr(y);

3. Retrieve the content of the current processors instance of a per cpu
variable.

	DEFINE_PER_CPU(int, y);
	int x = __get_cpu_var(y)

   Converts to

	int x = __this_cpu_read(y);

4. Retrieve the content of a percpu struct

	DEFINE_PER_CPU(struct mystruct, y);
	struct mystruct x = __get_cpu_var(y);

   Converts to

	memcpy(&x, this_cpu_ptr(&y), sizeof(x));

5. Assignment to a per cpu variable

	DEFINE_PER_CPU(int, y)
	__get_cpu_var(y) = x;

   Converts to

	__this_cpu_write(y, x);

6. Increment/Decrement etc of a per cpu variable

	DEFINE_PER_CPU(int, y);
	__get_cpu_var(y)++

   Converts to

	__this_cpu_inc(y)

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
Signed-off-by: Christoph Lameter <cl@linux.com>
[mpe: Fix build errors caused by set/or_softirq_pending(), and rework
      assignment in __set_breakpoint() to use memcpy().]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-11-03 12:12:32 +11:00
Linus Torvalds
e4e65676f2 Fixes and features for 3.18.
Apart from the usual cleanups, here is the summary of new features:
 
 - s390 moves closer towards host large page support
 
 - PowerPC has improved support for debugging (both inside the guest and
   via gdbstub) and support for e6500 processors
 
 - ARM/ARM64 support read-only memory (which is necessary to put firmware
   in emulated NOR flash)
 
 - x86 has the usual emulator fixes and nested virtualization improvements
   (including improved Windows support on Intel and Jailhouse hypervisor
   support on AMD), adaptive PLE which helps overcommitting of huge guests.
   Also included are some patches that make KVM more friendly to memory
   hot-unplug, and fixes for rare caching bugs.
 
 Two patches have trivial mm/ parts that were acked by Rik and Andrew.
 
 Note: I will soon switch to a subkey for signing purposes.  To verify
 future signed pull requests from me, please update my key with
 "gpg --recv-keys 9B4D86F2".  You should see 3 new subkeys---the
 one for signing will be a 2048-bit RSA key, 4E6B09D7.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJUL5sPAAoJEBvWZb6bTYbyfkEP/3MNhSyn6HCjPjtjLNPAl9KL
 WpExZSUFL2+4CztpdGIsek1BeJYHmqv3+c5S+WvaWVA1aqh2R7FT1D1ErBLjgLQq
 lq23IOr+XxmC3dXQUEEk+TlD+283UzypzEG4l4UD3JYg79fE3UrXAz82SeyewJDY
 x7aPYhkZG3RHu+wAyMPasG6E3zS5LySdUtGWbiPwz5BejrhBJoJdeb2WIL/RwnUK
 7ppSLB5EoFj/uMkuyeAAdAbdfSrhHA6faDZxNdxS9k9wGutrhhfUoQ49ONrKG4dV
 sFo1tSPTVgRs8QFYUZ2fJUPBAmUVddsgqh2K9d0NftGTq7b8YszaCsfFrs2/Y4MU
 YxssWEhxsfszerCu12bbAJrv6JBZYQ7TwGvI9L7P0iFU6IVw/djmukU4AkM9/e91
 YS/cue/PN+9Pn2ccXzL9J7xRtZb8FsOuRsCXTCmbOwDkLmrKPDBN2t3RUbeF+Eam
 ABrpWnLKX13kZSo4LKU+/niarzmPMp7odQfHVdr8ea0fiYLp4iN8puA20WaSPIgd
 CLvm+RAvXe5Lm91L4mpFotJ2uFyK6QlIYJV4FsgeWv/0D0qppWQi0Utb/aCNHCgy
 z8MyUMD48y7EpoQrFYr/7cddXIu0/NegnM8I1coVjIPEk4NfeebGUlCJ/V3D8wMG
 BgEfS2x6jRc5zB3hjwDr
 =iEVi
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "Fixes and features for 3.18.

  Apart from the usual cleanups, here is the summary of new features:

   - s390 moves closer towards host large page support

   - PowerPC has improved support for debugging (both inside the guest
     and via gdbstub) and support for e6500 processors

   - ARM/ARM64 support read-only memory (which is necessary to put
     firmware in emulated NOR flash)

   - x86 has the usual emulator fixes and nested virtualization
     improvements (including improved Windows support on Intel and
     Jailhouse hypervisor support on AMD), adaptive PLE which helps
     overcommitting of huge guests.  Also included are some patches that
     make KVM more friendly to memory hot-unplug, and fixes for rare
     caching bugs.

  Two patches have trivial mm/ parts that were acked by Rik and Andrew.

  Note: I will soon switch to a subkey for signing purposes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (157 commits)
  kvm: do not handle APIC access page if in-kernel irqchip is not in use
  KVM: s390: count vcpu wakeups in stat.halt_wakeup
  KVM: s390/facilities: allow TOD-CLOCK steering facility bit
  KVM: PPC: BOOK3S: HV: CMA: Reserve cma region only in hypervisor mode
  arm/arm64: KVM: Report correct FSC for unsupported fault types
  arm/arm64: KVM: Fix VTTBR_BADDR_MASK and pgd alloc
  kvm: Fix kvm_get_page_retry_io __gup retval check
  arm/arm64: KVM: Fix set_clear_sgi_pend_reg offset
  kvm: x86: Unpin and remove kvm_arch->apic_access_page
  kvm: vmx: Implement set_apic_access_page_addr
  kvm: x86: Add request bit to reload APIC access page address
  kvm: Add arch specific mmu notifier for page invalidation
  kvm: Rename make_all_cpus_request() to kvm_make_all_cpus_request() and make it non-static
  kvm: Fix page ageing bugs
  kvm/x86/mmu: Pass gfn and level to rmapp callback.
  x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-only
  kvm: x86: use macros to compute bank MSRs
  KVM: x86: Remove debug assertion of non-PAE reserved bits
  kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
  kvm: Faults which trigger IO release the mmap_sem
  ...
2014-10-08 05:27:39 -04:00
Aneesh Kumar K.V
cec26bc3c1 KVM: PPC: BOOK3S: HV: CMA: Reserve cma region only in hypervisor mode
We use cma reserved area for creating guest hash page table.
Don't do the reservation in non-hypervisor mode. This avoids unnecessary
CMA reservation when booting with limited memory configs like
fadump and kdump.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-29 15:36:33 +02:00
Paolo Bonzini
00c027db0c Patch queue for ppc - 2014-09-24
New awesome things in this release:
 
   - E500: e6500 core support
   - E500: guest and remote debug support
   - Book3S: remote sw breakpoint support
   - Book3S: HV: Minor bugfixes
 
 Alexander Graf (1):
       KVM: PPC: Pass enum to kvmppc_get_last_inst
 
 Bharat Bhushan (8):
       KVM: PPC: BOOKE: allow debug interrupt at "debug level"
       KVM: PPC: BOOKE : Emulate rfdi instruction
       KVM: PPC: BOOKE: Allow guest to change MSR_DE
       KVM: PPC: BOOKE: Clear guest dbsr in userspace exit KVM_EXIT_DEBUG
       KVM: PPC: BOOKE: Guest and hardware visible debug registers are same
       KVM: PPC: BOOKE: Add one reg interface for DBSR
       KVM: PPC: BOOKE: Add one_reg documentation of SPRG9 and DBSR
       KVM: PPC: BOOKE: Emulate debug registers and exception
 
 Madhavan Srinivasan (2):
       powerpc/kvm: support to handle sw breakpoint
       powerpc/kvm: common sw breakpoint instr across ppc
 
 Michael Neuling (1):
       KVM: PPC: Book3S HV: Add register name when loading toc
 
 Mihai Caraman (10):
       powerpc/booke: Restrict SPE exception handlers to e200/e500 cores
       powerpc/booke: Revert SPE/AltiVec common defines for interrupt numbers
       KVM: PPC: Book3E: Increase FPU laziness
       KVM: PPC: Book3e: Add AltiVec support
       KVM: PPC: Make ONE_REG powerpc generic
       KVM: PPC: Move ONE_REG AltiVec support to powerpc
       KVM: PPC: Remove the tasklet used by the hrtimer
       KVM: PPC: Remove shared defines for SPE and AltiVec interrupts
       KVM: PPC: e500mc: Add support for single threaded vcpus on e6500 core
       KVM: PPC: Book3E: Enable e6500 core
 
 Paul Mackerras (2):
       KVM: PPC: Book3S HV: Increase timeout for grabbing secondary threads
       KVM: PPC: Book3S HV: Only accept host PVR value for guest PVR
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABAgAGBQJUIyyEAAoJECszeR4D/txgiV8P/AnSRcjxrlW+ITsimZezDaj5
 MfFv2ZyQKlVjp4cfzfCTW5otQT/K2rSfJzB/V6l1xGcM/UEO+snmPddokvFLMsp9
 dLvPjZI6ivZu/rjRZ8eqnTQIAwid0K5Yss870Y8YWfRBByKVDs7rRx75gj6q8kek
 jG3wLQQxDYEapkGXiaIcX2Mbf6GAZKNhGf6M5Khn/v3RE0+mNg9J+nffBZXOxEYo
 WDe20KNSuDqDEnWIc82uibTbH1Wnxmetc5jf21DWaquLs9VGbON1X9Myl+aBNQuP
 wDt6D04rgtBZbwyHKsSO/0poK0eIms+5jiW8c+XPO2QOLXQwwNKBNmRKePyk1bt5
 gRxd+u9OGzRGHKwIS1vqHLKCdr5HiTN0uE+nZ+oDWjXVJQRMc8HCx0tWxzZg46yd
 kIIRuDrIQQUH3j2L/PnY3Nx3yKNhg97Ysek0ToIsxlkqczrAUewnXuOj9Ijf+/Cz
 Y3cVsQEhepcO3xyz5uyWJQwmFZkwJVOclzGaNgXKeKl5fkpXwLPxc6vmI2K+hnU9
 TRFoQgbknPxQe2qv9cXeMBFhZwNRKpcYW7w3G81ko/7foVmwP3CjnNulXMKiNuVH
 i8pVd8zxiJuTWVQSksGWuWCxueLmc86L4khSF5YBzg9pid7ajmxcfEDWCQGdN+Fe
 Oh4HUW0860IJYOQRIKJv
 =CR/Z
 -----END PGP SIGNATURE-----

Merge tag 'signed-kvm-ppc-next' of git://github.com/agraf/linux-2.6 into kvm-next

Patch queue for ppc - 2014-09-24

New awesome things in this release:

  - E500: e6500 core support
  - E500: guest and remote debug support
  - Book3S: remote sw breakpoint support
  - Book3S: HV: Minor bugfixes

Alexander Graf (1):
      KVM: PPC: Pass enum to kvmppc_get_last_inst

Bharat Bhushan (8):
      KVM: PPC: BOOKE: allow debug interrupt at "debug level"
      KVM: PPC: BOOKE : Emulate rfdi instruction
      KVM: PPC: BOOKE: Allow guest to change MSR_DE
      KVM: PPC: BOOKE: Clear guest dbsr in userspace exit KVM_EXIT_DEBUG
      KVM: PPC: BOOKE: Guest and hardware visible debug registers are same
      KVM: PPC: BOOKE: Add one reg interface for DBSR
      KVM: PPC: BOOKE: Add one_reg documentation of SPRG9 and DBSR
      KVM: PPC: BOOKE: Emulate debug registers and exception

Madhavan Srinivasan (2):
      powerpc/kvm: support to handle sw breakpoint
      powerpc/kvm: common sw breakpoint instr across ppc

Michael Neuling (1):
      KVM: PPC: Book3S HV: Add register name when loading toc

Mihai Caraman (10):
      powerpc/booke: Restrict SPE exception handlers to e200/e500 cores
      powerpc/booke: Revert SPE/AltiVec common defines for interrupt numbers
      KVM: PPC: Book3E: Increase FPU laziness
      KVM: PPC: Book3e: Add AltiVec support
      KVM: PPC: Make ONE_REG powerpc generic
      KVM: PPC: Move ONE_REG AltiVec support to powerpc
      KVM: PPC: Remove the tasklet used by the hrtimer
      KVM: PPC: Remove shared defines for SPE and AltiVec interrupts
      KVM: PPC: e500mc: Add support for single threaded vcpus on e6500 core
      KVM: PPC: Book3E: Enable e6500 core

Paul Mackerras (2):
      KVM: PPC: Book3S HV: Increase timeout for grabbing secondary threads
      KVM: PPC: Book3S HV: Only accept host PVR value for guest PVR
2014-09-24 23:19:45 +02:00
Andres Lagar-Cavilla
5712846808 kvm: Fix page ageing bugs
1. We were calling clear_flush_young_notify in unmap_one, but we are
within an mmu notifier invalidate range scope. The spte exists no more
(due to range_start) and the accessed bit info has already been
propagated (due to kvm_pfn_set_accessed). Simply call
clear_flush_young.

2. We clear_flush_young on a primary MMU PMD, but this may be mapped
as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
This required expanding the interface of the clear_flush_young mmu
notifier, so a lot of code has been trivially touched.

3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
the access bit by blowing the spte. This requires proper synchronizing
with MMU notifier consumers, like every other removal of spte's does.

Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:58 +02:00
Alexander Graf
8d0eff6385 KVM: PPC: Pass enum to kvmppc_get_last_inst
The kvmppc_get_last_inst function recently received a facelift that allowed
us to pass an enum of the type of instruction we want to read into it rather
than an unreadable boolean.

Unfortunately, not all callers ended up passing the enum. This wasn't really
an issue as "true" and "false" happen to match the two enum values we have,
but it's still hard to read.

Update all callers of kvmppc_get_last_inst() to follow the new calling
convention.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-09-22 10:11:36 +02:00
Madhavan Srinivasan
033aaa14af powerpc/kvm: common sw breakpoint instr across ppc
This patch extends the use of illegal instruction as software
breakpoint instruction across the ppc platform. Patch extends
booke program interrupt code to support software breakpoint.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
[agraf: Fix bookehv]
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-09-22 10:11:36 +02:00
Madhavan Srinivasan
a59c1d9e60 powerpc/kvm: support to handle sw breakpoint
This patch adds kernel side support for software breakpoint.
Design is that, by using an illegal instruction, we trap to hypervisor
via Emulation Assistance interrupt, where we check for the illegal instruction
and accordingly we return to Host or Guest. Patch also adds support for
software breakpoint in PR KVM.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-09-22 10:11:35 +02:00
Mihai Caraman
d2ca32a2d4 KVM: PPC: Book3E: Enable e6500 core
Now that AltiVec and hardware thread support is in place enable e6500 core.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-09-22 10:11:35 +02:00
Mihai Caraman
188e267ce2 KVM: PPC: e500mc: Add support for single threaded vcpus on e6500 core
ePAPR represents hardware threads as cpu node properties in device tree.
So with existing QEMU, hardware threads are simply exposed as vcpus with
one hardware thread.

The e6500 core shares TLBs between hardware threads. Without tlb write
conditional instruction, the Linux kernel uses per core mechanisms to
protect against duplicate TLB entries.

The guest is unable to detect real siblings threads, so it can't use the
TLB protection mechanism. An alternative solution is to use the hypervisor
to allocate different lpids to guest's vcpus that runs simultaneous on real
siblings threads. On systems with two threads per core this patch halves
the size of the lpid pool that the allocator sees and use two lpids per VM.
Use even numbers to speedup vcpu lpid computation with consecutive lpids
per VM: vm1 will use lpids 2 and 3, vm2 lpids 4 and 5, and so on.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
[agraf: fix spelling]
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-09-22 10:11:35 +02:00
Paul Mackerras
9333e6c4c1 KVM: PPC: Book3S HV: Only accept host PVR value for guest PVR
Since the guest can read the machine's PVR (Processor Version Register)
directly and see the real value, we should disallow userspace from
setting any value for the guest's PVR other than the real host value.
Therefore this makes kvm_arch_vcpu_set_sregs_hv() check the supplied
PVR value and return an error if it is different from the host value,
which has been put into vcpu->arch.pvr at vcpu creation time.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-09-22 10:11:34 +02:00
Paul Mackerras
b754c739ee KVM: PPC: Book3S HV: Increase timeout for grabbing secondary threads
Occasional failures have been seen with split-core mode and migration
where the message "KVM: couldn't grab cpu" appears.  This increases
the length of time that we wait from 1ms to 10ms, which seems to
work around the issue.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-09-22 10:11:34 +02:00
Mihai Caraman
d02d4d156e KVM: PPC: Remove the tasklet used by the hrtimer
Powerpc timer implementation is a copycat version of s390. Now that they removed
the tasklet with commit ea74c0ea1b follow this
optimization.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-09-22 10:11:34 +02:00
Bharat Bhushan
2f699a59f3 KVM: PPC: BOOKE: Emulate debug registers and exception
This patch emulates debug registers and debug exception
to support guest using debug resource. This enables running
gdb/kgdb etc in guest.

On BOOKE architecture we cannot share debug resources between QEMU and
guest because:
    When QEMU is using debug resources then debug exception must
    be always enabled. To achieve this we set MSR_DE and also set
    MSRP_DEP so guest cannot change MSR_DE.

    When emulating debug resource for guest we want guest
    to control MSR_DE (enable/disable debug interrupt on need).

    So above mentioned two configuration cannot be supported
    at the same time. So the result is that we cannot share
    debug resources between QEMU and Guest on BOOKE architecture.

In the current design QEMU gets priority over guest, this means that if
QEMU is using debug resources then guest cannot use them and if guest is
using debug resource then QEMU can overwrite them.

Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-09-22 10:11:33 +02:00
Mihai Caraman
3840edc803 KVM: PPC: Move ONE_REG AltiVec support to powerpc
Move ONE_REG AltiVec support to powerpc generic layer.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-09-22 10:11:33 +02:00
Mihai Caraman
8a41ea53b3 KVM: PPC: Make ONE_REG powerpc generic
Make ONE_REG generic for server and embedded architectures by moving
kvm_vcpu_ioctl_get_one_reg() and kvm_vcpu_ioctl_set_one_reg() functions
to powerpc layer.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-09-22 10:11:33 +02:00
Mihai Caraman
95d80a294b KVM: PPC: Book3e: Add AltiVec support
Add AltiVec support in KVM for Book3e. FPU support gracefully reuse host
infrastructure so follow the same approach for AltiVec.

Book3e specification defines shared interrupt numbers for SPE and AltiVec
units. Still SPE is present in e200/e500v2 cores while AltiVec is present in
e6500 core. So we can currently decide at compile-time which of the SPE or
AltiVec units to support exclusively by using CONFIG_SPE_POSSIBLE and
CONFIG_PPC_E500MC defines. As Alexander Graf suggested, keep SPE and AltiVec
exception handlers distinct to improve code readability.

Guests have the privilege to enable AltiVec, so we always need to support
AltiVec in KVM and implicitly in host to reflect interrupts and to save/restore
the unit context. KVM will be loaded on cores with AltiVec unit only if
CONFIG_ALTIVEC is defined. Use this define to guard KVM AltiVec logic.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-09-22 10:11:32 +02:00
Mihai Caraman
3efc7da61f KVM: PPC: Book3E: Increase FPU laziness
Increase FPU laziness by loading the guest state into the unit before entering
the guest instead of doing it on each vcpu schedule. Without this improvement
an interrupt may claim floating point corrupting guest state.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-09-22 10:11:32 +02:00
Michael Neuling
06a29e4274 KVM: PPC: Book3S HV: Add register name when loading toc
Add 'r' to register name r2 in kvmppc_hv_enter.

Also update comment at the top of kvmppc_hv_enter to indicate that R2/TOC is
non-volatile.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-09-22 10:11:31 +02:00
Bharat Bhushan
2c5096720f KVM: PPC: BOOKE: Add one reg interface for DBSR
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-09-22 10:11:30 +02:00
Bharat Bhushan
348ba71081 KVM: PPC: BOOKE: Guest and hardware visible debug registers are same
Guest visible debug register and hardware visible debug registers are
same, so ther is no need to have arch->shadow_dbg_reg, instead use
arch->dbg_reg.

Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-09-22 10:11:30 +02:00
Bharat Bhushan
2190991e7c KVM: PPC: BOOKE: Clear guest dbsr in userspace exit KVM_EXIT_DEBUG
Dbsr is not visible to userspace and we do not think any need to
expose this to userspace because:
  Userspace cannot inject debug interrupt to guest (as this
  does not know guest ability to handle debug interrupt), so
  userspace will always clear DBSR.
  Now if userspace has to always clear DBSR in KVM_EXIT_DEBUG
  handling then clearing dbsr in kernel looks simple as this
  avoid doing SET_SREGS/set_one_reg() to clear DBSR

Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-09-22 10:11:30 +02:00
Bharat Bhushan
37277b1129 KVM: PPC: BOOKE: Allow guest to change MSR_DE
This patch changes the default behavior of MSRP_DEP, that is
guest is not allowed to change the MSR_DE, to guest can change
MSR_DE. When userspace is debugging guest then it override the
default behavior and set MSRP_DEP. This stops guest to change
MSR_DE when userspace is debugging guest.

Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-09-22 10:11:29 +02:00
Bharat Bhushan
c8ca97ca9b KVM: PPC: BOOKE : Emulate rfdi instruction
This patch adds "rfdi" instruction emulation which is required for
guest debug hander on BOOKE-HV

Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-09-22 10:11:29 +02:00
Bharat Bhushan
9fee7563cd KVM: PPC: BOOKE: allow debug interrupt at "debug level"
Debug interrupt can be either "critical level" or "debug level".
There are separate set of save/restore registers used for different level.
Example: DSRR0/DSRR1 are used for "debug level" and CSRR0/CSRR1
are used for critical level debug interrupt.

Using CPU_FTR_DEBUG_LVL_EXC to decide which interrupt level to be used.

Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-09-22 10:11:29 +02:00
Laurent Dufour
02a68d0503 powerpc/kvm/cma: Fix panic introduces by signed shift operation
fc95ca7284 introduces a memset in
kvmppc_alloc_hpt since the general CMA doesn't clear the memory it
allocates.

However, the size argument passed to memset is computed from a signed value
and its signed bit is extended by the cast the compiler is doing. This lead
to extremely large size value when dealing with order value >= 31, and
almost all the memory following the allocated space is cleaned. As a
consequence, the system is panicing and may even fail spawning the kdump
kernel.

This fix makes use of an unsigned value for the memset's size argument to
avoid sign extension. Among this fix, another shift operation which may
lead to signed extended value too is also fixed.

Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-03 10:34:07 +02:00
Radim Krčmář
13a34e067e KVM: remove garbage arg to *hardware_{en,dis}able
In the beggining was on_each_cpu(), which required an unused argument to
kvm_arch_ops.hardware_{en,dis}able, but this was soon forgotten.

Remove unnecessary arguments that stem from this.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-29 16:35:55 +02:00
Radim Krčmář
0865e636ae KVM: static inline empty kvm_arch functions
Using static inline is going to save few bytes and cycles.
For example on powerpc, the difference is 700 B after stripping.
(5 kB before)

This patch also deals with two overlooked empty functions:
kvm_arch_flush_shadow was not removed from arch/mips/kvm/mips.c
  2df72e9bc KVM: split kvm_arch_flush_shadow
and kvm_arch_sched_in never made it into arch/ia64/kvm/kvm-ia64.c.
  e790d9ef6 KVM: add kvm_arch_sched_in

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-29 16:35:55 +02:00
Radim Krčmář
e790d9ef64 KVM: add kvm_arch_sched_in
Introduce preempt notifiers for architecture specific code.
Advantage over creating a new notifier in every arch is slightly simpler
code and guaranteed call order with respect to kvm_sched_in.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-21 18:45:21 +02:00
Alexey Kardashevskiy
c04fa5831d PC, KVM, CMA: Fix regression caused by wrong get_order() use
fc95ca7284 claims that there is no
functional change but this is not true as it calls get_order() (which
takes bytes) where it should have called order_base_2() and the kernel
stops on VM_BUG_ON().

This replaces get_order() with order_base_2() (round-up version of ilog2).

Suggested-by: Paul Mackerras <paulus@samba.org>
Cc: Alexander Graf <agraf@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-19 15:11:57 +02:00
Linus Torvalds
66bb0aa077 Here are the PPC and ARM changes for KVM, which I separated because
they had small conflicts (respectively within KVM documentation,
 and with 3.16-rc changes).  Since they were all within the subsystem,
 I took care of them.
 
 Stephen Rothwell reported some snags in PPC builds, but they are all
 fixed now; the latest linux-next report was clean.
 
 New features for ARM include:
 - KVM VGIC v2 emulation on GICv3 hardware
 - Big-Endian support for arm/arm64 (guest and host)
 - Debug Architecture support for arm64 (arm32 is on Christoffer's todo list)
 
 And for PPC:
 - Book3S: Good number of LE host fixes, enable HV on LE
 - Book3S HV: Add in-guest debug support
 
 This release drops support for KVM on the PPC440.  As a result, the
 PPC merge removes more lines than it adds. :)
 
 I also included an x86 change, since Davidlohr tied it to an independent
 bug report and the reporter quickly provided a Tested-by; there was no
 reason to wait for -rc2.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJT4iIJAAoJEBvWZb6bTYbyZqoP/3Wxy8NWPFJ8HGt81NHlGnDS
 a9UbL7EibcOEG+aaKqmtBglTD5YDiGBDNCxxiSJaDHt+grLN4fsWIliJob1nJFoO
 90f89EWN2XjeCrJXA5nUoeg5tpc5OoYKsiP6pTgzIwkP8vvs/H1+zpcTS/UmYsr/
 qipVMMsM+zZeHWZcSbqjW88z7YqIn1sr5282wJ85cbyv4KGizb/G4dyPuDqLb6np
 hkAD8Ah6VV2suQ2FSy7G2fg20R0vglUi60hkEHLoCBPVqJCl7SmC8MvxNbjBnP8S
 J36R0R0u1wHYKzAGooLJGVOZ/o/gSiVqKX+++L2EvJBN+kuA6u/7fxLyBT+LwDAE
 IF/Aln5rpg1fe+eywvhz86WljTVEQ8bO1zVsIQUPY+/ZOPedZHMwyvXft8ogbjSp
 2m9OJ/3e8Aggh0OeHpCDoeow+QDUXvX0YdCw+2Yh0p+7VMXqkyp0QEiBu38jrusC
 rB3VNifJbDSWLKdG9LfCAPHnxZD2XYEwv2WFBo6KQOGMGHfx0GXpCOL/jQihrhA6
 HtEG5Bs3lvnHQemdpUZ58xojiABbMaUPdcnPXQQEp23WhZzrfLMLzqVG0VYnhSsC
 9pi7MJj8c31rqx5WU2oRM28i/BvNxN0NCtkDpineO5s3f89Ws1xnwxqlm38AKP0J
 irJQTYFEqec+GM9JK1rG
 =hyQP
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull second round of KVM changes from Paolo Bonzini:
 "Here are the PPC and ARM changes for KVM, which I separated because
  they had small conflicts (respectively within KVM documentation, and
  with 3.16-rc changes).  Since they were all within the subsystem, I
  took care of them.

  Stephen Rothwell reported some snags in PPC builds, but they are all
  fixed now; the latest linux-next report was clean.

  New features for ARM include:
   - KVM VGIC v2 emulation on GICv3 hardware
   - Big-Endian support for arm/arm64 (guest and host)
   - Debug Architecture support for arm64 (arm32 is on Christoffer's todo list)

  And for PPC:
   - Book3S: Good number of LE host fixes, enable HV on LE
   - Book3S HV: Add in-guest debug support

  This release drops support for KVM on the PPC440.  As a result, the
  PPC merge removes more lines than it adds.  :)

  I also included an x86 change, since Davidlohr tied it to an
  independent bug report and the reporter quickly provided a Tested-by;
  there was no reason to wait for -rc2"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (122 commits)
  KVM: Move more code under CONFIG_HAVE_KVM_IRQFD
  KVM: nVMX: fix "acknowledge interrupt on exit" when APICv is in use
  KVM: nVMX: Fix nested vmexit ack intr before load vmcs01
  KVM: PPC: Enable IRQFD support for the XICS interrupt controller
  KVM: Give IRQFD its own separate enabling Kconfig option
  KVM: Move irq notifier implementation into eventfd.c
  KVM: Move all accesses to kvm::irq_routing into irqchip.c
  KVM: irqchip: Provide and use accessors for irq routing table
  KVM: Don't keep reference to irq routing table in irqfd struct
  KVM: PPC: drop duplicate tracepoint
  arm64: KVM: fix 64bit CP15 VM access for 32bit guests
  KVM: arm64: GICv3: mandate page-aligned GICV region
  arm64: KVM: GICv3: move system register access to msr_s/mrs_s
  KVM: PPC: PR: Handle FSCR feature deselects
  KVM: PPC: HV: Remove generic instruction emulation
  KVM: PPC: BOOKEHV: rename e500hv_spr to bookehv_spr
  KVM: PPC: Remove DCR handling
  KVM: PPC: Expose helper functions for data/inst faults
  KVM: PPC: Separate loadstore emulation from priv emulation
  KVM: PPC: Handle magic page in kvmppc_ld/st
  ...
2014-08-07 11:35:30 -07:00
Linus Torvalds
f536b3cae8 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc updates from Ben Herrenschmidt:
 "This is the powerpc new goodies for 3.17.  The short story:

  The biggest bit is Michael removing all of pre-POWER4 processor
  support from the 64-bit kernel.  POWER3 and rs64.  This gets rid of a
  ton of old cruft that has been bitrotting in a long while.  It was
  broken for quite a few versions already and nobody noticed.  Nobody
  uses those machines anymore.  While at it, he cleaned up a bunch of
  old dusty cabinets, getting rid of a skeletton or two.

  Then, we have some base VFIO support for KVM, which allows assigning
  of PCI devices to KVM guests, support for large 64-bit BARs on
  "powernv" platforms, support for HMI (Hardware Management Interrupts)
  on those same platforms, some sparse-vmemmap improvements (for memory
  hotplug),

  There is the usual batch of Freescale embedded updates (summary in the
  merge commit) and fixes here or there, I think that's it for the
  highlights"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (102 commits)
  powerpc/eeh: Export eeh_iommu_group_to_pe()
  powerpc/eeh: Add missing #ifdef CONFIG_IOMMU_API
  powerpc: Reduce scariness of interrupt frames in stack traces
  powerpc: start loop at section start of start in vmemmap_populated()
  powerpc: implement vmemmap_free()
  powerpc: implement vmemmap_remove_mapping() for BOOK3S
  powerpc: implement vmemmap_list_free()
  powerpc: Fail remap_4k_pfn() if PFN doesn't fit inside PTE
  powerpc/book3s: Fix endianess issue for HMI handling on napping cpus.
  powerpc/book3s: handle HMIs for cpus in nap mode.
  powerpc/powernv: Invoke opal call to handle hmi.
  powerpc/book3s: Add basic infrastructure to handle HMI in Linux.
  powerpc/iommu: Fix comments with it_page_shift
  powerpc/powernv: Handle compound PE in config accessors
  powerpc/powernv: Handle compound PE for EEH
  powerpc/powernv: Handle compound PE
  powerpc/powernv: Split ioda_eeh_get_state()
  powerpc/powernv: Allow to freeze PE
  powerpc/powernv: Enable M64 aperatus for PHB3
  powerpc/eeh: Aux PE data for error log
  ...
2014-08-07 08:50:34 -07:00
Joonsoo Kim
c1f733aaaf mm, CMA: change cma_declare_contiguous() to obey coding convention
Conventionally, we put output param to the end of param list and put the
'base' ahead of 'size', but cma_declare_contiguous() doesn't look like
that, so change it.

Additionally, move down cma_areas reference code to the position where
it is really needed.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:16 -07:00
Joonsoo Kim
fc95ca7284 PPC, KVM, CMA: use general CMA reserved area management framework
Now, we have general CMA reserved area management framework, so use it
for future maintainabilty.  There is no functional change.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:16 -07:00
Paul Mackerras
25a2150bee KVM: PPC: Enable IRQFD support for the XICS interrupt controller
This makes it possible to use IRQFDs on platforms that use the XICS
interrupt controller.  To do this we implement kvm_irq_map_gsi() and
kvm_irq_map_chip_pin() in book3s_xics.c, so as to provide a 1-1 mapping
between global interrupt numbers and XICS interrupt source numbers.
For now, all interrupts are mapped as "IRQCHIP" interrupts, and no
MSI support is provided.

This means that kvm_set_irq can now get called with level == 0 or 1
as well as the powerpc-specific values KVM_INTERRUPT_SET,
KVM_INTERRUPT_UNSET and KVM_INTERRUPT_SET_LEVEL.  We change
ics_deliver_irq() to accept all those values, and remove its
report_status argument, as it is always false, given that we don't
support KVM_IRQ_LINE_STATUS.

This also adds support for interrupt ack notifiers to the XICS code
so that the IRQFD resampler functionality can be supported.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-05 14:26:33 +02:00
Paul Mackerras
297e21053a KVM: Give IRQFD its own separate enabling Kconfig option
Currently, the IRQFD code is conditional on CONFIG_HAVE_KVM_IRQ_ROUTING.
So that we can have the IRQFD code compiled in without having the
IRQ routing code, this creates a new CONFIG_HAVE_KVM_IRQFD, makes
the IRQFD code conditional on it instead of CONFIG_HAVE_KVM_IRQ_ROUTING,
and makes all the platforms that currently select HAVE_KVM_IRQ_ROUTING
also select HAVE_KVM_IRQFD.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-05 14:26:28 +02:00
Paul Mackerras
8ba918d488 KVM: irqchip: Provide and use accessors for irq routing table
This provides accessor functions for the KVM interrupt mappings, in
order to reduce the amount of code that accesses the fields of the
kvm_irq_routing_table struct, and restrict that code to one file,
virt/kvm/irqchip.c.  The new functions are kvm_irq_map_gsi(), which
maps from a global interrupt number to a set of IRQ routing entries,
and kvm_irq_map_chip_pin, which maps from IRQ chip and pin numbers to
a global interrupt number.

This also moves the update of kvm_irq_routing_table::chip[][]
into irqchip.c, out of the various kvm_set_routing_entry
implementations.  That means that none of the kvm_set_routing_entry
implementations need the kvm_irq_routing_table argument anymore,
so this removes it.

This does not change any locking or data lifetime rules.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-05 14:26:16 +02:00
Paolo Bonzini
478d668625 KVM: PPC: drop duplicate tracepoint
Commit 29577fc00b ("KVM: PPC: HV: Remove generic instruction emulation")
caused a build failure with allyesconfig:

arch/powerpc/kvm/kvm-pr.o:(__tracepoints+0xa8): multiple definition of `__tracepoint_kvm_ppc_instr'
arch/powerpc/kvm/kvm.o:(__tracepoints+0x1c0): first defined here

due to a duplicate definition of the tracepoint in trace.h and
trace_pr.h.  Because the tracepoint is still used by Book3S HV
code, and because the PR code does include trace.h, just remove
the duplicate definition from trace_pr.h, and export it from
kvm.o.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-05 14:23:41 +02:00
Paolo Bonzini
cc568ead3c Patch queue for ppc - 2014-08-01
Highlights in this release include:
 
   - BookE: Rework instruction fetch, not racy anymore now
   - BookE HV: Fix ONE_REG accessors for some in-hardware registers
   - Book3S: Good number of LE host fixes, enable HV on LE
   - Book3S: Some misc bug fixes
   - Book3S HV: Add in-guest debug support
   - Book3S HV: Preload cache lines on context switch
   - Remove 440 support
 
 Alexander Graf (31):
       KVM: PPC: Book3s PR: Disable AIL mode with OPAL
       KVM: PPC: Book3s HV: Fix tlbie compile error
       KVM: PPC: Book3S PR: Handle hyp doorbell exits
       KVM: PPC: Book3S PR: Fix ABIv2 on LE
       KVM: PPC: Book3S PR: Fix sparse endian checks
       PPC: Add asm helpers for BE 32bit load/store
       KVM: PPC: Book3S HV: Make HTAB code LE host aware
       KVM: PPC: Book3S HV: Access guest VPA in BE
       KVM: PPC: Book3S HV: Access host lppaca and shadow slb in BE
       KVM: PPC: Book3S HV: Access XICS in BE
       KVM: PPC: Book3S HV: Fix ABIv2 on LE
       KVM: PPC: Book3S HV: Enable for little endian hosts
       KVM: PPC: Book3S: Move vcore definition to end of kvm_arch struct
       KVM: PPC: Deflect page write faults properly in kvmppc_st
       KVM: PPC: Book3S: Stop PTE lookup on write errors
       KVM: PPC: Book3S: Add hack for split real mode
       KVM: PPC: Book3S: Make magic page properly 4k mappable
       KVM: PPC: Remove 440 support
       KVM: Rename and add argument to check_extension
       KVM: Allow KVM_CHECK_EXTENSION on the vm fd
       KVM: PPC: Book3S: Provide different CAPs based on HV or PR mode
       KVM: PPC: Implement kvmppc_xlate for all targets
       KVM: PPC: Move kvmppc_ld/st to common code
       KVM: PPC: Remove kvmppc_bad_hva()
       KVM: PPC: Use kvm_read_guest in kvmppc_ld
       KVM: PPC: Handle magic page in kvmppc_ld/st
       KVM: PPC: Separate loadstore emulation from priv emulation
       KVM: PPC: Expose helper functions for data/inst faults
       KVM: PPC: Remove DCR handling
       KVM: PPC: HV: Remove generic instruction emulation
       KVM: PPC: PR: Handle FSCR feature deselects
 
 Alexey Kardashevskiy (1):
       KVM: PPC: Book3S: Fix LPCR one_reg interface
 
 Aneesh Kumar K.V (4):
       KVM: PPC: BOOK3S: PR: Fix PURR and SPURR emulation
       KVM: PPC: BOOK3S: PR: Emulate virtual timebase register
       KVM: PPC: BOOK3S: PR: Emulate instruction counter
       KVM: PPC: BOOK3S: HV: Update compute_tlbie_rb to handle 16MB base page
 
 Anton Blanchard (2):
       KVM: PPC: Book3S HV: Fix ABIv2 indirect branch issue
       KVM: PPC: Assembly functions exported to modules need _GLOBAL_TOC()
 
 Bharat Bhushan (10):
       kvm: ppc: bookehv: Added wrapper macros for shadow registers
       kvm: ppc: booke: Use the shared struct helpers of SRR0 and SRR1
       kvm: ppc: booke: Use the shared struct helpers of SPRN_DEAR
       kvm: ppc: booke: Add shared struct helpers of SPRN_ESR
       kvm: ppc: booke: Use the shared struct helpers for SPRN_SPRG0-7
       kvm: ppc: Add SPRN_EPR get helper function
       kvm: ppc: bookehv: Save restore SPRN_SPRG9 on guest entry exit
       KVM: PPC: Booke-hv: Add one reg interface for SPRG9
       KVM: PPC: Remove comment saying SPRG1 is used for vcpu pointer
       KVM: PPC: BOOKEHV: rename e500hv_spr to bookehv_spr
 
 Michael Neuling (1):
       KVM: PPC: Book3S HV: Add H_SET_MODE hcall handling
 
 Mihai Caraman (8):
       KVM: PPC: e500mc: Enhance tlb invalidation condition on vcpu schedule
       KVM: PPC: e500: Fix default tlb for victim hint
       KVM: PPC: e500: Emulate power management control SPR
       KVM: PPC: e500mc: Revert "add load inst fixup"
       KVM: PPC: Book3e: Add TLBSEL/TSIZE defines for MAS0/1
       KVM: PPC: Book3s: Remove kvmppc_read_inst() function
       KVM: PPC: Allow kvmppc_get_last_inst() to fail
       KVM: PPC: Bookehv: Get vcpu's last instruction for emulation
 
 Paul Mackerras (4):
       KVM: PPC: Book3S: Controls for in-kernel sPAPR hypercall handling
       KVM: PPC: Book3S: Allow only implemented hcalls to be enabled or disabled
       KVM: PPC: Book3S PR: Take SRCU read lock around RTAS kvm_read_guest() call
       KVM: PPC: Book3S: Make kvmppc_ld return a more accurate error indication
 
 Stewart Smith (2):
       Split out struct kvmppc_vcore creation to separate function
       Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABAgAGBQJT21skAAoJECszeR4D/txgeFEP/AzJopN7s//W33CfyBqURHXp
 XALCyAw+S67gtcaTZbxomcG1xuT8Lj9WEw28iz3rCtAnJwIxsY63xrI1nXMzTaI2
 p1rC0ai5Qy+nlEbd6L78spZy/Nzh8DFYGWx78iUSO1mYD8xywJwtoiBA539pwp8j
 8N+mgn61Hwhv31bKtsZlmzXymVr/jbTp5LVuxsBLJwD2lgT49g+4uBnX2cG/iXkg
 Rzbh7LxoNNXrSPI8sYmTWu/81aeXteeX70ja6DHuV5dWLNTuAXJrh5EUfeAZqBrV
 aYcLWUYmIyB87txNmt6ZGVar2p3jr2Xhb9mKx+EN4dbehblanLc1PUqlHd0q3dKc
 Nt60ByqpZn+qDAK86dShSZLEe+GT3lovvE76CqVXD4Er+OUEkc9JoxhN1cof/Gb0
 o6uwZ2isXHRdGoZx5vb4s3UTOlwZGtoL/CyY/HD/ujYDSURkCGbxLj3kkecSY8ut
 QdDAWsC15BwsHtKLr5Zwjp2w+0eGq2QJgfvO0zqWFiz9k33SCBCUpwluFeqh27Hi
 aR5Wir3j+MIw9G8XlYlDJWYfi0h/SZ4G7hh7jSu26NBNBzQsDa8ow/cLzdMhdUwH
 OYSaeqVk5wiRb9to1uq1NQWPA0uRAx3BSjjvr9MCGRqmvn+FV5nj637YWUT+53Hi
 aSvg/U2npghLPPG2cihu
 =JuLr
 -----END PGP SIGNATURE-----

Merge tag 'signed-kvm-ppc-next' of git://github.com/agraf/linux-2.6 into kvm

Patch queue for ppc - 2014-08-01

Highlights in this release include:

  - BookE: Rework instruction fetch, not racy anymore now
  - BookE HV: Fix ONE_REG accessors for some in-hardware registers
  - Book3S: Good number of LE host fixes, enable HV on LE
  - Book3S: Some misc bug fixes
  - Book3S HV: Add in-guest debug support
  - Book3S HV: Preload cache lines on context switch
  - Remove 440 support

Alexander Graf (31):
      KVM: PPC: Book3s PR: Disable AIL mode with OPAL
      KVM: PPC: Book3s HV: Fix tlbie compile error
      KVM: PPC: Book3S PR: Handle hyp doorbell exits
      KVM: PPC: Book3S PR: Fix ABIv2 on LE
      KVM: PPC: Book3S PR: Fix sparse endian checks
      PPC: Add asm helpers for BE 32bit load/store
      KVM: PPC: Book3S HV: Make HTAB code LE host aware
      KVM: PPC: Book3S HV: Access guest VPA in BE
      KVM: PPC: Book3S HV: Access host lppaca and shadow slb in BE
      KVM: PPC: Book3S HV: Access XICS in BE
      KVM: PPC: Book3S HV: Fix ABIv2 on LE
      KVM: PPC: Book3S HV: Enable for little endian hosts
      KVM: PPC: Book3S: Move vcore definition to end of kvm_arch struct
      KVM: PPC: Deflect page write faults properly in kvmppc_st
      KVM: PPC: Book3S: Stop PTE lookup on write errors
      KVM: PPC: Book3S: Add hack for split real mode
      KVM: PPC: Book3S: Make magic page properly 4k mappable
      KVM: PPC: Remove 440 support
      KVM: Rename and add argument to check_extension
      KVM: Allow KVM_CHECK_EXTENSION on the vm fd
      KVM: PPC: Book3S: Provide different CAPs based on HV or PR mode
      KVM: PPC: Implement kvmppc_xlate for all targets
      KVM: PPC: Move kvmppc_ld/st to common code
      KVM: PPC: Remove kvmppc_bad_hva()
      KVM: PPC: Use kvm_read_guest in kvmppc_ld
      KVM: PPC: Handle magic page in kvmppc_ld/st
      KVM: PPC: Separate loadstore emulation from priv emulation
      KVM: PPC: Expose helper functions for data/inst faults
      KVM: PPC: Remove DCR handling
      KVM: PPC: HV: Remove generic instruction emulation
      KVM: PPC: PR: Handle FSCR feature deselects

Alexey Kardashevskiy (1):
      KVM: PPC: Book3S: Fix LPCR one_reg interface

Aneesh Kumar K.V (4):
      KVM: PPC: BOOK3S: PR: Fix PURR and SPURR emulation
      KVM: PPC: BOOK3S: PR: Emulate virtual timebase register
      KVM: PPC: BOOK3S: PR: Emulate instruction counter
      KVM: PPC: BOOK3S: HV: Update compute_tlbie_rb to handle 16MB base page

Anton Blanchard (2):
      KVM: PPC: Book3S HV: Fix ABIv2 indirect branch issue
      KVM: PPC: Assembly functions exported to modules need _GLOBAL_TOC()

Bharat Bhushan (10):
      kvm: ppc: bookehv: Added wrapper macros for shadow registers
      kvm: ppc: booke: Use the shared struct helpers of SRR0 and SRR1
      kvm: ppc: booke: Use the shared struct helpers of SPRN_DEAR
      kvm: ppc: booke: Add shared struct helpers of SPRN_ESR
      kvm: ppc: booke: Use the shared struct helpers for SPRN_SPRG0-7
      kvm: ppc: Add SPRN_EPR get helper function
      kvm: ppc: bookehv: Save restore SPRN_SPRG9 on guest entry exit
      KVM: PPC: Booke-hv: Add one reg interface for SPRG9
      KVM: PPC: Remove comment saying SPRG1 is used for vcpu pointer
      KVM: PPC: BOOKEHV: rename e500hv_spr to bookehv_spr

Michael Neuling (1):
      KVM: PPC: Book3S HV: Add H_SET_MODE hcall handling

Mihai Caraman (8):
      KVM: PPC: e500mc: Enhance tlb invalidation condition on vcpu schedule
      KVM: PPC: e500: Fix default tlb for victim hint
      KVM: PPC: e500: Emulate power management control SPR
      KVM: PPC: e500mc: Revert "add load inst fixup"
      KVM: PPC: Book3e: Add TLBSEL/TSIZE defines for MAS0/1
      KVM: PPC: Book3s: Remove kvmppc_read_inst() function
      KVM: PPC: Allow kvmppc_get_last_inst() to fail
      KVM: PPC: Bookehv: Get vcpu's last instruction for emulation

Paul Mackerras (4):
      KVM: PPC: Book3S: Controls for in-kernel sPAPR hypercall handling
      KVM: PPC: Book3S: Allow only implemented hcalls to be enabled or disabled
      KVM: PPC: Book3S PR: Take SRCU read lock around RTAS kvm_read_guest() call
      KVM: PPC: Book3S: Make kvmppc_ld return a more accurate error indication

Stewart Smith (2):
      Split out struct kvmppc_vcore creation to separate function
      Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8

Conflicts:
	Documentation/virtual/kvm/api.txt
2014-08-05 09:58:11 +02:00
Mahesh Salgaonkar
0869b6fd20 powerpc/book3s: Add basic infrastructure to handle HMI in Linux.
Handle Hypervisor Maintenance Interrupt (HMI) in Linux. This patch implements
basic infrastructure to handle HMI in Linux host. The design is to invoke
opal handle hmi in real mode for recovery and set irq_pending when we hit HMI.
During check_irq_replay pull opal hmi event and print hmi info on console.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-08-05 16:33:48 +10:00
Alexander Graf
8e6afa36e7 KVM: PPC: PR: Handle FSCR feature deselects
We handle FSCR feature bits (well, TAR only really today) lazily when the guest
starts using them. So when a guest activates the bit and later uses that feature
we enable it for real in hardware.

However, when the guest stops using that bit we don't stop setting it in
hardware. That means we can potentially lose a trap that the guest expects to
happen because it thinks a feature is not active.

This patch adds support to drop TAR when then guest turns it off in FSCR. While
at it it also restricts FSCR access to 64bit systems - 32bit ones don't have it.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-31 10:23:46 +02:00
Alexander Graf
29577fc00b KVM: PPC: HV: Remove generic instruction emulation
Now that we have properly split load/store instruction emulation and generic
instruction emulation, we can move the generic one from kvm.ko to kvm-pr.ko
on book3s_64.

This reduces the attack surface and amount of code loaded on HV KVM kernels.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-30 15:25:49 +02:00
Alexander Graf
ce91ddc471 KVM: PPC: Remove DCR handling
DCR handling was only needed for 440 KVM. Since we removed it, we can also
remove handling of DCR accesses.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 19:29:15 +02:00
Alexander Graf
8de12015ff KVM: PPC: Expose helper functions for data/inst faults
We're going to implement guest code interpretation in KVM for some rare
corner cases. This code needs to be able to inject data and instruction
faults into the guest when it encounters them.

Expose generic APIs to do this in a reasonably subarch agnostic fashion.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 18:30:18 +02:00
Alexander Graf
d69614a295 KVM: PPC: Separate loadstore emulation from priv emulation
Today the instruction emulator can get called via 2 separate code paths. It
can either be called by MMIO emulation detection code or by privileged
instruction traps.

This is bad, as both code paths prepare the environment differently. For MMIO
emulation we already know the virtual address we faulted on, so instructions
there don't have to actually fetch that information.

Split out the two separate use cases into separate files.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 18:30:10 +02:00
Alexander Graf
c12fb43c2f KVM: PPC: Handle magic page in kvmppc_ld/st
We use kvmppc_ld and kvmppc_st to emulate load/store instructions that may as
well access the magic page. Special case it out so that we can properly access
it.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 16:35:53 +02:00
Alexander Graf
c45c551403 KVM: PPC: Use kvm_read_guest in kvmppc_ld
We have a nice and handy helper to read from guest physical address space,
so we should make use of it in kvmppc_ld as we already do for its counterpart
in kvmppc_st.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 16:33:54 +02:00
Alexander Graf
9897e88a79 KVM: PPC: Remove kvmppc_bad_hva()
We have a proper define for invalid HVA numbers. Use those instead of the
ppc specific kvmppc_bad_hva().

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 16:28:51 +02:00
Alexander Graf
35c4a7330d KVM: PPC: Move kvmppc_ld/st to common code
We have enough common infrastructure now to resolve GVA->GPA mappings at
runtime. With this we can move our book3s specific helpers to load / store
in guest virtual address space to common code as well.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 16:27:12 +02:00
Alexander Graf
7d15c06f1a KVM: PPC: Implement kvmppc_xlate for all targets
We have a nice API to find the translated GPAs of a GVA including protection
flags. So far we only use it on Book3S, but there's no reason the same shouldn't
be used on BookE as well.

Implement a kvmppc_xlate() version for BookE and clean it up to make it more
readable in general.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 16:15:50 +02:00
Aneesh Kumar K.V
63fff5c1e3 KVM: PPC: BOOK3S: HV: Update compute_tlbie_rb to handle 16MB base page
When calculating the lower bits of AVA field, use the shift
count based on the base page size. Also add the missing segment
size and remove stale comment.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 16:09:17 +02:00
Alexander Graf
7a58777a33 KVM: PPC: Book3S: Provide different CAPs based on HV or PR mode
With Book3S KVM we can create both PR and HV VMs in parallel on the same
machine. That gives us new challenges on the CAPs we return - both have
different capabilities.

When we get asked about CAPs on the kvm fd, there's nothing we can do. We
can try to be smart and assume we're running HV if HV is available, PR
otherwise. However with the newly added VM CHECK_EXTENSION we can now ask
for capabilities directly on a VM which knows whether it's PR or HV.

With this patch I can successfully expose KVM PVINFO data to user space
in the PR case, fixing magic page mapping for PAPR guests.

Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-28 15:23:18 +02:00
Alexander Graf
784aa3d7fb KVM: Rename and add argument to check_extension
In preparation to make the check_extension function available to VM scope
we add a struct kvm * argument to the function header and rename the function
accordingly. It will still be called from the /dev/kvm fd, but with a NULL
argument for struct kvm *.

Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-28 15:23:17 +02:00
Stewart Smith
9678cdaae9 Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
The POWER8 processor has a Micro Partition Prefetch Engine, which is
a fancy way of saying "has way to store and load contents of L2 or
L2+MRU way of L3 cache". We initiate the storing of the log (list of
addresses) using the logmpp instruction and start restore by writing
to a SPR.

The logmpp instruction takes parameters in a single 64bit register:
- starting address of the table to store log of L2/L2+L3 cache contents
  - 32kb for L2
  - 128kb for L2+L3
  - Aligned relative to maximum size of the table (32kb or 128kb)
- Log control (no-op, L2 only, L2 and L3, abort logout)

We should abort any ongoing logging before initiating one.

To initiate restore, we write to the MPPR SPR. The format of what to write
to the SPR is similar to the logmpp instruction parameter:
- starting address of the table to read from (same alignment requirements)
- table size (no data, until end of table)
- prefetch rate (from fastest possible to slower. about every 8, 16, 24 or
  32 cycles)

The idea behind loading and storing the contents of L2/L3 cache is to
reduce memory latency in a system that is frequently swapping vcores on
a physical CPU.

The best case scenario for doing this is when some vcores are doing very
cache heavy workloads. The worst case is when they have about 0 cache hits,
so we just generate needless memory operations.

This implementation just does L2 store/load. In my benchmarks this proves
to be useful.

Benchmark 1:
 - 16 core POWER8
 - 3x Ubuntu 14.04LTS guests (LE) with 8 VCPUs each
 - No split core/SMT
 - two guests running sysbench memory test.
   sysbench --test=memory --num-threads=8 run
 - one guest running apache bench (of default HTML page)
   ab -n 490000 -c 400 http://localhost/

This benchmark aims to measure performance of real world application (apache)
where other guests are cache hot with their own workloads. The sysbench memory
benchmark does pointer sized writes to a (small) memory buffer in a loop.

In this benchmark with this patch I can see an improvement both in requests
per second (~5%) and in mean and median response times (again, about 5%).
The spread of minimum and maximum response times were largely unchanged.

benchmark 2:
 - Same VM config as benchmark 1
 - all three guests running sysbench memory benchmark

This benchmark aims to see if there is a positive or negative affect to this
cache heavy benchmark. Although due to the nature of the benchmark (stores) we
may not see a difference in performance, but rather hopefully an improvement
in consistency of performance (when vcore switched in, don't have to wait
many times for cachelines to be pulled in)

The results of this benchmark are improvements in consistency of performance
rather than performance itself. With this patch, the few outliers in duration
go away and we get more consistent performance in each guest.

benchmark 3:
 - same 3 guests and CPU configuration as benchmark 1 and 2.
 - two idle guests
 - 1 guest running STREAM benchmark

This scenario also saw performance improvement with this patch. On Copy and
Scale workloads from STREAM, I got 5-6% improvement with this patch. For
Add and triad, it was around 10% (or more).

benchmark 4:
 - same 3 guests as previous benchmarks
 - two guests running sysbench --memory, distinctly different cache heavy
   workload
 - one guest running STREAM benchmark.

Similar improvements to benchmark 3.

benchmark 5:
 - 1 guest, 8 VCPUs, Ubuntu 14.04
 - Host configured with split core (SMT8, subcores-per-core=4)
 - STREAM benchmark

In this benchmark, we see a 10-20% performance improvement across the board
of STREAM benchmark results with this patch.

Based on preliminary investigation and microbenchmarks
by Prerna Saxena <prerna@linux.vnet.ibm.com>

Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:23:17 +02:00
Stewart Smith
de9bdd1a60 Split out struct kvmppc_vcore creation to separate function
No code changes, just split it out to a function so that with the addition
of micro partition prefetch buffer allocation (in subsequent patch) looks
neater and doesn't require excessive indentation.

Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:23:16 +02:00
Paul Mackerras
1b2e33b071 KVM: PPC: Book3S: Make kvmppc_ld return a more accurate error indication
At present, kvmppc_ld calls kvmppc_xlate, and if kvmppc_xlate returns
any error indication, it returns -ENOENT, which is taken to mean an
HPTE not found error.  However, the error could have been a segment
found (no SLB entry) or a permission error.  Similarly,
kvmppc_pte_to_hva currently does permission checking, but any error
from it is taken by kvmppc_ld to mean that the access is an emulated
MMIO access.  Also, kvmppc_ld does no execute permission checking.

This fixes these problems by (a) returning any error from kvmppc_xlate
directly, (b) moving the permission check from kvmppc_pte_to_hva
into kvmppc_ld, and (c) adding an execute permission check to kvmppc_ld.

This is similar to what was done for kvmppc_st() by commit 82ff911317c3
("KVM: PPC: Deflect page write faults properly in kvmppc_st").

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:23:16 +02:00
Paul Mackerras
ef1af2e296 KVM: PPC: Book3S PR: Take SRCU read lock around RTAS kvm_read_guest() call
This does for PR KVM what c9438092ca ("KVM: PPC: Book3S HV: Take SRCU
read lock around kvm_read_guest() call") did for HV KVM, that is,
eliminate a "suspicious rcu_dereference_check() usage!" warning by
taking the SRCU lock around the call to kvmppc_rtas_hcall().

It also fixes a return of RESUME_HOST to return EMULATE_FAIL instead,
since kvmppc_h_pr() is supposed to return EMULATE_* values.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: stable@vger.kernel.org
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:23:16 +02:00
Alexey Kardashevskiy
a0840240c0 KVM: PPC: Book3S: Fix LPCR one_reg interface
Unfortunately, the LPCR got defined as a 32-bit register in the
one_reg interface.  This is unfortunate because KVM allows userspace
to control the DPFD (default prefetch depth) field, which is in the
upper 32 bits.  The result is that DPFD always get set to 0, which
reduces performance in the guest.

We can't just change KVM_REG_PPC_LPCR to be a 64-bit register ID,
since that would break existing userspace binaries.  Instead we define
a new KVM_REG_PPC_LPCR_64 id which is 64-bit.  Userspace can still use
the old KVM_REG_PPC_LPCR id, but it now only modifies those fields in
the bottom 32 bits that userspace can modify (ILE, TC and AIL).
If userspace uses the new KVM_REG_PPC_LPCR_64 id, it can modify DPFD
as well.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: stable@vger.kernel.org
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:23:16 +02:00
Alexander Graf
b2677b8dd8 KVM: PPC: Remove 440 support
The 440 target hasn't been properly functioning for a few releases and
before I was the only one who fixes a very serious bug that indicates to
me that nobody used it before either.

Furthermore KVM on 440 is slow to the extent of unusable.

We don't have to carry along completely unused code. Remove 440 and give
us one less thing to worry about.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:23:15 +02:00
Bharat Bhushan
28d2f421bc KVM: PPC: Booke-hv: Add one reg interface for SPRG9
We now support SPRG9 for guest, so also add a one reg interface for same
Note: Changes are in bookehv code only as we do not have SPRG9 on booke-pr.

Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:23:15 +02:00
Bharat Bhushan
99e99d19a8 kvm: ppc: bookehv: Save restore SPRN_SPRG9 on guest entry exit
SPRN_SPRG is used by debug interrupt handler, so this is required for
debug support.

Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:23:14 +02:00
Mihai Caraman
f5250471b2 KVM: PPC: Bookehv: Get vcpu's last instruction for emulation
On book3e, KVM uses load external pid (lwepx) dedicated instruction to read
guest last instruction on the exit path. lwepx exceptions (DTLB_MISS, DSI
and LRAT), generated by loading a guest address, needs to be handled by KVM.
These exceptions are generated in a substituted guest translation context
(EPLC[EGS] = 1) from host context (MSR[GS] = 0).

Currently, KVM hooks only interrupts generated from guest context (MSR[GS] = 1),
doing minimal checks on the fast path to avoid host performance degradation.
lwepx exceptions originate from host state (MSR[GS] = 0) which implies
additional checks in DO_KVM macro (beside the current MSR[GS] = 1) by looking
at the Exception Syndrome Register (ESR[EPID]) and the External PID Load Context
Register (EPLC[EGS]). Doing this on each Data TLB miss exception is obvious
too intrusive for the host.

Read guest last instruction from kvmppc_load_last_inst() by searching for the
physical address and kmap it. This address the TODO for TLB eviction and
execute-but-not-read entries, and allow us to get rid of lwepx until we are
able to handle failures.

A simple stress benchmark shows a 1% sys performance degradation compared with
previous approach (lwepx without failure handling):

time for i in `seq 1 10000`; do /bin/echo > /dev/null; done

real    0m 8.85s
user    0m 4.34s
sys     0m 4.48s

vs

real    0m 8.84s
user    0m 4.36s
sys     0m 4.44s

A solution to use lwepx and to handle its exceptions in KVM would be to temporary
highjack the interrupt vector from host. This imposes additional synchronizations
for cores like FSL e6500 that shares host IVOR registers between hardware threads.
This optimized solution can be later developed on top of this patch.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:23:14 +02:00
Mihai Caraman
51f047261e KVM: PPC: Allow kvmppc_get_last_inst() to fail
On book3e, guest last instruction is read on the exit path using load
external pid (lwepx) dedicated instruction. This load operation may fail
due to TLB eviction and execute-but-not-read entries.

This patch lay down the path for an alternative solution to read the guest
last instruction, by allowing kvmppc_get_lat_inst() function to fail.
Architecture specific implmentations of kvmppc_load_last_inst() may read
last guest instruction and instruct the emulation layer to re-execute the
guest in case of failure.

Make kvmppc_get_last_inst() definition common between architectures.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:23:14 +02:00
Mihai Caraman
9a26af64d6 KVM: PPC: Book3s: Remove kvmppc_read_inst() function
In the context of replacing kvmppc_ld() function calls with a version of
kvmppc_get_last_inst() which allow to fail, Alex Graf suggested this:

"If we get EMULATE_AGAIN, we just have to make sure we go back into the guest.
No need to inject an ISI into  the guest - it'll do that all by itself.
With an error returning kvmppc_get_last_inst we can just use completely
get rid of kvmppc_read_inst() and only use kvmppc_get_last_inst() instead."

As a intermediate step get rid of kvmppc_read_inst() and only use kvmppc_ld()
instead.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:23:13 +02:00
Mihai Caraman
b5741bb3d4 KVM: PPC: e500mc: Revert "add load inst fixup"
The commit 1d628af7 "add load inst fixup" made an attempt to handle
failures generated by reading the guest current instruction. The fixup
code that was added works by chance hiding the real issue.

Load external pid (lwepx) instruction, used by KVM to read guest
instructions, is executed in a subsituted guest translation context
(EPLC[EGS] = 1). In consequence lwepx's TLB error and data storage
interrupts need to be handled by KVM, even though these interrupts
are generated from host context (MSR[GS] = 0) where lwepx is executed.

Currently, KVM hooks only interrupts generated from guest context
(MSR[GS] = 1), doing minimal checks on the fast path to avoid host
performance degradation. As a result, the host kernel handles lwepx
faults searching the faulting guest data address (loaded in DEAR) in
its own Logical Partition ID (LPID) 0 context. In case a host translation
is found the execution returns to the lwepx instruction instead of the
fixup, the host ending up in an infinite loop.

Revert the commit "add load inst fixup". lwepx issue will be addressed
in a subsequent patch without needing fixup code.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:23:13 +02:00
Bharat Bhushan
34f754b99e kvm: ppc: Add SPRN_EPR get helper function
kvmppc_set_epr() is already defined in asm/kvm_ppc.h, So
rename and move get_epr helper function to same file.

Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
[agraf: remove duplicate return]
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:23:13 +02:00
Bharat Bhushan
c1b8a01bf9 kvm: ppc: booke: Use the shared struct helpers for SPRN_SPRG0-7
Use kvmppc_set_sprg[0-7]() and kvmppc_get_sprg[0-7]() helper
functions

Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:23:12 +02:00
Bharat Bhushan
dc168549d9 kvm: ppc: booke: Add shared struct helpers of SPRN_ESR
Add and use kvmppc_set_esr() and kvmppc_get_esr() helper functions

Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:23:12 +02:00
Bharat Bhushan
a5414d4b5e kvm: ppc: booke: Use the shared struct helpers of SPRN_DEAR
Uses kvmppc_set_dar() and kvmppc_get_dar() helper functions

Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:23:12 +02:00
Bharat Bhushan
31579eea69 kvm: ppc: booke: Use the shared struct helpers of SRR0 and SRR1
Use kvmppc_set_srr0/srr1() and kvmppc_get_srr0/srr1() helper functions

Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:23:11 +02:00
Alexander Graf
89b68c96a2 KVM: PPC: Book3S: Make magic page properly 4k mappable
The magic page is defined as a 4k page of per-vCPU data that is shared
between the guest and the host to accelerate accesses to privileged
registers.

However, when the host is using 64k page size granularity we weren't quite
as strict about that rule anymore. Instead, we partially treated all of the
upper 64k as magic page and mapped only the uppermost 4k with the actual
magic contents.

This works well enough for Linux which doesn't use any memory in kernel
space in the upper 64k, but Mac OS X got upset. So this patch makes magic
page actually stay in a 4k range even on 64k page size hosts.

This patch fixes magic page usage with Mac OS X (using MOL) on 64k PAGE_SIZE
hosts for me.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:23:11 +02:00
Alexander Graf
c01e3f66cd KVM: PPC: Book3S: Add hack for split real mode
Today we handle split real mode by mapping both instruction and data faults
into a special virtual address space that only exists during the split mode
phase.

This is good enough to catch 32bit Linux guests that use split real mode for
copy_from/to_user. In this case we're always prefixed with 0xc0000000 for our
instruction pointer and can map the user space process freely below there.

However, that approach fails when we're running KVM inside of KVM. Here the 1st
level last_inst reader may well be in the same virtual page as a 2nd level
interrupt handler.

It also fails when running Mac OS X guests. Here we have a 4G/4G split, so a
kernel copy_from/to_user implementation can easily overlap with user space
addresses.

The architecturally correct way to fix this would be to implement an instruction
interpreter in KVM that kicks in whenever we go into split real mode. This
interpreter however would not receive a great amount of testing and be a lot of
bloat for a reasonably isolated corner case.

So I went back to the drawing board and tried to come up with a way to make
split real mode work with a single flat address space. And then I realized that
we could get away with the same trick that makes it work for Linux:

Whenever we see an instruction address during split real mode that may collide,
we just move it higher up the virtual address space to a place that hopefully
does not collide (keep your fingers crossed!).

That approach does work surprisingly well. I am able to successfully run
Mac OS X guests with KVM and QEMU (no split real mode hacks like MOL) when I
apply a tiny timing probe hack to QEMU. I'd say this is a win over even more
broken split real mode :).

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:23:10 +02:00
Alexander Graf
2e27ecc961 KVM: PPC: Book3S: Stop PTE lookup on write errors
When a page lookup failed because we're not allowed to write to the page, we
should not overwrite that value with another lookup on the second PTEG which
will return "page not found". Instead, we should just tell the caller that we
had a permission problem.

This fixes Mac OS X guests looping endlessly in page lookup code for me.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:23:10 +02:00
Alexander Graf
17824b5afc KVM: PPC: Deflect page write faults properly in kvmppc_st
When we have a page that we're not allowed to write to, xlate() will already
tell us -EPERM on lookup of that page. With the code as is we change it into
a "page missing" error which a guest may get confused about. Instead, just
tell the caller about the -EPERM directly.

This fixes Mac OS X guests when run with DCBZ32 emulation.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:23:10 +02:00
Mihai Caraman
debf27d6b9 KVM: PPC: e500: Emulate power management control SPR
For FSL e6500 core the kernel uses power management SPR register (PWRMGTCR0)
to enable idle power down for cores and devices by setting up the idle count
period at boot time. With the host already controlling the power management
configuration the guest could simply benefit from it, so emulate guest request
as a general store.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:22:27 +02:00
Alexander Graf
6947f948f0 KVM: PPC: Book3S HV: Enable for little endian hosts
Now that we've fixed all the issues that HV KVM code had on little endian
hosts, we can enable it in the kernel configuration for users to play with.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:22:26 +02:00
Alexander Graf
9bf163f86d KVM: PPC: Book3S HV: Fix ABIv2 on LE
For code that doesn't live in modules we can just branch to the real function
names, giving us compatibility with ABIv1 and ABIv2.

Do this for the compiled-in code of HV KVM.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:22:25 +02:00
Alexander Graf
76d072fb05 KVM: PPC: Book3S HV: Access XICS in BE
On the exit path from the guest we check what type of interrupt we received
if we received one. This means we're doing hardware access to the XICS interrupt
controller.

However, when running on a little endian system, this access is byte reversed.

So let's make sure to swizzle the bytes back again and virtually make XICS
accesses big endian.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:22:24 +02:00
Alexander Graf
0865a583a4 KVM: PPC: Book3S HV: Access host lppaca and shadow slb in BE
Some data structures are always stored in big endian. Among those are the LPPACA
fields as well as the shadow slb. These structures might be shared with a
hypervisor.

So whenever we access those fields, make sure we do so in big endian byte order.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:22:23 +02:00
Alexander Graf
0240755225 KVM: PPC: Book3S HV: Access guest VPA in BE
There are a few shared data structures between the host and the guest. Most
of them get registered through the VPA interface.

These data structures are defined to always be in big endian byte order, so
let's make sure we always access them in big endian.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:22:22 +02:00
Alexander Graf
6f22bd3265 KVM: PPC: Book3S HV: Make HTAB code LE host aware
When running on an LE host all data structures are kept in little endian
byte order. However, the HTAB still needs to be maintained in big endian.

So every time we access any HTAB we need to make sure we do so in the right
byte order. Fix up all accesses to manually byte swap.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:22:22 +02:00
Mihai Caraman
d57cef91a0 KVM: PPC: e500: Fix default tlb for victim hint
Tlb search operation used for victim hint relies on the default tlb set by the
host. When hardware tablewalk support is enabled in the host, the default tlb is
TLB1 which leads KVM to evict the bolted entry. Set and restore the default tlb
when searching for victim hint.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Reviewed-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:22:20 +02:00
Michael Neuling
9642382e82 KVM: PPC: Book3S HV: Add H_SET_MODE hcall handling
This adds support for the H_SET_MODE hcall.  This hcall is a
multiplexer that has several functions, some of which are called
rarely, and some which are potentially called very frequently.
Here we add support for the functions that set the debug registers
CIABR (Completed Instruction Address Breakpoint Register) and
DAWR/DAWRX (Data Address Watchpoint Register and eXtension),
since they could be updated by the guest as often as every context
switch.

This also adds a kvmppc_power8_compatible() function to test to see
if a guest is compatible with POWER8 or not.  The CIABR and DAWR/X
only exist on POWER8.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:22:19 +02:00
Paul Mackerras
ae2113a4f1 KVM: PPC: Book3S: Allow only implemented hcalls to be enabled or disabled
This adds code to check that when the KVM_CAP_PPC_ENABLE_HCALL
capability is used to enable or disable in-kernel handling of an
hcall, that the hcall is actually implemented by the kernel.
If not an EINVAL error is returned.

This also checks the default-enabled list of hcalls and prints a
warning if any hcall there is not actually implemented.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:22:18 +02:00
Paul Mackerras
699a0ea082 KVM: PPC: Book3S: Controls for in-kernel sPAPR hypercall handling
This provides a way for userspace controls which sPAPR hcalls get
handled in the kernel.  Each hcall can be individually enabled or
disabled for in-kernel handling, except for H_RTAS.  The exception
for H_RTAS is because userspace can already control whether
individual RTAS functions are handled in-kernel or not via the
KVM_PPC_RTAS_DEFINE_TOKEN ioctl, and because the numeric value for
H_RTAS is out of the normal sequence of hcall numbers.

Hcalls are enabled or disabled using the KVM_ENABLE_CAP ioctl for the
KVM_CAP_PPC_ENABLE_HCALL capability on the file descriptor for the VM.
The args field of the struct kvm_enable_cap specifies the hcall number
in args[0] and the enable/disable flag in args[1]; 0 means disable
in-kernel handling (so that the hcall will always cause an exit to
userspace) and 1 means enable.  Enabling or disabling in-kernel
handling of an hcall is effective across the whole VM.

The ability for KVM_ENABLE_CAP to be used on a VM file descriptor
on PowerPC is new, added by this commit.  The KVM_CAP_ENABLE_CAP_VM
capability advertises that this ability exists.

When a VM is created, an initial set of hcalls are enabled for
in-kernel handling.  The set that is enabled is the set that have
an in-kernel implementation at this point.  Any new hcall
implementations from this point onwards should not be added to the
default set without a good reason.

No distinction is made between real-mode and virtual-mode hcall
implementations; the one setting controls them both.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:22:17 +02:00
Mihai Caraman
1f0eeb7e1a KVM: PPC: e500mc: Enhance tlb invalidation condition on vcpu schedule
On vcpu schedule, the condition checked for tlb pollution is too loose.
The tlb entries of a vcpu become polluted (vs stale) only when a different
vcpu within the same logical partition runs in-between. Optimize the tlb
invalidation condition keeping last_vcpu per logical partition id.

With the new invalidation condition, a guest shows 4% performance improvement
on P5020DS while running a memory stress application with the cpu oversubscribed,
the other guest running a cpu intensive workload.

Guest - old invalidation condition
  real 3.89
  user 3.87
  sys 0.01

Guest - enhanced invalidation condition
  real 3.75
  user 3.73
  sys 0.01

Host
  real 3.70
  user 1.85
  sys 0.00

The memory stress application accesses 4KB pages backed by 75% of available
TLB0 entries:

char foo[ENTRIES][4096] __attribute__ ((aligned (4096)));

int main()
{
	char bar;
	int i, j;

	for (i = 0; i < ITERATIONS; i++)
        	for (j = 0; j < ENTRIES; j++)
            		bar = foo[j][0];

	return 0;
}

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Reviewed-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:22:16 +02:00
Alexander Graf
f396df3518 KVM: PPC: Book3S PR: Fix sparse endian checks
While sending sparse with endian checks over the code base, it triggered at
some places that were missing casts or had wrong types. Fix them up.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:22:16 +02:00
Alexander Graf
da166facd4 KVM: PPC: Book3S PR: Fix ABIv2 on LE
We switched to ABIv2 on Little Endian systems now which gets rid of the
dotted function names. Branch to the actual functions when we see such
a system.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:22:15 +02:00
Anton Blanchard
ad7d4584a2 KVM: PPC: Assembly functions exported to modules need _GLOBAL_TOC()
Both kvmppc_hv_entry_trampoline and kvmppc_entry_trampoline are
assembly functions that are exported to modules and also require
a valid r2.

As such we need to use _GLOBAL_TOC so we provide a global entry
point that establishes the TOC (r2).

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:22:14 +02:00
Anton Blanchard
05a308c722 KVM: PPC: Book3S HV: Fix ABIv2 indirect branch issue
To establish addressability quickly, ABIv2 requires the target
address of the function being called to be in r12.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:22:13 +02:00
Alexander Graf
568fccc43f KVM: PPC: Book3S PR: Handle hyp doorbell exits
If we're running PR KVM in HV mode, we may get hypervisor doorbell interrupts.
Handle those the same way we treat normal doorbells.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:22:12 +02:00
Alexander Graf
fb4188bad0 KVM: PPC: Book3s PR: Disable AIL mode with OPAL
When we're using PR KVM we must not allow the CPU to take interrupts
in virtual mode, as the SLB does not contain host kernel mappings
when running inside the guest context.

To make sure we get good performance for non-KVM tasks but still
properly functioning PR KVM, let's just disable AIL whenever a vcpu
is scheduled in.

This is fundamentally different from how we deal with AIL on pSeries
type machines where we disable AIL for the whole machine as soon as
a single KVM VM is up.

The reason for that is easy - on pSeries we do not have control over
per-cpu configuration of AIL. We also don't want to mess with CPU hotplug
races and AIL configuration, so setting it per CPU is easier and more
flexible.

This patch fixes running PR KVM on POWER8 bare metal for me.

Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Paul Mackerras <paulus@samba.org>
2014-07-28 15:22:11 +02:00
Aneesh Kumar K.V
06da28e76b KVM: PPC: BOOK3S: PR: Emulate instruction counter
Writing to IC is not allowed in the privileged mode.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:22:10 +02:00
Aneesh Kumar K.V
8f42ab2749 KVM: PPC: BOOK3S: PR: Emulate virtual timebase register
virtual time base register is a per VM, per cpu register that needs
to be saved and restored on vm exit and entry. Writing to VTB is not
allowed in the privileged mode.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[agraf: fix compile error]
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:21:50 +02:00
Linus Torvalds
5b2b9d7761 These are mostly PPC changes for 3.16-new things. However, there is
an x86 change too and it is a regression from 3.14.  As it only affects
 nested virtualization and there were other changes in this area in 3.16,
 I am not nominating it for 3.15-stable.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJTyTTfAAoJEBvWZb6bTYbyytIQAJare/EWQmNBDK57EcJBIlJS
 6MW2XnASEW+KCoUw0+u3sm9eaRXQdmJRb1Aw5zxTiUIR3ZSI8MDSQr1XxEgTAOtE
 vFZjonPwlbnE8edLMhH3v/6/v9oO7bwNTDYeOE2pKPRfgPRjFmj1QUOJkvzRnRwj
 kS5M4RtI+VqhdyJW8f4HaWqoRaOAISp3ZjQUJQdab3DWsf9ZpNjwLNjKzGZKNvIN
 Klcpi7JH32zawUfqnAvph/7NsrBGrpFRE+j+JU9LLnD9PehuXwqZbWh01g2Anbq2
 TKVrmXW+YnoD1IZsDw7r/14FaeRweV7yALA/eA9F4KfSyF2Qm9RbjVVdrUYz0CHV
 aIl0cZeZM8xRCLy/ZWj+dOQ23RWelZaslHSpshKOznoRsuuvVwpx93zVtRwlw2dx
 4WJ2A5gYA+ZUQ7eWjk83381JXkbRDUb3cO+NL8t9GnFctCJzT/gQHjqu15f7TJ2Q
 gKhmeciKOS3xY4sQ+ti6gv8CwIFYqgdTzkxedxSgS9xpiAmw9v57V7WukXoXB6zl
 AyjEAk9FFOeBZ5nXs0ObK5LKjI+MJoZ3X0bin7PCuT6dFrIA2yHvo5EgMvOcUua9
 8Tu9L8sWv/JsKjuqebkKxekAKvv0CV35Q8OsLpEF6RI0eXyiXy2extk1LzUuK9cx
 ZVYbN263++En/tgH2AJM
 =Vdqn
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "These are mostly PPC changes for 3.16-new things.  However, there is
  an x86 change too and it is a regression from 3.14.  As it only
  affects nested virtualization and there were other changes in this
  area in 3.16, I am not nominating it for 3.15-stable"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: Check for nested events if there is an injectable interrupt
  KVM: PPC: RTAS: Do byte swaps explicitly
  KVM: PPC: Book3S PR: Fix ABIv2 on LE
  KVM: PPC: Assembly functions exported to modules need _GLOBAL_TOC()
  PPC: Add _GLOBAL_TOC for 32bit
  KVM: PPC: BOOK3S: HV: Use base page size when comparing against slb value
  KVM: PPC: Book3E: Unlock mmu_lock when setting caching atttribute
2014-07-21 11:19:18 -07:00
Joel Stanley
f73128f4f6 powerpc/kvm: Remove redundant save of SIER AND MMCR2
These two registers are already saved in the block above. Aside from
being unnecessary, by the time we get down to the second save location
r8 no longer contains MMCR2, so we are clobbering the saved value with
PMC5.

MMCR2 primarily consists of counter freeze bits. So restoring the value
of PMC5 into MMCR2 will most likely have the effect of freezing
counters.

Fixes: 72cde5a88d ("KVM: PPC: Book3S HV: Save/restore host PMU registers that are new in POWER8")
Cc: stable@vger.kernel.org
Signed-off-by: Joel Stanley <joel@jms.id.au>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-07-11 12:55:07 +10:00
Paolo Bonzini
bb18b526a9 Patch queue for 3.16 - 2014-07-08
A few bug fixes to make 3.16 work well with KVM on PowerPC:
 
   - Fix ppc32 module builds
   - Fix Little Endian hosts
   - Fix Book3S HV HPTE lookup with huge pages in guest
   - Fix BookE lock leak
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABAgAGBQJTu8HqAAoJECszeR4D/txg7u0QAMk3SC2+yBVsOAusB0YvERyU
 Y9X4Lz9fALdeZf2Fd2Qk9BD5y283LDppZqyoy+dmef9DXopfCv0Kh4rl/GrlG9ny
 aHeiBfJGpIpjqZnvkZP0Ln9zpyg7gMLRVNfNJvZWji8RcHly6m9/bxEkG0HnX6Hn
 /2UkUzOdk2aymjzMqFXdHODdC0JsGtWtGBiVC+HOtIf1D3TX42R4KI+ieOSKjGDp
 OYgN2XskOMgiXvPtEx2yMyHAAw5OTCVNdFt6Co1x0qUsz560Wy3Hy6QCwiroLrPH
 rjxkHhcQN0GJJLXs/jajdDJoEp5wYLRomReZbdrKgBj+zGvQQgGRD+RO9iyfedlm
 4hTw98tgmHcPgFTIXQlG5U8Cn0/oPr/k7FWBZJDpiUCTNRI/rsL6eHX7Wu/ylUfm
 uvcwdl5tXdM2OMHE2wEB4pEwSAK4TNGjx237txNgaeLu4ZT8yk4TQnOXlxyMJQe7
 /Bfh8oUKBqRlWAymwut8y/cazZCRDFAx88ovwqAW9GXxgB+tiCeIDLNnLYEkjEmV
 8l+viAjZz3LbzLeFxCxHnNha9WhK7A7kNGhYaWn1+N2Zlz1F3u3mQm5QoZ1UJgIH
 TtbwWsfM7jYrlUsJB1xTeL5Hs8JhOTp+kgLpMbRXe1sNX1xqh+OQZHsJ16VB6zU9
 RiOjHnv2D9/icH0B2DsW
 =+sQF
 -----END PGP SIGNATURE-----

Merge tag 'signed-for-3.16' of git://github.com/agraf/linux-2.6 into kvm-master

Patch queue for 3.16 - 2014-07-08

A few bug fixes to make 3.16 work well with KVM on PowerPC:

  - Fix ppc32 module builds
  - Fix Little Endian hosts
  - Fix Book3S HV HPTE lookup with huge pages in guest
  - Fix BookE lock leak
2014-07-08 12:08:58 +02:00
Alexander Graf
19a44ecff5 KVM: PPC: RTAS: Do byte swaps explicitly
In commit b59d9d26b we introduced implicit byte swaps for RTAS calls.
Unfortunately we messed up and didn't swizzle return values properly.

Also the old approach wasn't "sparse" compatible - we were randomly
reading __be32 values on an LE system.

Let's just do all of the swizzling explicitly with byte swaps right
where values get used. That way we can at least catch bugs using sparse.

This patch fixes XICS RTAS emulation on little endian hosts for me.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-07 23:17:20 +02:00
Alexander Graf
55ab169b7b KVM: PPC: Book3S PR: Fix ABIv2 on LE
We switched to ABIv2 on Little Endian systems now which gets rid of the
dotted function names. Branch to the actual functions when we see such
a system.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-07 12:02:51 +02:00
Anton Blanchard
6ed179b67c KVM: PPC: Assembly functions exported to modules need _GLOBAL_TOC()
Both kvmppc_hv_entry_trampoline and kvmppc_entry_trampoline are
assembly functions that are exported to modules and also require
a valid r2.

As such we need to use _GLOBAL_TOC so we provide a global entry
point that establishes the TOC (r2).

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-07 12:02:32 +02:00
Aneesh Kumar K.V
3cd60e3118 KVM: PPC: BOOK3S: PR: Fix PURR and SPURR emulation
We use time base for PURR and SPURR emulation with PR KVM since we
are emulating a single threaded core. When using time base
we need to make sure that we don't accumulate time spent in the host
in PURR and SPURR value.

Also we don't need to emulate mtspr because both the registers are
hypervisor resource.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-06 13:56:49 +02:00
Aneesh Kumar K.V
341acbb3aa KVM: PPC: BOOK3S: HV: Use base page size when comparing against slb value
With guests supporting Multiple page size per segment (MPSS),
hpte_page_size returns the actual page size used. Add a new function to
return base page size and use that to compare against the the page size
calculated from SLB. Without this patch a hpte lookup can fail since
we are comparing wrong page size in kvmppc_hv_find_lock_hpte.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-06-25 14:07:06 +02:00
Mihai Caraman
511c66818d KVM: PPC: Book3E: Unlock mmu_lock when setting caching atttribute
The patch 08c9a188d0
  	kvm: powerpc: use caching attributes as per linux pte
do not handle properly the error case, letting mmu_lock locked. The lock
will further generate a RCU stall from kvmppc_e500_emul_tlbwe() caller.

In case of an error go to out label.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-06-24 15:37:25 +02:00
Mahesh Salgaonkar
74845bc2fa powerpc/book3s: Fix guest MC delivery mechanism to avoid soft lockups in guest.
Currently we forward MCEs to guest which have been recovered by guest.
And for unhandled errors we do not deliver the MCE to guest. It looks like
with no support of FWNMI in qemu, guest just panics whenever we deliver the
recovered MCEs to guest. Also, the existig code used to return to host for
unhandled errors which was casuing guest to hang with soft lockups inside
guest and makes it difficult to recover guest instance.

This patch now forwards all fatal MCEs to guest causing guest to crash/panic.
And, for recovered errors we just go back to normal functioning of guest
instead of returning to host. This fixes soft lockup issues in guest.
This patch also fixes an issue where guest MCE events were not logged to
host console.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-06-11 19:15:15 +10:00
Linus Torvalds
c5aec4c76a Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc updates from Ben Herrenschmidt:
 "Here is the bulk of the powerpc changes for this merge window.  It got
  a bit delayed in part because I wasn't paying attention, and in part
  because I discovered I had a core PCI change without a PCI maintainer
  ack in it.  Bjorn eventually agreed it was ok to merge it though we'll
  probably improve it later and I didn't want to rebase to add his ack.

  There is going to be a bit more next week, essentially fixes that I
  still want to sort through and test.

  The biggest item this time is the support to build the ppc64 LE kernel
  with our new v2 ABI.  We previously supported v2 userspace but the
  kernel itself was a tougher nut to crack.  This is now sorted mostly
  thanks to Anton and Rusty.

  We also have a fairly big series from Cedric that add support for
  64-bit LE zImage boot wrapper.  This was made harder by the fact that
  traditionally our zImage wrapper was always 32-bit, but our new LE
  toolchains don't really support 32-bit anymore (it's somewhat there
  but not really "supported") so we didn't want to rely on it.  This
  meant more churn that just endian fixes.

  This brings some more LE bits as well, such as the ability to run in
  LE mode without a hypervisor (ie. under OPAL firmware) by doing the
  right OPAL call to reinitialize the CPU to take HV interrupts in the
  right mode and the usual pile of endian fixes.

  There's another series from Gavin adding EEH improvements (one day we
  *will* have a release with less than 20 EEH patches, I promise!).

  Another highlight is the support for the "Split core" functionality on
  P8 by Michael.  This allows a P8 core to be split into "sub cores" of
  4 threads which allows the subcores to run different guests under KVM
  (the HW still doesn't support a partition per thread).

  And then the usual misc bits and fixes ..."

[ Further delayed by gmail deciding that BenH is a dirty spammer.
  Google knows.  ]

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (155 commits)
  powerpc/powernv: Add missing include to LPC code
  selftests/powerpc: Test the THP bug we fixed in the previous commit
  powerpc/mm: Check paca psize is up to date for huge mappings
  powerpc/powernv: Pass buffer size to OPAL validate flash call
  powerpc/pseries: hcall functions are exported to modules, need _GLOBAL_TOC()
  powerpc: Exported functions __clear_user and copy_page use r2 so need _GLOBAL_TOC()
  powerpc/powernv: Set memory_block_size_bytes to 256MB
  powerpc: Allow ppc_md platform hook to override memory_block_size_bytes
  powerpc/powernv: Fix endian issues in memory error handling code
  powerpc/eeh: Skip eeh sysfs when eeh is disabled
  powerpc: 64bit sendfile is capped at 2GB
  powerpc/powernv: Provide debugfs access to the LPC bus via OPAL
  powerpc/serial: Use saner flags when creating legacy ports
  powerpc: Add cpu family documentation
  powerpc/xmon: Fix up xmon format strings
  powerpc/powernv: Add calls to support little endian host
  powerpc: Document sysfs DSCR interface
  powerpc: Fix regression of per-CPU DSCR setting
  powerpc: Split __SYSFS_SPRSETUP macro
  arch: powerpc/fadump: Cleaning up inconsistent NULL checks
  ...
2014-06-10 18:54:22 -07:00
Linus Torvalds
b05d59dfce At over 200 commits, covering almost all supported architectures, this
was a pretty active cycle for KVM.  Changes include:
 
 - a lot of s390 changes: optimizations, support for migration,
   GDB support and more
 
 - ARM changes are pretty small: support for the PSCI 0.2 hypercall
   interface on both the guest and the host (the latter acked by Catalin)
 
 - initial POWER8 and little-endian host support
 
 - support for running u-boot on embedded POWER targets
 
 - pretty large changes to MIPS too, completing the userspace interface
   and improving the handling of virtualized timer hardware
 
 - for x86, a larger set of changes is scheduled for 3.17.  Still,
   we have a few emulator bugfixes and support for running nested
   fully-virtualized Xen guests (para-virtualized Xen guests have
   always worked).  And some optimizations too.
 
 The only missing architecture here is ia64.  It's not a coincidence
 that support for KVM on ia64 is scheduled for removal in 3.17.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJTjtlBAAoJEBvWZb6bTYbyMOUP/2NAePghE3IjG99ikHFdn+BX
 BfrURsuR6GD0AhYQnBidBmpFbAmN/LwSJxv/M7sV7OBRWLu3qbt69DrPTU2e/FK1
 j9q25peu8jRyHzJ1q9rBroo74nD9lQYuVr3uXNxxcg0DRnw14JHGlM3y8LDEknO8
 W+gpWTeAQ+2AuOX98MpRbCRMuzziCSv5bP5FhBVnsWHiZfvMbcUrbeJt+zYSiDAZ
 0tHm/5dFKzfj/vVrrnjD4EZcRr688Bs5rztG96hY6aoVJryjZGLtLp92wCWkRRmH
 CCvZwd245NmNthuKHzcs27/duSWfU0uOlu7AMrD44QYhzeDGyB/2nbCxbGqLLoBA
 nnOviXH4cC65/CnisZ79zfo979HbZcX+Lzg747EjBgCSxJmLlwgiG8yXtDvk5otB
 TH6GUeGDiEEPj//JD3XtgSz0sF2NvjREWRyemjDMvhz6JC/bLytXKb3sn+NXSj8m
 ujzF9eQoa4qKDcBL4IQYGTJ4z5nY3Pd68dHFIPHB7n82OxFLSQUBKxXw8/1fb5og
 VVb8PL4GOcmakQlAKtTMlFPmuy4bbL2r/2iV5xJiOZKmXIu8Hs1JezBE3SFAltbl
 3cAGwSM9/dDkKxUbTFblyOE9bkKbg4WYmq0LkdzsPEomb3IZWntOT25rYnX+LrBz
 bAknaZpPiOrW11Et1htY
 =j5Od
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm into next

Pull KVM updates from Paolo Bonzini:
 "At over 200 commits, covering almost all supported architectures, this
  was a pretty active cycle for KVM.  Changes include:

   - a lot of s390 changes: optimizations, support for migration, GDB
     support and more

   - ARM changes are pretty small: support for the PSCI 0.2 hypercall
     interface on both the guest and the host (the latter acked by
     Catalin)

   - initial POWER8 and little-endian host support

   - support for running u-boot on embedded POWER targets

   - pretty large changes to MIPS too, completing the userspace
     interface and improving the handling of virtualized timer hardware

   - for x86, a larger set of changes is scheduled for 3.17.  Still, we
     have a few emulator bugfixes and support for running nested
     fully-virtualized Xen guests (para-virtualized Xen guests have
     always worked).  And some optimizations too.

  The only missing architecture here is ia64.  It's not a coincidence
  that support for KVM on ia64 is scheduled for removal in 3.17"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (203 commits)
  KVM: add missing cleanup_srcu_struct
  KVM: PPC: Book3S PR: Rework SLB switching code
  KVM: PPC: Book3S PR: Use SLB entry 0
  KVM: PPC: Book3S HV: Fix machine check delivery to guest
  KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs
  KVM: PPC: Book3S HV: Make sure we don't miss dirty pages
  KVM: PPC: Book3S HV: Fix dirty map for hugepages
  KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address
  KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates()
  KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
  KVM: PPC: Book3S: Add ONE_REG register names that were missed
  KVM: PPC: Add CAP to indicate hcall fixes
  KVM: PPC: MPIC: Reset IRQ source private members
  KVM: PPC: Graciously fail broken LE hypercalls
  PPC: ePAPR: Fix hypercall on LE guest
  KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler
  KVM: PPC: BOOK3S: Always use the saved DAR value
  PPC: KVM: Make NX bit available with magic page
  KVM: PPC: Disable NX for old magic page using guests
  KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
  ...
2014-06-04 08:47:12 -07:00
Alexander Graf
d8d164a985 KVM: PPC: Book3S PR: Rework SLB switching code
On LPAR guest systems Linux enables the shadow SLB to indicate to the
hypervisor a number of SLB entries that always have to be available.

Today we go through this shadow SLB and disable all ESID's valid bits.
However, pHyp doesn't like this approach very much and honors us with
fancy machine checks.

Fortunately the shadow SLB descriptor also has an entry that indicates
the number of valid entries following. During the lifetime of a guest
we can just swap that value to 0 and don't have to worry about the
SLB restoration magic.

While we're touching the code, let's also make it more readable (get
rid of rldicl), allow it to deal with a dynamic number of bolted
SLB entries and only do shadow SLB swizzling on LPAR systems.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:30 +02:00
Alexander Graf
207438d4e2 KVM: PPC: Book3S PR: Use SLB entry 0
We didn't make use of SLB entry 0 because ... of no good reason. SLB entry 0
will always be used by the Linux linear SLB entry, so the fact that slbia
does not invalidate it doesn't matter as we overwrite SLB 0 on exit anyway.

Just enable use of SLB entry 0 for our shadow SLB code.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:30 +02:00
Paul Mackerras
000a25ddb7 KVM: PPC: Book3S HV: Fix machine check delivery to guest
The code that delivered a machine check to the guest after handling
it in real mode failed to load up r11 before calling kvmppc_msr_interrupt,
which needs the old MSR value in r11 so it can see the transactional
state there.  This adds the missing load.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:29 +02:00
Paul Mackerras
9bc01a9bc7 KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs
This adds workarounds for two hardware bugs in the POWER8 performance
monitor unit (PMU), both related to interrupt generation.  The effect
of these bugs is that PMU interrupts can get lost, leading to tools
such as perf reporting fewer counts and samples than they should.

The first bug relates to the PMAO (perf. mon. alert occurred) bit in
MMCR0; setting it should cause an interrupt, but doesn't.  The other
bug relates to the PMAE (perf. mon. alert enable) bit in MMCR0.
Setting PMAE when a counter is negative and counter negative
conditions are enabled to cause alerts should cause an alert, but
doesn't.

The workaround for the first bug is to create conditions where a
counter will overflow, whenever we are about to restore a MMCR0
value that has PMAO set (and PMAO_SYNC clear).  The workaround for
the second bug is to freeze all counters using MMCR2 before reading
MMCR0.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:29 +02:00
Paul Mackerras
6c576e74fd KVM: PPC: Book3S HV: Make sure we don't miss dirty pages
Current, when testing whether a page is dirty (when constructing the
bitmap for the KVM_GET_DIRTY_LOG ioctl), we test the C (changed) bit
in the HPT entries mapping the page, and if it is 0, we consider the
page to be clean.  However, the Power ISA doesn't require processors
to set the C bit to 1 immediately when writing to a page, and in fact
allows them to delay the writeback of the C bit until they receive a
TLB invalidation for the page.  Thus it is possible that the page
could be dirty and we miss it.

Now, if there are vcpus running, this is not serious since the
collection of the dirty log is racy already - some vcpu could dirty
the page just after we check it.  But if there are no vcpus running we
should return definitive results, in case we are in the final phase of
migrating the guest.

Also, if the permission bits in the HPTE don't allow writing, then we
know that no CPU can set C.  If the HPTE was previously writable and
the page was modified, any C bit writeback would have been flushed out
by the tlbie that we did when changing the HPTE to read-only.

Otherwise we need to do a TLB invalidation even if the C bit is 0, and
then check the C bit.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:29 +02:00
Alexey Kardashevskiy
687414bebe KVM: PPC: Book3S HV: Fix dirty map for hugepages
The dirty map that we construct for the KVM_GET_DIRTY_LOG ioctl has
one bit per system page (4K/64K).  Currently, we only set one bit in
the map for each HPT entry with the Change bit set, even if the HPT is
for a large page (e.g., 16MB).  Userspace then considers only the
first system page dirty, though in fact the guest may have modified
anywhere in the large page.

To fix this, we make kvm_test_clear_dirty() return the actual number
of pages that are dirty (and rename it to kvm_test_clear_dirty_npages()
to emphasize that that's what it returns).  In kvmppc_hv_get_dirty_log()
we then set that many bits in the dirty map.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:29 +02:00
Paul Mackerras
1066f7724c KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address
Currently, when a huge page is faulted in for a guest, we select the
rmap chain to insert the HPTE into based on the guest physical address
that the guest tried to access.  Since there is an rmap chain for each
system page, there are many rmap chains for the area covered by a huge
page (e.g. 256 for 16MB pages when PAGE_SIZE = 64kB), and the huge-page
HPTE could end up in any one of them.

For consistency, and to make the huge-page HPTEs easier to find, we now
put huge-page HPTEs in the rmap chain corresponding to the base address
of the huge page.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:28 +02:00
Paul Mackerras
55765483e1 KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates()
The global_invalidates() function contains a check that is intended
to tell whether we are currently executing in the context of a hypercall
issued by the guest.  The reason is that the optimization of using a
local TLB invalidate instruction is only valid in that context.  The
check was testing local_paca->kvm_hstate.kvm_vcore, which gets set
when entering the guest but no longer gets cleared when exiting the
guest.  To fix this, we use the kvm_vcpu field instead, which does
get cleared when exiting the guest, by the kvmppc_release_hwthread()
calls inside kvmppc_run_core().

The effect of having the check wrong was that when kvmppc_do_h_remove()
got called from htab_write() on the destination machine during a
migration, it cleared the current cpu's bit in kvm->arch.need_tlb_flush.
This meant that when the guest started running in the destination VM,
it may miss out on doing a complete TLB flush, and therefore may end
up using stale TLB entries from a previous guest that used the same
LPID value.

This should make migration more reliable.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:28 +02:00
Alexander Graf
f2e91042a8 KVM: PPC: Add CAP to indicate hcall fixes
We worked around some nasty KVM magic page hcall breakages:

  1) NX bit not honored, so ignore NX when we detect it
  2) LE guests swizzle hypercall instruction

Without these fixes in place, there's no way it would make sense to expose kvm
hypercalls to a guest. Chances are immensely high it would trip over and break.

So add a new CAP that gives user space a hint that we have workarounds for the
bugs above in place. It can use those as hint to disable PV hypercalls when
the guest CPU is anything POWER7 or higher and the host does not have fixes
in place.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:27 +02:00
Alexander Graf
aae6559651 KVM: PPC: MPIC: Reset IRQ source private members
When we reset the in-kernel MPIC controller, we forget to reset some hidden
state such as destmask and output. This state is usually set when the guest
writes to the IDR register for a specific IRQ line.

To make sure we stay in sync and don't forget hidden state, treat reset of
the IDR register as a simple write of the IDR register. That automatically
updates all the hidden state as well.

Reported-by: Paul Janzen <pcj@pauljanzen.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:26 +02:00
Alexander Graf
42188365f9 KVM: PPC: Graciously fail broken LE hypercalls
There are LE Linux guests out there that don't handle hypercalls correctly.
Instead of interpreting the instruction stream from device tree as big endian
they assume it's a little endian instruction stream and fail.

When we see an illegal instruction from such a byte reversed instruction stream,
bail out graciously and just declare every hcall as error.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:26 +02:00
Aneesh Kumar K.V
ddca156ae6 KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler
Use make_dsisr instead of open coding it. This also have
the added benefit of handling alignment interrupt on additional
instructions.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:25 +02:00
Aneesh Kumar K.V
7310f3a5b0 KVM: PPC: BOOK3S: Always use the saved DAR value
Although it's optional, IBM POWER cpus always had DAR value set on
alignment interrupt. So don't try to compute these values.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:25 +02:00
Alexander Graf
f3383cf80e KVM: PPC: Disable NX for old magic page using guests
Old guests try to use the magic page, but map their trampoline code inside
of an NX region.

Since we can't fix those old kernels, try to detect whether the guest is sane
or not. If not, just disable NX functionality in KVM so that old guests at
least work at all. For newer guests, add a bit that we can set to keep NX
functionality available.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:24 +02:00
Aneesh Kumar K.V
1f365bb0de KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
On recent IBM Power CPUs, while the hashed page table is looked up using
the page size from the segmentation hardware (i.e. the SLB), it is
possible to have the HPT entry indicate a larger page size.  Thus for
example it is possible to put a 16MB page in a 64kB segment, but since
the hash lookup is done using a 64kB page size, it may be necessary to
put multiple entries in the HPT for a single 16MB page.  This
capability is called mixed page-size segment (MPSS).  With MPSS,
there are two relevant page sizes: the base page size, which is the
size used in searching the HPT, and the actual page size, which is the
size indicated in the HPT entry. [ Note that the actual page size is
always >= base page size ].

We use "ibm,segment-page-sizes" device tree node to advertise
the MPSS support to PAPR guest. The penc encoding indicates whether
we support a specific combination of base page size and actual
page size in the same segment. We also use the penc value in the
LP encoding of HPTE entry.

This patch exposes MPSS support to KVM guest by advertising the
feature via "ibm,segment-page-sizes". It also adds the necessary changes
to decode the base page size and the actual page size correctly from the
HPTE entry.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:24 +02:00
Aneesh Kumar K.V
792fc49787 KVM: PPC: BOOK3S: HV: Prefer CMA region for hash page table allocation
Today when KVM tries to reserve memory for the hash page table it
allocates from the normal page allocator first. If that fails it
falls back to CMA's reserved region. One of the side effects of
this is that we could end up exhausting the page allocator and
get linux into OOM conditions while we still have plenty of space
available in CMA.

This patch addresses this issue by first trying hash page table
allocation from CMA's reserved region before falling back to the normal
page allocator. So if we run out of memory, we really are out of memory.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:24 +02:00
Alexander Graf
9916d57e64 KVM: PPC: Book3S PR: Expose TM registers
POWER8 introduces transactional memory which brings along a number of new
registers and MSR bits.

Implementing all of those is a pretty big headache, so for now let's at least
emulate enough to make Linux's context switching code happy.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:23 +02:00
Alexander Graf
2e23f54413 KVM: PPC: Book3S PR: Expose EBB registers
POWER8 introduces a new facility called the "Event Based Branch" facility.
It contains of a few registers that indicate where a guest should branch to
when a defined event occurs and it's in PR mode.

We don't want to really enable EBB as it will create a big mess with !PR guest
mode while hardware is in PR and we don't really emulate the PMU anyway.

So instead, let's just leave it at emulation of all its registers.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:23 +02:00
Alexander Graf
e14e7a1e53 KVM: PPC: Book3S PR: Expose TAR facility to guest
POWER8 implements a new register called TAR. This register has to be
enabled in FSCR and then from KVM's point of view is mere storage.

This patch enables the guest to use TAR.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:23 +02:00
Alexander Graf
616dff8602 KVM: PPC: Book3S PR: Handle Facility interrupt and FSCR
POWER8 introduced a new interrupt type called "Facility unavailable interrupt"
which contains its status message in a new register called FSCR.

Handle these exits and try to emulate instructions for unhandled facilities.
Follow-on patches enable KVM to expose specific facilities into the guest.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:22 +02:00
Alexander Graf
a5948fa092 KVM: PPC: Book3S PR: Emulate TIR register
In parallel to the Processor ID Register (PIR) threaded POWER8 also adds a
Thread ID Register (TIR). Since PR KVM doesn't emulate more than one thread
per core, we can just always expose 0 here.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:22 +02:00
Alexander Graf
f8f6eb0d18 KVM: PPC: Book3S PR: Ignore PMU SPRs
When we expose a POWER8 CPU into the guest, it will start accessing PMU SPRs
that we don't emulate. Just ignore accesses to them.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:22 +02:00
Alexander Graf
f24bc1ed45 KVM: PPC: Book3S: Move little endian conflict to HV KVM
With the previous patches applied, we can now successfully use PR KVM on
little endian hosts which means we can now allow users to select it.

However, HV KVM still needs some work, so let's keep the kconfig conflict
on that one.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:21 +02:00
Alexander Graf
cd087eefe6 KVM: PPC: Book3S PR: Do dcbz32 patching with big endian instructions
When the host CPU we're running on doesn't support dcbz32 itself, but the
guest wants to have dcbz only clear 32 bytes of data, we loop through every
executable mapped page to search for dcbz instructions and patch them with
a special privileged instruction that we emulate as dcbz32.

The only guests that want to see dcbz act as 32byte are book3s_32 guests, so
we don't have to worry about little endian instruction ordering. So let's
just always search for big endian dcbz instructions, also when we're on a
little endian host.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:21 +02:00
Alexander Graf
5deb8e7ad8 KVM: PPC: Make shared struct aka magic page guest endian
The shared (magic) page is a data structure that contains often used
supervisor privileged SPRs accessible via memory to the user to reduce
the number of exits we have to take to read/write them.

When we actually share this structure with the guest we have to maintain
it in guest endianness, because some of the patch tricks only work with
native endian load/store operations.

Since we only share the structure with either host or guest in little
endian on book3s_64 pr mode, we don't have to worry about booke or book3s hv.

For booke, the shared struct stays big endian. For book3s_64 hv we maintain
the struct in host native endian, since it never gets shared with the guest.

For book3s_64 pr we introduce a variable that tells us which endianness the
shared struct is in and route every access to it through helper inline
functions that evaluate this variable.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:21 +02:00
Alexander Graf
2743103f91 KVM: PPC: PR: Fill pvinfo hcall instructions in big endian
We expose a blob of hypercall instructions to user space that it gives to
the guest via device tree again. That blob should contain a stream of
instructions necessary to do a hypercall in big endian, as it just gets
passed into the guest and old guests use them straight away.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:20 +02:00
Alexander Graf
b59d9d26be KVM: PPC: Book3S PR: PAPR: Access RTAS in big endian
When the guest does an RTAS hypercall it keeps all RTAS variables inside a
big endian data structure.

To make sure we don't have to bother about endianness inside the actual RTAS
handlers, let's just convert the whole structure to host endian before we
call our RTAS handlers and back to big endian when we return to the guest.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:20 +02:00
Alexander Graf
1692aa3faa KVM: PPC: Book3S PR: PAPR: Access HTAB in big endian
The HTAB on PPC is always in big endian. When we access it via hypercalls
on behalf of the guest and we're running on a little endian host, we need
to make sure we swap the bits accordingly.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:20 +02:00
Alexander Graf
94810ba4ed KVM: PPC: Book3S PR: Default to big endian guest
The default MSR when user space does not define anything should be identical
on little and big endian hosts, so remove MSR_LE from it.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:20 +02:00
Alexander Graf
14a7d41dad KVM: PPC: Book3S_64 PR: Access shadow slb in big endian
The "shadow SLB" in the PACA is shared with the hypervisor, so it has to
be big endian. We access the shadow SLB during world switch, so let's make
sure we access it in big endian even when we're on a little endian host.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:19 +02:00
Alexander Graf
4e509af9f8 KVM: PPC: Book3S_64 PR: Access HTAB in big endian
The HTAB is always big endian. We access the guest's HTAB using
copy_from/to_user, but don't yet take care of the fact that we might
be running on an LE host.

Wrap all accesses to the guest HTAB with big endian accessors.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:19 +02:00
Alexander Graf
860540bc50 KVM: PPC: Book3S_32: PR: Access HTAB in big endian
The HTAB is always big endian. We access the guest's HTAB using
copy_from/to_user, but don't yet take care of the fact that we might
be running on an LE host.

Wrap all accesses to the guest HTAB with big endian accessors.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:19 +02:00
Alexander Graf
740f834eb2 KVM: PPC: Book3S: PR: Fix C/R bit setting
Commit 9308ab8e2d made C/R HTAB updates go byte-wise into the target HTAB.
However, it didn't update the guest's copy of the HTAB, but instead the
host local copy of it.

Write to the guest's HTAB instead.

Signed-off-by: Alexander Graf <agraf@suse.de>
CC: Paul Mackerras <paulus@samba.org>
Acked-by: Paul Mackerras <paulus@samba.org>
2014-05-30 14:26:18 +02:00
Aneesh Kumar K.V
7562c4fded KVM: PPC: BOOK3S: PR: Fix WARN_ON with debug options on
With debug option "sleep inside atomic section checking" enabled we get
the below WARN_ON during a PR KVM boot. This is because upstream now
have PREEMPT_COUNT enabled even if we have preempt disabled. Fix the
warning by adding preempt_disable/enable around floating point and altivec
enable.

WARNING: at arch/powerpc/kernel/process.c:156
Modules linked in: kvm_pr kvm
CPU: 1 PID: 3990 Comm: qemu-system-ppc Tainted: G        W     3.15.0-rc1+ #4
task: c0000000eb85b3a0 ti: c0000000ec59c000 task.ti: c0000000ec59c000
NIP: c000000000015c84 LR: d000000003334644 CTR: c000000000015c00
REGS: c0000000ec59f140 TRAP: 0700   Tainted: G        W      (3.15.0-rc1+)
MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI>  CR: 42000024  XER: 20000000
CFAR: c000000000015c24 SOFTE: 1
GPR00: d000000003334644 c0000000ec59f3c0 c000000000e2fa40 c0000000e2f80000
GPR04: 0000000000000800 0000000000002000 0000000000000001 8000000000000000
GPR08: 0000000000000001 0000000000000001 0000000000002000 c000000000015c00
GPR12: d00000000333da18 c00000000fb80900 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 00003fffce4e0fa1
GPR20: 0000000000000010 0000000000000001 0000000000000002 00000000100b9a38
GPR24: 0000000000000002 0000000000000000 0000000000000000 0000000000000013
GPR28: 0000000000000000 c0000000eb85b3a0 0000000000002000 c0000000e2f80000
NIP [c000000000015c84] .enable_kernel_fp+0x84/0x90
LR [d000000003334644] .kvmppc_handle_ext+0x134/0x190 [kvm_pr]
Call Trace:
[c0000000ec59f3c0] [0000000000000010] 0x10 (unreliable)
[c0000000ec59f430] [d000000003334644] .kvmppc_handle_ext+0x134/0x190 [kvm_pr]
[c0000000ec59f4c0] [d00000000324b380] .kvmppc_set_msr+0x30/0x50 [kvm]
[c0000000ec59f530] [d000000003337cac] .kvmppc_core_emulate_op_pr+0x16c/0x5e0 [kvm_pr]
[c0000000ec59f5f0] [d00000000324a944] .kvmppc_emulate_instruction+0x284/0xa80 [kvm]
[c0000000ec59f6c0] [d000000003336888] .kvmppc_handle_exit_pr+0x488/0xb70 [kvm_pr]
[c0000000ec59f790] [d000000003338d34] kvm_start_lightweight+0xcc/0xdc [kvm_pr]
[c0000000ec59f960] [d000000003336288] .kvmppc_vcpu_run_pr+0xc8/0x190 [kvm_pr]
[c0000000ec59f9f0] [d00000000324c880] .kvmppc_vcpu_run+0x30/0x50 [kvm]
[c0000000ec59fa60] [d000000003249e74] .kvm_arch_vcpu_ioctl_run+0x54/0x1b0 [kvm]
[c0000000ec59faf0] [d000000003244948] .kvm_vcpu_ioctl+0x478/0x760 [kvm]
[c0000000ec59fcb0] [c000000000224e34] .do_vfs_ioctl+0x4d4/0x790
[c0000000ec59fd90] [c000000000225148] .SyS_ioctl+0x58/0xb0
[c0000000ec59fe30] [c00000000000a1e4] syscall_exit+0x0/0x98

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:18 +02:00
Aneesh Kumar K.V
e5ee5422f8 KVM: PPC: BOOK3S: PR: Enable Little Endian PR guest
This patch make sure we inherit the LE bit correctly in different case
so that we can run Little Endian distro in PR mode

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:18 +02:00
Alexander Graf
8f20a3ab27 KVM: PPC: E500: Add dcbtls emulation
The dcbtls instruction is able to lock data inside the L1 cache.

We don't want to give the guest actual access to hardware cache locks,
as that could influence other VMs on the same system. But we can tell
the guest that its locking attempt failed.

By implementing the instruction we at least don't give the guest a
program exception which it definitely does not expect.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:17 +02:00
Alexander Graf
07fec1c2e7 KVM: PPC: E500: Ignore L1CSR1_ICFI,ICLFR
The L1 instruction cache control register contains bits that indicate
that we're still handling a request. Mask those out when we set the SPR
so that a read doesn't assume we're still doing something.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-30 14:26:17 +02:00
Sam bobroff
1739ea9e13 powerpc: Fix regression of per-CPU DSCR setting
Since commit "efcac65 powerpc: Per process DSCR + some fixes (try#4)"
it is no longer possible to set the DSCR on a per-CPU basis.

The old behaviour was to minipulate the DSCR SPR directly but this is no
longer sufficient: the value is quickly overwritten by context switching.

This patch stores the per-CPU DSCR value in a kernel variable rather than
directly in the SPR and it is used whenever a process has not set the DSCR
itself. The sysfs interface (/sys/devices/system/cpu/cpuN/dscr) is unchanged.

Writes to the old global default (/sys/devices/system/cpu/dscr_default)
now set all of the per-CPU values and reads return the last written value.

The new per-CPU default is added to the paca_struct and is used everywhere
outside of sysfs.c instead of the old global default.

Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-05-28 13:35:40 +10:00
Michael Ellerman
3102f7843c powerpc/kvm/book3s_hv: Use threads_per_subcore in KVM
To support split core on POWER8 we need to modify various parts of the
KVM code to use threads_per_subcore instead of threads_per_core. On
systems that do not support split core threads_per_subcore ==
threads_per_core and these changes are a nop.

We use threads_per_subcore as the value reported by KVM_CAP_PPC_SMT.
This communicates to userspace that guests can only be created with
a value of threads_per_core that is less than or equal to the current
threads_per_subcore. This ensures that guests can only be created with a
thread configuration that we are able to run given the current split
core mode.

Although threads_per_subcore can change during the life of the system,
the commit that enables that will ensure that threads_per_subcore does
not change during the life of a KVM VM.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Alexander Graf <agraf@suse.de>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-05-28 13:35:37 +10:00
Michael Ellerman
441c19c8a2 powerpc/kvm/book3s_hv: Rework the secondary inhibit code
As part of the support for split core on POWER8, we want to be able to
block splitting of the core while KVM VMs are active.

The logic to do that would be exactly the same as the code we currently
have for inhibiting onlining of secondaries.

Instead of adding an identical mechanism to block split core, rework the
secondary inhibit code to be a "HV KVM is active" check. We can then use
that in both the cpu hotplug code and the upcoming split core code.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Alexander Graf <agraf@suse.de>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-05-28 13:35:34 +10:00
Paolo Bonzini
5367742ad5 Patch queue for 3.15 - 2014-05-12
This request includes a few bug fixes that really shouldn't wait for the next
 release.
 
 It fixes KVM on 32bit PowerPC when built as module. It also fixes the PV KVM
 acceleration when NX gets honored by the host. Furthermore we fix transactional
 memory support and numa support on HV KVM.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABAgAGBQJTcKFaAAoJECszeR4D/txg7qYP/RX3V32i2zQYH2NpjQrDCwtY
 Wur+CQrn/VA6xhtTK1rT2zH5rNFLt6ClhtxCMkZFfBdUE4sHi3OTlEdcvXBZjbls
 JqQ/7lBkUPN8pTpz2NHP9gvH7g6v07EruysRQNa/JZMzlwhpzWk8D7yXakaCPNY/
 JZRgVTrfKnhQ8OtXt48Bp4EmEKllbNqi9kNN7dewD2dEb3fAco3Jpk6WoeG+1f0o
 jv3NmeTsp87KaRpjvDzPb7iCe6PA7GVqvJIQpir3Rpk2Kpx0yj58AfacF+f72GOf
 CPlJGepiumJCaANhV6dbvtS49vaiiAnSvbqCil2USNl0LIGWQXdSjs5lztEuiMyr
 tAav0YSVpnIcw0HJxXug/M31VwfRjYCX3hnCCIOd3Xj2jgAqwD+Lo95uUrRGJ9TP
 75zKh8E093tOXIC9CyMaiYajpFMUrCSMgnpJ+7fpeHiyigB6yc8juFxahIHsw8q1
 NgHggroJm6QNIm8JSY/tG/YET4AT7H4ZetGP8MeeRUg0TpqQXvYpkMGB8YDouaBA
 XzxjwyTq57BOYgLGExnwW3Jj0kbqVY+ts0aDGQVGrl5YFzooGqrQ61CRmwG5BvI8
 sou3l6TJ2ng8qrc7Maw9MHca1QB3mtXD7I26T/QEfQm9NLRTTqJyaxH5J1q9siRI
 PpHVE5FKnmWPNr8JlxtC
 =t2S+
 -----END PGP SIGNATURE-----

Merge tag 'signed-for-3.15' of git://github.com/agraf/linux-2.6 into kvm-master

Patch queue for 3.15 - 2014-05-12

This request includes a few bug fixes that really shouldn't wait for the next
release.

It fixes KVM on 32bit PowerPC when built as module. It also fixes the PV KVM
acceleration when NX gets honored by the host. Furthermore we fix transactional
memory support and numa support on HV KVM.
2014-05-13 18:15:16 +02:00
Benjamin Herrenschmidt
f6869e7fe6 Merge remote-tracking branch 'anton/abiv2' into next
This series adds support for building the powerpc 64-bit
LE kernel using the new ABI v2. We already supported
running ABI v2 userspace programs but this adds support
for building the kernel itself using the new ABI.
2014-05-05 20:57:12 +10:00
Alexander Graf
ab78475c76 KVM: PPC: Book3S: ifdef on CONFIG_KVM_BOOK3S_32_HANDLER for 32bit
The book3s_32 target can get built as module which means we don't see the
config define for it in code. Instead, check on the bool define
CONFIG_KVM_BOOK3S_32_HANDLER whenever we want to know whether we're building
for a book3s_32 host.

This fixes running book3s_32 kvm as a module for me.

Signed-off-by: Alexander Graf <agraf@suse.de>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2014-04-28 12:35:42 +02:00
Paul Mackerras
0a8eccefcb KVM: PPC: Book3S HV: Add missing code for transaction reclaim on guest exit
Testing by Michael Neuling revealed that commit e4e3812150 ("KVM:
PPC: Book3S HV: Add transactional memory support") is missing the code
that saves away the checkpointed state of the guest when switching to
the host.  This adds that code, which was in earlier versions of the
patch but went missing somehow.

Reported-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-04-28 12:35:41 +02:00
pingfank@linux.vnet.ibm.com
1ad9f23873 KVM: PPC: Book3S: HV: make _PAGE_NUMA take effect
Numa fault is a method which help to achieve auto numa balancing.
When such a page fault takes place, the page fault handler will check
whether the page is placed correctly. If not, migration should be
involved to cut down the distance between the cpu and pages.

A pte with _PAGE_NUMA help to implement numa fault. It means not to
allow the MMU to access the page directly. So a page fault is triggered
and numa fault handler gets the opportunity to run checker.

As for the access of MMU, we need special handling for the powernv's guest.
When we mark a pte with _PAGE_NUMA, we already call mmu_notifier to
invalidate it in guest's htab, but when we tried to re-insert them,
we firstly try to map it in real-mode. Only after this fails, we fallback
to virt mode, and most of important, we run numa fault handler in virt
mode.  This patch guards the way of real-mode to ensure that if a pte is
marked with _PAGE_NUMA, it will NOT be mapped in real mode, instead, it will
be mapped in virt mode and have the opportunity to be checked with placement.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-04-28 12:35:41 +02:00
Preeti U Murthy
582b910eda ppc/kvm: Clear the runlatch bit of a vcpu before napping
When the guest cedes the vcpu or the vcpu has no guest to
run it naps. Clear the runlatch bit of the vcpu before
napping to indicate an idle cpu.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-28 16:32:49 +10:00
Preeti U Murthy
fd17dc7b9a ppc/kvm: Set the runlatch bit of a CPU just before starting guest
The secondary threads in the core are kept offline before launching guests
in kvm on powerpc: "371fefd6f2dc4666:KVM: PPC: Allow book3s_hv guests to use
SMT processor modes."

Hence their runlatch bits are cleared. When the secondary threads are called
in to start a guest, their runlatch bits need to be set to indicate that they
are busy. The primary thread has its runlatch bit set though, but there is no
harm in setting this bit once again. Hence set the runlatch bit for all
threads before they start guest.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-28 16:32:45 +10:00
Anton Blanchard
c1fb019477 powerpc: Create DOTSYM to wrap dot symbol usage
There are a few places we have to use dot symbols with the
current ABI - the syscall table and the kvm hcall table.

Wrap both of these with a new macro called DOTSYM so it will
be easy to transition away from dot symbols in a future ABI.

Signed-off-by: Anton Blanchard <anton@samba.org>
2014-04-23 10:05:19 +10:00
Anton Blanchard
b1576fec7f powerpc: No need to use dot symbols when branching to a function
binutils is smart enough to know that a branch to a function
descriptor is actually a branch to the functions text address.

Alan tells me that binutils has been doing this for 9 years.

Signed-off-by: Anton Blanchard <anton@samba.org>
2014-04-23 10:05:16 +10:00
Linus Torvalds
7cbb39d4d4 Merge tag 'kvm-3.15-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
 "PPC and ARM do not have much going on this time.  Most of the cool
  stuff, instead, is in s390 and (after a few releases) x86.

  ARM has some caching fixes and PPC has transactional memory support in
  guests.  MIPS has some fixes, with more probably coming in 3.16 as
  QEMU will soon get support for MIPS KVM.

  For x86 there are optimizations for debug registers, which trigger on
  some Windows games, and other important fixes for Windows guests.  We
  now expose to the guest Broadwell instruction set extensions and also
  Intel MPX.  There's also a fix/workaround for OS X guests, nested
  virtualization features (preemption timer), and a couple kvmclock
  refinements.

  For s390, the main news is asynchronous page faults, together with
  improvements to IRQs (floating irqs and adapter irqs) that speed up
  virtio devices"

* tag 'kvm-3.15-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (96 commits)
  KVM: PPC: Book3S HV: Save/restore host PMU registers that are new in POWER8
  KVM: PPC: Book3S HV: Fix decrementer timeouts with non-zero TB offset
  KVM: PPC: Book3S HV: Don't use kvm_memslots() in real mode
  KVM: PPC: Book3S HV: Return ENODEV error rather than EIO
  KVM: PPC: Book3S: Trim top 4 bits of physical address in RTAS code
  KVM: PPC: Book3S HV: Add get/set_one_reg for new TM state
  KVM: PPC: Book3S HV: Add transactional memory support
  KVM: Specify byte order for KVM_EXIT_MMIO
  KVM: vmx: fix MPX detection
  KVM: PPC: Book3S HV: Fix KVM hang with CONFIG_KVM_XICS=n
  KVM: PPC: Book3S: Introduce hypervisor call H_GET_TCE
  KVM: PPC: Book3S HV: Fix incorrect userspace exit on ioeventfd write
  KVM: s390: clear local interrupts at cpu initial reset
  KVM: s390: Fix possible memory leak in SIGP functions
  KVM: s390: fix calculation of idle_mask array size
  KVM: s390: randomize sca address
  KVM: ioapic: reinject pending interrupts on KVM_SET_IRQCHIP
  KVM: Bump KVM_MAX_IRQ_ROUTES for s390
  KVM: s390: irq routing for adapter interrupts.
  KVM: s390: adapter interrupt sources
  ...
2014-04-02 14:50:10 -07:00
Linus Torvalds
235c7b9feb Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull main powerpc updates from Ben Herrenschmidt:
 "This time around, the powerpc merges are going to be a little bit more
  complicated than usual.

  This is the main pull request with most of the work for this merge
  window.  I will describe it a bit more further down.

  There is some additional cpuidle driver work, however I haven't
  included it in this tree as it depends on some work in tip/timer-core
  which Thomas accidentally forgot to put in a topic branch.  Since I
  didn't want to carry all of that tip timer stuff in powerpc -next, I
  setup a separate branch on top of Thomas tree with just that cpuidle
  driver in it, and Stephen has been carrying that in next separately
  for a while now.  I'll send a separate pull request for it.

  Additionally, two new pieces in this tree add users for a sysfs API
  that Tejun and Greg have been deprecating in drivers-core-next.
  Thankfully Greg reverted the patch that removes the old API so this
  merge can happen cleanly, but once merged, I will send a patch
  adjusting our new code to the new API so that Greg can send you the
  removal patch.

  Now as for the content of this branch, we have a lot of perf work for
  power8 new counters including support for our new "nest" counters
  (also called 24x7) under pHyp (not natively yet).

  We have new functionality when running under the OPAL firmware
  (non-virtualized or KVM host), such as access to the firmware error
  logs and service processor dumps, system parameters and sensors, along
  with a hwmon driver for the latter.

  There's also a bunch of bug fixes accross the board, some LE fixes,
  and a nice set of selftests for validating our various types of copy
  loops.

  On the Freescale side, we see mostly new chip/board revisions, some
  clock updates, better support for machine checks and debug exceptions,
  etc..."

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (70 commits)
  powerpc/book3s: Fix CFAR clobbering issue in machine check handler.
  powerpc/compat: 32-bit little endian machine name is ppcle, not ppc
  powerpc/le: Big endian arguments for ppc_rtas()
  powerpc: Use default set of netfilter modules (CONFIG_NETFILTER_ADVANCED=n)
  powerpc/defconfigs: Enable THP in pseries defconfig
  powerpc/mm: Make sure a local_irq_disable prevent a parallel THP split
  powerpc: Rate-limit users spamming kernel log buffer
  powerpc/perf: Fix handling of L3 events with bank == 1
  powerpc/perf/hv_{gpci, 24x7}: Add documentation of device attributes
  powerpc/perf: Add kconfig option for hypervisor provided counters
  powerpc/perf: Add support for the hv 24x7 interface
  powerpc/perf: Add support for the hv gpci (get performance counter info) interface
  powerpc/perf: Add macros for defining event fields & formats
  powerpc/perf: Add a shared interface to get gpci version and capabilities
  powerpc/perf: Add 24x7 interface headers
  powerpc/perf: Add hv_gpci interface header
  powerpc: Add hvcalls for 24x7 and gpci (Get Performance Counter Info)
  sysfs: create bin_attributes under the requested group
  powerpc/perf: Enable BHRB access for EBB events
  powerpc/perf: Add BHRB constraint and IFM MMCRA handling for EBB
  ...
2014-04-02 13:42:59 -07:00
Paolo Bonzini
7227fc0666 Merge branch 'kvm-ppchv-next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into kvm-next 2014-03-29 15:44:05 +01:00
Paul Mackerras
72cde5a88d KVM: PPC: Book3S HV: Save/restore host PMU registers that are new in POWER8
Currently we save the host PMU configuration, counter values, etc.,
when entering a guest, and restore it on return from the guest.
(We have to do this because the guest has control of the PMU while
it is executing.)  However, we missed saving/restoring the SIAR and
SDAR registers, as well as the registers which are new on POWER8,
namely SIER and MMCR2.

This adds code to save the values of these registers when entering
the guest and restore them on exit.  This also works around the bug
in POWER8 where setting PMAE with a counter already negative doesn't
generate an interrupt.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Scott Wood <scottwood@freescale.com>
2014-03-29 19:58:52 +11:00
Paul Mackerras
c5fb80d3b2 KVM: PPC: Book3S HV: Fix decrementer timeouts with non-zero TB offset
Commit c7699822bc21 ("KVM: PPC: Book3S HV: Make physical thread 0 do
the MMU switching") reordered the guest entry/exit code so that most
of the guest register save/restore code happened in guest MMU context.
A side effect of that is that the timebase still contains the guest
timebase value at the point where we compute and use vcpu->arch.dec_expires,
and therefore that is now a guest timebase value rather than a host
timebase value.  That in turn means that the timeouts computed in
kvmppc_set_timer() are wrong if the timebase offset for the guest is
non-zero.  The consequence of that is things such as "sleep 1" in a
guest after migration may sleep for much longer than they should.

This fixes the problem by converting between guest and host timebase
values as necessary, by adding or subtracting the timebase offset.
This also fixes an incorrect comment.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Scott Wood <scottwood@freescale.com>
2014-03-29 19:58:39 +11:00
Paul Mackerras
797f9c07eb KVM: PPC: Book3S HV: Don't use kvm_memslots() in real mode
With HV KVM, some high-frequency hypercalls such as H_ENTER are handled
in real mode, and need to access the memslots array for the guest.
Accessing the memslots array is safe, because we hold the SRCU read
lock for the whole time that a guest vcpu is running.  However, the
checks that kvm_memslots() does when lockdep is enabled are potentially
unsafe in real mode, when only the linear mapping is available.
Furthermore, kvm_memslots() can be called from a secondary CPU thread,
which is an offline CPU from the point of view of the host kernel,
and is not running the task which holds the SRCU read lock.

To avoid false positives in the checks in kvm_memslots(), and to avoid
possible side effects from doing the checks in real mode, this replaces
kvm_memslots() with kvm_memslots_raw() in all the places that execute
in real mode.  kvm_memslots_raw() is a new function that is like
kvm_memslots() but uses rcu_dereference_raw_notrace() instead of
kvm_dereference_check().

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Scott Wood <scottwood@freescale.com>
2014-03-29 19:58:35 +11:00
Paul Mackerras
739e2425fe KVM: PPC: Book3S HV: Return ENODEV error rather than EIO
If an attempt is made to load the kvm-hv module on a machine which
doesn't have hypervisor mode available, return an ENODEV error,
which is the conventional thing to return to indicate that this
module is not applicable to the hardware of the current machine,
rather than EIO, which causes a warning to be printed.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Scott Wood <scottwood@freescale.com>
2014-03-29 19:58:29 +11:00
Paul Mackerras
b24f36f33e KVM: PPC: Book3S: Trim top 4 bits of physical address in RTAS code
The in-kernel emulation of RTAS functions needs to read the argument
buffer from guest memory in order to find out what function is being
requested.  The guest supplies the guest physical address of the buffer,
and on a real system the code that reads that buffer would run in guest
real mode.  In guest real mode, the processor ignores the top 4 bits
of the address specified in load and store instructions.  In order to
emulate that behaviour correctly, we need to mask off those bits
before calling kvm_read_guest() or kvm_write_guest().  This adds that
masking.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Scott Wood <scottwood@freescale.com>
2014-03-29 19:58:23 +11:00
Michael Neuling
a7d80d01c6 KVM: PPC: Book3S HV: Add get/set_one_reg for new TM state
This adds code to get/set_one_reg to read and write the new transactional
memory (TM) state.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Scott Wood <scottwood@freescale.com>
2014-03-29 19:58:17 +11:00
Michael Neuling
e4e3812150 KVM: PPC: Book3S HV: Add transactional memory support
This adds saving of the transactional memory (TM) checkpointed state
on guest entry and exit.  We only do this if we see that the guest has
an active transaction.

It also adds emulation of the TM state changes when delivering IRQs
into the guest.  According to the architecture, if we are
transactional when an IRQ occurs, the TM state is changed to
suspended, otherwise it's left unchanged.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Scott Wood <scottwood@freescale.com>
2014-03-29 19:58:02 +11:00
Anton Blanchard
7505258c5f KVM: PPC: Book3S HV: Fix KVM hang with CONFIG_KVM_XICS=n
I noticed KVM is broken when KVM in-kernel XICS emulation
(CONFIG_KVM_XICS) is disabled.

The problem was introduced in 48eaef05 (KVM: PPC: Book3S HV: use
xics_wake_cpu only when defined). It used CONFIG_KVM_XICS to wrap
xics_wake_cpu, where CONFIG_PPC_ICP_NATIVE should have been
used.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@vger.kernel.org
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Scott Wood <scottwood@freescale.com>
2014-03-26 23:34:56 +11:00
Laurent Dufour
69e9fbb278 KVM: PPC: Book3S: Introduce hypervisor call H_GET_TCE
This introduces the H_GET_TCE hypervisor call, which is basically the
reverse of H_PUT_TCE, as defined in the Power Architecture Platform
Requirements (PAPR).

The hcall H_GET_TCE is required by the kdump kernel, which uses it to
retrieve TCEs set up by the previous (panicked) kernel.

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-03-26 23:34:27 +11:00
Greg Kurz
e59d24e612 KVM: PPC: Book3S HV: Fix incorrect userspace exit on ioeventfd write
When the guest does an MMIO write which is handled successfully by an
ioeventfd, ioeventfd_write() returns 0 (success) and
kvmppc_handle_store() returns EMULATE_DONE.  Then
kvmppc_emulate_mmio() converts EMULATE_DONE to RESUME_GUEST_NV and
this causes an exit from the loop in kvmppc_vcpu_run_hv(), causing an
exit back to userspace with a bogus exit reason code, typically
causing userspace (e.g. qemu) to crash with a message about an unknown
exit code.

This adds handling of RESUME_GUEST_NV in kvmppc_vcpu_run_hv() in order
to fix that.  For generality, we define a helper to check for either
of the return-to-guest codes we use, RESUME_GUEST and RESUME_GUEST_NV,
to make it easy to check for either and provide one place to update if
any other return-to-guest code gets defined in future.

Since it only affects Book3S HV for now, the helper is added to
the kvm_book3s.h header file.

We use the helper in two places in kvmppc_run_core() as well for
future-proofing, though we don't see RESUME_GUEST_NV in either place
at present.

[paulus@samba.org - combined 4 patches into one, rewrote description]

Suggested-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-03-26 23:33:44 +11:00
Scott Wood
a3dc620743 powerpc/booke64: Use SPRG_TLB_EXFRAME on bolted handlers
While bolted handlers (including e6500) do not need to deal with a TLB
miss recursively causing another TLB miss, nested TLB misses can still
happen with crit/mc/debug exceptions -- so we still need to honor
SPRG_TLB_EXFRAME.

We don't need to spend time modifying it in the TLB miss fastpath,
though -- the special level exception will handle that.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Mihai Caraman <mihai.caraman@freescale.com>
Cc: kvm-ppc@vger.kernel.org
2014-03-19 19:57:15 -05:00
Scott Wood
9d378dfac8 powerpc/booke64: Use SPRG7 for VDSO
Previously SPRG3 was marked for use by both VDSO and critical
interrupts (though critical interrupts were not fully implemented).

In commit 8b64a9dfb0 ("powerpc/booke64:
Use SPRG0/3 scratch for bolted TLB miss & crit int"), Mihai Caraman
made an attempt to resolve this conflict by restoring the VDSO value
early in the critical interrupt, but this has some issues:

 - It's incompatible with EXCEPTION_COMMON which restores r13 from the
   by-then-overwritten scratch (this cost me some debugging time).
 - It forces critical exceptions to be a special case handled
   differently from even machine check and debug level exceptions.
 - It didn't occur to me that it was possible to make this work at all
   (by doing a final "ld r13, PACA_EXCRIT+EX_R13(r13)") until after
   I made (most of) this patch. :-)

It might be worth investigating using a load rather than SPRG on return
from all exceptions (except TLB misses where the scratch never leaves
the SPRG) -- it could save a few cycles.  Until then, let's stick with
SPRG for all exceptions.

Since we cannot use SPRG4-7 for scratch without corrupting the state of
a KVM guest, move VDSO to SPRG7 on book3e.  Since neither SPRG4-7 nor
critical interrupts exist on book3s, SPRG3 is still used for VDSO
there.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Mihai Caraman <mihai.caraman@freescale.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: kvm-ppc@vger.kernel.org
2014-03-19 19:57:14 -05:00
Linus Torvalds
4907cdca72 A fix for a PowerPC bug that was introduced during the 3.14 merge window.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJTItl6AAoJEBvWZb6bTYby/EMP/A+/l9rU9oji5nns2ei2YVf7
 kmnpshb2rgDP6gTtPet3qPKcSUNq4bG7zPEZ04FG5H70PduDYBWchfT/jTZKxcBB
 rZgtrsMAzGjoxvtsQAc1yQiMtlkG28HD3yn0TC7BvazTQlZnzHVZTmiCZc9r0Qo7
 SjLslo00ITbhRVwkLi1cuSeuaNcim1WDgxeUFhpRd48HoywtkzGLGuU8dWmakwb0
 IDoo6X9srNlmZgrRdMaBS3agQtt1GoOE2T/KaJbnJ6lmnkGzBVKrw8kIZamQlwnK
 GnmJAM//1zu4eoWJquNuc9wqsZK2t0e68LXAL9HLeMQGyl1YBh5hTiInTUnXiCAI
 D/r/rkF2lqPgKotDBY/Q1l0bHABR53MOwa5ooZANLatnTwTv4gSpM8L1CAnekGhq
 dxqVYqeJr1+ZemhalO1pbi/V9CfpJHBBye9JDan8rJmkj25kCH7PF3BxLYgKckQs
 OtfZyOqpsZSCvpzFQa70VmWTGDqOxk686/R9ql88sdGbgyzZOHxHCeMIoAgCah6c
 o5s1CktKKLOmC2M6NNn1z44V1/LmmC2fgKuaSjI/y7rjgGwRoGrAV1giVMdnTc5y
 hGo4fD83/zHJnkafVdBQbmkKP69bC65a0w4iLXSxOKkPtHCI3c2IrZpovAC42aTp
 iREImat0+YzW4hDRI+qa
 =uZp5
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull another kvm fix from Paolo Bonzini:
 "A fix for a PowerPC bug that was introduced during the 3.14 merge
  window"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: PPC: Book3S HV: Fix register usage when loading/saving VRSAVE
  KVM: PPC: Book3S HV: Remove bogus duplicate code
2014-03-18 11:32:08 -07:00
Paolo Bonzini
8fbb1daf3e Merge branch 'kvm-ppc-fix' into HEAD 2014-03-14 16:06:30 +01:00
Paul Mackerras
e724f080f5 KVM: PPC: Book3S HV: Fix register usage when loading/saving VRSAVE
Commit 595e4f7e69 ("KVM: PPC: Book3S HV: Use load/store_fp_state
functions in HV guest entry/exit") changed the register usage in
kvmppc_save_fp() and kvmppc_load_fp() but omitted changing the
instructions that load and save VRSAVE.  The result is that the
VRSAVE value was loaded from a constant address, and saved to a
location past the end of the vcpu struct, causing host kernel
memory corruption and various kinds of host kernel crashes.

This fixes the problem by using register r31, which contains the
vcpu pointer, instead of r3 and r4.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-13 10:47:01 +01:00
Paul Mackerras
a5b0ccb0b5 KVM: PPC: Book3S HV: Remove bogus duplicate code
Commit 7b490411c3 ("KVM: PPC: Book3S HV: Add new state for
transactional memory") incorrectly added some duplicate code to the
guest exit path because I didn't manage to clean up after a rebase
correctly.  This removes the extraneous material.  The presence of
this extraneous code causes host crashes whenever a guest is run.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-13 10:46:52 +01:00
Linus Torvalds
e2a0f813e0 Second batch of KVM updates. Some minor x86 fixes,
two s390 guest features that need some handling in the host,
 and all the PPC changes.  The PPC changes include support for
 little-endian guests and enablement for new POWER8 features.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJS6UF5AAoJEBvWZb6bTYby55kP/AgTJnyu7avN653/2aSHkjkx
 KgYSMYhZPIFoY5LyZuNetXaoXFRvCykux1VYSZ6V6s35h2PZ+hdJNbHGjFYKPGTq
 FQ92xQVNuWCAPxmFCjDNuDV/0BauG5y08/Orh/jpjz+GAfH43LruUQGbtXUuyJ8u
 vf+yTHniU5gguqsAmodqjHUgbf+GoPJ1j7hmRoWwt8IWm7Ns3v/IK4l0p6G0h26a
 RjE6aK+Tm208Yr5hD/dRAqeTbBNt3c4xub+QPsKoiEMaZBSuAOiux7D3Kx+If1gp
 WsmqEQxoymihVtkZhUFO9ONLJepvmG2QwJVVyMSUW9iqxX9rraXsvVyVMwcQAhog
 JuOAYxBftH07xu6Fs4eym5KvCFghM+EaJvxxt+kgnvdD4htK1+eK5trntc2zygSi
 /qGiIrkqjXpkskW8kujLayF0eAU3CrZvFWveEPBfFgYiOGX/2wzJCtSm/bt9Jo0M
 v60qgNFK3LNqAyeEfnm9VtlwGr6ZgsAB6DHNPX4fM5s2IBjL+qloXk/e/+aVKkW0
 I3yeRdy/ExhLAab6w81JtMeR7G3YS0UNuAEVvcoxzNb5wIBY8qnpfUzTKyMxQR94
 64EVpxWEYO1s55eCCyMujWrSvc+YAwhJcWHGKgC4K7mxxLD3FVyQXX6YZvgRozMX
 HjQju+DToj9CskyrFlRL
 =yd0Z
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more KVM updates from Paolo Bonzini:
 "Second batch of KVM updates.  Some minor x86 fixes, two s390 guest
  features that need some handling in the host, and all the PPC changes.

  The PPC changes include support for little-endian guests and
  enablement for new POWER8 features"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (45 commits)
  x86, kvm: correctly access the KVM_CPUID_FEATURES leaf at 0x40000101
  x86, kvm: cache the base of the KVM cpuid leaves
  kvm: x86: move KVM_CAP_HYPERV_TIME outside #ifdef
  KVM: PPC: Book3S PR: Cope with doorbell interrupts
  KVM: PPC: Book3S HV: Add software abort codes for transactional memory
  KVM: PPC: Book3S HV: Add new state for transactional memory
  powerpc/Kconfig: Make TM select VSX and VMX
  KVM: PPC: Book3S HV: Basic little-endian guest support
  KVM: PPC: Book3S HV: Add support for DABRX register on POWER7
  KVM: PPC: Book3S HV: Prepare for host using hypervisor doorbells
  KVM: PPC: Book3S HV: Handle new LPCR bits on POWER8
  KVM: PPC: Book3S HV: Handle guest using doorbells for IPIs
  KVM: PPC: Book3S HV: Consolidate code that checks reason for wake from nap
  KVM: PPC: Book3S HV: Implement architecture compatibility modes for POWER8
  KVM: PPC: Book3S HV: Add handler for HV facility unavailable
  KVM: PPC: Book3S HV: Flush the correct number of TLB sets on POWER8
  KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs
  KVM: PPC: Book3S HV: Align physical and virtual CPU thread numbers
  KVM: PPC: Book3S HV: Don't set DABR on POWER8
  kvm/ppc: IRQ disabling cleanup
  ...
2014-01-31 08:37:32 -08:00
Paolo Bonzini
b73117c493 Merge branch 'kvm-ppc-next' of git://github.com/agraf/linux-2.6 into kvm-queue
Conflicts:
	arch/powerpc/kvm/book3s_hv_rmhandlers.S
	arch/powerpc/kvm/booke.c
2014-01-29 18:29:01 +01:00
Linus Torvalds
1b17366d69 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc updates from Ben Herrenschmidt:
 "So here's my next branch for powerpc.  A bit late as I was on vacation
  last week.  It's mostly the same stuff that was in next already, I
  just added two patches today which are the wiring up of lockref for
  powerpc, which for some reason fell through the cracks last time and
  is trivial.

  The highlights are, in addition to a bunch of bug fixes:

   - Reworked Machine Check handling on kernels running without a
     hypervisor (or acting as a hypervisor).  Provides hooks to handle
     some errors in real mode such as TLB errors, handle SLB errors,
     etc...

   - Support for retrieving memory error information from the service
     processor on IBM servers running without a hypervisor and routing
     them to the memory poison infrastructure.

   - _PAGE_NUMA support on server processors

   - 32-bit BookE relocatable kernel support

   - FSL e6500 hardware tablewalk support

   - A bunch of new/revived board support

   - FSL e6500 deeper idle states and altivec powerdown support

  You'll notice a generic mm change here, it has been acked by the
  relevant authorities and is a pre-req for our _PAGE_NUMA support"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (121 commits)
  powerpc: Implement arch_spin_is_locked() using arch_spin_value_unlocked()
  powerpc: Add support for the optimised lockref implementation
  powerpc/powernv: Call OPAL sync before kexec'ing
  powerpc/eeh: Escalate error on non-existing PE
  powerpc/eeh: Handle multiple EEH errors
  powerpc: Fix transactional FP/VMX/VSX unavailable handlers
  powerpc: Don't corrupt transactional state when using FP/VMX in kernel
  powerpc: Reclaim two unused thread_info flag bits
  powerpc: Fix races with irq_work
  Move precessing of MCE queued event out from syscall exit path.
  pseries/cpuidle: Remove redundant call to ppc64_runlatch_off() in cpu idle routines
  powerpc: Make add_system_ram_resources() __init
  powerpc: add SATA_MV to ppc64_defconfig
  powerpc/powernv: Increase candidate fw image size
  powerpc: Add debug checks to catch invalid cpu-to-node mappings
  powerpc: Fix the setup of CPU-to-Node mappings during CPU online
  powerpc/iommu: Don't detach device without IOMMU group
  powerpc/eeh: Hotplug improvement
  powerpc/eeh: Call opal_pci_reinit() on powernv for restoring config space
  powerpc/eeh: Add restore_config operation
  ...
2014-01-27 21:11:26 -08:00
Paul Mackerras
4068890931 KVM: PPC: Book3S PR: Cope with doorbell interrupts
When the PR host is running on a POWER8 machine in POWER8 mode, it
will use doorbell interrupts for IPIs.  If one of them arrives while
we are in the guest, we pop out of the guest with trap number 0xA00,
which isn't handled by kvmppc_handle_exit_pr, leading to the following
BUG_ON:

[  331.436215] exit_nr=0xa00 | pc=0x1d2c | msr=0x800000000000d032
[  331.437522] ------------[ cut here ]------------
[  331.438296] kernel BUG at arch/powerpc/kvm/book3s_pr.c:982!
[  331.439063] Oops: Exception in kernel mode, sig: 5 [#2]
[  331.439819] SMP NR_CPUS=1024 NUMA pSeries
[  331.440552] Modules linked in: tun nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw virtio_net kvm binfmt_misc ibmvscsi scsi_transport_srp scsi_tgt virtio_blk
[  331.447614] CPU: 11 PID: 1296 Comm: qemu-system-ppc Tainted: G      D      3.11.7-200.2.fc19.ppc64p7 #1
[  331.448920] task: c0000003bdc8c000 ti: c0000003bd32c000 task.ti: c0000003bd32c000
[  331.450088] NIP: d0000000025d6b9c LR: d0000000025d6b98 CTR: c0000000004cfdd0
[  331.451042] REGS: c0000003bd32f420 TRAP: 0700   Tainted: G      D       (3.11.7-200.2.fc19.ppc64p7)
[  331.452331] MSR: 800000000282b032 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI>  CR: 28004824  XER: 20000000
[  331.454616] SOFTE: 1
[  331.455106] CFAR: c000000000848bb8
[  331.455726]
GPR00: d0000000025d6b98 c0000003bd32f6a0 d0000000026017b8 0000000000000032
GPR04: c0000000018627f8 c000000001873208 320d0a3030303030 3030303030643033
GPR08: c000000000c490a8 0000000000000000 0000000000000000 0000000000000002
GPR12: 0000000028004822 c00000000fdc6300 0000000000000000 00000100076ec310
GPR16: 000000002ae343b8 00003ffffd397398 0000000000000000 0000000000000000
GPR20: 00000100076f16f4 00000100076ebe60 0000000000000008 ffffffffffffffff
GPR24: 0000000000000000 0000008001041e60 0000000000000000 0000008001040ce8
GPR28: c0000003a2d80000 0000000000000a00 0000000000000001 c0000003a2681810
[  331.466504] NIP [d0000000025d6b9c] .kvmppc_handle_exit_pr+0x75c/0xa80 [kvm]
[  331.466999] LR [d0000000025d6b98] .kvmppc_handle_exit_pr+0x758/0xa80 [kvm]
[  331.467517] Call Trace:
[  331.467909] [c0000003bd32f6a0] [d0000000025d6b98] .kvmppc_handle_exit_pr+0x758/0xa80 [kvm] (unreliable)
[  331.468553] [c0000003bd32f750] [d0000000025d98f0] kvm_start_lightweight+0xb4/0xc4 [kvm]
[  331.469189] [c0000003bd32f920] [d0000000025d7648] .kvmppc_vcpu_run_pr+0xd8/0x270 [kvm]
[  331.469838] [c0000003bd32f9c0] [d0000000025cf748] .kvmppc_vcpu_run+0xc8/0xf0 [kvm]
[  331.470790] [c0000003bd32fa50] [d0000000025cc19c] .kvm_arch_vcpu_ioctl_run+0x5c/0x1b0 [kvm]
[  331.471401] [c0000003bd32fae0] [d0000000025c4888] .kvm_vcpu_ioctl+0x478/0x730 [kvm]
[  331.472026] [c0000003bd32fc90] [c00000000026192c] .do_vfs_ioctl+0x4dc/0x7a0
[  331.472561] [c0000003bd32fd80] [c000000000261cc4] .SyS_ioctl+0xd4/0xf0
[  331.473095] [c0000003bd32fe30] [c000000000009ed8] syscall_exit+0x0/0x98
[  331.473633] Instruction dump:
[  331.473766] 4bfff9b4 2b9d0800 419efc18 60000000 60420000 3d220000 e8bf11a0 e8df12a8
[  331.474733] 7fa4eb78 e8698660 48015165 e8410028 <0fe00000> 813f00e4 3ba00000 39290001
[  331.475386] ---[ end trace 49fc47d994c1f8f2 ]---
[  331.479817]

This fixes the problem by making kvmppc_handle_exit_pr() recognize the
interrupt.  We also need to jump to the doorbell interrupt handler in
book3s_segment.S to handle the interrupt on the way out of the guest.
Having done that, there's nothing further to be done in
kvmppc_handle_exit_pr().

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-27 16:01:23 +01:00
Michael Neuling
7b490411c3 KVM: PPC: Book3S HV: Add new state for transactional memory
Add new state for transactional memory (TM) to kvm_vcpu_arch.  Also add
asm-offset bits that are going to be required.

This also moves the existing TFHAR, TFIAR and TEXASR SPRs into a
CONFIG_PPC_TRANSACTIONAL_MEM section.  This requires some code changes to
ensure we still compile with CONFIG_PPC_TRANSACTIONAL_MEM=N.  Much of the added
the added #ifdefs are removed in a later patch when the bulk of the TM code is
added.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix merge conflict]
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-27 16:01:20 +01:00
Anton Blanchard
d682916a38 KVM: PPC: Book3S HV: Basic little-endian guest support
We create a guest MSR from scratch when delivering exceptions in
a few places.  Instead of extracting LPCR[ILE] and inserting it
into MSR_LE each time, we simply create a new variable intr_msr which
contains the entire MSR to use.  For a little-endian guest, userspace
needs to set the ILE (interrupt little-endian) bit in the LPCR for
each vcpu (or at least one vcpu in each virtual core).

[paulus@samba.org - removed H_SET_MODE implementation from original
version of the patch, and made kvmppc_set_lpcr update vcpu->arch.intr_msr.]

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-27 16:01:16 +01:00
Paul Mackerras
8563bf52d5 KVM: PPC: Book3S HV: Add support for DABRX register on POWER7
The DABRX (DABR extension) register on POWER7 processors provides finer
control over which accesses cause a data breakpoint interrupt.  It
contains 3 bits which indicate whether to enable accesses in user,
kernel and hypervisor modes respectively to cause data breakpoint
interrupts, plus one bit that enables both real mode and virtual mode
accesses to cause interrupts.  Currently, KVM sets DABRX to allow
both kernel and user accesses to cause interrupts while in the guest.

This adds support for the guest to specify other values for DABRX.
PAPR defines a H_SET_XDABR hcall to allow the guest to set both DABR
and DABRX with one call.  This adds a real-mode implementation of
H_SET_XDABR, which shares most of its code with the existing H_SET_DABR
implementation.  To support this, we add a per-vcpu field to store the
DABRX value plus code to get and set it via the ONE_REG interface.

For Linux guests to use this new hcall, userspace needs to add
"hcall-xdabr" to the set of strings in the /chosen/hypertas-functions
property in the device tree.  If userspace does this and then migrates
the guest to a host where the kernel doesn't include this patch, then
userspace will need to implement H_SET_XDABR by writing the specified
DABR value to the DABR using the ONE_REG interface.  In that case, the
old kernel will set DABRX to DABRX_USER | DABRX_KERNEL.  That should
still work correctly, at least for Linux guests, since Linux guests
cope with getting data breakpoint interrupts in modes that weren't
requested by just ignoring the interrupt, and Linux guests never set
DABRX_BTI.

The other thing this does is to make H_SET_DABR and H_SET_XDABR work
on POWER8, which has the DAWR and DAWRX instead of DABR/X.  Guests that
know about POWER8 should use H_SET_MODE rather than H_SET_[X]DABR, but
guests running in POWER7 compatibility mode will still use H_SET_[X]DABR.
For them, this adds the logic to convert DABR/X values into DAWR/X values
on POWER8.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-27 16:01:15 +01:00
Paul Mackerras
5d00f66b86 KVM: PPC: Book3S HV: Prepare for host using hypervisor doorbells
POWER8 has support for hypervisor doorbell interrupts.  Though the
kernel doesn't use them for IPIs on the powernv platform yet, it
probably will in future, so this makes KVM cope gracefully if a
hypervisor doorbell interrupt arrives while in a guest.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-27 16:01:13 +01:00
Paul Mackerras
e0622bd9f2 KVM: PPC: Book3S HV: Handle new LPCR bits on POWER8
POWER8 has a bit in the LPCR to enable or disable the PURR and SPURR
registers to count when in the guest.  Set this bit.

POWER8 has a field in the LPCR called AIL (Alternate Interrupt Location)
which is used to enable relocation-on interrupts.  Allow userspace to
set this field.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-27 16:01:11 +01:00
Paul Mackerras
aa31e84322 KVM: PPC: Book3S HV: Handle guest using doorbells for IPIs
* SRR1 wake reason field for system reset interrupt on wakeup from nap
  is now a 4-bit field on P8, compared to 3 bits on P7.

* Set PECEDP in LPCR when napping because of H_CEDE so guest doorbells
  will wake us up.

* Waking up from nap because of a guest doorbell interrupt is not a
  reason to exit the guest.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-27 16:01:10 +01:00
Paul Mackerras
e3bbbbfa13 KVM: PPC: Book3S HV: Consolidate code that checks reason for wake from nap
Currently in book3s_hv_rmhandlers.S we have three places where we
have woken up from nap mode and we check the reason field in SRR1
to see what event woke us up.  This consolidates them into a new
function, kvmppc_check_wake_reason.  It looks at the wake reason
field in SRR1, and if it indicates that an external interrupt caused
the wakeup, calls kvmppc_read_intr to check what sort of interrupt
it was.

This also consolidates the two places where we synthesize an external
interrupt (0x500 vector) for the guest.  Now, if the guest exit code
finds that there was an external interrupt which has been handled
(i.e. it was an IPI indicating that there is now an interrupt pending
for the guest), it jumps to deliver_guest_interrupt, which is in the
last part of the guest entry code, where we synthesize guest external
and decrementer interrupts.  That code has been streamlined a little
and now clears LPCR[MER] when appropriate as well as setting it.

The extra clearing of any pending IPI on a secondary, offline CPU
thread before going back to nap mode has been removed.  It is no longer
necessary now that we have code to read and acknowledge IPIs in the
guest exit path.

This fixes a minor bug in the H_CEDE real-mode handling - previously,
if we found that other threads were already exiting the guest when we
were about to go to nap mode, we would branch to the cede wakeup path
and end up looking in SRR1 for a wakeup reason.  Now we branch to a
point after we have checked the wakeup reason.

This also fixes a minor bug in kvmppc_read_intr - previously it could
return 0xff rather than 1, in the case where we find that a host IPI
is pending after we have cleared the IPI.  Now it returns 1.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-27 16:01:08 +01:00
Paul Mackerras
5557ae0ec7 KVM: PPC: Book3S HV: Implement architecture compatibility modes for POWER8
This allows us to select architecture 2.05 (POWER6) or 2.06 (POWER7)
compatibility modes on a POWER8 processor.  (Note that transactional
memory is disabled for usermode if either or both of the PCR_TM_DIS
and PCR_ARCH_206 bits are set.)

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-27 16:01:06 +01:00
Michael Ellerman
bd3048b80c KVM: PPC: Book3S HV: Add handler for HV facility unavailable
At present this should never happen, since the host kernel sets
HFSCR to allow access to all facilities.  It's better to be prepared
to handle it cleanly if it does ever happen, though.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-27 16:01:04 +01:00
Paul Mackerras
ca25205513 KVM: PPC: Book3S HV: Flush the correct number of TLB sets on POWER8
POWER8 has 512 sets in the TLB, compared to 128 for POWER7, so we need
to do more tlbiel instructions when flushing the TLB on POWER8.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-27 16:01:02 +01:00
Michael Neuling
b005255e12 KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs
This adds fields to the struct kvm_vcpu_arch to store the new
guest-accessible SPRs on POWER8, adds code to the get/set_one_reg
functions to allow userspace to access this state, and adds code to
the guest entry and exit to context-switch these SPRs between host
and guest.

Note that DPDES (Directed Privileged Doorbell Exception State) is
shared between threads on a core; hence we store it in struct
kvmppc_vcore and have the master thread save and restore it.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-27 16:01:00 +01:00
Paul Mackerras
e0b7ec058c KVM: PPC: Book3S HV: Align physical and virtual CPU thread numbers
On a threaded processor such as POWER7, we group VCPUs into virtual
cores and arrange that the VCPUs in a virtual core run on the same
physical core.  Currently we don't enforce any correspondence between
virtual thread numbers within a virtual core and physical thread
numbers.  Physical threads are allocated starting at 0 on a first-come
first-served basis to runnable virtual threads (VCPUs).

POWER8 implements a new "msgsndp" instruction which guest kernels can
use to interrupt other threads in the same core or sub-core.  Since
the instruction takes the destination physical thread ID as a parameter,
it becomes necessary to align the physical thread IDs with the virtual
thread IDs, that is, to make sure virtual thread N within a virtual
core always runs on physical thread N.

This means that it's possible that thread 0, which is where we call
__kvmppc_vcore_entry, may end up running some other vcpu than the
one whose task called kvmppc_run_core(), or it may end up running
no vcpu at all, if for example thread 0 of the virtual core is
currently executing in userspace.  However, we do need thread 0
to be responsible for switching the MMU -- a previous version of
this patch that had other threads switching the MMU was found to
be responsible for occasional memory corruption and machine check
interrupts in the guest on POWER7 machines.

To accommodate this, we no longer pass the vcpu pointer to
__kvmppc_vcore_entry, but instead let the assembly code load it from
the PACA.  Since the assembly code will need to know the kvm pointer
and the thread ID for threads which don't have a vcpu, we move the
thread ID into the PACA and we add a kvm pointer to the virtual core
structure.

In the case where thread 0 has no vcpu to run, it still calls into
kvmppc_hv_entry in order to do the MMU switch, and then naps until
either its vcpu is ready to run in the guest, or some other thread
needs to exit the guest.  In the latter case, thread 0 jumps to the
code that switches the MMU back to the host.  This control flow means
that now we switch the MMU before loading any guest vcpu state.
Similarly, on guest exit we now save all the guest vcpu state before
switching the MMU back to the host.  This has required substantial
code movement, making the diff rather large.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-27 16:00:59 +01:00
Michael Neuling
eee7ff9d2c KVM: PPC: Book3S HV: Don't set DABR on POWER8
POWER8 doesn't have the DABR and DABRX registers; instead it has
new DAWR/DAWRX registers, which will be handled in a later patch.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-27 16:00:57 +01:00
Scott Wood
6c85f52b10 kvm/ppc: IRQ disabling cleanup
Simplify the handling of lazy EE by going directly from fully-enabled
to hard-disabled.  This replaces the lazy_irq_pending() check
(including its misplaced kvm_guest_exit() call).

As suggested by Tiejun Chen, move the interrupt disabling into
kvmppc_prepare_to_enter() rather than have each caller do it.  Also
move the IRQ enabling on heavyweight exit into
kvmppc_prepare_to_enter().

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-27 16:00:55 +01:00
Mihai Caraman
70713fe315 KVM: PPC: e500: Fix bad address type in deliver_tlb_misss()
Use gva_t instead of unsigned int for eaddr in deliver_tlb_miss().

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
CC: stable@vger.kernel.org
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-27 16:00:54 +01:00
Andreas Schwab
48eaef0518 KVM: PPC: Book3S HV: use xics_wake_cpu only when defined
Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
CC: stable@vger.kernel.org
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-27 16:00:52 +01:00
Cédric Le Goater
736017752d KVM: PPC: Book3S: MMIO emulation support for little endian guests
MMIO emulation reads the last instruction executed by the guest
and then emulates. If the guest is running in Little Endian order,
or more generally in a different endian order of the host, the
instruction needs to be byte-swapped before being emulated.

This patch adds a helper routine which tests the endian order of
the host and the guest in order to decide whether a byteswap is
needed or not. It is then used to byteswap the last instruction
of the guest in the endian order of the host before MMIO emulation
is performed.

Finally, kvmppc_handle_load() of kvmppc_handle_store() are modified
to reverse the endianness of the MMIO if required.

Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
[agraf: add booke handling]
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-27 16:00:39 +01:00
Linus Torvalds
7ebd3faa9b First round of KVM updates for 3.14; PPC parts will come next week.
Nothing major here, just bugfixes all over the place.  The most
 interesting part is the ARM guys' virtualized interrupt controller
 overhaul, which lets userspace get/set the state and thus enables
 migration of ARM VMs.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJS3TVKAAoJEBvWZb6bTYbyIFgP/2cmt4ifCuFMaZv4+G1S8jZU
 uC9ZB/+7vzht/p6zAy+4BxurKbHmSBFkC1OKcxYuy7yB4CQkHabzj4V2vRtqFdwH
 5lExP9qh3kqaVLuhnvxLTmkktR3EW4PFy6OI53l5kRNktOXSuZ0aN6K3V7tCg/X0
 iL7ASo4bJKlxeWcDpmuVrNgAajmZVfXrjKY7robgBQno+yIsgKhRZRBQHjozA6B8
 FpCo/k48RZd/EzIbV/PDDRI4hmmry/lgrO9SKjzq56wSqff2bd/k/KYze4dbAPfd
 Ps60enPTuHmeEjjb4MMMU4EKHVdTQFUMx/xZCmT4xzoh8s4of6RHphXbfE0SUznQ
 dTveyEQAR7E3JNS0k1+3WEX5fWlFesp0hO2NeE0wzUq4TAr9ztgVO9NQ6Si15e7Z
 2HysO0T5Ojtt0lY08/PvS6i48eCAuuBomrejJS8hLW4SUZ5adn+yW4Qo7Fp9JeBR
 l9a3LsVT8BZMtUWrUuFcVhlM4MbzElUPjDbgWhR8UYU/kpfVZOQu8qWgGKR4UWXy
 X7/t9l/tjR99CmfMJBAOzJid+ScSpAfg77BdaKiQrVfVIJmsjEjlO8vUMyj5b1HF
 hPX5wNyJjHAOfridLeHSs4Rdm4a8sk8Az5d4h76pLVz8M4jyTi2v0rO3N4/dU/pu
 x7N8KR5hAj+mLBoM9/Al
 =8sYU
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "First round of KVM updates for 3.14; PPC parts will come next week.

  Nothing major here, just bugfixes all over the place.  The most
  interesting part is the ARM guys' virtualized interrupt controller
  overhaul, which lets userspace get/set the state and thus enables
  migration of ARM VMs"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (67 commits)
  kvm: make KVM_MMU_AUDIT help text more readable
  KVM: s390: Fix memory access error detection
  KVM: nVMX: Update guest activity state field on L2 exits
  KVM: nVMX: Fix nested_run_pending on activity state HLT
  KVM: nVMX: Clean up handling of VMX-related MSRs
  KVM: nVMX: Add tracepoints for nested_vmexit and nested_vmexit_inject
  KVM: nVMX: Pass vmexit parameters to nested_vmx_vmexit
  KVM: nVMX: Leave VMX mode on clearing of feature control MSR
  KVM: VMX: Fix DR6 update on #DB exception
  KVM: SVM: Fix reading of DR6
  KVM: x86: Sync DR7 on KVM_SET_DEBUGREGS
  add support for Hyper-V reference time counter
  KVM: remove useless write to vcpu->hv_clock.tsc_timestamp
  KVM: x86: fix tsc catchup issue with tsc scaling
  KVM: x86: limit PIT timer frequency
  KVM: x86: handle invalid root_hpa everywhere
  kvm: Provide kvm_vcpu_eligible_for_directed_yield() stub
  kvm: vfio: silence GCC warning
  KVM: ARM: Remove duplicate include
  arm/arm64: KVM: relax the requirements of VMA alignment for THP
  ...
2014-01-22 21:40:43 -08:00
Zhouyi Zhou
47d45d9f53 KVM: PPC: NULL return of kvmppc_mmu_hpte_cache_next should be handled
NULL return of kvmppc_mmu_hpte_cache_next should be handled

Signed-off-by: Zhouyi Zhou <yizhouzhou@ict.ac.cn>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-09 10:15:11 +01:00
Tiejun Chen
9bd880a2c8 KVM: PPC: Book3E HV: call RECONCILE_IRQ_STATE to sync the software state
Rather than calling hard_irq_disable() when we're back in C code
we can just call RECONCILE_IRQ_STATE to soft disable IRQs while
we're already in hard disabled state.

This should be functionally equivalent to the code before, but
cleaner and faster.

Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
[agraf: fix comment, commit message]
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-09 10:15:10 +01:00
Bharat Bhushan
08c9a188d0 kvm: powerpc: use caching attributes as per linux pte
KVM uses same WIM tlb attributes as the corresponding qemu pte.
For this we now search the linux pte for the requested page and
get these cache caching/coherency attributes from pte.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Reviewed-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-09 10:15:08 +01:00
Bharat Bhushan
7c85e6b39c kvm: book3s: rename lookup_linux_pte() to lookup_linux_pte_and_update()
lookup_linux_pte() is doing more than lookup, updating the pte,
so for clarity it is renamed to lookup_linux_pte_and_update()

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Reviewed-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-09 10:15:06 +01:00
Bharat Bhushan
30a91fe24b kvm: booke: clear host tlb reference flag on guest tlb invalidation
On booke, "struct tlbe_ref" contains host tlb mapping information
(pfn: for guest-pfn to pfn, flags: attribute associated with this mapping)
for a guest tlb entry. So when a guest creates a TLB entry then
"struct tlbe_ref" is set to point to valid "pfn" and set attributes in
"flags" field of the above said structure. When a guest TLB entry is
invalidated then flags field of corresponding "struct tlbe_ref" is
updated to point that this is no more valid, also we selectively clear
some other attribute bits, example: if E500_TLB_BITMAP was set then we clear
E500_TLB_BITMAP, if E500_TLB_TLB0 is set then we clear this.

Ideally we should clear complete "flags" as this entry is invalid and does not
have anything to re-used. The other part of the problem is that when we use
the same entry again then also we do not clear (started doing or-ing etc).

So far it was working because the selectively clearing mentioned above
actually clears "flags" what was set during TLB mapping. But the problem
starts coming when we add more attributes to this then we need to selectively
clear them and which is not needed.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Reviewed-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-09 10:15:04 +01:00
Paul Mackerras
595e4f7e69 KVM: PPC: Book3S HV: Use load/store_fp_state functions in HV guest entry/exit
This modifies kvmppc_load_fp and kvmppc_save_fp to use the generic
FP/VSX and VMX load/store functions instead of open-coding the
FP/VSX/VMX load/store instructions.  Since kvmppc_load/save_fp don't
follow C calling conventions, we make them private symbols within
book3s_hv_rmhandlers.S.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-09 10:15:03 +01:00
Paul Mackerras
99dae3bad2 KVM: PPC: Load/save FP/VMX/VSX state directly to/from vcpu struct
Now that we have the vcpu floating-point and vector state stored in
the same type of struct as the main kernel uses, we can load that
state directly from the vcpu struct instead of having extra copies
to/from the thread_struct.  Similarly, when the guest state needs to
be saved, we can have it saved it directly to the vcpu struct by
setting the current->thread.fp_save_area and current->thread.vr_save_area
pointers.  That also means that we don't need to back up and restore
userspace's FP/vector state.  This all makes the code simpler and
faster.

Note that it's not necessary to save or modify current->thread.fpexc_mode,
since nothing in KVM uses or is affected by its value.  Nor is it
necessary to touch used_vr or used_vsr.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-09 10:15:02 +01:00
Paul Mackerras
efff191223 KVM: PPC: Store FP/VSX/VMX state in thread_fp/vr_state structures
This uses struct thread_fp_state and struct thread_vr_state to store
the floating-point, VMX/Altivec and VSX state, rather than flat arrays.
This makes transferring the state to/from the thread_struct simpler
and allows us to unify the get/set_one_reg implementations for the
VSX registers.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-09 10:15:00 +01:00
Paul Mackerras
09548fdaf3 KVM: PPC: Use load_fp/vr_state rather than load_up_fpu/altivec
The load_up_fpu and load_up_altivec functions were never intended to
be called from C, and do things like modifying the MSR value in their
callers' stack frames, which are assumed to be interrupt frames.  In
addition, on 32-bit Book S they require the MMU to be off.

This makes KVM use the new load_fp_state() and load_vr_state() functions
instead of load_up_fpu/altivec.  This means we can remove the assembler
glue in book3s_rmhandlers.S, and potentially fixes a bug on Book E,
where load_up_fpu was called directly from C.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-09 10:14:59 +01:00
Gleb Natapov
458ff3c099 KVM: PPC: fix couple of memory leaks in MPIC/XICS devices
XICS failed to free xics structure on error path. MPIC destroy handler
forgot to delete kvm_device structure.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-09 10:14:54 +01:00
Alexander Graf
398a76c677 KVM: PPC: Add devname:kvm aliases for modules
Systems that support automatic loading of kernel modules through
device aliases should try and automatically load kvm when /dev/kvm
gets opened.

Add code to support that magic for all PPC kvm targets, even the
ones that don't support modules yet.

Signed-off-by: Alexander Graf <agraf@suse.de>
2014-01-09 10:14:00 +01:00
Mihai Caraman
228b1a4730 powerpc/booke64: Add LRAT error exception handler
LRAT (Logical to Real Address Translation) present in MMU v2 provides hardware
translation from a logical page number (LPN) to a real page number (RPN) when
tlbwe is executed by a guest or when a page table translation occurs from a
guest virtual address.

Add LRAT error exception handler to Booke3E 64-bit kernel and the basic KVM
handler to avoid build breakage. This is a prerequisite for KVM LRAT support
that will follow.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
2014-01-07 18:15:29 -06:00
Paul Mackerras
df9059bb64 KVM: PPC: Book3S HV: Don't drop low-order page address bits
Commit caaa4c804f ("KVM: PPC: Book3S HV: Fix physical address
calculations") unfortunately resulted in some low-order address bits
getting dropped in the case where the guest is creating a 4k HPTE
and the host page size is 64k.  By getting the low-order bits from
hva rather than gpa we miss out on bits 12 - 15 in this case, since
hva is at page granularity.  This puts the missing bits back in.

Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-12-18 11:30:35 +01:00
Aneesh Kumar K.V
36e7bb3802 powerpc: book3s: kvm: Don't abuse host r2 in exit path
We don't use PACATOC for PR. Avoid updating HOST_R2 with PR
KVM mode when both HV and PR are enabled in the kernel. Without this we
get the below crash

(qemu)
Unable to handle kernel paging request for data at address 0xffffffffffff8310
Faulting instruction address: 0xc00000000001d5a4
cpu 0x2: Vector: 300 (Data Access) at [c0000001dc53aef0]
    pc: c00000000001d5a4: .vtime_delta.isra.1+0x34/0x1d0
    lr: c00000000001d760: .vtime_account_system+0x20/0x60
    sp: c0000001dc53b170
   msr: 8000000000009032
   dar: ffffffffffff8310
 dsisr: 40000000
  current = 0xc0000001d76c62d0
  paca    = 0xc00000000fef1100   softe: 0        irq_happened: 0x01
    pid   = 4472, comm = qemu-system-ppc
enter ? for help
[c0000001dc53b200] c00000000001d760 .vtime_account_system+0x20/0x60
[c0000001dc53b290] c00000000008d050 .kvmppc_handle_exit_pr+0x60/0xa50
[c0000001dc53b340] c00000000008f51c kvm_start_lightweight+0xb4/0xc4
[c0000001dc53b510] c00000000008cdf0 .kvmppc_vcpu_run_pr+0x150/0x2e0
[c0000001dc53b9e0] c00000000008341c .kvmppc_vcpu_run+0x2c/0x40
[c0000001dc53ba50] c000000000080af4 .kvm_arch_vcpu_ioctl_run+0x54/0x1b0
[c0000001dc53bae0] c00000000007b4c8 .kvm_vcpu_ioctl+0x478/0x730
[c0000001dc53bca0] c0000000002140cc .do_vfs_ioctl+0x4ac/0x770
[c0000001dc53bd80] c0000000002143e8 .SyS_ioctl+0x58/0xb0
[c0000001dc53be30] c000000000009e58 syscall_exit+0x0/0x98

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-12-18 11:29:31 +01:00
Takuya Yoshikawa
c08ac06ab3 KVM: Use cond_resched() directly and remove useless kvm_resched()
Since the commit 15ad7146 ("KVM: Use the scheduler preemption notifiers
to make kvm preemptible"), the remaining stuff in this function is a
simple cond_resched() call with an extra need_resched() check which was
there to avoid dropping VCPUs unnecessarily.  Now it is meaningless.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-12-13 14:23:45 +01:00
Scott Wood
f5f972102d powerpc/kvm/booke: Fix build break due to stack frame size warning
Commit ce11e48b7f ("KVM: PPC: E500: Add
userspace debug stub support") added "struct thread_struct" to the
stack of kvmppc_vcpu_run().  thread_struct is 1152 bytes on my build,
compared to 48 bytes for the recently-introduced "struct debug_reg".
Use the latter instead.

This fixes the following error:

cc1: warnings being treated as errors
arch/powerpc/kvm/booke.c: In function 'kvmppc_vcpu_run':
arch/powerpc/kvm/booke.c:760:1: error: the frame size of 1424 bytes is larger than 1024 bytes
make[2]: *** [arch/powerpc/kvm/booke.o] Error 1
make[1]: *** [arch/powerpc/kvm] Error 2
make[1]: *** Waiting for unfinished jobs....

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-12-11 00:12:44 +01:00
Alexander Graf
3d3319b45e KVM: PPC: Book3S: PR: Enable interrupts earlier
Now that the svcpu sync is interrupt aware we can enable interrupts
earlier in the exit code path again, moving 32bit and 64bit closer
together.

While at it, document the fact that we're always executing the exit
path with interrupts enabled so that the next person doesn't trap
over this.

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-12-09 09:41:41 +01:00
Alexander Graf
40fdd8c88c KVM: PPC: Book3S: PR: Make svcpu -> vcpu store preempt savvy
As soon as we get back to our "highmem" handler in virtual address
space we may get preempted. Today the reason we can get preempted is
that we replay interrupts and all the lazy logic thinks we have
interrupts enabled.

However, it's not hard to make the code interruptible and that way
we can enable and handle interrupts even earlier.

This fixes random guest crashes that happened with CONFIG_PREEMPT=y
for me.

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-12-09 09:41:39 +01:00
Alexander Graf
d825a04387 KVM: PPC: Book3S: PR: Don't clobber our exit handler id
We call a C helper to save all svcpu fields into our vcpu. The C
ABI states that r12 is considered volatile. However, we keep our
exit handler id in r12 currently.

So we need to save it away into a non-volatile register instead
that definitely does get preserved across the C call.

This bug usually didn't hit anyone yet since gcc is smart enough
to generate code that doesn't even need r12 which means it stayed
identical throughout the call by sheer luck. But we can't rely on
that.

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-12-09 09:41:26 +01:00
Mahesh Salgaonkar
36df96f8ac powerpc/book3s: Decode and save machine check event.
Now that we handle machine check in linux, the MCE decoding should also
take place in linux host. This info is crucial to log before we go down
in case we can not handle the machine check errors. This patch decodes
and populates a machine check event which contain high level meaning full
MCE information.

We do this in real mode C code with ME bit on. The MCE information is still
available on emergency stack (in pt_regs structure format). Even if we take
another exception at this point the MCE early handler will allocate a new
stack frame on top of current one. So when we return back here we still have
our MCE information safe on current stack.

We use per cpu buffer to save high level MCE information. Each per cpu buffer
is an array of machine check event structure indexed by per cpu counter
mce_nest_count. The mce_nest_count is incremented every time we enter
machine check early handler in real mode to get the current free slot
(index = mce_nest_count - 1). The mce_nest_count is decremented once the
MCE info is consumed by virtual mode machine exception handler.

This patch provides save_mce_event(), get_mce_event() and release_mce_event()
generic routines that can be used by machine check handlers to populate and
retrieve the event. The routine release_mce_event() will free the event slot so
that it can be reused. Caller can invoke get_mce_event() with a release flag
either to release the event slot immediately OR keep it so that it can be
fetched again. The event slot can be also released anytime by invoking
release_mce_event().

This patch also updates kvm code to invoke get_mce_event to retrieve generic
mce event rather than paca->opal_mce_evt.

The KVM code always calls get_mce_event() with release flags set to false so
that event is available for linus host machine

If machine check occurs while we are in guest, KVM tries to handle the error.
If KVM is able to handle MC error successfully, it enters the guest and
delivers the machine check to guest. If KVM is not able to handle MC error, it
exists the guest and passes the control to linux host machine check handler
which then logs MC event and decides how to handle it in linux host. In failure
case, KVM needs to make sure that the MC event is available for linux host to
consume. Hence KVM always calls get_mce_event() with release flags set to false
and later it invokes release_mce_event() only if it succeeds to handle error.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-12-05 16:05:20 +11:00
Mahesh Salgaonkar
0440705049 powerpc/book3s: Add flush_tlb operation in cpu_spec.
This patch introduces flush_tlb operation in cpu_spec structure. This will
help us to invoke appropriate CPU-side flush tlb routine. This patch
adds the foundation to invoke CPU specific flush routine for respective
architectures. Currently this patch introduce flush_tlb for p7 and p8.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-12-05 16:04:38 +11:00
Liu Ping Fan
27025a602c powerpc: kvm: optimize "sc 1" as fast return
In some scene, e.g openstack CI, PR guest can trigger "sc 1" frequently,
this patch optimizes the path by directly delivering BOOK3S_INTERRUPT_SYSCALL
to HV guest, so powernv can return to HV guest without heavy exit, i.e,
no need to swap TLB, HTAB,.. etc

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-11-21 14:56:45 +01:00
pingfan liu
91648ec09c powerpc: kvm: fix rare but potential deadlock scene
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
realmode, so it can trigger the deadlock.

Suppose the following scene:

Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of
vcpus.

If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched
out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be
caught in realmode for a long time.

What makes things even worse if the following happens,
  On cpuM, bitlockX is hold, on cpuN, Y is hold.
  vcpu_B_2 try to lock Y on cpuM in realmode
  vcpu_A_2 try to lock X on cpuN in realmode

Oops! deadlock happens

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
Reviewed-by: Paul Mackerras <paulus@samba.org>
CC: stable@vger.kernel.org
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-11-18 22:41:57 +01:00
Paul Mackerras
c9438092ca KVM: PPC: Book3S HV: Take SRCU read lock around kvm_read_guest() call
Running a kernel with CONFIG_PROVE_RCU=y yields the following diagnostic:

===============================
[ INFO: suspicious RCU usage. ]
3.12.0-rc5-kvm+ #9 Not tainted
-------------------------------

include/linux/kvm_host.h:473 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
1 lock held by qemu-system-ppc/4831:

stack backtrace:
CPU: 28 PID: 4831 Comm: qemu-system-ppc Not tainted 3.12.0-rc5-kvm+ #9
Call Trace:
[c000000be462b2a0] [c00000000001644c] .show_stack+0x7c/0x1f0 (unreliable)
[c000000be462b370] [c000000000ad57c0] .dump_stack+0x88/0xb4
[c000000be462b3f0] [c0000000001315e8] .lockdep_rcu_suspicious+0x138/0x180
[c000000be462b480] [c00000000007862c] .gfn_to_memslot+0x13c/0x170
[c000000be462b510] [c00000000007d384] .gfn_to_hva_prot+0x24/0x90
[c000000be462b5a0] [c00000000007d420] .kvm_read_guest_page+0x30/0xd0
[c000000be462b630] [c00000000007d528] .kvm_read_guest+0x68/0x110
[c000000be462b6e0] [c000000000084594] .kvmppc_rtas_hcall+0x34/0x180
[c000000be462b7d0] [c000000000097934] .kvmppc_pseries_do_hcall+0x74/0x830
[c000000be462b880] [c0000000000990e8] .kvmppc_vcpu_run_hv+0xff8/0x15a0
[c000000be462b9e0] [c0000000000839cc] .kvmppc_vcpu_run+0x2c/0x40
[c000000be462ba50] [c0000000000810b4] .kvm_arch_vcpu_ioctl_run+0x54/0x1b0
[c000000be462bae0] [c00000000007b508] .kvm_vcpu_ioctl+0x478/0x730
[c000000be462bca0] [c00000000025532c] .do_vfs_ioctl+0x4dc/0x7a0
[c000000be462bd80] [c0000000002556b4] .SyS_ioctl+0xc4/0xe0
[c000000be462be30] [c000000000009ee4] syscall_exit+0x0/0x98

To fix this, we take the SRCU read lock around the kvmppc_rtas_hcall()
call.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-11-18 22:41:20 +01:00
Paul Mackerras
bf3d32e115 KVM: PPC: Book3S HV: Make tbacct_lock irq-safe
Lockdep reported that there is a potential for deadlock because
vcpu->arch.tbacct_lock is not irq-safe, and is sometimes taken inside
the rq_lock (run-queue lock) in the scheduler, which is taken within
interrupts.  The lockdep splat looks like:

======================================================
[ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
3.12.0-rc5-kvm+ #8 Not tainted
------------------------------------------------------
qemu-system-ppc/4803 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
(&(&vcpu->arch.tbacct_lock)->rlock){+.+...}, at: [<c0000000000947ac>] .kvmppc_core_vcpu_put_hv+0x2c/0xa0

and this task is already holding:
(&rq->lock){-.-.-.}, at: [<c000000000ac16c0>] .__schedule+0x180/0xaa0
which would create a new lock dependency:
(&rq->lock){-.-.-.} -> (&(&vcpu->arch.tbacct_lock)->rlock){+.+...}

but this new dependency connects a HARDIRQ-irq-safe lock:
(&rq->lock){-.-.-.}
... which became HARDIRQ-irq-safe at:
 [<c00000000013797c>] .lock_acquire+0xbc/0x190
 [<c000000000ac3c74>] ._raw_spin_lock+0x34/0x60
 [<c0000000000f8564>] .scheduler_tick+0x54/0x180
 [<c0000000000c2610>] .update_process_times+0x70/0xa0
 [<c00000000012cdfc>] .tick_periodic+0x3c/0xe0
 [<c00000000012cec8>] .tick_handle_periodic+0x28/0xb0
 [<c00000000001ef40>] .timer_interrupt+0x120/0x2e0
 [<c000000000002868>] decrementer_common+0x168/0x180
 [<c0000000001c7ca4>] .get_page_from_freelist+0x924/0xc10
 [<c0000000001c8e00>] .__alloc_pages_nodemask+0x200/0xba0
 [<c0000000001c9eb8>] .alloc_pages_exact_nid+0x68/0x110
 [<c000000000f4c3ec>] .page_cgroup_init+0x1e0/0x270
 [<c000000000f24480>] .start_kernel+0x3e0/0x4e4
 [<c000000000009d30>] .start_here_common+0x20/0x70

to a HARDIRQ-irq-unsafe lock:
(&(&vcpu->arch.tbacct_lock)->rlock){+.+...}
... which became HARDIRQ-irq-unsafe at:
...  [<c00000000013797c>] .lock_acquire+0xbc/0x190
 [<c000000000ac3c74>] ._raw_spin_lock+0x34/0x60
 [<c0000000000946ac>] .kvmppc_core_vcpu_load_hv+0x2c/0x100
 [<c00000000008394c>] .kvmppc_core_vcpu_load+0x2c/0x40
 [<c000000000081000>] .kvm_arch_vcpu_load+0x10/0x30
 [<c00000000007afd4>] .vcpu_load+0x64/0xd0
 [<c00000000007b0f8>] .kvm_vcpu_ioctl+0x68/0x730
 [<c00000000025530c>] .do_vfs_ioctl+0x4dc/0x7a0
 [<c000000000255694>] .SyS_ioctl+0xc4/0xe0
 [<c000000000009ee4>] syscall_exit+0x0/0x98

Some users have reported this deadlock occurring in practice, though
the reports have been primarily on 3.10.x-based kernels.

This fixes the problem by making tbacct_lock be irq-safe.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-11-18 22:39:23 +01:00
Paul Mackerras
f019b7ad76 KVM: PPC: Book3S HV: Refine barriers in guest entry/exit
Some users have reported instances of the host hanging with secondary
threads of a core waiting for the primary thread to exit the guest,
and the primary thread stuck in nap mode.  This prompted a review of
the memory barriers in the guest entry/exit code, and this is the
result.  Most of these changes are the suggestions of Dean Burdick
<deanburdick@us.ibm.com>.

The barriers between updating napping_threads and reading the
entry_exit_count on the one hand, and updating entry_exit_count and
reading napping_threads on the other, need to be isync not lwsync,
since we need to ensure that either the napping_threads update or the
entry_exit_count update get seen.  It is not sufficient to order the
load vs. lwarx, as lwsync does; we need to order the load vs. the
stwcx., so we need isync.

In addition, we need a full sync before sending IPIs to wake other
threads from nap, to ensure that the write to the entry_exit_count is
visible before the IPI occurs.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-11-18 22:38:30 +01:00
Paul Mackerras
caaa4c804f KVM: PPC: Book3S HV: Fix physical address calculations
This fixes a bug in kvmppc_do_h_enter() where the physical address
for a page can be calculated incorrectly if transparent huge pages
(THP) are active.  Until THP came along, it was true that if we
encountered a large (16M) page in kvmppc_do_h_enter(), then the
associated memslot must be 16M aligned for both its guest physical
address and the userspace address, and the physical address
calculations in kvmppc_do_h_enter() assumed that.  With THP, that
is no longer true.

In the case where we are using MMU notifiers and the page size that
we get from the Linux page tables is larger than the page being mapped
by the guest, we need to fill in some low-order bits of the physical
address.  Without THP, these bits would be the same in the guest
physical address (gpa) and the host virtual address (hva).  With THP,
they can be different, and we need to use the bits from hva rather
than gpa.

In the case where we are not using MMU notifiers, the host physical
address we get from the memslot->arch.slot_phys[] array already
includes the low-order bits down to the PAGE_SIZE level, even if
we are using large pages.  Thus we can simplify the calculation in
this case to just add in the remaining bits in the case where
PAGE_SIZE is 64k and the guest is mapping a 4k page.

The same bug exists in kvmppc_book3s_hv_page_fault().  The basic fix
is to use psize (the page size from the HPTE) rather than pte_size
(the page size from the Linux PTE) when updating the HPTE low word
in r.  That means that pfn needs to be computed to PAGE_SIZE
granularity even if the Linux PTE is a huge page PTE.  That can be
arranged simply by doing the page_to_pfn() before setting page to
the head of the compound page.  If psize is less than PAGE_SIZE,
then we need to make sure we only update the bits from PAGE_SIZE
upwards, in order not to lose any sub-page offset bits in r.
On the other hand, if psize is greater than PAGE_SIZE, we need to
make sure we don't bring in non-zero low order bits in pfn, hence
we mask (pfn << PAGE_SHIFT) with ~(psize - 1).

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-11-18 22:36:09 +01:00
Linus Torvalds
f080480488 Here are the 3.13 KVM changes. There was a lot of work on the PPC
side: the HV and emulation flavors can now coexist in a single kernel
 is probably the most interesting change from a user point of view.
 On the x86 side there are nested virtualization improvements and a
 few bugfixes.  ARM got transparent huge page support, improved
 overcommit, and support for big endian guests.
 
 Finally, there is a new interface to connect KVM with VFIO.  This
 helps with devices that use NoSnoop PCI transactions, letting the
 driver in the guest execute WBINVD instructions.  This includes
 some nVidia cards on Windows, that fail to start without these
 patches and the corresponding userspace changes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJShPAhAAoJEBvWZb6bTYbyl48P/297GgmELHAGBgjvb6q7yyGu
 L8+eHjKbh4XBAkPwyzbvUjuww5z2hM0N3JQ0BDV9oeXlO+zwwCEns/sg2Q5/NJXq
 XxnTeShaKnp9lqVBnE6G9rAOUWKoyLJ2wItlvUL8JlaO9xJ0Vmk0ta4n2Nv5GqDp
 db6UD7vju6rHtIAhNpvvAO51kAOwc01xxRixCVb7KUYOnmO9nvpixzoI/S0Rp1gu
 w/OWMfCosDzBoT+cOe79Yx1OKcpaVW94X6CH1s+ShCw3wcbCL2f13Ka8/E3FIcuq
 vkZaLBxio7vjUAHRjPObw0XBW4InXEbhI1DjzIvm8dmc4VsgmtLQkTCG8fj+jINc
 dlHQUq6Do+1F4zy6WMBUj8tNeP1Z9DsABp98rQwR8+BwHoQpGQBpAxW0TE0ZMngC
 t1caqyvjZ5pPpFUxSrAV+8Kg4AvobXPYOim0vqV7Qea07KhFcBXLCfF7BWdwq/Jc
 0CAOlsLL4mHGIQWZJuVGw0YGP7oATDCyewlBuDObx+szYCoV4fQGZVBEL0KwJx/1
 7lrLN7JWzRyw6xTgJ5VVwgYE1tUY4IFQcHu7/5N+dw8/xg9KWA3f4PeMavIKSf+R
 qteewbtmQsxUnvuQIBHLs8NRWPnBPy+F3Sc2ckeOLIe4pmfTte6shtTXcLDL+LqH
 NTmT/cfmYp2BRkiCfCiS
 =rWNf
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM changes from Paolo Bonzini:
 "Here are the 3.13 KVM changes.  There was a lot of work on the PPC
  side: the HV and emulation flavors can now coexist in a single kernel
  is probably the most interesting change from a user point of view.

  On the x86 side there are nested virtualization improvements and a few
  bugfixes.

  ARM got transparent huge page support, improved overcommit, and
  support for big endian guests.

  Finally, there is a new interface to connect KVM with VFIO.  This
  helps with devices that use NoSnoop PCI transactions, letting the
  driver in the guest execute WBINVD instructions.  This includes some
  nVidia cards on Windows, that fail to start without these patches and
  the corresponding userspace changes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits)
  kvm, vmx: Fix lazy FPU on nested guest
  arm/arm64: KVM: PSCI: propagate caller endianness to the incoming vcpu
  arm/arm64: KVM: MMIO support for BE guest
  kvm, cpuid: Fix sparse warning
  kvm: Delete prototype for non-existent function kvm_check_iopl
  kvm: Delete prototype for non-existent function complete_pio
  hung_task: add method to reset detector
  pvclock: detect watchdog reset at pvclock read
  kvm: optimize out smp_mb after srcu_read_unlock
  srcu: API for barrier after srcu read unlock
  KVM: remove vm mmap method
  KVM: IOMMU: hva align mapping page size
  KVM: x86: trace cpuid emulation when called from emulator
  KVM: emulator: cleanup decode_register_operand() a bit
  KVM: emulator: check rex prefix inside decode_register()
  KVM: x86: fix emulation of "movzbl %bpl, %eax"
  kvm_host: typo fix
  KVM: x86: emulate SAHF instruction
  MAINTAINERS: add tree for kvm.git
  Documentation/kvm: add a 00-INDEX file
  ...
2013-11-15 13:51:36 +09:00
Linus Torvalds
66a173b926 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc updates from Benjamin Herrenschmidt:
 "The bulk of this is LE updates.  One should now be able to build an LE
  kernel and even run some things in it.

  I'm still sitting on a handful of patches to enable the new ABI that I
  *might* still send this merge window around, but due to the
  incertainty (they are pretty fresh) I want to keep them separate.

  Other notable changes are some infrastructure bits to better handle
  PCI pass-through under KVM, some bits and pieces added to the new
  PowerNV platform support such as access to the CPU SCOM bus via sysfs,
  and support for EEH error handling on PHB3 (Power8 PCIe).

  We also grew arch_get_random_long() for both pseries and powernv when
  running on P7+ and P8, exploiting the HW rng.

  And finally various embedded updates from freescale"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (154 commits)
  powerpc: Fix fatal SLB miss when restoring PPR
  powerpc/powernv: Reserve the correct PE number
  powerpc/powernv: Add PE to its own PELTV
  powerpc/powernv: Add support for indirect XSCOM via debugfs
  powerpc/scom: Improve debugfs interface
  powerpc/scom: Enable 64-bit addresses
  powerpc/boot: Properly handle the base "of" boot wrapper
  powerpc/bpf: Support MOD operation
  powerpc/bpf: Fix DIVWU instruction opcode
  of: Move definition of of_find_next_cache_node into common code.
  powerpc: Remove big endianness assumption in of_find_next_cache_node
  powerpc/tm: Remove interrupt disable in __switch_to()
  powerpc: word-at-a-time optimization for 64-bit Little Endian
  powerpc/bpf: BPF JIT compiler for 64-bit Little Endian
  powerpc: Only save/restore SDR1 if in hypervisor mode
  powerpc/pmu: Fix ADB_PMU_LED_IDE dependencies
  powerpc/nvram: Fix endian issue when using the partition length
  powerpc/nvram: Fix endian issue when reading the NVRAM size
  powerpc/nvram: Scan partitions only once
  powerpc/mpc512x: remove unnecessary #if
  ...
2013-11-12 14:34:19 +09:00
Gleb Natapov
95f328d3ad Merge branch 'kvm-ppc-queue' of git://github.com/agraf/linux-2.6 into queue
Conflicts:
	arch/powerpc/include/asm/processor.h
2013-11-04 10:20:57 +02:00
Aneesh Kumar K.V
a78b55d1c0 kvm: powerpc: book3s: drop is_hv_enabled
drop is_hv_enabled, because that should not be a callback property

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 18:43:34 +02:00
Aneesh Kumar K.V
cbbc58d4fd kvm: powerpc: book3s: Allow the HV and PR selection per virtual machine
This moves the kvmppc_ops callbacks to be a per VM entity. This
enables us to select HV and PR mode when creating a VM. We also
allow both kvm-hv and kvm-pr kernel module to be loaded. To
achieve this we move /dev/kvm ownership to kvm.ko module. Depending on
which KVM mode we select during VM creation we take a reference
count on respective module

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[agraf: fix coding style]
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 18:42:36 +02:00
Aneesh Kumar K.V
5587027ce9 kvm: Add struct kvm arg to memslot APIs
We will use that in the later patch to find the kvm ops handler

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 15:49:23 +02:00
Aneesh Kumar K.V
2ba9f0d887 kvm: powerpc: book3s: Support building HV and PR KVM as module
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[agraf: squash in compile fix]
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 15:45:35 +02:00
Aneesh Kumar K.V
dba291f2ce kvm: powerpc: booke: Move booke related tracepoints to separate header
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 15:37:16 +02:00
Aneesh Kumar K.V
72c1253574 kvm: powerpc: book3s: pr: move PR related tracepoints to a separate header
This patch moves PR related tracepoints to a separate header. This
enables in converting PR to a kernel module which will be done in
later patches

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 15:36:22 +02:00
Aneesh Kumar K.V
699cc87641 kvm: powerpc: book3s: Add is_hv_enabled to kvmppc_ops
This help us to identify whether we are running with hypervisor mode KVM
enabled. The change is needed so that we can have both HV and PR kvm
enabled in the same kernel.

If both HV and PR KVM are included, interrupts come in to the HV version
of the kvmppc_interrupt code, which then jumps to the PR handler,
renamed to kvmppc_interrupt_pr, if the guest is a PR guest.

Allowing both PR and HV in the same kernel required some changes to
kvm_dev_ioctl_check_extension(), since the values returned now can't
be selected with #ifdefs as much as previously. We look at is_hv_enabled
to return the right value when checking for capabilities.For capabilities that
are only provided by HV KVM, we return the HV value only if
is_hv_enabled is true. For capabilities provided by PR KVM but not HV,
we return the PR value only if is_hv_enabled is false.

NOTE: in later patch we replace is_hv_enabled with a static inline
function comparing kvm_ppc_ops

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 15:29:09 +02:00
Aneesh Kumar K.V
dd96b2c2dc kvm: powerpc: book3s: Cleanup interrupt handling code
With this patch if HV is included, interrupts come in to the HV version
of the kvmppc_interrupt code, which then jumps to the PR handler,
renamed to kvmppc_interrupt_pr, if the guest is a PR guest. This helps
in enabling both HV and PR, which we do in later patch

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 15:26:31 +02:00
Aneesh Kumar K.V
3a167beac0 kvm: powerpc: Add kvmppc_ops callback
This patch add a new callback kvmppc_ops. This will help us in enabling
both HV and PR KVM together in the same kernel. The actual change to
enable them together is done in the later patch in the series.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[agraf: squash in booke changes]
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 15:24:26 +02:00
Aneesh Kumar K.V
9975f5e369 kvm: powerpc: book3s: Add a new config variable CONFIG_KVM_BOOK3S_HV_POSSIBLE
This help ups to select the relevant code in the kernel code
when we later move HV and PR bits as seperate modules. The patch
also makes the config options for PR KVM selectable

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 15:18:28 +02:00
Aneesh Kumar K.V
7aa79938f7 kvm: powerpc: book3s: pr: Rename KVM_BOOK3S_PR to KVM_BOOK3S_PR_POSSIBLE
With later patches supporting PR kvm as a kernel module, the changes
that has to be built into the main kernel binary to enable PR KVM module
is now selected via KVM_BOOK3S_PR_POSSIBLE

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 15:17:49 +02:00
Paul Mackerras
066212e02a kvm: powerpc: book3s: move book3s_64_vio_hv.c into the main kernel binary
Since the code in book3s_64_vio_hv.c is called from real mode with HV
KVM, and therefore has to be built into the main kernel binary, this
makes it always built-in rather than part of the KVM module.  It gets
called from the KVM module by PR KVM, so this adds an EXPORT_SYMBOL_GPL().

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 15:17:25 +02:00
Paul Mackerras
178db620ee kvm: powerpc: book3s: remove kvmppc_handler_highmem label
This label is not used now.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 15:15:56 +02:00
Bharat Bhushan
ce11e48b7f KVM: PPC: E500: Add userspace debug stub support
This patch adds the debug stub support on booke/bookehv.
Now QEMU debug stub can use hw breakpoint, watchpoint and
software breakpoint to debug guest.

This is how we save/restore debug register context when switching
between guest, userspace and kernel user-process:

When QEMU is running
 -> thread->debug_reg == QEMU debug register context.
 -> Kernel will handle switching the debug register on context switch.
 -> no vcpu_load() called

QEMU makes ioctls (except RUN)
 -> This will call vcpu_load()
 -> should not change context.
 -> Some ioctls can change vcpu debug register, context saved in vcpu->debug_regs

QEMU Makes RUN ioctl
 -> Save thread->debug_reg on STACK
 -> Store thread->debug_reg == vcpu->debug_reg
 -> load thread->debug_reg
 -> RUN VCPU ( So thread points to vcpu context )

Context switch happens When VCPU running
 -> makes vcpu_load() should not load any context
 -> kernel loads the vcpu context as thread->debug_regs points to vcpu context.

On heavyweight_exit
 -> Load the context saved on stack in thread->debug_reg

Currently we do not support debug resource emulation to guest,
On debug exception, always exit to user space irrespective of
user space is expecting the debug exception or not. If this is
unexpected exception (breakpoint/watchpoint event not set by
userspace) then let us leave the action on user space. This
is similar to what it was before, only thing is that now we
have proper exit state available to user space.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:49:40 +02:00
Bharat Bhushan
547465ef8b KVM: PPC: E500: Using "struct debug_reg"
For KVM also use the "struct debug_reg" defined in asm/processor.h

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:49:39 +02:00
Bharat Bhushan
b12c784123 KVM: PPC: E500: exit to user space on "ehpriv 1" instruction
"ehpriv 1" instruction is used for setting software breakpoints
by user space. This patch adds support to exit to user space
with "run->debug" have relevant information.

As this is the first point we are using run->debug, also defined
the run->debug structure.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:49:39 +02:00
Bharat Bhushan
84e4d632b5 kvm: powerpc: e500: mark page accessed when mapping a guest page
Mark the guest page as accessed so that there is likely
less chances of this page getting swap-out.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:49:38 +02:00
Bharat Bhushan
ca8ccbd41d kvm: powerpc: allow guest control "G" attribute in mas2
"G" bit in MAS2 indicates whether the page is Guarded.
There is no reason to stop guest setting  "G", so allow him.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:49:37 +02:00
Bharat Bhushan
fd75cb51f2 kvm: powerpc: allow guest control "E" attribute in mas2
"E" bit in MAS2 bit indicates whether the page is accessed
in Little-Endian or Big-Endian byte order.
There is no reason to stop guest setting  "E", so allow him."

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:49:37 +02:00
Paul Mackerras
44a3add863 KVM: PPC: Book3S HV: Better handling of exceptions that happen in real mode
When an interrupt or exception happens in the guest that comes to the
host, the CPU goes to hypervisor real mode (MMU off) to handle the
exception but doesn't change the MMU context.  After saving a few
registers, we then clear the "in guest" flag.  If, for any reason,
we get an exception in the real-mode code, that then gets handled
by the normal kernel exception handlers, which turn the MMU on.  This
is disastrous if the MMU is still set to the guest context, since we
end up executing instructions from random places in the guest kernel
with hypervisor privilege.

In order to catch this situation, we define a new value for the "in guest"
flag, KVM_GUEST_MODE_HOST_HV, to indicate that we are in hypervisor real
mode with guest MMU context.  If the "in guest" flag is set to this value,
we branch off to an emergency handler.  For the moment, this just does
a branch to self to stop the CPU from doing anything further.

While we're here, we define another new flag value to indicate that we
are in a HV guest, as distinct from a PR guest.  This will be useful
when we have a kernel that can support both PR and HV guests concurrently.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:49:37 +02:00
Paul Mackerras
f1378b1c0b kvm: powerpc: book3s hv: Fix vcore leak
add kvmppc_free_vcores() to free the kvmppc_vcore structures
that we allocate for a guest, which are currently being leaked.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:49:36 +02:00
Paul Mackerras
491d6ecc17 KVM: PPC: Book3S PR: Reduce number of shadow PTEs invalidated by MMU notifiers
Currently, whenever any of the MMU notifier callbacks get called, we
invalidate all the shadow PTEs.  This is inefficient because it means
that we typically then get a lot of DSIs and ISIs in the guest to fault
the shadow PTEs back in.  We do this even if the address range being
notified doesn't correspond to guest memory.

This commit adds code to scan the memslot array to find out what range(s)
of guest physical addresses corresponds to the host virtual address range
being affected.  For each such range we flush only the shadow PTEs
for the range, on all cpus.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:49:36 +02:00
Paul Mackerras
adc0bafe00 KVM: PPC: Book3S PR: Mark pages accessed, and dirty if being written
The mark_page_dirty() function, despite what its name might suggest,
doesn't actually mark the page as dirty as far as the MM subsystem is
concerned.  It merely sets a bit in KVM's map of dirty pages, if
userspace has requested dirty tracking for the relevant memslot.
To tell the MM subsystem that the page is dirty, we have to call
kvm_set_pfn_dirty() (or an equivalent such as SetPageDirty()).

This adds a call to kvm_set_pfn_dirty(), and while we are here, also
adds a call to kvm_set_pfn_accessed() to tell the MM subsystem that
the page has been accessed.  Since we are now using the pfn in
several places, this adds a 'pfn' variable to store it and changes
the places that used hpaddr >> PAGE_SHIFT to use pfn instead, which
is the same thing.

This also changes a use of HPTE_R_PP to PP_RXRX.  Both are 3, but
PP_RXRX is more informative as being the read-only page permission
bit setting.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:49:36 +02:00
Paul Mackerras
d78bca7296 KVM: PPC: Book3S PR: Use mmu_notifier_retry() in kvmppc_mmu_map_page()
When the MM code is invalidating a range of pages, it calls the KVM
kvm_mmu_notifier_invalidate_range_start() notifier function, which calls
kvm_unmap_hva_range(), which arranges to flush all the existing host
HPTEs for guest pages.  However, the Linux PTEs for the range being
flushed are still valid at that point.  We are not supposed to establish
any new references to pages in the range until the ...range_end()
notifier gets called.  The PPC-specific KVM code doesn't get any
explicit notification of that; instead, we are supposed to use
mmu_notifier_retry() to test whether we are or have been inside a
range flush notifier pair while we have been getting a page and
instantiating a host HPTE for the page.

This therefore adds a call to mmu_notifier_retry inside
kvmppc_mmu_map_page().  This call is inside a region locked with
kvm->mmu_lock, which is the same lock that is called by the KVM
MMU notifier functions, thus ensuring that no new notification can
proceed while we are in the locked region.  Inside this region we
also create the host HPTE and link the corresponding hpte_cache
structure into the lists used to find it later.  We cannot allocate
the hpte_cache structure inside this locked region because that can
lead to deadlock, so we allocate it outside the region and free it
if we end up not using it.

This also moves the updates of vcpu3s->hpte_cache_count inside the
regions locked with vcpu3s->mmu_lock, and does the increment in
kvmppc_mmu_hpte_cache_map() when the pte is added to the cache
rather than when it is allocated, in order that the hpte_cache_count
is accurate.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:49:35 +02:00
Paul Mackerras
93b159b466 KVM: PPC: Book3S PR: Better handling of host-side read-only pages
Currently we request write access to all pages that get mapped into the
guest, even if the guest is only loading from the page.  This reduces
the effectiveness of KSM because it means that we unshare every page we
access.  Also, we always set the changed (C) bit in the guest HPTE if
it allows writing, even for a guest load.

This fixes both these problems.  We pass an 'iswrite' flag to the
mmu.xlate() functions and to kvmppc_mmu_map_page() to indicate whether
the access is a load or a store.  The mmu.xlate() functions now only
set C for stores.  kvmppc_gfn_to_pfn() now calls gfn_to_pfn_prot()
instead of gfn_to_pfn() so that it can indicate whether we need write
access to the page, and get back a 'writable' flag to indicate whether
the page is writable or not.  If that 'writable' flag is clear, we then
make the host HPTE read-only even if the guest HPTE allowed writing.

This means that we can get a protection fault when the guest writes to a
page that it has mapped read-write but which is read-only on the host
side (perhaps due to KSM having merged the page).  Thus we now call
kvmppc_handle_pagefault() for protection faults as well as HPTE not found
faults.  In kvmppc_handle_pagefault(), if the access was allowed by the
guest HPTE and we thus need to install a new host HPTE, we then need to
remove the old host HPTE if there is one.  This is done with a new
function, kvmppc_mmu_unmap_page(), which uses kvmppc_mmu_pte_vflush() to
find and remove the old host HPTE.

Since the memslot-related functions require the KVM SRCU read lock to
be held, this adds srcu_read_lock/unlock pairs around the calls to
kvmppc_handle_pagefault().

Finally, this changes kvmppc_mmu_book3s_32_xlate_pte() to not ignore
guest HPTEs that don't permit access, and to return -EPERM for accesses
that are not permitted by the page protections.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:49:35 +02:00
Paul Mackerras
4f6c11db10 KVM: PPC: Book3S: Move skip-interrupt handlers to common code
Both PR and HV KVM have separate, identical copies of the
kvmppc_skip_interrupt and kvmppc_skip_Hinterrupt handlers that are
used for the situation where an interrupt happens when loading the
instruction that caused an exit from the guest.  To eliminate this
duplication and make it easier to compile in both PR and HV KVM,
this moves this code to arch/powerpc/kernel/exceptions-64s.S along
with other kernel interrupt handler code.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:49:35 +02:00
Paul Mackerras
3ff955024d KVM: PPC: Book3S PR: Allocate kvm_vcpu structs from kvm_vcpu_cache
This makes PR KVM allocate its kvm_vcpu structs from the kvm_vcpu_cache
rather than having them embedded in the kvmppc_vcpu_book3s struct,
which is allocated with vzalloc.  The reason is to reduce the
differences between PR and HV KVM in order to make is easier to have
them coexist in one kernel binary.

With this, the kvm_vcpu struct has a pointer to the kvmppc_vcpu_book3s
struct.  The pointer to the kvmppc_book3s_shadow_vcpu struct has moved
from the kvmppc_vcpu_book3s struct to the kvm_vcpu struct, and is only
present for 32-bit, since it is only used for 32-bit.

Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: squash in compile fix from Aneesh]
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:45:05 +02:00
Paul Mackerras
9308ab8e2d KVM: PPC: Book3S PR: Make HPT accesses and updates SMP-safe
This adds a per-VM mutex to provide mutual exclusion between vcpus
for accesses to and updates of the guest hashed page table (HPT).
This also makes the code use single-byte writes to the HPT entry
when updating of the reference (R) and change (C) bits.  The reason
for doing this, rather than writing back the whole HPTE, is that on
non-PAPR virtual machines, the guest OS might be writing to the HPTE
concurrently, and writing back the whole HPTE might conflict with
that.  Also, real hardware does single-byte writes to update R and C.

The new mutex is taken in kvmppc_mmu_book3s_64_xlate() when reading
the HPT and updating R and/or C, and in the PAPR HPT update hcalls
(H_ENTER, H_REMOVE, etc.).  Having the mutex means that we don't need
to use a hypervisor lock bit in the HPT update hcalls, and we don't
need to be careful about the order in which the bytes of the HPTE are
updated by those hcalls.

The other change here is to make emulated TLB invalidations (tlbie)
effective across all vcpus.  To do this we call kvmppc_mmu_pte_vflush
for all vcpus in kvmppc_ppc_book3s_64_tlbie().

For 32-bit, this makes the setting of the accessed and dirty bits use
single-byte writes, and makes tlbie invalidate shadow HPTEs for all
vcpus.

With this, PR KVM can successfully run SMP guests.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:45:04 +02:00
Paul Mackerras
5cd92a9521 KVM: PPC: Book3S PR: Correct errors in H_ENTER implementation
The implementation of H_ENTER in PR KVM has some errors:

* With H_EXACT not set, if the HPTEG is full, we return H_PTEG_FULL
  as the return value of kvmppc_h_pr_enter, but the caller is expecting
  one of the EMULATE_* values.  The H_PTEG_FULL needs to go in the
  guest's R3 instead.

* With H_EXACT set, if the selected HPTE is already valid, the H_ENTER
  call should return a H_PTEG_FULL error.

This fixes these errors and also makes it write only the selected HPTE,
not the whole group, since only the selected HPTE has been modified.
This also micro-optimizes the calculations involving pte_index and i.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:45:04 +02:00
Paul Mackerras
03a9c90334 KVM: PPC: Book3S PR: Handle PP0 page-protection bit in guest HPTEs
64-bit POWER processors have a three-bit field for page protection in
the hashed page table entry (HPTE).  Currently we only interpret the two
bits that were present in older versions of the architecture.  The only
defined combination that has the new bit set is 110, meaning read-only
for supervisor and no access for user mode.

This adds code to kvmppc_mmu_book3s_64_xlate() to interpret the extra
bit appropriately.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:45:04 +02:00
Paul Mackerras
c9029c341d KVM: PPC: Book3S PR: Use 64k host pages where possible
Currently, PR KVM uses 4k pages for the host-side mappings of guest
memory, regardless of the host page size.  When the host page size is
64kB, we might as well use 64k host page mappings for guest mappings
of 64kB and larger pages and for guest real-mode mappings.  However,
the magic page has to remain a 4k page.

To implement this, we first add another flag bit to the guest VSID
values we use, to indicate that this segment is one where host pages
should be mapped using 64k pages.  For segments with this bit set
we set the bits in the shadow SLB entry to indicate a 64k base page
size.  When faulting in host HPTEs for this segment, we make them
64k HPTEs instead of 4k.  We record the pagesize in struct hpte_cache
for use when invalidating the HPTE.

For now we restrict the segment containing the magic page (if any) to
4k pages.  It should be possible to lift this restriction in future
by ensuring that the magic 4k page is appropriately positioned within
a host 64k page.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:45:03 +02:00
Paul Mackerras
a4a0f2524a KVM: PPC: Book3S PR: Allow guest to use 64k pages
This adds the code to interpret 64k HPTEs in the guest hashed page
table (HPT), 64k SLB entries, and to tell the guest about 64k pages
in kvm_vm_ioctl_get_smmu_info().  Guest 64k pages are still shadowed
by 4k pages.

This also adds another hash table to the four we have already in
book3s_mmu_hpte.c to allow us to find all the PTEs that we have
instantiated that match a given 64k guest page.

The tlbie instruction changed starting with POWER6 to use a bit in
the RB operand to indicate large page invalidations, and to use other
RB bits to indicate the base and actual page sizes and the segment
size.  64k pages came in slightly earlier, with POWER5++.
We use one bit in vcpu->arch.hflags to indicate that the emulated
cpu supports 64k pages, and another to indicate that it has the new
tlbie definition.

The KVM_PPC_GET_SMMU_INFO ioctl presents a bit of a problem, because
the MMU capabilities depend on which CPU model we're emulating, but it
is a VM ioctl not a VCPU ioctl and therefore doesn't get passed a VCPU
fd.  In addition, commonly-used userspace (QEMU) calls it before
setting the PVR for any VCPU.  Therefore, as a best effort we look at
the first vcpu in the VM and return 64k pages or not depending on its
capabilities.  We also make the PVR default to the host PVR on recent
CPUs that support 1TB segments (and therefore multiple page sizes as
well) so that KVM_PPC_GET_SMMU_INFO will include 64k page and 1TB
segment support on those CPUs.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:45:03 +02:00
Paul Mackerras
a2d56020d1 KVM: PPC: Book3S PR: Keep volatile reg values in vcpu rather than shadow_vcpu
Currently PR-style KVM keeps the volatile guest register values
(R0 - R13, CR, LR, CTR, XER, PC) in a shadow_vcpu struct rather than
the main kvm_vcpu struct.  For 64-bit, the shadow_vcpu exists in two
places, a kmalloc'd struct and in the PACA, and it gets copied back
and forth in kvmppc_core_vcpu_load/put(), because the real-mode code
can't rely on being able to access the kmalloc'd struct.

This changes the code to copy the volatile values into the shadow_vcpu
as one of the last things done before entering the guest.  Similarly
the values are copied back out of the shadow_vcpu to the kvm_vcpu
immediately after exiting the guest.  We arrange for interrupts to be
still disabled at this point so that we can't get preempted on 64-bit
and end up copying values from the wrong PACA.

This means that the accessor functions in kvm_book3s.h for these
registers are greatly simplified, and are same between PR and HV KVM.
In places where accesses to shadow_vcpu fields are now replaced by
accesses to the kvm_vcpu, we can also remove the svcpu_get/put pairs.
Finally, on 64-bit, we don't need the kmalloc'd struct at all any more.

With this, the time to read the PVR one million times in a loop went
from 567.7ms to 575.5ms (averages of 6 values), an increase of about
1.4% for this worse-case test for guest entries and exits.  The
standard deviation of the measurements is about 11ms, so the
difference is only marginally significant statistically.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:45:03 +02:00
Paul Mackerras
f24817716e KVM: PPC: Book3S PR: Fix compilation without CONFIG_ALTIVEC
Commit 9d1ffdd8f3 ("KVM: PPC: Book3S PR: Don't corrupt guest state
when kernel uses VMX") added a call to kvmppc_load_up_altivec() that
isn't guarded by CONFIG_ALTIVEC, causing a link failure when building
a kernel without CONFIG_ALTIVEC set.  This adds an #ifdef to fix this.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:45:03 +02:00
Paul Mackerras
f3271d4c90 KVM: PPC: Book3S HV: Don't crash host on unknown guest interrupt
If we come out of a guest with an interrupt that we don't know about,
instead of crashing the host with a BUG(), we now return to userspace
with the exit reason set to KVM_EXIT_UNKNOWN and the trap vector in
the hw.hardware_exit_reason field of the kvm_run structure, as is done
on x86.  Note that run->exit_reason is already set to KVM_EXIT_UNKNOWN
at the beginning of kvmppc_handle_exit().

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:45:02 +02:00
Paul Mackerras
388cc6e133 KVM: PPC: Book3S HV: Support POWER6 compatibility mode on POWER7
This enables us to use the Processor Compatibility Register (PCR) on
POWER7 to put the processor into architecture 2.05 compatibility mode
when running a guest.  In this mode the new instructions and registers
that were introduced on POWER7 are disabled in user mode.  This
includes all the VSX facilities plus several other instructions such
as ldbrx, stdbrx, popcntw, popcntd, etc.

To select this mode, we have a new register accessible through the
set/get_one_reg interface, called KVM_REG_PPC_ARCH_COMPAT.  Setting
this to zero gives the full set of capabilities of the processor.
Setting it to one of the "logical" PVR values defined in PAPR puts
the vcpu into the compatibility mode for the corresponding
architecture level.  The supported values are:

0x0f000002	Architecture 2.05 (POWER6)
0x0f000003	Architecture 2.06 (POWER7)
0x0f100003	Architecture 2.06+ (POWER7+)

Since the PCR is per-core, the architecture compatibility level and
the corresponding PCR value are stored in the struct kvmppc_vcore, and
are therefore shared between all vcpus in a virtual core.

Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: squash in fix to add missing break statements and documentation]
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:45:02 +02:00
Paul Mackerras
4b8473c9c1 KVM: PPC: Book3S HV: Add support for guest Program Priority Register
POWER7 and later IBM server processors have a register called the
Program Priority Register (PPR), which controls the priority of
each hardware CPU SMT thread, and affects how fast it runs compared
to other SMT threads.  This priority can be controlled by writing to
the PPR or by use of a set of instructions of the form or rN,rN,rN
which are otherwise no-ops but have been defined to set the priority
to particular levels.

This adds code to context switch the PPR when entering and exiting
guests and to make the PPR value accessible through the SET/GET_ONE_REG
interface.  When entering the guest, we set the PPR as late as
possible, because if we are setting a low thread priority it will
make the code run slowly from that point on.  Similarly, the
first-level interrupt handlers save the PPR value in the PACA very
early on, and set the thread priority to the medium level, so that
the interrupt handling code runs at a reasonable speed.

Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:45:02 +02:00
Paul Mackerras
a0144e2a6b KVM: PPC: Book3S HV: Store LPCR value for each virtual core
This adds the ability to have a separate LPCR (Logical Partitioning
Control Register) value relating to a guest for each virtual core,
rather than only having a single value for the whole VM.  This
corresponds to what real POWER hardware does, where there is a LPCR
per CPU thread but most of the fields are required to have the same
value on all active threads in a core.

The per-virtual-core LPCR can be read and written using the
GET/SET_ONE_REG interface.  Userspace can can only modify the
following fields of the LPCR value:

DPFD	Default prefetch depth
ILE	Interrupt little-endian
TC	Translation control (secondary HPT hash group search disable)

We still maintain a per-VM default LPCR value in kvm->arch.lpcr, which
contains bits relating to memory management, i.e. the Virtualized
Partition Memory (VPM) bits and the bits relating to guest real mode.
When this default value is updated, the update needs to be propagated
to the per-vcore values, so we add a kvmppc_update_lpcr() helper to do
that.

Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix whitespace]
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:45:01 +02:00
Paul Mackerras
8b75cbbe64 KVM: PPC: BookE: Add GET/SET_ONE_REG interface for VRSAVE
This makes the VRSAVE register value for a vcpu accessible through
the GET/SET_ONE_REG interface on Book E systems (in addition to the
existing GET/SET_SREGS interface), for consistency with Book 3S.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:45:01 +02:00
Paul Mackerras
8c2dbb79c6 KVM: PPC: Book3S HV: Avoid unbalanced increments of VPA yield count
The yield count in the VPA is supposed to be incremented every time
we enter the guest, and every time we exit the guest, so that its
value is even when the vcpu is running in the guest and odd when it
isn't.  However, it's currently possible that we increment the yield
count on the way into the guest but then find that other CPU threads
are already exiting the guest, so we go back to nap mode via the
secondary_too_late label.  In this situation we don't increment the
yield count again, breaking the relationship between the LSB of the
count and whether the vcpu is in the guest.

To fix this, we move the increment of the yield count to a point
after we have checked whether other CPU threads are exiting.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:45:01 +02:00
Paul Mackerras
c934243ca0 KVM: PPC: Book3S HV: Pull out interrupt-reading code into a subroutine
This moves the code in book3s_hv_rmhandlers.S that reads any pending
interrupt from the XICS interrupt controller, and works out whether
it is an IPI for the guest, an IPI for the host, or a device interrupt,
into a new function called kvmppc_read_intr.  Later patches will
need this.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:45:00 +02:00
Paul Mackerras
218309b75b KVM: PPC: Book3S HV: Restructure kvmppc_hv_entry to be a subroutine
We have two paths into and out of the low-level guest entry and exit
code: from a vcpu task via kvmppc_hv_entry_trampoline, and from the
system reset vector for an offline secondary thread on POWER7 via
kvm_start_guest.  Currently both just branch to kvmppc_hv_entry to
enter the guest, and on guest exit, we test the vcpu physical thread
ID to detect which way we came in and thus whether we should return
to the vcpu task or go back to nap mode.

In order to make the code flow clearer, and to keep the code relating
to each flow together, this turns kvmppc_hv_entry into a subroutine
that follows the normal conventions for call and return.  This means
that kvmppc_hv_entry_trampoline() and kvmppc_hv_entry() now establish
normal stack frames, and we use the normal stack slots for saving
return addresses rather than local_paca->kvm_hstate.vmhandler.  Apart
from that this is mostly moving code around unchanged.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:45:00 +02:00
Paul Mackerras
42d7604d0c KVM: PPC: Book3S HV: Implement H_CONFER
The H_CONFER hypercall is used when a guest vcpu is spinning on a lock
held by another vcpu which has been preempted, and the spinning vcpu
wishes to give its timeslice to the lock holder.  We implement this
in the straightforward way using kvm_vcpu_yield_to().

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:45:00 +02:00
Paul Mackerras
c0867fd509 KVM: PPC: Book3S: Add GET/SET_ONE_REG interface for VRSAVE
The VRSAVE register value for a vcpu is accessible through the
GET/SET_SREGS interface for Book E processors, but not for Book 3S
processors.  In order to make this accessible for Book 3S processors,
this adds a new register identifier for GET/SET_ONE_REG, and adds
the code to implement it.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:44:59 +02:00
Paul Mackerras
93b0f4dc29 KVM: PPC: Book3S HV: Implement timebase offset for guests
This allows guests to have a different timebase origin from the host.
This is needed for migration, where a guest can migrate from one host
to another and the two hosts might have a different timebase origin.
However, the timebase seen by the guest must not go backwards, and
should go forwards only by a small amount corresponding to the time
taken for the migration.

Therefore this provides a new per-vcpu value accessed via the one_reg
interface using the new KVM_REG_PPC_TB_OFFSET identifier.  This value
defaults to 0 and is not modified by KVM.  On entering the guest, this
value is added onto the timebase, and on exiting the guest, it is
subtracted from the timebase.

This is only supported for recent POWER hardware which has the TBU40
(timebase upper 40 bits) register.  Writing to the TBU40 register only
alters the upper 40 bits of the timebase, leaving the lower 24 bits
unchanged.  This provides a way to modify the timebase for guest
migration without disturbing the synchronization of the timebase
registers across CPU cores.  The kernel rounds up the value given
to a multiple of 2^24.

Timebase values stored in KVM structures (struct kvm_vcpu, struct
kvmppc_vcore, etc.) are stored as host timebase values.  The timebase
values in the dispatch trace log need to be guest timebase values,
however, since that is read directly by the guest.  This moves the
setting of vcpu->arch.dec_expires on guest exit to a point after we
have restored the host timebase so that vcpu->arch.dec_expires is a
host timebase value.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:44:59 +02:00
Paul Mackerras
14941789f2 KVM: PPC: Book3S HV: Save/restore SIAR and SDAR along with other PMU registers
Currently we are not saving and restoring the SIAR and SDAR registers in
the PMU (performance monitor unit) on guest entry and exit.  The result
is that performance monitoring tools in the guest could get false
information about where a program was executing and what data it was
accessing at the time of a performance monitor interrupt.  This fixes
it by saving and restoring these registers along with the other PMU
registers on guest entry/exit.

This also provides a way for userspace to access these values for a
vcpu via the one_reg interface.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 14:44:59 +02:00
Benjamin Herrenschmidt
3ad26e5c44 Merge branch 'for-kvm' into next
Topic branch for commits that the KVM tree might want to pull
in separately.

Hand merged a few files due to conflicts with the LE stuff

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-10-11 18:23:53 +11:00
Paul Mackerras
de79f7b9f6 powerpc: Put FP/VSX and VR state into structures
This creates new 'thread_fp_state' and 'thread_vr_state' structures
to store FP/VSX state (including FPSCR) and Altivec/VSX state
(including VSCR), and uses them in the thread_struct.  In the
thread_fp_state, the FPRs and VSRs are represented as u64 rather
than double, since we rarely perform floating-point computations
on the values, and this will enable the structures to be used
in KVM code as well.  Similarly FPSCR is now a u64 rather than
a structure of two 32-bit values.

This takes the offsets out of the macros such as SAVE_32FPRS,
REST_32FPRS, etc.  This enables the same macros to be used for normal
and transactional state, enabling us to delete the transactional
versions of the macros.   This also removes the unused do_load_up_fpu
and do_load_up_altivec, which were in fact buggy since they didn't
create large enough stack frames to account for the fact that
load_up_fpu and load_up_altivec are not designed to be called from C
and assume that their caller's stack frame is an interrupt frame.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-10-11 17:26:49 +11:00
Anton Blanchard
7df697c815 KVM: PPC: Disable KVM on little endian builds
There are a number of KVM issues with little endian builds.
We are working on fixing them, but in the meantime disable
it.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: Alexander Graf <agraf@suse.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-10-11 16:48:55 +11:00
Bharat Bhushan
40fde70d0d kvm: ppc: booke: check range page invalidation progress on page setup
When the MM code is invalidating a range of pages, it calls the KVM
kvm_mmu_notifier_invalidate_range_start() notifier function, which calls
kvm_unmap_hva_range(), which arranges to flush all the TLBs for guest pages.
However, the Linux PTEs for the range being flushed are still valid at
that point.  We are not supposed to establish any new references to pages
in the range until the ...range_end() notifier gets called.
The PPC-specific KVM code doesn't get any explicit notification of that;
instead, we are supposed to use mmu_notifier_retry() to test whether we
are or have been inside a range flush notifier pair while we have been
referencing a page.

This patch calls the mmu_notifier_retry() while mapping the guest
page to ensure we are not referencing a page when in range invalidation.

This call is inside a region locked with kvm->mmu_lock, which is the
same lock that is called by the KVM MMU notifier functions, thus
ensuring that no new notification can proceed while we are in the
locked region.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Acked-by: Alexander Graf <agraf@suse.de>
[Backported to 3.12 - Paolo]
Reviewed-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-10 11:40:08 +02:00
Paul Mackerras
cfc860253a KVM: PPC: Book3S HV: Fix typo in saving DSCR
This fixes a typo in the code that saves the guest DSCR (Data Stream
Control Register) into the kvm_vcpu_arch struct on guest exit.  The
effect of the typo was that the DSCR value was saved in the wrong place,
so changes to the DSCR by the guest didn't persist across guest exit
and entry, and some host kernel memory got corrupted.

Cc: stable@vger.kernel.org [v3.1+]
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-10 11:40:07 +02:00
Linus Torvalds
39eda2aba6 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc updates from Ben Herrenschmidt:
 "Here's the powerpc batch for this merge window.  Some of the
  highlights are:

   - A bunch of endian fixes ! We don't have full LE support yet in that
     release but this contains a lot of fixes all over arch/powerpc to
     use the proper accessors, call the firmware with the right endian
     mode, etc...

   - A few updates to our "powernv" platform (non-virtualized, the one
     to run KVM on), among other, support for bridging the P8 LPC bus
     for UARTs, support and some EEH fixes.

   - Some mpc51xx clock API cleanups in preparation for a clock API
     overhaul

   - A pile of cleanups of our old math emulation code, including better
     support for using it to emulate optional FP instructions on
     embedded chips that otherwise have a HW FPU.

   - Some infrastructure in selftest, for powerpc now, but could be
     generalized, initially used by some tests for our perf instruction
     counting code.

   - A pile of fixes for hotplug on pseries (that was seriously
     bitrotting)

   - The usual slew of freescale embedded updates, new boards, 64-bit
     hiberation support, e6500 core PMU support, etc..."

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (146 commits)
  powerpc: Correct FSCR bit definitions
  powerpc/xmon: Fix printing of set of CPUs in xmon
  powerpc/pseries: Move lparcfg.c to platforms/pseries
  powerpc/powernv: Return secondary CPUs to firmware on kexec
  powerpc/btext: Fix CONFIG_PPC_EARLY_DEBUG_BOOTX on ppc32
  powerpc: Cleanup handling of the DSCR bit in the FSCR register
  powerpc/pseries: Child nodes are not detached by dlpar_detach_node
  powerpc/pseries: Add mising of_node_put in delete_dt_node
  powerpc/pseries: Make dlpar_configure_connector parent node aware
  powerpc/pseries: Do all node initialization in dlpar_parse_cc_node
  powerpc/pseries: Fix parsing of initial node path in update_dt_node
  powerpc/pseries: Pack update_props_workarea to map correctly to rtas buffer header
  powerpc/pseries: Fix over writing of rtas return code in update_dt_node
  powerpc/pseries: Fix creation of loop in device node property list
  powerpc: Skip emulating & leave interrupts off for kernel program checks
  powerpc: Add more exception trampolines for hypervisor exceptions
  powerpc: Fix location and rename exception trampolines
  powerpc: Add more trap names to xmon
  powerpc/pseries: Add a warning in the case of cross-cpu VPA registration
  powerpc: Update the 00-Index in Documentation/powerpc
  ...
2013-09-06 10:49:42 -07:00
Linus Torvalds
45d9a2220f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile 1 from Al Viro:
 "Unfortunately, this merge window it'll have a be a lot of small piles -
  my fault, actually, for not keeping #for-next in anything that would
  resemble a sane shape ;-/

  This pile: assorted fixes (the first 3 are -stable fodder, IMO) and
  cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last
  components) + several long-standing patches from various folks.

  There definitely will be a lot more (starting with Miklos'
  check_submount_and_drop() series)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
  direct-io: Handle O_(D)SYNC AIO
  direct-io: Implement generic deferred AIO completions
  add formats for dentry/file pathnames
  kvm eventfd: switch to fdget
  powerpc kvm: use fdget
  switch fchmod() to fdget
  switch epoll_ctl() to fdget
  switch copy_module_from_fd() to fdget
  git simplify nilfs check for busy subtree
  ibmasmfs: don't bother passing superblock when not needed
  don't pass superblock to hypfs_{mkdir,create*}
  don't pass superblock to hypfs_diag_create_files
  don't pass superblock to hypfs_vm_create_files()
  oprofile: get rid of pointless forward declarations of struct super_block
  oprofilefs_create_...() do not need superblock argument
  oprofilefs_mkdir() doesn't need superblock argument
  don't bother with passing superblock to oprofile_create_stats_files()
  oprofile: don't bother with passing superblock to ->create_files()
  don't bother passing sb to oprofile_create_files()
  coh901318: don't open-code simple_read_from_buffer()
  ...
2013-09-05 08:50:26 -07:00
Al Viro
70abadedab powerpc kvm: use fdget
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-03 23:04:45 -04:00
Gleb Natapov
a9f6cf965e Merge branch 'kvm-ppc-next' of git://github.com/agraf/linux-2.6 into queue
* 'kvm-ppc-next' of git://github.com/agraf/linux-2.6:
  KVM: PPC: Book3S PR: Rework kvmppc_mmu_book3s_64_xlate()
  KVM: PPC: Book3S PR: Make instruction fetch fallback work for system calls
  KVM: PPC: Book3S PR: Don't corrupt guest state when kernel uses VMX
  KVM: PPC: Book3S: Fix compile error in XICS emulation
  KVM: PPC: Book3S PR: return appropriate error when allocation fails
  arch: powerpc: kvm: add signed type cast for comparation
  powerpc/kvm: Copy the pvr value after memset
  KVM: PPC: Book3S PR: Load up SPRG3 register with guest value on guest entry
  kvm/ppc/booke: Don't call kvm_guest_enter twice
  kvm/ppc: Call trace_hardirqs_on before entry
  KVM: PPC: Book3S HV: Allow negative offsets to real-mode hcall handlers
  KVM: PPC: Book3S HV: Correct tlbie usage
  powerpc/kvm: Use 256K chunk to track both RMA and hash page table allocation.
  powerpc/kvm: Contiguous memory allocator based RMA allocation
  powerpc/kvm: Contiguous memory allocator based hash page table allocation
  KVM: PPC: Book3S: Ignore DABR register
  mm/cma: Move dma contiguous changes into a seperate config
2013-08-30 15:33:11 +03:00
Alexander Graf
bf550fc93d Merge remote-tracking branch 'origin/next' into kvm-ppc-next
Conflicts:
	mm/Kconfig

CMA DMA split and ZSWAP introduction were conflicting, fix up manually.
2013-08-29 00:41:59 +02:00
Paul Mackerras
7e48c101e0 KVM: PPC: Book3S PR: Rework kvmppc_mmu_book3s_64_xlate()
This reworks kvmppc_mmu_book3s_64_xlate() to make it check the large
page bit in the hashed page table entries (HPTEs) it looks at, and
to simplify and streamline the code.  The checking of the first dword
of each HPTE is now done with a single mask and compare operation,
and all the code dealing with the matching HPTE, if we find one,
is consolidated in one place in the main line of the function flow.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-08-29 00:05:50 +02:00
Paul Mackerras
8b23de2948 KVM: PPC: Book3S PR: Make instruction fetch fallback work for system calls
It turns out that if we exit the guest due to a hcall instruction (sc 1),
and the loading of the instruction in the guest exit path fails for any
reason, the call to kvmppc_ld() in kvmppc_get_last_inst() fetches the
instruction after the hcall instruction rather than the hcall itself.
This in turn means that the instruction doesn't get recognized as an
hcall in kvmppc_handle_exit_pr() but gets passed to the guest kernel
as a sc instruction.  That usually results in the guest kernel getting
a return code of 38 (ENOSYS) from an hcall, which often triggers a
BUG_ON() or other failure.

This fixes the problem by adding a new variant of kvmppc_get_last_inst()
called kvmppc_get_last_sc(), which fetches the instruction if necessary
from pc - 4 rather than pc.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-08-28 16:47:49 +02:00
Paul Mackerras
9d1ffdd8f3 KVM: PPC: Book3S PR: Don't corrupt guest state when kernel uses VMX
Currently the code assumes that once we load up guest FP/VSX or VMX
state into the CPU, it stays valid in the CPU registers until we
explicitly flush it to the thread_struct.  However, on POWER7,
copy_page() and memcpy() can use VMX.  These functions do flush the
VMX state to the thread_struct before using VMX instructions, but if
this happens while we have guest state in the VMX registers, and we
then re-enter the guest, we don't reload the VMX state from the
thread_struct, leading to guest corruption.  This has been observed
to cause guest processes to segfault.

To fix this, we check before re-entering the guest that all of the
bits corresponding to facilities owned by the guest, as expressed
in vcpu->arch.guest_owned_ext, are set in current->thread.regs->msr.
Any bits that have been cleared correspond to facilities that have
been used by kernel code and thus flushed to the thread_struct, so
for them we reload the state from the thread_struct.

We also need to check current->thread.regs->msr before calling
giveup_fpu() or giveup_altivec(), since if the relevant bit is
clear, the state has already been flushed to the thread_struct and
to flush it again would corrupt it.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-08-28 16:41:14 +02:00
Paul Mackerras
7bfa9ad55d KVM: PPC: Book3S: Fix compile error in XICS emulation
Commit 8e44ddc3f3 ("powerpc/kvm/book3s: Add support for H_IPOLL and
H_XIRR_X in XICS emulation") added a call to get_tb() but didn't
include the header that defines it, and on some configs this means
book3s_xics.c fails to compile:

arch/powerpc/kvm/book3s_xics.c: In function ‘kvmppc_xics_hcall’:
arch/powerpc/kvm/book3s_xics.c:812:3: error: implicit declaration of function ‘get_tb’ [-Werror=implicit-function-declaration]

Cc: stable@vger.kernel.org [v3.10, v3.11]
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-08-28 16:28:47 +02:00
Thadeu Lima de Souza Cascardo
7c7b406e6b KVM: PPC: Book3S PR: return appropriate error when allocation fails
err was overwritten by a previous function call, and checked to be 0. If
the following page allocation fails, 0 is going to be returned instead
of -ENOMEM.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-08-28 16:26:33 +02:00
Chen Gang
5d226ae56f arch: powerpc: kvm: add signed type cast for comparation
'rmls' is 'unsigned long', lpcr_rmls() will return negative number when
failure occurs, so it need a type cast for comparing.

'lpid' is 'unsigned long', kvmppc_alloc_lpid() return negative number
when failure occurs, so it need a type cast for comparing.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-08-28 16:23:35 +02:00
Benjamin Herrenschmidt
3f1f431188 Merge branch 'merge' into next
Merge stuff that already went into Linus via "merge" which
are pre-reqs for subsequent patches
2013-08-27 15:03:30 +10:00
Yann Droneaud
2f84d5ea6f ppc: kvm: use anon_inode_getfd() with O_CLOEXEC flag
KVM uses anon_inode_get() to allocate file descriptors as part
of some of its ioctls. But those ioctls are lacking a flag argument
allowing userspace to choose options for the newly opened file descriptor.

In such case it's advised to use O_CLOEXEC by default so that
userspace is allowed to choose, without race, if the file descriptor
is going to be inherited across exec().

This patch set O_CLOEXEC flag on all file descriptors created
with anon_inode_getfd() to not leak file descriptors across exec().

Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://lkml.kernel.org/r/cover.1377372576.git.ydroneaud@opteya.com
Reviewed-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-08-26 13:19:56 +03:00
Aneesh Kumar K.V
87916442bd powerpc/kvm: Copy the pvr value after memset
Otherwise we would clear the pvr value

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-08-22 19:25:41 +02:00
Anton Blanchard
54bb7f4bda powerpc: Make rwlocks endian safe
Our ppc64 spinlocks and rwlocks use a trick where a lock token and
the paca index are placed in the lock with a single store. Since we
are using two u16s they need adjusting for little endian.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-08-14 15:33:40 +10:00
Anton Blanchard
7ffcf8ec26 powerpc: Fix little endian lppaca, slb_shadow and dtl_entry
The lppaca, slb_shadow and dtl_entry hypervisor structures are
big endian, so we have to byte swap them in little endian builds.

LE KVM hosts will also need to be fixed but for now add an #error
to remind us.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-08-14 15:33:35 +10:00
Anton Blanchard
f13c13a005 powerpc: Stop using non-architected shared_proc field in lppaca
Although the shared_proc field in the lppaca works today, it is
not architected. A shared processor partition will always have a non
zero yield_count so use that instead. Create a wrapper so users
don't have to know about the details.

In order for older kernels to continue to work on KVM we need
to set the shared_proc bit. While here, remove the ugly bitfield.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-08-14 11:50:26 +10:00
Thadeu Lima de Souza Cascardo
e0e1361462 powerpc/kvm/book3s_pr: Return appropriate error when allocation fails
err was overwritten by a previous function call, and checked to be 0. If
the following page allocation fails, 0 is going to be returned instead
of -ENOMEM.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-08-09 18:06:54 +10:00
Chen Gang
2fb10672c8 powerpc/kvm: Add signed type cast for comparation
'rmls' is 'unsigned long', lpcr_rmls() will return negative number when
failure occurs, so it need a type cast for comparing.

'lpid' is 'unsigned long', kvmppc_alloc_lpid() return negative number
when failure occurs, so it need a type cast for comparing.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-08-09 18:06:51 +10:00
Hongtao Jia
9123c5ed45 powerpc: Move opcode definitions from kvm/emulate.c to asm/ppc-opcode.h
Opcode and xopcode are useful definitions not just for KVM. Move these
definitions to asm/ppc-opcode.h for public use.

Also add the opcodes for LHAUX and LWZUX.

Signed-off-by: Jia Hongtao <hongtao.jia@freescale.com>
Signed-off-by: Li Yang <leoli@freescale.com>
[scottwood@freesacle.com: update commit message and rebase]
Signed-off-by: Scott Wood <scottwood@freescale.com>
2013-07-30 15:50:07 -05:00
Paul Mackerras
c8ae0ace10 KVM: PPC: Book3S PR: Load up SPRG3 register with guest value on guest entry
Unlike the other general-purpose SPRs, SPRG3 can be read by usermode
code, and is used in recent kernels to store the CPU and NUMA node
numbers so that they can be read by VDSO functions.  Thus we need to
load the guest's SPRG3 value into the real SPRG3 register when entering
the guest, and restore the host's value when exiting the guest.  We don't
need to save the guest SPRG3 value when exiting the guest as usermode
code can't modify SPRG3.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-07-25 15:33:09 +02:00
Takuya Yoshikawa
e59dbe09f8 KVM: Introduce kvm_arch_memslots_updated()
This is called right after the memslots is updated, i.e. when the result
of update_memslots() gets installed in install_new_memslots().  Since
the memslots needs to be updated twice when we delete or move a memslot,
kvm_arch_commit_memory_region() does not correspond to this exactly.

In the following patch, x86 will use this new API to check if the mmio
generation has reached its maximum value, in which case mmio sptes need
to be flushed out.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Acked-by: Alexander Graf <agraf@suse.de>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-18 12:29:25 +02:00
Scott Wood
c09ec198eb kvm/ppc/booke: Don't call kvm_guest_enter twice
kvm_guest_enter() was already called by kvmppc_prepare_to_enter().
Don't call it again.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-07-11 00:58:49 +02:00
Scott Wood
5f1c248f52 kvm/ppc: Call trace_hardirqs_on before entry
Currently this is only being done on 64-bit.  Rather than just move it
out of the 64-bit ifdef, move it to kvm_lazy_ee_enable() so that it is
consistent with lazy ee state, and so that we don't track more host
code as interrupts-enabled than necessary.

Rename kvm_lazy_ee_enable() to kvm_fix_ee_before_entry() to reflect
that this function now has a role on 32-bit as well.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-07-11 00:51:28 +02:00
Paul Mackerras
4baa1d871c KVM: PPC: Book3S HV: Allow negative offsets to real-mode hcall handlers
The table of offsets to real-mode hcall handlers in book3s_hv_rmhandlers.S
can contain negative values, if some of the handlers end up before the
table in the vmlinux binary.  Thus we need to use a sign-extending load
to read the values in the table rather than a zero-extending load.
Without this, the host crashes when the guest does one of the hcalls
with negative offsets, due to jumping to a bogus address.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-07-10 13:14:16 +02:00
Paul Mackerras
5448050124 KVM: PPC: Book3S HV: Correct tlbie usage
This corrects the usage of the tlbie (TLB invalidate entry) instruction
in HV KVM.  The tlbie instruction changed between PPC970 and POWER7.
On the PPC970, the bit to select large vs. small page is in the instruction,
not in the RB register value.  This changes the code to use the correct
form on PPC970.

On POWER7 we were calculating the AVAL (Abbreviated Virtual Address, Lower)
field of the RB value incorrectly for 64k pages.  This fixes it.

Since we now have several cases to handle for the tlbie instruction, this
factors out the code to do a sequence of tlbies into a new function,
do_tlbies(), and calls that from the various places where the code was
doing tlbie instructions inline.  It also makes kvmppc_h_bulk_remove()
use the same global_invalidates() function for determining whether to do
local or global TLB invalidations as is used in other places, for
consistency, and also to make sure that kvm->arch.need_tlb_flush gets
updated properly.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-07-10 13:14:09 +02:00
Aneesh Kumar K.V
990978e993 powerpc/kvm: Use 256K chunk to track both RMA and hash page table allocation.
Both RMA and hash page table request will be a multiple of 256K. We can use
a chunk size of 256K to track the free/used 256K chunk in the bitmap. This
should help to reduce the bitmap size.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-07-08 16:21:13 +02:00
Aneesh Kumar K.V
6c45b81098 powerpc/kvm: Contiguous memory allocator based RMA allocation
Older version of power architecture use Real Mode Offset register and Real Mode Limit
Selector for mapping guest Real Mode Area. The guest RMA should be physically
contigous since we use the range when address translation is not enabled.

This patch switch RMA allocation code to use contigous memory allocator. The patch
also remove the the linear allocator which not used any more

Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-07-08 16:20:20 +02:00
Aneesh Kumar K.V
fa61a4e376 powerpc/kvm: Contiguous memory allocator based hash page table allocation
Powerpc architecture uses a hash based page table mechanism for mapping virtual
addresses to physical address. The architecture require this hash page table to
be physically contiguous. With KVM on Powerpc currently we use early reservation
mechanism for allocating guest hash page table. This implies that we need to
reserve a big memory region to ensure we can create large number of guest
simultaneously with KVM on Power. Another disadvantage is that the reserved memory
is not available to rest of the subsystems and and that implies we limit the total
available memory in the host.

This patch series switch the guest hash page table allocation to use
contiguous memory allocator.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-07-08 16:19:58 +02:00
Alexander Graf
f35320288c KVM: PPC: Book3S: Ignore DABR register
We don't emulate breakpoints yet, so just ignore reads and writes
to / from DABR.

This fixes booting of more recent Linux guest kernels for me.

Reported-by: Nello Martuscielli <ppc.addon@gmail.com>
Tested-by: Nello Martuscielli <ppc.addon@gmail.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-07-08 16:18:20 +02:00
Linus Torvalds
65b97fb730 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc updates from Ben Herrenschmidt:
 "This is the powerpc changes for the 3.11 merge window.  In addition to
  the usual bug fixes and small updates, the main highlights are:

   - Support for transparent huge pages by Aneesh Kumar for 64-bit
     server processors.  This allows the use of 16M pages as transparent
     huge pages on kernels compiled with a 64K base page size.

   - Base VFIO support for KVM on power by Alexey Kardashevskiy

   - Wiring up of our nvram to the pstore infrastructure, including
     putting compressed oopses in there by Aruna Balakrishnaiah

   - Move, rework and improve our "EEH" (basically PCI error handling
     and recovery) infrastructure.  It is no longer specific to pseries
     but is now usable by the new "powernv" platform as well (no
     hypervisor) by Gavin Shan.

   - I fixed some bugs in our math-emu instruction decoding and made it
     usable to emulate some optional FP instructions on processors with
     hard FP that lack them (such as fsqrt on Freescale embedded
     processors).

   - Support for Power8 "Event Based Branch" facility by Michael
     Ellerman.  This facility allows what is basically "userspace
     interrupts" for performance monitor events.

   - A bunch of Transactional Memory vs.  Signals bug fixes and HW
     breakpoint/watchpoint fixes by Michael Neuling.

  And more ...  I appologize in advance if I've failed to highlight
  something that somebody deemed worth it."

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits)
  pstore: Add hsize argument in write_buf call of pstore_ftrace_call
  powerpc/fsl: add MPIC timer wakeup support
  powerpc/mpic: create mpic subsystem object
  powerpc/mpic: add global timer support
  powerpc/mpic: add irq_set_wake support
  powerpc/85xx: enable coreint for all the 64bit boards
  powerpc/8xx: Erroneous double irq_eoi() on CPM IRQ in MPC8xx
  powerpc/fsl: Enable CONFIG_E1000E in mpc85xx_smp_defconfig
  powerpc/mpic: Add get_version API both for internal and external use
  powerpc: Handle both new style and old style reserve maps
  powerpc/hw_brk: Fix off by one error when validating DAWR region end
  powerpc/pseries: Support compression of oops text via pstore
  powerpc/pseries: Re-organise the oops compression code
  pstore: Pass header size in the pstore write callback
  powerpc/powernv: Fix iommu initialization again
  powerpc/pseries: Inform the hypervisor we are using EBB regs
  powerpc/perf: Add power8 EBB support
  powerpc/perf: Core EBB support for 64-bit book3s
  powerpc/perf: Drop MMCRA from thread_struct
  powerpc/perf: Don't enable if we have zero events
  ...
2013-07-04 10:29:23 -07:00
Linus Torvalds
fe489bf450 KVM fixes for 3.11
On the x86 side, there are some optimizations and documentation updates.
 The big ARM/KVM change for 3.11, support for AArch64, will come through
 Catalin Marinas's tree.  s390 and PPC have misc cleanups and bugfixes.
 
 There is a conflict due to "s390/pgtable: fix ipte notify bit" having
 entered 3.10 through Martin Schwidefsky's s390 tree.  This pull request
 has additional changes on top, so this tree's version is the correct one.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQIcBAABAgAGBQJR0oU6AAoJEBvWZb6bTYbynnsP/RSUrrHrA8Wu1tqVfAKu+1y5
 6OIihqZ9x11/YMaNofAfv86jqxFu0/j7CzMGphNdjzujqKI+Q1tGe7oiVCmKzoG+
 UvSctWsz0lpllgBtnnrm5tcfmG6rrddhLtpA7m320+xCVx8KV5P4VfyHZEU+Ho8h
 ziPmb2mAQ65gBNX6nLHEJ3ITTgad6gt4NNbrKIYpyXuWZQJypzaRqT/vpc4md+Ed
 dCebMXsL1xgyb98EcnOdrWH1wV30MfucR7IpObOhXnnMKeeltqAQPvaOlKzZh4dK
 +QfxJfdRZVS0cepcxzx1Q2X3dgjoKQsHq1nlIyz3qu1vhtfaqBlixLZk0SguZ/R9
 1S1YqucZiLRO57RD4q0Ak5oxwobu18ZoqJZ6nledNdWwDe8bz/W2wGAeVty19ky0
 qstBdM9jnwXrc0qrVgZp3+s5dsx3NAm/KKZBoq4sXiDLd/yBzdEdWIVkIrU3X9wU
 3X26wOmBxtsB7so/JR7ciTsQHelmLicnVeXohAEP9CjIJffB81xVXnXs0P0SYuiQ
 RzbSCwjPzET4JBOaHWT0Dhv0DTS/EaI97KzlN32US3Bn3WiLlS1oDCoPFoaLqd2K
 LxQMsXS8anAWxFvexfSuUpbJGPnKSidSQoQmJeMGBa9QhmZCht3IL16/Fb641ToN
 xBohzi49L9FDbpOnTYfz
 =1zpG
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "On the x86 side, there are some optimizations and documentation
  updates.  The big ARM/KVM change for 3.11, support for AArch64, will
  come through Catalin Marinas's tree.  s390 and PPC have misc cleanups
  and bugfixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (87 commits)
  KVM: PPC: Ignore PIR writes
  KVM: PPC: Book3S PR: Invalidate SLB entries properly
  KVM: PPC: Book3S PR: Allow guest to use 1TB segments
  KVM: PPC: Book3S PR: Don't keep scanning HPTEG after we find a match
  KVM: PPC: Book3S PR: Fix invalidation of SLB entry 0 on guest entry
  KVM: PPC: Book3S PR: Fix proto-VSID calculations
  KVM: PPC: Guard doorbell exception with CONFIG_PPC_DOORBELL
  KVM: Fix RTC interrupt coalescing tracking
  kvm: Add a tracepoint write_tsc_offset
  KVM: MMU: Inform users of mmio generation wraparound
  KVM: MMU: document fast invalidate all mmio sptes
  KVM: MMU: document fast invalidate all pages
  KVM: MMU: document fast page fault
  KVM: MMU: document mmio page fault
  KVM: MMU: document write_flooding_count
  KVM: MMU: document clear_spte_count
  KVM: MMU: drop kvm_mmu_zap_mmio_sptes
  KVM: MMU: init kvm generation close to mmio wrap-around value
  KVM: MMU: add tracepoint for check_mmio_spte
  KVM: MMU: fast invalidate all mmio sptes
  ...
2013-07-03 13:21:40 -07:00
Linus Torvalds
ab3d681e9d Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The major changes:

  - Simplify RCU's grace-period and callback processing based on the new
    numbering for callbacks.

  - Removal of TINY_PREEMPT_RCU in favor of TREE_PREEMPT_RCU for
    single-CPU low-latency systems.

  - SRCU-related changes and fixes.

  - Miscellaneous fixes, including converting a few remaining printk()
    calls to pr_*().

  - Documentation updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
  rcu: Shrink TINY_RCU by reworking CPU-stall ifdefs
  rcu: Shrink TINY_RCU by moving exit_rcu()
  rcu: Remove TINY_PREEMPT_RCU tracing documentation
  rcu: Consolidate rcutiny_plugin.h ifdefs
  rcu: Remove rcu_preempt_note_context_switch()
  rcu: Remove the CONFIG_TINY_RCU ifdefs in rcutiny.h
  rcu: Remove check_cpu_stall_preempt()
  rcu: Simplify RCU_TINY RCU callback invocation
  rcu: Remove rcu_preempt_process_callbacks()
  rcu: Remove rcu_preempt_remove_callbacks()
  rcu: Remove rcu_preempt_check_callbacks()
  rcu: Remove show_tiny_preempt_stats()
  rcu: Remove TINY_PREEMPT_RCU
  powerpc,kvm: fix imbalance srcu_read_[un]lock()
  rcu: Remove srcu_read_lock_raw() and srcu_read_unlock_raw().
  rcu: Apply Dave Jones's NOCB Kconfig help feedback
  rcu: Merge adjacent identical ifdefs
  rcu: Drive quiescent-state-forcing delay from HZ
  rcu: Remove "Experimental" flags
  kthread: Add kworker kthreads to OS-jitter documentation
  ...
2013-07-02 16:13:29 -07:00
Benjamin Herrenschmidt
24a72acac1 Linux 3.10
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQEcBAABAgAGBQJR0K2gAAoJEHm+PkMAQRiGWsEH+gMZSN1qRm34hZ82q1Tx7HvL
 Eb/Gsl3Qw/7G2TlTqgjBUs36IdqV9O2cui/aa3/TfXvdvrx+0GlhRkEwQPc+ygcO
 Mvoyoke4tT4+4jVFdCg1J8avREsa28/6oaHs0ZZxuVmJBBLTJH7aXaNsGn6eU1q9
 9+p798MQis6naIiPC63somlZcCIiBhsuWCPWpEfLMn8G1HWAFTM3xXIbNBqe/brS
 bmIOfhomlIZ5dcdaXGvjtP3+KJhkNDwhkPC4tVYu8JqqgSlrE+a+EGyEuuGqKk10
 U+swiqyuD31uBI9ga54u/2FzSqDiAu6YOcMXevjo/m3g9XLdYbYLvN+nvN8alCQ=
 =Ob6Z
 -----END PGP SIGNATURE-----

Merge tag 'v3.10' into next

Merge 3.10 in order to get some of the last minute powerpc
changes, resolve conflicts and add additional fixes on top
of them.
2013-07-01 17:57:25 +10:00
Alexander Graf
a3ff5fbc94 KVM: PPC: Ignore PIR writes
While technically it's legal to write to PIR and have the identifier changed,
we don't implement logic to do so because we simply expose vcpu_id to the guest.

So instead, let's ignore writes to PIR. This ensures that we don't inject faults
into the guest for something the guest is allowed to do. While at it, we cross
our fingers hoping that it also doesn't mind that we broke its PIR read values.

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-06-30 03:33:22 +02:00
Paul Mackerras
681562cd56 KVM: PPC: Book3S PR: Invalidate SLB entries properly
At present, if the guest creates a valid SLB (segment lookaside buffer)
entry with the slbmte instruction, then invalidates it with the slbie
instruction, then reads the entry with the slbmfee/slbmfev instructions,
the result of the slbmfee will have the valid bit set, even though the
entry is not actually considered valid by the host.  This is confusing,
if not worse.  This fixes it by zeroing out the orige and origv fields
of the SLB entry structure when the entry is invalidated.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-06-30 03:33:22 +02:00
Paul Mackerras
0f296829b5 KVM: PPC: Book3S PR: Allow guest to use 1TB segments
With this, the guest can use 1TB segments as well as 256MB segments.
Since we now have the situation where a single emulated guest segment
could correspond to multiple shadow segments (as the shadow segments
are still 256MB segments), this adds a new kvmppc_mmu_flush_segment()
to scan for all shadow segments that need to be removed.

This restructures the guest HPT (hashed page table) lookup code to
use the correct hashing and matching functions for HPTEs within a
1TB segment.  We use the standard hpt_hash() function instead of
open-coding the hash calculation, and we use HPTE_V_COMPARE() with
an AVPN value that has the B (segment size) field included.  The
calculation of avpn is done a little earlier since it doesn't change
in the loop starting at the do_second label.

The computation in kvmppc_mmu_book3s_64_esid_to_vsid() changes so that
it returns a 256MB VSID even if the guest SLB entry is a 1TB entry.
This is because the users of this function are creating 256MB SLB
entries.  We set a new VSID_1T flag so that entries created from 1T
segments don't collide with entries from 256MB segments.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-06-30 03:33:22 +02:00
Paul Mackerras
6ed1485f65 KVM: PPC: Book3S PR: Don't keep scanning HPTEG after we find a match
The loop in kvmppc_mmu_book3s_64_xlate() that looks up a translation
in the guest hashed page table (HPT) keeps going if it finds an
HPTE that matches but doesn't allow access.  This is incorrect; it
is different from what the hardware does, and there should never be
more than one matching HPTE anyway.  This fixes it to stop when any
matching HPTE is found.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-06-30 03:33:22 +02:00
Paul Mackerras
bc1bc4e392 KVM: PPC: Book3S PR: Fix invalidation of SLB entry 0 on guest entry
On entering a PR KVM guest, we invalidate the whole SLB before loading
up the guest entries.  We do this using an slbia instruction, which
invalidates all entries except entry 0, followed by an slbie to
invalidate entry 0.  However, the slbie turns out to be ineffective
in some circumstances (specifically when the host linear mapping uses
64k pages) because of errors in computing the parameter to the slbie.
The result is that the guest kernel hangs very early in boot because
it takes a DSI the first time it tries to access kernel data using
a linear mapping address in real mode.

Currently we construct bits 36 - 43 (big-endian numbering) of the slbie
parameter by taking bits 56 - 63 of the SLB VSID doubleword.  These bits
for the tlbie are C (class, 1 bit), B (segment size, 2 bits) and 5
reserved bits.  For the SLB VSID doubleword these are C (class, 1 bit),
reserved (1 bit), LP (large page size, 2 bits), and 4 reserved bits.
Thus we are not setting the B field correctly, and when LP = 01 as
it is for 64k pages, we are setting a reserved bit.

Rather than add more instructions to calculate the slbie parameter
correctly, this takes a simpler approach, which is to set entry 0 to
zeroes explicitly.  Normally slbmte should not be used to invalidate
an entry, since it doesn't invalidate the ERATs, but it is OK to use
it to invalidate an entry if it is immediately followed by slbia,
which does invalidate the ERATs.  (This has been confirmed with the
Power architects.)  This approach takes fewer instructions and will
work whatever the contents of entry 0.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-06-30 03:33:21 +02:00
Paul Mackerras
8ed7b7e9d2 KVM: PPC: Book3S PR: Fix proto-VSID calculations
This makes sure the calculation of the proto-VSIDs used by PR KVM
is done with 64-bit arithmetic.  Since vcpu3s->context_id[] is int,
when we do vcpu3s->context_id[0] << ESID_BITS the shift will be done
with 32-bit instructions, possibly leading to significant bits
getting lost, as the context id can be up to 524283 and ESID_BITS is
18.  To fix this we cast the context id to u64 before shifting.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-06-30 03:33:21 +02:00
Tiejun Chen
5f17ce8b95 KVM: PPC: Guard doorbell exception with CONFIG_PPC_DOORBELL
Availablity of the doorbell_exception function is guarded by
CONFIG_PPC_DOORBELL. Use the same define to guard our caller
of it.

Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
[agraf: improve patch description]
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-06-30 03:33:21 +02:00
Aneesh Kumar K.V
db7cb5b924 powerpc/kvm: Handle transparent hugepage in KVM
We can find pte that are splitting while walking page tables. Return
None pte in that case.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21 16:01:55 +10:00
Aneesh Kumar K.V
12bc9f6fc1 powerpc: Replace find_linux_pte with find_linux_pte_or_hugepte
Replace find_linux_pte with find_linux_pte_or_hugepte and explicitly
document why we don't need to handle transparent hugepages at callsites.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21 16:01:54 +10:00
Aneesh Kumar K.V
db3d853490 powerpc/mm: handle hugepage size correctly when invalidating hpte entries
If a hash bucket gets full, we "evict" a more/less random entry from it.
When we do that we don't invalidate the TLB (hpte_remove) because we assume
the old translation is still technically "valid". This implies that when
we are invalidating or updating pte, even if HPTE entry is not valid
we should do a tlb invalidate. With hugepages, we need to pass the correct
actual page size value for tlb invalidation.

This change update the patch 0608d69246
"powerpc/mm: Always invalidate tlb on hpte invalidate and update" to handle
transparent hugepages correctly.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21 16:01:52 +10:00
Ingo Molnar
b1fe9987b7 Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU changes from Paul E. McKenney:

"The major changes for this series are:

 1.      Simplify RCU's grace-period and callback processing based on
         the new numbering for callbacks.  These were posted to LKML at
         https://lkml.org/lkml/2013/5/20/330.

 2.      Documentation updates.  These were posted to LKML at
         https://lkml.org/lkml/2013/5/20/348.

 3.      Miscellaneous fixes, including converting a few remaining printk()
         calls to pr_*().  These were posted to LKML at
         https://lkml.org/lkml/2013/5/20/324.

 4.      SRCU-related changes and fixes.  These were posted to LKML at
         https://lkml.org/lkml/2013/5/20/425.

 5.      Removal of TINY_PREEMPT_RCU in favor of TREE_PREEMPT_RCU for
         single-CPU low-latency systems.  These were posted to LKML at
         https://lkml.org/lkml/2013/5/20/427."

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19 13:48:57 +02:00
Scott Wood
f8941fbe20 kvm/ppc/booke: Delay kvmppc_lazy_ee_enable
kwmppc_lazy_ee_enable() should be called as late as possible,
or else we get things like WARN_ON(preemptible()) in enable_kernel_fp()
in configurations where preemptible() works.

Note that book3s_pr already waits until just before __kvmppc_vcpu_run
to call kvmppc_lazy_ee_enable().

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-06-19 12:15:13 +02:00
Scott Wood
7c11c0ccc7 kvm/ppc/booke64: Fix lazy ee handling in kvmppc_handle_exit()
EE is hard-disabled on entry to kvmppc_handle_exit(), so call
hard_irq_disable() so that PACA_IRQ_HARD_DIS is set, and soft_enabled
is unset.

Without this, we get warnings such as arch/powerpc/kernel/time.c:300,
and sometimes host kernel hangs.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-11 11:11:00 +03:00
Scott Wood
f1e89028f0 kvm/ppc/booke: Hold srcu lock when calling gfn functions
KVM core expects arch code to acquire the srcu lock when calling
gfn_to_memslot and similar functions.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-11 11:10:59 +03:00
Scott Wood
2b6398fcf2 kvm/ppc/booke64: Disable e6500 support
The previous patch made 64-bit booke KVM build again, but Altivec
support is still not complete, and we can't prevent the guest from
turning on Altivec (which can corrupt host state until state
save/restore is implemented).  Disable e6500 on KVM until this is
fixed.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-11 11:10:56 +03:00
Lai Jiangshan
505d6421bd powerpc,kvm: fix imbalance srcu_read_[un]lock()
At the point of up_out label in kvmppc_hv_setup_htab_rma(),
srcu read lock is still held.

We have to release it before return.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: kvm@vger.kernel.org
Cc: kvm-ppc@vger.kernel.org
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2013-06-10 13:45:26 -07:00
Paul Mackerras
8e44ddc3f3 powerpc/kvm/book3s: Add support for H_IPOLL and H_XIRR_X in XICS emulation
This adds the remaining two hypercalls defined by PAPR for manipulating
the XICS interrupt controller, H_IPOLL and H_XIRR_X.  H_IPOLL returns
information about the priority and pending interrupts for a virtual
cpu, without changing any state.  H_XIRR_X is like H_XIRR in that it
reads and acknowledges the highest-priority pending interrupt, but it
also returns the timestamp (timebase register value) from when the
interrupt was first received by the hypervisor.  Currently we just
return the current time, since we don't do any software queueing of
virtual interrupts inside the XICS emulation code.

These hcalls are not currently used by Linux guests, but may be in
future.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-01 08:29:27 +10:00
Marc Zyngier
535cf7b3b1 KVM: get rid of $(addprefix ../../../virt/kvm/, ...) in Makefiles
As requested by the KVM maintainers, remove the addprefix used to
refer to the main KVM code from the arch code, and replace it with
a KVM variable that does the same thing.

Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Christoffer Dall <cdall@cs.columbia.edu>
Acked-by: Xiantao Zhang <xiantao.zhang@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-19 15:14:00 +03:00
Linus Torvalds
01227a889e Merge tag 'kvm-3.10-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Gleb Natapov:
 "Highlights of the updates are:

  general:
   - new emulated device API
   - legacy device assignment is now optional
   - irqfd interface is more generic and can be shared between arches

  x86:
   - VMCS shadow support and other nested VMX improvements
   - APIC virtualization and Posted Interrupt hardware support
   - Optimize mmio spte zapping

  ppc:
    - BookE: in-kernel MPIC emulation with irqfd support
    - Book3S: in-kernel XICS emulation (incomplete)
    - Book3S: HV: migration fixes
    - BookE: more debug support preparation
    - BookE: e6500 support

  ARM:
   - reworking of Hyp idmaps

  s390:
   - ioeventfd for virtio-ccw

  And many other bug fixes, cleanups and improvements"

* tag 'kvm-3.10-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits)
  kvm: Add compat_ioctl for device control API
  KVM: x86: Account for failing enable_irq_window for NMI window request
  KVM: PPC: Book3S: Add API for in-kernel XICS emulation
  kvm/ppc/mpic: fix missing unlock in set_base_addr()
  kvm/ppc: Hold srcu lock when calling kvm_io_bus_read/write
  kvm/ppc/mpic: remove users
  kvm/ppc/mpic: fix mmio region lists when multiple guests used
  kvm/ppc/mpic: remove default routes from documentation
  kvm: KVM_CAP_IOMMU only available with device assignment
  ARM: KVM: iterate over all CPUs for CPU compatibility check
  KVM: ARM: Fix spelling in error message
  ARM: KVM: define KVM_ARM_MAX_VCPUS unconditionally
  KVM: ARM: Fix API documentation for ONE_REG encoding
  ARM: KVM: promote vfp_host pointer to generic host cpu context
  ARM: KVM: add architecture specific hook for capabilities
  ARM: KVM: perform HYP initilization for hotplugged CPUs
  ARM: KVM: switch to a dual-step HYP init code
  ARM: KVM: rework HYP page table freeing
  ARM: KVM: enforce maximum size for identity mapped code
  ARM: KVM: move to a KVM provided HYP idmap
  ...
2013-05-05 14:47:31 -07:00
Linus Torvalds
5a148af669 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc update from Benjamin Herrenschmidt:
 "The main highlights this time around are:

   - A pile of addition POWER8 bits and nits, such as updated
     performance counter support (Michael Ellerman), new branch history
     buffer support (Anshuman Khandual), base support for the new PCI
     host bridge when not using the hypervisor (Gavin Shan) and other
     random related bits and fixes from various contributors.

   - Some rework of our page table format by Aneesh Kumar which fixes a
     thing or two and paves the way for THP support.  THP itself will
     not make it this time around however.

   - More Freescale updates, including Altivec support on the new e6500
     cores, new PCI controller support, and a pile of new boards support
     and updates.

   - The usual batch of trivial cleanups & fixes"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits)
  powerpc: Fix build error for book3e
  powerpc: Context switch the new EBB SPRs
  powerpc: Turn on the EBB H/FSCR bits
  powerpc: Replace CPU_FTR_BCTAR with CPU_FTR_ARCH_207S
  powerpc: Setup BHRB instructions facility in HFSCR for POWER8
  powerpc: Fix interrupt range check on debug exception
  powerpc: Update tlbie/tlbiel as per ISA doc
  powerpc: Print page size info during boot
  powerpc: print both base and actual page size on hash failure
  powerpc: Fix hpte_decode to use the correct decoding for page sizes
  powerpc: Decode the pte-lp-encoding bits correctly.
  powerpc: Use encode avpn where we need only avpn values
  powerpc: Reduce PTE table memory wastage
  powerpc: Move the pte free routines from common header
  powerpc: Reduce the PTE_INDEX_SIZE
  powerpc: Switch 16GB and 16MB explicit hugepages to a different page table format
  powerpc: New hugepage directory format
  powerpc: Don't truncate pgd_index wrongly
  powerpc: Don't hard code the size of pte page
  powerpc: Save DAR and DSISR in pt_regs on MCE
  ...
2013-05-02 10:16:16 -07:00
Paul Mackerras
5975a2e095 KVM: PPC: Book3S: Add API for in-kernel XICS emulation
This adds the API for userspace to instantiate an XICS device in a VM
and connect VCPUs to it.  The API consists of a new device type for
the KVM_CREATE_DEVICE ioctl, a new capability KVM_CAP_IRQ_XICS, which
functions similarly to KVM_CAP_IRQ_MPIC, and the KVM_IRQ_LINE ioctl,
which is used to assert and deassert interrupt inputs of the XICS.

The XICS device has one attribute group, KVM_DEV_XICS_GRP_SOURCES.
Each attribute within this group corresponds to the state of one
interrupt source.  The attribute number is the same as the interrupt
source number.

This does not support irq routing or irqfd yet.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-05-02 15:28:36 +02:00
Wei Yongjun
d133b40f2c kvm/ppc/mpic: fix missing unlock in set_base_addr()
Add the missing unlock before return from function set_base_addr()
when disables the mapping.

Introduced by commit 5df554ad5b
(kvm/ppc/mpic: in-kernel MPIC emulation)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-05-02 15:28:35 +02:00
Scott Wood
ed840ee9c8 kvm/ppc: Hold srcu lock when calling kvm_io_bus_read/write
These functions do an srcu_dereference without acquiring the srcu lock
themselves.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-05-02 15:28:35 +02:00
Scott Wood
1d6f6b7339 kvm/ppc/mpic: remove users
This is an unused (no pun intended) leftover from when this code did
reference counting.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-05-02 15:28:34 +02:00
Scott Wood
398d87836e kvm/ppc/mpic: fix mmio region lists when multiple guests used
Keeping a linked list of statically defined objects doesn't work
very well when we have multiple guests. :-P

Switch to an array of constant objects.  This fixes a hang when
multiple guests are used.

Signed-off-by: Scott Wood <scottwood@freescale.com>
[agraf: remove struct list_head from mem_reg]
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-05-02 15:28:33 +02:00
Linus Torvalds
20b4fb4852 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS updates from Al Viro,

Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).

7kloc removed.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
  don't bother with deferred freeing of fdtables
  proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
  proc: Make the PROC_I() and PDE() macros internal to procfs
  proc: Supply a function to remove a proc entry by PDE
  take cgroup_open() and cpuset_open() to fs/proc/base.c
  ppc: Clean up scanlog
  ppc: Clean up rtas_flash driver somewhat
  hostap: proc: Use remove_proc_subtree()
  drm: proc: Use remove_proc_subtree()
  drm: proc: Use minor->index to label things, not PDE->name
  drm: Constify drm_proc_list[]
  zoran: Don't print proc_dir_entry data in debug
  reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
  proc: Supply an accessor for getting the data from a PDE's parent
  airo: Use remove_proc_subtree()
  rtl8192u: Don't need to save device proc dir PDE
  rtl8187se: Use a dir under /proc/net/r8180/
  proc: Add proc_mkdir_data()
  proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
  proc: Move PDE_NET() to fs/proc/proc_net.c
  ...
2013-05-01 17:51:54 -07:00
Linus Torvalds
5d434fcb25 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "Usual stuff, mostly comment fixes, typo fixes, printk fixes and small
  code cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (45 commits)
  mm: Convert print_symbol to %pSR
  gfs2: Convert print_symbol to %pSR
  m32r: Convert print_symbol to %pSR
  iostats.txt: add easy-to-find description for field 6
  x86 cmpxchg.h: fix wrong comment
  treewide: Fix typo in printk and comments
  doc: devicetree: Fix various typos
  docbook: fix 8250 naming in device-drivers
  pata_pdc2027x: Fix compiler warning
  treewide: Fix typo in printks
  mei: Fix comments in drivers/misc/mei
  treewide: Fix typos in kernel messages
  pm44xx: Fix comment for "CONFIG_CPU_IDLE"
  doc: Fix typo "CONFIG_CGROUP_CGROUP_MEMCG_SWAP"
  mmzone: correct "pags" to "pages" in comment.
  kernel-parameters: remove outdated 'noresidual' parameter
  Remove spurious _H suffixes from ifdef comments
  sound: Remove stray pluses from Kconfig file
  radio-shark: Fix printk "CONFIG_LED_CLASS"
  doc: put proper reference to CONFIG_MODULE_SIG_ENFORCE
  ...
2013-04-30 09:36:50 -07:00
Aneesh Kumar K.V
b1022fbd29 powerpc: Decode the pte-lp-encoding bits correctly.
We look at both the segment base page size and actual page size and store
the pte-lp-encodings in an array per base page size.

We also update all relevant functions to take actual page size argument
so that we can use the correct PTE LP encoding in HPTE. This should also
get the basic Multiple Page Size per Segment (MPSS) support. This is needed
to enable THP on ppc64.

[Fixed PR KVM build --BenH]

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30 16:00:14 +10:00
Paul Mackerras
8b78645c93 KVM: PPC: Book3S: Facilities to save/restore XICS presentation ctrler state
This adds the ability for userspace to save and restore the state
of the XICS interrupt presentation controllers (ICPs) via the
KVM_GET/SET_ONE_REG interface.  Since there is one ICP per vcpu, we
simply define a new 64-bit register in the ONE_REG space for the ICP
state.  The state includes the CPU priority setting, the pending IPI
priority, and the priority and source number of any pending external
interrupt.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:34 +02:00
Paul Mackerras
d19bd86204 KVM: PPC: Book3S: Add support for ibm,int-on/off RTAS calls
This adds support for the ibm,int-on and ibm,int-off RTAS calls to the
in-kernel XICS emulation and corrects the handling of the saved
priority by the ibm,set-xive RTAS call.  With this, ibm,int-off sets
the specified interrupt's priority in its saved_priority field and
sets the priority to 0xff (the least favoured value).  ibm,int-on
restores the saved_priority to the priority field, and ibm,set-xive
sets both the priority and the saved_priority to the specified
priority value.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:33 +02:00
Paul Mackerras
4619ac88b7 KVM: PPC: Book3S HV: Improve real-mode handling of external interrupts
This streamlines our handling of external interrupts that come in
while we're in the guest.  First, when waking up a hardware thread
that was napping, we split off the "napping due to H_CEDE" case
earlier, and use the code that handles an external interrupt (0x500)
in the guest to handle that too.  Secondly, the code that handles
those external interrupts now checks if any other thread is exiting
to the host before bouncing an external interrupt to the guest, and
also checks that there is actually an external interrupt pending for
the guest before setting the LPCR MER bit (mediated external request).

This also makes sure that we clear the "ceded" flag when we handle a
wakeup from cede in real mode, and fixes a potential infinite loop
in kvmppc_run_vcpu() which can occur if we ever end up with the ceded
flag set but MSR[EE] off.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:32 +02:00
Benjamin Herrenschmidt
e7d26f285b KVM: PPC: Book3S HV: Add support for real mode ICP in XICS emulation
This adds an implementation of the XICS hypercalls in real mode for HV
KVM, which allows us to avoid exiting the guest MMU context on all
threads for a variety of operations such as fetching a pending
interrupt, EOI of messages, IPIs, etc.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:32 +02:00
Benjamin Herrenschmidt
54695c3088 KVM: PPC: Book3S HV: Speed up wakeups of CPUs on HV KVM
Currently, we wake up a CPU by sending a host IPI with
smp_send_reschedule() to thread 0 of that core, which will take all
threads out of the guest, and cause them to re-evaluate their
interrupt status on the way back in.

This adds a mechanism to differentiate real host IPIs from IPIs sent
by KVM for guest threads to poke each other, in order to target the
guest threads precisely when possible and avoid that global switch of
the core to host state.

We then use this new facility in the in-kernel XICS code.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:31 +02:00
Benjamin Herrenschmidt
bc5ad3f370 KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
This adds in-kernel emulation of the XICS (eXternal Interrupt
Controller Specification) interrupt controller specified by PAPR, for
both HV and PR KVM guests.

The XICS emulation supports up to 1048560 interrupt sources.
Interrupt source numbers below 16 are reserved; 0 is used to mean no
interrupt and 2 is used for IPIs.  Internally these are represented in
blocks of 1024, called ICS (interrupt controller source) entities, but
that is not visible to userspace.

Each vcpu gets one ICP (interrupt controller presentation) entity,
used to store the per-vcpu state such as vcpu priority, pending
interrupt state, IPI request, etc.

This does not include any API or any way to connect vcpus to their
ICP state; that will be added in later patches.

This is based on an initial implementation by Michael Ellerman
<michael@ellerman.id.au> reworked by Benjamin Herrenschmidt and
Paul Mackerras.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix typo, add dependency on !KVM_MPIC]
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:30 +02:00
Michael Ellerman
8e591cb720 KVM: PPC: Book3S: Add infrastructure to implement kernel-side RTAS calls
For pseries machine emulation, in order to move the interrupt
controller code to the kernel, we need to intercept some RTAS
calls in the kernel itself.  This adds an infrastructure to allow
in-kernel handlers to be registered for RTAS services by name.
A new ioctl, KVM_PPC_RTAS_DEFINE_TOKEN, then allows userspace to
associate token values with those service names.  Then, when the
guest requests an RTAS service with one of those token values, it
will be handled by the relevant in-kernel handler rather than being
passed up to userspace as at present.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix warning]
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:29 +02:00
Scott Wood
91194919a6 kvm/ppc/mpic: Eliminate mmio_mapped
We no longer need to keep track of this now that MPIC destruction
always happens either during VM destruction (after MMIO has been
destroyed) or during a failed creation (before the fd has been exposed
to userspace, and thus before the MMIO region could have been
registered).

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:28 +02:00
Scott Wood
07f0a7bdec kvm: destroy emulated devices on VM exit
The hassle of getting refcounting right was greater than the hassle
of keeping a list of devices to destroy on VM exit.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:28 +02:00
Alexander Graf
447a03c02a KVM: PPC: MPIC: Restrict to e500 platforms
The code as is doesn't make any sense on non-e500 platforms. Restrict it
there, so that people don't get wrong ideas on what would actually work.

This patch should get reverted as soon as it's possible to either run e500
guests on non-e500 hosts or the MPIC emulation gains support for non-e500
modes.

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:26 +02:00
Alexander Graf
5efdb4be59 KVM: PPC: MPIC: Add support for KVM_IRQ_LINE
Now that all pieces are in place for reusing generic irq infrastructure,
we can copy x86's implementation of KVM_IRQ_LINE irq injection and simply
reuse it for PPC, as it will work there just as well.

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:25 +02:00
Alexander Graf
de9ba2f363 KVM: PPC: Support irq routing and irqfd for in-kernel MPIC
Now that all the irq routing and irqfd pieces are generic, we can expose
real irqchip support to all of KVM's internal helpers.

This allows us to use irqfd with the in-kernel MPIC.

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:25 +02:00
Scott Wood
eb1e4f43e0 kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
Enabling this capability connects the vcpu to the designated in-kernel
MPIC.  Using explicit connections between vcpus and irqchips allows
for flexibility, but the main benefit at the moment is that it
simplifies the code -- KVM doesn't need vm-global state to remember
which MPIC object is associated with this vm, and it doesn't need to
care about ordering between irqchip creation and vcpu creation.

Signed-off-by: Scott Wood <scottwood@freescale.com>
[agraf: add stub functions for kvmppc_mpic_{dis,}connect_vcpu]
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:24 +02:00
Scott Wood
5df554ad5b kvm/ppc/mpic: in-kernel MPIC emulation
Hook the MPIC code up to the KVM interfaces, add locking, etc.

Signed-off-by: Scott Wood <scottwood@freescale.com>
[agraf: add stub function for kvmppc_mpic_set_epr, non-booke, 64bit]
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:23 +02:00
Scott Wood
f0f5c481a9 kvm/ppc/mpic: adapt to kernel style and environment
Remove braces that Linux style doesn't permit, remove space after
'*' that Lindent added, keep error/debug strings contiguous, etc.

Substitute type names, debug prints, etc.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:22 +02:00
Scott Wood
6dd830a09a kvm/ppc/mpic: remove some obviously unneeded code
Remove some parts of the code that are obviously QEMU or Raven specific
before fixing style issues, to reduce the style issues that need to be
fixed.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:22 +02:00
Scott Wood
b823f98f89 kvm/ppc/mpic: import hw/openpic.c from QEMU
This is QEMU's hw/openpic.c from commit
abd8d4a4d6dfea7ddea72f095f993e1de941614e ("Update version for
1.4.0-rc0"), run through Lindent with no other changes to ease merging
future changes between Linux and QEMU.  Remaining style issues
(including those introduced by Lindent) will be fixed in a later patch.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:20 +02:00
Paul Mackerras
c35635efdc KVM: PPC: Book3S HV: Report VPA and DTL modifications in dirty map
At present, the KVM_GET_DIRTY_LOG ioctl doesn't report modifications
done by the host to the virtual processor areas (VPAs) and dispatch
trace logs (DTLs) registered by the guest.  This is because those
modifications are done either in real mode or in the host kernel
context, and in neither case does the access go through the guest's
HPT, and thus no change (C) bit gets set in the guest's HPT.

However, the changes done by the host do need to be tracked so that
the modified pages get transferred when doing live migration.  In
order to track these modifications, this adds a dirty flag to the
struct representing the VPA/DTL areas, and arranges to set the flag
when the VPA/DTL gets modified by the host.  Then, when we are
collecting the dirty log, we also check the dirty flags for the
VPA and DTL for each vcpu and set the relevant bit in the dirty log
if necessary.  Doing this also means we now need to keep track of
the guest physical address of the VPA/DTL areas.

So as not to lose track of modifications to a VPA/DTL area when it gets
unregistered, or when a new area gets registered in its place, we need
to transfer the dirty state to the rmap chain.  This adds code to
kvmppc_unpin_guest_page() to do that if the area was dirty.  To simplify
that code, we now require that all VPA, DTL and SLB shadow buffer areas
fit within a single host page.  Guests already comply with this
requirement because pHyp requires that these areas not cross a 4k
boundary.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:13 +02:00
Paul Mackerras
a1b4a0f606 KVM: PPC: Book3S HV: Make HPT reading code notice R/C bit changes
At present, the code that determines whether a HPT entry has changed,
and thus needs to be sent to userspace when it is copying the HPT,
doesn't consider a hardware update to the reference and change bits
(R and C) in the HPT entries to constitute a change that needs to
be sent to userspace.  This adds code to check for changes in R and C
when we are scanning the HPT to find changed entries, and adds code
to set the changed flag for the HPTE when we update the R and C bits
in the guest view of the HPTE.

Since we now need to set the HPTE changed flag in book3s_64_mmu_hv.c
as well as book3s_hv_rm_mmu.c, we move the note_hpte_modification()
function into kvm_book3s_64.h.

Current Linux guest kernels don't use the hardware updates of R and C
in the HPT, so this change won't affect them.  Linux (or other) kernels
might in future want to use the R and C bits and have them correctly
transferred across when a guest is migrated, so it is better to correct
this deficiency.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:12 +02:00
Mihai Caraman
d9ce6041b3 KVM: PPC: e500: Add e6500 core to Kconfig description
Add e6500 core to Kconfig description.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:11 +02:00
Mihai Caraman
ea17a971c2 KVM: PPC: e500mc: Enable e6500 cores
Extend processor compatibility names to e6500 cores.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Reviewed-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:10 +02:00
Mihai Caraman
5b21501045 KVM: PPC: e500: Remove E.PT and E.HV.LRAT categories from VCPUs
Embedded.Page Table (E.PT) category is not supported yet in e6500 kernel.
Configure TLBnCFG to remove E.PT and E.HV.LRAT categories from VCPUs.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:09 +02:00
Mihai Caraman
9a6061d7fd KVM: PPC: e500: Add support for EPTCFG register
EPTCFG register defined by E.PT is accessed unconditionally by Linux guests
in the presence of MAV 2.0. Emulate it now.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:08 +02:00
Mihai Caraman
307d9008ed KVM: PPC: e500: Add support for TLBnPS registers
Add support for TLBnPS registers available in MMU Architecture Version
(MAV) 2.0.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:07 +02:00
Mihai Caraman
8893a188b1 KVM: PPC: e500: Move vcpu's MMU configuration to dedicated functions
Vcpu's MMU default configuration and geometry update logic was buried in
a chunk of code. Move them to dedicated functions to add more clarity.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:07 +02:00
Mihai Caraman
a85d2aa23e KVM: PPC: e500: Expose MMU registers via ONE_REG
MMU registers were exposed to user-space using sregs interface. Add them
to ONE_REG interface using kvmppc_get_one_reg/kvmppc_set_one_reg delegation
mechanism.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:06 +02:00
Mihai Caraman
35b299e279 KVM: PPC: Book3E: Refactor ONE_REG ioctl implementation
Refactor Book3E ONE_REG ioctl implementation to use kvmppc_get_one_reg/
kvmppc_set_one_reg delegation interface introduced by Book3S. This is
necessary for MMU SPRs which are platform specifics.

Get rid of useless case braces in the process.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:05 +02:00
Bharat Bhushan
9b4f530807 booke: exit to user space if emulator request
This allows the exit to user space if emulator request by returning
EMULATE_EXIT_USER. This will be used in subsequent patches in list

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:04 +02:00
Bharat Bhushan
0f47f9b517 KVM: extend EMULATE_EXIT_USER to support different exit reasons
Currently the instruction emulator code returns EMULATE_EXIT_USER
and common code initializes the "run->exit_reason = .." and
"vcpu->arch.hcall_needed = .." with one fixed reason.
But there can be different reasons when emulator need to exit
to user space. To support that the "run->exit_reason = .."
and "vcpu->arch.hcall_needed = .." initialization is moved a
level up to emulator.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:03 +02:00
Bharat Bhushan
c402a3f457 Rename EMULATE_DO_PAPR to EMULATE_EXIT_USER
Instruction emulation return EMULATE_DO_PAPR when it requires
exit to userspace on book3s. Similar return is required
for booke. EMULATE_DO_PAPR reads out to be confusing so it is
renamed to EMULATE_EXIT_USER.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:03 +02:00
Bharat Bhushan
092d62ee93 KVM: PPC: debug stub interface parameter defined
This patch defines the interface parameter for KVM_SET_GUEST_DEBUG
ioctl support. Follow up patches will use this for setting up
hardware breakpoints, watchpoints and software breakpoints.

Also kvm_arch_vcpu_ioctl_set_guest_debug() is brought one level below.
This is because I am not sure what is required for book3s. So this ioctl
behaviour will not change for book3s.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:02 +02:00
Benjamin Herrenschmidt
234d15def9 Merge remote-tracking branch 'origin/master' into next
Merge upstream to get the audit fixes
2013-04-24 14:43:36 +10:00
Paul Mackerras
3cc33d50f5 powerpc: Fix build errors with UP configs in HV-style KVM
This fixes these errors when building UP with CONFIG_KVM_BOOK3S_64_HV=y:

arch/powerpc/kvm/book3s_hv.c:1855:2: error: implicit declaration of function 'inhibit_secondary_onlining' [-Werror=implicit-function-declaration]
arch/powerpc/kvm/book3s_hv.c:1862:2: error: implicit declaration of function 'uninhibit_secondary_onlining' [-Werror=implicit-function-declaration]
cc1: all warnings being treated as errors

and this error (with CONFIG_KVM_BOOK3S_64=m, or a vmlinux link error
with CONFIG_KVM_BOOK3S_64=y):

ERROR: "smp_send_reschedule" [arch/powerpc/kvm/kvm.ko] undefined!
make[2]: *** [__modpost] Error 1

The fix for the link error is suboptimal; ideally we want a self_ipi()
function from irq.c, connected at least to the MPIC code, to initiate
an IPI to this cpu.  The fix here at least lets the code build, and it
will work, just with interrupts being delayed sometimes.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
2013-04-18 13:03:57 +10:00
Zhang Yanfei
c843be8a54 powerpc: remove cast for kmalloc/kzalloc return value
remove cast for kmalloc/kzalloc return value.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
2013-04-18 13:03:56 +10:00
Scott Wood
be28a27c99 kvm/ppc: don't call complete_mmio_load when it's a store
complete_mmio_load writes back the mmio result into the
destination register.  Doing this on a store results in
register corruption.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-17 15:21:16 +02:00
Stuart Yoder
c32498ee64 KVM: PPC: emulate dcbst
Signed-off-by: Stuart Yoder <stuart.yoder@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-17 15:21:15 +02:00
Bharat Bhushan
8c32a2ea65 Added ONE_REG interface for debug instruction
This patch adds the one_reg interface to get the special instruction
to be used for setting software breakpoint from userspace.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-17 15:21:14 +02:00
Scott Wood
4d2be6f7c7 kvm/ppc/e500: eliminate tlb_refs
Commit 523f0e5421 ("KVM: PPC: E500:
Explicitly mark shadow maps invalid") began using E500_TLB_VALID
for guest TLB1 entries, and skipping invalidations if it's not set.

However, when E500_TLB_VALID was set for such entries, it was on a
fake local ref, and so the invalidations never happen.  gtlb_privs
is documented as being only for guest TLB0, though we already violate
that with E500_TLB_BITMAP.

Now that we have MMU notifiers, and thus don't need to actually
retain a reference to the mapped pages, get rid of tlb_refs, and
use gtlb_privs for E500_TLB_VALID in TLB1.

Since we can have more than one host TLB entry for a given tlbe_ref,
be careful not to clear existing flags that are relevant to other
host TLB entries when preparing a new host TLB entry.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-11 15:53:43 +02:00
Scott Wood
66a5fecdcc kvm/ppc/e500: g2h_tlb1_map: clear old bit before setting new bit
It's possible that we're using the same host TLB1 slot to map (a
presumably different portion of) the same guest TLB1 entry.  Clear
the bit in the map before setting it, so that if the esels are the same
the bit will remain set.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-11 15:53:38 +02:00
Scott Wood
6b2ba1a912 kvm/ppc/e500: h2g_tlb1_rmap: esel 0 is valid
Add one to esel values in h2g_tlb1_rmap, so that "no mapping" can be
distinguished from "esel 0".  Note that we're not saved by the fact
that host esel 0 is reserved for non-KVM use, because KVM host esel
numbering is not the raw host numbering (see to_htlb1_esel).

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-11 15:53:34 +02:00
Scott Wood
c5e6cb051c kvm/powerpc/e500mc: fix tlb invalidation on cpu migration
The existing check handles the case where we've migrated to a different
core than we last ran on, but it doesn't handle the case where we're
still on the same cpu we last ran on, but some other vcpu has run on
this cpu in the meantime.

Without this, guest segfaults (and other misbehavior) have been seen in
smp guests.

Cc: stable@vger.kernel.org # 3.8.x
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-11 00:06:39 +02:00
Al Viro
75ef9de126 constify a bunch of struct file_operations instances
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:16:20 -04:00
Paul Mackerras
4fe27d2add KVM: PPC: Remove unused argument to kvmppc_core_dequeue_external
Currently kvmppc_core_dequeue_external() takes a struct kvm_interrupt *
argument and does nothing with it, in any of its implementations.
This removes it in order to make things easier for forthcoming
in-kernel interrupt controller emulation code.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-03-22 01:21:17 +01:00
Scott Wood
47bf379742 kvm/ppc/e500: eliminate tlb_refs
Commit 523f0e5421 ("KVM: PPC: E500:
Explicitly mark shadow maps invalid") began using E500_TLB_VALID
for guest TLB1 entries, and skipping invalidations if it's not set.

However, when E500_TLB_VALID was set for such entries, it was on a
fake local ref, and so the invalidations never happen.  gtlb_privs
is documented as being only for guest TLB0, though we already violate
that with E500_TLB_BITMAP.

Now that we have MMU notifiers, and thus don't need to actually
retain a reference to the mapped pages, get rid of tlb_refs, and
use gtlb_privs for E500_TLB_VALID in TLB1.

Since we can have more than one host TLB entry for a given tlbe_ref,
be careful not to clear existing flags that are relevant to other
host TLB entries when preparing a new host TLB entry.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-03-22 01:21:15 +01:00
Scott Wood
36ada4f431 kvm/ppc/e500: g2h_tlb1_map: clear old bit before setting new bit
It's possible that we're using the same host TLB1 slot to map (a
presumably different portion of) the same guest TLB1 entry.  Clear
the bit in the map before setting it, so that if the esels are the same
the bit will remain set.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-03-22 01:21:13 +01:00
Scott Wood
d6940b6416 kvm/ppc/e500: h2g_tlb1_rmap: esel 0 is valid
Add one to esel values in h2g_tlb1_rmap, so that "no mapping" can be
distinguished from "esel 0".  Note that we're not saved by the fact
that host esel 0 is reserved for non-KVM use, because KVM host esel
numbering is not the raw host numbering (see to_htlb1_esel).

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-03-22 01:21:11 +01:00
Bharat Bhushan
15b708beee KVM: PPC: booke: Added debug handler
Installed debug handler will be used for guest debug support
and debug facility emulation features (patches for these
features will follow this patch).

Signed-off-by: Liu Yu <yu.liu@freescale.com>
[bharat.bhushan@freescale.com: Substantial changes]
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-03-22 01:21:09 +01:00
Bharat Bhushan
78accda4f8 KVM: PPC: Added one_reg interface for timer registers
If userspace wants to change some specific bits of TSR
(timer status register) then it uses GET/SET_SREGS ioctl interface.
So the steps will be:
      i)   user-space will make get ioctl,
      ii)  change TSR in userspace
      iii) then make set ioctl.
It can happen that TSR gets changed by kernel after step i) and
before step iii).

To avoid this we have added below one_reg ioctls for oring and clearing
specific bits in TSR. This patch adds one registerface for:
     1) setting specific bit in TSR (timer status register)
     2) clearing specific bit in TSR (timer status register)
     3) setting/getting the TCR register. There are cases where we want to only
        change TCR and not TSR. Although we can uses SREGS without
        KVM_SREGS_E_UPDATE_TSR flag but I think one reg is better. I am open
        if someone feels we should use SREGS only here.
     4) getting/setting TSR register

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-03-22 01:21:06 +01:00
Bharat Bhushan
d26f22c9cd KVM: PPC: move tsr update in a separate function
This is done so that same function can be called from SREGS and
ONE_REG interface (follow up patch).

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-03-22 01:21:05 +01:00
Marcelo Tosatti
2ae33b3896 Merge remote-tracking branch 'upstream/master' into queue
Merge reason:

From: Alexander Graf <agraf@suse.de>

"Just recently this really important patch got pulled into Linus' tree for 3.9:

commit 1674400aae
Author: Anton Blanchard <anton <at> samba.org>
Date:   Tue Mar 12 01:51:51 2013 +0000

Without that commit, I can not boot my G5, thus I can't run automated tests on it against my queue.

Could you please merge kvm/next against linus/master, so that I can base my trees against that?"

* upstream/master: (653 commits)
  PCI: Use ROM images from firmware only if no other ROM source available
  sparc: remove unused "config BITS"
  sparc: delete "if !ULTRA_HAS_POPULATION_COUNT"
  KVM: Fix bounds checking in ioapic indirect register reads (CVE-2013-1798)
  KVM: x86: Convert MSR_KVM_SYSTEM_TIME to use gfn_to_hva_cache functions (CVE-2013-1797)
  KVM: x86: fix for buffer overflow in handling of MSR_KVM_SYSTEM_TIME (CVE-2013-1796)
  arm64: Kconfig.debug: Remove unused CONFIG_DEBUG_ERRORS
  arm64: Do not select GENERIC_HARDIRQS_NO_DEPRECATED
  inet: limit length of fragment queue hash table bucket lists
  qeth: Fix scatter-gather regression
  qeth: Fix invalid router settings handling
  qeth: delay feature trace
  sgy-cts1000: Remove __dev* attributes
  KVM: x86: fix deadlock in clock-in-progress request handling
  KVM: allow host header to be included even for !CONFIG_KVM
  hwmon: (lm75) Fix tcn75 prefix
  hwmon: (lm75.h) Update header inclusion
  MAINTAINERS: Remove Mark M. Hoffman
  xfs: ensure we capture IO errors correctly
  xfs: fix xfs_iomap_eof_prealloc_initial_size type
  ...

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-21 11:11:52 -03:00
Zhang Yanfei
6e51c9ff6a powerpc: remove cast for kmalloc/kzalloc return value
remove cast for kmalloc/kzalloc return value.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-03-18 14:16:00 +01:00
Aneesh Kumar K.V
af81d7878c powerpc: Rename USER_ESID_BITS* to ESID_BITS*
Now we use ESID_BITS of kernel address to build proto vsid. So rename
USER_ESIT_BITS to ESID_BITS

Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: <stable@vger.kernel.org> [v3.8]
2013-03-17 12:45:44 +11:00
Takuya Yoshikawa
8482644aea KVM: set_memory_region: Refactor commit_memory_region()
This patch makes the parameter old a const pointer to the old memory
slot and adds a new parameter named change to know the change being
requested: the former is for removing extra copying and the latter is
for cleaning up the code.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-04 20:21:08 -03:00
Takuya Yoshikawa
7b6195a91d KVM: set_memory_region: Refactor prepare_memory_region()
This patch drops the parameter old, a copy of the old memory slot, and
adds a new parameter named change to know the change being requested.

This not only cleans up the code but also removes extra copying of the
memory slot structure.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-04 20:21:08 -03:00
Takuya Yoshikawa
462fce4606 KVM: set_memory_region: Drop user_alloc from prepare/commit_memory_region()
X86 does not use this any more.  The remaining user, s390's !user_alloc
check, can be simply removed since KVM_SET_MEMORY_REGION ioctl is no
longer supported.

Note: fixed powerpc's indentations with spaces to suppress checkpatch
errors.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-04 20:21:08 -03:00
Sasha Levin
b67bfe0d42 hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived

        list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

        hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

 - Fix up the actual hlist iterators in linux/list.h
 - Fix up the declaration of other iterators based on the hlist ones.
 - A very small amount of places were using the 'node' parameter, this
 was modified to use 'obj->member' instead.
 - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
 properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;
    <+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
    ...+>

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:24 -08:00
Linus Torvalds
89f883372f Merge tag 'kvm-3.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Marcelo Tosatti:
 "KVM updates for the 3.9 merge window, including x86 real mode
  emulation fixes, stronger memory slot interface restrictions, mmu_lock
  spinlock hold time reduction, improved handling of large page faults
  on shadow, initial APICv HW acceleration support, s390 channel IO
  based virtio, amongst others"

* tag 'kvm-3.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (143 commits)
  Revert "KVM: MMU: lazily drop large spte"
  x86: pvclock kvm: align allocation size to page size
  KVM: nVMX: Remove redundant get_vmcs12 from nested_vmx_exit_handled_msr
  x86 emulator: fix parity calculation for AAD instruction
  KVM: PPC: BookE: Handle alignment interrupts
  booke: Added DBCR4 SPR number
  KVM: PPC: booke: Allow multiple exception types
  KVM: PPC: booke: use vcpu reference from thread_struct
  KVM: Remove user_alloc from struct kvm_memory_slot
  KVM: VMX: disable apicv by default
  KVM: s390: Fix handling of iscs.
  KVM: MMU: cleanup __direct_map
  KVM: MMU: remove pt_access in mmu_set_spte
  KVM: MMU: cleanup mapping-level
  KVM: MMU: lazily drop large spte
  KVM: VMX: cleanup vmx_set_cr0().
  KVM: VMX: add missing exit names to VMX_EXIT_REASONS array
  KVM: VMX: disable SMEP feature when guest is in non-paging mode
  KVM: Remove duplicate text in api.txt
  Revert "KVM: MMU: split kvm_mmu_free_page"
  ...
2013-02-24 13:07:18 -08:00
Linus Torvalds
9d3cae26ac Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc updates from Benjamin Herrenschmidt:
 "So from the depth of frozen Minnesota, here's the powerpc pull request
  for 3.9.  It has a few interesting highlights, in addition to the
  usual bunch of bug fixes, minor updates, embedded device tree updates
  and new boards:

   - Hand tuned asm implementation of SHA1 (by Paulus & Michael
     Ellerman)

   - Support for Doorbell interrupts on Power8 (kind of fast
     thread-thread IPIs) by Ian Munsie

   - Long overdue cleanup of the way we handle relocation of our open
     firmware trampoline (prom_init.c) on 64-bit by Anton Blanchard

   - Support for saving/restoring & context switching the PPR (Processor
     Priority Register) on server processors that support it.  This
     allows the kernel to preserve thread priorities established by
     userspace.  By Haren Myneni.

   - DAWR (new watchpoint facility) support on Power8 by Michael Neuling

   - Ability to change the DSCR (Data Stream Control Register) which
     controls cache prefetching on a running process via ptrace by
     Alexey Kardashevskiy

   - Support for context switching the TAR register on Power8 (new
     branch target register meant to be used by some new specific
     userspace perf event interrupt facility which is yet to be enabled)
     by Ian Munsie.

   - Improve preservation of the CFAR register (which captures the
     origin of a branch) on various exception conditions by Paulus.

   - Move the Bestcomm DMA driver from arch powerpc to drivers/dma where
     it belongs by Philippe De Muyter

   - Support for Transactional Memory on Power8 by Michael Neuling
     (based on original work by Matt Evans).  For those curious about
     the feature, the patch contains a pretty good description."

(See commit db8ff90702: "powerpc: Documentation for transactional
memory on powerpc" for the mentioned description added to the file
Documentation/powerpc/transactional_memory.txt)

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (140 commits)
  powerpc/kexec: Disable hard IRQ before kexec
  powerpc/85xx: l2sram - Add compatible string for BSC9131 platform
  powerpc/85xx: bsc9131 - Correct typo in SDHC device node
  powerpc/e500/qemu-e500: enable coreint
  powerpc/mpic: allow coreint to be determined by MPIC version
  powerpc/fsl_pci: Store the pci ctlr device ptr in the pci ctlr struct
  powerpc/85xx: Board support for ppa8548
  powerpc/fsl: remove extraneous DIU platform functions
  arch/powerpc/platforms/85xx/p1022_ds.c: adjust duplicate test
  powerpc: Documentation for transactional memory on powerpc
  powerpc: Add transactional memory to pseries and ppc64 defconfigs
  powerpc: Add config option for transactional memory
  powerpc: Add transactional memory to POWER8 cpu features
  powerpc: Add new transactional memory state to the signal context
  powerpc: Hook in new transactional memory code
  powerpc: Routines for FP/VSX/VMX unavailable during a transaction
  powerpc: Add transactional memory unavaliable execption handler
  powerpc: Add reclaim and recheckpoint functions for context switching transactional memory processes
  powerpc: Add FP/VSX and VMX register load functions for transactional memory
  powerpc: Add helper functions for transactional memory context switching
  ...
2013-02-23 17:09:55 -08:00
Paul Mackerras
deb26c274d powerpc/kvm/book3s_pr: Fix compilation on 32-bit machines
Commit a413f474a0 ("powerpc: Disable relocation on exceptions whenever
PR KVM is active") added calls to pSeries_disable_reloc_on_exc() and
pSeries_enable_reloc_on_exc() to book3s_pr.c, and added declarations
of those functions to <asm/hvcall.h>, but didn't add an include of
<asm/hvcall.h> to book3s_pr.c.  64-bit kernels seem to get hvcall.h
included via some other path, but 32-bit kernels fail to compile with:

arch/powerpc/kvm/book3s_pr.c: In function ‘kvmppc_core_init_vm’:
arch/powerpc/kvm/book3s_pr.c:1300:4: error: implicit declaration of function ‘pSeries_disable_reloc_on_exc’ [-Werror=implicit-function-declaration]
arch/powerpc/kvm/book3s_pr.c: In function ‘kvmppc_core_destroy_vm’:
arch/powerpc/kvm/book3s_pr.c:1316:4: error: implicit declaration of function ‘pSeries_enable_reloc_on_exc’ [-Werror=implicit-function-declaration]
cc1: all warnings being treated as errors
make[2]: *** [arch/powerpc/kvm/book3s_pr.o] Error 1
make[1]: *** [arch/powerpc/kvm] Error 2
make: *** [sub-make] Error 2

This fixes it by adding an include of hvcall.h.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-02-15 16:54:36 +11:00
Paul Mackerras
0acb91112a powerpc/kvm/book3s_hv: Preserve guest CFAR register value
The CFAR (Come-From Address Register) is a useful debugging aid that
exists on POWER7 processors.  Currently HV KVM doesn't save or restore
the CFAR register for guest vcpus, making the CFAR of limited use in
guests.

This adds the necessary code to capture the CFAR value saved in the
early exception entry code (it has to be saved before any branch is
executed), save it in the vcpu.arch struct, and restore it on entry
to the guest.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-02-15 16:54:33 +11:00
Alexander Graf
011da89962 KVM: PPC: BookE: Handle alignment interrupts
When the guest triggers an alignment interrupt, we don't handle it properly
today and instead BUG_ON(). This really shouldn't happen.

Instead, we should just pass the interrupt back into the guest so it can deal
with it.

Reported-by: Gao Guanhua-B22826 <B22826@freescale.com>
Tested-by: Gao Guanhua-B22826 <B22826@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-02-13 12:56:45 +01:00
Bharat Bhushan
1d542d9c2b KVM: PPC: booke: Allow multiple exception types
Current kvmppc_booke_handlers uses the same macro (KVM_HANDLER) and
all handlers are considered to be the same size. This will not be
the case if we want to use different macros for different handlers.

This patch improves the kvmppc_booke_handler so that it can
support different macros for different handlers.

Signed-off-by: Liu Yu <yu.liu@freescale.com>
[bharat.bhushan@freescale.com: Substantial changes]
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-02-13 12:56:40 +01:00
Bharat Bhushan
ffe129ecd7 KVM: PPC: booke: use vcpu reference from thread_struct
Like other places, use thread_struct to get vcpu reference.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-02-13 12:56:39 +01:00
Benjamin Herrenschmidt
dfd0436ad0 Merge branch 'merge' into next
Merge "merge" branch to bring in various bug fixes that are
going into 3.8
2013-01-29 11:33:37 +11:00
Greg Kroah-Hartman
422d26b6ec Merge 3.8-rc5 into driver-core-next
This resolves a gpio driver merge issue pointed out in linux-next.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-25 21:06:30 -08:00
Alexander Graf
b9e3e20893 KVM: PPC: E500: Remove kvmppc_e500_tlbil_all usage from guest TLB code
The guest TLB handling code should not have any insight into how the host
TLB shadow code works.

kvmppc_e500_tlbil_all() is a function that is used for distinction between
e500v2 and e500mc (E.HV) on how to flush shadow entries. This function really
is private between the e500.c/e500mc.c file and e500_mmu_host.c.

Instead of this one, use the public kvmppc_core_flush_tlb() function to flush
all shadow TLB entries. As a nice side effect, with this we also end up
flushing TLB1 entries which we forgot to do before.

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-01-24 19:23:34 +01:00
Alexander Graf
483ba97c0f KVM: PPC: E500: Make clear_tlb_refs and clear_tlb1_bitmap static
Host shadow TLB flushing is logic that the guest TLB code should have
no insight about. Declare the internal clear_tlb_refs and clear_tlb1_bitmap
functions static to the host TLB handling file.

Instead of these, we can use the already exported kvmppc_core_flush_tlb().
This gives us a common API across the board to say "please flush any
pending host shadow translation".

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-01-24 19:23:33 +01:00
Alexander Graf
c015c62b13 KVM: PPC: e500: Implement TLB1-in-TLB0 mapping
When a host mapping fault happens in a guest TLB1 entry today, we
map the translated guest entry into the host's TLB1.

This isn't particularly clever when the guest is mapped by normal 4k
pages, since these would be a lot better to put into TLB0 instead.

This patch adds the required logic to map 4k TLB1 shadow maps into
the host's TLB0.

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-01-24 19:23:32 +01:00
Alexander Graf
b71c9e2fb7 KVM: PPC: E500: Split host and guest MMU parts
This patch splits the file e500_tlb.c into e500_mmu.c (guest TLB handling)
and e500_mmu_host.c (host TLB handling).

The main benefit of this split is readability and maintainability. It's
just a lot harder to write dirty code :).

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-01-24 19:23:31 +01:00
Alexander Graf
9d98b3ff94 KVM: PPC: e500: Call kvmppc_mmu_map for initial mapping
When emulating tlbwe, we want to automatically map the entry that just got
written in our shadow TLB map, because chances are quite high that it's
going to be used very soon.

Today this happens explicitly, duplicating all the logic that is in
kvmppc_mmu_map() already. Just call that one instead.

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-01-24 19:23:31 +01:00
Alexander Graf
2c378fd779 KVM: PPC: E500: Propagate errors when shadow mapping
When shadow mapping a page, mapping this page can fail. In that case we
don't have a shadow map.

Take this case into account, otherwise we might end up writing bogus TLB
entries into the host TLB.

While at it, also move the write_stlbe() calls into the respective TLBn
handlers.

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-01-24 19:23:30 +01:00
Alexander Graf
523f0e5421 KVM: PPC: E500: Explicitly mark shadow maps invalid
When we invalidate shadow TLB maps on the host, we don't mark them
as not valid. But we should.

Fix this by removing the E500_TLB_VALID from their flags when
invalidating.

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-01-24 19:23:30 +01:00
Alexander Graf
9445ef0181 KVM: PPC: E500: Move write_stlbe higher
Later patches want to call the function and it doesn't have
dependencies on anything below write_host_tlbe.

Move it higher up in the file.

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-01-24 19:23:29 +01:00
Kees Cook
07ff8b5358 arch/powerpc/kvm: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: Alexander Graf <agraf@suse.de>
CC: Avi Kivity <avi@redhat.com>
CC: Marcelo Tosatti <mtosatti@redhat.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:43:12 -08:00
Alexander Graf
d3286144c9 KVM: PPC: Emulate dcbf
Guests can trigger MMIO exits using dcbf. Since we don't emulate cache
incoherent MMIO, just do nothing and move on.

Reported-by: Ben Collins <ben.c@servergy.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Tested-by: Ben Collins <ben.c@servergy.com>
CC: stable@vger.kernel.org
2013-01-18 00:40:49 +01:00
Alexander Graf
324b3e6316 KVM: PPC: BookE: Add EPR ONE_REG sync
We need to be able to read and write the contents of the EPR register
from user space.

This patch implements that logic through the ONE_REG API and declares
its (never implemented) SREGS counterpart as deprecated.

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-01-10 13:42:33 +01:00
Alexander Graf
1c81063655 KVM: PPC: BookE: Implement EPR exit
The External Proxy Facility in FSL BookE chips allows the interrupt
controller to automatically acknowledge an interrupt as soon as a
core gets its pending external interrupt delivered.

Today, user space implements the interrupt controller, so we need to
check on it during such a cycle.

This patch implements logic for user space to enable EPR exiting,
disable EPR exiting and EPR exiting itself, so that user space can
acknowledge an interrupt when an external interrupt has successfully
been delivered into the guest vcpu.

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-01-10 13:42:31 +01:00
Alexander Graf
37ecb257f6 KVM: PPC: BookE: Emulate mfspr on EPR
The EPR register is potentially valid for PR KVM as well, so we need
to emulate accesses to it. It's only defined for reading, so only
handle the mfspr case.

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-01-10 13:42:30 +01:00
Alexander Graf
b8c649a99d KVM: PPC: BookE: Allow irq deliveries to inject requests
When injecting an interrupt into guest context, we usually don't need
to check for requests anymore. At least not until today.

With the introduction of EPR, we will have to create a request when the
guest has successfully accepted an external interrupt though.

So we need to prepare the interrupt delivery to abort guest entry
gracefully. Otherwise we'd delay the EPR request.

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-01-10 13:42:21 +01:00
Mihai Caraman
f2be655004 KVM: PPC: Fix mfspr/mtspr MMUCFG emulation
On mfspr/mtspr emulation path Book3E's MMUCFG SPR with value 1015 clashes
with G4's MSSSR0 SPR. Move MSSSR0 emulation from generic part to Books3S.
MSSSR0 also clashes with Book3S's DABRX SPR. DABRX was not explicitly
handled so Book3S execution flow will behave as before.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-01-10 13:30:11 +01:00
Alexander Graf
50c7bb80b5 KVM: PPC: Book3S: PR: Enable alternative instruction for SC 1
When running on top of pHyp, the hypercall instruction "sc 1" goes
straight into pHyp without trapping in supervisor mode.

So if we want to support PAPR guest in this configuration we need to
add a second way of accessing PAPR hypercalls, preferably with the
exact same semantics except for the instruction.

So let's overlay an officially reserved instruction and emulate PAPR
hypercalls whenever we hit that one.

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-01-10 13:15:08 +01:00
Alexander Graf
5a33169ed2 KVM: PPC: Only WARN on invalid emulation
When we hit an emulation result that we didn't expect, that is an error,
but it's nothing that warrants a BUG(), because it can be guest triggered.

So instead, let's only WARN() the user that this happened.

Signed-off-by: Alexander Graf <agraf@suse.de>
2013-01-10 13:15:08 +01:00
Ian Munsie
a413f474a0 powerpc: Disable relocation on exceptions whenever PR KVM is active
For PR KVM we allow userspace to map 0xc000000000000000. Because
transitioning from userspace to the guest kernel may use the relocated
exception vectors we have to disable relocation on exceptions whenever
PR KVM is active as we cannot trust that address.

This issue does not apply to HV KVM, since changing from a guest to the
hypervisor will never use the relocated exception vectors.

Currently the hypervisor interface only allows us to toggle relocation
on exceptions on a partition wide scope, so we need to globally disable
relocation on exceptions when the first PR KVM instance is started and
only re-enable them when all PR KVM instances have been destroyed.

It's a bit heavy handed, but until the hypervisor gives us a lightweight
way to toggle relocation on exceptions on a single thread it's only real
option.

Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-01-10 17:00:42 +11:00
Andreas Schwab
d591390da9 KVM: PPC: Book3S HV: Fix compilation without CONFIG_PPC_POWERNV
Fixes this build breakage:

arch/powerpc/kvm/book3s_hv_ras.c: In function ‘kvmppc_realmode_mc_power7’:
arch/powerpc/kvm/book3s_hv_ras.c:126:23: error: ‘struct paca_struct’ has no member named ‘opal_mc_evt’

Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-01-06 14:02:00 +01:00
Alex Williamson
f82a8cfe93 KVM: struct kvm_memory_slot.user_alloc -> bool
There's no need for this to be an int, it holds a boolean.
Move to the end of the struct for alignment.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-13 23:24:38 -02:00
Alex Williamson
bbacc0c111 KVM: Rename KVM_MEMORY_SLOTS -> KVM_USER_MEM_SLOTS
It's easy to confuse KVM_MEMORY_SLOTS and KVM_MEM_SLOTS_NUM.  One is
the user accessible slots and the other is user + private.  Make this
more obvious.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-13 23:21:57 -02:00
Mihai Caraman
352df1deb2 KVM: PPC: booke: Get/set guest EPCR register using ONE_REG interface
Implement ONE_REG interface for EPCR register adding KVM_REG_PPC_EPCR to
the list of ONE_REG PPC supported registers.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
[agraf: remove HV dependency, use get/put_user]
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-12-06 01:34:20 +01:00
Mihai Caraman
38f988240c KVM: PPC: bookehv: Add EPCR support in mtspr/mfspr emulation
Add EPCR support in booke mtspr/mfspr emulation. EPCR register is defined only
for 64-bit and HV categories, we will expose it at this point only to 64-bit
virtual processors running on 64-bit HV hosts.
Define a reusable setter function for vcpu's EPCR.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
[agraf: move HV dependency in the code]
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-12-06 01:34:19 +01:00
Mihai Caraman
95e90b43c9 KVM: PPC: bookehv: Add guest computation mode for irq delivery
When delivering guest IRQs, update MSR computation mode according to guest
interrupt computation mode found in EPCR.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
[agraf: remove HV dependency in the code]
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-12-06 01:34:18 +01:00
Mihai Caraman
e9666ea1b3 KVM: PPC: booke: Extend MAS2 EPN mask for 64-bit
Extend MAS2 EPN mask to retain most significant bits on 64-bit hosts.
Use this mask in tlb effective address accessor.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-12-06 01:34:15 +01:00
Mihai Caraman
9e2fa64693 KVM: PPC: e500: Mask MAS2 EPN high 32-bits in 32/64 tlbwe emulation
Mask high 32 bits of MAS2's effective page number in tlbwe emulation for guests
running in 32-bit mode.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-12-06 01:34:14 +01:00
Mihai Caraman
7cdd7a95c6 KVM: PPC: e500: Add emulation helper for getting instruction ea
Add emulation helper for getting instruction ea and refactor tlb instruction
emulation to use it.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
[agraf: keep rt variable around]
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-12-06 01:34:12 +01:00
Mihai Caraman
e51f8f32d6 KVM: PPC: bookehv64: Add support for interrupt handling
Add interrupt handling support for 64-bit bookehv hosts. Unify 32 and 64 bit
implementations using a common stack layout and a common execution flow starting
from kvm_handler_common macro. Update documentation for 64-bit input register
values. This patch only address the bolted TLB miss exception handlers version.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-12-06 01:34:11 +01:00