mirror of
https://git.proxmox.com/git/mirror_ubuntu-kernels.git
synced 2025-11-12 08:22:35 +00:00
ffc901b4d1
446 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
65133033ee |
Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
652521d460 |
perf/headers: Fix spelling s/EACCESS/EACCES/, s/privilidge/privilege/
As per POSIX, the correct spelling of the error code is EACCES: include/uapi/asm-generic/errno-base.h:#define EACCES 13 /* Permission denied */ Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Kosina <trivial@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: https://lkml.kernel.org/r/20191024122904.12463-1-geert+renesas@glider.be Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
da97e18458 |
perf_event: Add support for LSM and SELinux checks
In current mainline, the degree of access to perf_event_open(2) system call depends on the perf_event_paranoid sysctl. This has a number of limitations: 1. The sysctl is only a single value. Many types of accesses are controlled based on the single value thus making the control very limited and coarse grained. 2. The sysctl is global, so if the sysctl is changed, then that means all processes get access to perf_event_open(2) opening the door to security issues. This patch adds LSM and SELinux access checking which will be used in Android to access perf_event_open(2) for the purposes of attaching BPF programs to tracepoints, perf profiling and other operations from userspace. These operations are intended for production systems. 5 new LSM hooks are added: 1. perf_event_open: This controls access during the perf_event_open(2) syscall itself. The hook is called from all the places that the perf_event_paranoid sysctl is checked to keep it consistent with the systctl. The hook gets passed a 'type' argument which controls CPU, kernel and tracepoint accesses (in this context, CPU, kernel and tracepoint have the same semantics as the perf_event_paranoid sysctl). Additionally, I added an 'open' type which is similar to perf_event_paranoid sysctl == 3 patch carried in Android and several other distros but was rejected in mainline [1] in 2016. 2. perf_event_alloc: This allocates a new security object for the event which stores the current SID within the event. It will be useful when the perf event's FD is passed through IPC to another process which may try to read the FD. Appropriate security checks will limit access. 3. perf_event_free: Called when the event is closed. 4. perf_event_read: Called from the read(2) and mmap(2) syscalls for the event. 5. perf_event_write: Called from the ioctl(2) syscalls for the event. [1] https://lwn.net/Articles/696240/ Since Peter had suggest LSM hooks in 2016 [1], I am adding his Suggested-by tag below. To use this patch, we set the perf_event_paranoid sysctl to -1 and then apply selinux checking as appropriate (default deny everything, and then add policy rules to give access to domains that need it). In the future we can remove the perf_event_paranoid sysctl altogether. Suggested-by: Peter Zijlstra <peterz@infradead.org> Co-developed-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: James Morris <jmorris@namei.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: rostedt@goodmis.org Cc: Yonghong Song <yhs@fb.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: jeffv@google.com Cc: Jiri Olsa <jolsa@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: primiano@google.com Cc: Song Liu <songliubraving@fb.com> Cc: rsavitski@google.com Cc: Namhyung Kim <namhyung@kernel.org> Cc: Matthew Garrett <matthewgarrett@google.com> Link: https://lkml.kernel.org/r/20191014170308.70668-1-joel@joelfernandes.org |
||
|
|
ab43762ef0 |
perf: Allow normal events to output AUX data
In some cases, ordinary (non-AUX) events can generate data for AUX events. For example, PEBS events can come out as records in the Intel PT stream instead of their usual DS records, if configured to do so. One requirement for such events is to consistently schedule together, to ensure that the data from the "AUX output" events isn't lost while their corresponding AUX event is not scheduled. We use grouping to provide this guarantee: an "AUX output" event can be added to a group where an AUX event is a group leader, and provided that the former supports writing to the latter. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: kan.liang@linux.intel.com Link: https://lkml.kernel.org/r/20190806084606.4021-2-alexander.shishkin@linux.intel.com |
||
|
|
8a58ddae23 |
perf/core: Fix exclusive events' grouping
So far, we tried to disallow grouping exclusive events for the fear of
complications they would cause with moving between contexts. Specifically,
moving a software group to a hardware context would violate the exclusivity
rules if both groups contain matching exclusive events.
This attempt was, however, unsuccessful: the check that we have in the
perf_event_open() syscall is both wrong (looks at wrong PMU) and
insufficient (group leader may still be exclusive), as can be illustrated
by running:
$ perf record -e '{intel_pt//,cycles}' uname
$ perf record -e '{cycles,intel_pt//}' uname
ultimately successfully.
Furthermore, we are completely free to trigger the exclusivity violation
by:
perf -e '{cycles,intel_pt//}' -e '{intel_pt//,instructions}'
even though the helpful perf record will not allow that, the ABI will.
The warning later in the perf_event_open() path will also not trigger, because
it's also wrong.
Fix all this by validating the original group before moving, getting rid
of broken safeguards and placing a useful one to perf_install_in_context().
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: mathieu.poirier@linaro.org
Cc: will.deacon@arm.com
Fixes:
|
||
|
|
552a031ba1 |
Linux 5.2
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0idTweHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGZesIAJDKicw2Voyx8K8m 3pXSK+71RuO/d3Y9M51mdfTMKRP4PHR9/4wVZ9wHPwC4dV6wxgsmIYCF69a1Wety LD1MpDCP1DK5wVfPNKVX2xmj7ua6iutPtSsJHzdzM2TlscgsrFKjmUccqJ5JLwL5 c34nqwXWnzzRyI5Ga9cQSlwzAXq0vDHXyML3AnCosSsLX0lKFrHlK1zttdOPNkfj dXRN62g3q+9kVQozzhDXb8atZZ7IkBk8Q0lujpNXW83Ci1VjaVNv3SB8GZTXIlLj U15VdyuwfJDfpBgFBN6/unzVaAB6FFrEKy0jT1aeTyKarMKDKgOnJjn10aKjDNno /bXsKKc= =TVqV -----END PGP SIGNATURE----- Merge tag 'v5.2' into perf/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
fd7d55172d |
perf/cgroups: Don't rotate events for cgroups unnecessarily
Currently perf_rotate_context assumes that if the context's nr_events != nr_active a rotation is necessary for perf event multiplexing. With cgroups, nr_events is the total count of events for all cgroups and nr_active will not include events in a cgroup other than the current task's. This makes rotation appear necessary for cgroups when it is not. Add a perf_event_context flag that is set when rotation is necessary. Clear the flag during sched_out and set it when a flexible sched_in fails due to resources. Signed-off-by: Ian Rogers <irogers@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: https://lkml.kernel.org/r/20190601082722.44543-1-irogers@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
e321d02db8 |
perf/x86: Disable extended registers for non-supported PMUs
The perf fuzzer caused Skylake machine to crash:
[ 9680.085831] Call Trace:
[ 9680.088301] <IRQ>
[ 9680.090363] perf_output_sample_regs+0x43/0xa0
[ 9680.094928] perf_output_sample+0x3aa/0x7a0
[ 9680.099181] perf_event_output_forward+0x53/0x80
[ 9680.103917] __perf_event_overflow+0x52/0xf0
[ 9680.108266] ? perf_trace_run_bpf_submit+0xc0/0xc0
[ 9680.113108] perf_swevent_hrtimer+0xe2/0x150
[ 9680.117475] ? check_preempt_wakeup+0x181/0x230
[ 9680.122091] ? check_preempt_curr+0x62/0x90
[ 9680.126361] ? ttwu_do_wakeup+0x19/0x140
[ 9680.130355] ? try_to_wake_up+0x54/0x460
[ 9680.134366] ? reweight_entity+0x15b/0x1a0
[ 9680.138559] ? __queue_work+0x103/0x3f0
[ 9680.142472] ? update_dl_rq_load_avg+0x1cd/0x270
[ 9680.147194] ? timerqueue_del+0x1e/0x40
[ 9680.151092] ? __remove_hrtimer+0x35/0x70
[ 9680.155191] __hrtimer_run_queues+0x100/0x280
[ 9680.159658] hrtimer_interrupt+0x100/0x220
[ 9680.163835] smp_apic_timer_interrupt+0x6a/0x140
[ 9680.168555] apic_timer_interrupt+0xf/0x20
[ 9680.172756] </IRQ>
The XMM registers can only be collected by PEBS hardware events on the
platforms with PEBS baseline support, e.g. Icelake, not software/probe
events.
Add capabilities flag PERF_PMU_CAP_EXTENDED_REGS to indicate the PMU
which support extended registers. For X86, the extended registers are
XMM registers.
Add has_extended_regs() to check if extended registers are applied.
The generic code define the mask of extended registers as 0 if arch
headers haven't overridden it.
Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes:
|
||
|
|
f3a3a8257e |
perf/core: Add attr_groups_update into struct pmu
Adding attr_update attribute group into pmu, to allow
having multiple attribute groups for same group name.
This will allow us to update "events" or "format"
directories with attributes that depend on various
HW conditions.
For example having group_format_extra group that updates
"format" directory only if pmu version is 2 and higher:
static umode_t
exra_is_visible(struct kobject *kobj, struct attribute *attr, int i)
{
return x86_pmu.version >= 2 ? attr->mode : 0;
}
static struct attribute_group group_format_extra = {
.name = "format",
.is_visible = exra_is_visible,
};
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190512155518.21468-3-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
0ef0fd3515 |
* ARM: support for SVE and Pointer Authentication in guests, PMU improvements
* POWER: support for direct access to the POWER9 XIVE interrupt controller, memory and performance optimizations. * x86: support for accessing memory not backed by struct page, fixes and refactoring * Generic: dirty page tracking improvements -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQEcBAABAgAGBQJc3qV/AAoJEL/70l94x66Dn3QH/jX1Bn0P/RZAIt4w0SySklSg PqxUKDyBQqB9vN9Qeb9jWXAKPH2CtM3+up/rz7oRnBWp7qA6vXcC/R/QJYAvzdXE nklsR/oYCsflR1KdlVYuDvvPCPP2fLBU5zfN83OsaBQ8fNRkm3gN+N5XQ2SbXbLy Mo9tybS4otY201UAC96e8N0ipwwyCRpDneQpLcl+F5nH3RBt63cVbs04O+70MXn7 eT4I+8K3+Go7LATzT8hglD21D/7uvE31qQb6yr5L33IfhU4GB51RZzBXTNaAdY8n hT1rMrRkAMAFWYZPQDfoMadjWU3i5DIfstKjDxOr9oTfuOEp5Z+GvJwvVnUDg1I= =D0+p -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM updates from Paolo Bonzini: "ARM: - support for SVE and Pointer Authentication in guests - PMU improvements POWER: - support for direct access to the POWER9 XIVE interrupt controller - memory and performance optimizations x86: - support for accessing memory not backed by struct page - fixes and refactoring Generic: - dirty page tracking improvements" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (155 commits) kvm: fix compilation on aarch64 Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU" kvm: x86: Fix L1TF mitigation for shadow MMU KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible KVM: PPC: Book3S: Remove useless checks in 'release' method of KVM device KVM: PPC: Book3S HV: XIVE: Fix spelling mistake "acessing" -> "accessing" KVM: PPC: Book3S HV: Make sure to load LPID for radix VCPUs kvm: nVMX: Set nested_run_pending in vmx_set_nested_state after checks complete tests: kvm: Add tests for KVM_SET_NESTED_STATE KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state tests: kvm: Add tests for KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_CPU_ID tests: kvm: Add tests to .gitignore KVM: Introduce KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 KVM: Fix kvm_clear_dirty_log_protect off-by-(minus-)one KVM: Fix the bitmap range to copy during clear dirty KVM: arm64: Fix ptrauth ID register masking logic KVM: x86: use direct accessors for RIP and RSP KVM: VMX: Use accessors for GPRs outside of dedicated caching logic KVM: x86: Omit caching logic for always-available GPRs kvm, x86: Properly check whether a pfn is an MMIO or not ... |
||
|
|
90489a72fb |
Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
"The main kernel changes were:
- add support for Intel's "adaptive PEBS v4" - which embedds LBS data
in PEBS records and can thus batch up and reduce the IRQ (NMI) rate
significantly - reducing overhead and making call-graph profiling
less intrusive.
- add Intel CPU core and uncore support updates for Tremont, Icelake,
- extend the x86 PMU constraints scheduler with 'constraint ranges'
to better support Icelake hw constraints,
- make x86 call-chain support work better with CONFIG_FRAME_POINTER=y
- misc other changes
Tooling changes:
- updates to the main tools: 'perf record', 'perf trace', 'perf
stat'
- updated Intel and S/390 vendor events
- libtraceevent updates
- misc other updates and fixes"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER
watchdog: Fix typo in comment
perf/x86/intel: Add Tremont core PMU support
perf/x86/intel/uncore: Add Intel Icelake uncore support
perf/x86/msr: Add Icelake support
perf/x86/intel/rapl: Add Icelake support
perf/x86/intel/cstate: Add Icelake support
perf/x86/intel: Add Icelake support
perf/x86: Support constraint ranges
perf/x86/lbr: Avoid reading the LBRs when adaptive PEBS handles them
perf/x86/intel: Support adaptive PEBS v4
perf/x86/intel/ds: Extract code of event update in short period
perf/x86/intel: Extract memory code PEBS parser for reuse
perf/x86: Support outputting XMM registers
perf/x86/intel: Force resched when TFA sysctl is modified
perf/core: Add perf_pmu_resched() as global function
perf/headers: Fix stale comment for struct perf_addr_filter
perf/core: Make perf_swevent_init_cpu() static
perf/x86: Add sanity checks to x86_schedule_events()
perf/x86: Optimize x86_schedule_events()
...
|
||
|
|
72e830f684 |
perf/x86/intel/pt: Remove software double buffering PMU capability
Now that all AUX allocations are high-order by default, the software double buffering PMU capability doesn't make sense any more, get rid of it. In case some PMUs choose to opt out, we can re-introduce it. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: adrian.hunter@intel.com Link: http://lkml.kernel.org/r/20190503085536.24119-3-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
8479e04e7d |
KVM: x86: Inject PMI for KVM guest
Inject a PMI for KVM guest when Intel PT working in Host-Guest mode and Guest ToPA entry memory buffer was completely filled. Signed-off-by: Luwei Kang <luwei.kang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
|
|
d15d356887 |
perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER
Currently perf callchain doesn't work well with ORC unwinder
when sampling from trace point. We'll get useless in kernel callchain
like this:
perf 6429 [000] 22.498450: kmem:mm_page_alloc: page=0x176a17 pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL
ffffffffbe23e32e __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux)
7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
5651468729c1 [unknown] (/usr/bin/perf)
5651467ee82a main+0x69a (/usr/bin/perf)
7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
5541f689495641d7 [unknown] ([unknown])
The root cause is that, for trace point events, it doesn't provide a
real snapshot of the hardware registers. Instead perf tries to get
required caller's registers and compose a fake register snapshot
which suppose to contain enough information for start a unwinding.
However without CONFIG_FRAME_POINTER, if failed to get caller's BP as the
frame pointer, so current frame pointer is returned instead. We get
a invalid register combination which confuse the unwinder, and end the
stacktrace early.
So in such case just don't try dump BP, and let the unwinder start
directly when the register is not a real snapshot. Use SP
as the skip mark, unwinder will skip all the frames until it meet
the frame of the trace point caller.
Tested with frame pointer unwinder and ORC unwinder, this makes perf
callchain get the full kernel space stacktrace again like this:
perf 6503 [000] 1567.570191: kmem:mm_page_alloc: page=0x16c904 pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL
ffffffffb523e2ae __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux)
ffffffffb52383bd __get_free_pages+0xd (/lib/modules/5.1.0-rc3+/build/vmlinux)
ffffffffb52fd28a __pollwait+0x8a (/lib/modules/5.1.0-rc3+/build/vmlinux)
ffffffffb521426f perf_poll+0x2f (/lib/modules/5.1.0-rc3+/build/vmlinux)
ffffffffb52fe3e2 do_sys_poll+0x252 (/lib/modules/5.1.0-rc3+/build/vmlinux)
ffffffffb52ff027 __x64_sys_poll+0x37 (/lib/modules/5.1.0-rc3+/build/vmlinux)
ffffffffb500418b do_syscall_64+0x5b (/lib/modules/5.1.0-rc3+/build/vmlinux)
ffffffffb5a0008c entry_SYSCALL_64_after_hwframe+0x44 (/lib/modules/5.1.0-rc3+/build/vmlinux)
7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
55a22960d9c1 [unknown] (/usr/bin/perf)
55a22958982a main+0x69a (/usr/bin/perf)
7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
5541f689495641d7 [unknown] ([unknown])
Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Kairui Song <kasong@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Young <dyoung@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190422162652.15483-1-kasong@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
c68d224e5e |
perf/core: Add perf_pmu_resched() as global function
This patch add perf_pmu_resched() a global function that can be called to force rescheduling of events for a given PMU. The function locks both cpuctx and task_ctx internally. This will be used by a subsequent patch. Signed-off-by: Stephane Eranian <eranian@google.com> [ Simplified the calling convention. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: kan.liang@intel.com Cc: nelson.dsouza@intel.com Cc: tonyj@suse.com Link: https://lkml.kernel.org/r/20190408173252.37932-2-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
1279e41d53 |
perf/headers: Fix stale comment for struct perf_addr_filter
The @inode field has been removed after:
|
||
|
|
c978b9460f |
perf/core improvements and fixes:
perf annotate:
Wei Li:
- Fix getting source line failure
perf script:
Andi Kleen:
- Handle missing fields with -F +...
perf data:
Jiri Olsa:
- Prep work to support per-cpu files in a directory.
Intel PT:
Adrian Hunter:
- Improve thread_stack__no_call_return()
- Hide x86 retpolines in thread stacks.
- exported SQL viewer refactorings, new 'top calls' report..
Alexander Shishkin:
- Copy parent's address filter offsets on clone
- Fix address filters for vmas with non-zero offset. Applies to
ARM's CoreSight as well.
python scripts:
Tony Jones:
- Python3 support for several 'perf script' python scripts.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCXHRYNwAKCRCyPKLppCJ+
J8XmAQDKY7gb3GhkX+4aE8cGffFYB2YV5mD9Bbu4AM9tuFFBJwD+KAq87FMCy7m7
h7xyWk3UILpz6y235AVdfOmgcNDkpAQ=
=SJCG
-----END PGP SIGNATURE-----
Merge tag 'perf-core-for-mingo-5.1-20190225' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
perf annotate:
Wei Li:
- Fix getting source line failure
perf script:
Andi Kleen:
- Handle missing fields with -F +...
perf data:
Jiri Olsa:
- Prep work to support per-cpu files in a directory.
Intel PT:
Adrian Hunter:
- Improve thread_stack__no_call_return()
- Hide x86 retpolines in thread stacks.
- exported SQL viewer refactorings, new 'top calls' report..
Alexander Shishkin:
- Copy parent's address filter offsets on clone
- Fix address filters for vmas with non-zero offset. Applies to
ARM's CoreSight as well.
python scripts:
Tony Jones:
- Python3 support for several 'perf script' python scripts.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
9ed8f1a6e7 |
Merge branch 'linus' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
c60f83b813 |
perf, pt, coresight: Fix address filters for vmas with non-zero offset
Currently, the address range calculation for file-based filters works as
long as the vma that maps the matching part of the object file starts
from offset zero into the file (vm_pgoff==0). Otherwise, the resulting
filter range would be off by vm_pgoff pages. Another related problem is
that in case of a partially matching vma, that is, a vma that matches
part of a filter region, the filter range size wouldn't be adjusted.
Fix the arithmetics around address filter range calculations, taking
into account vma offset, so that the entire calculation is done before
the filter configuration is passed to the PMU drivers instead of having
those drivers do the final bit of arithmetics.
Based on the patch by Adrian Hunter <adrian.hunter.intel.com>.
Reported-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Tested-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Fixes:
|
||
|
|
81ec3f3c4c |
perf/x86: Add check_period PMU callback
Vince (and later on Ravi) reported crashes in the BTS code during fuzzing with the following backtrace: general protection fault: 0000 [#1] SMP PTI ... RIP: 0010:perf_prepare_sample+0x8f/0x510 ... Call Trace: <IRQ> ? intel_pmu_drain_bts_buffer+0x194/0x230 intel_pmu_drain_bts_buffer+0x160/0x230 ? tick_nohz_irq_exit+0x31/0x40 ? smp_call_function_single_interrupt+0x48/0xe0 ? call_function_single_interrupt+0xf/0x20 ? call_function_single_interrupt+0xa/0x20 ? x86_schedule_events+0x1a0/0x2f0 ? x86_pmu_commit_txn+0xb4/0x100 ? find_busiest_group+0x47/0x5d0 ? perf_event_set_state.part.42+0x12/0x50 ? perf_mux_hrtimer_restart+0x40/0xb0 intel_pmu_disable_event+0xae/0x100 ? intel_pmu_disable_event+0xae/0x100 x86_pmu_stop+0x7a/0xb0 x86_pmu_del+0x57/0x120 event_sched_out.isra.101+0x83/0x180 group_sched_out.part.103+0x57/0xe0 ctx_sched_out+0x188/0x240 ctx_resched+0xa8/0xd0 __perf_event_enable+0x193/0x1e0 event_function+0x8e/0xc0 remote_function+0x41/0x50 flush_smp_call_function_queue+0x68/0x100 generic_smp_call_function_single_interrupt+0x13/0x30 smp_call_function_single_interrupt+0x3e/0xe0 call_function_single_interrupt+0xf/0x20 </IRQ> The reason is that while event init code does several checks for BTS events and prevents several unwanted config bits for BTS event (like precise_ip), the PERF_EVENT_IOC_PERIOD allows to create BTS event without those checks being done. Following sequence will cause the crash: If we create an 'almost' BTS event with precise_ip and callchains, and it into a BTS event it will crash the perf_prepare_sample() function because precise_ip events are expected to come in with callchain data initialized, but that's not the case for intel_pmu_drain_bts_buffer() caller. Adding a check_period callback to be called before the period is changed via PERF_EVENT_IOC_PERIOD. It will deny the change if the event would become BTS. Plus adding also the limit_period check as well. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20190204123532.GA4794@krava Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
840018668c |
perf/aux: Make perf_event accessible to setup_aux()
When pmu::setup_aux() is called the coresight PMU needs to know which sink to use for the session by looking up the information in the event's attr::config2 field. As such simply replace the cpu information by the complete perf_event structure and change all affected customers. Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Reviewed-by: Suzuki Poulouse <suzuki.poulose@arm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-s390@vger.kernel.org Link: http://lkml.kernel.org/r/20190131184714.20388-2-mathieu.poirier@linaro.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> |
||
|
|
8c94abbbe1 |
perf: Convert perf_event_context.refcount to refcount_t
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable perf_event_context.refcount is used as pure reference counter. Convert it to refcount_t and fix up the operations. ** Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst for more information. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the perf_event_context.refcount it might make a difference in following places: - get_ctx(), perf_event_ctx_lock_nested(), perf_lock_task_context() and __perf_event_ctx_lock_double(): increment in refcount_inc_not_zero() only guarantees control dependency on success vs. fully ordered atomic counterpart - put_ctx(): decrement in refcount_dec_and_test() provides RELEASE ordering and ACQUIRE ordering + control dependency on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: namhyung@kernel.org Link: https://lkml.kernel.org/r/1548678448-24458-2-git-send-email-elena.reshetova@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
6ee52e2a3f |
perf, bpf: Introduce PERF_RECORD_BPF_EVENT
For better performance analysis of BPF programs, this patch introduces
PERF_RECORD_BPF_EVENT, a new perf_event_type that exposes BPF program
load/unload information to user space.
Each BPF program may contain up to BPF_MAX_SUBPROGS (256) sub programs.
The following example shows kernel symbols for a BPF program with 7 sub
programs:
ffffffffa0257cf9 t bpf_prog_b07ccb89267cf242_F
ffffffffa02592e1 t bpf_prog_2dcecc18072623fc_F
ffffffffa025b0e9 t bpf_prog_bb7a405ebaec5d5c_F
ffffffffa025dd2c t bpf_prog_a7540d4a39ec1fc7_F
ffffffffa025fcca t bpf_prog_05762d4ade0e3737_F
ffffffffa026108f t bpf_prog_db4bd11e35df90d4_F
ffffffffa0263f00 t bpf_prog_89d64e4abf0f0126_F
ffffffffa0257cf9 t bpf_prog_ae31629322c4b018__dummy_tracepoi
When a bpf program is loaded, PERF_RECORD_KSYMBOL is generated for each
of these sub programs. Therefore, PERF_RECORD_BPF_EVENT is not needed
for simple profiling.
For annotation, user space need to listen to PERF_RECORD_BPF_EVENT and
gather more information about these (sub) programs via sys_bpf.
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradeaed.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: kernel-team@fb.com
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/20190117161521.1341602-4-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
||
|
|
76193a9452 |
perf, bpf: Introduce PERF_RECORD_KSYMBOL
For better performance analysis of dynamically JITed and loaded kernel
functions, such as BPF programs, this patch introduces
PERF_RECORD_KSYMBOL, a new perf_event_type that exposes kernel symbol
register/unregister information to user space.
The following data structure is used for PERF_RECORD_KSYMBOL.
/*
* struct {
* struct perf_event_header header;
* u64 addr;
* u32 len;
* u16 ksym_type;
* u16 flags;
* char name[];
* struct sample_id sample_id;
* };
*/
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: kernel-team@fb.com
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/20190117161521.1341602-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
||
|
|
5620196951 |
perf: Make perf_event_output() propagate the output() return
For the original mode of operation it isn't needed, since we report back errors via PERF_RECORD_LOST records in the ring buffer, but for use in bpf_perf_event_output() it is convenient to return the errors, basically -ENOSPC. Currently bpf_perf_event_output() returns an error indication, the last thing it does, which is to push it to the ring buffer is that can fail and if so, this failure won't be reported back to its users, fix it. Reported-by: Jamal Hadi Salim <jhs@mojatatu.com> Tested-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/r/20190118150938.GN5823@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> |
||
|
|
cf5c6c211b |
perf: Remove duplicated workqueue.h include from perf_event.h
It is already included a little bit higher up in that file. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20190117072504.14428-1-yuehaibing@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> |
||
|
|
cc6795aeff |
perf/core: Add PERF_PMU_CAP_NO_EXCLUDE for exclusion incapable PMUs
Many PMU drivers do not have the capability to exclude counting events that occur in specific contexts such as idle, kernel, guest, etc. These drivers indicate this by returning an error in their event_init upon testing the events attribute flags. This approach is error prone and often inconsistent. Let's instead allow PMU drivers to advertise their inability to exclude based on context via a new capability: PERF_PMU_CAP_NO_EXCLUDE. This allows the perf core to reject requests for exclusion events where there is no support in the PMU. Signed-off-by: Andrew Murray <andrew.murray@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Russell King <linux@armlinux.org.uk> Cc: Sascha Hauer <s.hauer@pengutronix.de> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: robin.murphy@arm.com Cc: suzuki.poulose@arm.com Link: https://lkml.kernel.org/r/1547128414-50693-4-git-send-email-andrew.murray@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
486efe9f8e |
perf/core: Add function to test for event exclusion flags
Add a function that tests if any of the perf event exclusion flags are set on a given event. Signed-off-by: Andrew Murray <andrew.murray@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Russell King <linux@armlinux.org.uk> Cc: Sascha Hauer <s.hauer@pengutronix.de> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: robin.murphy@arm.com Cc: suzuki.poulose@arm.com Link: https://lkml.kernel.org/r/1547128414-50693-3-git-send-email-andrew.murray@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
43b9e4febc |
perf/core: Declare the __percpu attribute on non-deref types
Sparse reports the current declaration of two perf percpu variables
with this warning:
warning: incorrect type in initializer (different address spaces)
expected void const [noderef] <asn:3>*__vpp_verify
got struct perf_cpu_context *<noident>
While it's normally perfectly fine to place GCC attributes anywhere
in the definition, this particular attribute is for a checking
compiler's such as Sparse's benefit, which doesn't want __percpu
on pointers.
So reorder the attribute to come after the structure type, not after
the pointer type.
[ mingo: Rewrote the changelog. ]
Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1543310012-7967-1-git-send-email-mojha@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
93081caaae |
Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
6cbc304f2f |
perf/x86/intel: Fix unwind errors from PEBS entries (mk-II)
Vince reported the perf_fuzzer giving various unwinder warnings and
Josh reported:
> Deja vu. Most of these are related to perf PEBS, similar to the
> following issue:
>
>
|
||
|
|
788faab70d |
perf, tools: Use correct articles in comments
Some of the comments in the perf events code use articles incorrectly, using 'a' for words beginning with a vowel sound, where 'an' should be used. Signed-off-by: Tobias Tefke <tobias.tefke@tutanota.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: alexander.shishkin@linux.intel.com Cc: jolsa@redhat.com Cc: namhyung@kernel.org Link: http://lkml.kernel.org/r/20180709105715.22938-1-tobias.tefke@tutanota.com [ Fix a few more perf related 'a event' typo fixes from all around the kernel and tooling tree. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
1c8c5a9d38 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
1) Add Maglev hashing scheduler to IPVS, from Inju Song.
2) Lots of new TC subsystem tests from Roman Mashak.
3) Add TCP zero copy receive and fix delayed acks and autotuning with
SO_RCVLOWAT, from Eric Dumazet.
4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard
Brouer.
5) Add ttl inherit support to vxlan, from Hangbin Liu.
6) Properly separate ipv6 routes into their logically independant
components. fib6_info for the routing table, and fib6_nh for sets of
nexthops, which thus can be shared. From David Ahern.
7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP
messages from XDP programs. From Nikita V. Shirokov.
8) Lots of long overdue cleanups to the r8169 driver, from Heiner
Kallweit.
9) Add BTF ("BPF Type Format"), from Martin KaFai Lau.
10) Add traffic condition monitoring to iwlwifi, from Luca Coelho.
11) Plumb extack down into fib_rules, from Roopa Prabhu.
12) Add Flower classifier offload support to igb, from Vinicius Costa
Gomes.
13) Add UDP GSO support, from Willem de Bruijn.
14) Add documentation for eBPF helpers, from Quentin Monnet.
15) Add TLS tx offload to mlx5, from Ilya Lesokhin.
16) Allow applications to be given the number of bytes available to read
on a socket via a control message returned from recvmsg(), from
Soheil Hassas Yeganeh.
17) Add x86_32 eBPF JIT compiler, from Wang YanQing.
18) Add AF_XDP sockets, with zerocopy support infrastructure as well.
From Björn Töpel.
19) Remove indirect load support from all of the BPF JITs and handle
these operations in the verifier by translating them into native BPF
instead. From Daniel Borkmann.
20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha.
21) Allow XDP programs to do lookups in the main kernel routing tables
for forwarding. From David Ahern.
22) Allow drivers to store hardware state into an ELF section of kernel
dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy.
23) Various RACK and loss detection improvements in TCP, from Yuchung
Cheng.
24) Add TCP SACK compression, from Eric Dumazet.
25) Add User Mode Helper support and basic bpfilter infrastructure, from
Alexei Starovoitov.
26) Support ports and protocol values in RTM_GETROUTE, from Roopa
Prabhu.
27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard
Brouer.
28) Add lots of forwarding selftests, from Petr Machata.
29) Add generic network device failover driver, from Sridhar Samudrala.
* ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits)
strparser: Add __strp_unpause and use it in ktls.
rxrpc: Fix terminal retransmission connection ID to include the channel
net: hns3: Optimize PF CMDQ interrupt switching process
net: hns3: Fix for VF mailbox receiving unknown message
net: hns3: Fix for VF mailbox cannot receiving PF response
bnx2x: use the right constant
Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
enic: fix UDP rss bits
netdev-FAQ: clarify DaveM's position for stable backports
rtnetlink: validate attributes in do_setlink()
mlxsw: Add extack messages for port_{un, }split failures
netdevsim: Add extack error message for devlink reload
devlink: Add extack to reload and port_{un, }split operations
net: metrics: add proper netlink validation
ipmr: fix error path when ipmr_new_table fails
ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
net: hns3: remove unused hclgevf_cfg_func_mta_filter
netfilter: provide udp*_lib_lookup for nf_tproxy
qed*: Utilize FW 8.37.2.0
...
|
||
|
|
9511bce9fe |
perf/core: Fix bad use of igrab()
As Miklos reported and suggested:
"This pattern repeats two times in trace_uprobe.c and in
kernel/events/core.c as well:
ret = kern_path(filename, LOOKUP_FOLLOW, &path);
if (ret)
goto fail_address_parse;
inode = igrab(d_inode(path.dentry));
path_put(&path);
And it's wrong. You can only hold a reference to the inode if you
have an active ref to the superblock as well (which is normally
through path.mnt) or holding s_umount.
This way unmounting the containing filesystem while the tracepoint is
active will give you the "VFS: Busy inodes after unmount..." message
and a crash when the inode is finally put.
Solution: store path instead of inode."
This patch fixes the issue in kernel/event/core.c.
Reviewed-and-tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Reported-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <kernel-team@fb.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes:
|
||
|
|
a1150c2022 |
perf/core: Fix group scheduling with mixed hw and sw events
When hw and sw events are mixed in the same group, they are all attached
to the hw perf_event_context. This sometimes requires moving group of
perf_event to a different context.
We found a bug in how the kernel handles this, for example if we do:
perf stat -e '{faults,ref-cycles,faults}' -I 1000
1.005591180 1,297 faults
1.005591180 457,476,576 ref-cycles
1.005591180 <not supported> faults
First, sw event "faults" is attached to the sw context, and becomes the
group leader. Then, hw event "ref-cycles" is attached, so both events
are moved to the hw context. Last, another sw "faults" tries to attach,
but it fails because of mismatch between the new target ctx (from sw
pmu) and the group_leader's ctx (hw context, same as ref-cycles).
The broken condition is:
group_leader is sw event;
group_leader is on hw context;
add a sw event to the group.
Fix this scenario by checking group_leader's context (instead of just
event type). If group_leader is on hw context, use the ->pmu of this
context to look up context for the new event.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <kernel-team@fb.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes:
|
||
|
|
f8d959a5b1 |
perf/core: add perf_get_event() to return perf_event given a struct file
A new extern function, perf_get_event(), is added to return a perf event given a struct file. This function will be used in later patches. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
6ed70cf342 |
perf/x86/pt, coresight: Clean up address filter structure
This is a cosmetic patch that deals with the address filter structure's ambiguous fields 'filter' and 'range'. The former stands to mean that the filter's *action* should be to filter the traces to its address range if it's set or stop tracing if it's unset. This is confusing and hard on the eyes, so this patch replaces it with 'action' enum. The 'range' field is completely redundant (meaning that the filter is an address range as opposed to a single address trigger), as we can use zero size to mean the same thing. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Acked-by: Mathieu Poirier <mathieu.poirier@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/20180329120648.11902-1-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
edb39592a5 |
perf: Fix sibling iteration
Mark noticed that the change to sibling_list changed some iteration
semantics; because previously we used group_list as list entry,
sibling events would always have an empty sibling_list.
But because we now use sibling_list for both list head and list entry,
siblings will report as having siblings.
Fix this with a custom for_each_sibling_event() iterator.
Fixes:
|
||
|
|
6668128a9e |
perf/core: Optimize ctx_sched_out()
When an event group contains more events than can be scheduled on the hardware, iterating the full event group for ctx_sched_out is a waste of time. Keep track of the events that got programmed on the hardware, such that we can iterate this smaller list in order to schedule them out. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Carrillo-Cisneros <davidcc@google.com> Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valery Cherepennikov <valery.cherepennikov@intel.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
8343aae661 |
perf/core: Remove perf_event::group_entry
Now that all the grouping is done with RB trees, we no longer need group_entry and can replace the whole thing with sibling_list. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Carrillo-Cisneros <davidcc@google.com> Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valery Cherepennikov <valery.cherepennikov@intel.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
8e1a2031e4 |
perf/cor: Use RB trees for pinned/flexible groups
Change event groups into RB trees sorted by CPU and then by a 64bit index, so that multiplexing hrtimer interrupt handler would be able skipping to the current CPU's list and ignore groups allocated for the other CPUs. New API for manipulating event groups in the trees is implemented as well as adoption on the API in the current implementation. pinned_group_sched_in() and flexible_group_sched_in() API are introduced to consolidate code enabling the whole group from pinned and flexible groups appropriately. Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: David Carrillo-Cisneros <davidcc@google.com> Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valery Cherepennikov <valery.cherepennikov@intel.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/372f9c8b-0cfe-4240-e44d-83d863d40813@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
c895f6f703 |
bpf: correct broken uapi for BPF_PROG_TYPE_PERF_EVENT program type
Commit |
||
|
|
2dcd9c71c1 |
Tracing updates for 4.15:
- Now allow module init functions to be traced
- Clean up some unused or not used by config events (saves space)
- Clean up of trace histogram code
- Add support for preempt and interrupt enabled/disable events
- Other various clean ups
-----BEGIN PGP SIGNATURE-----
iQHIBAABCgAyFiEEPm6V/WuN2kyArTUe1a05Y9njSUkFAloPGgkUHHJvc3RlZHRA
Z29vZG1pcy5vcmcACgkQ1a05Y9njSUmfaAwAjge5FWBCBQeby8tVuw4RGAorRgl5
IFuijFSygcKRMhQFP6B+haHsezeCbNaBBtIncXhoJGDC5XuhUhr9foYf1SChEmYp
tCOK2o71FgZ8yG539IYCVjG9cJZxPLM0OI7RQ8hcMETAr+eiXPXxHrmrm9kdBtYM
ZAQERvqI5yu2HWIb87KBc38H0rgYrOJKZt9Rx20as/aqAME7hFvYErFlcnxdmHo+
LmovJOQBCTicNJ4TXJc418JaUWi9cm/A3uhW3o5aLMoRAxCc/8FD+dq2rg4qlHDH
tOtK6pwIPHfqRZ3nMLXXWhaa+w+swsxBOnegkvgP2xCyibKjFgh9kzcpaj41w3x1
0FCfvS7flx9ob//fAB8kxLvJyY5p3Qp3xdvj0+gp2qa3Ga5lSqcMzS419TLY1Yfa
Jpi2oAagDqP94m0EjAGTkhZMOrsFIDr49g3h7nqz3T3Z54luyXniDoYoO11d+dUF
vCUiIJz/PsQIE3NVViZiaRtcLVXneLHISmnz
=h3F2
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from
- allow module init functions to be traced
- clean up some unused or not used by config events (saves space)
- clean up of trace histogram code
- add support for preempt and interrupt enabled/disable events
- other various clean ups
* tag 'trace-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (30 commits)
tracing, thermal: Hide cpu cooling trace events when not in use
tracing, thermal: Hide devfreq trace events when not in use
ftrace: Kill FTRACE_OPS_FL_PER_CPU
perf/ftrace: Small cleanup
perf/ftrace: Fix function trace events
perf/ftrace: Revert ("perf/ftrace: Fix double traces of perf on ftrace:function")
tracing, dma-buf: Remove unused trace event dma_fence_annotate_wait_on
tracing, memcg, vmscan: Hide trace events when not in use
tracing/xen: Hide events that are not used when X86_PAE is not defined
tracing: mark trace_test_buffer as __maybe_unused
printk: Remove superfluous memory barriers from printk_safe
ftrace: Clear hashes of stale ips of init memory
tracing: Add support for preempt and irq enable/disable events
tracing: Prepare to add preempt and irq trace events
ftrace/kallsyms: Have /proc/kallsyms show saved mod init functions
ftrace: Add freeing algorithm to free ftrace_mod_maps
ftrace: Save module init functions kallsyms symbols for tracing
ftrace: Allow module init functions to be traced
ftrace: Add a ftrace_free_mem() function for modules to use
tracing: Reimplement log2
...
|
||
|
|
0d3d73aac2 |
perf/core: Rewrite event timekeeping
The current even timekeeping, which computes enabled and running times, uses 3 distinct timestamps to reflect the various event states: OFF (stopped), INACTIVE (enabled) and ACTIVE (running). Furthermore, the update rules are such that even INACTIVE events need their timestamps updated. This is undesirable because we'd like to not touch INACTIVE events if at all possible, this makes event scheduling (much) more expensive than needed. Rewrite the timekeeping to directly use event->state, this greatly simplifies the code and results in only having to update things when we change state, or an up-to-date value is requested (read). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
8ca2bd41c7 |
perf/core: Rename 'enum perf_event_active_state'
Its a weird name, active is one of the states, it should not be part of the name, also, its too long. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
7d9285e82d |
perf/bpf: Extend the perf_event_read_local() interface, a.k.a. "bpf: perf event change needed for subsequent bpf helpers"
eBPF programs would like access to the (perf) event enabled and running times along with the event value, such that they can deal with event multiplexing (among other things). This patch extends the interface; a future eBPF patch will utilize the new functionality. [ Note, there's a same-content commit with a poor changelog and a meaningless title in the networking tree as well - but we need this change for subsequent perf work, so apply it here as well, with a proper changelog. Hopefully Git will be able to sort out this somewhat messy workflow, if there are no other, conflicting changes to these files. ] Signed-off-by: Yonghong Song <yhs@fb.com> [ Rewrote the changelog. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <ast@fb.com> Cc: <daniel@iogearbox.net> Cc: <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David S. Miller <davem@davemloft.net> Link: http://lkml.kernel.org/r/20171005161923.332790-2-yhs@fb.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
8fd0fbbe88 |
perf/ftrace: Revert ("perf/ftrace: Fix double traces of perf on ftrace:function")
Revert commit:
|
||
|
|
f57091767a |
Merge branch 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cache quality monitoring update from Thomas Gleixner: "This update provides a complete rewrite of the Cache Quality Monitoring (CQM) facility. The existing CQM support was duct taped into perf with a lot of issues and the attempts to fix those turned out to be incomplete and horrible. After lengthy discussions it was decided to integrate the CQM support into the Resource Director Technology (RDT) facility, which is the obvious choise as in hardware CQM is part of RDT. This allowed to add Memory Bandwidth Monitoring support on top. As a result the mechanisms for allocating cache/memory bandwidth and the corresponding monitoring mechanisms are integrated into a single management facility with a consistent user interface" * 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits) x86/intel_rdt: Turn off most RDT features on Skylake x86/intel_rdt: Add command line options for resource director technology x86/intel_rdt: Move special case code for Haswell to a quirk function x86/intel_rdt: Remove redundant ternary operator on return x86/intel_rdt/cqm: Improve limbo list processing x86/intel_rdt/mbm: Fix MBM overflow handler during CPU hotplug x86/intel_rdt: Modify the intel_pqr_state for better performance x86/intel_rdt/cqm: Clear the default RMID during hotcpu x86/intel_rdt: Show bitmask of shareable resource with other executing units x86/intel_rdt/mbm: Handle counter overflow x86/intel_rdt/mbm: Add mbm counter initialization x86/intel_rdt/mbm: Basic counting of MBM events (total and local) x86/intel_rdt/cqm: Add CPU hotplug support x86/intel_rdt/cqm: Add sched_in support x86/intel_rdt: Introduce rdt_enable_key for scheduling x86/intel_rdt/cqm: Add mount,umount support x86/intel_rdt/cqm: Add rmdir support x86/intel_rdt: Separate the ctrl bits from rmdir x86/intel_rdt/cqm: Add mon_data x86/intel_rdt: Prepare for RDT monitor data support ... |
||
|
|
fc7ce9c74c |
perf/core, x86: Add PERF_SAMPLE_PHYS_ADDR
For understanding how the workload maps to memory channels and hardware behavior, it's very important to collect address maps with physical addresses. For example, 3D XPoint access can only be found by filtering the physical address. Add a new sample type for physical address. perf already has a facility to collect data virtual address. This patch introduces a function to convert the virtual address to physical address. The function is quite generic and can be extended to any architecture as long as a virtual address is provided. - For kernel direct mapping addresses, virt_to_phys is used to convert the virtual addresses to physical address. - For user virtual addresses, __get_user_pages_fast is used to walk the pages tables for user physical address. - This does not work for vmalloc addresses right now. These are not resolved, but code to do that could be added. The new sample type requires collecting the virtual address. The virtual address will not be output unless SAMPLE_ADDR is applied. For security, the physical address can only be exposed to root or privileged user. Tested-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: acme@kernel.org Cc: mpe@ellerman.id.au Link: http://lkml.kernel.org/r/1503967969-48278-1-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
8d4e6c4caa |
perf/core, pt, bts: Get rid of itrace_started
I just noticed that hw.itrace_started and hw.config are aliased to the
same location. Now, the PT driver happens to use both, which works out
fine by sheer luck:
- STORE(hw.itrace_start) is ordered before STORE(hw.config), in the
program order, although there are no compiler barriers to ensure that,
- to the perf_log_itrace_start() hw.itrace_start looks set at the same
time as when it is intended to be set because both stores happen in the
same path,
- hw.config is never reset to zero in the PT driver.
Now, the use of hw.config by the PT driver makes more sense (it being a
HW PMU) than messing around with itrace_started, which is an awkward API
to begin with.
This patch replaces hw.itrace_started with an attach_state bit and an
API call for the PMU drivers to use to communicate the condition.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170330153956.25994-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
75e8387685 |
perf/ftrace: Fix double traces of perf on ftrace:function
When running perf on the ftrace:function tracepoint, there is a bug
which can be reproduced by:
perf record -e ftrace:function -a sleep 20 &
perf record -e ftrace:function ls
perf script
ls 10304 [005] 171.853235: ftrace:function:
perf_output_begin
ls 10304 [005] 171.853237: ftrace:function:
perf_output_begin
ls 10304 [005] 171.853239: ftrace:function:
task_tgid_nr_ns
ls 10304 [005] 171.853240: ftrace:function:
task_tgid_nr_ns
ls 10304 [005] 171.853242: ftrace:function:
__task_pid_nr_ns
ls 10304 [005] 171.853244: ftrace:function:
__task_pid_nr_ns
We can see that all the function traces are doubled.
The problem is caused by the inconsistency of the register
function perf_ftrace_event_register() with the probe function
perf_ftrace_function_call(). The former registers one probe
for every perf_event. And the latter handles all perf_events
on the current cpu. So when two perf_events on the current cpu,
the traces of them will be doubled.
So this patch adds an extra parameter "event" for perf_tp_event,
only send sample data to this event when it's not NULL.
Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: alexander.shishkin@linux.intel.com
Cc: huawei.libin@huawei.com
Link: http://lkml.kernel.org/r/1503668977-12526-1-git-send-email-zhouchengming1@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
bfe334924c |
perf/x86: Fix RDPMC vs. mm_struct tracking
Vince reported the following rdpmc() testcase failure:
> Failing test case:
>
> fd=perf_event_open();
> addr=mmap(fd);
> exec() // without closing or unmapping the event
> fd=perf_event_open();
> addr=mmap(fd);
> rdpmc() // GPFs due to rdpmc being disabled
The problem is of course that exec() plays tricks with what is
current->mm, only destroying the old mappings after having
installed the new mm.
Fix this confusion by passing along vma->vm_mm instead of relying on
current->mm.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
c39a0e2c88 |
x86/perf/cqm: Wipe out perf based cqm
'perf cqm' never worked due to the incompatibility between perf
infrastructure and cqm hardware support. The hardware uses RMIDs to
track the llc occupancy of tasks and these RMIDs are per package. This
makes monitoring a hierarchy like cgroup along with monitoring of tasks
separately difficult and several patches sent to lkml to fix them were
NACKed. Further more, the following issues in the current perf cqm make
it almost unusable:
1. No support to monitor the same group of tasks for which we do
allocation using resctrl.
2. It gives random and inaccurate data (mostly 0s) once we run out
of RMIDs due to issues in Recycling.
3. Recycling results in inaccuracy of data because we cannot
guarantee that the RMID was stolen from a task when it was not
pulling data into cache or even when it pulled the least data. Also
for monitoring llc_occupancy, if we stop using an RMID_x and then
start using an RMID_y after we reclaim an RMID from an other event,
we miss accounting all the occupancy that was tagged to RMID_x at a
later perf_count.
2. Recycling code makes the monitoring code complex including
scheduling because the event can lose RMID any time. Since MBM
counters count bandwidth for a period of time by taking snap shot of
total bytes at two different times, recycling complicates the way we
count MBM in a hierarchy. Also we need a spin lock while we do the
processing to account for MBM counter overflow. We also currently
use a spin lock in scheduling to prevent the RMID from being taken
away.
4. Lack of support when we run different kind of event like task,
system-wide and cgroup events together. Data mostly prints 0s. This
is also because we can have only one RMID tied to a cpu as defined
by the cqm hardware but a perf can at the same time tie multiple
events during one sched_in.
5. No support of monitoring a group of tasks. There is partial support
for cgroup but it does not work once there is a hierarchy of cgroups
or if we want to monitor a task in a cgroup and the cgroup itself.
6. No support for monitoring tasks for the lifetime without perf
overhead.
7. It reported the aggregate cache occupancy or memory bandwidth over
all sockets. But most cloud and VMM based use cases want to know the
individual per-socket usage.
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-2-git-send-email-vikas.shivappa@linux.intel.com
|
||
|
|
5518b69b76 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
"Reasonably busy this cycle, but perhaps not as busy as in the 4.12
merge window:
1) Several optimizations for UDP processing under high load from
Paolo Abeni.
2) Support pacing internally in TCP when using the sch_fq packet
scheduler for this is not practical. From Eric Dumazet.
3) Support mutliple filter chains per qdisc, from Jiri Pirko.
4) Move to 1ms TCP timestamp clock, from Eric Dumazet.
5) Add batch dequeueing to vhost_net, from Jason Wang.
6) Flesh out more completely SCTP checksum offload support, from
Davide Caratti.
7) More plumbing of extended netlink ACKs, from David Ahern, Pablo
Neira Ayuso, and Matthias Schiffer.
8) Add devlink support to nfp driver, from Simon Horman.
9) Add RTM_F_FIB_MATCH flag to RTM_GETROUTE queries, from Roopa
Prabhu.
10) Add stack depth tracking to BPF verifier and use this information
in the various eBPF JITs. From Alexei Starovoitov.
11) Support XDP on qed device VFs, from Yuval Mintz.
12) Introduce BPF PROG ID for better introspection of installed BPF
programs. From Martin KaFai Lau.
13) Add bpf_set_hash helper for TC bpf programs, from Daniel Borkmann.
14) For loads, allow narrower accesses in bpf verifier checking, from
Yonghong Song.
15) Support MIPS in the BPF selftests and samples infrastructure, the
MIPS eBPF JIT will be merged in via the MIPS GIT tree. From David
Daney.
16) Support kernel based TLS, from Dave Watson and others.
17) Remove completely DST garbage collection, from Wei Wang.
18) Allow installing TCP MD5 rules using prefixes, from Ivan
Delalande.
19) Add XDP support to Intel i40e driver, from Björn Töpel
20) Add support for TC flower offload in nfp driver, from Simon
Horman, Pieter Jansen van Vuuren, Benjamin LaHaise, Jakub
Kicinski, and Bert van Leeuwen.
21) IPSEC offloading support in mlx5, from Ilan Tayari.
22) Add HW PTP support to macb driver, from Rafal Ozieblo.
23) Networking refcount_t conversions, From Elena Reshetova.
24) Add sock_ops support to BPF, from Lawrence Brako. This is useful
for tuning the TCP sockopt settings of a group of applications,
currently via CGROUPs"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1899 commits)
net: phy: dp83867: add workaround for incorrect RX_CTRL pin strap
dt-bindings: phy: dp83867: provide a workaround for incorrect RX_CTRL pin strap
cxgb4: Support for get_ts_info ethtool method
cxgb4: Add PTP Hardware Clock (PHC) support
cxgb4: time stamping interface for PTP
nfp: default to chained metadata prepend format
nfp: remove legacy MAC address lookup
nfp: improve order of interfaces in breakout mode
net: macb: remove extraneous return when MACB_EXT_DESC is defined
bpf: add missing break in for the TCP_BPF_SNDCWND_CLAMP case
bpf: fix return in load_bpf_file
mpls: fix rtm policy in mpls_getroute
net, ax25: convert ax25_cb.refcount from atomic_t to refcount_t
net, ax25: convert ax25_route.refcount from atomic_t to refcount_t
net, ax25: convert ax25_uid_assoc.refcount from atomic_t to refcount_t
net, sctp: convert sctp_ep_common.refcnt from atomic_t to refcount_t
net, sctp: convert sctp_transport.refcnt from atomic_t to refcount_t
net, sctp: convert sctp_chunk.refcnt from atomic_t to refcount_t
net, sctp: convert sctp_datamsg.refcnt from atomic_t to refcount_t
net, sctp: convert sctp_auth_bytes.refcnt from atomic_t to refcount_t
...
|
||
|
|
f91840a32d |
perf, bpf: Add BPF support to all perf_event types
Allow BPF_PROG_TYPE_PERF_EVENT program types to attach to all perf_event types, including HW_CACHE, RAW, and dynamic pmu events. Only tracepoint/kprobe events are treated differently which require BPF_PROG_TYPE_TRACEPOINT/BPF_PROG_TYPE_KPROBE program types accordingly. Also add support for reading all event counters using bpf_perf_event_read() helper. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
|
a63fbed776 |
perf/tracing/cpuhotplug: Fix locking order
perf, tracing, kprobes and jump_labels have a gazillion of ways to create dependency lock chains. Some of those involve nested invocations of get_online_cpus(). The conversion of the hotplug locking to a percpu rwsem requires to avoid such nested calls. sys_perf_event_open() protects most of the syscall logic against cpu hotplug. This causes nested calls and lock inversions versus ftrace and kprobes in various interesting ways. It's impossible to move the hotplug locking to the outer end of all call chains in the involved facilities, so the hotplug protection in sys_perf_event_open() needs to be solved differently. Introduce 'pmus_mutex' which protects a perf private online cpumask. This mutex is taken when the mask is updated in the cpu hotplug callbacks and can be taken in sys_perf_event_open() to protect the swhash setup/teardown code and when the final judgement about a valid event has to be made. [ tglx: Produced changelog and fixed the swhash interaction ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Link: http://lkml.kernel.org/r/20170524081548.930941109@linutronix.de |
||
|
|
cf25f904ef |
x86/events/amd/iommu: Add IOMMU-specific hw_perf_event struct
Current AMD IOMMU perf PMU inappropriately uses the hardware struct inside the union in struct hw_perf_event, extra_reg in particular. Instead, introduce an AMD IOMMU-specific struct with required parameters to be programmed into the IOMMU performance counter control register. Update the pasid field from 16 to 20 bits while at it. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> [ Fixup macros, shorten get_next_avail_iommu_bnk_cntr() local vars, massage commit message. ] Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Jörg Rödel <joro@8bytes.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: iommu@lists.linux-foundation.org Link: http://lkml.kernel.org/r/1487926102-13073-10-git-send-email-Suravee.Suthikulpanit@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
f4c0b0aa58 |
perf/core: Keep AUX flags in the output handle
In preparation for adding more flags to perf AUX records, introduce a separate API for setting the flags for a session, rather than appending more bool arguments to perf_aux_output_end. This allows to set each flag at the time a corresponding condition is detected, instead of tracking it in each driver's private state. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: vince@deater.net Link: http://lkml.kernel.org/r/20170220133352.17995-3-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
e422267322 |
perf: Add PERF_RECORD_NAMESPACES to include namespaces related info
With the advert of container technologies like docker, that depend on namespaces for isolation, there is a need for tracing support for namespaces. This patch introduces new PERF_RECORD_NAMESPACES event for recording namespaces related info. By recording info for every namespace, it is left to userspace to take a call on the definition of a container and trace containers by updating perf tool accordingly. Each namespace has a combination of device and inode numbers. Though every namespace has the same device number currently, that may change in future to avoid the need for a namespace of namespaces. Considering such possibility, record both device and inode numbers separately for each namespace. Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Aravinda Prasad <aravinda@linux.vnet.ibm.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/148891929686.25309.2827618988917007768.stgit@hbathini.in.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> |
||
|
|
6ce77bfd6c |
perf/core: Allow kernel filters on CPU events
While supporting file-based address filters for CPU events requires some extra context switch handling, kernel address filters are easy, since the kernel mapping is preserved across address spaces. It is also useful as it permits tracing scheduling paths of the kernel. This patch allows setting up kernel filters for CPU events. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Will Deacon <will.deacon@arm.com> Cc: vince@deater.net Link: http://lkml.kernel.org/r/20170126094057.13805-4-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
1fd7e41699 |
perf/core: Remove perf_cpu_context::unique_pmu
cpuctx->unique_pmu was originally introduced as a way to identify cpuctxs with shared pmus in order to avoid visiting the same cpuctx more than once in a for_each_pmu loop. cpuctx->unique_pmu == cpuctx->pmu in non-software task contexts since they have only one pmu per cpuctx. Since perf_pmu_sched_task() is only called in hw contexts, this patch replaces cpuctx->unique_pmu by cpuctx->pmu in it. The change above, together with the previous patch in this series, removed the remaining uses of cpuctx->unique_pmu, so we remove it altogether. Signed-off-by: David Carrillo-Cisneros <davidcc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com> Cc: Vince Weaver <vince@deater.net> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/20170118192454.58008-3-davidcc@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
058fe1c044 |
perf/core: Make cgroup switch visit only cpuctxs with cgroup events
This patch follows from a conversation in CQM/CMT's last series about speeding up the context switch for cgroup events: https://patchwork.kernel.org/patch/9478617/ This is a low-hanging fruit optimization. It replaces the iteration over the "pmus" list in cgroup switch by an iteration over a new list that contains only cpuctxs with at least one cgroup event. This is necessary because the number of PMUs have increased over the years e.g modern x86 server systems have well above 50 PMUs. The iteration over the full PMU list is unneccessary and can be costly in heavy cache contention scenarios. Below are some instrumentation measurements with 10, 50 and 90 percentiles of the total cost of context switch before and after this optimization for a simple array read/write microbenchark. Contention Level Nr events Before (us) After (us) Median L2 L3 types (10%, 50%, 90%) (10%, 50%, 90% Speedup -------------------------------------------------------------------------- Low Low 1 (1.72, 2.42, 5.85) (1.35, 1.64, 5.46) 29% High Low 1 (2.08, 4.56, 19.8) (1720, 2.20, 13.7) 51% High High 1 (2.86, 10.4, 12.7) (2.54, 4.32, 12.1) 58% Low Low 2 (1.98, 3.20, 6.89) (1.68, 2.41, 8.89) 24% High Low 2 (2.48, 5.28, 22.4) (2150, 3.69, 14.6) 30% High High 2 (3.32, 8.09, 13.9) (2.80, 5.15, 13.7) 36% where: 1 event type = cycles 2 event types = cycles,intel_cqm/llc_occupancy/ Contention L2 Low: workset < L2 cache size. High: " >> L2 " " . Contention L3 Low: workset of task on all sockets < L3 cache size. High: " " " " " " >> L3 " " . Median Speedup is (50%ile Before - 50%ile After) / 50%ile Before Unsurprisingly, the benefits of this optimization decrease with the number of cpuctxs with a cgroup events, yet, is never detrimental. Tested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: David Carrillo-Cisneros <davidcc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com> Cc: Vince Weaver <vince@deater.net> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/20170118192454.58008-2-davidcc@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
475113d937 |
perf/x86/intel: Account interrupts for PEBS errors
It's possible to set up PEBS events to get only errors and not
any data, like on SNB-X (model 45) and IVB-EP (model 62)
via 2 perf commands running simultaneously:
taskset -c 1 ./perf record -c 4 -e branches:pp -j any -C 10
This leads to a soft lock up, because the error path of the
intel_pmu_drain_pebs_nhm() does not account event->hw.interrupt
for error PEBS interrupts, so in case you're getting ONLY
errors you don't have a way to stop the event when it's over
the max_samples_per_tick limit:
NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [perf_fuzzer:5816]
...
RIP: 0010:[<ffffffff81159232>] [<ffffffff81159232>] smp_call_function_single+0xe2/0x140
...
Call Trace:
? trace_hardirqs_on_caller+0xf5/0x1b0
? perf_cgroup_attach+0x70/0x70
perf_install_in_context+0x199/0x1b0
? ctx_resched+0x90/0x90
SYSC_perf_event_open+0x641/0xf90
SyS_perf_event_open+0x9/0x10
do_syscall_64+0x6c/0x1f0
entry_SYSCALL64_slow_path+0x25/0x25
Add perf_event_account_interrupt() which does the interrupt
and frequency checks and call it from intel_pmu_drain_pebs_nhm()'s
error path.
We keep the pending_kill and pending_wakeup logic only in the
__perf_event_overflow() path, because they make sense only if
there's any data to deliver.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1482931866-6018-2-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
5aab90ce1e |
perf/powerpc: Don't call perf_event_disable() from atomic context
The trinity syscall fuzzer triggered following WARN() on powerpc: WARNING: CPU: 9 PID: 2998 at arch/powerpc/kernel/hw_breakpoint.c:278 ... NIP [c00000000093aedc] .hw_breakpoint_handler+0x28c/0x2b0 LR [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0 Call Trace: [c0000002f7933580] [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0 (unreliable) [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0 [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0 [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0 [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100 [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48 Followed by a lockdep warning: =============================== [ INFO: suspicious RCU usage. ] 4.8.0-rc5+ #7 Tainted: G W ------------------------------- ./include/linux/rcupdate.h:556 Illegal context switch in RCU read-side critical section! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 2 locks held by ls/2998: #0: (rcu_read_lock){......}, at: [<c0000000000f6a00>] .__atomic_notifier_call_chain+0x0/0x1c0 #1: (rcu_read_lock){......}, at: [<c00000000093ac50>] .hw_breakpoint_handler+0x0/0x2b0 stack backtrace: CPU: 9 PID: 2998 Comm: ls Tainted: G W 4.8.0-rc5+ #7 Call Trace: [c0000002f7933150] [c00000000094b1f8] .dump_stack+0xe0/0x14c (unreliable) [c0000002f79331e0] [c00000000013c468] .lockdep_rcu_suspicious+0x138/0x180 [c0000002f7933270] [c0000000001005d8] .___might_sleep+0x278/0x2e0 [c0000002f7933300] [c000000000935584] .mutex_lock_nested+0x64/0x5a0 [c0000002f7933410] [c00000000023084c] .perf_event_ctx_lock_nested+0x16c/0x380 [c0000002f7933500] [c000000000230a80] .perf_event_disable+0x20/0x60 [c0000002f7933580] [c00000000093aeec] .hw_breakpoint_handler+0x29c/0x2b0 [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0 [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0 [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0 [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100 [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48 While it looks like the first WARN() is probably valid, the other one is triggered by disabling event via perf_event_disable() from atomic context. The event is disabled here in case we were not able to emulate the instruction that hit the breakpoint. By disabling the event we unschedule the event and make sure it's not scheduled back. But we can't call perf_event_disable() from atomic context, instead we need to use the event's pending_disable irq_work method to disable it. Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Neuling <mikey@neuling.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20161026094824.GA21397@krava Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
687ee0ad4e |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
1) BBR TCP congestion control, from Neal Cardwell, Yuchung Cheng and
co. at Google. https://lwn.net/Articles/701165/
2) Do TCP Small Queues for retransmits, from Eric Dumazet.
3) Support collect_md mode for all IPV4 and IPV6 tunnels, from Alexei
Starovoitov.
4) Allow cls_flower to classify packets in ip tunnels, from Amir Vadai.
5) Support DSA tagging in older mv88e6xxx switches, from Andrew Lunn.
6) Support GMAC protocol in iwlwifi mwm, from Ayala Beker.
7) Support ndo_poll_controller in mlx5, from Calvin Owens.
8) Move VRF processing to an output hook and allow l3mdev to be
loopback, from David Ahern.
9) Support SOCK_DESTROY for UDP sockets. Also from David Ahern.
10) Congestion control in RXRPC, from David Howells.
11) Support geneve RX offload in ixgbe, from Emil Tantilov.
12) When hitting pressure for new incoming TCP data SKBs, perform a
partial rathern than a full purge of the OFO queue (which could be
huge). From Eric Dumazet.
13) Convert XFRM state and policy lookups to RCU, from Florian Westphal.
14) Support RX network flow classification to igb, from Gangfeng Huang.
15) Hardware offloading of eBPF in nfp driver, from Jakub Kicinski.
16) New skbmod packet action, from Jamal Hadi Salim.
17) Remove some inefficiencies in snmp proc output, from Jia He.
18) Add FIB notifications to properly propagate route changes to
hardware which is doing forwarding offloading. From Jiri Pirko.
19) New dsa driver for qca8xxx chips, from John Crispin.
20) Implement RFC7559 ipv6 router solicitation backoff, from Maciej
Żenczykowski.
21) Add L3 mode to ipvlan, from Mahesh Bandewar.
22) Support 802.1ad in mlx4, from Moshe Shemesh.
23) Support hardware LRO in mediatek driver, from Nelson Chang.
24) Add TC offloading to mlx5, from Or Gerlitz.
25) Convert various drivers to ethtool ksettings interfaces, from
Philippe Reynes.
26) TX max rate limiting for cxgb4, from Rahul Lakkireddy.
27) NAPI support for ath10k, from Rajkumar Manoharan.
28) Support XDP in mlx5, from Rana Shahout and Saeed Mahameed.
29) UDP replicast support in TIPC, from Richard Alpe.
30) Per-queue statistics for qed driver, from Sudarsana Reddy Kalluru.
31) Support BQL in thunderx driver, from Sunil Goutham.
32) TSO support in alx driver, from Tobias Regnery.
33) Add stream parser engine and use it in kcm.
34) Support async DHCP replies in ipconfig module, from Uwe
Kleine-König.
35) DSA port fast aging for mv88e6xxx driver, from Vivien Didelot.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1715 commits)
mlxsw: switchx2: Fix misuse of hard_header_len
mlxsw: spectrum: Fix misuse of hard_header_len
net/faraday: Stop NCSI device on shutdown
net/ncsi: Introduce ncsi_stop_dev()
net/ncsi: Rework the channel monitoring
net/ncsi: Allow to extend NCSI request properties
net/ncsi: Rework request index allocation
net/ncsi: Don't probe on the reserved channel ID (0x1f)
net/ncsi: Introduce NCSI_RESERVED_CHANNEL
net/ncsi: Avoid unused-value build warning from ia64-linux-gcc
net: Add netdev all_adj_list refcnt propagation to fix panic
net: phy: Add Edge-rate driver for Microsemi PHYs.
vmxnet3: Wake queue from reset work
i40e: avoid NULL pointer dereference and recursive errors on early PCI error
qed: Add RoCE ll2 & GSI support
qed: Add support for memory registeration verbs
qed: Add support for QP verbs
qed: PD,PKEY and CQ verb support
qed: Add support for RoCE hw init
qede: Add qedr framework
...
|
||
|
|
aa6a5f3cb2 |
perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT programs
Allow attaching BPF_PROG_TYPE_PERF_EVENT programs to sw and hw perf events via overflow_handler mechanism. When program is attached the overflow_handlers become stacked. The program acts as a filter. Returning zero from the program means that the normal perf_event_output handler will not be called and sampling event won't be stored in the ring buffer. The overflow_handler_context==NULL is an additional safety check to make sure programs are not attached to hw breakpoints and watchdog in case other checks (that prevent that now anyway) get accidentally relaxed in the future. The program refcnt is incremented in case perf_events are inhereted when target task is forked. Similar to kprobe and tracepoint programs there is no ioctl to detach the program or swap already attached program. The user space expected to close(perf_event_fd) like it does right now for kprobe+bpf. That restriction simplifies the code quite a bit. The invocation of overflow_handler in __perf_event_overflow() is now done via READ_ONCE, since that pointer can be replaced when the program is attached while perf_event itself could have been active already. There is no need to do similar treatment for event->prog, since it's assigned only once before it's accessed. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
|
0515e5999a |
bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type
Introduce BPF_PROG_TYPE_PERF_EVENT programs that can be attached to
HW and SW perf events (PERF_TYPE_HARDWARE and PERF_TYPE_SOFTWARE
correspondingly in uapi/linux/perf_event.h)
The program visible context meta structure is
struct bpf_perf_event_data {
struct pt_regs regs;
__u64 sample_period;
};
which is accessible directly from the program:
int bpf_prog(struct bpf_perf_event_data *ctx)
{
... ctx->sample_period ...
... ctx->regs.ip ...
}
The bpf verifier rewrites the accesses into kernel internal
struct bpf_perf_event_data_kern which allows changing
struct perf_sample_data without affecting bpf programs.
New fields can be added to the end of struct bpf_perf_event_data
in the future.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
||
|
|
d6a2f9035b |
perf/core: Introduce PMU_EV_CAP_READ_ACTIVE_PKG
Introduce the flag PMU_EV_CAP_READ_ACTIVE_PKG, useful for uncore events, that allows a PMU to signal the generic perf code that an event is readable in the current CPU if the event is active in a CPU in the same package as the current CPU. This is an optimization that avoids a unnecessary IPI for the common case where uncore events are run and read in the same package but in different CPUs. As an example, the IPI removal speeds up perf_read() in my Haswell system as follows: - For event UNC_C_LLC_LOOKUP: From 260 us to 31 us. - For event RAPL's power/energy-cores/: From to 255 us to 27 us. For the optimization to work, all events in the group must have it (similarly to PERF_EV_CAP_SOFTWARE). Signed-off-by: David Carrillo-Cisneros <davidcc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Carrillo-Cisneros <davidcc@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1471467307-61171-4-git-send-email-davidcc@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
4ff6a8debf |
perf/core: Generalize event->group_flags
Currently, PERF_GROUP_SOFTWARE is used in the group_flags field of a group's leader to indicate that is_software_event(event) is true for all events in a group. This is the only usage of event->group_flags. This pattern of setting a group level flags when all events in the group share a property is useful for the flag introduced in the next patch and for future CQM/CMT flags. So this patches generalizes group_flags to work as an aggregate of event level flags. PERF_GROUP_SOFTWARE denotes an inmutable event's property. All other flags that I intend to add are also determinable at event initialization. To better convey the above, this patch renames event's group_flags to group_caps and PERF_GROUP_SOFTWARE to PERF_EV_CAP_SOFTWARE. Individual event flags are stored in the new event->event_caps. Since the cap flags do not change after event initialization, there is no need to serialize event_caps. This new field is used when events are added to a context, similarly to how PERF_GROUP_SOFTWARE and is_software_event() worked. Lastly, for consistency, updates is_software_event() to rely in event_cap instead of the context index. Signed-off-by: David Carrillo-Cisneros <davidcc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1471467307-61171-3-git-send-email-davidcc@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
e48c178814 |
perf/core: Optimize perf_pmu_sched_task()
For perf record -b, which requires the pmu::sched_task callback the
current code is rather expensive:
7.68% sched-pipe [kernel.vmlinux] [k] perf_pmu_sched_task
5.95% sched-pipe [kernel.vmlinux] [k] __switch_to
5.20% sched-pipe [kernel.vmlinux] [k] __intel_pmu_disable_all
3.95% sched-pipe perf [.] worker_thread
The problem is that it will iterate all registered PMUs, most of which
will not have anything to do. Avoid this by keeping an explicit list
of PMUs that have requested the callback.
The perf_sched_cb_{inc,dec}() functions already takes the required pmu
argument, and now that these functions are no longer called from NMI
context we can use them to manage a list.
With this patch applied the function doesn't show up in the top 4
anymore (it dropped to 18th place).
6.67% sched-pipe [kernel.vmlinux] [k] __switch_to
6.18% sched-pipe [kernel.vmlinux] [k] __intel_pmu_disable_all
3.92% sched-pipe [kernel.vmlinux] [k] switch_mm_irqs_off
3.71% sched-pipe perf [.] worker_thread
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
db4a835601 |
perf/core: Set cgroup in CPU contexts for new cgroup events
There's a perf stat bug easy to observer on a machine with only one cgroup:
$ perf stat -e cycles -I 1000 -C 0 -G /
# time counts unit events
1.000161699 <not counted> cycles /
2.000355591 <not counted> cycles /
3.000565154 <not counted> cycles /
4.000951350 <not counted> cycles /
We'd expect some output there.
The underlying problem is that there is an optimization in
perf_cgroup_sched_{in,out}() that skips the switch of cgroup events
if the old and new cgroups in a task switch are the same.
This optimization interacts with the current code in two ways
that cause a CPU context's cgroup (cpuctx->cgrp) to be NULL even if a
cgroup event matches the current task. These are:
1. On creation of the first cgroup event in a CPU: In current code,
cpuctx->cpu is only set in perf_cgroup_sched_in, but due to the
aforesaid optimization, perf_cgroup_sched_in will run until the next
cgroup switches in that CPU. This may happen late or never happen,
depending on system's number of cgroups, CPU load, etc.
2. On deletion of the last cgroup event in a cpuctx: In list_del_event,
cpuctx->cgrp is set NULL. Any new cgroup event will not be sched in
because cpuctx->cgrp == NULL until a cgroup switch occurs and
perf_cgroup_sched_in is executed (updating cpuctx->cgrp).
This patch fixes both problems by setting cpuctx->cgrp in list_add_event,
mirroring what list_del_event does when removing a cgroup event from CPU
context, as introduced in:
commit
|
||
|
|
a6408f6cb6 |
Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp hotplug updates from Thomas Gleixner:
"This is the next part of the hotplug rework.
- Convert all notifiers with a priority assigned
- Convert all CPU_STARTING/DYING notifiers
The final removal of the STARTING/DYING infrastructure will happen
when the merge window closes.
Another 700 hundred line of unpenetrable maze gone :)"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
timers/core: Correct callback order during CPU hot plug
leds/trigger/cpu: Move from CPU_STARTING to ONLINE level
powerpc/numa: Convert to hotplug state machine
arm/perf: Fix hotplug state machine conversion
irqchip/armada: Avoid unused function warnings
ARC/time: Convert to hotplug state machine
clocksource/atlas7: Convert to hotplug state machine
clocksource/armada-370-xp: Convert to hotplug state machine
clocksource/exynos_mct: Convert to hotplug state machine
clocksource/arm_global_timer: Convert to hotplug state machine
rcu: Convert rcutree to hotplug state machine
KVM/arm/arm64/vgic-new: Convert to hotplug state machine
smp/cfd: Convert core to hotplug state machine
x86/x2apic: Convert to CPU hotplug state machine
profile: Convert to hotplug state machine
timers/core: Convert to hotplug state machine
hrtimer: Convert to hotplug state machine
x86/tboot: Convert to hotplug state machine
arm64/armv8 deprecated: Convert to hotplug state machine
hwtracing/coresight-etm4x: Convert to hotplug state machine
...
|
||
|
|
468fc7ed55 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
1) Unified UDP encapsulation offload methods for drivers, from
Alexander Duyck.
2) Make DSA binding more sane, from Andrew Lunn.
3) Support QCA9888 chips in ath10k, from Anilkumar Kolli.
4) Several workqueue usage cleanups, from Bhaktipriya Shridhar.
5) Add XDP (eXpress Data Path), essentially running BPF programs on RX
packets as soon as the device sees them, with the option to mirror
the packet on TX via the same interface. From Brenden Blanco and
others.
6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet.
7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli.
8) Simplify netlink conntrack entry layout, from Florian Westphal.
9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido
Schimmel, Yotam Gigi, and Jiri Pirko.
10) Add SKB array infrastructure and convert tun and macvtap over to it.
From Michael S Tsirkin and Jason Wang.
11) Support qdisc packet injection in pktgen, from John Fastabend.
12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy.
13) Add NV congestion control support to TCP, from Lawrence Brakmo.
14) Add GSO support to SCTP, from Marcelo Ricardo Leitner.
15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni.
16) Support MPLS over IPV4, from Simon Horman.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
xgene: Fix build warning with ACPI disabled.
be2net: perform temperature query in adapter regardless of its interface state
l2tp: Correctly return -EBADF from pppol2tp_getname.
net/mlx5_core/health: Remove deprecated create_singlethread_workqueue
net: ipmr/ip6mr: update lastuse on entry change
macsec: ensure rx_sa is set when validation is disabled
tipc: dump monitor attributes
tipc: add a function to get the bearer name
tipc: get monitor threshold for the cluster
tipc: make cluster size threshold for monitoring configurable
tipc: introduce constants for tipc address validation
net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()
MAINTAINERS: xgene: Add driver and documentation path
Documentation: dtb: xgene: Add MDIO node
dtb: xgene: Add MDIO node
drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset
drivers: net: xgene: Use exported functions
drivers: net: xgene: Enable MDIO driver
drivers: net: xgene: Add backward compatibility
drivers: net: phy: xgene: Add MDIO driver
...
|
||
|
|
aa7145c16d |
bpf, events: fix offset in skb copy handler
This patch fixes the __output_custom() routine we currently use with bpf_skb_copy(). I missed that when len is larger than the size of the current handle, we can issue multiple invocations of copy_func, and __output_custom() advances destination but also source buffer by the written amount of bytes. When we have __output_custom(), this is actually wrong since in that case the source buffer points to a non-linear object, in our case an skb, which the copy_func helper is supposed to walk. Therefore, since this is non-linear we thus need to pass the offset into the helper, so that copy_func can use it for extracting the data from the source object. Therefore, adjust the callback signatures properly and pass offset into the skb_header_pointer() invoked from bpf_skb_copy() callback. The __DEFINE_OUTPUT_COPY_BODY() is adjusted to accommodate for two things: i) to pass in whether we should advance source buffer or not; this is a compile-time constant condition, ii) to pass in the offset for __output_custom(), which we do with help of __VA_ARGS__, so everything can stay inlined as is currently. Both changes allow for adapting the __output_* fast-path helpers w/o extra overhead. Fixes: |
||
|
|
7e3f977edd |
perf, events: add non-linear data support for raw records
This patch adds support for non-linear data on raw records. It extends raw records to have one or multiple fragments that will be written linearly into the ring slot, where each fragment can optionally have a custom callback handler to walk and extract complex, possibly non-linear data. If a callback handler is provided for a fragment, then the new __output_custom() will be used instead of __output_copy() for the perf_output_sample() part. perf_prepare_sample() does all the size calculation only once, so perf_output_sample() doesn't need to redo the same work anymore, meaning real_size and padding will be cached in the raw record. The raw record becomes 32 bytes in size without holes; to not increase it further and to avoid doing unnecessary recalculations in fast-path, we can reuse next pointer of the last fragment, idea here is borrowed from ZERO_OR_NULL_PTR(), which should keep the perf_output_sample() path for PERF_SAMPLE_RAW minimal. This facility is needed for BPF's event output helper as a first user that will, in a follow-up, add an additional perf_raw_frag to its perf_raw_record in order to be able to more efficiently dump skb context after a linear head meta data related to it. skbs can be non-linear and thus need a custom output function to dump buffers. Currently, the skb data needs to be copied twice; with the help of __output_custom() this work only needs to be done once. Future users could be things like XDP/BPF programs that work on different context though and would thus also have a different callback function. The few users of raw records are adapted to initialize their frag data from the raw record itself, no change in behavior for them. The code is based upon a PoC diff provided by Peter Zijlstra [1]. [1] http://thread.gmane.org/gmane.linux.network/421294 Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
|
89ab9cb169 |
perf/core: Remove perf CPU notifier code
All users converted to state machine callbacks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160713153335.115333381@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
00e16c3d68 |
perf/core: Convert to hotplug state machine
Actually a nice symmetric startup/teardown pair which fits properly into the state machine concept. In the long run we should be able to invoke the startup callback for the boot CPU via the state machine and get rid of the init function which invokes it on the boot CPU. Note: This comes actually before the perf hardware callbacks. In the notifier model the hardware callbacks have a higher priority than the core callback. But that's solely for CPU offline so that hardware migration of events happens before the core is notified about the outgoing CPU. With the symetric state array model we have the following ordering: UP: core -> hardware DOWN: hardware -> core Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Reviewed-by: Sebastian Siewior <bigeasy@linutronix.de> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160713153333.587514098@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
616d1c1b98 |
Merge branch 'linus' into perf/core, to refresh the branch
Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
fc07e9f983 |
perf/x86: Support sysfs files depending on SMT status
Add a way to show different sysfs events attributes depending on HyperThreading is on or off. This is difficult to determine early at boot, so we just do it dynamically when the sysfs attribute is read. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: acme@kernel.org Cc: jolsa@kernel.org Link: http://lkml.kernel.org/r/1463703002-19686-3-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
f2fb6bef92 |
perf/core: Optimize side-band event delivery
The perf_event_aux() function iterates all PMUs and all events in their respective per-CPU contexts to find the events to deliver side-band records to. For example, the brk test case in lkp triggers many mmap() operations, which, if we're also running perf, results in many perf_event_aux() invocations. If we enable uncore PMU support (even when uncore events are not used), dozens of uncore PMUs will be iterated, which can significantly decrease brk_test's throughput. For example, the brk throughput: without uncore PMUs: 2647573 ops_per_sec with uncore PMUs: 1768444 ops_per_sec ... a 33% reduction. To get at the per-CPU events that need side-band records, this patch puts these events on a per-CPU list, this avoids iterating the PMUs and any events that do not need side-band records. Per task events are unchanged to avoid extra overhead on the context switch paths. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-by: Huang, Ying <ying.huang@linux.intel.com> Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1458757477-3781-1-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
97c79a38cd |
perf core: Per event callchain limit
Additionally to being able to control the system wide maximum depth via /proc/sys/kernel/perf_event_max_stack, now we are able to ask for different depths per event, using perf_event_attr.sample_max_stack for that. This uses an u16 hole at the end of perf_event_attr, that, when perf_event_attr.sample_type has the PERF_SAMPLE_CALLCHAIN, if sample_max_stack is zero, means use perf_event_max_stack, otherwise it'll be bounds checked under callchain_mutex. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Milian Wolff <milian.wolff@kdab.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Wang Nan <wangnan0@huawei.com> Cc: Zefan Li <lizefan@huawei.com> Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> |
||
|
|
bdc6b758e4 |
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar: "Mostly tooling and PMU driver fixes, but also a number of late updates such as the reworking of the call-chain size limiting logic to make call-graph recording more robust, plus tooling side changes for the new 'backwards ring-buffer' extension to the perf ring-buffer" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits) perf record: Read from backward ring buffer perf record: Rename variable to make code clear perf record: Prevent reading invalid data in record__mmap_read perf evlist: Add API to pause/resume perf trace: Use the ptr->name beautifier as default for "filename" args perf trace: Use the fd->name beautifier as default for "fd" args perf report: Add srcline_from/to branch sort keys perf evsel: Record fd into perf_mmap perf evsel: Add overwrite attribute and check write_backward perf tools: Set buildid dir under symfs when --symfs is provided perf trace: Only auto set call-graph to "dwarf" when syscalls are being traced perf annotate: Sort list of recognised instructions perf annotate: Fix identification of ARM blt and bls instructions perf tools: Fix usage of max_stack sysctl perf callchain: Stop validating callchains by the max_stack sysctl perf trace: Fix exit_group() formatting perf top: Use machine->kptr_restrict_warned perf trace: Warn when trying to resolve kernel addresses with kptr_restrict=1 perf machine: Do not bail out if not managing to read ref reloc symbol perf/x86/intel/p4: Trival indentation fix, remove space ... |
||
|
|
a7fd20d1c4 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
"Highlights:
1) Support SPI based w5100 devices, from Akinobu Mita.
2) Partial Segmentation Offload, from Alexander Duyck.
3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.
4) Allow cls_flower stats offload, from Amir Vadai.
5) Implement bpf blinding, from Daniel Borkmann.
6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
actually using FASYNC these atomics are superfluous. From Eric
Dumazet.
7) Run TCP more preemptibly, also from Eric Dumazet.
8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
driver, from Gal Pressman.
9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.
10) Improve BPF usage documentation, from Jesper Dangaard Brouer.
11) Support tunneling offloads in qed, from Manish Chopra.
12) aRFS offloading in mlx5e, from Maor Gottlieb.
13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
Leitner.
14) Add MSG_EOR support to TCP, this allows controlling packet
coalescing on application record boundaries for more accurate
socket timestamp sampling. From Martin KaFai Lau.
15) Fix alignment of 64-bit netlink attributes across the board, from
Nicolas Dichtel.
16) Per-vlan stats in bridging, from Nikolay Aleksandrov.
17) Several conversions of drivers to ethtool ksettings, from Philippe
Reynes.
18) Checksum neutral ILA in ipv6, from Tom Herbert.
19) Factorize all of the various marvell dsa drivers into one, from
Vivien Didelot
20) Add VF support to qed driver, from Yuval Mintz"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
Revert "phy dp83867: Make rgmii parameters optional"
r8169: default to 64-bit DMA on recent PCIe chips
phy dp83867: Make rgmii parameters optional
phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
bpf: arm64: remove callee-save registers use for tmp registers
asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
switchdev: pass pointer to fib_info instead of copy
net_sched: close another race condition in tcf_mirred_release()
tipc: fix nametable publication field in nl compat
drivers: net: Don't print unpopulated net_device name
qed: add support for dcbx.
ravb: Add missing free_irq() calls to ravb_close()
qed: Remove a stray tab
net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fec-mpc52xx: use phydev from struct net_device
bpf, doc: fix typo on bpf_asm descriptions
stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fs-enet: use phydev from struct net_device
...
|
||
|
|
c85b033496 |
perf core: Separate accounting of contexts and real addresses in a stack trace
The perf_sample->ip_callchain->nr value includes all the entries in the
ip_callchain->ip[] array, real addresses and PERF_CONTEXT_{KERNEL,USER,etc},
while what the user expects is that what is in the kernel.perf_event_max_stack
sysctl or in the upcoming per event perf_event_attr.sample_max_stack knob be
honoured in terms of IP addresses in the stack trace.
So allocate a bunch of extra entries for contexts, and do the accounting
via perf_callchain_entry_ctx struct members.
A new sysctl, kernel.perf_event_max_contexts_per_stack is also
introduced for investigating possible bugs in the callchain
implementation by some arch.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Milian Wolff <milian.wolff@kdab.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lkml.kernel.org/n/tip-3b4wnqk340c4sg4gwkfdi9yk@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
||
|
|
3e4de4ec4c |
perf core: Add perf_callchain_store_context() helper
We need have different helpers to account how many contexts we have in the sample and for real addresses, so do it now as a prep patch, to ease review. Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-q964tnyuqrxw5gld18vizs3c@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> |
||
|
|
3b1fff0803 |
perf core: Add a 'nr' field to perf_event_callchain_context
We will use it to count how many addresses are in the entry->ip[] array,
excluding PERF_CONTEXT_{KERNEL,USER,etc} entries, so that we can really
return the number of entries specified by the user via the relevant
sysctl, kernel.perf_event_max_contexts, or via the per event
perf_event_attr.sample_max_stack knob.
This way we keep the perf_sample->ip_callchain->nr meaning, that is the
number of entries, be it real addresses or PERF_CONTEXT_ entries, while
honouring the max_stack knobs, i.e. the end result will be max_stack
entries if we have at least that many entries in a given stack trace.
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-s8teto51tdqvlfhefndtat9r@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
||
|
|
cfbcf46845 |
perf core: Pass max stack as a perf_callchain_entry context
This makes perf_callchain_{user,kernel}() receive the max stack
as context for the perf_callchain_entry, instead of accessing
the global sysctl_perf_event_max_stack.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Milian Wolff <milian.wolff@kdab.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
||
|
|
5101ef20f0 |
perf/arm: Special-case hetereogeneous CPUs
Commit: |
||
|
|
375637bc52 |
perf/core: Introduce address range filtering
Many instruction tracing PMUs out there support address range-based filtering, which would, for example, generate trace data only for a given range of instruction addresses, which is useful for tracing individual functions, modules or libraries. Other PMUs may also utilize this functionality to allow filtering to or filtering out code at certain address ranges. This patch introduces the interface for userspace to specify these filters and for the PMU drivers to apply these filters to hardware configuration. The user interface is an ASCII string that is passed via an ioctl() and specifies (in the form of an ASCII string) address ranges within certain object files or within kernel. There is no special treatment for kernel modules yet, but it might be a worthy pursuit. The PMU driver interface basically adds two extra callbacks to the PMU driver structure, one of which validates the filter configuration proposed by the user against what the hardware is actually capable of doing and the other one translates hardware-independent filter configuration into something that can be programmed into the hardware. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: vince@deater.net Link: http://lkml.kernel.org/r/1461771888-10409-6-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
c5dfd78eb7 |
perf core: Allow setting up max frame stack depth via sysctl
The default remains 127, which is good for most cases, and not even hit most of the time, but then for some cases, as reported by Brendan, 1024+ deep frames are appearing on the radar for things like groovy, ruby. And in some workloads putting a _lower_ cap on this may make sense. One that is per event still needs to be put in place tho. The new file is: # cat /proc/sys/kernel/perf_event_max_stack 127 Chaging it: # echo 256 > /proc/sys/kernel/perf_event_max_stack # cat /proc/sys/kernel/perf_event_max_stack 256 But as soon as there is some event using callchains we get: # echo 512 > /proc/sys/kernel/perf_event_max_stack -bash: echo: write error: Device or resource busy # Because we only allocate the callchain percpu data structures when there is a user, which allows for changing the max easily, its just a matter of having no callchain users at that point. Reported-and-Tested-by: Brendan Gregg <brendan.d.gregg@gmail.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: David Ahern <dsahern@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Milian Wolff <milian.wolff@kdab.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Wang Nan <wangnan0@huawei.com> Cc: Zefan Li <lizefan@huawei.com> Link: http://lkml.kernel.org/r/20160426002928.GB16708@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> |
||
|
|
9ecda41acb |
perf/core: Add ::write_backward attribute to perf event
This patch introduces 'write_backward' bit to perf_event_attr, which
controls the direction of a ring buffer. After set, the corresponding
ring buffer is written from end to beginning. This feature is design to
support reading from overwritable ring buffer.
Ring buffer can be created by mapping a perf event fd. Kernel puts event
records into ring buffer, user tooling like perf fetch them from
address returned by mmap(). To prevent racing between kernel and tooling,
they communicate to each other through 'head' and 'tail' pointers.
Kernel maintains 'head' pointer, points it to the next free area (tail
of the last record). Tooling maintains 'tail' pointer, points it to the
tail of last consumed record (record has already been fetched). Kernel
determines the available space in a ring buffer using these two
pointers to avoid overwrite unfetched records.
By mapping without 'PROT_WRITE', an overwritable ring buffer is created.
Different from normal ring buffer, tooling is unable to maintain 'tail'
pointer because writing is forbidden. Therefore, for this type of ring
buffers, kernel overwrite old records unconditionally, works like flight
recorder. This feature would be useful if reading from overwritable ring
buffer were as easy as reading from normal ring buffer. However,
there's an obscure problem.
The following figure demonstrates a full overwritable ring buffer. In
this figure, the 'head' pointer points to the end of last record, and a
long record 'E' is pending. For a normal ring buffer, a 'tail' pointer
would have pointed to position (X), so kernel knows there's no more
space in the ring buffer. However, for an overwritable ring buffer,
kernel ignore the 'tail' pointer.
(X) head
. |
. V
+------+-------+----------+------+---+
|A....A|B.....B|C........C|D....D| |
+------+-------+----------+------+---+
Record 'A' is overwritten by event 'E':
head
|
V
+--+---+-------+----------+------+---+
|.E|..A|B.....B|C........C|D....D|E..|
+--+---+-------+----------+------+---+
Now tooling decides to read from this ring buffer. However, none of these
two natural positions, 'head' and the start of this ring buffer, are
pointing to the head of a record. Even the full ring buffer can be
accessed by tooling, it is unable to find a position to start decoding.
The first attempt tries to solve this problem AFAIK can be found from
[1]. It makes kernel to maintain 'tail' pointer: updates it when ring
buffer is half full. However, this approach introduces overhead to
fast path. Test result shows a 1% overhead [2]. In addition, this method
utilizes no more tham 50% records.
Another attempt can be found from [3], which allows putting the size of
an event at the end of each record. This approach allows tooling to find
records in a backward manner from 'head' pointer by reading size of a
record from its tail. However, because of alignment requirement, it
needs 8 bytes to record the size of a record, which is a huge waste. Its
performance is also not good, because more data need to be written.
This approach also introduces some extra branch instructions to fast
path.
'write_backward' is a better solution to this problem.
Following figure demonstrates the state of the overwritable ring buffer
when 'write_backward' is set before overwriting:
head
|
V
+---+------+----------+-------+------+
| |D....D|C........C|B.....B|A....A|
+---+------+----------+-------+------+
and after overwriting:
head
|
V
+---+------+----------+-------+---+--+
|..E|D....D|C........C|B.....B|A..|E.|
+---+------+----------+-------+---+--+
In each situation, 'head' points to the beginning of the newest record.
From this record, tooling can iterate over the full ring buffer and fetch
records one by one.
The only limitation that needs to be considered is back-to-back reading.
Due to the non-deterministic of user programs, it is impossible to ensure
the ring buffer keeps stable during reading. Consider an extreme situation:
tooling is scheduled out after reading record 'D', then a burst of events
come, eat up the whole ring buffer (one or multiple rounds). When the
tooling process comes back, reading after 'D' is incorrect now.
To prevent this problem, we need to find a way to ensure the ring buffer
is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is
suggested because its overhead is lower than
ioctl(PERF_EVENT_IOC_ENABLE).
By carefully verifying 'header' pointer, reader can avoid pausing the
ring-buffer. For example:
/* A union of all possible events */
union perf_event event;
p = head = perf_mmap__read_head();
while (true) {
/* copy header of next event */
fetch(&event.header, p, sizeof(event.header));
/* read 'head' pointer */
head = perf_mmap__read_head();
/* check overwritten: is the header good? */
if (!verify(sizeof(event.header), p, head))
break;
/* copy the whole event */
fetch(&event, p, event.header.size);
/* read 'head' pointer again */
head = perf_mmap__read_head();
/* is the whole event good? */
if (!verify(event.header.size, p, head))
break;
p += event.header.size;
}
However, the overhead is high because:
a) In-place decoding is not safe.
Copying-verifying-decoding is required.
b) Fetching 'head' pointer requires additional synchronization.
(From Alexei Starovoitov:
Even when this trick works, pause is needed for more than stability of
reading. When we collect the events into overwrite buffer we're waiting
for some other trigger (like all cpu utilization spike or just one cpu
running and all others are idle) and when it happens the buffer has
valuable info from the past. At this point new events are no longer
interesting and buffer should be paused, events read and unpaused until
next trigger comes.)
This patch utilizes event's default overflow_handler introduced
previously. perf_event_output_backward() is created as the default
overflow handler for backward ring buffers. To avoid extra overhead to
fast path, original perf_event_output() becomes __perf_event_output()
and marked '__always_inline'. In theory, there's no extra overhead
introduced to fast path.
Performance testing:
Calling 3000000 times of 'close(-1)', use gettimeofday() to check
duration. Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
system calls. In ns.
Testing environment:
CPU : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Kernel : v4.5.0
MEAN STDVAR
BASE 800214.950 2853.083
PRE1 2253846.700 9997.014
PRE2 2257495.540 8516.293
POST 2250896.100 8933.921
Where 'BASE' is pure performance without capturing. 'PRE1' is test
result of pure 'v4.5.0' kernel. 'PRE2' is test result before this
patch. 'POST' is test result after this patch. See [4] for the detailed
experimental setup.
Considering the stdvar, this patch doesn't introduce performance
overhead to the fast path.
[1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html
[2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html
[3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html
[4] http://lkml.kernel.org/g/56F89DCD.1040202@huawei.com
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: <acme@kernel.org>
Cc: <pi3orama@163.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lkml.kernel.org/r/1459865478-53413-1-git-send-email-wangnan0@huawei.com
[ Fixed the changelog some more. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
889fac6d67 |
Linux 4.6-rc3
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJXCva8AAoJEHm+PkMAQRiGXBoIAIkrjxdbuT2nS9A3tHwkiFXa 6/Th1UjbNaoLuZ+MckQHayAD9NcWY9lVjOUmFsSiSWMCQK/rTWDl8x5ITputrY2V VuhrJCwI7huEtu6GpRaJaUgwtdOjhIHz1Ue2MCdNIbKX3l+LjVyyJ9Vo8rruvZcR fC7kiivH04fYX58oQ+SHymCg54ny3qJEPT8i4+g26686m11hvZLI3UAs2PAn6ut+ atCjxdQ4yLN3DWsbjuA7wYGWhTgFloxL4TIoisuOUc3FXnSi/ivIbXZvu4lUfisz LA2JBhfII3AEMBWG9xfGbXPijJTT4q7yNlTD0oYcnMtAt/Roh2F04asqB1LetEY= =bri6 -----END PGP SIGNATURE----- Merge tag 'v4.6-rc3' into perf/core, to refresh the tree Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
1e1dcd93b4 |
perf: split perf_trace_buf_prepare into alloc and update parts
split allows to move expensive update of 'struct trace_entry' to later phase. Repurpose unused 1st argument of perf_tp_event() to indicate event type. While splitting use temp variable 'rctx' instead of '*rctx' to avoid unnecessary loads done by the compiler due to -fno-strict-aliasing Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
|
ec5e099d6e |
perf: optimize perf_fetch_caller_regs
avoid memset in perf_fetch_caller_regs, since it's the critical path of all tracepoints. It's called from perf_sw_event_sched, perf_event_task_sched_in and all of perf_trace_##call with this_cpu_ptr(&__perf_regs[..]) which are zero initialized by perpcu init logic and subsequent call to perf_arch_fetch_caller_regs initializes the same fields on all archs, so we can safely drop memset from all of the above cases and move it into perf_ftrace_function_call that calls it with stack allocated pt_regs. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
|
1879445dfa |
perf/core: Set event's default ::overflow_handler()
Set a default event->overflow_handler in perf_event_alloc() so don't need to check event->overflow_handler in __perf_event_overflow(). Following commits can give a different default overflow_handler. Initial idea comes from Peter: http://lkml.kernel.org/r/20130708121557.GA17211@twins.programming.kicks-ass.net Since the default value of event->overflow_handler is not NULL, existing 'if (!overflow_handler)' checks need to be changed. is_default_overflow_handler() is introduced for this. No extra performance overhead is introduced into the hot path because in the original code we still need to read this handler from memory. A conditional branch is avoided so actually we remove some instructions. Signed-off-by: Wang Nan <wangnan0@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <pi3orama@163.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Zefan Li <lizefan@huawei.com> Link: http://lkml.kernel.org/r/1459147292-239310-3-git-send-email-wangnan0@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
3fa2fe2ce0 |
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar: "This tree contains various perf fixes on the kernel side, plus three hw/event-enablement late additions: - Intel Memory Bandwidth Monitoring events and handling - the AMD Accumulated Power Mechanism reporting facility - more IOMMU events ... and a final round of perf tooling updates/fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits) perf llvm: Use strerror_r instead of the thread unsafe strerror one perf llvm: Use realpath to canonicalize paths perf tools: Unexport some methods unused outside strbuf.c perf probe: No need to use formatting strbuf method perf help: Use asprintf instead of adhoc equivalents perf tools: Remove unused perf_pathdup, xstrdup functions perf tools: Do not include stringify.h from the kernel sources tools include: Copy linux/stringify.h from the kernel tools lib traceevent: Remove redundant CPU output perf tools: Remove needless 'extern' from function prototypes perf tools: Simplify die() mechanism perf tools: Remove unused DIE_IF macro perf script: Remove lots of unused arguments perf thread: Rename perf_event__preprocess_sample_addr to thread__resolve perf machine: Rename perf_event__preprocess_sample to machine__resolve perf tools: Add cpumode to struct perf_sample perf tests: Forward the perf_sample in the dwarf unwind test perf tools: Remove misplaced __maybe_unused perf list: Fix documentation of :ppp perf bench numa: Fix assertion for nodes bitfield ... |
||
|
|
c7ab62bfbe |
perf/x86/amd/power: Add AMD accumulated power reporting mechanism
Introduce an AMD accumlated power reporting mechanism for the Family
15h, Model 60h processor that can be used to calculate the average
power consumed by a processor during a measurement interval. The
feature support is indicated by CPUID Fn8000_0007_EDX[12].
This feature will be implemented both in hwmon and perf. The current
design provides one event to report per package/processor power
consumption by counting each compute unit power value.
Here the gory details of how the computation is done:
* Tsample: compute unit power accumulator sample period
* Tref: the PTSC counter period (PTSC: performance timestamp counter)
* N: the ratio of compute unit power accumulator sample period to the
PTSC period
* Jmax: max compute unit accumulated power which is indicated by
MSR_C001007b[MaxCpuSwPwrAcc]
* Jx/Jy: compute unit accumulated power which is indicated by
MSR_C001007a[CpuSwPwrAcc]
* Tx/Ty: the value of performance timestamp counter which is indicated
by CU_PTSC MSR_C0010280[PTSC]
* PwrCPUave: CPU average power
i. Determine the ratio of Tsample to Tref by executing CPUID Fn8000_0007.
N = value of CPUID Fn8000_0007_ECX[CpuPwrSampleTimeRatio[15:0]].
ii. Read the full range of the cumulative energy value from the new
MSR MaxCpuSwPwrAcc.
Jmax = value returned.
iii. At time x, software reads CpuSwPwrAcc and samples the PTSC.
Jx = value read from CpuSwPwrAcc and Tx = value read from PTSC.
iv. At time y, software reads CpuSwPwrAcc and samples the PTSC.
Jy = value read from CpuSwPwrAcc and Ty = value read from PTSC.
v. Calculate the average power consumption for a compute unit over
time period (y-x). Unit of result is uWatt:
if (Jy < Jx) // Rollover has occurred
Jdelta = (Jy + Jmax) - Jx
else
Jdelta = Jy - Jx
PwrCPUave = N * Jdelta * 1000 / (Ty - Tx)
Simple example:
root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' make -j4
CHK include/config/kernel.release
CHK include/generated/uapi/linux/version.h
CHK include/generated/utsrelease.h
CHK include/generated/timeconst.h
CHK include/generated/bounds.h
CHK include/generated/asm-offsets.h
CALL scripts/checksyscalls.sh
CHK include/generated/compile.h
SKIPPED include/generated/compile.h
Building modules, stage 2.
Kernel: arch/x86/boot/bzImage is ready (#40)
MODPOST 4225 modules
Performance counter stats for 'system wide':
183.44 mWatts power/power-pkg/
341.837270111 seconds time elapsed
root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' sleep 10
Performance counter stats for 'system wide':
0.18 mWatts power/power-pkg/
10.012551815 seconds time elapsed
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: jacob.w.shin@gmail.com
Link: http://lkml.kernel.org/r/1457502306-2559-1-git-send-email-ray.huang@amd.com
[ Fixed the modular build. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
a223c1c7ab |
perf/x86/cqm: Fix CQM handling of grouping events into a cache_group
Currently CQM (cache quality of service monitoring) is grouping all events belonging to same PID to use one RMID. However its not counting all of these different events. Hence we end up with a count of zero for all events other than the group leader. The patch tries to address the issue by keeping a flag in the perf_event.hw which has other CQM related fields. The field is updated at event creation and during grouping. Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> [peterz: Changed hw_perf_event::is_group_event to an int] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tony Luck <tony.luck@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: fenghua.yu@intel.com Cc: h.peter.anvin@intel.com Cc: ravi.v.shankar@intel.com Cc: vikas.shivappa@intel.com Link: http://lkml.kernel.org/r/1457652732-4499-2-git-send-email-vikas.shivappa@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
1200b6809d |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
"Highlights:
1) Support more Realtek wireless chips, from Jes Sorenson.
2) New BPF types for per-cpu hash and arrap maps, from Alexei
Starovoitov.
3) Make several TCP sysctls per-namespace, from Nikolay Borisov.
4) Allow the use of SO_REUSEPORT in order to do per-thread processing
of incoming TCP/UDP connections. The muxing can be done using a
BPF program which hashes the incoming packet. From Craig Gallek.
5) Add a multiplexer for TCP streams, to provide a messaged based
interface. BPF programs can be used to determine the message
boundaries. From Tom Herbert.
6) Add 802.1AE MACSEC support, from Sabrina Dubroca.
7) Avoid factorial complexity when taking down an inetdev interface
with lots of configured addresses. We were doing things like
traversing the entire address less for each address removed, and
flushing the entire netfilter conntrack table for every address as
well.
8) Add and use SKB bulk free infrastructure, from Jesper Brouer.
9) Allow offloading u32 classifiers to hardware, and implement for
ixgbe, from John Fastabend.
10) Allow configuring IRQ coalescing parameters on a per-queue basis,
from Kan Liang.
11) Extend ethtool so that larger link mode masks can be supported.
From David Decotigny.
12) Introduce devlink, which can be used to configure port link types
(ethernet vs Infiniband, etc.), port splitting, and switch device
level attributes as a whole. From Jiri Pirko.
13) Hardware offload support for flower classifiers, from Amir Vadai.
14) Add "Local Checksum Offload". Basically, for a tunneled packet
the checksum of the outer header is 'constant' (because with the
checksum field filled into the inner protocol header, the payload
of the outer frame checksums to 'zero'), and we can take advantage
of that in various ways. From Edward Cree"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1548 commits)
bonding: fix bond_get_stats()
net: bcmgenet: fix dma api length mismatch
net/mlx4_core: Fix backward compatibility on VFs
phy: mdio-thunder: Fix some Kconfig typos
lan78xx: add ndo_get_stats64
lan78xx: handle statistics counter rollover
RDS: TCP: Remove unused constant
RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket
net: smc911x: convert pxa dma to dmaengine
team: remove duplicate set of flag IFF_MULTICAST
bonding: remove duplicate set of flag IFF_MULTICAST
net: fix a comment typo
ethernet: micrel: fix some error codes
ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use it
bpf, dst: add and use dst_tclassid helper
bpf: make skb->tc_classid also readable
net: mvneta: bm: clarify dependencies
cls_bpf: reset class and reuse major in da
ldmvsw: Checkpatch sunvnet.c and sunvnet_common.c
ldmvsw: Add ldmvsw.c driver code
...
|
||
|
|
e23604edac |
Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull NOHZ updates from Ingo Molnar: "NOHZ enhancements, by Frederic Weisbecker, which reorganizes/refactors the NOHZ 'can the tick be stopped?' infrastructure and related code to be data driven, and harmonizes the naming and handling of all the various properties" [ This makes the ugly "fetch_or()" macro that the scheduler used internally a new generic helper, and does a bad job at it. I'm pulling it, but I've asked Ingo and Frederic to get this fixed up ] * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched-clock: Migrate to use new tick dependency mask model posix-cpu-timers: Migrate to use new tick dependency mask model sched: Migrate sched to use new tick dependency mask model sched: Account rr tasks perf: Migrate perf to use new tick dependency mask model nohz: Use enum code for tick stop failure tracing message nohz: New tick dependency mask nohz: Implement wide kick on top of irq work atomic: Export fetch_or() |