mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
synced 2025-08-29 02:59:13 +00:00
loongarch-next
6776 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
e34c38057a |
[ Merge note: this pull request depends on you having merged
two locking commits in the locking tree, part of the locking-core-2025-03-22 pull request. ] x86 CPU features support: - Generate the <asm/cpufeaturemasks.h> header based on build config (H. Peter Anvin, Xin Li) - x86 CPUID parsing updates and fixes (Ahmed S. Darwish) - Introduce the 'setcpuid=' boot parameter (Brendan Jackman) - Enable modifying CPU bug flags with '{clear,set}puid=' (Brendan Jackman) - Utilize CPU-type for CPU matching (Pawan Gupta) - Warn about unmet CPU feature dependencies (Sohil Mehta) - Prepare for new Intel Family numbers (Sohil Mehta) Percpu code: - Standardize & reorganize the x86 percpu layout and related cleanups (Brian Gerst) - Convert the stackprotector canary to a regular percpu variable (Brian Gerst) - Add a percpu subsection for cache hot data (Brian Gerst) - Unify __pcpu_op{1,2}_N() macros to __pcpu_op_N() (Uros Bizjak) - Construct __percpu_seg_override from __percpu_seg (Uros Bizjak) MM: - Add support for broadcast TLB invalidation using AMD's INVLPGB instruction (Rik van Riel) - Rework ROX cache to avoid writable copy (Mike Rapoport) - PAT: restore large ROX pages after fragmentation (Kirill A. Shutemov, Mike Rapoport) - Make memremap(MEMREMAP_WB) map memory as encrypted by default (Kirill A. Shutemov) - Robustify page table initialization (Kirill A. Shutemov) - Fix flush_tlb_range() when used for zapping normal PMDs (Jann Horn) - Clear _PAGE_DIRTY for kernel mappings when we clear _PAGE_RW (Matthew Wilcox) KASLR: - x86/kaslr: Reduce KASLR entropy on most x86 systems, to support PCI BAR space beyond the 10TiB region (CONFIG_PCI_P2PDMA=y) (Balbir Singh) CPU bugs: - Implement FineIBT-BHI mitigation (Peter Zijlstra) - speculation: Simplify and make CALL_NOSPEC consistent (Pawan Gupta) - speculation: Add a conditional CS prefix to CALL_NOSPEC (Pawan Gupta) - RFDS: Exclude P-only parts from the RFDS affected list (Pawan Gupta) System calls: - Break up entry/common.c (Brian Gerst) - Move sysctls into arch/x86 (Joel Granados) Intel LAM support updates: (Maciej Wieczor-Retman) - selftests/lam: Move cpu_has_la57() to use cpuinfo flag - selftests/lam: Skip test if LAM is disabled - selftests/lam: Test get_user() LAM pointer handling AMD SMN access updates: - Add SMN offsets to exclusive region access (Mario Limonciello) - Add support for debugfs access to SMN registers (Mario Limonciello) - Have HSMP use SMN through AMD_NODE (Yazen Ghannam) Power management updates: (Patryk Wlazlyn) - Allow calling mwait_play_dead with an arbitrary hint - ACPI/processor_idle: Add FFH state handling - intel_idle: Provide the default enter_dead() handler - Eliminate mwait_play_dead_cpuid_hint() Bootup: Build system: - Raise the minimum GCC version to 8.1 (Brian Gerst) - Raise the minimum LLVM version to 15.0.0 (Nathan Chancellor) Kconfig: (Arnd Bergmann) - Add cmpxchg8b support back to Geode CPUs - Drop 32-bit "bigsmp" machine support - Rework CONFIG_GENERIC_CPU compiler flags - Drop configuration options for early 64-bit CPUs - Remove CONFIG_HIGHMEM64G support - Drop CONFIG_SWIOTLB for PAE - Drop support for CONFIG_HIGHPTE - Document CONFIG_X86_INTEL_MID as 64-bit-only - Remove old STA2x11 support - Only allow CONFIG_EISA for 32-bit Headers: - Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI and non-UAPI headers (Thomas Huth) Assembly code & machine code patching: - x86/alternatives: Simplify alternative_call() interface (Josh Poimboeuf) - x86/alternatives: Simplify callthunk patching (Peter Zijlstra) - KVM: VMX: Use named operands in inline asm (Josh Poimboeuf) - x86/hyperv: Use named operands in inline asm (Josh Poimboeuf) - x86/traps: Cleanup and robustify decode_bug() (Peter Zijlstra) - x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h> (Uros Bizjak) - Use named operands in inline asm (Uros Bizjak) - Improve performance by using asm_inline() for atomic locking instructions (Uros Bizjak) Earlyprintk: - Harden early_serial (Peter Zijlstra) NMI handler: - Add an emergency handler in nmi_desc & use it in nmi_shootdown_cpus() (Waiman Long) Miscellaneous fixes and cleanups: - by Ahmed S. Darwish, Andy Shevchenko, Ard Biesheuvel, Artem Bityutskiy, Borislav Petkov, Brendan Jackman, Brian Gerst, Dan Carpenter, Dr. David Alan Gilbert, H. Peter Anvin, Ingo Molnar, Josh Poimboeuf, Kevin Brodsky, Mike Rapoport, Lukas Bulwahn, Maciej Wieczor-Retman, Max Grobecker, Patryk Wlazlyn, Pawan Gupta, Peter Zijlstra, Philip Redkin, Qasim Ijaz, Rik van Riel, Thomas Gleixner, Thorsten Blum, Tom Lendacky, Tony Luck, Uros Bizjak, Vitaly Kuznetsov, Xin Li, liuye. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfenkQRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1g1FRAAi6OFTSn/5aeLMI0IMNBxJ6ddQiFc3imd 7+C/vU5nul4CyDs8mKyj/+f/DDrbkG9lKz3VG631Yl237lXHjD8XWcVMeC/1z/q0 3zInDIloE9/nBHRPkF6F7fARBLBZ0LFgaBsGrCo7mwpGybiQdqGcqcxllvTbtXaw OHta4q6ok+lBDNlfc0v6H4cRnzhmmlKu6Ng0j6UI3V7uFhi3vtxas32ltDQtzorq 2+jbV6/+kbrrv+xPC+jlzOFhTEKRupNPQXmvyQteoQg6G3kqAKMDvBthGXd1rHuX Qa+BoDIifE/2NiVeRwNrhoqYH/pHCzUzDREW5IW8+ca+4XNKuzAC6EuC8CeCzyK1 q8ZjZjooQW4zEeVFeJYllHONzJYfxfSH5CLsnbcuhq99yfGlrQhF1qL72/Omn1w/ DfPJM8Zt5zyKvLqUg3Md+fkVCO2wyDNhB61QPzRgHF+yD+rvuDpoqvUWir+w7cSn fwEDVZGXlFx6dumtSrqRaTd1nvFt80s8yP2ll09DMvGQ8D/yruS7hndGAmmJVCSW NAfd8pSjq5v2+ux2UR92/Cc3VF3SjaUqHBOp/Nq9rESya18ZVa3cJpHhVYYtPIVf THW0h07RIkGVKs1uq+5ekLCr/8uAZg58UPIqmhTuW0ttymRHCNfohR45FQZzy+0M tJj1oc2TIZw= =Dcb3 -----END PGP SIGNATURE----- Merge tag 'x86-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core x86 updates from Ingo Molnar: "x86 CPU features support: - Generate the <asm/cpufeaturemasks.h> header based on build config (H. Peter Anvin, Xin Li) - x86 CPUID parsing updates and fixes (Ahmed S. Darwish) - Introduce the 'setcpuid=' boot parameter (Brendan Jackman) - Enable modifying CPU bug flags with '{clear,set}puid=' (Brendan Jackman) - Utilize CPU-type for CPU matching (Pawan Gupta) - Warn about unmet CPU feature dependencies (Sohil Mehta) - Prepare for new Intel Family numbers (Sohil Mehta) Percpu code: - Standardize & reorganize the x86 percpu layout and related cleanups (Brian Gerst) - Convert the stackprotector canary to a regular percpu variable (Brian Gerst) - Add a percpu subsection for cache hot data (Brian Gerst) - Unify __pcpu_op{1,2}_N() macros to __pcpu_op_N() (Uros Bizjak) - Construct __percpu_seg_override from __percpu_seg (Uros Bizjak) MM: - Add support for broadcast TLB invalidation using AMD's INVLPGB instruction (Rik van Riel) - Rework ROX cache to avoid writable copy (Mike Rapoport) - PAT: restore large ROX pages after fragmentation (Kirill A. Shutemov, Mike Rapoport) - Make memremap(MEMREMAP_WB) map memory as encrypted by default (Kirill A. Shutemov) - Robustify page table initialization (Kirill A. Shutemov) - Fix flush_tlb_range() when used for zapping normal PMDs (Jann Horn) - Clear _PAGE_DIRTY for kernel mappings when we clear _PAGE_RW (Matthew Wilcox) KASLR: - x86/kaslr: Reduce KASLR entropy on most x86 systems, to support PCI BAR space beyond the 10TiB region (CONFIG_PCI_P2PDMA=y) (Balbir Singh) CPU bugs: - Implement FineIBT-BHI mitigation (Peter Zijlstra) - speculation: Simplify and make CALL_NOSPEC consistent (Pawan Gupta) - speculation: Add a conditional CS prefix to CALL_NOSPEC (Pawan Gupta) - RFDS: Exclude P-only parts from the RFDS affected list (Pawan Gupta) System calls: - Break up entry/common.c (Brian Gerst) - Move sysctls into arch/x86 (Joel Granados) Intel LAM support updates: (Maciej Wieczor-Retman) - selftests/lam: Move cpu_has_la57() to use cpuinfo flag - selftests/lam: Skip test if LAM is disabled - selftests/lam: Test get_user() LAM pointer handling AMD SMN access updates: - Add SMN offsets to exclusive region access (Mario Limonciello) - Add support for debugfs access to SMN registers (Mario Limonciello) - Have HSMP use SMN through AMD_NODE (Yazen Ghannam) Power management updates: (Patryk Wlazlyn) - Allow calling mwait_play_dead with an arbitrary hint - ACPI/processor_idle: Add FFH state handling - intel_idle: Provide the default enter_dead() handler - Eliminate mwait_play_dead_cpuid_hint() Build system: - Raise the minimum GCC version to 8.1 (Brian Gerst) - Raise the minimum LLVM version to 15.0.0 (Nathan Chancellor) Kconfig: (Arnd Bergmann) - Add cmpxchg8b support back to Geode CPUs - Drop 32-bit "bigsmp" machine support - Rework CONFIG_GENERIC_CPU compiler flags - Drop configuration options for early 64-bit CPUs - Remove CONFIG_HIGHMEM64G support - Drop CONFIG_SWIOTLB for PAE - Drop support for CONFIG_HIGHPTE - Document CONFIG_X86_INTEL_MID as 64-bit-only - Remove old STA2x11 support - Only allow CONFIG_EISA for 32-bit Headers: - Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI and non-UAPI headers (Thomas Huth) Assembly code & machine code patching: - x86/alternatives: Simplify alternative_call() interface (Josh Poimboeuf) - x86/alternatives: Simplify callthunk patching (Peter Zijlstra) - KVM: VMX: Use named operands in inline asm (Josh Poimboeuf) - x86/hyperv: Use named operands in inline asm (Josh Poimboeuf) - x86/traps: Cleanup and robustify decode_bug() (Peter Zijlstra) - x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h> (Uros Bizjak) - Use named operands in inline asm (Uros Bizjak) - Improve performance by using asm_inline() for atomic locking instructions (Uros Bizjak) Earlyprintk: - Harden early_serial (Peter Zijlstra) NMI handler: - Add an emergency handler in nmi_desc & use it in nmi_shootdown_cpus() (Waiman Long) Miscellaneous fixes and cleanups: - by Ahmed S. Darwish, Andy Shevchenko, Ard Biesheuvel, Artem Bityutskiy, Borislav Petkov, Brendan Jackman, Brian Gerst, Dan Carpenter, Dr. David Alan Gilbert, H. Peter Anvin, Ingo Molnar, Josh Poimboeuf, Kevin Brodsky, Mike Rapoport, Lukas Bulwahn, Maciej Wieczor-Retman, Max Grobecker, Patryk Wlazlyn, Pawan Gupta, Peter Zijlstra, Philip Redkin, Qasim Ijaz, Rik van Riel, Thomas Gleixner, Thorsten Blum, Tom Lendacky, Tony Luck, Uros Bizjak, Vitaly Kuznetsov, Xin Li, liuye" * tag 'x86-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (211 commits) zstd: Increase DYNAMIC_BMI2 GCC version cutoff from 4.8 to 11.0 to work around compiler segfault x86/asm: Make asm export of __ref_stack_chk_guard unconditional x86/mm: Only do broadcast flush from reclaim if pages were unmapped perf/x86/intel, x86/cpu: Replace Pentium 4 model checks with VFM ones perf/x86/intel, x86/cpu: Simplify Intel PMU initialization x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in non-UAPI headers x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI headers x86/locking/atomic: Improve performance by using asm_inline() for atomic locking instructions x86/asm: Use asm_inline() instead of asm() in clwb() x86/asm: Use CLFLUSHOPT and CLWB mnemonics in <asm/special_insns.h> x86/hweight: Use asm_inline() instead of asm() x86/hweight: Use ASM_CALL_CONSTRAINT in inline asm() x86/hweight: Use named operands in inline asm() x86/stackprotector/64: Only export __ref_stack_chk_guard on CONFIG_SMP x86/head/64: Avoid Clang < 17 stack protector in startup code x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h> x86/runtime-const: Add the RUNTIME_CONST_PTR assembly macro x86/cpu/intel: Limit the non-architectural constant_tsc model checks x86/mm/pat: Replace Intel x86_model checks with VFM ones x86/cpu/intel: Fix fast string initialization for extended Families ... |
||
![]() |
32b22538be |
Scheduler updates for v6.15:
[ Merge note, these two commits are identical: - |
||
![]() |
3ba7dfb8da |
RCU pull request for v6.15
This pull request contains the following branches: docs.2025.02.04a: - Add broken-timing possibility to stallwarn.rst. - Improve discussion of this_cpu_ptr(), add raw_cpu_ptr(). - Document self-propagating callbacks. - Point call_srcu() to call_rcu() for detailed memory ordering. - Add CONFIG_RCU_LAZY delays to call_rcu() kernel-doc header. - Clarify RCU_LAZY and RCU_LAZY_DEFAULT_OFF help text. - Remove references to old grace-period-wait primitives. srcu.2025.02.05a: - Introduce srcu_read_{un,}lock_fast(), which is similar to srcu_read_{un,}lock_lite(): avoid smp_mb()s in lock and unlock at the cost of calling synchronize_rcu() in synchronize_srcu(). Moreover, by returning the percpu offset of the counter at srcu_read_lock_fast() time, srcu_read_unlock_fast() can save extra pointer dereferencing, which makes it faster than srcu_read_{un,}lock_lite(). srcu_read_{un,}lock_fast() are intended to replace rcu_read_{un,}lock_trace() if possible. torture.2025.02.05a: - Add get_torture_init_jiffies() to return the start time of the test. - Add a test_boost_holdoff module parameter to allow delaying boosting tests when building rcutorture as built-in. - Add grace period sequence number logging at the beginning and end of failure/close-call results. - Switch to hexadecimal for the expedited grace period sequence number in the rcu_exp_grace_period trace point. - Make cur_ops->format_gp_seqs take buffer length. - Move RCU_TORTURE_TEST_{CHK_RDR_STATE,LOG_CPU} to bool. - Complain when invalid SRCU reader_flavor is specified. - Add FORCE_NEED_SRCU_NMI_SAFE Kconfig for testing, which forces SRCU uses atomics even when percpu ops are NMI safe, and use the Kconfig for SRCU lockdep testing. misc.2025.03.04a: - Split rcu_report_exp_cpu_mult() mask parameter and use for tracing. - Remove READ_ONCE() for rdp->gpwrap access in __note_gp_changes(). - Fix get_state_synchronize_rcu_full() GP-start detection. - Move RCU Tasks self-tests to core_initcall(). - Print segment lengths in show_rcu_nocb_gp_state(). - Make RCU watch ct_kernel_exit_state() warning. - Flush console log from kernel_power_off(). - rcutorture: Allow a negative value for nfakewriters. - rcu: Update TREE05.boot to test normal synchronize_rcu(). - rcu: Use _full() API to debug synchronize_rcu(). lazypreempt.2025.03.04a: Make RCU handle PREEMPT_LAZY better: - Fix header guard for rcu_all_qs(). - rcu: Rename PREEMPT_AUTO to PREEMPT_LAZY. - Update __cond_resched comment about RCU quiescent states. - Handle unstable rdp in rcu_read_unlock_strict(). - Handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y. - osnoise: Provide quiescent states. - Adjust rcutorture with possible PREEMPT_RCU=n && PREEMPT_COUNT=y combination. - Limit PREEMPT_RCU configurations. - Make rcutorture senario TREE07 and senario TREE10 use PREEMPT_LAZY=y. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEj5IosQTPz8XU1wRHSXnow7UH+rgFAmfeBLQACgkQSXnow7UH +rh11Qf/Rt6IZJ/YT/V9Sd+8hMx4O0BMh779pr9cD6mbAG+FDk2Yeva1m8vIdFOb qId6oc8K/ef2JfFjSn0oHMzQP2D3XUyiJWPNbBDHv/D8Os8GZgjzu8dkxVkSbdbY OxtvIflbcqFN1JDJfGKZnTEW0/YxGqfnS9b6R7iyyA7SOGQ/WubGOE5qNCqPufc9 zJiP+qTUFYQzCIiPlEJul39o9KboPogbt3QAAQjWmi3utd77ehJnm/15FvAjyau4 uhC2cnGfMY535rQaiaQeBQ/IHIowKripCq0JQFvcUNdyArZM3HOI2x79+2II6ft7 mjHskNODOIJHfW2o1RzQ0yRYAywFIg== =J+mH -----END PGP SIGNATURE----- Merge tag 'rcu-next-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU updates from Boqun Feng: "Documentation: - Add broken-timing possibility to stallwarn.rst - Improve discussion of this_cpu_ptr(), add raw_cpu_ptr() - Document self-propagating callbacks - Point call_srcu() to call_rcu() for detailed memory ordering - Add CONFIG_RCU_LAZY delays to call_rcu() kernel-doc header - Clarify RCU_LAZY and RCU_LAZY_DEFAULT_OFF help text - Remove references to old grace-period-wait primitives srcu: - Introduce srcu_read_{un,}lock_fast(), which is similar to srcu_read_{un,}lock_lite(): avoid smp_mb()s in lock and unlock at the cost of calling synchronize_rcu() in synchronize_srcu() Moreover, by returning the percpu offset of the counter at srcu_read_lock_fast() time, srcu_read_unlock_fast() can avoid extra pointer dereferencing, which makes it faster than srcu_read_{un,}lock_lite() srcu_read_{un,}lock_fast() are intended to replace rcu_read_{un,}lock_trace() if possible RCU torture: - Add get_torture_init_jiffies() to return the start time of the test - Add a test_boost_holdoff module parameter to allow delaying boosting tests when building rcutorture as built-in - Add grace period sequence number logging at the beginning and end of failure/close-call results - Switch to hexadecimal for the expedited grace period sequence number in the rcu_exp_grace_period trace point - Make cur_ops->format_gp_seqs take buffer length - Move RCU_TORTURE_TEST_{CHK_RDR_STATE,LOG_CPU} to bool - Complain when invalid SRCU reader_flavor is specified - Add FORCE_NEED_SRCU_NMI_SAFE Kconfig for testing, which forces SRCU uses atomics even when percpu ops are NMI safe, and use the Kconfig for SRCU lockdep testing Misc: - Split rcu_report_exp_cpu_mult() mask parameter and use for tracing - Remove READ_ONCE() for rdp->gpwrap access in __note_gp_changes() - Fix get_state_synchronize_rcu_full() GP-start detection - Move RCU Tasks self-tests to core_initcall() - Print segment lengths in show_rcu_nocb_gp_state() - Make RCU watch ct_kernel_exit_state() warning - Flush console log from kernel_power_off() - rcutorture: Allow a negative value for nfakewriters - rcu: Update TREE05.boot to test normal synchronize_rcu() - rcu: Use _full() API to debug synchronize_rcu() Make RCU handle PREEMPT_LAZY better: - Fix header guard for rcu_all_qs() - rcu: Rename PREEMPT_AUTO to PREEMPT_LAZY - Update __cond_resched comment about RCU quiescent states - Handle unstable rdp in rcu_read_unlock_strict() - Handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y - osnoise: Provide quiescent states - Adjust rcutorture with possible PREEMPT_RCU=n && PREEMPT_COUNT=y combination - Limit PREEMPT_RCU configurations - Make rcutorture senario TREE07 and senario TREE10 use PREEMPT_LAZY=y" * tag 'rcu-next-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (59 commits) rcutorture: Make scenario TREE07 build CONFIG_PREEMPT_LAZY=y rcutorture: Make scenario TREE10 build CONFIG_PREEMPT_LAZY=y rcu: limit PREEMPT_RCU configurations rcutorture: Update ->extendables check for lazy preemption rcutorture: Update rcutorture_one_extend_check() for lazy preemption osnoise: provide quiescent states rcu: Use _full() API to debug synchronize_rcu() rcu: Update TREE05.boot to test normal synchronize_rcu() rcutorture: Allow a negative value for nfakewriters Flush console log from kernel_power_off() context_tracking: Make RCU watch ct_kernel_exit_state() warning rcu/nocb: Print segment lengths in show_rcu_nocb_gp_state() rcu-tasks: Move RCU Tasks self-tests to core_initcall() rcu: Fix get_state_synchronize_rcu_full() GP-start detection torture: Make SRCU lockdep testing use srcu_read_lock_nmisafe() srcu: Add FORCE_NEED_SRCU_NMI_SAFE Kconfig for testing rcutorture: Complain when invalid SRCU reader_flavor is specified rcutorture: Move RCU_TORTURE_TEST_{CHK_RDR_STATE,LOG_CPU} to bool rcutorture: Make cur_ops->format_gp_seqs take buffer length rcutorture: Add ftrace-compatible timestamp to GP# failure/close-call output ... |
||
![]() |
fbe6c09b7e |
rv: Add scpd, snep and sncid per-cpu monitors
Add 3 per-cpu monitors as part of the sched model: * scpd: schedule called with preemption disabled Monitor to ensure schedule is called with preemption disabled * snep: schedule does not enable preempt Monitor to ensure schedule does not enable preempt * sncid: schedule not called with interrupt disabled Monitor to ensure schedule is not called with interrupt disabled To: Ingo Molnar <mingo@redhat.com> To: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: John Kacur <jkacur@redhat.com> Cc: Clark Williams <williams@redhat.com> Link: https://lore.kernel.org/20250305140406.350227-6-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
93bac9cf35 |
rv: Add snroc per-task monitor
Add a per-task monitor as part of the sched model: * snroc: set non runnable on its own context Monitor to ensure set_state happens only in the respective task's context To: Ingo Molnar <mingo@redhat.com> To: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: John Kacur <jkacur@redhat.com> Cc: Clark Williams <williams@redhat.com> Link: https://lore.kernel.org/20250305140406.350227-5-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
9fd420abc4 |
rv: Add sco and tss per-cpu monitors
Add 2 per-cpu monitors as part of the sched model: * sco: scheduling context operations Monitor to ensure sched_set_state happens only in thread context * tss: task switch while scheduling Monitor to ensure sched_switch happens only in scheduling context To: Ingo Molnar <mingo@redhat.com> To: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: John Kacur <jkacur@redhat.com> Cc: Clark Williams <williams@redhat.com> Link: https://lore.kernel.org/20250305140406.350227-4-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
cb85c660fc |
rv: Add option for nested monitors and include sched
Monitors describing complex systems, such as the scheduler, can easily grow to the point where they are just hard to understand because of the many possible state transitions. Often it is possible to break such descriptions into smaller monitors, sharing some or all events. Enabling those smaller monitors concurrently is, in fact, testing the system as if we had one single larger monitor. Splitting models into multiple specification is not only easier to understand, but gives some more clues when we see errors. Add the possibility to create container monitors, whose only purpose is to host other nested monitors. Enabling a container monitor enables all nested ones, but it's still possible to enable nested monitors independently. Add the sched monitor as first container, for now empty. Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/20250305140406.350227-3-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
8eb1518642 |
tracing: Do not use PERF enums when perf is not defined
An update was made to up the module ref count when a synthetic event is
registered for both trace and perf events. But if perf is not configured
in, the perf enums used will cause the kernel to fail to build.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Douglas Raillard <douglas.raillard@arm.com>
Link: https://lore.kernel.org/20250323152151.528b5ced@batman.local.home
Fixes:
|
||
![]() |
391dda1bd7 |
tracing: Use hashtable.h for event_hash
Convert the event_hash array in trace_output.c to use the generic hashtable implementation from hashtable.h instead of the manually implemented hash table. This simplifies the code and makes it more maintainable by using the standard hashtable API defined in hashtable.h. Rename EVENT_HASHSIZE to EVENT_HASH_BITS to properly reflect its new meaning as the number of bits for the hashtable size. Link: https://lore.kernel.org/20250323132800.3010783-1-sashal@kernel.org Link: https://lore.kernel.org/20250319190545.3058319-1-sashal@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
21581dd4e7 |
tracing: Ensure module defining synth event cannot be unloaded while tracing
Currently, using synth_event_delete() will fail if the event is being
used (tracing in progress), but that is normally done in the module exit
function. At that stage, failing is problematic as returning a non-zero
status means the module will become locked (impossible to unload or
reload again).
Instead, ensure the module exit function does not get called in the
first place by increasing the module refcnt when the event is enabled.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes:
|
||
![]() |
0c588ac0ca |
tracing: fix return value in __ftrace_event_enable_disable for TRACE_REG_UNREGISTER
When __ftrace_event_enable_disable invokes the class callback to unregister the event, the return value is not reported up to the caller, hence leading to event unregister failures being silently ignored. This patch assigns the ret variable to the invocation of the event unregister callback, so that its return value is stored and reported to the caller, and it raises a warning in case of error. Link: https://lore.kernel.org/20250321170821.101403-1-gpaoloni@redhat.com Signed-off-by: Gabriele Paoloni <gpaoloni@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
7e6b3fcc9c |
tracing/osnoise: Fix possible recursive locking for cpus_read_lock()
Lockdep reports this deadlock log:
osnoise: could not start sampling thread
============================================
WARNING: possible recursive locking detected
--------------------------------------------
CPU0
----
lock(cpu_hotplug_lock);
lock(cpu_hotplug_lock);
Call Trace:
<TASK>
print_deadlock_bug+0x282/0x3c0
__lock_acquire+0x1610/0x29a0
lock_acquire+0xcb/0x2d0
cpus_read_lock+0x49/0x120
stop_per_cpu_kthreads+0x7/0x60
start_kthread+0x103/0x120
osnoise_hotplug_workfn+0x5e/0x90
process_one_work+0x44f/0xb30
worker_thread+0x33e/0x5e0
kthread+0x206/0x3b0
ret_from_fork+0x31/0x50
ret_from_fork_asm+0x11/0x20
</TASK>
This is the deadlock scenario:
osnoise_hotplug_workfn()
guard(cpus_read_lock)(); // first lock call
start_kthread(cpu)
if (IS_ERR(kthread)) {
stop_per_cpu_kthreads(); {
cpus_read_lock(); // second lock call. Cause the AA deadlock
}
}
It is not necessary to call stop_per_cpu_kthreads() which stops osnoise
kthread for every other CPUs in the system if a failure occurs during
hotplug of a certain CPU.
For start_per_cpu_kthreads(), if the start_kthread() call fails,
this function calls stop_per_cpu_kthreads() to handle the error.
Therefore, similarly, there is no need to call stop_per_cpu_kthreads()
again within start_kthread().
So just remove stop_per_cpu_kthreads() from start_kthread to solve this issue.
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250321095249.2739397-1-ranxiaokai627@163.com
Fixes:
|
||
![]() |
81c7a515b0 |
tracing: Align synth event print fmt
The vast majority of ftrace event print fmt consist of a space-separated field=value pair. Synthetic event currently use a comma-separated field=value pair, which sticks out from events created via more classical means. Align the format of synth events so they look just like any other event, for better consistency and less headache when doing crude text-based data processing. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250319215028.1680278-1-douglas.raillard@arm.com Signed-off-by: Douglas Raillard <douglas.raillard@arm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
7f81f27b10 |
tracing: Fix use-after-free in print_graph_function_flags during tracer switching
Kairui reported a UAF issue in print_graph_function_flags() during
ftrace stress testing [1]. This issue can be reproduced if puting a
'mdelay(10)' after 'mutex_unlock(&trace_types_lock)' in s_start(),
and executing the following script:
$ echo function_graph > current_tracer
$ cat trace > /dev/null &
$ sleep 5 # Ensure the 'cat' reaches the 'mdelay(10)' point
$ echo timerlat > current_tracer
The root cause lies in the two calls to print_graph_function_flags
within print_trace_line during each s_show():
* One through 'iter->trace->print_line()';
* Another through 'event->funcs->trace()', which is hidden in
print_trace_fmt() before print_trace_line returns.
Tracer switching only updates the former, while the latter continues
to use the print_line function of the old tracer, which in the script
above is print_graph_function_flags.
Moreover, when switching from the 'function_graph' tracer to the
'timerlat' tracer, s_start only calls graph_trace_close of the
'function_graph' tracer to free 'iter->private', but does not set
it to NULL. This provides an opportunity for 'event->funcs->trace()'
to use an invalid 'iter->private'.
To fix this issue, set 'iter->private' to NULL immediately after
freeing it in graph_trace_close(), ensuring that an invalid pointer
is not passed to other tracers. Additionally, clean up the unnecessary
'iter->private = NULL' during each 'cat trace' when using wakeup and
irqsoff tracers.
[1] https://lore.kernel.org/all/20231112150030.84609-1-ryncsn@gmail.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Link: https://lore.kernel.org/20250320122137.23635-1-wutengda@huaweicloud.com
Fixes:
|
||
![]() |
502d2e71a8 |
tracing: Constify struct event_trigger_ops
'event_trigger_ops mwifiex_if_ops' are not modified in these drivers. Constifying these structures moves some data to a read-only section, so increase overall security, especially when the structure holds some function pointers. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 31368 9024 6200 46592 b600 kernel/trace/trace_events_trigger.o After: ===== text data bss dec hex filename 31752 8608 6200 46560 b5e0 kernel/trace/trace_events_trigger.o Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/66e8f990e649678e4be37d4d1a19158ca0dea2f4.1741521295.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
89771319e0 |
Linux 6.14-rc7
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmfXVtUeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGN/sH/i5423Gt/z51gDjA s4v5Z7GaBJ9zOGBahn2RWFe72zytTqKrEJmMnGfguirs0atD1DtQj4WAP7iFKP+e WyO663X6HF7i5y37ja0Yd4PZc31hwtqzKH8LjBf8f8tTy8UsEVqumdi5A4sS9KTM qm4kTyyVEY9D/s7oRY8ywjDlRJtO6nT0aKMp4kAqNEbrNUYbilT/a0hgXcgSmPyB uIjmjL2fZfutxGI5LgfbaSHCa1ElmhvTvivOMpaAmZSGCRVHCKGgT0CTNnHyn/7C dB145JkRO4ZOUqirCdO4PE/23id3ajq9fcixJGBzAv7c45y+B3JZ1r2kAfKalE8/ qrOKLys= =8r7a -----END PGP SIGNATURE----- Merge tag 'v6.14-rc7' into x86/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
ae0a457f5d |
bpf: Make perf_event_read_output accessible in all program types.
The perf_event_read_event_output helper is currently only available to tracing protrams, but is useful for other BPF programs like sched_ext schedulers. When the helper is available, provide its bpf_func_proto directly from the bpf base_proto. Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20250318030753.10949-1-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
![]() |
47c7efa4f0 |
Probes fixes for v6.14-rc6:
- tprobe-events: Fix to clean up tprobe correctly when module unload tprobe (Tracepoint probe) event does not set TRACEPOINT_STUB to its 'tpoint' pointer when unloading module, thus it is shown as a normal 'fprobe' instead of 'tprobe' and never comes back. - tprobe-events: Fix leakage of module refcount When a tprobe's target module is loaded, it gets the module's refcount in the module notifier but forgot to put it after registering the probe on it. Fix to get the refcount only when registering tprobe. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmfYFpobHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bi0kH/3wSyMa7wP7pSPQ1Hq8i ZLeKa7pSS3X39IE6urvpYrTUBRw75tqSSozyF6g8jM3Tj8ISpF9rlZJ6ct775mTR 9Ld9wPNgnj0pCZU29VGf9TPm3JCm5V0h0bhE9qZp8wZRiedbwg9e6uZ92LBLAmW+ MisEmLsKrYcOUF1bwhLwoNRzlwidhHKCBAF6vUqVC62kktCe08XKKmiusslVfEZz KrU0XM1MPnncWd2Cp3qw9ywcC40ES61lBdhMjJcGM3yKqKqlJSQxIBa5rOl0TlJA D58bEpPhoyGT3sWS7UlqVjdkHy9NtpFx3nvcvPGCrDMLl9qwj4aSrwUBjK693RS7 Dyo= =D3X0 -----END PGP SIGNATURE----- Merge tag 'probes-fixes-v6.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - Clean up tprobe correctly when module unload Tracepoint probes do not set TRACEPOINT_STUB on the 'tpoint' pointer when unloading a module, thus they show as a normal 'fprobe' instead of 'tprobe' and never come back - Fix leakage of tprobe module refcount When a tprobe's target module is loaded, it gets the module's refcount in the module notifier but forgot to put it after registering the probe on it. Fix it by getting the refcount only when registering tprobe. * tag 'probes-fixes-v6.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: tprobe-events: Fix leakage of module refcount tracing: tprobe-events: Fix to clean up tprobe correctly when module unload |
||
![]() |
3bffa47a02 |
tracing: Use preempt_model_str()
Use preempt_model_str() instead of manually conducting the preemption model. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20250314160810.2373416-10-bigeasy@linutronix.de |
||
![]() |
ad87a8d0c4 |
Fix ref count of trace_array in error path of histogram file open
Tracing instances have a ref count to keep them around while files within their directories are open. This prevents them from being deleted while they are used. The histogram code had some files that needed to take the ref count and that was added, but the error paths did not decrement the ref counts. This caused the instances from ever being removed if a histogram file failed to open due to some error. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ9cE8hQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qtm8AQDrdfwo63g/K0xpM0Uej9Ex5pBkHEq/ WwCI2hUbPya0wAEAtEcfvgt7P0GuO0PNOYY3uzhAokfBiN04elatbLEetQ4= =i35g -----END PGP SIGNATURE----- Merge tag 'trace-v6.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fix from Steven Rostedt: "Fix ref count of trace_array in error path of histogram file open Tracing instances have a ref count to keep them around while files within their directories are open. This prevents them from being deleted while they are used. The histogram code had some files that needed to take the ref count and that was added, but the error paths did not decrement the ref counts. This caused the instances from ever being removed if a histogram file failed to open due to some error" * tag 'trace-v6.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Correct the refcount if the hist/hist_debug file fails to open |
||
![]() |
ac91052f0a |
tracing: tprobe-events: Fix leakage of module refcount
When enabling the tracepoint at loading module, the target module
refcount is incremented by find_tracepoint_in_module(). But it is
unnecessary because the module is not unloaded while processing
module loading callbacks.
Moreover, the refcount is not decremented in that function.
To be clear the module refcount handling, move the try_module_get()
callsite to trace_fprobe_create_internal(), where it is actually
required.
Link: https://lore.kernel.org/all/174182761071.83274.18334217580449925882.stgit@devnote2/
Fixes:
|
||
![]() |
0a8bb688aa |
tracing: tprobe-events: Fix to clean up tprobe correctly when module unload
When unloading module, the tprobe events are not correctly cleaned
up. Thus it becomes `fprobe-event` and never be enabled again even
if loading the same module again.
For example;
# cd /sys/kernel/tracing
# modprobe trace_events_sample
# echo 't:my_tprobe foo_bar' >> dynamic_events
# cat dynamic_events
t:tracepoints/my_tprobe foo_bar
# rmmod trace_events_sample
# cat dynamic_events
f:tracepoints/my_tprobe foo_bar
As you can see, the second time my_tprobe starts with 'f' instead
of 't'.
This unregisters the fprobe and tracepoint callback when module is
unloaded but marks the fprobe-event is tprobe-event.
Link: https://lore.kernel.org/all/174158724946.189309.15826571379395619524.stgit@mhiramat.tok.corp.google.com/
Fixes:
|
||
![]() |
0b4ffbe488 |
tracing: Correct the refcount if the hist/hist_debug file fails to open
The function event_{hist,hist_debug}_open() maintains the refcount of
'file->tr' and 'file' through tracing_open_file_tr(). However, it does
not roll back these counts on subsequent failure paths, resulting in a
refcount leak.
A very obvious case is that if the hist/hist_debug file belongs to a
specific instance, the refcount leak will prevent the deletion of that
instance, as it relies on the condition 'tr->ref == 1' within
__remove_instance().
Fix this by calling tracing_release_file_tr() on all failure paths in
event_{hist,hist_debug}_open() to correct the refcount.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Link: https://lore.kernel.org/20250314065335.1202817-1-wutengda@huaweicloud.com
Fixes:
|
||
![]() |
8c6eb7ca86 |
bpf: Use RCU in all users of __module_text_address().
__module_address() can be invoked within a RCU section, there is no requirement to have preemption disabled. Replace the preempt_disable() section around __module_address() with RCU. Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Hao Luo <haoluo@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matt Bobrowski <mattbobrowski@google.com> Cc: Song Liu <song@kernel.org> Cc: Stanislav Fomichev <sdf@fomichev.me> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Yonghong Song <yonghong.song@linux.dev> Cc: bpf@vger.kernel.org Cc: linux-trace-kernel@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20250129084751.tH6iidUO@linutronix.de Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> |
||
![]() |
febaa65c94 |
module: Use RCU in find_module_all().
The modules list and module::kallsyms can be accessed under RCU assumption. Remove module_assert_mutex_or_preempt() from find_module_all() so it can be used under RCU protection without warnings. Update its callers to use RCU protection instead of preempt_disable(). Cc: Jiri Kosina <jikos@kernel.org> Cc: Joe Lawrence <joe.lawrence@redhat.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linux-trace-kernel@vger.kernel.org Cc: live-patching@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20250108090457.512198-7-bigeasy@linutronix.de Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> |
||
![]() |
5ba8f4a39e |
function_graph: Remove the unused variable func
Variable func is not effectively used, so delete it.
kernel/trace/trace_functions_graph.c:925:16: warning: variable ‘func’ set but not used.
This happened because the variable "func" which came from "call->func" was
replaced by "ret_func" coming from "graph_ret->func" but "func" wasn't
removed after the replacement.
Link: https://lore.kernel.org/20250307021412.119107-1-jiapeng.chong@linux.alibaba.com
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=19250
Fixes:
|
||
![]() |
06889030f5 |
tracing/user_events: Slightly simplify user_seq_show()
2 seq_puts() calls can be merged. It saves a few lines of code and a few cycles, should it matter. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/845caa94b74cea8d72c158bf1994fe250beee28c.1739979791.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
effd1059c4 |
tracing/user_events: Don't use %pK through printk
Restricted pointers ("%pK") are not meant to be used through printk(). It can unintentionally expose security sensitive, raw pointer values. Use regular pointer formatting instead. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250217-restricted-pointers-trace-v1-1-bbe9ea279848@linutronix.de Link: https://lore.kernel.org/lkml/20250113171731-dc10e3c1-da64-4af0-b767-7c7070468023@linutronix.de/ Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
3ca4d7af35 |
ring-buffer: Fix typo in comment about header page pointer
Fix typo in comment about header page pointer in function rb_get_reader_page. Link: https://lore.kernel.org/20250118012352.3430519-1-zhouzhouyi@gmail.com Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
9fd858cc5a |
osnoise: provide quiescent states
To reduce RCU noise for nohz_full configurations, osnoise depends on cond_resched() providing quiescent states for PREEMPT_RCU=n configurations. For PREEMPT_RCU=y configurations -- where cond_resched() is a stub -- we do this by directly calling rcu_momentary_eqs(). With (PREEMPT_LAZY=y, PREEMPT_DYNAMIC=n), however, we have a configuration with (PREEMPTION=y, PREEMPT_RCU=n) where neither of the above can help. Handle that by providing an explicit quiescent state here for all configurations. As mentioned above this is not needed for non-stubbed cond_resched(), but, providing a quiescent state here just pulls in one that a future cond_resched() would provide, so doesn't cause any extra work for this configuration. Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Suggested-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> |
||
![]() |
41a4d2d3e3 |
rv: Add license identifiers to monitor files
Some monitor files like the main header and the Kconfig are missing the license identifier. Add it to those and make sure the automatic generation script includes the line in newly created monitors. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/20250218123121.253551-3-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
76fe0337c2 |
ftrace: Add arguments to function tracer
Wire up the code to print function arguments in the function tracer. This functionality can be enabled/disabled during runtime with options/func-args. ping-689 [004] b.... 77.170220: dummy_xmit(skb = 0x82904800, dev = 0x882d0000) <-dev_hard_start_xmit Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Guo Ren <guoren@kernel.org> Cc: Donglin Peng <dolinux.peng@gmail.com> Cc: Zheng Yejian <zhengyejian@huaweicloud.com> Link: https://lore.kernel.org/20250227185823.154996172@goodmis.org Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
c7a60a733c |
ftrace: Have funcgraph-args take affect during tracing
Currently, when function_graph is started, it looks at the option funcgraph-args, and if it is set, it will enable tracing of the arguments. But if tracing is already running, and the user enables funcgraph-args, it will have no effect. Instead, it should enable argument tracing when it is enabled, even if it means disabling the function graph tracing for a short time in order to do the transition. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Guo Ren <guoren@kernel.org> Cc: Donglin Peng <dolinux.peng@gmail.com> Cc: Zheng Yejian <zhengyejian@huaweicloud.com> Link: https://lore.kernel.org/20250227185822.978998710@goodmis.org Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
ff5c9c576e |
ftrace: Add support for function argument to graph tracer
Wire up the code to print function arguments in the function graph tracer. This functionality can be enabled/disabled during runtime with options/funcgraph-args. Example usage: 6) | dummy_xmit [dummy](skb = 0x8887c100, dev = 0x872ca000) { 6) | consume_skb(skb = 0x8887c100) { 6) | skb_release_head_state(skb = 0x8887c100) { 6) 0.178 us | sock_wfree(skb = 0x8887c100) 6) 0.627 us | } Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Guo Ren <guoren@kernel.org> Cc: Donglin Peng <dolinux.peng@gmail.com> Cc: Zheng Yejian <zhengyejian@huaweicloud.com> Link: https://lore.kernel.org/20250227185822.810321199@goodmis.org Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
533c20b062 |
ftrace: Add print_function_args()
Add a function to decode argument types with the help of BTF. Will be used to display arguments in the function and function graph tracer. It can only handle simply arguments and up to FTRACE_REGS_MAX_ARGS number of arguments. When it hits a max, it will print ", ...": page_to_skb(vi=0xffff8d53842dc980, rq=0xffff8d53843a0800, page=0xfffffc2e04337c00, offset=6160, len=64, truesize=1536, ...) And if it hits an argument that is not recognized, it will print the raw value and the type of argument it is: make_vfsuid(idmap=0xffffffff87f99db8, fs_userns=0xffffffff87e543c0, kuid=0x0 (STRUCT)) __pti_set_user_pgtbl(pgdp=0xffff8d5384ab47f8, pgd=0x110e74067 (STRUCT)) Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Guo Ren <guoren@kernel.org> Cc: Donglin Peng <dolinux.peng@gmail.com> Cc: Zheng Yejian <zhengyejian@huaweicloud.com> Link: https://lore.kernel.org/20250227185822.639418500@goodmis.org Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
0c667775fe |
ftrace: Have ftrace_free_filter() WARN and exit if ops is active
The ftrace_free_filter() is used to reset the ops filters. But it must be done if the ops is not currently active (tracing). If it is, it will mess up the ftrace accounting of what functions are attached and what is not. WARN and exit the ftrace_free_filter() if the ops is active when it is called. Currently, it doesn't seem if anything does this, but it may in the future. Link: https://lore.kernel.org/all/20250219095330.2e9f171c@gandalf.local.home/ Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250219135040.3a9fbe00@gandalf.local.home Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
97d6a9c4b3 |
fgraph: Correct typo in ftrace_return_to_handler comment
Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250218122052.58348-1-haiyuewa@163.com Signed-off-by: Haiyue Wang <haiyuewa@163.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
1fff9f8730 |
Linux 6.14-rc5
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmfEtgQeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGUI8H/0TJm0ia8TOseHDT bV0alw4qlNdq7Thi7M5HCsInzyhKImdHDBd/cY3HI2QRZS927WoPHUExDvEd6vUE sETrrBrl/iGJ5nLvXPKkCxXC43ZD3nCI/chBPWwspH2DA/Nih8LAmMkESGVFC7fa TUOKqY1FsAWMYUJ64hP4s9Dwi5XECps7/Di46ypgtr7sVA15jpfF3ePi2mfR73Bm hSfF7E5Xa3E22IBE2NPxvO4fHiYJWbNJk8Vv2ewfHroE/zKsJ/zCMk9mOtFII0P/ TODkBSImFqUx+cDZVc0bJAy8rwA2lNXo3LU28N0Ca4EAoqyyIIdTPQ2wYA82Z2Y2 hE/BeN0= =a9Md -----END PGP SIGNATURE----- Merge tag 'v6.14-rc5' into x86/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
26edad06d5 |
Probes fixes for v6.14-rc4:
- probe-events: Some issues are fixed. . probe-events: Remove unused MAX_ARG_BUF_LEN macro. MAX_ARG_BUF_LEN is not used so remove it. . fprobe-events: Log error for exceeding the number of entry args. Since the max number of entry args is limited, it should be checked and rejected when the parser detects it. . tprobe-events: Reject invalid tracepoint name User can specify an invalid tracepoint name e.g. including '/', then the new event is not defined correctly in the eventfs. . tprobe-events: Fix a memory leak when tprobe defined with $retval There is a memory leak if tprobe is defined with $retval. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmfFKkcbHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b/F4H/10qmUSsec9+IbQseg0E MSRxAhJQ+xOcLfGsWhblW2zirkw9o4PghZYwBodkastu4Wgq2M5ASKd6KqUY2o7D CX+tCoXf80SDLEVd2go5m72Ml40rrGDEgLvS5YcEa4Iqr5nPZrvCJ7rl2tlqupQH W2ttOTkX9H28phAFDCsdl5ZJUCJRxlFc6fYG0yZYHsFdRub9J2LPiMTMwIlu56YS 8HH3NxS+wxlKK2I4VfD8mFsOnrNh7MFDLOOwNMlKWvm2wSPbPmVho+eXLAc5xyTO d+vUpkp4Dp9WWCLuNdO/sqY0IKngO2sM++WbtL/YPP8YijqsrImep4PCR8/fvlN6 Urs= =dyZm -----END PGP SIGNATURE----- Merge tag 'probes-fixes-v6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probe events fixes from Masami Hiramatsu: - probe-events: Remove unused MAX_ARG_BUF_LEN macro - it is not used - fprobe-events: Log error for exceeding the number of entry args. Since the max number of entry args is limited, it should be checked and rejected when the parser detects it. - tprobe-events: Reject invalid tracepoint name If a user specifies an invalid tracepoint name (e.g. including '/') then the new event is not defined correctly in the eventfs. - tprobe-events: Fix a memory leak when tprobe defined with $retval There is a memory leak if tprobe is defined with $retval. * tag 'probes-fixes-v6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: probe-events: Remove unused MAX_ARG_BUF_LEN macro tracing: fprobe-events: Log error for exceeding the number of entry args tracing: tprobe-events: Reject invalid tracepoint name tracing: tprobe-events: Fix a memory leak when tprobe with $retval |
||
![]() |
fd5ba38390 |
tracing: probe-events: Remove unused MAX_ARG_BUF_LEN macro
Commit |
||
![]() |
a1a7eb89ca |
ftrace: Avoid potential division by zero in function_stat_show()
Check whether denominator expression x * (x - 1) * 1000 mod {2^32, 2^64}
produce zero and skip stddev computation in that case.
For now don't care about rec->counter * rec->counter overflow because
rec->time * rec->time overflow will likely happen earlier.
Cc: stable@vger.kernel.org
Cc: Wen Yang <wenyang@linux.alibaba.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250206090156.1561783-1-kniv@yandex-team.ru
Fixes:
|
||
![]() |
6f86bdeab6 |
tracing: Fix bad hist from corrupting named_triggers list
The following commands causes a crash:
~# cd /sys/kernel/tracing/events/rcu/rcu_callback
~# echo 'hist:name=bad:keys=common_pid:onmax(bogus).save(common_pid)' > trigger
bash: echo: write error: Invalid argument
~# echo 'hist:name=bad:keys=common_pid' > trigger
Because the following occurs:
event_trigger_write() {
trigger_process_regex() {
event_hist_trigger_parse() {
data = event_trigger_alloc(..);
event_trigger_register(.., data) {
cmd_ops->reg(.., data, ..) [hist_register_trigger()] {
data->ops->init() [event_hist_trigger_init()] {
save_named_trigger(name, data) {
list_add(&data->named_list, &named_triggers);
}
}
}
}
ret = create_actions(); (return -EINVAL)
if (ret)
goto out_unreg;
[..]
ret = hist_trigger_enable(data, ...) {
list_add_tail_rcu(&data->list, &file->triggers); <<<---- SKIPPED!!! (this is important!)
[..]
out_unreg:
event_hist_unregister(.., data) {
cmd_ops->unreg(.., data, ..) [hist_unregister_trigger()] {
list_for_each_entry(iter, &file->triggers, list) {
if (!hist_trigger_match(data, iter, named_data, false)) <- never matches
continue;
[..]
test = iter;
}
if (test && test->ops->free) <<<-- test is NULL
test->ops->free(test) [event_hist_trigger_free()] {
[..]
if (data->name)
del_named_trigger(data) {
list_del(&data->named_list); <<<<-- NEVER gets removed!
}
}
}
}
[..]
kfree(data); <<<-- frees item but it is still on list
The next time a hist with name is registered, it causes an u-a-f bug and
the kernel can crash.
Move the code around such that if event_trigger_register() succeeds, the
next thing called is hist_trigger_enable() which adds it to the list.
A bunch of actions is called if get_named_trigger_data() returns false.
But that doesn't need to be called after event_trigger_register(), so it
can be moved up, allowing event_trigger_register() to be called just
before hist_trigger_enable() keeping them together and allowing the
file->triggers to be properly populated.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250227163944.1c37f85f@gandalf.local.home
Fixes:
|
||
![]() |
a065bbf776 |
trace/osnoise: Add trace events for samples
Add trace events that fire at osnoise and timerlat sample generation, in addition to the already existing noise and threshold events. This allows processing the samples directly in the kernel, either with ftrace triggers or with BPF. Cc: John Kacur <jkacur@redhat.com> Cc: Luis Goncalves <lgoncalv@redhat.com> Link: https://lore.kernel.org/20250203090418.1458923-1-tglozar@redhat.com Signed-off-by: Tomas Glozar <tglozar@redhat.com> Tested-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
db5e228611 |
tracing: fprobe-events: Log error for exceeding the number of entry args
Add error message when the number of entry argument exceeds the
maximum size of entry data.
This is currently checked when registering fprobe, but in this case
no error message is shown in the error_log file.
Link: https://lore.kernel.org/all/174055074269.4079315.17809232650360988538.stgit@mhiramat.tok.corp.google.com/
Fixes:
|
||
![]() |
d0453655b6 |
tracing: tprobe-events: Reject invalid tracepoint name
Commit |
||
![]() |
ac965d7d88 |
tracing: tprobe-events: Fix a memory leak when tprobe with $retval
Fix a memory leak when a tprobe is defined with $retval. This
combination is not allowed, but the parse_symbol_and_return() does
not free the *symbol which should not be used if it returns the error.
Thus, it leaks the *symbol memory in that error path.
Link: https://lore.kernel.org/all/174055072650.4079315.3063014346697447838.stgit@mhiramat.tok.corp.google.com/
Fixes:
|
||
![]() |
9ec84f79c5 |
perf: Remove unnecessary parameter of security check
It seems that the attr parameter was never been used in security
checks since it was first introduced by:
commit
|
||
![]() |
4580f4e0eb |
bpf: Fix deadlock between rcu_tasks_trace and event_mutex.
Fix the following deadlock: CPU A _free_event() perf_kprobe_destroy() mutex_lock(&event_mutex) perf_trace_event_unreg() synchronize_rcu_tasks_trace() There are several paths where _free_event() grabs event_mutex and calls sync_rcu_tasks_trace. Above is one such case. CPU B bpf_prog_test_run_syscall() rcu_read_lock_trace() bpf_prog_run_pin_on_cpu() bpf_prog_load() bpf_tracing_func_proto() trace_set_clr_event() mutex_lock(&event_mutex) Delegate trace_set_clr_event() to workqueue to avoid such lock dependency. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250224221637.4780-1-alexei.starovoitov@gmail.com |
||
![]() |
937fbf111a |
tracing: Add traceoff_after_boot option
Sometimes tracing is used to debug issues during the boot process. Since the trace buffer has a limited amount of storage, it may be prudent to disable tracing after the boot is finished, otherwise the critical information may be overwritten. With this option, the main tracing buffer will be turned off at the end of the boot process. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Borislav Petkov <bp@alien8.de> Link: https://lore.kernel.org/20250208103017.48a7ec83@batman.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
da0f622b34 |
ftrace: Check against is_kernel_text() instead of kaslr_offset()
As kaslr_offset() is architecture dependent and also may not be defined by
all architectures, when zeroing out unused weak functions, do not check
against kaslr_offset(), but instead check if the address is within the
kernel text sections. If KASLR added a shift to the zeroed out function,
it would still not be located in the kernel text. This is a more robust
way to test if the text is valid or not.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: "Arnd Bergmann" <arnd@arndb.de>
Link: https://lore.kernel.org/20250225182054.471759017@goodmis.org
Fixes:
|
||
![]() |
6eeca746fa |
ftrace: Test mcount_loc addr before calling ftrace_call_addr()
The addresses in the mcount_loc can be zeroed and then moved by KASLR
making them invalid addresses. ftrace_call_addr() for ARM 64 expects a
valid address to kernel text. If the addr read from the mcount_loc section
is invalid, it must not call ftrace_call_addr(). Move the addr check
before calling ftrace_call_addr() in ftrace_process_locs().
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/20250225182054.290128736@goodmis.org
Fixes:
|
||
![]() |
2fa6a01345 |
tracing: Fix memory leak when reading set_event file
kmemleak reports the following memory leak after reading set_event file:
# cat /sys/kernel/tracing/set_event
# cat /sys/kernel/debug/kmemleak
unreferenced object 0xff110001234449e0 (size 16):
comm "cat", pid 13645, jiffies 4294981880
hex dump (first 16 bytes):
01 00 00 00 00 00 00 00 a8 71 e7 84 ff ff ff ff .........q......
backtrace (crc c43abbc):
__kmalloc_cache_noprof+0x3ca/0x4b0
s_start+0x72/0x2d0
seq_read_iter+0x265/0x1080
seq_read+0x2c9/0x420
vfs_read+0x166/0xc30
ksys_read+0xf4/0x1d0
do_syscall_64+0x79/0x150
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The issue can be reproduced regardless of whether set_event is empty or
not. Here is an example about the valid content of set_event.
# cat /sys/kernel/tracing/set_event
sched:sched_process_fork
sched:sched_switch
sched:sched_wakeup
*:*:mod:trace_events_sample
The root cause is that s_next() returns NULL when nothing is found.
This results in s_stop() attempting to free a NULL pointer because its
parameter is NULL.
Fix the issue by freeing the memory appropriately when s_next() fails
to find anything.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250220031528.7373-1-ahuang12@lenovo.com
Fixes:
|
||
![]() |
57b76bedc5 |
ftrace: Correct preemption accounting for function tracing.
The function tracer should record the preemption level at the point when
the function is invoked. If the tracing subsystem decrement the
preemption counter it needs to correct this before feeding the data into
the trace buffer. This was broken in the commit cited below while
shifting the preempt-disabled section.
Use tracing_gen_ctx_dec() which properly subtracts one from the
preemption counter on a preemptible kernel.
Cc: stable@vger.kernel.org
Cc: Wander Lairson Costa <wander@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/20250220140749.pfw8qoNZ@linutronix.de
Fixes:
|
||
![]() |
ca26554a14 |
fprobe: Fix accounting of when to unregister from function graph
When adding a new fprobe, it will update the function hash to the
functions the fprobe is attached to and register with function graph to
have it call the registered functions. The fprobe_graph_active variable
keeps track of the number of fprobes that are using function graph.
If two fprobes attach to the same function, it increments the
fprobe_graph_active for each of them. But when they are removed, the first
fprobe to be removed will see that the function it is attached to is also
used by another fprobe and it will not remove that function from
function_graph. The logic will skip decrementing the fprobe_graph_active
variable.
This causes the fprobe_graph_active variable to not go to zero when all
fprobes are removed, and in doing so it does not unregister from
function graph. As the fgraph ops hash will now be empty, and an empty
filter hash means all functions are enabled, this triggers function graph
to add a callback to the fprobe infrastructure for every function!
# echo "f:myevent1 kernel_clone" >> /sys/kernel/tracing/dynamic_events
# echo "f:myevent2 kernel_clone%return" >> /sys/kernel/tracing/dynamic_events
# cat /sys/kernel/tracing/enabled_functions
kernel_clone (1) tramp: 0xffffffffc0024000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
# > /sys/kernel/tracing/dynamic_events
# cat /sys/kernel/tracing/enabled_functions
trace_initcall_start_cb (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
run_init_process (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
try_to_run_init_process (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
x86_pmu_show_pmu_cap (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
cleanup_rapl_pmus (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
uncore_free_pcibus_map (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
uncore_types_exit (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
uncore_pci_exit.part.0 (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
kvm_shutdown (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
vmx_dump_msrs (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170
[..]
# cat /sys/kernel/tracing/enabled_functions | wc -l
54702
If a fprobe is being removed and all its functions are also traced by
other fprobes, still decrement the fprobe_graph_active counter.
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250220202055.565129766@goodmis.org
Fixes:
|
||
![]() |
ded9140622 |
fprobe: Always unregister fgraph function from ops
When the last fprobe is removed, it calls unregister_ftrace_graph() to
remove the graph_ops from function graph. The issue is when it does so, it
calls return before removing the function from its graph ops via
ftrace_set_filter_ips(). This leaves the last function lingering in the
fprobe's fgraph ops and if a probe is added it also enables that last
function (even though the callback will just drop it, it does add unneeded
overhead to make that call).
# echo "f:myevent1 kernel_clone" >> /sys/kernel/tracing/dynamic_events
# cat /sys/kernel/tracing/enabled_functions
kernel_clone (1) tramp: 0xffffffffc02f3000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
# echo "f:myevent2 schedule_timeout" >> /sys/kernel/tracing/dynamic_events
# cat /sys/kernel/tracing/enabled_functions
kernel_clone (1) tramp: 0xffffffffc02f3000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
schedule_timeout (1) tramp: 0xffffffffc02f3000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
# > /sys/kernel/tracing/dynamic_events
# cat /sys/kernel/tracing/enabled_functions
# echo "f:myevent3 kmem_cache_free" >> /sys/kernel/tracing/dynamic_events
# cat /sys/kernel/tracing/enabled_functions
kmem_cache_free (1) tramp: 0xffffffffc0219000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
schedule_timeout (1) tramp: 0xffffffffc0219000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
The above enabled a fprobe on kernel_clone, and then on schedule_timeout.
The content of the enabled_functions shows the functions that have a
callback attached to them. The fprobe attached to those functions
properly. Then the fprobes were cleared, and enabled_functions was empty
after that. But after adding a fprobe on kmem_cache_free, the
enabled_functions shows that the schedule_timeout was attached again. This
is because it was still left in the fprobe ops that is used to tell
function graph what functions it wants callbacks from.
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250220202055.393254452@goodmis.org
Fixes:
|
||
![]() |
8eb4b09e0b |
ftrace: Do not add duplicate entries in subops manager ops
Check if a function is already in the manager ops of a subops. A manager
ops contains multiple subops, and if two or more subops are tracing the
same function, the manager ops only needs a single entry in its hash.
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250220202055.226762894@goodmis.org
Fixes:
|
||
![]() |
38b1406194 |
ftrace: Fix accounting of adding subops to a manager ops
Function graph uses a subops and manager ops mechanism to attach to
ftrace. The manager ops connects to ftrace and the functions it connects
to is defined by a list of subops that it manages.
The function hash that defines what the above ops attaches to limits the
functions to attach if the hash has any content. If the hash is empty, it
means to trace all functions.
The creation of the manager ops hash is done by iterating over all the
subops hashes. If any of the subops hashes is empty, it means that the
manager ops hash must trace all functions as well.
The issue is in the creation of the manager ops. When a second subops is
attached, a new hash is created by starting it as NULL and adding the
subops one at a time. But the NULL ops is mistaken as an empty hash, and
once an empty hash is found, it stops the loop of subops and just enables
all functions.
# echo "f:myevent1 kernel_clone" >> /sys/kernel/tracing/dynamic_events
# cat /sys/kernel/tracing/enabled_functions
kernel_clone (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
# echo "f:myevent2 schedule_timeout" >> /sys/kernel/tracing/dynamic_events
# cat /sys/kernel/tracing/enabled_functions
trace_initcall_start_cb (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
run_init_process (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
try_to_run_init_process (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
x86_pmu_show_pmu_cap (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
cleanup_rapl_pmus (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
uncore_free_pcibus_map (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
uncore_types_exit (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
uncore_pci_exit.part.0 (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
kvm_shutdown (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
vmx_dump_msrs (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
vmx_cleanup_l1d_flush (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60
[..]
Fix this by initializing the new hash to NULL and if the hash is NULL do
not treat it as an empty hash but instead allocate by copying the content
of the first sub ops. Then on subsequent iterations, the new hash will not
be NULL, but the content of the previous subops. If that first subops
attached to all functions, then new hash may assume that the manager ops
also needs to attach to all functions.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250220202055.060300046@goodmis.org
Fixes:
|
||
![]() |
b4a8b5bba7 |
bpf: Use preempt_count() directly in bpf_send_signal_common()
bpf_send_signal_common() uses preemptible() to check whether or not the
current context is preemptible. If it is preemptible, it will use
irq_work to send the signal asynchronously instead of trying to hold a
spin-lock, because spin-lock is sleepable under PREEMPT_RT.
However, preemptible() depends on CONFIG_PREEMPT_COUNT. When
CONFIG_PREEMPT_COUNT is turned off (e.g., CONFIG_PREEMPT_VOLUNTARY=y),
!preemptible() will be evaluated as 1 and bpf_send_signal_common() will
use irq_work unconditionally.
Fix it by unfolding "!preemptible()" and using "preempt_count() != 0 ||
irqs_disabled()" instead.
Fixes:
|
||
![]() |
264143c4e5 |
ftrace: Have ftrace pages output reflect freed pages
The amount of memory that ftrace uses to save the descriptors to manage the functions it can trace is shown at output. But if there are a lot of functions that are skipped because they were weak or the architecture added holes into the tables, then the extra pages that were allocated are freed. But these freed pages are not reflected in the numbers shown, and they can even be inconsistent with what is reported: ftrace: allocating 57482 entries in 225 pages ftrace: allocated 224 pages with 3 groups The above shows the number of original entries that are in the mcount_loc section and the pages needed to save them (225), but the second output reflects the number of pages that were actually used. The two should be consistent as: ftrace: allocating 56739 entries in 224 pages ftrace: allocated 224 pages with 3 groups The above also shows the accurate number of entires that were actually stored and does not include the entries that were removed. Cc: bpf <bpf@vger.kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nicolas Schier <nicolas@fjasle.eu> Cc: Zheng Yejian <zhengyejian1@huawei.com> Cc: Martin Kelly <martin.kelly@crowdstrike.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/20250218200023.221100846@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
4a3efc6baf |
ftrace: Update the mcount_loc check of skipped entries
Now that weak functions turn into skipped entries, update the check to make sure the amount that was allocated would fit both the entries that were allocated as well as those that were skipped. Cc: bpf <bpf@vger.kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nicolas Schier <nicolas@fjasle.eu> Cc: Zheng Yejian <zhengyejian1@huawei.com> Cc: Martin Kelly <martin.kelly@crowdstrike.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/20250218200023.055162048@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
ef378c3b82 |
scripts/sorttable: Zero out weak functions in mcount_loc table
When a function is annotated as "weak" and is overridden, the code is not
removed. If it is traced, the fentry/mcount location in the weak function
will be referenced by the "__mcount_loc" section. This will then be added
to the available_filter_functions list. Since only the address of the
functions are listed, to find the name to show, a search of kallsyms is
used.
Since kallsyms will return the function by simply finding the function
that the address is after but before the next function, an address of a
weak function will show up as the function before it. This is because
kallsyms does not save names of weak functions. This has caused issues in
the past, as now the traced weak function will be listed in
available_filter_functions with the name of the function before it.
At best, this will cause the previous function's name to be listed twice.
At worse, if the previous function was marked notrace, it will now show up
as a function that can be traced. Note that it only shows up that it can
be traced but will not be if enabled, which causes confusion.
https://lore.kernel.org/all/20220412094923.0abe90955e5db486b7bca279@kernel.org/
The commit
|
||
![]() |
e8f925c320 |
Linux 6.14-rc3
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmeyYIQeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGNy0H/jWdgjddRaEHQ1RB e18Oi6MJcTQikHbCHKGZGlyxR4dYxdAONuMmWwgt+266K8qUJSZcNXePwqGEWjx2 qkJ9Tu0Agr8KkfVDtGHGXyd4tuZRpx9Fco6+jKkKiMjjtif7nrUajUGGwRsqGoib YYzrhbjNZDl17/J58O1E4YZs3w7Lu26PwDR58RZMsSG0pygAfU2fogKcYmi1pTYV w86icn0LlO8b5Y7fsrY56rLrawnI1RGlxfylUTHzo4QkoIUGvQLB8c6XPMYsVf9R lvkphu+/fGVnSw577WlVy8DTBso+Pj2nWw4jUTiEAy9hYY6zMxrqrX3XowAwbxj1 m6zP+F8= =ieVA -----END PGP SIGNATURE----- Merge tag 'v6.14-rc3' into x86/core, to pick up fixes Pick up upstream x86 fixes before applying new patches. Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
19fec9c443 |
tracing/osnoise: Switch to use hrtimer_setup()
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Patch was created by using Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/all/ff8e6e11df5f928b2b97619ac847b4fa045376a1.1738746821.git.namcao@linutronix.de |
||
![]() |
97937834ae |
ring-buffer: Update pages_touched to reflect persistent buffer content
The pages_touched field represents the number of subbuffers in the ring
buffer that have content that can be read. This is used in accounting of
"dirty_pages" and "buffer_percent" to allow the user to wait for the
buffer to be filled to a certain amount before it reads the buffer in
blocking mode.
The persistent buffer never updated this value so it was set to zero, and
this accounting would take it as it had no content. This would cause user
space to wait for content even though there's enough content in the ring
buffer that satisfies the buffer_percent.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250214123512.0631436e@gandalf.local.home
Fixes:
|
||
![]() |
129fe71881 |
tracing: Do not allow mmap() of persistent ring buffer
When trying to mmap a trace instance buffer that is attached to
reserve_mem, it would crash:
BUG: unable to handle page fault for address: ffffe97bd00025c8
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 2862f3067 P4D 2862f3067 PUD 0
Oops: Oops: 0000 [#1] PREEMPT_RT SMP PTI
CPU: 4 UID: 0 PID: 981 Comm: mmap-rb Not tainted 6.14.0-rc2-test-00003-g7f1a5e3fbf9e-dirty #233
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:validate_page_before_insert+0x5/0xb0
Code: e2 01 89 d0 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 <48> 8b 46 08 a8 01 75 67 66 90 48 89 f0 8b 50 34 85 d2 74 76 48 89
RSP: 0018:ffffb148c2f3f968 EFLAGS: 00010246
RAX: ffff9fa5d3322000 RBX: ffff9fa5ccff9c08 RCX: 00000000b879ed29
RDX: ffffe97bd00025c0 RSI: ffffe97bd00025c0 RDI: ffff9fa5ccff9c08
RBP: ffffb148c2f3f9f0 R08: 0000000000000004 R09: 0000000000000004
R10: 0000000000000000 R11: 0000000000000200 R12: 0000000000000000
R13: 00007f16a18d5000 R14: ffff9fa5c48db6a8 R15: 0000000000000000
FS: 00007f16a1b54740(0000) GS:ffff9fa73df00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffe97bd00025c8 CR3: 00000001048c6006 CR4: 0000000000172ef0
Call Trace:
<TASK>
? __die_body.cold+0x19/0x1f
? __die+0x2e/0x40
? page_fault_oops+0x157/0x2b0
? search_module_extables+0x53/0x80
? validate_page_before_insert+0x5/0xb0
? kernelmode_fixup_or_oops.isra.0+0x5f/0x70
? __bad_area_nosemaphore+0x16e/0x1b0
? bad_area_nosemaphore+0x16/0x20
? do_kern_addr_fault+0x77/0x90
? exc_page_fault+0x22b/0x230
? asm_exc_page_fault+0x2b/0x30
? validate_page_before_insert+0x5/0xb0
? vm_insert_pages+0x151/0x400
__rb_map_vma+0x21f/0x3f0
ring_buffer_map+0x21b/0x2f0
tracing_buffers_mmap+0x70/0xd0
__mmap_region+0x6f0/0xbd0
mmap_region+0x7f/0x130
do_mmap+0x475/0x610
vm_mmap_pgoff+0xf2/0x1d0
ksys_mmap_pgoff+0x166/0x200
__x64_sys_mmap+0x37/0x50
x64_sys_call+0x1670/0x1d70
do_syscall_64+0xbb/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The reason was that the code that maps the ring buffer pages to user space
has:
page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]);
And uses that in:
vm_insert_pages(vma, vma->vm_start, pages, &nr_pages);
But virt_to_page() does not work with vmap()'d memory which is what the
persistent ring buffer has. It is rather trivial to allow this, but for
now just disable mmap() of instances that have their ring buffer from the
reserve_mem option.
If an mmap() is performed on a persistent buffer it will return -ENODEV
just like it would if the .mmap field wasn't defined in the
file_operations structure.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250214115547.0d7287d3@gandalf.local.home
Fixes:
|
||
![]() |
f5b95f1fa2 |
ring-buffer: Validate the persistent meta data subbuf array
The meta data for a mapped ring buffer contains an array of indexes of all
the subbuffers. The first entry is the reader page, and the rest of the
entries lay out the order of the subbuffers in how the ring buffer link
list is to be created.
The validator currently makes sure that all the entries are within the
range of 0 and nr_subbufs. But it does not check if there are any
duplicates.
While working on the ring buffer, I corrupted this array, where I added
duplicates. The validator did not catch it and created the ring buffer
link list on top of it. Luckily, the corruption was only that the reader
page was also in the writer path and only presented corrupted data but did
not crash the kernel. But if there were duplicates in the writer side,
then it could corrupt the ring buffer link list and cause a crash.
Create a bitmask array with the size of the number of subbuffers. Then
clear it. When walking through the subbuf array checking to see if the
entries are within the range, test if its bit is already set in the
subbuf_mask. If it is, then there is duplicates and fail the validation.
If not, set the corresponding bit and continue.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250214102820.7509ddea@gandalf.local.home
Fixes:
|
||
![]() |
60b8f71114 |
tracing: Have the error of __tracing_resize_ring_buffer() passed to user
Currently if __tracing_resize_ring_buffer() returns an error, the
tracing_resize_ringbuffer() returns -ENOMEM. But it may not be a memory
issue that caused the function to fail. If the ring buffer is memory
mapped, then the resizing of the ring buffer will be disabled. But if the
user tries to resize the buffer, it will get an -ENOMEM returned, which is
confusing because there is plenty of memory. The actual error returned was
-EBUSY, which would make much more sense to the user.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250213134132.7e4505d7@gandalf.local.home
Fixes:
|
||
![]() |
9ba0e1755a |
ring-buffer: Unlock resize on mmap error
Memory mapping the tracing ring buffer will disable resizing the buffer.
But if there's an error in the memory mapping like an invalid parameter,
the function exits out without re-enabling the resizing of the ring
buffer, preventing the ring buffer from being resized after that.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250213131957.530ec3c5@gandalf.local.home
Fixes:
|
||
![]() |
72e213a7cc |
x86/ibt: Clean up is_endbr()
Pretty much every caller of is_endbr() actually wants to test something at an address and ends up doing get_kernel_nofault(). Fold the lot into a more convenient helper. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20250207122546.181367417@infradead.org |
||
![]() |
c8c9b1d2d5 |
fgraph: Fix set_graph_notrace with setting TRACE_GRAPH_NOTRACE_BIT
The code was restructured where the function graph notrace code, that
would not trace a function and all its children is done by setting a
NOTRACE flag when the function that is not to be traced is hit.
There's a TRACE_GRAPH_NOTRACE_BIT which defines the bit in the flags and a
TRACE_GRAPH_NOTRACE which is the mask with that bit set. But the
restructuring used TRACE_GRAPH_NOTRACE_BIT when it should have used
TRACE_GRAPH_NOTRACE.
For example:
# cd /sys/kernel/tracing
# echo set_track_prepare stack_trace_save > set_graph_notrace
# echo function_graph > current_tracer
# cat trace
[..]
0) | __slab_free() {
0) | free_to_partial_list() {
0) | arch_stack_walk() {
0) | __unwind_start() {
0) 0.501 us | get_stack_info();
Where a non filter trace looks like:
# echo > set_graph_notrace
# cat trace
0) | free_to_partial_list() {
0) | set_track_prepare() {
0) | stack_trace_save() {
0) | arch_stack_walk() {
0) | __unwind_start() {
Where the filter should look like:
# cat trace
0) | free_to_partial_list() {
0) | _raw_spin_lock_irqsave() {
0) 0.350 us | preempt_count_add();
0) 0.351 us | do_raw_spin_lock();
0) 2.440 us | }
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250208001511.535be150@batman.local.home
Fixes:
|
||
![]() |
1751f872cc |
treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.
Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit
|
||
![]() |
40648d246f |
rv: tools/rtla: Updates for 6.14
- Add a test suite to test the tool Add a small test suite that can be used to test rtla's basic features to at least have something to test when applying changes. - Automate manual steps in monitor creation While creating a new monitor in RV, besides generating code from dot2k, there are a few manual steps which can be tedious and error prone, like adding the tracepoints, makefile lines and kconfig, or selecting events that start the monitor in the initial state. Updates were made to try and automate as much as possible among those steps to make creating a new RV monitor much quicker. It is still requires to select proper tracepoints, this step is harder to automate in a general way and, in several cases, would still need user intervention. - Have rtla timerlat hist and top set OSNOISE_WORKLOAD flag Have both rtla-timerlat-hist and rtla-timerlat-top set OSNOISE_WORKLOAD to the proper value ("on" when running with -k, "off" when running with -u) every time the option is available instead of setting it only when running with -u. This prevents rtla timerlat -k from giving no results when NO_OSNOISE_WORKLOAD is set, either manually or by an abnormally exited earlier run of rtla timerlat -u. - Stop rtla timerlat on signal properly when overloaded There is an issue where if rtla is run on machines with a high number of CPUs (100+), timerlat can generate more samples than rtla is able to process via tracefs_iterate_raw_events. This is especially common when the interval is set to 100us (rteval and cyclictest default) as opposed to the rtla default of 1000us, but also happens with the rtla default. Currently, this leads to rtla hanging and having to be terminated with SIGTERM. SIGINT setting stop_tracing is not enough, since more and more events are coming and tracefs_iterate_raw_events never exits. To fix this: Stop the timerlat tracer on SIGINT/SIGALRM to ensure no more events are generated when rtla is supposed to exit. Also on receiving SIGINT/SIGALRM twice, abort iteration immediately with tracefs_iterate_stop, making rtla exit right away instead of waiting for all events to be processed. - Account for missed events Due to tracefs buffer overflow, it can happen that rtla misses events, making the tracing results inaccurate. Count both the number of missed events and the total number of processed events, and display missed events as well as their percentage. The numbers are displayed for both osnoise and timerlat, even though for the earlier, missed events are generally not expected. For hist, the number is displayed at the end of the run; for top, it is displayed on each printing of the top table. - Changes to make osnoise more robust There was a dependency in the code that the first field of the osnoise_tool structure was the trace field. If that that ever changed, then the code work break. Change the code to encapsulate this dependency where the code that uses the structure does not have this dependency. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ5UQ4BQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qktFAQD2px6MyoOVTssB5Iw3aTWGUfTFoDEc bfng5JsBxlVJkQEA+2UUvP8FJlLTOQvVEwJiscX7CCJxl5bYkV6GWuGRxQU= =h//9 -----END PGP SIGNATURE----- Merge tag 'trace-tools-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull rv and tools/rtla updates from Steven Rostedt: - Add a test suite to test the tool Add a small test suite that can be used to test rtla's basic features to at least have something to test when applying changes. - Automate manual steps in monitor creation While creating a new monitor in RV, besides generating code from dot2k, there are a few manual steps which can be tedious and error prone, like adding the tracepoints, makefile lines and kconfig, or selecting events that start the monitor in the initial state. Updates were made to try and automate as much as possible among those steps to make creating a new RV monitor much quicker. It is still requires to select proper tracepoints, this step is harder to automate in a general way and, in several cases, would still need user intervention. - Have rtla timerlat hist and top set OSNOISE_WORKLOAD flag Have both rtla-timerlat-hist and rtla-timerlat-top set OSNOISE_WORKLOAD to the proper value ("on" when running with -k, "off" when running with -u) every time the option is available instead of setting it only when running with -u. This prevents rtla timerlat -k from giving no results when NO_OSNOISE_WORKLOAD is set, either manually or by an abnormally exited earlier run of rtla timerlat -u. - Stop rtla timerlat on signal properly when overloaded There is an issue where if rtla is run on machines with a high number of CPUs (100+), timerlat can generate more samples than rtla is able to process via tracefs_iterate_raw_events. This is especially common when the interval is set to 100us (rteval and cyclictest default) as opposed to the rtla default of 1000us, but also happens with the rtla default. Currently, this leads to rtla hanging and having to be terminated with SIGTERM. SIGINT setting stop_tracing is not enough, since more and more events are coming and tracefs_iterate_raw_events never exits. To fix this: Stop the timerlat tracer on SIGINT/SIGALRM to ensure no more events are generated when rtla is supposed to exit. Also on receiving SIGINT/SIGALRM twice, abort iteration immediately with tracefs_iterate_stop, making rtla exit right away instead of waiting for all events to be processed. - Account for missed events Due to tracefs buffer overflow, it can happen that rtla misses events, making the tracing results inaccurate. Count both the number of missed events and the total number of processed events, and display missed events as well as their percentage. The numbers are displayed for both osnoise and timerlat, even though for the earlier, missed events are generally not expected. For hist, the number is displayed at the end of the run; for top, it is displayed on each printing of the top table. - Changes to make osnoise more robust There was a dependency in the code that the first field of the osnoise_tool structure was the trace field. If that that ever changed, then the code work break. Change the code to encapsulate this dependency where the code that uses the structure does not have this dependency. * tag 'trace-tools-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (22 commits) rtla: Report missed event count rtla: Add function to report missed events rtla: Count all processed events rtla: Count missed trace events tools/rtla: Add osnoise_trace_is_off() rtla/timerlat_top: Set OSNOISE_WORKLOAD for kernel threads rtla/timerlat_hist: Set OSNOISE_WORKLOAD for kernel threads rtla/osnoise: Distinguish missing workload option rtla/timerlat_top: Abort event processing on second signal rtla/timerlat_hist: Abort event processing on second signal rtla/timerlat_top: Stop timerlat tracer on signal rtla/timerlat_hist: Stop timerlat tracer on signal rtla: Add trace_instance_stop tools/rtla: Add basic test suite verification/dot2k: Implement event type detection verification/dot2k: Auto patch current kernel source verification/dot2k: Simplify manual steps in monitor creation rv: Simplify manual steps in monitor creation verification/dot2k: Add support for name and description options verification/dot2k: More robust template variables ... |
||
![]() |
90ab2117f4 |
Runtime verifier and osnoise fixes for 6.14:
- Reset idle tasks on reset for runtime verifier When the runtime verifier is reset, it resets the task's data that is being monitored. But it only iterates for_each_process() which does not include the idle tasks. As the idle tasks can be monitored, they need to be reset as well. - Fix the enabling and disabling of tracepoints in osnoise If timerlat is enabled and the WORKLOAD flag is not set, then the osnoise tracer will enable the migrate task tracepoint to monitor it for its own workload. The test to enable the tracepoint is done against user space modifiable parameters. On disabling of the tracer, those same parameters are used to determine if the tracepoint should be disabled. The problem is if user space were to modify the parameters after it enables the tracer then it may not disable the tracepoint. Instead, a static variable is used to keep track if the tracepoint was enabled or not. Then when the tracer shuts down, it will use this variable to decide to disable the tracepoint or not, instead of looking at the user space parameters. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ5UJ1BQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qo46AQDjr1yVTyUmzvxe1bMLDoDOq1xeRMRe 4f8CdpOJRxbi0QEAwnl5Ey9Rvcuy8GFpt/USESr4VYWAN10fvsPx7pkphAs= =RYyn -----END PGP SIGNATURE----- Merge tag 'trace-rv-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull runtime verifier and osnoise fixes from Steven Rostedt: - Reset idle tasks on reset for runtime verifier When the runtime verifier is reset, it resets the task's data that is being monitored. But it only iterates for_each_process() which does not include the idle tasks. As the idle tasks can be monitored, they need to be reset as well. - Fix the enabling and disabling of tracepoints in osnoise If timerlat is enabled and the WORKLOAD flag is not set, then the osnoise tracer will enable the migrate task tracepoint to monitor it for its own workload. The test to enable the tracepoint is done against user space modifiable parameters. On disabling of the tracer, those same parameters are used to determine if the tracepoint should be disabled. The problem is if user space were to modify the parameters after it enables the tracer then it may not disable the tracepoint. Instead, a static variable is used to keep track if the tracepoint was enabled or not. Then when the tracer shuts down, it will use this variable to decide to disable the tracepoint or not, instead of looking at the user space parameters. * tag 'trace-rv-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/osnoise: Fix resetting of tracepoints rv: Reset per-task monitors also for idle tasks |
||
![]() |
e3ff424592 |
tracing/osnoise: Fix resetting of tracepoints
If a timerlat tracer is started with the osnoise option OSNOISE_WORKLOAD
disabled, but then that option is enabled and timerlat is removed, the
tracepoints that were enabled on timerlat registration do not get
disabled. If the option is disabled again and timelat is started, then it
triggers a warning in the tracepoint code due to registering the
tracepoint again without ever disabling it.
Do not use the same user space defined options to know to disable the
tracepoints when timerlat is removed. Instead, set a global flag when it
is enabled and use that flag to know to disable the events.
~# echo NO_OSNOISE_WORKLOAD > /sys/kernel/tracing/osnoise/options
~# echo timerlat > /sys/kernel/tracing/current_tracer
~# echo OSNOISE_WORKLOAD > /sys/kernel/tracing/osnoise/options
~# echo nop > /sys/kernel/tracing/current_tracer
~# echo NO_OSNOISE_WORKLOAD > /sys/kernel/tracing/osnoise/options
~# echo timerlat > /sys/kernel/tracing/current_tracer
Triggers:
------------[ cut here ]------------
WARNING: CPU: 6 PID: 1337 at kernel/tracepoint.c:294 tracepoint_add_func+0x3b6/0x3f0
Modules linked in:
CPU: 6 UID: 0 PID: 1337 Comm: rtla Not tainted 6.13.0-rc4-test-00018-ga867c441128e-dirty #73
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:tracepoint_add_func+0x3b6/0x3f0
Code: 48 8b 53 28 48 8b 73 20 4c 89 04 24 e8 23 59 11 00 4c 8b 04 24 e9 36 fe ff ff 0f 0b b8 ea ff ff ff 45 84 e4 0f 84 68 fe ff ff <0f> 0b e9 61 fe ff ff 48 8b 7b 18 48 85 ff 0f 84 4f ff ff ff 49 8b
RSP: 0018:ffffb9b003a87ca0 EFLAGS: 00010202
RAX: 00000000ffffffef RBX: ffffffff92f30860 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9bf59e91ccd0 RDI: ffffffff913b6410
RBP: 000000000000000a R08: 00000000000005c7 R09: 0000000000000002
R10: ffffb9b003a87ce0 R11: 0000000000000002 R12: 0000000000000001
R13: ffffb9b003a87ce0 R14: ffffffffffffffef R15: 0000000000000008
FS: 00007fce81209240(0000) GS:ffff9bf6fdd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055e99b728000 CR3: 00000001277c0002 CR4: 0000000000172ef0
Call Trace:
<TASK>
? __warn.cold+0xb7/0x14d
? tracepoint_add_func+0x3b6/0x3f0
? report_bug+0xea/0x170
? handle_bug+0x58/0x90
? exc_invalid_op+0x17/0x70
? asm_exc_invalid_op+0x1a/0x20
? __pfx_trace_sched_migrate_callback+0x10/0x10
? tracepoint_add_func+0x3b6/0x3f0
? __pfx_trace_sched_migrate_callback+0x10/0x10
? __pfx_trace_sched_migrate_callback+0x10/0x10
tracepoint_probe_register+0x78/0xb0
? __pfx_trace_sched_migrate_callback+0x10/0x10
osnoise_workload_start+0x2b5/0x370
timerlat_tracer_init+0x76/0x1b0
tracing_set_tracer+0x244/0x400
tracing_set_trace_write+0xa0/0xe0
vfs_write+0xfc/0x570
? do_sys_openat2+0x9c/0xe0
ksys_write+0x72/0xf0
do_syscall_64+0x79/0x1c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: Gabriele Monaco <gmonaco@redhat.com>
Cc: Luis Goncalves <lgoncalv@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Link: https://lore.kernel.org/20250123204159.4450c88e@gandalf.local.home
Fixes:
|
||
![]() |
606489dbfa |
Fix atomic64 operations on some architectures for the tracing ring buffer:
- Have emulating atomic64 use arch_spin_locks instead of raw_spin_locks The tracing ring buffer events have a small timestamp that holds the delta between itself and the event before it. But this can be tricky to update when interrupts come in. It originally just set the deltas to zero for events that interrupted the adding of another event which made all the events in the interrupt have the same timestamp as the event it interrupted. This was not suitable for many tools, so it was eventually fixed. But that fix required adding an atomic64 cmpxchg on the timestamp in cases where an event was added while another event was in the process of being added. Originally, for 32 bit architectures, the manipulation of the 64 bit timestamp was done by a structure that held multiple 32bit words to hold parts of the timestamp and a counter. But as updates to the ring buffer were done, maintaining this became too complex and was replaced by the atomic64 generic operations which are now used by both 64bit and 32bit architectures. Shortly after that, it was reported that riscv32 and other 32 bit architectures that just used the generic atomic64 were locking up. This was because the generic atomic64 operations defined in lib/atomic64.c uses a raw_spin_lock() to emulate an atomic64 operation. The problem here was that raw_spin_lock() can also be traced by the function tracer (which is commonly used for debugging raw spin locks). Since the function tracer uses the tracing ring buffer, which now is being traced internally, this was triggering a recursion and setting off a warning that the spin locks were recusing. There's no reason for the code that emulates atomic64 operations to be using raw_spin_locks which have a lot of debugging infrastructure attached to them (depending on the config options). Instead it should be using the arch_spin_lock() which does not have any infrastructure attached to them and is used by low level infrastructure like RCU locks, lockdep and of course tracing. Using arch_spin_lock()s fixes this issue. - Do not trace in NMI if the architecture uses emulated atomic64 operations Another issue with using the emulated atomic64 operations that uses spin locks to emulate the atomic64 operations is that they cannot be used in NMI context. As an NMI can trigger while holding the atomic64 spin locks it can try to take the same lock and cause a deadlock. Have the ring buffer fail recording events if in NMI context and the architecture uses the emulated atomic64 operations. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ5Jr7RQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qg7cAPoD/H4BRsFa3UUDnxofTlBuj4A7neJd rk9ddD9HXH8KywEAhBn1Oujiw81Ayjx7E6s4ednAQX4rldTXBXDyFNuuGgU= =b13F -----END PGP SIGNATURE----- Merge tag 'trace-ringbuffer-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull trace fing buffer fix from Steven Rostedt: "Fix atomic64 operations on some architectures for the tracing ring buffer: - Have emulating atomic64 use arch_spin_locks instead of raw_spin_locks The tracing ring buffer events have a small timestamp that holds the delta between itself and the event before it. But this can be tricky to update when interrupts come in. It originally just set the deltas to zero for events that interrupted the adding of another event which made all the events in the interrupt have the same timestamp as the event it interrupted. This was not suitable for many tools, so it was eventually fixed. But that fix required adding an atomic64 cmpxchg on the timestamp in cases where an event was added while another event was in the process of being added. Originally, for 32 bit architectures, the manipulation of the 64 bit timestamp was done by a structure that held multiple 32bit words to hold parts of the timestamp and a counter. But as updates to the ring buffer were done, maintaining this became too complex and was replaced by the atomic64 generic operations which are now used by both 64bit and 32bit architectures. Shortly after that, it was reported that riscv32 and other 32 bit architectures that just used the generic atomic64 were locking up. This was because the generic atomic64 operations defined in lib/atomic64.c uses a raw_spin_lock() to emulate an atomic64 operation. The problem here was that raw_spin_lock() can also be traced by the function tracer (which is commonly used for debugging raw spin locks). Since the function tracer uses the tracing ring buffer, which now is being traced internally, this was triggering a recursion and setting off a warning that the spin locks were recusing. There's no reason for the code that emulates atomic64 operations to be using raw_spin_locks which have a lot of debugging infrastructure attached to them (depending on the config options). Instead it should be using the arch_spin_lock() which does not have any infrastructure attached to them and is used by low level infrastructure like RCU locks, lockdep and of course tracing. Using arch_spin_lock()s fixes this issue. - Do not trace in NMI if the architecture uses emulated atomic64 operations Another issue with using the emulated atomic64 operations that uses spin locks to emulate the atomic64 operations is that they cannot be used in NMI context. As an NMI can trigger while holding the atomic64 spin locks it can try to take the same lock and cause a deadlock. Have the ring buffer fail recording events if in NMI context and the architecture uses the emulated atomic64 operations" * tag 'trace-ringbuffer-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: atomic64: Use arch_spin_locks instead of raw_spin_locks ring-buffer: Do not allow events in NMI with generic atomic64 cmpxchg() |
||
![]() |
7c1badb2a9 |
Remove calltime and rettime from fgraph infrastructure
The calltime and rettime were used by the function graph tracer to calculate the timings of functions where it traced their entry and exit. The calltime and rettime were stored in the generic structures that were used for the mechanisms to add an entry and exit callback. Now that function graph infrastructure is used by other subsystems than just the tracer, the calltime and rettime are not needed for them. Remove the calltime and rettime from the generic fgraph infrastructure and have the callers that require them handle them. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ5Jk2BQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qr3GAQCtzSX6PYTliZKm/Wa5zhebopwhGKEb Mv0VZejaxYu0hgEA5O57J676FAOaQ7BLzoMojKFhR8RZ6LadKR5r7A0Qzgc= =Bb/r -----END PGP SIGNATURE----- Merge tag 'ftrace-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull fgraph updates from Steven Rostedt: "Remove calltime and rettime from fgraph infrastructure The calltime and rettime were used by the function graph tracer to calculate the timings of functions where it traced their entry and exit. The calltime and rettime were stored in the generic structures that were used for the mechanisms to add an entry and exit callback. Now that function graph infrastructure is used by other subsystems than just the tracer, the calltime and rettime are not needed for them. Remove the calltime and rettime from the generic fgraph infrastructure and have the callers that require them handle them" * tag 'ftrace-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: fgraph: Remove calltime and rettime from generic operations |
||
![]() |
e8744fbc83 |
tracing updates for v6.14:
- Cleanup with guard() and free() helpers There were several places in the code that had a lot of "goto out" in the error paths to either unlock a lock or free some memory that was allocated. But this is error prone. Convert the code over to use the guard() and free() helpers that let the compiler unlock locks or free memory when the function exits. - Update the Rust tracepoint code to use the C code too There was some duplication of the tracepoint code for Rust that did the same logic as the C code. Add a helper that makes it possible for both algorithms to use the same logic in one place. - Add poll to trace event hist files It is useful to know when an event is triggered, or even with some filtering. Since hist files of events get updated when active and the event is triggered, allow applications to poll the hist file and wake up when an event is triggered. This will let the application know that the event it is waiting for happened. - Add :mod: command to enable events for current or future modules The function tracer already has a way to enable functions to be traced in modules by writing ":mod:<module>" into set_ftrace_filter. That will enable either all the functions for the module if it is loaded, or if it is not, it will cache that command, and when the module is loaded that matches <module>, its functions will be enabled. This also allows init functions to be traced. But currently events do not have that feature. Add the command where if ':mod:<module>' is written into set_event, then either all the modules events are enabled if it is loaded, or cache it so that the module's events are enabled when it is loaded. This also works from the kernel command line, where "trace_event=:mod:<module>", when the module is loaded at boot up, its events will be enabled then. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ5EbMxQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qkZsAP9Amgx9frSbR1pn1t0I3wVnQx7khgOu s/b8Ro+vjTx1/QD/RN2AA7f+HK4F27w3Aqfrs0nKXAPtXWsJ9Epp8raG5w8= =Pg+4 -----END PGP SIGNATURE----- Merge tag 'trace-v6.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: - Cleanup with guard() and free() helpers There were several places in the code that had a lot of "goto out" in the error paths to either unlock a lock or free some memory that was allocated. But this is error prone. Convert the code over to use the guard() and free() helpers that let the compiler unlock locks or free memory when the function exits. - Update the Rust tracepoint code to use the C code too There was some duplication of the tracepoint code for Rust that did the same logic as the C code. Add a helper that makes it possible for both algorithms to use the same logic in one place. - Add poll to trace event hist files It is useful to know when an event is triggered, or even with some filtering. Since hist files of events get updated when active and the event is triggered, allow applications to poll the hist file and wake up when an event is triggered. This will let the application know that the event it is waiting for happened. - Add :mod: command to enable events for current or future modules The function tracer already has a way to enable functions to be traced in modules by writing ":mod:<module>" into set_ftrace_filter. That will enable either all the functions for the module if it is loaded, or if it is not, it will cache that command, and when the module is loaded that matches <module>, its functions will be enabled. This also allows init functions to be traced. But currently events do not have that feature. Add the command where if ':mod:<module>' is written into set_event, then either all the modules events are enabled if it is loaded, or cache it so that the module's events are enabled when it is loaded. This also works from the kernel command line, where "trace_event=:mod:<module>", when the module is loaded at boot up, its events will be enabled then. * tag 'trace-v6.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (26 commits) tracing: Fix output of set_event for some cached module events tracing: Fix allocation of printing set_event file content tracing: Rename update_cache() to update_mod_cache() tracing: Fix #if CONFIG_MODULES to #ifdef CONFIG_MODULES selftests/ftrace: Add test that tests event :mod: commands tracing: Cache ":mod:" events for modules not loaded yet tracing: Add :mod: command to enabled module events selftests/tracing: Add hist poll() support test tracing/hist: Support POLLPRI event for poll on histogram tracing/hist: Add poll(POLLIN) support on hist file tracing: Fix using ret variable in tracing_set_tracer() tracepoint: Reduce duplication of __DO_TRACE_CALL tracing/string: Create and use __free(argv_free) in trace_dynevent.c tracing: Switch trace_stat.c code over to use guard() tracing: Switch trace_stack.c code over to use guard() tracing: Switch trace_osnoise.c code over to use guard() and __free() tracing: Switch trace_events_synth.c code over to use guard() tracing: Switch trace_events_filter.c code over to use guard() tracing: Switch trace_events_trigger.c code over to use guard() tracing: Switch trace_events_hist.c code over to use guard() ... |
||
![]() |
544521d621 |
Probes updates for v6.14:
- kprobes: Cleanups using guard() and __free(): Use cleanup.h macros to cleanup code and remove all gotos from kprobes code. This work includes below changes. . kprobes: Adopt guard() and scoped_guard() . jump_label: Define guard() for jump_label_lock . kprobes: Use guard() for external locks . kprobes: Use guard for rcu_read_lock . kprobes: Remove unneeded goto . kprobes: Remove remaining gotos - tracing/probes: Also cleanups tracing/*probe events code with guard() and __free(). These patches are just to simplify the parser codes. This work includes below changes. . tracing/kprobe: Adopt guard() and scoped_guard() . tracing/uprobe: Adopt guard() and scoped_guard() . tracing/eprobe: Adopt guard() and scoped_guard() . tracing: Use __free() in trace_probe for cleanup . tracing: Use __free() for kprobe events to cleanup . tracing/kprobes: Simplify __trace_kprobe_create() by removing gotos - kprobes: Reduce preempt disable scope in check_kprobe_access_safe() This reduces preempt disable time only when getting the module refcount in check_kprobe_access_safe(). Previously it disabled preempt needlessly for other checks including jump_label_text_reserved(), but it took long time because of the linear search. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmeNy8kbHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b/UoIAMYX8pYtLMZjokKSlsnN Y3rMZecFJNKfXeA62YazpESO9n+BYyqj/Zj25Ws4YzdfnoT4kGtMuoUCS7d5U3lo RZO/a7i3TgrzQdWbRwBz7l75h+FnCAsrSoLGoH/sQxXUP+p/2C6GJG3O3A/0qMaS TPp7/nsR78P7MSf6Yr/ODZj0UWMSX6a6QNrzwloiMUjY6aPxz7K1on9cdzMF2QPE fgtJk+JbCn9nLOJi/SVlO+wu4EfdKrfT7sqz9taNX/eQYmx2O9fE3R92Qvhxh79r MrqcrMfcwfWtX+yD1kb9mf+jMNYy8Dhcf595B0TGSzCZq64ir9UukntR6YD3raAO Wm0= =qmrY -----END PGP SIGNATURE----- Merge tag 'probes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes updates from Masami Hiramatsu: - kprobes: Cleanups using guard() and __free(): Use cleanup.h macros to cleanup code and remove all gotos from kprobes code. - tracing/probes: Also cleanups tracing/*probe events code with guard() and __free(). These patches are just to simplify the parser codes. - kprobes: Reduce preempt disable scope in check_kprobe_access_safe() This reduces preempt disable time to only when getting the module refcount in check_kprobe_access_safe(). Previously it disabled preempt needlessly for other checks including jump_label_text_reserved(), which took a long time because of the linear search. * tag 'probes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/kprobes: Simplify __trace_kprobe_create() by removing gotos tracing: Use __free() for kprobe events to cleanup tracing: Use __free() in trace_probe for cleanup kprobes: Remove remaining gotos kprobes: Remove unneeded goto kprobes: Use guard for rcu_read_lock kprobes: Use guard() for external locks jump_label: Define guard() for jump_label_lock tracing/eprobe: Adopt guard() and scoped_guard() tracing/uprobe: Adopt guard() and scoped_guard() tracing/kprobe: Adopt guard() and scoped_guard() kprobes: Adopt guard() and scoped_guard() kprobes: Reduce preempt disable scope in check_kprobe_access_safe() |
||
![]() |
d0d106a2bd |
bpf-next-6.14
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmeOu1YACgkQ6rmadz2v bTrrHxAAn6eqEsluWnDlzhI0OGsPjvgS00sf+MOeqiXYeS2eJ8yJuKifp38+nIQZ lIplsWU2ReUY20eizPqLPnQ7TXZGvLgp08E8yHUoZ0siWanqr9iDRfbZCCNrDMNm lMqeR1SLapMws2R/UX9JbvPn2ajIJ6Lb4wxenTfdlW6q+0hAGM6Dt0k/jBod+quq /oo+xwG3L0q4APBovJfiAFN2z6IYN03b+zLiOrpIJtMACGewEXnl3m4mkL8ZM/FV nZGPIxIUPXCpKTGEkNqxfkrnHN2wZQ4ZSKEJ6lhEEp4jrgCVITaGZ/E7jlx6fZoj bbd4YMonIPo9Nhim8p1dt8yYBhKKiE5IXIq0GqlMv5+MvAN8ylrlydpsouW1fu66 hZ1W1BxbxmrgyF0Bwo9JPOMhBHwMrmD6iH9LgiMpZf0ASeF+q9cJpoSOU5j5E9XB LpLIRf5jYTd4wZjhDmrQREReLo+Bng9DlCBu+jjh2+YTz6l6Qed+ETpENcd7lL5i IHZVbgD2RVPNJoUfdrd763HfYfDTk+50MF5FIMEyfKHz11if0E/LhBMzto22hm6b 2f8ruj/8yvg8s2dxEP3ySQgcnynlwEnGxLenUVv7uEOYKeWri1rq+fvTK5ne1OLK oHnTlkViwQb74c0r8cFW+nkyfUYTfhhBAql14rl/fMjGDO2KZ10= =f2CA -----END PGP SIGNATURE----- Merge tag 'bpf-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: "A smaller than usual release cycle. The main changes are: - Prepare selftest to run with GCC-BPF backend (Ihor Solodrai) In addition to LLVM-BPF runs the BPF CI now runs GCC-BPF in compile only mode. Half of the tests are failing, since support for btf_decl_tag is still WIP, but this is a great milestone. - Convert various samples/bpf to selftests/bpf/test_progs format (Alexis Lothoré and Bastien Curutchet) - Teach verifier to recognize that array lookup with constant in-range index will always succeed (Daniel Xu) - Cleanup migrate disable scope in BPF maps (Hou Tao) - Fix bpf_timer destroy path in PREEMPT_RT (Hou Tao) - Always use bpf_mem_alloc in bpf_local_storage in PREEMPT_RT (Martin KaFai Lau) - Refactor verifier lock support (Kumar Kartikeya Dwivedi) This is a prerequisite for upcoming resilient spin lock. - Remove excessive 'may_goto +0' instructions in the verifier that LLVM leaves when unrolls the loops (Yonghong Song) - Remove unhelpful bpf_probe_write_user() warning message (Marco Elver) - Add fd_array_cnt attribute for prog_load command (Anton Protopopov) This is a prerequisite for upcoming support for static_branch" * tag 'bpf-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (125 commits) selftests/bpf: Add some tests related to 'may_goto 0' insns bpf: Remove 'may_goto 0' instruction in opt_remove_nops() bpf: Allow 'may_goto 0' instruction in verifier selftests/bpf: Add test case for the freeing of bpf_timer bpf: Cancel the running bpf_timer through kworker for PREEMPT_RT bpf: Free element after unlock in __htab_map_lookup_and_delete_elem() bpf: Bail out early in __htab_map_lookup_and_delete_elem() bpf: Free special fields after unlock in htab_lru_map_delete_node() tools: Sync if_xdp.h uapi tooling header libbpf: Work around kernel inconsistently stripping '.llvm.' suffix bpf: selftests: verifier: Add nullness elision tests bpf: verifier: Support eliding map lookup nullness bpf: verifier: Refactor helper access type tracking bpf: tcp: Mark bpf_load_hdr_opt() arg2 as read-write bpf: verifier: Add missing newline on verbose() call selftests/bpf: Add distilled BTF test about marking BTF_IS_EMBEDDED libbpf: Fix incorrect traversal end type ID when marking BTF_IS_EMBEDDED libbpf: Fix return zero when elf_begin failed selftests/bpf: Fix btf leak on new btf alloc failure in btf_distill test veristat: Load struct_ops programs only once ... |
||
![]() |
66611c0475 |
fgraph: Remove calltime and rettime from generic operations
The function graph infrastructure is now generic so that kretprobes, fprobes and BPF can use it. But there is still some leftover logic that only the function graph tracer itself uses. This is the calculation of the calltime and return time of the functions. The calculation of the calltime has been moved into the function graph tracer and those users that need it so that it doesn't cause overhead to the other users. But the return function timestamp was still called. Instead of just moving the taking of the timestamp into the function graph trace remove the calltime and rettime completely from the ftrace_graph_ret structure. Instead, move it into the function graph return entry event structure and this also moves all the calltime and rettime logic out of the generic fgraph.c code and into the tracing code that uses it. This has been reported to decrease the overhead by ~27%. Link: https://lore.kernel.org/all/Z3aSuql3fnXMVMoM@krava/ Link: https://lore.kernel.org/all/173665959558.1629214.16724136597211810729.stgit@devnote2/ Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250121194436.15bdf71a@gandalf.local.home Reported-by: Jiri Olsa <olsajiri@gmail.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
2e04247f7c |
ftrace updates for v6.14:
- Have fprobes built on top of function graph infrastructure The fprobe logic is an optimized kprobe that uses ftrace to attach to functions when a probe is needed at the start or end of the function. The fprobe and kretprobe logic implements a similar method as the function graph tracer to trace the end of the function. That is to hijack the return address and jump to a trampoline to do the trace when the function exits. To do this, a shadow stack needs to be created to store the original return address. Fprobes and function graph do this slightly differently. Fprobes (and kretprobes) has slots per callsite that are reserved to save the return address. This is fine when just a few points are traced. But users of fprobes, such as BPF programs, are starting to add many more locations, and this method does not scale. The function graph tracer was created to trace all functions in the kernel. In order to do this, when function graph tracing is started, every task gets its own shadow stack to hold the return address that is going to be traced. The function graph tracer has been updated to allow multiple users to use its infrastructure. Now have fprobes be one of those users. This will also allow for the fprobe and kretprobe methods to trace the return address to become obsolete. With new technologies like CFI that need to know about these methods of hijacking the return address, going toward a solution that has only one method of doing this will make the kernel less complex. - Cleanup with guard() and free() helpers There were several places in the code that had a lot of "goto out" in the error paths to either unlock a lock or free some memory that was allocated. But this is error prone. Convert the code over to use the guard() and free() helpers that let the compiler unlock locks or free memory when the function exits. - Remove disabling of interrupts in the function graph tracer When function graph tracer was first introduced, it could race with interrupts and NMIs. To prevent that race, it would disable interrupts and not trace NMIs. But the code has changed to allow NMIs and also interrupts. This change was done a long time ago, but the disabling of interrupts was never removed. Remove the disabling of interrupts in the function graph tracer is it is not needed. This greatly improves its performance. - Allow the :mod: command to enable tracing module functions on the kernel command line. The function tracer already has a way to enable functions to be traced in modules by writing ":mod:<module>" into set_ftrace_filter. That will enable either all the functions for the module if it is loaded, or if it is not, it will cache that command, and when the module is loaded that matches <module>, its functions will be enabled. This also allows init functions to be traced. But currently events do not have that feature. Because enabling function tracing can be done very early at boot up (before scheduling is enabled), the commands that can be done when function tracing is started is limited. Having the ":mod:" command to trace module functions as they are loaded is very useful. Update the kernel command line function filtering to allow it. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ42E2RQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qqXSAPwOMxuhye8tb1GYG62QD9+w7e6nOmlC 2GCPj4detnEM2QD/ciivkhespVKhHpZHRewAuSnJgHPSM45NQ3EVESzjWQ4= =snbx -----END PGP SIGNATURE----- Merge tag 'ftrace-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ftrace updates from Steven Rostedt: - Have fprobes built on top of function graph infrastructure The fprobe logic is an optimized kprobe that uses ftrace to attach to functions when a probe is needed at the start or end of the function. The fprobe and kretprobe logic implements a similar method as the function graph tracer to trace the end of the function. That is to hijack the return address and jump to a trampoline to do the trace when the function exits. To do this, a shadow stack needs to be created to store the original return address. Fprobes and function graph do this slightly differently. Fprobes (and kretprobes) has slots per callsite that are reserved to save the return address. This is fine when just a few points are traced. But users of fprobes, such as BPF programs, are starting to add many more locations, and this method does not scale. The function graph tracer was created to trace all functions in the kernel. In order to do this, when function graph tracing is started, every task gets its own shadow stack to hold the return address that is going to be traced. The function graph tracer has been updated to allow multiple users to use its infrastructure. Now have fprobes be one of those users. This will also allow for the fprobe and kretprobe methods to trace the return address to become obsolete. With new technologies like CFI that need to know about these methods of hijacking the return address, going toward a solution that has only one method of doing this will make the kernel less complex. - Cleanup with guard() and free() helpers There were several places in the code that had a lot of "goto out" in the error paths to either unlock a lock or free some memory that was allocated. But this is error prone. Convert the code over to use the guard() and free() helpers that let the compiler unlock locks or free memory when the function exits. - Remove disabling of interrupts in the function graph tracer When function graph tracer was first introduced, it could race with interrupts and NMIs. To prevent that race, it would disable interrupts and not trace NMIs. But the code has changed to allow NMIs and also interrupts. This change was done a long time ago, but the disabling of interrupts was never removed. Remove the disabling of interrupts in the function graph tracer is it is not needed. This greatly improves its performance. - Allow the :mod: command to enable tracing module functions on the kernel command line. The function tracer already has a way to enable functions to be traced in modules by writing ":mod:<module>" into set_ftrace_filter. That will enable either all the functions for the module if it is loaded, or if it is not, it will cache that command, and when the module is loaded that matches <module>, its functions will be enabled. This also allows init functions to be traced. But currently events do not have that feature. Because enabling function tracing can be done very early at boot up (before scheduling is enabled), the commands that can be done when function tracing is started is limited. Having the ":mod:" command to trace module functions as they are loaded is very useful. Update the kernel command line function filtering to allow it. * tag 'ftrace-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (26 commits) ftrace: Implement :mod: cache filtering on kernel command line tracing: Adopt __free() and guard() for trace_fprobe.c bpf: Use ftrace_get_symaddr() for kprobe_multi probes ftrace: Add ftrace_get_symaddr to convert fentry_ip to symaddr Documentation: probes: Update fprobe on function-graph tracer selftests/ftrace: Add a test case for repeating register/unregister fprobe selftests: ftrace: Remove obsolate maxactive syntax check tracing/fprobe: Remove nr_maxactive from fprobe fprobe: Add fprobe_header encoding feature fprobe: Rewrite fprobe on function-graph tracer s390/tracing: Enable HAVE_FTRACE_GRAPH_FUNC ftrace: Add CONFIG_HAVE_FTRACE_GRAPH_FUNC bpf: Enable kprobe_multi feature if CONFIG_FPROBE is enabled tracing/fprobe: Enable fprobe events with CONFIG_DYNAMIC_FTRACE_WITH_ARGS tracing: Add ftrace_fill_perf_regs() for perf event tracing: Add ftrace_partial_regs() for converting ftrace_regs to pt_regs fprobe: Use ftrace_regs in fprobe exit handler fprobe: Use ftrace_regs in fprobe entry handler fgraph: Pass ftrace_regs to retfunc fgraph: Replace fgraph_ret_regs with ftrace_regs ... |
||
![]() |
0074adea39 |
ring-buffer changes for v6.14
- Clean up the __rb_map_vma() logic The logic of __rb_map_vma() has a error check with WARN_ON() that makes sure that the index does not go past the end of the array of buffers. The test in the loop pretty much guarantees that it will never happen, but since the relation of the variables used is a little complex, the WARN_ON() check was added. It was noticed that the array was dereferenced before this check and if the logic does break and for some reason the logic goes past the array, there will be an out of bounds access here. Move the access to after the WARN_ON(). - Consolidate how the ring buffer is determined to be empty Currently there's two ways that are used to determine if the ring buffer is empty. One relies on the status of the commit and reader pages and what was read, and the other is on what was written vs what was read. By using the number of entries (written) method, it can be used for reading events that are out of the kernel's control (what pKVM will use). Move to this method to make it easier to implement a pKVM ring buffer that the kernel can read. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ42XuRQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qn3GAQCOQ94vr88FSXb/azC9281iDGYC/KbJ 7J4dGv2rXHpoVAEAtXRXSXpG0mTIJ6TtgVKgMrIFAuT/AVo4EIUr2q/CsgA= =2G7c -----END PGP SIGNATURE----- Merge tag 'trace-ringbuffer-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull trace ring-buffer updates from Steven Rostedt: - Clean up the __rb_map_vma() logic The logic of __rb_map_vma() has a error check with WARN_ON() that makes sure that the index does not go past the end of the array of buffers. The test in the loop pretty much guarantees that it will never happen, but since the relation of the variables used is a little complex, the WARN_ON() check was added. It was noticed that the array was dereferenced before this check and if the logic does break and for some reason the logic goes past the array, there will be an out of bounds access here. Move the access to after the WARN_ON(). - Consolidate how the ring buffer is determined to be empty Currently there's two ways that are used to determine if the ring buffer is empty. One relies on the status of the commit and reader pages and what was read, and the other is on what was written vs what was read. By using the number of entries (written) method, it can be used for reading events that are out of the kernel's control (what pKVM will use). Move to this method to make it easier to implement a pKVM ring buffer that the kernel can read. * tag 'trace-ringbuffer-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Make reading page consistent with the code logic ring-buffer: Check for empty ring-buffer with rb_num_of_entries() |
||
![]() |
8f21943e10 |
tracing: Fix output of set_event for some cached module events
The following works fine:
~# echo ':mod:trace_events_sample' > /sys/kernel/tracing/set_event
~# cat /sys/kernel/tracing/set_event
*:*:mod:trace_events_sample
~#
But if a name is given without a ':' where it can match an event name or
system name, the output of the cached events does not include a new line:
~# echo 'foo_bar:mod:trace_events_sample' > /sys/kernel/tracing/set_event
~# cat /sys/kernel/tracing/set_event
foo_bar:mod:trace_events_sample~#
Add the '\n' to that as well.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250121151336.6c491844@gandalf.local.home
Fixes:
|
||
![]() |
f95ee54294 |
tracing: Fix allocation of printing set_event file content
The adding of cached events for modules not loaded yet required a
descriptor to separate the iteration of events with the iteration of
cached events for a module. But the allocation used the size of the
pointer and not the size of the contents to allocate its data and caused a
slab-out-of-bounds.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20250121151236.47fcf433@gandalf.local.home
Reported-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/all/Z4_OHKESRSiJcr-b@lappy/
Fixes:
|
||
![]() |
cd2375a356 |
ring-buffer: Do not allow events in NMI with generic atomic64 cmpxchg()
Some architectures can not safely do atomic64 operations in NMI context.
Since the ring buffer relies on atomic64 operations to do its time
keeping, if an event is requested in NMI context, reject it for these
architectures.
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andreas Larsson <andreas@gaisler.com>
Link: https://lore.kernel.org/20250120235721.407068250@goodmis.org
Fixes:
|
||
![]() |
6c4aa896eb |
Performance events changes for v6.14:
- Seqlock optimizations that arose in a perf context and were merged into the perf tree: - seqlock: Add raw_seqcount_try_begin (Suren Baghdasaryan) - mm: Convert mm_lock_seq to a proper seqcount ((Suren Baghdasaryan) - mm: Introduce mmap_lock_speculate_{try_begin|retry} (Suren Baghdasaryan) - mm/gup: Use raw_seqcount_try_begin() (Peter Zijlstra) - Core perf enhancements: - Reduce 'struct page' footprint of perf by mapping pages in advance (Lorenzo Stoakes) - Save raw sample data conditionally based on sample type (Yabin Cui) - Reduce sampling overhead by checking sample_type in perf_sample_save_callchain() and perf_sample_save_brstack() (Yabin Cui) - Export perf_exclude_event() (Namhyung Kim) - Uprobes scalability enhancements: (Andrii Nakryiko) - Simplify find_active_uprobe_rcu() VMA checks - Add speculative lockless VMA-to-inode-to-uprobe resolution - Simplify session consumer tracking - Decouple return_instance list traversal and freeing - Ensure return_instance is detached from the list before freeing - Reuse return_instances between multiple uretprobes within task - Guard against kmemdup() failing in dup_return_instance() - AMD core PMU driver enhancements: - Relax privilege filter restriction on AMD IBS (Namhyung Kim) - AMD RAPL energy counters support: (Dhananjay Ugwekar) - Introduce topology_logical_core_id() (K Prateek Nayak) - Remove the unused get_rapl_pmu_cpumask() function - Remove the cpu_to_rapl_pmu() function - Rename rapl_pmu variables - Make rapl_model struct global - Add arguments to the init and cleanup functions - Modify the generic variable names to *_pkg* - Remove the global variable rapl_msrs - Move the cntr_mask to rapl_pmus struct - Add core energy counter support for AMD CPUs - Intel core PMU driver enhancements: - Support RDPMC 'metrics clear mode' feature (Kan Liang) - Clarify adaptive PEBS processing (Kan Liang) - Factor out functions for PEBS records processing (Kan Liang) - Simplify the PEBS records processing for adaptive PEBS (Kan Liang) - Intel uncore driver enhancements: (Kan Liang) - Convert buggy pmu->func_id use to pmu->registered - Support more units on Granite Rapids Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmeOJdQRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1i2yQ/+MXl7yfJOgdbwjBpgGGzH4burEO7ppak+ ktzz+YjpNgjODe/xMAJGjjblouuYArCnRolc1UPvPm6M7jSY76wi42Y6c4dRtFoB 2ReSrRqnreLOcrRS9nsTjvWRHfJHqJDVSd9TfHX6ILfzbaizCZOGYk558ZxAKRqu Lw7FOvLEe/Y3tg4z8dDg083jsasalKySP9wIPc0BkSqQTOfusd3KXju/Fux/9wkn hZcUgF4ds+0bH7xtO1/G9ILqGyeq97X1McIR9bAjln5Mxykclen4hSjRaWWHHo9O mzBKmd/blIATisfuuW+QLDQow3M1k3688cz7e9QOeWHHd/dJiMb9RLV90jdND/T/ uLINC5vNemzyWEfnNiYQ31LjhG3SeuDiKWzRp36MbQcCh6EBdRXWLBgtmxq1L/3o ZCaCdtFu5+6epycdyOVZEpWDnjdx4GmLXMZi5WJfZ7fZ/IFjNkjk4OdzI1iRQ+i3 Sbi75ep59ayTUhm5AB7gCJsP3R7EsZsiPHUenQdA2n9Sj6xE+IuhlS/QDQ9g5mdY Ijs0jHeVCGmhYoOD1xWnCZSzlnkEVU3zwfypAK+MC7pgtFMwDy5/Bu1USGxXXDy+ aKsrJRSgHbtZ1gwoHstqkV+DeCTfElCLYkvigzI5Nmyib5Zp4vkwy2ZLWQjaNjm7 mqRI7PugUkU= =c8XB -----END PGP SIGNATURE----- Merge tag 'perf-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance events updates from Ingo Molnar: "Seqlock optimizations that arose in a perf context and were merged into the perf tree: - seqlock: Add raw_seqcount_try_begin (Suren Baghdasaryan) - mm: Convert mm_lock_seq to a proper seqcount (Suren Baghdasaryan) - mm: Introduce mmap_lock_speculate_{try_begin|retry} (Suren Baghdasaryan) - mm/gup: Use raw_seqcount_try_begin() (Peter Zijlstra) Core perf enhancements: - Reduce 'struct page' footprint of perf by mapping pages in advance (Lorenzo Stoakes) - Save raw sample data conditionally based on sample type (Yabin Cui) - Reduce sampling overhead by checking sample_type in perf_sample_save_callchain() and perf_sample_save_brstack() (Yabin Cui) - Export perf_exclude_event() (Namhyung Kim) Uprobes scalability enhancements: (Andrii Nakryiko) - Simplify find_active_uprobe_rcu() VMA checks - Add speculative lockless VMA-to-inode-to-uprobe resolution - Simplify session consumer tracking - Decouple return_instance list traversal and freeing - Ensure return_instance is detached from the list before freeing - Reuse return_instances between multiple uretprobes within task - Guard against kmemdup() failing in dup_return_instance() AMD core PMU driver enhancements: - Relax privilege filter restriction on AMD IBS (Namhyung Kim) AMD RAPL energy counters support: (Dhananjay Ugwekar) - Introduce topology_logical_core_id() (K Prateek Nayak) - Remove the unused get_rapl_pmu_cpumask() function - Remove the cpu_to_rapl_pmu() function - Rename rapl_pmu variables - Make rapl_model struct global - Add arguments to the init and cleanup functions - Modify the generic variable names to *_pkg* - Remove the global variable rapl_msrs - Move the cntr_mask to rapl_pmus struct - Add core energy counter support for AMD CPUs Intel core PMU driver enhancements: - Support RDPMC 'metrics clear mode' feature (Kan Liang) - Clarify adaptive PEBS processing (Kan Liang) - Factor out functions for PEBS records processing (Kan Liang) - Simplify the PEBS records processing for adaptive PEBS (Kan Liang) Intel uncore driver enhancements: (Kan Liang) - Convert buggy pmu->func_id use to pmu->registered - Support more units on Granite Rapids" * tag 'perf-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) perf: map pages in advance perf/x86/intel/uncore: Support more units on Granite Rapids perf/x86/intel/uncore: Clean up func_id perf/x86/intel: Support RDPMC metrics clear mode uprobes: Guard against kmemdup() failing in dup_return_instance() perf/x86: Relax privilege filter restriction on AMD IBS perf/core: Export perf_exclude_event() uprobes: Reuse return_instances between multiple uretprobes within task uprobes: Ensure return_instance is detached from the list before freeing uprobes: Decouple return_instance list traversal and freeing uprobes: Simplify session consumer tracking uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution uprobes: simplify find_active_uprobe_rcu() VMA checks mm: introduce mmap_lock_speculate_{try_begin|retry} mm: convert mm_lock_seq to a proper seqcount mm/gup: Use raw_seqcount_try_begin() seqlock: add raw_seqcount_try_begin perf/x86/rapl: Add core energy counter support for AMD CPUs perf/x86/rapl: Move the cntr_mask to rapl_pmus struct perf/x86/rapl: Remove the global variable rapl_msrs ... |
||
![]() |
1cbfb828e0 |
for-6.14/block-20250118
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeL6hoQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgppw2EADQV8nDgLRggZR+il4U03yKHXcQEdAX1GrB Erowx+dasIJuh6kp3n6qRe9QD/pRqt1DKyLvXoWF8Qfuwq85j7oDnDDYxutNYT27 hDgrLJriJ3VeKYtTu+andHWt8P29b5h57UayInDOUJurEPA6rXyFZ5YVIti8n21K uDOrQXiACG3qRWS2+p2f3UNhX0MkFNFdN/lxi13WMIJtRWF5bXAP+JOgIWCID4Ze QuSY6rQD4dp4Q6M2erpX6tn0YZb7Hvw3rPjsd91n6jvYfTUVLH375zg8jCBpi6Wi Syufbb8xcTtriVPTDRNu0ekjebkc8wD8ax/h86g0z9v3Ua4DlNmsx9eXrtv6r5nu YXqDODOad6stI0+owFquW2vas0gHmfNSfyfGdlk2g24PMtP5Yx0V6FIEvwIeqnje ghgxQvBuKUsdhqakByfNnc+XvXi3+RUJek8kvMeUSUQWT1IyMQqPOOk0yp9WdyWD bY1f2ECP5BR1b37zYOyawewsI5xTupHUswn5a4r4qtGn3O15rGDkX98Nab5aLCnR rW/DvX7+wT6gW9EwrRHiwjwfNDZbsJ9Ggu3lMhtUl5GUWdk58yTiVgKaHJLnlX9/ CKFKfyyIR1Vl8+gYIpemyFhhcoN+dCSf06ISkrg0jeS0/tYwydaAaCBPL5J4kxZA h3Rtbh+Pgg== =EXYs -----END PGP SIGNATURE----- Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - NVMe pull requests via Keith: - Target support for PCI-Endpoint transport (Damien) - TCP IO queue spreading fixes (Sagi, Chaitanya) - Target handling for "limited retry" flags (Guixen) - Poll type fix (Yongsoo) - Xarray storage error handling (Keisuke) - Host memory buffer free size fix on error (Francis) - MD pull requests via Song: - Reintroduce md-linear (Yu Kuai) - md-bitmap refactor and fix (Yu Kuai) - Replace kmap_atomic with kmap_local_page (David Reaver) - Quite a few queue freeze and debugfs deadlock fixes Ming introduced lockdep support for this in the 6.13 kernel, and it has (unsurprisingly) uncovered quite a few issues - Use const attributes for IO schedulers - Remove bio ioprio wrappers - Fixes for stacked device atomic write support - Refactor queue affinity helpers, in preparation for better supporting isolated CPUs - Cleanups of loop O_DIRECT handling - Cleanup of BLK_MQ_F_* flags - Add rotational support for null_blk - Various fixes and cleanups * tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits) block: Don't trim an atomic write block: Add common atomic writes enable flag md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add() block: limit disk max sectors to (LLONG_MAX >> 9) block: Change blk_stack_atomic_writes_limits() unit_min check block: Ensure start sector is aligned for stacking atomic writes blk-mq: Move more error handling into blk_mq_submit_bio() block: Reorder the request allocation code in blk_mq_submit_bio() nvme: fix bogus kzalloc() return check in nvme_init_effects_log() md/md-bitmap: move bitmap_{start, end}write to md upper layer md/raid5: implement pers->bitmap_sector() md: add a new callback pers->bitmap_sector() md/md-bitmap: remove the last parameter for bimtap_ops->endwrite() md/md-bitmap: factor behind write counters out from bitmap_{start/end}write() md: Replace deprecated kmap_atomic() with kmap_local_page() md: reintroduce md-linear partitions: ldm: remove the initial kernel-doc notation blk-cgroup: rwstat: fix kernel-doc warnings in header file blk-cgroup: fix kernel-doc warnings in header file nbd: fix partial sending ... |
||
![]() |
22412b72ca |
tracing: Rename update_cache() to update_mod_cache()
The static function in trace_events.c called update_cache() is too generic
and conflicts with the function defined in arch/openrisc/include/asm/pgtable.h
Rename it to update_mod_cache() to make it less generic.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250120172756.4ecfb43f@batman.local.home
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202501210550.Ufrj5CRn-lkp@intel.com/
Fixes:
|
||
![]() |
1a89a6924b |
kernel-6.14-rc1.pid
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ4pR0wAKCRCRxhvAZXjc ojb2AQD5QfpTEX/ju1TkenTvoNl+JfnIjaVSY40Lm9DWYzmCMAEAuRvf5WRIV713 00/RVOrUvsLobzhmnk0yw53EQ5A+pA0= =2NDA -----END PGP SIGNATURE----- Merge tag 'kernel-6.14-rc1.pid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull pid_max namespacing update from Christian Brauner: "The pid_max sysctl is a global value. For a long time the default value has been 65535 and during the pidfd dicussions Linus proposed to bump pid_max by default. Based on this discussion systemd started bumping pid_max to 2^22. So all new systems now run with a very high pid_max limit with some distros having also backported that change. The decision to bump pid_max is obviously correct. It just doesn't make a lot of sense nowadays to enforce such a low pid number. There's sufficient tooling to make selecting specific processes without typing really large pid numbers available. In any case, there are workloads that have expections about how large pid numbers they accept. Either for historical reasons or architectural reasons. One concreate example is the 32-bit version of Android's bionic libc which requires pid numbers less than 65536. There are workloads where it is run in a 32-bit container on a 64-bit kernel. If the host has a pid_max value greater than 65535 the libc will abort thread creation because of size assumptions of pthread_mutex_t. That's a fairly specific use-case however, in general specific workloads that are moved into containers running on a host with a new kernel and a new systemd can run into issues with large pid_max values. Obviously making assumptions about the size of the allocated pid is suboptimal but we have userspace that does it. Of course, giving containers the ability to restrict the number of processes in their respective pid namespace indepent of the global limit through pid_max is something desirable in itself and comes in handy in general. Independent of motivating use-cases the existence of pid namespaces makes this also a good semantical extension and there have been prior proposals pushing in a similar direction. The trick here is to minimize the risk of regressions which I think is doable. The fact that pid namespaces are hierarchical will help us here. What we mostly care about is that when the host sets a low pid_max limit, say (crazy number) 100 that no descendant pid namespace can allocate a higher pid number in its namespace. Since pid allocation is hierarchial this can be ensured by checking each pid allocation against the pid namespace's pid_max limit. This means if the allocation in the descendant pid namespace succeeds, the ancestor pid namespace can reject it. If the ancestor pid namespace has a higher limit than the descendant pid namespace the descendant pid namespace will reject the pid allocation. The ancestor pid namespace will obviously not care about this. All in all this means pid_max continues to enforce a system wide limit on the number of processes but allows pid namespaces sufficient leeway in handling workloads with assumptions about pid values and allows containers to restrict the number of processes in a pid namespace through the pid_max interface" * tag 'kernel-6.14-rc1.pid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: tests/pid_namespace: add pid_max tests pid: allow pid_max to be set per pid namespace |
||
![]() |
a925df6f50 |
tracing: Fix #if CONFIG_MODULES to #ifdef CONFIG_MODULES
A typo was introduced when adding the ":mod:" command that did
a "#if CONFIG_MODULES" instead of a "#ifdef CONFIG_MODULES".
Fix it.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20250120125745.4ac90ca6@gandalf.local.home
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202501190121.E2CIJuUj-lkp@intel.com/
Fixes:
|
||
![]() |
31f505dc70 |
ftrace: Implement :mod: cache filtering on kernel command line
Module functions can be set to set_ftrace_filter before the module is loaded. # echo :mod:snd_hda_intel > set_ftrace_filter This will enable all the functions for the module snd_hda_intel. If that module is not loaded, it is "cached" in the trace array for when the module is loaded, its functions will be traced. But this is not implemented in the kernel command line. That's because the kernel command line filtering is added very early in boot up as it is needed to be done before boot time function tracing can start, which is also available very early in boot up. The code used by the "set_ftrace_filter" file can not be used that early as it depends on some other initialization to occur first. But some of the functions can. Implement the ":mod:" feature of "set_ftrace_filter" in the kernel command line parsing. Now function tracing on just a single module that is loaded at boot up can be done. Adding: ftrace=function ftrace_filter=:mod:sna_hda_intel To the kernel command line will only enable the sna_hda_intel module functions when the module is loaded, and it will start tracing. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250116175832.34e39779@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
8275637215 |
tracing: Adopt __free() and guard() for trace_fprobe.c
Adopt __free() and guard() for trace_fprobe.c to remove gotos. Link: https://lore.kernel.org/173708043449.319651.12242878905778792182.stgit@mhiramat.roam.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
b355247df1 |
tracing: Cache ":mod:" events for modules not loaded yet
When the :mod: command is written into /sys/kernel/tracing/set_event (or that file within an instance), if the module specified after the ":mod:" is not yet loaded, it will store that string internally. When the module is loaded, it will enable the events as if the module was loaded when the string was written into the set_event file. This can also be useful to enable events that are in the init section of the module, as the events are enabled before the init section is executed. This also works on the kernel command line: trace_event=:mod:<module> Will enable the events for <module> when it is loaded. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20250116143533.514730995@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
4c86bc531e |
tracing: Add :mod: command to enabled module events
Add a :mod: command to enable only events from a given module from the set_events file. echo '*:mod:<module>' > set_events Or echo ':mod:<module>' > set_events Will enable all events for that module. Specific events can also be enabled via: echo '<event>:mod:<module>' > set_events Or echo '<system>:<event>:mod:<module>' > set_events Or echo '*:<event>:mod:<module>' > set_events The ":mod:" keyword is consistent with the function tracing filter to enable functions from a given module. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20250116143533.214496360@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
87c544108b |
bpf: Send signals asynchronously if !preemptible
BPF programs can execute in all kinds of contexts and when a program
running in a non-preemptible context uses the bpf_send_signal() kfunc,
it will cause issues because this kfunc can sleep.
Change `irqs_disabled()` to `!preemptible()`.
Reported-by: syzbot+97da3d7e0112d59971de@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67486b09.050a0220.253251.0084.GAE@google.com/
Fixes:
|
||
![]() |
24e0e61040 |
tracing: Print lazy preemption model
Print lazy preemption model in ftrace header when latency-format=1. # cat /sys/kernel/debug/sched/preempt none voluntary full (lazy) Without patch: latency: 0 us, #232946/232946, CPU#40 | (M:unknown VP:0, KP:0, SP:0 HP:0 #P:80) ^^^^^^^ With Patch: latency: 0 us, #1897938/25566788, CPU#16 | (M:lazy VP:0, KP:0, SP:0 HP:0 #P:80) ^^^^ Now that lazy preemption is part of the kernel, make sure the tracing infrastructure reflects that. Link: https://lore.kernel.org/20250103093647.575919-1-sshegde@linux.ibm.com Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
a485ea9e3e |
tracing: Fix irqsoff and wakeup latency tracers when using function graph
The function graph tracer has become generic so that kretprobes and BPF
can use it along with function graph tracing itself. Some of the
infrastructure was specific for function graph tracing such as recording
the calltime and return time of the functions. Calling the clock code on a
high volume function does add overhead. The calculation of the calltime
was removed from the generic code and placed into the function graph
tracer itself so that the other users did not incur this overhead as they
did not need that timestamp.
The calltime field was still kept in the generic return entry structure
and the function graph return entry callback filled it as that structure
was passed to other code.
But this broke both irqsoff and wakeup latency tracer as they still
depended on the trace structure containing the calltime when the option
display-graph is set as it used some of those same functions that the
function graph tracer used. But now the calltime was not set and was just
zero. This caused the calculation of the function time to be the absolute
value of the return timestamp and not the length of the function.
# cd /sys/kernel/tracing
# echo 1 > options/display-graph
# echo irqsoff > current_tracer
The tracers went from:
# REL TIME CPU TASK/PID |||| DURATION FUNCTION CALLS
# | | | | |||| | | | | | |
0 us | 4) <idle>-0 | d..1. | 0.000 us | irqentry_enter();
3 us | 4) <idle>-0 | d..2. | | irq_enter_rcu() {
4 us | 4) <idle>-0 | d..2. | 0.431 us | preempt_count_add();
5 us | 4) <idle>-0 | d.h2. | | tick_irq_enter() {
5 us | 4) <idle>-0 | d.h2. | 0.433 us | tick_check_oneshot_broadcast_this_cpu();
6 us | 4) <idle>-0 | d.h2. | 2.426 us | ktime_get();
9 us | 4) <idle>-0 | d.h2. | | tick_nohz_stop_idle() {
10 us | 4) <idle>-0 | d.h2. | 0.398 us | nr_iowait_cpu();
11 us | 4) <idle>-0 | d.h1. | 1.903 us | }
11 us | 4) <idle>-0 | d.h2. | | tick_do_update_jiffies64() {
12 us | 4) <idle>-0 | d.h2. | | _raw_spin_lock() {
12 us | 4) <idle>-0 | d.h2. | 0.360 us | preempt_count_add();
13 us | 4) <idle>-0 | d.h3. | 0.354 us | do_raw_spin_lock();
14 us | 4) <idle>-0 | d.h2. | 2.207 us | }
15 us | 4) <idle>-0 | d.h3. | 0.428 us | calc_global_load();
16 us | 4) <idle>-0 | d.h3. | | _raw_spin_unlock() {
16 us | 4) <idle>-0 | d.h3. | 0.380 us | do_raw_spin_unlock();
17 us | 4) <idle>-0 | d.h3. | 0.334 us | preempt_count_sub();
18 us | 4) <idle>-0 | d.h1. | 1.768 us | }
18 us | 4) <idle>-0 | d.h2. | | update_wall_time() {
[..]
To:
# REL TIME CPU TASK/PID |||| DURATION FUNCTION CALLS
# | | | | |||| | | | | | |
0 us | 5) <idle>-0 | d.s2. | 0.000 us | _raw_spin_lock_irqsave();
0 us | 5) <idle>-0 | d.s3. | 312159583 us | preempt_count_add();
2 us | 5) <idle>-0 | d.s4. | 312159585 us | do_raw_spin_lock();
3 us | 5) <idle>-0 | d.s4. | | _raw_spin_unlock() {
3 us | 5) <idle>-0 | d.s4. | 312159586 us | do_raw_spin_unlock();
4 us | 5) <idle>-0 | d.s4. | 312159587 us | preempt_count_sub();
4 us | 5) <idle>-0 | d.s2. | 312159587 us | }
5 us | 5) <idle>-0 | d.s3. | | _raw_spin_lock() {
5 us | 5) <idle>-0 | d.s3. | 312159588 us | preempt_count_add();
6 us | 5) <idle>-0 | d.s4. | 312159589 us | do_raw_spin_lock();
7 us | 5) <idle>-0 | d.s3. | 312159590 us | }
8 us | 5) <idle>-0 | d.s4. | 312159591 us | calc_wheel_index();
9 us | 5) <idle>-0 | d.s4. | | enqueue_timer() {
9 us | 5) <idle>-0 | d.s4. | | wake_up_nohz_cpu() {
11 us | 5) <idle>-0 | d.s4. | | native_smp_send_reschedule() {
11 us | 5) <idle>-0 | d.s4. | 312171987 us | default_send_IPI_single_phys();
12408 us | 5) <idle>-0 | d.s3. | 312171990 us | }
12408 us | 5) <idle>-0 | d.s3. | 312171991 us | }
12409 us | 5) <idle>-0 | d.s3. | 312171991 us | }
Where the calculation of the time for each function was the return time
minus zero and not the time of when the function returned.
Have these tracers also save the calltime in the fgraph data section and
retrieve it again on the return to get the correct timings again.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/20250113183124.61767419@gandalf.local.home
Fixes:
|
||
![]() |
6e31b759b0 |
ring-buffer: Make reading page consistent with the code logic
In the loop of __rb_map_vma(), the 's' variable is calculated from the same logic that nr_pages is and they both come from nr_subbufs. But the relationship is not obvious and there's a WARN_ON_ONCE() around the 's' variable to make sure it never becomes equal to nr_subbufs within the loop. If that happens, then the code is buggy and needs to be fixed. The 'page' variable is calculated from cpu_buffer->subbuf_ids[s] which is an array of 'nr_subbufs' entries. If the code becomes buggy and 's' becomes equal to or greater than 'nr_subbufs' then this will be an out of bounds hit before the WARN_ON() is triggered and the code exiting safely. Make the 'page' initialization consistent with the code logic and assign it after the out of bounds check. Link: https://lore.kernel.org/20250110162612.13983-1-aha310510@gmail.com Signed-off-by: Jeongjun Park <aha310510@gmail.com> [ sdr: rewrote change log ] Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
0568c6ebf0 |
ring-buffer: Check for empty ring-buffer with rb_num_of_entries()
Currently there are two ways of identifying an empty ring-buffer. One relying on the current status of the commit / reader page (rb_per_cpu_empty()) and the other on the write and read counters (rb_num_of_entries() used in rb_get_reader_page()). with rb_num_of_entries(). This intends to ease later introduction of ring-buffer writers which are out of the kernel control and with whom, the only information available is through the meta-page counters. Link: https://lore.kernel.org/20250108114536.627715-2-vdonnefort@google.com Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
4f7caaa2f9 |
bpf: Use ftrace_get_symaddr() for kprobe_multi probes
Add ftrace_get_entry_ip() which is only for ftrace based probes, and use it for kprobe multi probes because they are based on fprobe which uses ftrace instead of kprobes. Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/173566081414.878879.10631096557346094362.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
927054606d |
tracing/kprobes: Simplify __trace_kprobe_create() by removing gotos
Simplify __trace_kprobe_create() by removing gotos. Link: https://lore.kernel.org/all/173643301102.1514810.6149004416601259466.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
7dcc352078 |
tracing: Use __free() for kprobe events to cleanup
Use __free() in trace_kprobe.c to cleanup code. Link: https://lore.kernel.org/all/173643299989.1514810.2924926552980462072.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
4af0532a0f |
tracing: Use __free() in trace_probe for cleanup
Use __free() in trace_probe to cleanup some gotos. Link: https://lore.kernel.org/all/173643298860.1514810.7267350121047606213.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
4e83017e4c |
tracing/eprobe: Adopt guard() and scoped_guard()
Use guard() or scoped_guard() in eprobe events for critical sections rather than discrete lock/unlock pairs. Link: https://lore.kernel.org/all/173289890996.73724.17421347964110362029.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> |
||
![]() |
f8821732dc |
tracing/uprobe: Adopt guard() and scoped_guard()
Use guard() or scoped_guard() in uprobe events for critical sections rather than discrete lock/unlock pairs. Link: https://lore.kernel.org/all/173289889911.73724.12457932738419630525.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> |
||
![]() |
2cba0070cd |
tracing/kprobe: Adopt guard() and scoped_guard()
Use guard() or scoped_guard() in kprobe events for critical sections rather than discrete lock/unlock pairs. Link: https://lore.kernel.org/all/173289888883.73724.6586200652276577583.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> |
||
![]() |
30c8fd31c5 |
tracing/kprobes: Fix to free objects when failed to copy a symbol
In __trace_kprobe_create(), if something fails it must goto error block
to free objects. But when strdup() a symbol, it returns without that.
Fix it to goto the error block to free objects correctly.
Link: https://lore.kernel.org/all/173643297743.1514810.2408159540454241947.stgit@devnote2/
Fixes:
|
||
![]() |
2ebadb60cb |
bpf: Return error for missed kprobe multi bpf program execution
When kprobe multi bpf program can't be executed due to recursion check, we currently return 0 (success) to fprobe layer where it's ignored for standard kprobe multi probes. For kprobe session the success return value will make fprobe layer to install return probe and try to execute it as well. But the return session probe should not get executed, because the entry part did not run. FWIW the return probe bpf program most likely won't get executed, because its recursion check will likely fail as well, but we don't need to run it in the first place.. also we can make this clear and obvious. It also affects missed counts for kprobe session program execution, which are now doubled (extra count for not executed return probe). Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20250106175048.1443905-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
![]() |
ca3c4f646a |
bpf: Move out synchronize_rcu_tasks_trace from mutex CS
Commit
|
||
![]() |
66fc6f521a |
tracing/hist: Support POLLPRI event for poll on histogram
Since POLLIN will not be flushed until the hist file is read, the user needs to repeatedly read() and poll() on the hist file for monitoring the event continuously. But the read() is somewhat redundant when the user is only monitoring for event updates. Add POLLPRI poll event on the hist file so the event returns when a histogram is updated after open(), poll() or read(). Thus it is possible to wait for the next event without having to issue a read(). Cc: Shuah Khan <shuah@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/173527248770.464571.2536902137325258133.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
1bd13edbbe |
tracing/hist: Add poll(POLLIN) support on hist file
Add poll syscall support on the `hist` file. The Waiter will be waken up when the histogram is updated with POLLIN. Currently, there is no way to wait for a specific event in userspace. So user needs to peek the `trace` periodicaly, or wait on `trace_pipe`. But it is not a good idea to peek at the `trace` for an event that randomly happens. And `trace_pipe` is not coming back until a page is filled with events. This allows a user to wait for a specific event on the `hist` file. User can set a histogram trigger on the event which they want to monitor and poll() on its `hist` file. Since this poll() returns POLLIN, the next poll() will return soon unless a read() happens on that hist file. NOTE: To read the hist file again, you must set the file offset to 0, but just for monitoring the event, you may not need to read the histogram. Cc: Shuah Khan <shuah@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/173527247756.464571.14236296701625509931.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
22bec11a56 |
tracing: Fix using ret variable in tracing_set_tracer()
When the function tracing_set_tracer() switched over to using the guard()
infrastructure, it did not need to save the 'ret' variable and would just
return the value when an error arised, instead of setting ret and jumping
to an out label.
When CONFIG_TRACER_SNAPSHOT is enabled, it had code that expected the
"ret" variable to be initialized to zero and had set 'ret' while holding
an arch_spin_lock() (not used by guard), and then upon releasing the lock
it would check 'ret' and exit if set. But because ret was only set when an
error occurred while holding the locks, 'ret' would be used uninitialized
if there was no error. The code in the CONFIG_TRACER_SNAPSHOT block should
be self contain. Make sure 'ret' is also set when no error occurred.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250106111143.2f90ff65@gandalf.local.home
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202412271654.nJVBuwmF-lkp@intel.com/
Fixes:
|
||
![]() |
e30dd219c7 |
Fixes for ftrace in v6.13:
- Add needed READ_ONCE() around access to the fgraph array element The updates to the fgraph array can happen when callbacks are registered and unregistered. The __ftrace_return_to_handler() can handle reading either the old value or the new value. But once it reads that value it must stay consistent otherwise the check that looks to see if the value is a stub may show false, but if the compiler decides to re-read after that check, it can be true which can cause the code to crash later on. - Make function profiler use the top level ops for filtering again When function graph became available for instances, its filter ops became independent from the top level set_ftrace_filter. In the process the function profiler received its own filter ops as well. But the function profiler uses the top level set_ftrace_filter file and does not have one of its own. In giving it its own filter ops, it lost any user interface it once had. Make it use the top level set_ftrace_filter file again. This fixes a regression. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ3cR4RQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qjxfAQCPhNztdmGmEYmuBtONPHwejidWnuJ6 Rl2mQxEbp40OUgD+JvSWofhRsvtXWlymqZ9j+dKMegLqMeq834hB0LK4NAg= =+KqV -----END PGP SIGNATURE----- Merge tag 'ftrace-v6.13-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ftrace fixes from Steven Rostedt: - Add needed READ_ONCE() around access to the fgraph array element The updates to the fgraph array can happen when callbacks are registered and unregistered. The __ftrace_return_to_handler() can handle reading either the old value or the new value. But once it reads that value it must stay consistent otherwise the check that looks to see if the value is a stub may show false, but if the compiler decides to re-read after that check, it can be true which can cause the code to crash later on. - Make function profiler use the top level ops for filtering again When function graph became available for instances, its filter ops became independent from the top level set_ftrace_filter. In the process the function profiler received its own filter ops as well. But the function profiler uses the top level set_ftrace_filter file and does not have one of its own. In giving it its own filter ops, it lost any user interface it once had. Make it use the top level set_ftrace_filter file again. This fixes a regression. * tag 'ftrace-v6.13-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Fix function profiler's filtering functionality fgraph: Add READ_ONCE() when accessing fgraph_array[] |
||
![]() |
789a8cff8d |
ftrace: Fix function profiler's filtering functionality
Commit |
||
![]() |
d654740337 |
fgraph: Add READ_ONCE() when accessing fgraph_array[]
In __ftrace_return_to_handler(), a loop iterates over the fgraph_array[] elements, which are fgraph_ops. The loop checks if an element is a fgraph_stub to prevent using a fgraph_stub afterward. However, if the compiler reloads fgraph_array[] after this check, it might race with an update to fgraph_array[] that introduces a fgraph_stub. This could result in the stub being processed, but the stub contains a null "func_hash" field, leading to a NULL pointer dereference. To ensure that the gops compared against the fgraph_stub matches the gops processed later, add a READ_ONCE(). A similar patch appears in commit |
||
![]() |
afc6717628 |
tracing: Have process_string() also allow arrays
In order to catch a common bug where a TRACE_EVENT() TP_fast_assign()
assigns an address of an allocated string to the ring buffer and then
references it in TP_printk(), which can be executed hours later when the
string is free, the function test_event_printk() runs on all events as
they are registered to make sure there's no unwanted dereferencing.
It calls process_string() to handle cases in TP_printk() format that has
"%s". It returns whether or not the string is safe. But it can have some
false positives.
For instance, xe_bo_move() has:
TP_printk("move_lacks_source:%s, migrate object %p [size %zu] from %s to %s device_id:%s",
__entry->move_lacks_source ? "yes" : "no", __entry->bo, __entry->size,
xe_mem_type_to_name[__entry->old_placement],
xe_mem_type_to_name[__entry->new_placement], __get_str(device_id))
Where the "%s" references into xe_mem_type_to_name[]. This is an array of
pointers that should be safe for the event to access. Instead of flagging
this as a bad reference, if a reference points to an array, where the
record field is the index, consider it safe.
Link: https://lore.kernel.org/all/9dee19b6185d325d0e6fa5f7cbba81d007d99166.camel@sapience.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241231000646.324fb5f7@gandalf.local.home
Fixes:
|
||
![]() |
de6f45c2dd |
verification/dot2k: Auto patch current kernel source
dot2k suggests a list of changes to the kernel tree while adding a monitor: edit tracepoints header, Makefile, Kconfig and moving the monitor folder. Those changes can be easily run automatically. Add a flag to dot2k to alter the kernel source. The kernel source directory can be either assumed from the PWD, or from the running kernel, if installed. This feature works best if the kernel tree is a git repository, so that its easier to make sure there are no unintended changes. The main RV files (e.g. Makefile) have now a comment placeholder that can be useful for manual editing (e.g. to know where to add new monitors) and it is used by the script to append the required lines. We also slightly adapt the file handling functions in dot2k: __open_file is now called __read_file and also closes the file before returning the content; __create_file is now a more general __write_file, we no longer return on FileExistsError (not thrown while opening), a new __create_file simply calls __write_file specifying the monitor folder in the path. Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Kacur <jkacur@redhat.com> Link: https://lore.kernel.org/20241227144752.362911-8-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
bc3d482dcc |
rv: Simplify manual steps in monitor creation
While creating a new monitor in RV, besides generating code from dot2k, there are a few manual steps which can be tedious and error prone, like adding the tracepoints, makefile lines and kconfig. This patch restructures the existing monitors to keep some files in the monitor's folder itself, which can be automatically generated by future versions of dot2k. Monitors have now their own Kconfig and tracepoint snippets. For simplicity, the main tracepoint definition, is moved to the RV directory, it defines only the tracepoint classes and includes the monitor-specific tracepoints, which reside in the monitor directory. Tracepoints and Kconfig no longer need to be copied and adapted from existing ones but only need to be included in the main files. The Makefile remains untouched since there's little advantage in having a separated Makefile for each monitor with a single line and including it in the main RV Makefile. Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Kacur <jkacur@redhat.com> Link: https://lore.kernel.org/20241227144752.362911-6-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
411a678d30 |
Probes fixes for v6.13-rc4:
- tracing/kprobes: Change the priority of the module callback of kprobe events so that it is called after the jump label list on the module is updated. This ensures the kprobe can check whether it is not on the jump label address correctly. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmduAMgbHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bJ6YH/2QBkWNTe3qjxdPsTxJ2 MyL2PO8tMwZbNSyYZ1yGnbguWUUKVkuiheS/qWhLNpuVEyb6Q9/Zuifh5rFqDbf0 Ug3YvsP7gQurmqDm1NGlnMic3zlmZaYDtXCKB+kiA3HO3iP92zesTJlasiok3aSd sQphxUzmG41BQUDN5/LktGjVb5juf3Xq6i6bdCd6wunUbGWCEE+XmFrg1oVnutES GTckUGswUBGbgkcVPc07UfKZpNzZdyZlmbVfOISCdYIAddUKftATN7SaUrM29oqC /lkUcxeXSVXBIUkbA1p50nfjYzTWNeXG92WrvMrRZjNivyMf/nUJnxrlHsv5h2Dy gtI= =d3Zj -----END PGP SIGNATURE----- Merge tag 'probes-fixes-v6.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fix from Masami Hiramatsu: "Change the priority of the module callback of kprobe events so that it is called after the jump label list on the module is updated. This ensures the kprobe can check whether it is not on the jump label address correctly" * tag 'probes-fixes-v6.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/kprobe: Make trace_kprobe's module callback called after jump_label update |
||
![]() |
a2224559cb |
tracing/fprobe: Remove nr_maxactive from fprobe
Remove depercated fprobe::nr_maxactive. This involves fprobe events to rejects the maxactive number. Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/173519007257.391279.946804046982289337.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
b5fa903b7f |
fprobe: Add fprobe_header encoding feature
Fprobe store its data structure address and size on the fgraph return stack by __fprobe_header. But most 64bit architecture can combine those to one unsigned long value because 4 MSB in the kernel address are the same. With this encoding, fprobe can consume less space on ret_stack. This introduces asm/fprobe.h to define arch dependent encode/decode macros. Note that since fprobe depends on CONFIG_HAVE_FUNCTION_GRAPH_FREGS, currently only arm64, loongarch, riscv, s390 and x86 are supported. Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/173519005783.391279.5307910947400277525.stgit@devnote2 Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
4346ba1604 |
fprobe: Rewrite fprobe on function-graph tracer
Rewrite fprobe implementation on function-graph tracer. Major API changes are: - 'nr_maxactive' field is deprecated. - This depends on CONFIG_DYNAMIC_FTRACE_WITH_ARGS or !CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS, and CONFIG_HAVE_FUNCTION_GRAPH_FREGS. So currently works only on x86_64. - Currently the entry size is limited in 15 * sizeof(long). - If there is too many fprobe exit handler set on the same function, it will fail to probe. Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Naveen N Rao <naveen@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/173519003970.391279.14406792285453830996.stgit@devnote2 Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
a762e9267d |
ftrace: Add CONFIG_HAVE_FTRACE_GRAPH_FUNC
Add CONFIG_HAVE_FTRACE_GRAPH_FUNC kconfig in addition to ftrace_graph_func macro check. This is for the other feature (e.g. FPROBE) which requires to access ftrace_regs from fgraph_ops::entryfunc() can avoid compiling if the fgraph can not pass the valid ftrace_regs. Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Naveen N Rao <naveen@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/173519001472.391279.1174901685282588467.stgit@devnote2 Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
8e2759da93 |
bpf: Enable kprobe_multi feature if CONFIG_FPROBE is enabled
Enable kprobe_multi feature if CONFIG_FPROBE is enabled. The pt_regs is converted from ftrace_regs by ftrace_partial_regs(), thus some registers may always returns 0. But it should be enough for function entry (access arguments) and exit (access return value). Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/173519000417.391279.14011193569589886419.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Florent Revest <revest@chromium.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
0566cefe73 |
tracing/fprobe: Enable fprobe events with CONFIG_DYNAMIC_FTRACE_WITH_ARGS
Allow fprobe events to be enabled with CONFIG_DYNAMIC_FTRACE_WITH_ARGS. With this change, fprobe events mostly use ftrace_regs instead of pt_regs. Note that if the arch doesn't enable HAVE_FTRACE_REGS_HAVING_PT_REGS, fprobe events will not be able to be used from perf. Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/173518999352.391279.13332699755290175168.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
762abbc0d0 |
fprobe: Use ftrace_regs in fprobe exit handler
Change the fprobe exit handler to use ftrace_regs structure instead of pt_regs. This also introduce HAVE_FTRACE_REGS_HAVING_PT_REGS which means the ftrace_regs is including the pt_regs so that ftrace_regs can provide pt_regs without memory allocation. Fprobe introduces a new dependency with that. Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: bpf <bpf@vger.kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Song Liu <song@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: KP Singh <kpsingh@kernel.org> Cc: Matt Bobrowski <mattbobrowski@google.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Yonghong Song <yonghong.song@linux.dev> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Stanislav Fomichev <sdf@fomichev.me> Cc: Hao Luo <haoluo@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/173518995092.391279.6765116450352977627.stgit@devnote2 Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
46bc082388 |
fprobe: Use ftrace_regs in fprobe entry handler
This allows fprobes to be available with CONFIG_DYNAMIC_FTRACE_WITH_ARGS instead of CONFIG_DYNAMIC_FTRACE_WITH_REGS, then we can enable fprobe on arm64. Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/173518994037.391279.2786805566359674586.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Florent Revest <revest@chromium.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
2ca8c112c9 |
fgraph: Pass ftrace_regs to retfunc
Pass ftrace_regs to the fgraph_ops::retfunc(). If ftrace_regs is not available, it passes a NULL instead. User callback function can access some registers (including return address) via this ftrace_regs. Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/173518992972.391279.14055405490327765506.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
a3ed4157b7 |
fgraph: Replace fgraph_ret_regs with ftrace_regs
Use ftrace_regs instead of fgraph_ret_regs for tracing return value on function_graph tracer because of simplifying the callback interface. The CONFIG_HAVE_FUNCTION_GRAPH_RETVAL is also replaced by CONFIG_HAVE_FUNCTION_GRAPH_FREGS. Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/173518991508.391279.16635322774382197642.stgit@devnote2 Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
41705c4262 |
fgraph: Pass ftrace_regs to entryfunc
Pass ftrace_regs to the fgraph_ops::entryfunc(). If ftrace_regs is not available, it passes a NULL instead. User callback function can access some registers (including return address) via this ftrace_regs. Note that the ftrace_regs can be NULL when the arch does NOT define: HAVE_DYNAMIC_FTRACE_WITH_ARGS or HAVE_DYNAMIC_FTRACE_WITH_REGS. More specifically, if HAVE_DYNAMIC_FTRACE_WITH_REGS is defined but not the HAVE_DYNAMIC_FTRACE_WITH_ARGS, and the ftrace ops used to register the function callback does not set FTRACE_OPS_FL_SAVE_REGS. In this case, ftrace_regs can be NULL in user callback. Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Naveen N Rao <naveen@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/173518990044.391279.17406984900626078579.stgit@devnote2 Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
9e49ca756d |
tracing/string: Create and use __free(argv_free) in trace_dynevent.c
The function dyn_event_release() uses argv_split() which must be freed via argv_free(). It contains several error paths that do a goto out to call argv_free() for cleanup. This makes the code complex and error prone. Create a new __free() directive __free(argv_free) that will call argv_free() for data allocated with argv_split(), and use it in the dyn_event_release() function. Cc: Kees Cook <kees@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Shevchenko <andy@kernel.org> Cc: linux-hardening@vger.kernel.org Link: https://lore.kernel.org/20241220103313.4a74ec8e@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
08b7673171 |
tracing: Switch trace_stat.c code over to use guard()
There are a couple functions in trace_stat.c that have "goto out" or equivalent on error in order to release locks that were taken. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex on error over to using the guard(mutex)() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201346.870318466@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
6c05353e4f |
tracing: Switch trace_stack.c code over to use guard()
The function stack_trace_sysctl() uses a goto on the error path to jump to the mutex_unlock() code. Replace the logic to use guard() and let the compiler worry about it. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241225222931.684913592@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
930d2b32c0 |
tracing: Switch trace_osnoise.c code over to use guard() and __free()
The osnoise_hotplug_workfn() grabs two mutexes and cpu_read_lock(). It has various gotos to handle unlocking them. Switch them over to guard() and let the compiler worry about it. The osnoise_cpus_read() has a temporary mask_str allocated and there's some gotos to make sure it gets freed on error paths. Switch that over to __free() to let the compiler worry about it. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241225222931.517329690@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
a2e27e1bb1 |
tracing: Switch trace_events_synth.c code over to use guard()
There are a couple functions in trace_events_synth.c that have "goto out" or equivalent on error in order to release locks that were taken. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex on error over to using the guard(mutex)() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201346.371082515@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
076796f74e |
tracing: Switch trace_events_filter.c code over to use guard()
There are a couple functions in trace_events_filter.c that have "goto out" or equivalent on error in order to release locks that were taken. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex on error over to using the guard(mutex)() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201346.200737679@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
63c7264168 |
tracing: Switch trace_events_trigger.c code over to use guard()
There are a few functions in trace_events_trigger.c that have "goto out" or equivalent on error in order to release locks that were taken. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex on error over to using the guard(mutex)() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. Also use __free() for free a temporary buffer in event_trigger_regex_write(). Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241220110621.639d3bc8@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
2b36a97aee |
tracing: Switch trace_events_hist.c code over to use guard()
There are a couple functions in trace_events_hist.c that have "goto out" or equivalent on error in order to release locks that were taken. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex on error over to using the guard(mutex)() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201345.694601480@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
59980d9b0b |
tracing: Switch trace_events.c code over to use guard()
There are several functions in trace_events.c that have "goto out;" or equivalent on error in order to release locks that were taken. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex on error over to using the guard(mutex)() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. Some locations did some simple arithmetic after releasing the lock. As this causes no real overhead for holding a mutex while processing the file position (*ppos += cnt;) let the lock be held over this logic too. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201345.522546095@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
4b8d63e5b6 |
tracing: Simplify event_enable_func() goto_reg logic
Currently there's an "out_reg:" label that gets jumped to if there's no parameters to process. Instead, make it a proper "if (param) { }" block as there's not much to do for the parameter processing, and remove the "out_reg:" label. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201345.354746196@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
c949dfb974 |
tracing: Simplify event_enable_func() goto out_free logic
The event_enable_func() function allocates the data descriptor early in the function just to assign its data->count value via: kstrtoul(number, 0, &data->count); This makes the code more complex as there are several error paths before the data descriptor is actually used. This means there needs to be a goto out_free; to clean it up. Use a local variable "count" to do the update and move the data allocation just before it is used. This removes the "out_free" label as the data can be freed on the failure path of where it is used. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201345.190820140@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
cad1d5bd2c |
tracing: Have event_enable_write() just return error on error
The event_enable_write() function is inconsistent in how it returns errors. Sometimes it updates the ppos parameter and sometimes it doesn't. Simplify the code to just return an error or the count if there isn't an error. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201345.025284170@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
d1e27ee9c6 |
tracing: Return -EINVAL if a boot tracer tries to enable the mmiotracer at boot
The mmiotracer is not set to be enabled at boot up from the kernel command line. If the boot command line tries to enable that tracer, it will fail to be enabled. The return code is currently zero when that happens so the caller just thinks it was enabled. Return -EINVAL in this case. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241219201344.854254394@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
d33b10c0c7 |
tracing: Switch trace.c code over to use guard()
There are several functions in trace.c that have "goto out;" or equivalent on error in order to release locks or free values that were allocated. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex or freeing on error over to using the guard(mutex)() and __free() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. There's one place that should probably return an error but instead return 0. This does not change the return as the only changes are to do the conversion without changing the logic. Fixing that location will have to come later. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/20241224221413.7b8c68c3@batman.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
98feccbf32 |
tracing: Prevent bad count for tracing_cpumask_write
If a large count is provided, it will trigger a warning in bitmap_parse_user.
Also check zero for it.
Cc: stable@vger.kernel.org
Fixes:
|
||
![]() |
d576aec24d |
fgraph: Get ftrace recursion lock in function_graph_enter
Get the ftrace recursion lock in the generic function_graph_enter() instead of each architecture code. This changes all function_graph tracer callbacks running in non-preemptive state. On x86 and powerpc, this is by default, but on the other architecutres, this will be new. Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Florent Revest <revest@chromium.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: bpf <bpf@vger.kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Naveen N Rao <naveen@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/173379653720.973433.18438622234884980494.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
1d95fd9d6b |
ftrace: Switch ftrace.c code over to use guard()
There are a few functions in ftrace.c that have "goto out" or equivalent on error in order to release locks that were taken. This can be error prone or just simply make the code more complex. Switch every location that ends with unlocking a mutex on error over to using the guard(mutex)() infrastructure to let the compiler worry about releasing locks. This makes the code easier to read and understand. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20241223184941.718001540@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
77e53cb2fc |
ftrace: Remove unneeded goto jumps
There are some goto jumps to exit a program to just return a value. The code after the label doesn't free anything nor does it do any unlocks. It simply returns the variable that was set before the jump. Remove these unneeded goto jumps. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20241223184941.544855549@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
ac8c3b02fc |
ftrace: Do not disable interrupts in profiler
The function profiler disables interrupts before processing. This was there since the profiler was introduced back in 2009 when there were recursion issues to deal with. The function tracer is much more robust today and has its own internal recursion protection. There's no reason to disable interrupts in the function profiler. Instead, just disable preemption and use the guard() infrastructure while at it. Before this change: ~# echo 1 > /sys/kernel/tracing/function_profile_enabled ~# perf stat -r 10 ./hackbench 10 Time: 3.099 Time: 2.556 Time: 2.500 Time: 2.705 Time: 2.985 Time: 2.959 Time: 2.859 Time: 2.621 Time: 2.742 Time: 2.631 Performance counter stats for '/work/c/hackbench 10' (10 runs): 23,156.77 msec task-clock # 6.951 CPUs utilized ( +- 2.36% ) 18,306 context-switches # 790.525 /sec ( +- 5.95% ) 495 cpu-migrations # 21.376 /sec ( +- 8.61% ) 11,522 page-faults # 497.565 /sec ( +- 1.80% ) 47,967,124,606 cycles # 2.071 GHz ( +- 0.41% ) 80,009,078,371 instructions # 1.67 insn per cycle ( +- 0.34% ) 16,389,249,798 branches # 707.752 M/sec ( +- 0.36% ) 139,943,109 branch-misses # 0.85% of all branches ( +- 0.61% ) 3.332 +- 0.101 seconds time elapsed ( +- 3.04% ) After this change: ~# echo 1 > /sys/kernel/tracing/function_profile_enabled ~# perf stat -r 10 ./hackbench 10 Time: 1.869 Time: 1.428 Time: 1.575 Time: 1.569 Time: 1.685 Time: 1.511 Time: 1.611 Time: 1.672 Time: 1.724 Time: 1.715 Performance counter stats for '/work/c/hackbench 10' (10 runs): 13,578.21 msec task-clock # 6.931 CPUs utilized ( +- 2.23% ) 12,736 context-switches # 937.973 /sec ( +- 3.86% ) 341 cpu-migrations # 25.114 /sec ( +- 5.27% ) 11,378 page-faults # 837.960 /sec ( +- 1.74% ) 27,638,039,036 cycles # 2.035 GHz ( +- 0.27% ) 45,107,762,498 instructions # 1.63 insn per cycle ( +- 0.23% ) 8,623,868,018 branches # 635.125 M/sec ( +- 0.27% ) 125,738,443 branch-misses # 1.46% of all branches ( +- 0.32% ) 1.9590 +- 0.0484 seconds time elapsed ( +- 2.47% ) Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20241223184941.373853944@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
7d137e604a |
fgraph: Remove unnecessary disabling of interrupts and recursion
The function graph tracer disables interrupts as well as prevents
recursion via NMIs when recording the graph tracer code. There's no reason
to do this today. That disabling goes back to 2008 when the function graph
tracer was first introduced and recursion protection wasn't part of the
code.
Today, there's no reason to disable interrupts or prevent the code from
recursing as the infrastructure can easily handle it.
Before this change:
~# echo function_graph > /sys/kernel/tracing/current_tracer
~# perf stat -r 10 ./hackbench 10
Time: 4.240
Time: 4.236
Time: 4.106
Time: 4.014
Time: 4.314
Time: 3.830
Time: 4.063
Time: 4.323
Time: 3.763
Time: 3.727
Performance counter stats for '/work/c/hackbench 10' (10 runs):
33,937.20 msec task-clock # 7.008 CPUs utilized ( +- 1.85% )
18,220 context-switches # 536.874 /sec ( +- 6.41% )
624 cpu-migrations # 18.387 /sec ( +- 9.07% )
11,319 page-faults # 333.528 /sec ( +- 1.97% )
76,657,643,617 cycles # 2.259 GHz ( +- 0.40% )
141,403,302,768 instructions # 1.84 insn per cycle ( +- 0.37% )
25,518,463,888 branches # 751.932 M/sec ( +- 0.35% )
156,151,050 branch-misses # 0.61% of all branches ( +- 0.63% )
4.8423 +- 0.0892 seconds time elapsed ( +- 1.84% )
After this change:
~# echo function_graph > /sys/kernel/tracing/current_tracer
~# perf stat -r 10 ./hackbench 10
Time: 3.340
Time: 3.192
Time: 3.129
Time: 2.579
Time: 2.589
Time: 2.798
Time: 2.791
Time: 2.955
Time: 3.044
Time: 3.065
Performance counter stats for './hackbench 10' (10 runs):
24,416.30 msec task-clock # 6.996 CPUs utilized ( +- 2.74% )
16,764 context-switches # 686.590 /sec ( +- 5.85% )
469 cpu-migrations # 19.208 /sec ( +- 6.14% )
11,519 page-faults # 471.775 /sec ( +- 1.92% )
53,895,628,450 cycles # 2.207 GHz ( +- 0.52% )
105,552,664,638 instructions # 1.96 insn per cycle ( +- 0.47% )
17,808,672,667 branches # 729.376 M/sec ( +- 0.48% )
133,075,435 branch-misses # 0.75% of all branches ( +- 0.59% )
3.490 +- 0.112 seconds time elapsed ( +- 3.22% )
Also removed unneeded "unlikely()" around the retaddr code.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20241223184941.204074053@goodmis.org
Fixes:
|
||
![]() |
ccb9868ab7 |
blktrace: remove redundant return at end of function
A recent change added return 0 before an existing return statement at the end of function blk_trace_setup. The final return is now redundant, so remove it. Fixes: 64d124798244 ("blktrace: move copy_[to|from]_user() out of ->debugfs_lock") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20241204150450.399005-1-colin.i.king@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
![]() |
b769a2f409 |
blktrace: move copy_[to|from]_user() out of ->debugfs_lock
Move copy_[to|from]_user() out of ->debugfs_lock and cut the dependency between mm->mmap_lock and q->debugfs_lock, then we avoids lots of lockdep false positive warning. Obviously ->debug_lock isn't needed for copy_[to|from]_user(). The only behavior change is to call blk_trace_remove() in case of setup failure handling by re-grabbing ->debugfs_lock, and this way is just fine since we do cover concurrent setup() & remove(). Reported-by: syzbot+91585b36b538053343e4@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-block/67450fd4.050a0220.1286eb.0007.GAE@google.com/ Closes: https://lore.kernel.org/linux-block/6742e584.050a0220.1cc393.0038.GAE@google.com/ Closes: https://lore.kernel.org/linux-block/6742a600.050a0220.1cc393.002e.GAE@google.com/ Closes: https://lore.kernel.org/linux-block/67420102.050a0220.1cc393.0019.GAE@google.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241128125029.4152292-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
![]() |
fd9b0244f5 |
blktrace: don't centralize grabbing q->debugfs_mutex in blk_trace_ioctl
Call each handler directly and the handler do grab q->debugfs_mutex, prepare for killing dependency between ->debug_mutex and ->mmap_lock. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241128125029.4152292-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
![]() |
d685d55dfc |
tracing/kprobe: Make trace_kprobe's module callback called after jump_label update
Make sure the trace_kprobe's module notifer callback function is called
after jump_label's callback is called. Since the trace_kprobe's callback
eventually checks jump_label address during registering new kprobe on
the loading module, jump_label must be updated before this registration
happens.
Link: https://lore.kernel.org/all/173387585556.995044.3157941002975446119.stgit@devnote2/
Fixes:
|
||
![]() |
5b83bcdea5 |
ring-buffer fixes for v6.13:
- Fix possible overflow of mmapped ring buffer with bad offset If the mmap() to the ring buffer passes in a start address that is passed the end of the mmapped file, it is not caught and a slab-out-of-bounds is triggered. Add a check to make sure the start address is within the bounds - Do not use TP_printk() to boot mapped ring buffers As a boot mapped ring buffer's data may have pointers that map to the previous boot's memory map, it is unsafe to allow the TP_printk() to be used to read the boot mapped buffer's events. If a TP_printk() points to a static string from within the kernel it will not match the current kernel mapping if KASLR is active, and it can fault. Have it simply print out the raw fields. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ2QuXRQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qncvAQDf2s2WWsy4pYp2mpRtBXvAPf6tpBdi J9eceJQbwJVJHAEApQjEFfbUxLh2WgPU1Cn++PwDA+NLiru70+S0vtDLWwE= =OI+v -----END PGP SIGNATURE----- Merge tag 'trace-ringbuffer-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ring-buffer fixes from Steven Rostedt: - Fix possible overflow of mmapped ring buffer with bad offset If the mmap() to the ring buffer passes in a start address that is passed the end of the mmapped file, it is not caught and a slab-out-of-bounds is triggered. Add a check to make sure the start address is within the bounds - Do not use TP_printk() to boot mapped ring buffers As a boot mapped ring buffer's data may have pointers that map to the previous boot's memory map, it is unsafe to allow the TP_printk() to be used to read the boot mapped buffer's events. If a TP_printk() points to a static string from within the kernel it will not match the current kernel mapping if KASLR is active, and it can fault. Have it simply print out the raw fields. * tag 'trace-ringbuffer-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: trace/ring-buffer: Do not use TP_printk() formatting for boot mapped buffers ring-buffer: Fix overflow in __rb_map_vma |
||
![]() |
8cd63406d0 |
trace/ring-buffer: Do not use TP_printk() formatting for boot mapped buffers
The TP_printk() of a TRACE_EVENT() is a generic printf format that any
developer can create for their event. It may include pointers to strings
and such. A boot mapped buffer may contain data from a previous kernel
where the strings addresses are different.
One solution is to copy the event content and update the pointers by the
recorded delta, but a simpler solution (for now) is to just use the
print_fields() function to print these events. The print_fields() function
just iterates the fields and prints them according to what type they are,
and ignores the TP_printk() format from the event itself.
To understand the difference, when printing via TP_printk() the output
looks like this:
4582.696626: kmem_cache_alloc: call_site=getname_flags+0x47/0x1f0 ptr=00000000e70e10e0 bytes_req=4096 bytes_alloc=4096 gfp_flags=GFP_KERNEL node=-1 accounted=false
4582.696629: kmem_cache_alloc: call_site=alloc_empty_file+0x6b/0x110 ptr=0000000095808002 bytes_req=360 bytes_alloc=384 gfp_flags=GFP_KERNEL node=-1 accounted=false
4582.696630: kmem_cache_alloc: call_site=security_file_alloc+0x24/0x100 ptr=00000000576339c3 bytes_req=16 bytes_alloc=16 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false
4582.696653: kmem_cache_free: call_site=do_sys_openat2+0xa7/0xd0 ptr=00000000e70e10e0 name=names_cache
But when printing via print_fields() (echo 1 > /sys/kernel/tracing/options/fields)
the same event output looks like this:
4582.696626: kmem_cache_alloc: call_site=0xffffffff92d10d97 (-1831793257) ptr=0xffff9e0e8571e000 (-107689771147264) bytes_req=0x1000 (4096) bytes_alloc=0x1000 (4096) gfp_flags=0xcc0 (3264) node=0xffffffff (-1) accounted=(0)
4582.696629: kmem_cache_alloc: call_site=0xffffffff92d0250b (-1831852789) ptr=0xffff9e0e8577f800 (-107689770747904) bytes_req=0x168 (360) bytes_alloc=0x180 (384) gfp_flags=0xcc0 (3264) node=0xffffffff (-1) accounted=(0)
4582.696630: kmem_cache_alloc: call_site=0xffffffff92efca74 (-1829778828) ptr=0xffff9e0e8d35d3b0 (-107689640864848) bytes_req=0x10 (16) bytes_alloc=0x10 (16) gfp_flags=0xdc0 (3520) node=0xffffffff (-1) accounted=(0)
4582.696653: kmem_cache_free: call_site=0xffffffff92cfbea7 (-1831879001) ptr=0xffff9e0e8571e000 (-107689771147264) name=names_cache
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20241218141507.28389a1d@gandalf.local.home
Fixes:
|
||
![]() |
c58a812c8e |
ring-buffer: Fix overflow in __rb_map_vma
An overflow occurred when performing the following calculation:
nr_pages = ((nr_subbufs + 1) << subbuf_order) - pgoff;
Add a check before the calculation to avoid this problem.
syzbot reported this as a slab-out-of-bounds in __rb_map_vma:
BUG: KASAN: slab-out-of-bounds in __rb_map_vma+0x9ab/0xae0 kernel/trace/ring_buffer.c:7058
Read of size 8 at addr ffff8880767dd2b8 by task syz-executor187/5836
CPU: 0 UID: 0 PID: 5836 Comm: syz-executor187 Not tainted 6.13.0-rc2-syzkaller-00159-gf932fb9b4074 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/25/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0xc3/0x620 mm/kasan/report.c:489
kasan_report+0xd9/0x110 mm/kasan/report.c:602
__rb_map_vma+0x9ab/0xae0 kernel/trace/ring_buffer.c:7058
ring_buffer_map+0x56e/0x9b0 kernel/trace/ring_buffer.c:7138
tracing_buffers_mmap+0xa6/0x120 kernel/trace/trace.c:8482
call_mmap include/linux/fs.h:2183 [inline]
mmap_file mm/internal.h:124 [inline]
__mmap_new_file_vma mm/vma.c:2291 [inline]
__mmap_new_vma mm/vma.c:2355 [inline]
__mmap_region+0x1786/0x2670 mm/vma.c:2456
mmap_region+0x127/0x320 mm/mmap.c:1348
do_mmap+0xc00/0xfc0 mm/mmap.c:496
vm_mmap_pgoff+0x1ba/0x360 mm/util.c:580
ksys_mmap_pgoff+0x32c/0x5c0 mm/mmap.c:542
__do_sys_mmap arch/x86/kernel/sys_x86_64.c:89 [inline]
__se_sys_mmap arch/x86/kernel/sys_x86_64.c:82 [inline]
__x64_sys_mmap+0x125/0x190 arch/x86/kernel/sys_x86_64.c:82
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The reproducer for this bug is:
------------------------8<-------------------------
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <asm/types.h>
#include <sys/mman.h>
int main(int argc, char **argv)
{
int page_size = getpagesize();
int fd;
void *meta;
system("echo 1 > /sys/kernel/tracing/buffer_size_kb");
fd = open("/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw", O_RDONLY);
meta = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, page_size * 5);
}
------------------------>8-------------------------
Cc: stable@vger.kernel.org
Fixes:
|
||
![]() |
c061cf420d |
tracing fixes for v6.13:
- Replace trace_check_vprintf() with test_event_printk() and ignore_event() The function test_event_printk() checks on boot up if the trace event printf() formats dereference any pointers, and if they do, it then looks at the arguments to make sure that the pointers they dereference will exist in the event on the ring buffer. If they do not, it issues a WARN_ON() as it is a likely bug. But this isn't the case for the strings that can be dereferenced with "%s", as some trace events (notably RCU and some IPI events) save a pointer to a static string in the ring buffer. As the string it points to lives as long as the kernel is running, it is not a bug to reference it, as it is guaranteed to be there when the event is read. But it is also possible (and a common bug) to point to some allocated string that could be freed before the trace event is read and the dereference is to bad memory. This case requires a run time check. The previous way to handle this was with trace_check_vprintf() that would process the printf format piece by piece and send what it didn't care about to vsnprintf() to handle arguments that were not strings. This kept it from having to reimplement vsnprintf(). But it relied on va_list implementation and for architectures that copied the va_list and did not pass it by reference, it wasn't even possible to do this check and it would be skipped. As 64bit x86 passed va_list by reference, most events were tested and this kept out bugs where strings would have been dereferenced after being freed. Instead of relying on the implementation of va_list, extend the boot up test_event_printk() function to validate all the "%s" strings that can be validated at boot, and for the few events that point to strings outside the ring buffer, flag both the event and the field that is dereferenced as "needs_test". Then before the event is printed, a call to ignore_event() is made, and if the event has the flag set, it iterates all its fields and for every field that is to be tested, it will read the pointer directly from the event in the ring buffer and make sure that it is valid. If the pointer is not valid, it will print a WARN_ON(), print out to the trace that the event has unsafe memory and ignore the print format. With this new update, the trace_check_vprintf() can be safely removed and now all events can be verified regardless of architecture. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ2IqiRQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qlfgAP9hJFl6zhA5GGRo905G9JWFHkbNNjgp WfQ0oMU2Eo1q+AEAmb5d3wWfWJAa+AxiiDNeZ28En/+ZbmjhSe6fPpR4egU= =LRKi -----END PGP SIGNATURE----- Merge tag 'trace-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: "Replace trace_check_vprintf() with test_event_printk() and ignore_event() The function test_event_printk() checks on boot up if the trace event printf() formats dereference any pointers, and if they do, it then looks at the arguments to make sure that the pointers they dereference will exist in the event on the ring buffer. If they do not, it issues a WARN_ON() as it is a likely bug. But this isn't the case for the strings that can be dereferenced with "%s", as some trace events (notably RCU and some IPI events) save a pointer to a static string in the ring buffer. As the string it points to lives as long as the kernel is running, it is not a bug to reference it, as it is guaranteed to be there when the event is read. But it is also possible (and a common bug) to point to some allocated string that could be freed before the trace event is read and the dereference is to bad memory. This case requires a run time check. The previous way to handle this was with trace_check_vprintf() that would process the printf format piece by piece and send what it didn't care about to vsnprintf() to handle arguments that were not strings. This kept it from having to reimplement vsnprintf(). But it relied on va_list implementation and for architectures that copied the va_list and did not pass it by reference, it wasn't even possible to do this check and it would be skipped. As 64bit x86 passed va_list by reference, most events were tested and this kept out bugs where strings would have been dereferenced after being freed. Instead of relying on the implementation of va_list, extend the boot up test_event_printk() function to validate all the "%s" strings that can be validated at boot, and for the few events that point to strings outside the ring buffer, flag both the event and the field that is dereferenced as "needs_test". Then before the event is printed, a call to ignore_event() is made, and if the event has the flag set, it iterates all its fields and for every field that is to be tested, it will read the pointer directly from the event in the ring buffer and make sure that it is valid. If the pointer is not valid, it will print a WARN_ON(), print out to the trace that the event has unsafe memory and ignore the print format. With this new update, the trace_check_vprintf() can be safely removed and now all events can be verified regardless of architecture" * tag 'trace-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Check "%s" dereference via the field and not the TP_printk format tracing: Add "%s" check in test_event_printk() tracing: Add missing helper functions in event pointer dereference check tracing: Fix test_event_printk() to process entire print argument |
||
![]() |
afd2627f72 |
tracing: Check "%s" dereference via the field and not the TP_printk format
The TP_printk() portion of a trace event is executed at the time a event
is read from the trace. This can happen seconds, minutes, hours, days,
months, years possibly later since the event was recorded. If the print
format contains a dereference to a string via "%s", and that string was
allocated, there's a chance that string could be freed before it is read
by the trace file.
To protect against such bugs, there are two functions that verify the
event. The first one is test_event_printk(), which is called when the
event is created. It reads the TP_printk() format as well as its arguments
to make sure nothing may be dereferencing a pointer that was not copied
into the ring buffer along with the event. If it is, it will trigger a
WARN_ON().
For strings that use "%s", it is not so easy. The string may not reside in
the ring buffer but may still be valid. Strings that are static and part
of the kernel proper which will not be freed for the life of the running
system, are safe to dereference. But to know if it is a pointer to a
static string or to something on the heap can not be determined until the
event is triggered.
This brings us to the second function that tests for the bad dereferencing
of strings, trace_check_vprintf(). It would walk through the printf format
looking for "%s", and when it finds it, it would validate that the pointer
is safe to read. If not, it would produces a WARN_ON() as well and write
into the ring buffer "[UNSAFE-MEMORY]".
The problem with this is how it used va_list to have vsnprintf() handle
all the cases that it didn't need to check. Instead of re-implementing
vsnprintf(), it would make a copy of the format up to the %s part, and
call vsnprintf() with the current va_list ap variable, where the ap would
then be ready to point at the string in question.
For architectures that passed va_list by reference this was possible. For
architectures that passed it by copy it was not. A test_can_verify()
function was used to differentiate between the two, and if it wasn't
possible, it would disable it.
Even for architectures where this was feasible, it was a stretch to rely
on such a method that is undocumented, and could cause issues later on
with new optimizations of the compiler.
Instead, the first function test_event_printk() was updated to look at
"%s" as well. If the "%s" argument is a pointer outside the event in the
ring buffer, it would find the field type of the event that is the problem
and mark the structure with a new flag called "needs_test". The event
itself will be marked by TRACE_EVENT_FL_TEST_STR to let it be known that
this event has a field that needs to be verified before the event can be
printed using the printf format.
When the event fields are created from the field type structure, the
fields would copy the field type's "needs_test" value.
Finally, before being printed, a new function ignore_event() is called
which will check if the event has the TEST_STR flag set (if not, it
returns false). If the flag is set, it then iterates through the events
fields looking for the ones that have the "needs_test" flag set.
Then it uses the offset field from the field structure to find the pointer
in the ring buffer event. It runs the tests to make sure that pointer is
safe to print and if not, it triggers the WARN_ON() and also adds to the
trace output that the event in question has an unsafe memory access.
The ignore_event() makes the trace_check_vprintf() obsolete so it is
removed.
Link: https://lore.kernel.org/all/CAHk-=wh3uOnqnZPpR0PeLZZtyWbZLboZ7cHLCKRWsocvs9Y7hQ@mail.gmail.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20241217024720.848621576@goodmis.org
Fixes:
|
||
![]() |
65a25d9f7a |
tracing: Add "%s" check in test_event_printk()
The test_event_printk() code makes sure that when a trace event is
registered, any dereferenced pointers in from the event's TP_printk() are
pointing to content in the ring buffer. But currently it does not handle
"%s", as there's cases where the string pointer saved in the ring buffer
points to a static string in the kernel that will never be freed. As that
is a valid case, the pointer needs to be checked at runtime.
Currently the runtime check is done via trace_check_vprintf(), but to not
have to replicate everything in vsnprintf() it does some logic with the
va_list that may not be reliable across architectures. In order to get rid
of that logic, more work in the test_event_printk() needs to be done. Some
of the strings can be validated at this time when it is obvious the string
is valid because the string will be saved in the ring buffer content.
Do all the validation of strings in the ring buffer at boot in
test_event_printk(), and make sure that the field of the strings that
point into the kernel are accessible. This will allow adding checks at
runtime that will validate the fields themselves and not rely on paring
the TP_printk() format at runtime.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20241217024720.685917008@goodmis.org
Fixes:
|
||
![]() |
917110481f |
tracing: Add missing helper functions in event pointer dereference check
The process_pointer() helper function looks to see if various trace event
macros are used. These macros are for storing data in the event. This
makes it safe to dereference as the dereference will then point into the
event on the ring buffer where the content of the data stays with the
event itself.
A few helper functions were missing. Those were:
__get_rel_dynamic_array()
__get_dynamic_array_len()
__get_rel_dynamic_array_len()
__get_rel_sockaddr()
Also add a helper function find_print_string() to not need to use a middle
man variable to test if the string exists.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20241217024720.521836792@goodmis.org
Fixes:
|
||
![]() |
a6629626c5 |
tracing: Fix test_event_printk() to process entire print argument
The test_event_printk() analyzes print formats of trace events looking for
cases where it may dereference a pointer that is not in the ring buffer
which can possibly be a bug when the trace event is read from the ring
buffer and the content of that pointer no longer exists.
The function needs to accurately go from one print format argument to the
next. It handles quotes and parenthesis that may be included in an
argument. When it finds the start of the next argument, it uses a simple
"c = strstr(fmt + i, ',')" to find the end of that argument!
In order to include "%s" dereferencing, it needs to process the entire
content of the print format argument and not just the content of the first
',' it finds. As there may be content like:
({ const char *saved_ptr = trace_seq_buffer_ptr(p); static const char
*access_str[] = { "---", "--x", "w--", "w-x", "-u-", "-ux", "wu-", "wux"
}; union kvm_mmu_page_role role; role.word = REC->role;
trace_seq_printf(p, "sp gen %u gfn %llx l%u %u-byte q%u%s %s%s" " %snxe
%sad root %u %s%c", REC->mmu_valid_gen, REC->gfn, role.level,
role.has_4_byte_gpte ? 4 : 8, role.quadrant, role.direct ? " direct" : "",
access_str[role.access], role.invalid ? " invalid" : "", role.efer_nx ? ""
: "!", role.ad_disabled ? "!" : "", REC->root_count, REC->unsync ?
"unsync" : "sync", 0); saved_ptr; })
Which is an example of a full argument of an existing event. As the code
already handles finding the next print format argument, process the
argument at the end of it and not the start of it. This way it has both
the start of the argument as well as the end of it.
Add a helper function "process_pointer()" that will do the processing during
the loop as well as at the end. It also makes the code cleaner and easier
to read.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20241217024720.362271189@goodmis.org
Fixes:
|
||
![]() |
166438a432 |
ftrace: Do not find "true_parent" if HAVE_DYNAMIC_FTRACE_WITH_ARGS is not set
When function tracing and function graph tracing are both enabled (in
different instances) the "parent" of some of the function tracing events
is "return_to_handler" which is the trampoline used by function graph
tracing. To fix this, ftrace_get_true_parent_ip() was introduced that
returns the "true" parent ip instead of the trampoline.
To do this, the ftrace_regs_get_stack_pointer() is used, which uses
kernel_stack_pointer(). The problem is that microblaze does not implement
kerenl_stack_pointer() so when function graph tracing is enabled, the
build fails. But microblaze also does not enabled HAVE_DYNAMIC_FTRACE_WITH_ARGS.
That option has to be enabled by the architecture to reliably get the
values from the fregs parameter passed in. When that config is not set,
the architecture can also pass in NULL, which is not tested for in that
function and could cause the kernel to crash.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Jeff Xie <jeff.xie@linux.dev>
Link: https://lore.kernel.org/20241216164633.6df18e87@gandalf.local.home
Fixes:
|
||
![]() |
cc252bb592 |
fgraph: Still initialize idle shadow stacks when starting
A bug was discovered where the idle shadow stacks were not initialized
for offline CPUs when starting function graph tracer, and when they came
online they were not traced due to the missing shadow stack. To fix
this, the idle task shadow stack initialization was moved to using the
CPU hotplug callbacks. But it removed the initialization when the
function graph was enabled. The problem here is that the hotplug
callbacks are called when the CPUs come online, but the idle shadow
stack initialization only happens if function graph is currently
active. This caused the online CPUs to not get their shadow stack
initialized.
The idle shadow stack initialization still needs to be done when the
function graph is registered, as they will not be allocated if function
graph is not registered.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241211135335.094ba282@batman.local.home
Fixes:
|
||
![]() |
06103dccbb |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Cross-merge bpf fixes after downstream PR. No conflicts. Adjacent changes in: Auto-merging include/linux/bpf.h Auto-merging include/linux/bpf_verifier.h Auto-merging kernel/bpf/btf.c Auto-merging kernel/bpf/verifier.c Auto-merging kernel/trace/bpf_trace.c Auto-merging tools/testing/selftests/bpf/progs/test_tp_btf_nullable.c Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
![]() |
35f301dd45 |
BPF fixes:
- Fix a bug in the BPF verifier to track changes to packet data property for global functions (Eduard Zingerman) - Fix a theoretical BPF prog_array use-after-free in RCU handling of __uprobe_perf_func (Jann Horn) - Fix BPF tracing to have an explicit list of tracepoints and their arguments which need to be annotated as PTR_MAYBE_NULL (Kumar Kartikeya Dwivedi) - Fix a logic bug in the bpf_remove_insns code where a potential error would have been wrongly propagated (Anton Protopopov) - Avoid deadlock scenarios caused by nested kprobe and fentry BPF programs (Priya Bala Govindasamy) - Fix a bug in BPF verifier which was missing a size check for BTF-based context access (Kumar Kartikeya Dwivedi) - Fix a crash found by syzbot through an invalid BPF prog_array access in perf_event_detach_bpf_prog (Jiri Olsa) - Fix several BPF sockmap bugs including a race causing a refcount imbalance upon element replace (Michal Luczaj) - Fix a use-after-free from mismatching BPF program/attachment RCU flavors (Jann Horn) Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> -----BEGIN PGP SIGNATURE----- iIsEABYKADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZ13rdhUcZGFuaWVsQGlv Z2VhcmJveC5uZXQACgkQ2yufC7HISINfqAD7B2vX6EgTFrgy7QDepQnZsmu2qjdW fFUzPatFXXp2S3MA/16vOEoHJ4rRhBkcUK/vw3gyY5j5bYZNUTTaam5l4BcM =gkfb -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Daniel Borkmann: - Fix a bug in the BPF verifier to track changes to packet data property for global functions (Eduard Zingerman) - Fix a theoretical BPF prog_array use-after-free in RCU handling of __uprobe_perf_func (Jann Horn) - Fix BPF tracing to have an explicit list of tracepoints and their arguments which need to be annotated as PTR_MAYBE_NULL (Kumar Kartikeya Dwivedi) - Fix a logic bug in the bpf_remove_insns code where a potential error would have been wrongly propagated (Anton Protopopov) - Avoid deadlock scenarios caused by nested kprobe and fentry BPF programs (Priya Bala Govindasamy) - Fix a bug in BPF verifier which was missing a size check for BTF-based context access (Kumar Kartikeya Dwivedi) - Fix a crash found by syzbot through an invalid BPF prog_array access in perf_event_detach_bpf_prog (Jiri Olsa) - Fix several BPF sockmap bugs including a race causing a refcount imbalance upon element replace (Michal Luczaj) - Fix a use-after-free from mismatching BPF program/attachment RCU flavors (Jann Horn) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (23 commits) bpf: Avoid deadlock caused by nested kprobe and fentry bpf programs selftests/bpf: Add tests for raw_tp NULL args bpf: Augment raw_tp arguments with PTR_MAYBE_NULL bpf: Revert "bpf: Mark raw_tp arguments with PTR_MAYBE_NULL" selftests/bpf: Add test for narrow ctx load for pointer args bpf: Check size for BTF-based ctx access of pointer members selftests/bpf: extend changes_pkt_data with cases w/o subprograms bpf: fix null dereference when computing changes_pkt_data of prog w/o subprogs bpf: Fix theoretical prog_array UAF in __uprobe_perf_func() bpf: fix potential error return selftests/bpf: validate that tail call invalidates packet pointers bpf: consider that tail calls invalidate packet pointers selftests/bpf: freplace tests for tracking of changes_packet_data bpf: check changes_pkt_data property for extension programs selftests/bpf: test for changing packet data from global functions bpf: track changes_pkt_data property for global functions bpf: refactor bpf_helper_changes_pkt_data to use helper number bpf: add find_containing_subprog() utility function bpf,perf: Fix invalid prog_array access in perf_event_detach_bpf_prog bpf: Fix UAF via mismatching bpf_prog/attachment RCU flavors ... |
||
![]() |
1594c49394 |
Probes fixes for v6.13-rc1:
- eprobes: Fix to release eprobe when failed to add dyn_event. This unregisters event call and release eprobe when it fails to add a dynamic event. Found in cleaning up. -----BEGIN PGP SIGNATURE----- iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmdYT3sbHG1hc2FtaS5o aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b5X8IALRigb6oDLzrq8yavSPy xn1QlnRtRFdLz+PQ3kFCzU3TOT9oxdFhBkYAXS32vDItPqzM7Upj0oZceqhmd5kz aXSdkL+PFmbHuLzyPuBksyX4gKga06rQBHJ2SIPxnRPZcXBBRStqyWRDpNjwIxrW K8p6k0Agrtd4tL7QtBdukda0uJqKSjN3gOzRAu40KMBjBJZ3kMTsoc+GWGIoIMHb PIDaXTZT0DlZ9ZxiEA/gPcjMugNjDVhkbq2ChPU+asvlRs0YUANT4CF0HcntJvDO W0xIWivfYIKWFLdAn6fhXicPkqU9DQ7FjppyRKC6y4bwuCYJlSeLsPmSWNI2IEBX bFA= =LLWX -----END PGP SIGNATURE----- Merge tag 'probes-fixes-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull eprobes fix from Masami Hiramatsu: - release eprobe when failing to add dyn_event. This unregisters event call and release eprobe when it fails to add a dynamic event. Found in cleaning up. * tag 'probes-fixes-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/eprobe: Fix to release eprobe when failed to add dyn_event |
||
![]() |
7d0d673627 |
bpf: Fix theoretical prog_array UAF in __uprobe_perf_func()
Currently, the pointer stored in call->prog_array is loaded in
__uprobe_perf_func(), with no RCU annotation and no immediately visible
RCU protection, so it looks as if the loaded pointer can immediately be
dangling.
Later, bpf_prog_run_array_uprobe() starts a RCU-trace read-side critical
section, but this is too late. It then uses rcu_dereference_check(), but
this use of rcu_dereference_check() does not actually dereference anything.
Fix it by aligning the semantics to bpf_prog_run_array(): Let the caller
provide rcu_read_lock_trace() protection and then load call->prog_array
with rcu_dereference_check().
This issue seems to be theoretical: I don't know of any way to reach this
code without having handle_swbp() further up the stack, which is already
holding a rcu_read_lock_trace() lock, so where we take
rcu_read_lock_trace() in __uprobe_perf_func()/bpf_prog_run_array_uprobe()
doesn't actually have any effect.
Fixes:
|
||
![]() |
978c4486cc |
bpf,perf: Fix invalid prog_array access in perf_event_detach_bpf_prog
Syzbot reported [1] crash that happens for following tracing scenario:
- create tracepoint perf event with attr.inherit=1, attach it to the
process and set bpf program to it
- attached process forks -> chid creates inherited event
the new child event shares the parent's bpf program and tp_event
(hence prog_array) which is global for tracepoint
- exit both process and its child -> release both events
- first perf_event_detach_bpf_prog call will release tp_event->prog_array
and second perf_event_detach_bpf_prog will crash, because
tp_event->prog_array is NULL
The fix makes sure the perf_event_detach_bpf_prog checks prog_array
is valid before it tries to remove the bpf program from it.
[1] https://lore.kernel.org/bpf/Z1MR6dCIKajNS6nU@krava/T/#m91dbf0688221ec7a7fc95e896a7ef9ff93b0b8ad
Fixes:
|
||
![]() |
ef1b808e3b |
bpf: Fix UAF via mismatching bpf_prog/attachment RCU flavors
Uprobes always use bpf_prog_run_array_uprobe() under tasks-trace-RCU
protection. But it is possible to attach a non-sleepable BPF program to a
uprobe, and non-sleepable BPF programs are freed via normal RCU (see
__bpf_prog_put_noref()). This leads to UAF of the bpf_prog because a normal
RCU grace period does not imply a tasks-trace-RCU grace period.
Fix it by explicitly waiting for a tasks-trace-RCU grace period after
removing the attachment of a bpf_prog to a perf_event.
Fixes:
|
||
![]() |
442bc81bd3 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Cross-merge bpf fixes after downstream PR. Trivial conflict: tools/testing/selftests/bpf/prog_tests/verifier.c Adjacent changes in: Auto-merging kernel/bpf/verifier.c Auto-merging samples/bpf/Makefile Auto-merging tools/testing/selftests/bpf/.gitignore Auto-merging tools/testing/selftests/bpf/Makefile Auto-merging tools/testing/selftests/bpf/prog_tests/verifier.c Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
![]() |
494b332064 |
tracing/eprobe: Fix to release eprobe when failed to add dyn_event
Fix eprobe event to unregister event call and release eprobe when it fails
to add dynamic event correctly.
Link: https://lore.kernel.org/all/173289886698.73724.1959899350183686006.stgit@devnote2/
Fixes:
|
||
![]() |
dc1b157b82 |
tracing: Fix archs that still call tracepoints without RCU watching
Tracepoints require having RCU "watching" as it uses RCU to do updates to
the tracepoints. There are some cases that would call a tracepoint when
RCU was not "watching". This was usually in the idle path where RCU has
"shutdown". For the few locations that had tracepoints without RCU
watching, there was an trace_*_rcuidle() variant that could be used. This
used SRCU for protection.
There are tracepoints that trace when interrupts and preemption are
enabled and disabled. In some architectures, these tracepoints are called
in a path where RCU is not watching. When x86 and arm64 removed these
locations, it was incorrectly assumed that it would be safe to remove the
trace_*_rcuidle() variant and also remove the SRCU logic, as it made the
code more complex and harder to implement new tracepoint features (like
faultable tracepoints and tracepoints in rust).
Instead of bringing back the trace_*_rcuidle(), as it will not be trivial
to do as new code has already been added depending on its removal, add a
workaround to the one file that still requires it (trace_preemptirq.c). If
the architecture does not define CONFIG_ARCH_WANTS_NO_INSTR, then check if
the code is in the idle path, and if so, call ct_irq_enter/exit() which
will enable RCU around the tracepoint.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/20241204100414.4d3e06d0@gandalf.local.home
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Fixes:
|
||
![]() |
e63fbd5f68 |
tracing: Fix cmp_entries_dup() to respect sort() comparison rules
The cmp_entries_dup() function used as the comparator for sort()
violated the symmetry and transitivity properties required by the
sorting algorithm. Specifically, it returned 1 whenever memcmp() was
non-zero, which broke the following expectations:
* Symmetry: If x < y, then y > x.
* Transitivity: If x < y and y < z, then x < z.
These violations could lead to incorrect sorting and failure to
correctly identify duplicate elements.
Fix the issue by directly returning the result of memcmp(), which
adheres to the required comparison properties.
Cc: stable@vger.kernel.org
Fixes:
|
||
![]() |
3bfb49d73f |
bpf: Refactor bpf_tracing_func_proto() and remove bpf_get_probe_write_proto()
With bpf_get_probe_write_proto() no longer printing a message, we can avoid it being a special case with its own permission check. Refactor bpf_tracing_func_proto() similar to bpf_base_func_proto() to have a section conditional on bpf_token_capable(CAP_SYS_ADMIN), where the proto for bpf_probe_write_user() is returned. Finally, remove the unnecessary bpf_get_probe_write_proto(). This simplifies the code, and adding additional CAP_SYS_ADMIN-only helpers in future avoids duplicating the same CAP_SYS_ADMIN check. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Marco Elver <elver@google.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20241129090040.2690691-2-elver@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
![]() |
b28573ebfa |
bpf: Remove bpf_probe_write_user() warning message
The warning message for bpf_probe_write_user() was introduced in |
||
![]() |
bcfd5f644c |
Linux 6.13-rc1
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmdM4ygeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGURgIAIpjH8kH2NS3bdqK 65MBoKZ8qstZcQyo7H68sCkMyaspvDyePznmkDrWym/FyIOVg4FQ/sXes9xxLACu 2zy9WG+bAmZvpQ/xCqJZK9WklbXwvRXW5c5i+SB1kFTMhhdLqCpwxRnaQyIVMnmO dIAtJxDr1eYpOCEmibEbVfYyj9SUhBcvk4qznV5yeW50zOYzv0OJU9BwAuxkShxV NXqMpXoy1Ye5GJ2KB8u/VEccVpywR0c6bHlvaTnPZxOBxrZF/FbVQ6PzEO+j4/aX 3TWgSa5jrVwRksnll8YqIkNSWR10u3kOLgDax/S0G8opktTFIB/EiQ84AVN0Tjme PrwJSWs= =tjyG -----END PGP SIGNATURE----- Merge tag 'v6.13-rc1' into perf/core, to refresh the branch Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
7863dcc72d
|
pid: allow pid_max to be set per pid namespace
The pid_max sysctl is a global value. For a long time the default value has been 65535 and during the pidfd dicussions Linus proposed to bump pid_max by default (cf. [1]). Based on this discussion systemd started bumping pid_max to 2^22. So all new systems now run with a very high pid_max limit with some distros having also backported that change. The decision to bump pid_max is obviously correct. It just doesn't make a lot of sense nowadays to enforce such a low pid number. There's sufficient tooling to make selecting specific processes without typing really large pid numbers available. In any case, there are workloads that have expections about how large pid numbers they accept. Either for historical reasons or architectural reasons. One concreate example is the 32-bit version of Android's bionic libc which requires pid numbers less than 65536. There are workloads where it is run in a 32-bit container on a 64-bit kernel. If the host has a pid_max value greater than 65535 the libc will abort thread creation because of size assumptions of pthread_mutex_t. That's a fairly specific use-case however, in general specific workloads that are moved into containers running on a host with a new kernel and a new systemd can run into issues with large pid_max values. Obviously making assumptions about the size of the allocated pid is suboptimal but we have userspace that does it. Of course, giving containers the ability to restrict the number of processes in their respective pid namespace indepent of the global limit through pid_max is something desirable in itself and comes in handy in general. Independent of motivating use-cases the existence of pid namespaces makes this also a good semantical extension and there have been prior proposals pushing in a similar direction. The trick here is to minimize the risk of regressions which I think is doable. The fact that pid namespaces are hierarchical will help us here. What we mostly care about is that when the host sets a low pid_max limit, say (crazy number) 100 that no descendant pid namespace can allocate a higher pid number in its namespace. Since pid allocation is hierarchial this can be ensured by checking each pid allocation against the pid namespace's pid_max limit. This means if the allocation in the descendant pid namespace succeeds, the ancestor pid namespace can reject it. If the ancestor pid namespace has a higher limit than the descendant pid namespace the descendant pid namespace will reject the pid allocation. The ancestor pid namespace will obviously not care about this. All in all this means pid_max continues to enforce a system wide limit on the number of processes but allows pid namespaces sufficient leeway in handling workloads with assumptions about pid values and allows containers to restrict the number of processes in a pid namespace through the pid_max interface. [1]: https://lore.kernel.org/linux-api/CAHk-=wiZ40LVjnXSi9iHLE_-ZBsWFGCgdmNiYZUXn1-V5YBg2g@mail.gmail.com - rebased from 5.14-rc1 - a few fixes (missing ns_free_inum on error path, missing initialization, etc) - permission check changes in pid_table_root_permissions - unsigned int pid_max -> int pid_max (keep pid_max type as it was) - add READ_ONCE in alloc_pid() as suggested by Christian - rebased from 6.7 and take into account: * sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table) * sysctl: treewide: constify ctl_table_header::ctl_table_arg * pidfd: add pidfs * tracing: Move saved_cmdline code into trace_sched_switch.c Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Link: https://lore.kernel.org/r/20241122132459.135120-2-aleksandr.mikhalitsyn@canonical.com Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
![]() |
0172afefbf |
tracing: Record task flag NEED_RESCHED_LAZY.
The scheduler added NEED_RESCHED_LAZY scheduling. Record this state as part of trace flags and expose it in the need_resched field. Record and expose NEED_RESCHED_LAZY. [bigeasy: Commit description, documentation bits.] Cc: Peter Zijlstra <peterz@infradead.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241122202849.7DfYpJR0@linutronix.de Reviewed-by: Ankur Arora <ankur.a.arora@oracle.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
06afb0f361 |
tracing updates for v6.13:
- Addition of faultable tracepoints There's a tracepoint attached to both a system call entry and exit. This location is known to allow page faults. The tracepoints are called under an rcu_read_lock() which does not allow faults that can sleep. This limits the ability of tracepoint handlers to page fault in user space system call parameters. Now these tracepoints have been made "faultable", allowing the callbacks to fault in user space parameters and record them. Note, only the infrastructure has been implemented. The consumers (perf, ftrace, BPF) now need to have their code modified to allow faults. - Fix up of BPF code for the tracepoint faultable logic - Update tracepoints to use the new static branch API - Remove trace_*_rcuidle() variants and the SRCU protection they used - Remove unused TRACE_EVENT_FL_FILTERED logic - Replace strncpy() with strscpy() and memcpy() - Use replace per_cpu_ptr(smp_processor_id()) with this_cpu_ptr() - Fix perf events to not duplicate samples when tracing is enabled - Replace atomic64_add_return(1, counter) with atomic64_inc_return(counter) - Make stack trace buffer 4K instead of PAGE_SIZE - Remove TRACE_FLAG_IRQS_NOSUPPORT flag as it was never used - Get the true return address for function tracer when function graph tracer is also running. When function_graph trace is running along with function tracer, the parent function of the function tracer sometimes is "return_to_handler", which is the function graph trampoline to record the exit of the function. Use existing logic that calls into the fgraph infrastructure to find the real return address. - Remove (un)regfunc pointers out of tracepoint structure - Added last minute bug fix for setting pending modules in stack function filter. echo "write*:mod:ext3" > /sys/kernel/tracing/stack_trace_filter Would cause a kernel NULL dereference. - Minor clean ups -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZz6dehQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qlQsAP9aB0XGUV3UykvjZuKK84VDZ26a2hZH X2JDYsNA4luuPAEAz/BG2rnslfMZ04WTMAl8h1eh10lxcuHG0wQMHVBXIwI= =lzb5 -----END PGP SIGNATURE----- Merge tag 'trace-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: - Addition of faultable tracepoints There's a tracepoint attached to both a system call entry and exit. This location is known to allow page faults. The tracepoints are called under an rcu_read_lock() which does not allow faults that can sleep. This limits the ability of tracepoint handlers to page fault in user space system call parameters. Now these tracepoints have been made "faultable", allowing the callbacks to fault in user space parameters and record them. Note, only the infrastructure has been implemented. The consumers (perf, ftrace, BPF) now need to have their code modified to allow faults. - Fix up of BPF code for the tracepoint faultable logic - Update tracepoints to use the new static branch API - Remove trace_*_rcuidle() variants and the SRCU protection they used - Remove unused TRACE_EVENT_FL_FILTERED logic - Replace strncpy() with strscpy() and memcpy() - Use replace per_cpu_ptr(smp_processor_id()) with this_cpu_ptr() - Fix perf events to not duplicate samples when tracing is enabled - Replace atomic64_add_return(1, counter) with atomic64_inc_return(counter) - Make stack trace buffer 4K instead of PAGE_SIZE - Remove TRACE_FLAG_IRQS_NOSUPPORT flag as it was never used - Get the true return address for function tracer when function graph tracer is also running. When function_graph trace is running along with function tracer, the parent function of the function tracer sometimes is "return_to_handler", which is the function graph trampoline to record the exit of the function. Use existing logic that calls into the fgraph infrastructure to find the real return address. - Remove (un)regfunc pointers out of tracepoint structure - Added last minute bug fix for setting pending modules in stack function filter. echo "write*:mod:ext3" > /sys/kernel/tracing/stack_trace_filter Would cause a kernel NULL dereference. - Minor clean ups * tag 'trace-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (31 commits) ftrace: Fix regression with module command in stack_trace_filter tracing: Fix function name for trampoline ftrace: Get the true parent ip for function tracer tracing: Remove redundant check on field->field in histograms bpf: ensure RCU Tasks Trace GP for sleepable raw tracepoint BPF links bpf: decouple BPF link/attach hook and BPF program sleepable semantics bpf: put bpf_link's program when link is safe to be deallocated tracing: Replace strncpy() with strscpy() when copying comm tracing: Add might_fault() check in __DECLARE_TRACE_SYSCALL tracing: Fix syscall tracepoint use-after-free tracing: Introduce tracepoint_is_faultable() tracing: Introduce tracepoint extended structure tracing: Remove TRACE_FLAG_IRQS_NOSUPPORT tracing: Replace multiple deprecated strncpy with memcpy tracing: Make percpu stack trace buffer invariant to PAGE_SIZE tracing: Use atomic64_inc_return() in trace_clock_counter() trace/trace_event_perf: remove duplicate samples on the first tracepoint event tracing/bpf: Add might_fault check to syscall probes tracing/perf: Add might_fault check to syscall probes tracing/ftrace: Add might_fault check to syscall probes ... |
||
![]() |
4b01712311 |
tracing/tools: Updates for 6.13
- Add ':' to getopt option 'trace-buffer-size' in timerlat_hist for consistency - Remove unused sched_getattr define - Rename sched_setattr() helper to syscall_sched_setattr() to avoid conflicts - Update counters to long from int to avoid overflow - Add libcpupower dependency detection - Add --deepest-idle-state to timerlat to limit deep idle sleeps - Other minor clean ups and documentation changes -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZz5O/hQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qkLlAQDAJ0MASrdbJRDrLrfmKX6sja582MLe 3MvevdSkOeXRdQEA0tzm46KOb5/aYNotzpntQVkTjuZiPBHSgn1JzASiaAI= =OZ1w -----END PGP SIGNATURE----- Merge tag 'trace-tools-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing tools updates from Steven Rostedt: - Add ':' to getopt option 'trace-buffer-size' in timerlat_hist for consistency - Remove unused sched_getattr define - Rename sched_setattr() helper to syscall_sched_setattr() to avoid conflicts - Update counters to long from int to avoid overflow - Add libcpupower dependency detection - Add --deepest-idle-state to timerlat to limit deep idle sleeps - Other minor clean ups and documentation changes * tag 'trace-tools-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: verification/dot2: Improve dot parser robustness tools/rtla: Improve exception handling in timerlat_load.py tools/rtla: Enhance argument parsing in timerlat_load.py tools/rtla: Improve code readability in timerlat_load.py rtla/timerlat: Do not set params->user_workload with -U rtla: Documentation: Mention --deepest-idle-state rtla/timerlat: Add --deepest-idle-state for hist rtla/timerlat: Add --deepest-idle-state for top rtla/utils: Add idle state disabling via libcpupower rtla: Add optional dependency on libcpupower tools/build: Add libcpupower dependency detection rtla/timerlat: Make timerlat_hist_cpu->*_count unsigned long long rtla/timerlat: Make timerlat_top_cpu->*_count unsigned long long tools/rtla: fix collision with glibc sched_attr/sched_set_attr tools/rtla: drop __NR_sched_getattr rtla: Fix consistency in getopt_long for timerlat_hist rv: Fix a typo tools/rv: Correct the grammatical errors in the comments tools/rv: Correct the grammatical errors in the comments rtla: use the definition for stdout fd when calling isatty() |
||
![]() |
f1db825805 |
trace ring-buffer updates for v6.13
- Limit time interrupts are disabled in rb_check_pages() The rb_check_pages() is called after the ring buffer size is updated to make sure that the ring buffer has not been corrupted. Commit |
||
![]() |
6e95ef0258 |
bpf-next-bpf-next-6.13
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmc7hIQACgkQ6rmadz2v bTrcRA/+MsUOzJPnjokonHwk8X4KQM21gOua/sUcGArLVGF/JoW5/b1W8UBQ0y5+ +okYaRNGpwF0/2S8M5FAYpM7VSPLl1U7Rihr55I63D9kbAo0pDQwpn4afQFuZhaC l7MzkhBHS7XXx5/70APOzy3kz1GDYvz39jiWuAAhRqVejFO+fa4pDz4W+Ht7jYTQ jJOLn4vJna9fSfVf/U/bbdz5lL0lncIiEnRIEbF7EszbF2CA7sa+/KFENGM7ChEo UlxK2Xz5fpzgT6htZRjMr6jmupfg7gzdT4moOysQQcjkllvv6/4MD0s/GLShtG9H SmpaptpYCEGXLuApGzkSddwiT6iUMTqQr7zs6LPp0gPh+4Z0sSPNoBtBp2v0aVDl w0zhVhMfoF66rMG+IZY684CsMGg5h8UsOS46KLjSU0fW2HpGM7+zZLpXOaGkU3OH UV0womPT/C2kS2fpOn9F91O8qMjOZ4EXd+zuRtIRv9CeuVIpCT9R13lEYn+wfr6d aUci8wybha1UOAvkRiXiqWOPS+0Z/arrSbCSDMQF6DevLpQl0noVbTVssWXcRdUE 9Ve6J0yS29WxNWFtuuw4xP5NcG1AnRXVGh215TuVBX7xK9X/hnDDhfalltsjXfnd m1f64FxU2SGp2D7X8BX/6Aeyo6mITE6I3SNMUrcvk1Zid36zhy8= =TXGS -----END PGP SIGNATURE----- Merge tag 'bpf-next-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Add BPF uprobe session support (Jiri Olsa) - Optimize uprobe performance (Andrii Nakryiko) - Add bpf_fastcall support to helpers and kfuncs (Eduard Zingerman) - Avoid calling free_htab_elem() under hash map bucket lock (Hou Tao) - Prevent tailcall infinite loop caused by freplace (Leon Hwang) - Mark raw_tracepoint arguments as nullable (Kumar Kartikeya Dwivedi) - Introduce uptr support in the task local storage map (Martin KaFai Lau) - Stringify errno log messages in libbpf (Mykyta Yatsenko) - Add kmem_cache BPF iterator for perf's lock profiling (Namhyung Kim) - Support BPF objects of either endianness in libbpf (Tony Ambardar) - Add ksym to struct_ops trampoline to fix stack trace (Xu Kuohai) - Introduce private stack for eligible BPF programs (Yonghong Song) - Migrate samples/bpf tests to selftests/bpf test_progs (Daniel T. Lee) - Migrate test_sock to selftests/bpf test_progs (Jordan Rife) * tag 'bpf-next-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (152 commits) libbpf: Change hash_combine parameters from long to unsigned long selftests/bpf: Fix build error with llvm 19 libbpf: Fix memory leak in bpf_program__attach_uprobe_multi bpf: use common instruction history across all states bpf: Add necessary migrate_disable to range_tree. bpf: Do not alloc arena on unsupported arches selftests/bpf: Set test path for token/obj_priv_implicit_token_envvar selftests/bpf: Add a test for arena range tree algorithm bpf: Introduce range_tree data structure and use it in bpf arena samples/bpf: Remove unused variable in xdp2skb_meta_kern.c samples/bpf: Remove unused variables in tc_l2_redirect_kern.c bpftool: Cast variable `var` to long long bpf, x86: Propagate tailcall info only for subprogs bpf: Add kernel symbol for struct_ops trampoline bpf: Use function pointers count as struct_ops links count bpf: Remove unused member rcu from bpf_struct_ops_map selftests/bpf: Add struct_ops prog private stack tests bpf: Support private stack for struct_ops progs selftests/bpf: Add tracing prog private stack tests bpf, x86: Support private stack in jit ... |
||
![]() |
f89a687aae |
kgdb patches for 6.13
A relatively modest collection of changes: * Adopt kstrtoint() and kstrtol() instead of the simple_strtoXX family for better error checking of user input. * Align the print behavour when breakpoints are enabled and disabled by adopting the current behaviour of breakpoint disable for both. * Remove some of the (rather odd and user hostile) hex fallbacks and require kdb users to prefix with 0x instead. * Tidy up (and fix) control code handling in kdb's keyboard code. This makes the control code handling at the keyboard behave the same way as it does via the UART. * Switch my own entry in MAINTAINERS to my @kernel.org address. Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAmc7bV4ACgkQfOMlXTn3 iKE9Mw/9G80KzejHGaSbzA17ELmxvCeQYQtnpbOiySpvzmIQWkOT7RBhqvqSD/+b 8tCT1aE/QHgkYRSIGTtCVILMSrJ1v2yJR5yuNOXAQgpwVCKq13hq4t7OFBpd+f2K kiY+UCpOOLb7okhjwT5I8hwI1wiHw9VOfcVq2BbBrcQPSoPfAI3iQ8PXUZHu4uq9 EB2OZskFxnIRtCJWXzEayXwzpD0mI9j0Ab+TEm32X3RU+BF0kGLfRvTKYl9jWkBc jsW4BKGOa+dfO5tu8zhVGxk5pssNeomaBNwRLD2EqtlmQJOkiGEk7qsR8z8aeETx uGbmfa4glrZj1V66bOeq9i+qqoAB9VY4TWw2/KSGOaQYsKHcK58EmSzq5nM0Abex rJbOBslsTYBMxz0z5qW8GyD20WtjgMSGtCmAu7OmlDJJdcksYsy6CY+gkfUsVS87 ZA4U0y8zvpyjMt2EKMS5o0/511bwzFtWtqEmiEBqfkX/NUJanaEBTt943NbnJEgu i8J+62B69G2X6gXjRZdncGC+MTWH/o93wmZk5u7bgdO0Wqk9t/EArILp4P9Ieco9 TpblPvcqEjfzBwkQKGMX5zhiR1YHzQn4sC4SmFUjczwuEjnmN0jEPMappG7bxI1c MEX5mPVQdRHO0N4jN/a7qC5PONbi8gKtnhfmCPbTGPwLF87DOEc= =rlg/ -----END PGP SIGNATURE----- Merge tag 'kgdb-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux Pull kgdb updates from Daniel Thompson: "A relatively modest collection of changes: - Adopt kstrtoint() and kstrtol() instead of the simple_strtoXX family for better error checking of user input. - Align the print behavour when breakpoints are enabled and disabled by adopting the current behaviour of breakpoint disable for both. - Remove some of the (rather odd and user hostile) hex fallbacks and require kdb users to prefix with 0x instead. - Tidy up (and fix) control code handling in kdb's keyboard code. This makes the control code handling at the keyboard behave the same way as it does via the UART. - Switch my own entry in MAINTAINERS to my @kernel.org address" * tag 'kgdb-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux: kdb: fix ctrl+e/a/f/b/d/p/n broken in keyboard mode MAINTAINERS: Use Daniel Thompson's korg address for kgdb work kdb: Fix breakpoint enable to be silent if already enabled kdb: Remove fallback interpretation of arbitrary numbers as hex trace: kdb: Replace simple_strtoul with kstrtoul in kdb_ftdump kdb: Replace the use of simple_strto with safer kstrto in kdb_main |
||
![]() |
aad3a0d084 |
ftrace updates for v6.13:
- Merged tag ftrace-v6.12-rc4 There was a fix to locking in register_ftrace_graph() for shadow stacks that was sent upstream. But this code was also being rewritten, and the locking fix was needed. Merging this fix was required to continue the work. - Restructure the function graph shadow stack to prepare it for use with kretprobes With the goal of merging the shadow stack logic of function graph and kretprobes, some more restructuring of the function shadow stack is required. Move out function graph specific fields from the fgraph infrastructure and store it on the new stack variables that can pass data from the entry callback to the exit callback. Hopefully, with this change, the merge of kretprobes to use fgraph shadow stacks will be ready by the next merge window. - Make shadow stack 4k instead of using PAGE_SIZE. Some architectures have very large PAGE_SIZE values which make its use for shadow stacks waste a lot of memory. - Give shadow stacks its own kmem cache. When function graph is started, every task on the system gets a shadow stack. In the future, shadow stacks may not be 4K in size. Have it have its own kmem cache so that whatever size it becomes will still be efficient in allocations. - Initialize profiler graph ops as it will be needed for new updates to fgraph - Convert to use guard(mutex) for several ftrace and fgraph functions - Add more comments and documentation - Show function return address in function graph tracer Add an option to show the caller of a function at each entry of the function graph tracer, similar to what the function tracer does. - Abstract out ftrace_regs from being used directly like pt_regs ftrace_regs was created to store a partial pt_regs. It holds only the registers and stack information to get to the function arguments and return values. On several archs, it is simply a wrapper around pt_regs. But some users would access ftrace_regs directly to get the pt_regs which will not work on all archs. Make ftrace_regs an abstract structure that requires all access to its fields be through accessor functions. - Show how long it takes to do function code modifications When code modification for function hooks happen, it always had the time recorded in how long it took to do the conversion. But this value was never exported. Recently the code was touched due to new ROX modification handling that caused a large slow down in doing the modifications and had a significant impact on boot times. Expose the timings in the dyn_ftrace_total_info file. This file was created a while ago to show information about memory usage and such to implement dynamic function tracing. It's also an appropriate file to store the timings of this modification as well. This will make it easier to see the impact of changes to code modification on boot up timings. - Other clean ups and small fixes -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZztrUxQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qnnNAQD6w4q9VQ7oOE2qKLqtnj87h4c1GqKn SPkpEfC3n/ATEAD/fnYjT/eOSlHiGHuD/aTA+U/bETrT99bozGM/4mFKEgY= =6nCa -----END PGP SIGNATURE----- Merge tag 'ftrace-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ftrace updates from Steven Rostedt: - Restructure the function graph shadow stack to prepare it for use with kretprobes With the goal of merging the shadow stack logic of function graph and kretprobes, some more restructuring of the function shadow stack is required. Move out function graph specific fields from the fgraph infrastructure and store it on the new stack variables that can pass data from the entry callback to the exit callback. Hopefully, with this change, the merge of kretprobes to use fgraph shadow stacks will be ready by the next merge window. - Make shadow stack 4k instead of using PAGE_SIZE. Some architectures have very large PAGE_SIZE values which make its use for shadow stacks waste a lot of memory. - Give shadow stacks its own kmem cache. When function graph is started, every task on the system gets a shadow stack. In the future, shadow stacks may not be 4K in size. Have it have its own kmem cache so that whatever size it becomes will still be efficient in allocations. - Initialize profiler graph ops as it will be needed for new updates to fgraph - Convert to use guard(mutex) for several ftrace and fgraph functions - Add more comments and documentation - Show function return address in function graph tracer Add an option to show the caller of a function at each entry of the function graph tracer, similar to what the function tracer does. - Abstract out ftrace_regs from being used directly like pt_regs ftrace_regs was created to store a partial pt_regs. It holds only the registers and stack information to get to the function arguments and return values. On several archs, it is simply a wrapper around pt_regs. But some users would access ftrace_regs directly to get the pt_regs which will not work on all archs. Make ftrace_regs an abstract structure that requires all access to its fields be through accessor functions. - Show how long it takes to do function code modifications When code modification for function hooks happen, it always had the time recorded in how long it took to do the conversion. But this value was never exported. Recently the code was touched due to new ROX modification handling that caused a large slow down in doing the modifications and had a significant impact on boot times. Expose the timings in the dyn_ftrace_total_info file. This file was created a while ago to show information about memory usage and such to implement dynamic function tracing. It's also an appropriate file to store the timings of this modification as well. This will make it easier to see the impact of changes to code modification on boot up timings. - Other clean ups and small fixes * tag 'ftrace-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (22 commits) ftrace: Show timings of how long nop patching took ftrace: Use guard to take ftrace_lock in ftrace_graph_set_hash() ftrace: Use guard to take the ftrace_lock in release_probe() ftrace: Use guard to lock ftrace_lock in cache_mod() ftrace: Use guard for match_records() fgraph: Use guard(mutex)(&ftrace_lock) for unregister_ftrace_graph() fgraph: Give ret_stack its own kmem cache fgraph: Separate size of ret_stack from PAGE_SIZE ftrace: Rename ftrace_regs_return_value to ftrace_regs_get_return_value selftests/ftrace: Fix check of return value in fgraph-retval.tc test ftrace: Use arch_ftrace_regs() for ftrace_regs_*() macros ftrace: Consolidate ftrace_regs accessor functions for archs using pt_regs ftrace: Make ftrace_regs abstract from direct use fgragh: No need to invoke the function call_filter_check_discard() fgraph: Simplify return address printing in function graph tracer function_graph: Remove unnecessary initialization in ftrace_graph_ret_addr() function_graph: Support recording and printing the function return address ftrace: Have calltime be saved in the fgraph storage ftrace: Use a running sleeptime instead of saving on shadow stack fgraph: Use fgraph data to store subtime for profiler ... |
||
![]() |
45af52e7d3 |
ftrace: Fix regression with module command in stack_trace_filter
When executing the following command: # echo "write*:mod:ext3" > /sys/kernel/tracing/stack_trace_filter The current mod command causes a null pointer dereference. While commit |
||
![]() |
f41dac3efb |
Performance events changes for v6.13:
- Uprobes: - Add BPF session support (Jiri Olsa) - Switch to RCU Tasks Trace flavor for better performance (Andrii Nakryiko) - Massively increase uretprobe SMP scalability by SRCU-protecting the uretprobe lifetime (Andrii Nakryiko) - Kill xol_area->slot_count (Oleg Nesterov) - Core facilities: - Implement targeted high-frequency profiling by adding the ability for an event to "pause" or "resume" AUX area tracing (Adrian Hunter) - VM profiling/sampling: - Correct perf sampling with guest VMs (Colton Lewis) - New hardware support: - x86/intel: Add PMU support for Intel ArrowLake-H CPUs (Dapeng Mi) - Misc fixes and enhancements: - x86/intel/pt: Fix buffer full but size is 0 case (Adrian Hunter) - x86/amd: Warn only on new bits set (Breno Leitao) - x86/amd/uncore: Avoid a false positive warning about snprintf truncation in amd_uncore_umc_ctx_init (Jean Delvare) - uprobes: Re-order struct uprobe_task to save some space (Christophe JAILLET) - x86/rapl: Move the pmu allocation out of CPU hotplug (Kan Liang) - x86/rapl: Clean up cpumask and hotplug (Kan Liang) - uprobes: Deuglify xol_get_insn_slot/xol_free_insn_slot paths (Oleg Nesterov) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmc7eKERHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1i57A/+KQ6TrIoICVTE+BPlDfUw8NU+N3DagVb0 dzoyDxlDRsnsYzeXZipPn+3IitX1w+DrGxBNIojSoiFVCLnHIKgo4uHbj7cVrR7J fBTVSnoJ94SGAk5ySebvLwMLce/YhXBeHK2lx6W/pI6acNcxzDfIabjjETeqltUo g7hmT9lo10pzZEZyuUfYX9khlWBxda1dKHc9pMIq7baeLe4iz/fCGlJ0K4d4M4z3 NPZw239Np6iHUwu3Lcs4gNKe4rcDe7Bt47hpedemHe0Y+7c4s2HaPxbXWxvDtE76 mlsg93i28f8SYxeV83pREn0EOCptXcljhiek+US+GR7NSbltMnV+uUiDfPKIE9+Y vYP/DYF9hx73FsOucEFrHxYYcePorn3pne5/khBYWdQU6TnlrBYWpoLQsjgCKTTR 4JhCFlBZ5cDpc6ihtpwCwVTQ4Q/H7vM1XOlDwx0hPhcIPPHDreaQD/wxo61jBdXf PY0EPAxh3BcQxfPYuDS+XiYjQ8qO8MtXMKz5bZyHBZlbHwccV6T4ExjsLKxFk5As 6BG8pkBWLg7drXAgVdleIY0ux+34w/Zzv7gemdlQxvWLlZrVvpjiG93oU3PTpZeq A2UD9eAOuXVD6+HsF/dmn88sFmcLWbrMskFWujkvhEUmCvSGAnz3YSS/mLEawBiT 2xI8xykNWSY= =ItOT -----END PGP SIGNATURE----- Merge tag 'perf-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance events updates from Ingo Molnar: "Uprobes: - Add BPF session support (Jiri Olsa) - Switch to RCU Tasks Trace flavor for better performance (Andrii Nakryiko) - Massively increase uretprobe SMP scalability by SRCU-protecting the uretprobe lifetime (Andrii Nakryiko) - Kill xol_area->slot_count (Oleg Nesterov) Core facilities: - Implement targeted high-frequency profiling by adding the ability for an event to "pause" or "resume" AUX area tracing (Adrian Hunter) VM profiling/sampling: - Correct perf sampling with guest VMs (Colton Lewis) New hardware support: - x86/intel: Add PMU support for Intel ArrowLake-H CPUs (Dapeng Mi) Misc fixes and enhancements: - x86/intel/pt: Fix buffer full but size is 0 case (Adrian Hunter) - x86/amd: Warn only on new bits set (Breno Leitao) - x86/amd/uncore: Avoid a false positive warning about snprintf truncation in amd_uncore_umc_ctx_init (Jean Delvare) - uprobes: Re-order struct uprobe_task to save some space (Christophe JAILLET) - x86/rapl: Move the pmu allocation out of CPU hotplug (Kan Liang) - x86/rapl: Clean up cpumask and hotplug (Kan Liang) - uprobes: Deuglify xol_get_insn_slot/xol_free_insn_slot paths (Oleg Nesterov)" * tag 'perf-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits) perf/core: Correct perf sampling with guest VMs perf/x86: Refactor misc flag assignments perf/powerpc: Use perf_arch_instruction_pointer() perf/core: Hoist perf_instruction_pointer() and perf_misc_flags() perf/arm: Drop unused functions uprobes: Re-order struct uprobe_task to save some space perf/x86/amd/uncore: Avoid a false positive warning about snprintf truncation in amd_uncore_umc_ctx_init perf/x86/intel: Do not enable large PEBS for events with aux actions or aux sampling perf/x86/intel/pt: Add support for pause / resume perf/core: Add aux_pause, aux_resume, aux_start_paused perf/x86/intel/pt: Fix buffer full but size is 0 case uprobes: SRCU-protect uretprobe lifetime (with timeout) uprobes: allow put_uprobe() from non-sleepable softirq context perf/x86/rapl: Clean up cpumask and hotplug perf/x86/rapl: Move the pmu allocation out of CPU hotplug uprobe: Add support for session consumer uprobe: Add data pointer to consumer handlers perf/x86/amd: Warn only on new bits set uprobes: fold xol_take_insn_slot() into xol_get_insn_slot() uprobes: kill xol_area->slot_count ... |
||
![]() |
b9c44b9147 |
perf/core: Save raw sample data conditionally based on sample type
Currently, space for raw sample data is always allocated within sample
records for both BPF output and tracepoint events. This leads to unused
space in sample records when raw sample data is not requested.
This patch enforces checking sample type of an event in
perf_sample_save_raw_data(). So raw sample data will only be saved if
explicitly requested, reducing overhead when it is not needed.
Fixes:
|
||
![]() |
6ce5a6f0a0 |
tracing: Fix function name for trampoline
The issue that unrelated function name is shown on stack trace like following even though it should be trampoline code address is caused by the creation of trampoline code in the area where .init.text section of module was freed after module is loaded. bash-1344 [002] ..... 43.644608: <stack trace> => (MODULE INIT FUNCTION) => vfs_write => ksys_write => do_syscall_64 => entry_SYSCALL_64_after_hwframe To resolve this, when function address of stack trace entry is in trampoline, output without looking up symbol name. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241021071454.34610-2-tatsuya.s2862@gmail.com Signed-off-by: Tatsuya S <tatsuya.s2862@gmail.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
60b1f578b5 |
ftrace: Get the true parent ip for function tracer
When using both function tracer and function graph simultaneously, it is found that function tracer sometimes captures a fake parent ip (return_to_handler) instead of the true parent ip. This issue is easy to reproduce. Below are my reproduction steps: jeff-labs:~/bin # ./trace-net.sh jeff-labs:~/bin # cat /sys/kernel/debug/tracing/instances/foo/trace | grep return_to_handler trace-net.sh-405 [001] ...2. 31.859501: avc_has_perm+0x4/0x190 <-return_to_handler+0x0/0x40 trace-net.sh-405 [001] ...2. 31.859503: simple_setattr+0x4/0x70 <-return_to_handler+0x0/0x40 trace-net.sh-405 [001] ...2. 31.859503: truncate_pagecache+0x4/0x60 <-return_to_handler+0x0/0x40 trace-net.sh-405 [001] ...2. 31.859505: unmap_mapping_range+0x4/0x140 <-return_to_handler+0x0/0x40 trace-net.sh-405 [001] ...3. 31.859508: _raw_spin_unlock+0x4/0x30 <-return_to_handler+0x0/0x40 [...] The following is my simple trace script: <snip> jeff-labs:~/bin # cat ./trace-net.sh TRACE_PATH="/sys/kernel/tracing" set_events() { echo 1 > $1/events/net/enable echo 1 > $1/events/tcp/enable echo 1 > $1/events/sock/enable echo 1 > $1/events/napi/enable echo 1 > $1/events/fib/enable echo 1 > $1/events/neigh/enable } set_events ${TRACE_PATH} echo 1 > ${TRACE_PATH}/options/sym-offset echo 1 > ${TRACE_PATH}/options/funcgraph-tail echo 1 > ${TRACE_PATH}/options/funcgraph-proc echo 1 > ${TRACE_PATH}/options/funcgraph-abstime echo 'tcp_orphan*' > ${TRACE_PATH}/set_ftrace_notrace echo function_graph > ${TRACE_PATH}/current_tracer INSTANCE_FOO=${TRACE_PATH}/instances/foo if [ ! -e $INSTANCE_FOO ]; then mkdir ${INSTANCE_FOO} fi set_events ${INSTANCE_FOO} echo 1 > ${INSTANCE_FOO}/options/sym-offset echo 'tcp_orphan*' > ${INSTANCE_FOO}/set_ftrace_notrace echo function > ${INSTANCE_FOO}/current_tracer echo 1 > ${TRACE_PATH}/tracing_on echo 1 > ${INSTANCE_FOO}/tracing_on echo > ${TRACE_PATH}/trace echo > ${INSTANCE_FOO}/trace </snip> Link: https://lore.kernel.org/20241008033159.22459-1-jeff.xie@linux.dev Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Jeff Xie <jeff.xie@linux.dev> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
537affea16 |
ring-buffer: Correct a grammatical error in a comment
The word "trace" begins with a consonant sound, so "a" should be used instead of "an". Link: https://lore.kernel.org/20241107095327.6390-1-liujing@cmss.chinamobile.com Signed-off-by: liujing <liujing@cmss.chinamobile.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
b5a24181e4 |
Ring buffer fixes for 6.12:
- Revert: "ring-buffer: Do not have boot mapped buffers hook to CPU hotplug" A crash that happened on cpu hotplug was actually caused by the incorrect ref counting that was fixed by commit |
||
![]() |
09663753bb |
tracing/ring-buffer: Clear all memory mapped CPU ring buffers on first recording
The events of a memory mapped ring buffer from the previous boot should
not be mixed in with events from the current boot. There's meta data that
is used to handle KASLR so that function names can be shown properly.
Also, since the timestamps of the previous boot have no meaning to the
timestamps of the current boot, having them intermingled in a buffer can
also cause confusion because there could possibly be events in the future.
When a trace is activated the meta data is reset so that the pointers of
are now processed for the new address space. The trace buffers are reset
when tracing starts for the first time. The problem here is that the reset
only happens on online CPUs. If a CPU is offline, it does not get reset.
To demonstrate the issue, a previous boot had tracing enabled in the boot
mapped ring buffer on reboot. On the following boot, tracing has not been
started yet so the function trace from the previous boot is still visible.
# trace-cmd show -B boot_mapped -c 3 | tail
<idle>-0 [003] d.h2. 156.462395: __rcu_read_lock <-cpu_emergency_disable_virtualization
<idle>-0 [003] d.h2. 156.462396: vmx_emergency_disable_virtualization_cpu <-cpu_emergency_disable_virtualization
<idle>-0 [003] d.h2. 156.462396: __rcu_read_unlock <-__sysvec_reboot
<idle>-0 [003] d.h2. 156.462397: stop_this_cpu <-__sysvec_reboot
<idle>-0 [003] d.h2. 156.462397: set_cpu_online <-stop_this_cpu
<idle>-0 [003] d.h2. 156.462397: disable_local_APIC <-stop_this_cpu
<idle>-0 [003] d.h2. 156.462398: clear_local_APIC <-disable_local_APIC
<idle>-0 [003] d.h2. 156.462574: mcheck_cpu_clear <-stop_this_cpu
<idle>-0 [003] d.h2. 156.462575: mce_intel_feature_clear <-stop_this_cpu
<idle>-0 [003] d.h2. 156.462575: lmce_supported <-mce_intel_feature_clear
Now, if CPU 3 is taken offline, and tracing is started on the memory
mapped ring buffer, the events from the previous boot in the CPU 3 ring
buffer is not reset. Now those events are using the meta data from the
current boot and produces just hex values.
# echo 0 > /sys/devices/system/cpu/cpu3/online
# trace-cmd start -B boot_mapped -p function
# trace-cmd show -B boot_mapped -c 3 | tail
<idle>-0 [003] d.h2. 156.462395: 0xffffffff9a1e3194 <-0xffffffff9a0f655e
<idle>-0 [003] d.h2. 156.462396: 0xffffffff9a0a1d24 <-0xffffffff9a0f656f
<idle>-0 [003] d.h2. 156.462396: 0xffffffff9a1e6bc4 <-0xffffffff9a0f7323
<idle>-0 [003] d.h2. 156.462397: 0xffffffff9a0d12b4 <-0xffffffff9a0f732a
<idle>-0 [003] d.h2. 156.462397: 0xffffffff9a1458d4 <-0xffffffff9a0d12e2
<idle>-0 [003] d.h2. 156.462397: 0xffffffff9a0faed4 <-0xffffffff9a0d12e7
<idle>-0 [003] d.h2. 156.462398: 0xffffffff9a0faaf4 <-0xffffffff9a0faef2
<idle>-0 [003] d.h2. 156.462574: 0xffffffff9a0e3444 <-0xffffffff9a0d12ef
<idle>-0 [003] d.h2. 156.462575: 0xffffffff9a0e4964 <-0xffffffff9a0d12ef
<idle>-0 [003] d.h2. 156.462575: 0xffffffff9a0e3fb0 <-0xffffffff9a0e496f
Reset all CPUs when starting a boot mapped ring buffer for the first time,
and not just the online CPUs.
Fixes:
|
||
![]() |
580bb355bc |
Revert: "ring-buffer: Do not have boot mapped buffers hook to CPU hotplug"
A crash happened when testing cpu hotplug with respect to the memory mapped ring buffers. It was assumed that the hot plug code was adding a per CPU buffer that was already created that caused the crash. The real problem was due to ref counting and was fixed by commit |
||
![]() |
8714381703 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Cross-merge bpf fixes after downstream PR. In particular to bring the fix in commit |
||
![]() |
6371b4bc17 |
tracing: Remove redundant check on field->field in histograms
The check on field->field being true is handled as the first check on the cascaded if statement, so the later checks on field->field are redundant because this clause has already been handled. Since this later check is redundant, just remove it. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241107120530.18728-1-colin.i.king@gmail.com Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
99b403d206 |
bpf: Add support for uprobe multi session context
Placing bpf_session_run_ctx layer in between bpf_run_ctx and bpf_uprobe_multi_run_ctx, so the session data can be retrieved from uprobe_multi link. Plus granting session kfuncs access to uprobe session programs. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241108134544.480660-5-jolsa@kernel.org |
||
![]() |
d920179b3d |
bpf: Add support for uprobe multi session attach
Adding support to attach BPF program for entry and return probe of the same function. This is common use case which at the moment requires to create two uprobe multi links. Adding new BPF_TRACE_UPROBE_SESSION attach type that instructs kernel to attach single link program to both entry and exit probe. It's possible to control execution of the BPF program on return probe simply by returning zero or non zero from the entry BPF program execution to execute or not the BPF program on return probe respectively. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241108134544.480660-4-jolsa@kernel.org |
||
![]() |
f505005bc7 |
bpf: Force uprobe bpf program to always return 0
As suggested by Andrii make uprobe multi bpf programs to always return 0,
so they can't force uprobe removal.
Keeping the int return type for uprobe_prog_run, because it will be used
in following session changes.
Fixes:
|
||
![]() |
5f67329cb2 |
Stable tag for bpf-next's uprobe work.
-----BEGIN PGP SIGNATURE----- iQJJBAABCgAzFiEEv3OU3/byMaA0LqWJdkfhpEvA5LoFAmcrTRsVHHBldGVyekBp bmZyYWRlYWQub3JnAAoJEHZH4aRLwOS6PLoP/jL4pUgW/ZrQFwpZh71BxeDt2Ka/ Eb6AsHe0PcKAMJYaJDfin6FRU87hp3tHIefSGdexvSttWwbnwKl8cVb+Y7gVnytu b2PkMfiOFShKEhu6YAJmxWIOi6MDxonjIMQgjvsVGrZmHiPgGTrh+nnmHYQ+qxFq wCaZXO3E65drtZKbi1HddHDYR+e1mHQU0uC+mLO44sP3lzJVxPnYGKGjaS62Z/Da XF+3tz6jc6jpu08FJy8ltrqLvcHPmTuDkR6f8mG3Hc8Hw0mndY/4yk0bGbbHo7Vx y42Aq4UUgcpvb8OUIicMRLzp3hRjsSTn8UJjsinEaCexdw6ZZiZVU/YR9Mf5ivrJ dlplFJvP8b6psnHrRf5xJ1SUv7+dap075A3/28MEvGErZOINoULAGa/hJIndHfuL NeWaZj0+of2eAX1SDePia87jX1P9xuU6AEw944i2rhI4P1J5I6XYfcaDDICBYitv yREafY/i6wb/Q8GhpjWmSE7p4wUIi5o3CpZsncj7B4Me9JBdHWrcnyUY55Tz05mo zoKnNgYC3d9DAIwXvq7x6tM2Tw183YXul/aHJSr3/rFKuuGQx0XACt6BO+yI35q3 6max4kMyr+kUqr9YYZtb9fuBw3TPhwY/zXG0ydSxNNh7oX+boxh4/bxXljLWXmRQ eHgsXuuF1YgCg1R9 =Wiky -----END PGP SIGNATURE----- Merge tag 'perf-core-for-bpf-next' from tip tree Stable tag for bpf-next's uprobe work. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> |
||
![]() |
0c10cc2435 |
trace: kdb: Replace simple_strtoul with kstrtoul in kdb_ftdump
The function simple_strtoul performs no error checking in scenarios where the input value overflows the intended output variable. This results in this function successfully returning, even when the output does not match the input string (aka the function returns successfully even when the result is wrong). Or as it was mentioned [1], "...simple_strtol(), simple_strtoll(), simple_strtoul(), and simple_strtoull() functions explicitly ignore overflows, which may lead to unexpected results in callers." Hence, the use of those functions is discouraged. This patch replaces all uses of the simple_strtoul with the safer alternatives kstrtoint and kstrtol. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#simple-strtol-simple-strtoll-simple-strtoul-simple-strtoull Signed-off-by: Yuran Pereira <yuran.pereira@hotmail.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> [nir: style fixes] Signed-off-by: Nir Lichtman <nir@lichtman.org> Link: https://lore.kernel.org/r/20241028192100.GB918454@lichtman.org Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> |
||
![]() |
242b32d807 |
tracing: Replace strncpy() with strscpy() when copying comm
Replace the depreciated[1] strncpy() calls with strscpy() when copying comm. Link: https://github.com/KSPP/linux/issues/90 [1] Cc: <mhiramat@kernel.org> Cc: <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241031120139.1343025-1-ruanjinjie@huawei.com Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
e9f0a36347 |
tracing: Remove TRACE_FLAG_IRQS_NOSUPPORT
It was possible to enable tracing with no IRQ tracing support. The
tracing infrastructure would then record TRACE_FLAG_IRQS_NOSUPPORT as
the only tracing flag and show an 'X' in the output.
The last user of this feature was PPC32 which managed to implement it
during PowerPC merge in 2009. Since then, it was unused and the PPC32
dependency was finally removed in commit
|
||
![]() |
fa17cb4b3b |
tracing: Document tracefs gid mount option
Commit |
||
![]() |
77a1326f64 |
tracing: Replace multiple deprecated strncpy with memcpy
strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. String copy operations involving manual pointer offset and length calculations followed by explicit NUL-byte assignments are best changed to either strscpy or memcpy. strscpy is not a drop-in replacement as @len would need a one subtracted from it to avoid truncating the source string. To not sabotage readability of the current code, use memcpy (retaining the manual NUL assignment) as this unambiguously describes the desired behavior. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://github.com/KSPP/linux/issues/90 [2] Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: linux-hardening@vger.kernel.org Link: https://lore.kernel.org/20241014-strncpy-kernel-trace-trace_events_filter-c-v2-1-d821e81e371e@google.com Reviewed-by: Kees Cook <kees@kernel.org> Signed-off-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
2c33155ef6 |
tracing: Make percpu stack trace buffer invariant to PAGE_SIZE
Previously the size of "struct ftrace_stacks" depended upon PAGE_SIZE. For the common 4K page size, on a 64-bit system, sizeof(struct ftrace_stacks) was 32K. But for a 64K page size, sizeof(struct ftrace_stacks) was 512K. But ftrace stack usage requirements should be invariant to page size. So let's redefine FTRACE_KSTACK_ENTRIES so that "struct ftrace_stacks" is always sized at 32K for 64-bit and 16K for 32-bit. As a side effect, it removes the PAGE_SIZE compile-time constant assumption from this code, which is required to reach the goal of boot-time page size selection. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241021141832.3668264-1-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
36a367b891 |
ftrace: Show timings of how long nop patching took
Since the beginning of ftrace, the code that did the patching had its timings saved on how long it took to complete. But this information was never exposed. It was used for debugging and exposing it was always something that was on the TODO list. Now it's time to expose it. There's even a file that is where it should go! Also include how long patching modules took as a separate value. # cat /sys/kernel/tracing/dyn_ftrace_total_info 57680 pages:231 groups: 9 ftrace boot update time = 14024666 (ns) ftrace module total update time = 126070 (ns) Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241017113105.1edfa943@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
8b0cb3a4c5 |
ftrace: Use guard to take ftrace_lock in ftrace_graph_set_hash()
The ftrace_lock is taken for most of the ftrace_graph_set_hash() function throughout the end. Use guard to take the ftrace_lock to simplify the exit paths. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241028071308.406073025@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
9687bbf219 |
ftrace: Use guard to take the ftrace_lock in release_probe()
The ftrace_lock is held throughout the entire release_probe() function. Use guard to simplify any exit paths. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241028071308.250787901@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
1432afb50d |
ftrace: Use guard to lock ftrace_lock in cache_mod()
The ftrace_lock is held throughout cache_mod(), use guard to simplify the error paths. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241028071308.088458856@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
06294cf045 |
ftrace: Use guard for match_records()
The ftrace_lock is held for most of match_records() until the end of the function. Use guard to make error paths simpler. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241028071307.927146604@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
6348a3fa72 |
fgraph: Use guard(mutex)(&ftrace_lock) for unregister_ftrace_graph()
The ftrace_lock is held throughout unregister_ftrace_graph(), use a guard to simplify the error paths. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241028071307.770550792@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
434098485b |
fgraph: Give ret_stack its own kmem cache
The ret_stack (shadow stack used by function graph infrastructure) is created for every task on the system when function graph is enabled. Give it its own kmem_cache. This will make it easier to see how much memory is being used specifically for function graph shadow stacks. In the future, this size may change and may not be a power of two. Having its own cache can also keep it from fragmenting memory. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Link: https://lore.kernel.org/20241026063210.7d4910a7@rorschach.local.home Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
6ea8b69da6 |
fgraph: Separate size of ret_stack from PAGE_SIZE
The ret_stack (shadow stack used by function graph infrastructure) is currently defined as PAGE_SIZE. But some architectures which have 64K PAGE_SIZE, this is way overkill. Also there's an effort to allow the PAGE_SIZE to be defined at boot up. Hard code it for now to 4096. In the future, this size may change and even be dependent on specific architectures. Link: https://lore.kernel.org/all/e5067bb8-0fcd-4739-9bca-0e872037d5a1@arm.com/ Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241019152951.053f9646@rorschach.local.home Suggested-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
81ec38ee9d |
Fixes for function graph:
- Fix missing mutex unlock in error path of register_ftrace_graph() A previous fix added a return on an error path and forgot to unlock the mutex. Instead of dealing with error paths, use guard(mutex) as the mutex is just released at the exit of the function anyway. Other functions in this file should be updated with this, but that's a cleanup and not a fix. - Change cpuhp setup name to be consistent with other cpuhp states The same fix that the above patch fixes added a cpuhp_setup_state() call with the name of "fgraph_idle_init". I was informed that it should instead be something like: "fgraph:online". Update that too. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZxydTRQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qsE/APoDcsqqaDJvQ0OsMqVaPdHoj2IUkU4M yueb6U/Kyq1m4wEA259W1PZuQlM0Vo0yJM1w2YIAH18UpO09ZroLnbWoUAc= =2sS+ -----END PGP SIGNATURE----- Merge tag 'ftrace-v6.12-rc4' into trace/ftrace/core In order to modify the code that allocates the shadow stacks, merge the changes that fixed the CPU hotplug shadow stack allocations and build on top of that. Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
f69a1accfe |
Fixes for function graph:
- Fix missing mutex unlock in error path of register_ftrace_graph() A previous fix added a return on an error path and forgot to unlock the mutex. Instead of dealing with error paths, use guard(mutex) as the mutex is just released at the exit of the function anyway. Other functions in this file should be updated with this, but that's a cleanup and not a fix. - Change cpuhp setup name to be consistent with other cpuhp states The same fix that the above patch fixes added a cpuhp_setup_state() call with the name of "fgraph_idle_init". I was informed that it should instead be something like: "fgraph:online". Update that too. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZxydTRQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qsE/APoDcsqqaDJvQ0OsMqVaPdHoj2IUkU4M yueb6U/Kyq1m4wEA259W1PZuQlM0Vo0yJM1w2YIAH18UpO09ZroLnbWoUAc= =2sS+ -----END PGP SIGNATURE----- Merge tag 'ftrace-v6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ftrace fixes from Steven Rostedt: - Fix missing mutex unlock in error path of register_ftrace_graph() A previous fix added a return on an error path and forgot to unlock the mutex. Instead of dealing with error paths, use guard(mutex) as the mutex is just released at the exit of the function anyway. Other functions in this file should be updated with this, but that's a cleanup and not a fix. - Change cpuhp setup name to be consistent with other cpuhp states The same fix that the above patch fixes added a cpuhp_setup_state() call with the name of "fgraph_idle_init". I was informed that it should instead be something like: "fgraph:online". Update that too. * tag 'ftrace-v6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: fgraph: Change the name of cpuhp state to "fgraph:online" fgraph: Fix missing unlock in register_ftrace_graph() |
||
![]() |
a574e7f80e |
fgraph: Change the name of cpuhp state to "fgraph:online"
The cpuhp state name given to cpuhp_setup_state() is "fgraph_idle_init"
which doesn't really conform to the names that are used for cpu hotplug
setups. Instead rename it to "fgraph:online" to be in line with other
states.
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/20241024222944.473d88c5@rorschach.local.home
Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Fixes:
|
||
![]() |
bd3734db86 |
fgraph: Fix missing unlock in register_ftrace_graph()
Use guard(mutex)() to acquire and automatically release ftrace_lock,
fixing the issue of not unlocking when calling cpuhp_setup_state()
fails.
Fixes smatch warning:
kernel/trace/fgraph.c:1317 register_ftrace_graph() warn: inconsistent returns '&ftrace_lock'.
Link: https://lore.kernel.org/20241024155917.1019580-1-lihuafei1@huawei.com
Fixes:
|
||
![]() |
bfa7b5c98b |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Cross-merge bpf fixes after downstream PR. No conflicts. Adjacent changes in: include/linux/bpf.h include/uapi/linux/bpf.h kernel/bpf/btf.c kernel/bpf/helpers.c kernel/bpf/syscall.c kernel/bpf/verifier.c kernel/trace/bpf_trace.c mm/slab_common.c tools/include/uapi/linux/bpf.h tools/testing/selftests/bpf/Makefile Link: https://lore.kernel.org/all/20241024215724.60017-1-daniel@iogearbox.net/ Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
![]() |
ae90f6a617 |
BPF fixes:
- Fix an out-of-bounds read in bpf_link_show_fdinfo for BPF sockmap link file descriptors (Hou Tao) - Fix BPF arm64 JIT's address emission with tag-based KASAN enabled reserving not enough size (Peter Collingbourne) - Fix BPF verifier do_misc_fixups patching for inlining of the bpf_get_branch_snapshot BPF helper (Andrii Nakryiko) - Fix a BPF verifier bug and reject BPF program write attempts into read-only marked BPF maps (Daniel Borkmann) - Fix perf_event_detach_bpf_prog error handling by removing an invalid check which would skip BPF program release (Jiri Olsa) - Fix memory leak when parsing mount options for the BPF filesystem (Hou Tao) Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> -----BEGIN PGP SIGNATURE----- iIsEABYIADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZxrAzxUcZGFuaWVsQGlv Z2VhcmJveC5uZXQACgkQ2yufC7HISIPcHwD8DnBSPlHX9OezMWCm8mjVx2Fd26W9 /IaiW2tyOPtoSGIA/3hfgfLrxkb3Raoh0miQB2+FRrz9e+y7i8c4Q91mcUgJ =Hvht -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Daniel Borkmann: - Fix an out-of-bounds read in bpf_link_show_fdinfo for BPF sockmap link file descriptors (Hou Tao) - Fix BPF arm64 JIT's address emission with tag-based KASAN enabled reserving not enough size (Peter Collingbourne) - Fix BPF verifier do_misc_fixups patching for inlining of the bpf_get_branch_snapshot BPF helper (Andrii Nakryiko) - Fix a BPF verifier bug and reject BPF program write attempts into read-only marked BPF maps (Daniel Borkmann) - Fix perf_event_detach_bpf_prog error handling by removing an invalid check which would skip BPF program release (Jiri Olsa) - Fix memory leak when parsing mount options for the BPF filesystem (Hou Tao) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Check validity of link->type in bpf_link_show_fdinfo() bpf: Add the missing BPF_LINK_TYPE invocation for sockmap bpf: fix do_misc_fixups() for bpf_get_branch_snapshot() bpf,perf: Fix perf_event_detach_bpf_prog error handling selftests/bpf: Add test for passing in uninit mtu_len selftests/bpf: Add test for writes to .rodata bpf: Remove MEM_UNINIT from skb/xdp MTU helpers bpf: Fix overloading of MEM_UNINIT's meaning bpf: Add MEM_WRITE attribute bpf: Preserve param->string when parsing mount options bpf, arm64: Fix address emission with tag-based KASAN enabled |
||
![]() |
0ee288e69d |
bpf,perf: Fix perf_event_detach_bpf_prog error handling
Peter reported that perf_event_detach_bpf_prog might skip to release
the bpf program for -ENOENT error from bpf_prog_array_copy.
This can't happen because bpf program is stored in perf event and is
detached and released only when perf event is freed.
Let's drop the -ENOENT check and make sure the bpf program is released
in any case.
Fixes:
|
||
![]() |
da09a9e0c3 |
uprobe: Add data pointer to consumer handlers
Adding data pointer to both entry and exit consumer handlers and all its users. The functionality itself is coming in following change. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20241018202252.693462-2-jolsa@kernel.org |
||
![]() |
0b6e2e22cb |
tracing: Consider the NULL character when validating the event length
strlen() returns a string length excluding the null byte. If the string
length equals to the maximum buffer length, the buffer will have no
space for the NULL terminating character.
This commit checks this condition and returns failure for it.
Link: https://lore.kernel.org/all/20241007144724.920954-1-leo.yan@arm.com/
Fixes:
|
||
![]() |
73f3508047 |
tracing/probes: Fix MAX_TRACE_ARGS limit handling
When creating a trace_probe we would set nr_args prior to truncating the
arguments to MAX_TRACE_ARGS. However, we would only initialize arguments
up to the limit.
This caused invalid memory access when attempting to set up probes with
more than 128 fetchargs.
BUG: kernel NULL pointer dereference, address: 0000000000000020
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 0 UID: 0 PID: 1769 Comm: cat Not tainted 6.11.0-rc7+ #8
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
RIP: 0010:__set_print_fmt+0x134/0x330
Resolve the issue by applying the MAX_TRACE_ARGS limit earlier. Return
an error when there are too many arguments instead of silently
truncating.
Link: https://lore.kernel.org/all/20240930202656.292869-1-mikel@mikelr.com/
Fixes:
|
||
![]() |
6fad274f06 |
bpf: Add MEM_WRITE attribute
Add a MEM_WRITE attribute for BPF helper functions which can be used in
bpf_func_proto to annotate an argument type in order to let the verifier
know that the helper writes into the memory passed as an argument. In
the past MEM_UNINIT has been (ab)used for this function, but the latter
merely tells the verifier that the passed memory can be uninitialized.
There have been bugs with overloading the latter but aside from that
there are also cases where the passed memory is read + written which
currently cannot be expressed, see also
|
||
![]() |
6280cf718d |
bpf: Implement bpf_send_signal_task() kfunc
Implement bpf_send_signal_task kfunc that is similar to bpf_send_signal_thread and bpf_send_signal helpers but can be used to send signals to other threads and processes. It also supports sending a cookie with the signal similar to sigqueue(). If the receiving process establishes a handler for the signal using the SA_SIGINFO flag to sigaction(), then it can obtain this cookie via the si_value field of the siginfo_t structure passed as the second argument to the handler. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241016084136.10305-2-puranjay@kernel.org |
||
![]() |
373b9338c9 |
uprobe: avoid out-of-bounds memory access of fetching args
Uprobe needs to fetch args into a percpu buffer, and then copy to ring
buffer to avoid non-atomic context problem.
Sometimes user-space strings, arrays can be very large, but the size of
percpu buffer is only page size. And store_trace_args() won't check
whether these data exceeds a single page or not, caused out-of-bounds
memory access.
It could be reproduced by following steps:
1. build kernel with CONFIG_KASAN enabled
2. save follow program as test.c
```
\#include <stdio.h>
\#include <stdlib.h>
\#include <string.h>
// If string length large than MAX_STRING_SIZE, the fetch_store_strlen()
// will return 0, cause __get_data_size() return shorter size, and
// store_trace_args() will not trigger out-of-bounds access.
// So make string length less than 4096.
\#define STRLEN 4093
void generate_string(char *str, int n)
{
int i;
for (i = 0; i < n; ++i)
{
char c = i % 26 + 'a';
str[i] = c;
}
str[n-1] = '\0';
}
void print_string(char *str)
{
printf("%s\n", str);
}
int main()
{
char tmp[STRLEN];
generate_string(tmp, STRLEN);
print_string(tmp);
return 0;
}
```
3. compile program
`gcc -o test test.c`
4. get the offset of `print_string()`
```
objdump -t test | grep -w print_string
0000000000401199 g F .text 000000000000001b print_string
```
5. configure uprobe with offset 0x1199
```
off=0x1199
cd /sys/kernel/debug/tracing/
echo "p /root/test:${off} arg1=+0(%di):ustring arg2=\$comm arg3=+0(%di):ustring"
> uprobe_events
echo 1 > events/uprobes/enable
echo 1 > tracing_on
```
6. run `test`, and kasan will report error.
==================================================================
BUG: KASAN: use-after-free in strncpy_from_user+0x1d6/0x1f0
Write of size 8 at addr ffff88812311c004 by task test/499CPU: 0 UID: 0 PID: 499 Comm: test Not tainted 6.12.0-rc3+ #18
Hardware name: Red Hat KVM, BIOS 1.16.0-4.al8 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x55/0x70
print_address_description.constprop.0+0x27/0x310
kasan_report+0x10f/0x120
? strncpy_from_user+0x1d6/0x1f0
strncpy_from_user+0x1d6/0x1f0
? rmqueue.constprop.0+0x70d/0x2ad0
process_fetch_insn+0xb26/0x1470
? __pfx_process_fetch_insn+0x10/0x10
? _raw_spin_lock+0x85/0xe0
? __pfx__raw_spin_lock+0x10/0x10
? __pte_offset_map+0x1f/0x2d0
? unwind_next_frame+0xc5f/0x1f80
? arch_stack_walk+0x68/0xf0
? is_bpf_text_address+0x23/0x30
? kernel_text_address.part.0+0xbb/0xd0
? __kernel_text_address+0x66/0xb0
? unwind_get_return_address+0x5e/0xa0
? __pfx_stack_trace_consume_entry+0x10/0x10
? arch_stack_walk+0xa2/0xf0
? _raw_spin_lock_irqsave+0x8b/0xf0
? __pfx__raw_spin_lock_irqsave+0x10/0x10
? depot_alloc_stack+0x4c/0x1f0
? _raw_spin_unlock_irqrestore+0xe/0x30
? stack_depot_save_flags+0x35d/0x4f0
? kasan_save_stack+0x34/0x50
? kasan_save_stack+0x24/0x50
? mutex_lock+0x91/0xe0
? __pfx_mutex_lock+0x10/0x10
prepare_uprobe_buffer.part.0+0x2cd/0x500
uprobe_dispatcher+0x2c3/0x6a0
? __pfx_uprobe_dispatcher+0x10/0x10
? __kasan_slab_alloc+0x4d/0x90
handler_chain+0xdd/0x3e0
handle_swbp+0x26e/0x3d0
? __pfx_handle_swbp+0x10/0x10
? uprobe_pre_sstep_notifier+0x151/0x1b0
irqentry_exit_to_user_mode+0xe2/0x1b0
asm_exc_int3+0x39/0x40
RIP: 0033:0x401199
Code: 01 c2 0f b6 45 fb 88 02 83 45 fc 01 8b 45 fc 3b 45 e4 7c b7 8b 45 e4 48 98 48 8d 50 ff 48 8b 45 e8 48 01 d0 ce
RSP: 002b:00007ffdf00576a8 EFLAGS: 00000206
RAX: 00007ffdf00576b0 RBX: 0000000000000000 RCX: 0000000000000ff2
RDX: 0000000000000ffc RSI: 0000000000000ffd RDI: 00007ffdf00576b0
RBP: 00007ffdf00586b0 R08: 00007feb2f9c0d20 R09: 00007feb2f9c0d20
R10: 0000000000000001 R11: 0000000000000202 R12: 0000000000401040
R13: 00007ffdf0058780 R14: 0000000000000000 R15: 0000000000000000
</TASK>
This commit enforces the buffer's maxlen less than a page-size to avoid
store_trace_args() out-of-memory access.
Link: https://lore.kernel.org/all/20241015060148.1108331-1-mqaio@linux.alibaba.com/
Fixes:
|
||
![]() |
2b4d25010d |
- Add PREEMPT_RT maintainers
- Fix another aspect of delayed dequeued tasks wrt determining their state, i.e., whether they're runnable or blocked - Handle delayed dequeued tasks and their migration wrt PSI properly - Fix the situation where a delayed dequeue task gets enqueued into a new class, which should not happen - Fix a case where memory allocation would happen while the runqueue lock is held, which is a no-no - Do not over-schedule when tasks with shorter slices preempt the currently running task - Make sure delayed to deque entities are properly handled before unthrottling - Other smaller cleanups and improvements -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmcU3tMACgkQEsHwGGHe VUqpuhAAqqyi2NNgrIOlEWh/Ej4NQZL7KleF84cSpKCIBK2somYX5ksgMcUgn82i bIuDVErQu/a4lhNAf5zn7TO3yuPA1Q5xd/453qBlWM9ApkH0S69Mp9f0yocVu8F0 t3XsgXm+/R8A4TYbiv8cB+r1Xt8E5NUP6RkNIKCHbPLAG94gqYF8UZEJ9sAl9ZXw qEWc9afpnp+4LQ9PlzePuaM976LWUPB49OoFZMnFmY1VkvFuVjkjXSVzJX6l4qB7 Omo/+TXOOBSHXVVflNx/68Q16irFHAnqwPPrLCBQWBLIPz3iRiZjV9ptD9tUZkRM M+klL7w0jRG+8wa9fTwuqybmBNIBt4Az1/WUw9Lc3ryEWRsCKzkGT8au3lv5FpQY CTwIIBSMmUcqQSG40R0HHS3nDR4UBFFD0PAww+8cJQZc0IPd2rT9/hfqYdt3sq2Z vV9rmTFOcDlApeDdCGcfC7zJhdgVuBgDVjdTsE5SNRUduBUsBYOeLDnT+0Qi0ArJ txVINGxQDm6jz512f4CAB/xzUcYpU4o639Z1Jkd6a8QbO1NBZGX1ioOcvPEMhmFF f/qFyM8ctR5Kj6LJCZiDcstqtAZviW1d2uMp48gk2QeSvkCyIUQqrWshItd02iBG TZdSYRvSYtYSIz7WYtE/CABUDmrJGjuLtb+jOrR93//TsWwwVdE= =1D7H -----END PGP SIGNATURE----- Merge tag 'sched_urgent_for_v6.12_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduling fixes from Borislav Petkov: - Add PREEMPT_RT maintainers - Fix another aspect of delayed dequeued tasks wrt determining their state, i.e., whether they're runnable or blocked - Handle delayed dequeued tasks and their migration wrt PSI properly - Fix the situation where a delayed dequeue task gets enqueued into a new class, which should not happen - Fix a case where memory allocation would happen while the runqueue lock is held, which is a no-no - Do not over-schedule when tasks with shorter slices preempt the currently running task - Make sure delayed to deque entities are properly handled before unthrottling - Other smaller cleanups and improvements * tag 'sched_urgent_for_v6.12_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: MAINTAINERS: Add an entry for PREEMPT_RT. sched/fair: Fix external p->on_rq users sched/psi: Fix mistaken CPU pressure indication after corrupted task state bug sched/core: Dequeue PSI signals for blocked tasks that are delayed sched: Fix delayed_dequeue vs switched_from_fair() sched/core: Disable page allocation in task_tick_mm_cid() sched/deadline: Use hrtick_enabled_dl() before start_hrtick_dl() sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr_running sched: Fix sched_delayed vs cfs_bandwidth |
||
![]() |
06526daaff |
ftrace: A couple of fixes to function graph infrastructure
- Fix allocation of idle shadow stack allocation during hotplug If function graph tracing is started when a CPU is offline, if it were come online during the trace then the idle task that represents the CPU will not get a shadow stack allocated for it. This means all function graph hooks that happen while that idle task is running (including in interrupt mode) will have all its events dropped. Switch over to the CPU hotplug mechanism that will have any newly brought on line CPU get a callback that can allocate the shadow stack for its idle task. - Fix allocation size of the ret_stack_list array When function graph tracing converted over to allowing more than one user at a time, it had to convert its shadow stack from an array of ret_stack structures to an array of unsigned longs. The shadow stacks are allocated in batches of 32 at a time and assigned to every running task. The batch is held by the ret_stack_list array. But when the conversion happened, instead of allocating an array of 32 pointers, it was allocated as a ret_stack itself (PAGE_SIZE). This ret_stack_list gets passed to a function that iterates over what it believes is its size defined by the FTRACE_RETSTACK_ALLOC_SIZE macro (which is 32). Luckily (PAGE_SIZE) is greater than 32 * sizeof(long), otherwise this would have been an array overflow. This still should be fixed and the ret_stack_list should be allocated to the size it is expected to be as someday it may end up being bigger than SHADOW_STACK_SIZE. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZxP8RhQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qmW3AP4qCOvU/g9g6u32gIZmS1oUWqe3q+Rq 9OKCk0JP6GGc8AD/cF816lbs5vpDiZFdbBvaz5gLHqhfAt35NVU8T5tbJA4= =Lh3A -----END PGP SIGNATURE----- Merge tag 'ftrace-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ftrace fixes from Steven Rostedt: "A couple of fixes to function graph infrastructure: - Fix allocation of idle shadow stack allocation during hotplug If function graph tracing is started when a CPU is offline, if it were come online during the trace then the idle task that represents the CPU will not get a shadow stack allocated for it. This means all function graph hooks that happen while that idle task is running (including in interrupt mode) will have all its events dropped. Switch over to the CPU hotplug mechanism that will have any newly brought on line CPU get a callback that can allocate the shadow stack for its idle task. - Fix allocation size of the ret_stack_list array When function graph tracing converted over to allowing more than one user at a time, it had to convert its shadow stack from an array of ret_stack structures to an array of unsigned longs. The shadow stacks are allocated in batches of 32 at a time and assigned to every running task. The batch is held by the ret_stack_list array. But when the conversion happened, instead of allocating an array of 32 pointers, it was allocated as a ret_stack itself (PAGE_SIZE). This ret_stack_list gets passed to a function that iterates over what it believes is its size defined by the FTRACE_RETSTACK_ALLOC_SIZE macro (which is 32). Luckily (PAGE_SIZE) is greater than 32 * sizeof(long), otherwise this would have been an array overflow. This still should be fixed and the ret_stack_list should be allocated to the size it is expected to be as someday it may end up being bigger than SHADOW_STACK_SIZE" * tag 'ftrace-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: fgraph: Allocate ret_stack_list with proper size fgraph: Use CPU hotplug mechanism to initialize idle shadow stacks |
||
![]() |
514da6924e |
ring-buffer: Use str_low_high() helper in ring_buffer_producer()
Remove hard-coded strings by using the helper function str_low_high(). Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241018110709.111707-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
0b60a7fb60 |
ring-buffer: Reorganize kerneldoc parameter names
Reorganize kerneldoc parameter names to match the parameter order in the function header. Problems identified using Coccinelle. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20240930112121.95324-22-Julia.Lawall@inria.fr Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
b237e1f7d2 |
ring-buffer: Limit time with disabled interrupts in rb_check_pages()
The function rb_check_pages() validates the integrity of a specified per-CPU tracing ring buffer. It does so by traversing the underlying linked list and checking its next and prev links. To guarantee that the list isn't modified during the check, a caller typically needs to take cpu_buffer->reader_lock. This prevents the check from running concurrently, for example, with a potential reader which can make the list temporarily inconsistent when swapping its old reader page into the buffer. A problem with this approach is that the time when interrupts are disabled is non-deterministic, dependent on the ring buffer size. This particularly affects PREEMPT_RT because the reader_lock is a raw spinlock which doesn't become sleepable on PREEMPT_RT kernels. Modify the check so it still attempts to traverse the entire list, but gives up the reader_lock between checking individual pages. Introduce for this purpose a new variable ring_buffer_per_cpu.cnt which is bumped any time the list is modified. The value is used by rb_check_pages() to detect such a change and restart the check. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241015112810.27203-1-petr.pavlu@suse.com Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
fae4078c28 |
fgraph: Allocate ret_stack_list with proper size
The ret_stack_list is an array of ret_stack shadow stacks for the function
graph usage. When the first function graph is enabled, all tasks in the
system get a shadow stack. The ret_stack_list is a 32 element array of
pointers to these shadow stacks. It allocates the shadow stack in batches
(32 stacks at a time), assigns them to running tasks, and continues until
all tasks are covered.
When the function graph shadow stack changed from an array of
ftrace_ret_stack structures to an array of longs, the allocation of
ret_stack_list went from allocating an array of 32 elements to just a
block defined by SHADOW_STACK_SIZE. Luckily, that's defined as PAGE_SIZE
and is much more than enough to hold 32 pointers. But it is way overkill
for the amount needed to allocate.
Change the allocation of ret_stack_list back to a kcalloc() of
FTRACE_RETSTACK_ALLOC_SIZE pointers.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241018215212.23f13f40@rorschach
Fixes:
|
||
![]() |
2c02f7375e |
fgraph: Use CPU hotplug mechanism to initialize idle shadow stacks
The function graph infrastructure allocates a shadow stack for every task
when enabled. This includes the idle tasks. The first time the function
graph is invoked, the shadow stacks are created and never freed until the
task exits. This includes the idle tasks.
Only the idle tasks that were for online CPUs had their shadow stacks
created when function graph tracing started. If function graph tracing is
enabled and a CPU comes online, the idle task representing that CPU will
not have its shadow stack created, and all function graph tracing for that
idle task will be silently dropped.
Instead, use the CPU hotplug mechanism to allocate the idle shadow stacks.
This will include idle tasks for CPUs that come online during tracing.
This issue can be reproduced by:
# cd /sys/kernel/tracing
# echo 0 > /sys/devices/system/cpu/cpu1/online
# echo 0 > set_ftrace_pid
# echo function_graph > current_tracer
# echo 1 > options/funcgraph-proc
# echo 1 > /sys/devices/system/cpu/cpu1
# grep '<idle>' per_cpu/cpu1/trace | head
Before, nothing would show up.
After:
1) <idle>-0 | 0.811 us | __enqueue_entity();
1) <idle>-0 | 5.626 us | } /* enqueue_entity */
1) <idle>-0 | | dl_server_update_idle_time() {
1) <idle>-0 | | dl_scaled_delta_exec() {
1) <idle>-0 | 0.450 us | arch_scale_cpu_capacity();
1) <idle>-0 | 1.242 us | }
1) <idle>-0 | 1.908 us | }
1) <idle>-0 | | dl_server_start() {
1) <idle>-0 | | enqueue_dl_entity() {
1) <idle>-0 | | task_contending() {
Note, if tracing stops and restarts, the old way would then initialize
the onlined CPUs.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/20241018214300.6df82178@rorschach
Fixes:
|
||
![]() |
3d5ad2d4ec |
BPF fixes:
- Fix BPF verifier to not affect subreg_def marks in its range propagation, from Eduard Zingerman. - Fix a truncation bug in the BPF verifier's handling of coerce_reg_to_size_sx, from Dimitar Kanaliev. - Fix the BPF verifier's delta propagation between linked registers under 32-bit addition, from Daniel Borkmann. - Fix a NULL pointer dereference in BPF devmap due to missing rxq information, from Florian Kauer. - Fix a memory leak in bpf_core_apply, from Jiri Olsa. - Fix an UBSAN-reported array-index-out-of-bounds in BTF parsing for arrays of nested structs, from Hou Tao. - Fix build ID fetching where memory areas backing the file were created with memfd_secret, from Andrii Nakryiko. - Fix BPF task iterator tid filtering which was incorrectly using pid instead of tid, from Jordan Rome. - Several fixes for BPF sockmap and BPF sockhash redirection in combination with vsocks, from Michal Luczaj. - Fix riscv BPF JIT and make BPF_CMPXCHG fully ordered, from Andrea Parri. - Fix riscv BPF JIT under CONFIG_CFI_CLANG to prevent the possibility of an infinite BPF tailcall, from Pu Lehui. - Fix a build warning from resolve_btfids that bpf_lsm_key_free cannot be resolved, from Thomas Weißschuh. - Fix a bug in kfunc BTF caching for modules where the wrong BTF object was returned, from Toke Høiland-Jørgensen. - Fix a BPF selftest compilation error in cgroup-related tests with musl libc, from Tony Ambardar. - Several fixes to BPF link info dumps to fill missing fields, from Tyrone Wu. - Add BPF selftests for kfuncs from multiple modules, checking that the correct kfuncs are called, from Simon Sundberg. - Ensure that internal and user-facing bpf_redirect flags don't overlap, also from Toke Høiland-Jørgensen. - Switch to use kvzmalloc to allocate BPF verifier environment, from Rik van Riel. - Use raw_spinlock_t in BPF ringbuf to fix a sleep in atomic splat under RT, from Wander Lairson Costa. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> -----BEGIN PGP SIGNATURE----- iIsEABYIADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZxK4OhUcZGFuaWVsQGlv Z2VhcmJveC5uZXQACgkQ2yufC7HISIOCrwEAib2kC5EEQn5+wKVE/bnZryVX2leT YXdfItDCBU6zCYUA+wTU5hGGn9lcDUcZx72l/KZPDyPw7HdzNJ+6iR1zQqoM =f9kv -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Daniel Borkmann: - Fix BPF verifier to not affect subreg_def marks in its range propagation (Eduard Zingerman) - Fix a truncation bug in the BPF verifier's handling of coerce_reg_to_size_sx (Dimitar Kanaliev) - Fix the BPF verifier's delta propagation between linked registers under 32-bit addition (Daniel Borkmann) - Fix a NULL pointer dereference in BPF devmap due to missing rxq information (Florian Kauer) - Fix a memory leak in bpf_core_apply (Jiri Olsa) - Fix an UBSAN-reported array-index-out-of-bounds in BTF parsing for arrays of nested structs (Hou Tao) - Fix build ID fetching where memory areas backing the file were created with memfd_secret (Andrii Nakryiko) - Fix BPF task iterator tid filtering which was incorrectly using pid instead of tid (Jordan Rome) - Several fixes for BPF sockmap and BPF sockhash redirection in combination with vsocks (Michal Luczaj) - Fix riscv BPF JIT and make BPF_CMPXCHG fully ordered (Andrea Parri) - Fix riscv BPF JIT under CONFIG_CFI_CLANG to prevent the possibility of an infinite BPF tailcall (Pu Lehui) - Fix a build warning from resolve_btfids that bpf_lsm_key_free cannot be resolved (Thomas Weißschuh) - Fix a bug in kfunc BTF caching for modules where the wrong BTF object was returned (Toke Høiland-Jørgensen) - Fix a BPF selftest compilation error in cgroup-related tests with musl libc (Tony Ambardar) - Several fixes to BPF link info dumps to fill missing fields (Tyrone Wu) - Add BPF selftests for kfuncs from multiple modules, checking that the correct kfuncs are called (Simon Sundberg) - Ensure that internal and user-facing bpf_redirect flags don't overlap (Toke Høiland-Jørgensen) - Switch to use kvzmalloc to allocate BPF verifier environment (Rik van Riel) - Use raw_spinlock_t in BPF ringbuf to fix a sleep in atomic splat under RT (Wander Lairson Costa) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (38 commits) lib/buildid: Handle memfd_secret() files in build_id_parse() selftests/bpf: Add test case for delta propagation bpf: Fix print_reg_state's constant scalar dump bpf: Fix incorrect delta propagation between linked registers bpf: Properly test iter/task tid filtering bpf: Fix iter/task tid filtering riscv, bpf: Make BPF_CMPXCHG fully ordered bpf, vsock: Drop static vsock_bpf_prot initialization vsock: Update msg_count on read_skb() vsock: Update rx_bytes on read_skb() bpf, sockmap: SK_DROP on attempted redirects of unsupported af_vsock selftests/bpf: Add asserts for netfilter link info bpf: Fix link info netfilter flags to populate defrag flag selftests/bpf: Add test for sign extension in coerce_subreg_to_size_sx() selftests/bpf: Add test for truncation after sign extension in coerce_reg_to_size_sx() bpf: Fix truncation bug in coerce_reg_to_size_sx() selftests/bpf: Assert link info uprobe_multi count & path_size if unset bpf: Fix unpopulated path_size when uprobe_multi fields unset selftests/bpf: Fix cross-compiling urandom_read selftests/bpf: Add test for kfunc module order ... |
||
![]() |
be602cde65 |
Merge branch 'linus' into sched/urgent, to resolve conflict
Conflicts: kernel/sched/ext.c There's a context conflict between this upstream commit: |
||
![]() |
09661f75e7 |
ring-buffer: Fix reader locking when changing the sub buffer order
The function ring_buffer_subbuf_order_set() updates each
ring_buffer_per_cpu and installs new sub buffers that match the requested
page order. This operation may be invoked concurrently with readers that
rely on some of the modified data, such as the head bit (RB_PAGE_HEAD), or
the ring_buffer_per_cpu.pages and reader_page pointers. However, no
exclusive access is acquired by ring_buffer_subbuf_order_set(). Modifying
the mentioned data while a reader also operates on them can then result in
incorrect memory access and various crashes.
Fix the problem by taking the reader_lock when updating a specific
ring_buffer_per_cpu in ring_buffer_subbuf_order_set().
Link: https://lore.kernel.org/linux-trace-kernel/20240715145141.5528-1-petr.pavlu@suse.com/
Link: https://lore.kernel.org/linux-trace-kernel/20241010195849.2f77cc3f@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20241011112850.17212b25@gandalf.local.home/
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241015112440.26987-1-petr.pavlu@suse.com
Fixes:
|
||
![]() |
2cf9733891 |
ring-buffer: Fix refcount setting of boot mapped buffers
A ring buffer which has its buffered mapped at boot up to fixed memory
should not be freed. Other buffers can be. The ref counting setup was
wrong for both. It made the not mapped buffers ref count have zero, and the
boot mapped buffer a ref count of 1. But an normally allocated buffer
should be 1, where it can be removed.
Keep the ref count of a normal boot buffer with its setup ref count (do
not decrement it), and increment the fixed memory boot mapped buffer's ref
count.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241011165224.33dd2624@gandalf.local.home
Fixes:
|
||
![]() |
cd9626e9eb |
sched/fair: Fix external p->on_rq users
Sean noted that ever since commit |
||
![]() |
ad6b5b6ea9 |
bpf: Fix unpopulated path_size when uprobe_multi fields unset
Previously when retrieving `bpf_link_info.uprobe_multi` with `path` and
`path_size` fields unset, the `path_size` field is not populated
(remains 0). This behavior was inconsistent with how other input/output
string buffer fields work, as the field should be populated in cases
when:
- both buffer and length are set (currently works as expected)
- both buffer and length are unset (not working as expected)
This patch now fills the `path_size` field when `path` and `path_size`
are unset.
Fixes:
|
||
![]() |
7888af4166 |
ftrace: Make ftrace_regs abstract from direct use
ftrace_regs was created to hold registers that store information to save function parameters, return value and stack. Since it is a subset of pt_regs, it should only be used by its accessor functions. But because pt_regs can easily be taken from ftrace_regs (on most archs), it is tempting to use it directly. But when running on other architectures, it may fail to build or worse, build but crash the kernel! Instead, make struct ftrace_regs an empty structure and have the architectures define __arch_ftrace_regs and all the accessor functions will typecast to it to get to the actual fields. This will help avoid usage of ftrace_regs directly. Link: https://lore.kernel.org/all/20241007171027.629bdafd@gandalf.local.home/ Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org> Cc: "x86@kernel.org" <x86@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Naveen N Rao <naveen@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/20241008230628.958778821@goodmis.org Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
c73eb02a47 |
fgragh: No need to invoke the function call_filter_check_discard()
The function call_filter_check_discard() has been removed in the commit |
||
![]() |
0a6c61bc9c |
fgraph: Simplify return address printing in function graph tracer
Simplify return address printing in the function graph tracer by removing fgraph_extras. Since this feature is only used by the function graph tracer and the feature flags can directly accessible from the function graph tracer, fgraph_extras can be removed from the fgraph callback. Cc: Donglin Peng <dolinux.peng@gmail.com> Link: https://lore.kernel.org/172857234900.270774.15378354017601069781.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
eb887c4567 |
tracing: Use atomic64_inc_return() in trace_clock_counter()
Use atomic64_inc_return(&ref) instead of atomic64_add_return(1, &ref) to use optimized implementation and ease register pressure around the primitive for targets that implement optimized variant. Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241007085651.48544-1-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
afe5960dc2 |
trace/trace_event_perf: remove duplicate samples on the first tracepoint event
When a tracepoint event is created with attr.freq = 1,
'hwc->period_left' is not initialized correctly. As a result,
in the perf_swevent_overflow() function, when the first time the event occurs,
it calculates the event overflow and the perf_swevent_set_period() returns 3,
this leads to the event are recorded for three duplicate times.
Step to reproduce:
1. Enable the tracepoint event & starting tracing
$ echo 1 > /sys/kernel/tracing/events/module/module_free
$ echo 1 > /sys/kernel/tracing/tracing_on
2. Record with perf
$ perf record -a --strict-freq -F 1 -e "module:module_free"
3. Trigger module_free event.
$ modprobe -i sunrpc
$ modprobe -r sunrpc
Result:
- Trace pipe result:
$ cat trace_pipe
modprobe-174509 [003] ..... 6504.868896: module_free: sunrpc
- perf sample:
modprobe 174509 [003] 6504.868980: module:module_free: sunrpc
modprobe 174509 [003] 6504.868980: module:module_free: sunrpc
modprobe 174509 [003] 6504.868980: module:module_free: sunrpc
By setting period_left via perf_swevent_set_period() as other sw_event did,
This problem could be solved.
After patch:
- Trace pipe result:
$ cat trace_pipe
modprobe 1153096 [068] 613468.867774: module:module_free: xfs
- perf sample
modprobe 1153096 [068] 613468.867794: module:module_free: xfs
Link: https://lore.kernel.org/20240913021347.595330-1-yeoreum.yun@arm.com
Fixes:
|
||
![]() |
cdb537ac41 |
tracing/perf: Add might_fault check to syscall probes
Add a might_fault() check to validate that the perf sys_enter/sys_exit probe callbacks are indeed called from a context where page faults can be handled. Cc: Michael Jeanson <mjeanson@efficios.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Yonghong Song <yhs@fb.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com> Cc: bpf@vger.kernel.org Cc: Joel Fernandes <joel@joelfernandes.org> Link: https://lore.kernel.org/20241009010718.2050182-8-mathieu.desnoyers@efficios.com Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
a3204c740a |
tracing/ftrace: Add might_fault check to syscall probes
Add a might_fault() check to validate that the ftrace sys_enter/sys_exit probe callbacks are indeed called from a context where page faults can be handled. Cc: Michael Jeanson <mjeanson@efficios.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Yonghong Song <yhs@fb.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com> Cc: bpf@vger.kernel.org Cc: Joel Fernandes <joel@joelfernandes.org> Link: https://lore.kernel.org/20241009010718.2050182-7-mathieu.desnoyers@efficios.com Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
65e7462a16 |
tracing/perf: disable preemption in syscall probe
In preparation for allowing system call enter/exit instrumentation to handle page faults, make sure that perf can handle this change by explicitly disabling preemption within the perf system call tracepoint probes to respect the current expectations within perf ring buffer code. This change does not yet allow perf to take page faults per se within its probe, but allows its existing probes to adapt to the upcoming change. Cc: Michael Jeanson <mjeanson@efficios.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Yonghong Song <yhs@fb.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com> Cc: bpf@vger.kernel.org Cc: Joel Fernandes <joel@joelfernandes.org> Link: https://lore.kernel.org/20241009010718.2050182-4-mathieu.desnoyers@efficios.com Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
13d750c2c0 |
tracing/ftrace: disable preemption in syscall probe
In preparation for allowing system call enter/exit instrumentation to handle page faults, make sure that ftrace can handle this change by explicitly disabling preemption within the ftrace system call tracepoint probes to respect the current expectations within ftrace ring buffer code. This change does not yet allow ftrace to take page faults per se within its probe, but allows its existing probes to adapt to the upcoming change. Cc: Michael Jeanson <mjeanson@efficios.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Yonghong Song <yhs@fb.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com> Cc: bpf@vger.kernel.org Cc: Joel Fernandes <joel@joelfernandes.org> Link: https://lore.kernel.org/20241009010718.2050182-3-mathieu.desnoyers@efficios.com Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
912da2c384 |
ring-buffer: Do not have boot mapped buffers hook to CPU hotplug
The boot mapped ring buffer has its buffer mapped at a fixed location
found at boot up. It is not dynamic. It cannot grow or be expanded when
new CPUs come online.
Do not hook fixed memory mapped ring buffers to the CPU hotplug callback,
otherwise it can cause a crash when it tries to add the buffer to the
memory that is already fully occupied.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241008143242.25e20801@gandalf.local.home
Fixes:
|
||
![]() |
48bcda6848 |
tracing: Remove definition of trace_*_rcuidle()
The trace_*_rcuidle() variant of a tracepoint was to handle places where a tracepoint was located but RCU was not "watching". All those locations have been removed, and RCU should be watching where all tracepoints are located. We can now remove the trace_*_rcuidle() variant. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Joel Fernandes <joel@joelfernandes.org> Link: https://lore.kernel.org/20241003181629.36209057@gandalf.local.home Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
4a8840af5f |
tracepoints: Use new static branch API
The old static key API is deprecated. Switch to the new one. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Alice Ryhl <aliceryhl@google.com> Link: https://lore.kernel.org/7a08dae3c5eddb14b13864923c1b58ac1f4af83c.1728414936.git.jpoimboe@kernel.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
49e4154f4b |
tracing: Remove TRACE_EVENT_FL_FILTERED logic
After commit
|
||
![]() |
2aa746ec02 |
tracing/branch-profiler: Replace deprecated strncpy with strscpy
strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. Both of these fields want to be NUL-terminated as per their use in printk: F_printk("%u:%s:%s (%u)%s", __entry->line, __entry->func, __entry->file, __entry->correct, __entry->constant ? " CONSTANT" : "") Use strscpy() as it NUL-terminates the destination buffer, so it doesn't have to be done manually. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html Link: https://github.com/KSPP/linux/issues/90 Cc: linux-hardening@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Kees Cook <kees@kernel.org> Link: https://lore.kernel.org/20240826-strncpy-kernel-trace-trace_branch-c-v1-1-b2c14f2e9e84@google.com Signed-off-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
e32540b1e4 |
ftrace: Use this_cpu_ptr() instead of per_cpu_ptr(smp_processor_id())
Use this_cpu_ptr() instead of open coding the equivalent in various ftrace functions. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/87y14t6ofi.wl-me@linux.beauty Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
474ec3e849 |
function_graph: Remove unnecessary initialization in ftrace_graph_ret_addr()
After the commit
|
||
![]() |
21e92806d3 |
function_graph: Support recording and printing the function return address
When using function_graph tracer to analyze the flow of kernel function execution, it is often necessary to quickly locate the exact line of code where the call occurs. While this may be easy at times, it can be more time-consuming when some functions are inlined or the flow is too long. This feature aims to simplify the process by recording the return address of traced funcions and printing it when outputing trace logs. To enhance human readability, the prefix 'ret=' is used for the kernel return value, while '<-' serves as the prefix for the return address in trace logs to make it look more like the function tracer. A new trace option named 'funcgraph-retaddr' has been introduced, and the existing option 'sym-addr' can be used to control the format of the return address. See below logs with both funcgraph-retval and funcgraph-retaddr enabled. 0) | load_elf_binary() { /* <-bprm_execve+0x249/0x600 */ 0) | load_elf_phdrs() { /* <-load_elf_binary+0x84/0x1730 */ 0) | __kmalloc_noprof() { /* <-load_elf_phdrs+0x4a/0xb0 */ 0) 3.657 us | __cond_resched(); /* <-__kmalloc_noprof+0x28c/0x390 ret=0x0 */ 0) + 24.335 us | } /* __kmalloc_noprof ret=0xffff8882007f3000 */ 0) | kernel_read() { /* <-load_elf_phdrs+0x6c/0xb0 */ 0) | rw_verify_area() { /* <-kernel_read+0x2b/0x50 */ 0) | security_file_permission() { /* <-kernel_read+0x2b/0x50 */ 0) | selinux_file_permission() { /* <-security_file_permission+0x26/0x40 */ 0) | __inode_security_revalidate() { /* <-selinux_file_permission+0x6d/0x140 */ 0) 2.034 us | __cond_resched(); /* <-__inode_security_revalidate+0x5f/0x80 ret=0x0 */ 0) 6.602 us | } /* __inode_security_revalidate ret=0x0 */ 0) 2.214 us | avc_policy_seqno(); /* <-selinux_file_permission+0x107/0x140 ret=0x0 */ 0) + 16.670 us | } /* selinux_file_permission ret=0x0 */ 0) + 20.809 us | } /* security_file_permission ret=0x0 */ 0) + 25.217 us | } /* rw_verify_area ret=0x0 */ 0) | __kernel_read() { /* <-load_elf_phdrs+0x6c/0xb0 */ 0) | ext4_file_read_iter() { /* <-__kernel_read+0x160/0x2e0 */ Then, we can use the faddr2line to locate the source code, for example: $ ./scripts/faddr2line ./vmlinux load_elf_phdrs+0x6c/0xb0 load_elf_phdrs+0x6c/0xb0: elf_read at fs/binfmt_elf.c:471 (inlined by) load_elf_phdrs at fs/binfmt_elf.c:531 Link: https://lore.kernel.org/20240915032912.1118397-1-dolinux.peng@gmail.com Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409150605.HgUmU8ea-lkp@intel.com/ Signed-off-by: Donglin Peng <dolinux.peng@gmail.com> [ Rebased to handle text_delta offsets ] Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
ac1987f8f5 |
rv: Fix a typo
Fix a typo in comments. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20240911114349.20449-1-algonell@gmail.com Reported-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Kreimer <algonell@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
2a13ca2e8a |
tracing/hwlat: Fix a race during cpuhp processing
The cpuhp online/offline processing race also exists in percpu-mode hwlat
tracer in theory, apply the fix too. That is:
T1 | T2
[CPUHP_ONLINE] | cpu_device_down()
hwlat_hotplug_workfn() |
| cpus_write_lock()
| takedown_cpu(1)
| cpus_write_unlock()
[CPUHP_OFFLINE] |
cpus_read_lock() |
start_kthread(1) |
cpus_read_unlock() |
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240924094515.3561410-5-liwei391@huawei.com
Fixes:
|
||
![]() |
829e0c9f08 |
tracing/timerlat: Fix a race during cpuhp processing
There is another found exception that the "timerlat/1" thread was
scheduled on CPU0, and lead to timer corruption finally:
```
ODEBUG: init active (active state 0) object: ffff888237c2e108 object type: hrtimer hint: timerlat_irq+0x0/0x220
WARNING: CPU: 0 PID: 426 at lib/debugobjects.c:518 debug_print_object+0x7d/0xb0
Modules linked in:
CPU: 0 UID: 0 PID: 426 Comm: timerlat/1 Not tainted 6.11.0-rc7+ #45
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:debug_print_object+0x7d/0xb0
...
Call Trace:
<TASK>
? __warn+0x7c/0x110
? debug_print_object+0x7d/0xb0
? report_bug+0xf1/0x1d0
? prb_read_valid+0x17/0x20
? handle_bug+0x3f/0x70
? exc_invalid_op+0x13/0x60
? asm_exc_invalid_op+0x16/0x20
? debug_print_object+0x7d/0xb0
? debug_print_object+0x7d/0xb0
? __pfx_timerlat_irq+0x10/0x10
__debug_object_init+0x110/0x150
hrtimer_init+0x1d/0x60
timerlat_main+0xab/0x2d0
? __pfx_timerlat_main+0x10/0x10
kthread+0xb7/0xe0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2d/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
```
After tracing the scheduling event, it was discovered that the migration
of the "timerlat/1" thread was performed during thread creation. Further
analysis confirmed that it is because the CPU online processing for
osnoise is implemented through workers, which is asynchronous with the
offline processing. When the worker was scheduled to create a thread, the
CPU may has already been removed from the cpu_online_mask during the offline
process, resulting in the inability to select the right CPU:
T1 | T2
[CPUHP_ONLINE] | cpu_device_down()
osnoise_hotplug_workfn() |
| cpus_write_lock()
| takedown_cpu(1)
| cpus_write_unlock()
[CPUHP_OFFLINE] |
cpus_read_lock() |
start_kthread(1) |
cpus_read_unlock() |
To fix this, skip online processing if the CPU is already offline.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240924094515.3561410-4-liwei391@huawei.com
Fixes:
|
||
![]() |
b484a02c9c |
tracing/timerlat: Drop interface_lock in stop_kthread()
stop_kthread() is the offline callback for "trace/osnoise:online", since commit |
||
![]() |
0bb0a5c12e |
tracing/timerlat: Fix duplicated kthread creation due to CPU online/offline
osnoise_hotplug_workfn() is the asynchronous online callback for
"trace/osnoise:online". It may be congested when a CPU goes online and
offline repeatedly and is invoked for multiple times after a certain
online.
This will lead to kthread leak and timer corruption. Add a check
in start_kthread() to prevent this situation.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240924094515.3561410-2-liwei391@huawei.com
Fixes:
|
||
![]() |
50a3242d84 |
tracing: Fix trace_check_vprintf() when tp_printk is used
When the tp_printk kernel command line is used, the trace events go
directly to printk(). It is still checked via the trace_check_vprintf()
function to make sure the pointers of the trace event are legit.
The addition of reading buffers from previous boots required adding a
delta between the addresses of the previous boot and the current boot so
that the pointers in the old buffer can still be used. But this required
adding a trace_array pointer to acquire the delta offsets.
The tp_printk code does not provide a trace_array (tr) pointer, so when
the offsets were examined, a NULL pointer dereference happened and the
kernel crashed.
If the trace_array does not exist, just default the delta offsets to zero,
as that also means the trace event is not being read from a previous boot.
Link: https://lore.kernel.org/all/Zv3z5UsG_jsO9_Tb@aschofie-mobl2.lan/
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241003104925.4e1b1fd9@gandalf.local.home
Fixes:
|
||
![]() |
f1f36e22be |
ftrace: Have calltime be saved in the fgraph storage
The calltime field in the shadow stack frame is only used by the function graph tracer and profiler. But now that there's other users of the function graph infrastructure, this adds overhead and wastes space on the shadow stack. Move the calltime to the fgraph data storage, where the function graph and profiler entry functions will save it in its own graph storage and retrieve it in its exit functions. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Olsa <olsajiri@gmail.com> Link: https://lore.kernel.org/20240914214827.096968730@goodmis.org Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
3c9880f3ab |
ftrace: Use a running sleeptime instead of saving on shadow stack
The fgraph "sleep-time" option tells the function graph tracer and the profiler whether to include the time a function "sleeps" (is scheduled off the CPU) in its duration for the function. By default it is true, which means the duration of a function is calculated by the timestamp of when the function was entered to the timestamp of when it exits. If the "sleep-time" option is disabled, it needs to remove the time that the task was not running on the CPU during the function. Currently it is done in a sched_switch tracepoint probe where it moves the "calltime" (time of entry of the function) forward by the sleep time calculated. It updates all the calltime in the shadow stack. This is time consuming for those users of the function graph tracer that does not care about the sleep time. Instead, add a "ftrace_sleeptime" to the task_struct that gets the sleep time added each time the task wakes up. Then have the function entry save the current "ftrace_sleeptime" and on function exit, move the calltime forward by the difference of the current "ftrace_sleeptime" from the saved sleeptime. This removes one dependency of "calltime" needed to be on the shadow stack. It also simplifies the code that removes the sleep time of functions. TODO: Only enable the sched_switch tracepoint when this is needed. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Olsa <olsajiri@gmail.com> Link: https://lore.kernel.org/20240914214826.938908568@goodmis.org Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
a312a0f783 |
fgraph: Use fgraph data to store subtime for profiler
Instead of having the "subtime" for the function profiler in the infrastructure ftrace_ret_stack structure, have it use the fgraph data reserve and retrieve functions. This will keep the limited shadow stack from wasting 8 bytes for something that is seldom used. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Olsa <olsajiri@gmail.com> Link: https://lore.kernel.org/20240914214826.780323141@goodmis.org Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
c54a1a06da |
tracing: Fix function timing profiler to initialize hashtable
Since the new fgraph requires to initialize fgraph_ops.ops.func_hash before
calling register_ftrace_graph(), initialize it with default (tracing all
functions) parameter.
Cc: stable@vger.kernel.org
Fixes:
|
||
![]() |
cb787f4ac0 |
[tree-wide] finally take no_llseek out
no_llseek had been defined to NULL two years ago, in commit
|
||
![]() |
5159938e10 |
Probes updates for v6.12:
- uprobes: make trace_uprobe->nhit counter a per-CPU one This makes uprobe event's hit counter per-CPU for improving scalability on multi-core environment. - kprobes: Remove obsoleted declaration for init_test_probes Remove unused init_test_probes() from header. - Raw tracepoint probe supports raw tracepoint events on modules. The tracepoint events using fprobe were introduced in v6.5, but tracepoints can be compiled in modules. This supports such a case. This includes the following improvements. . tracepoint: add a function for iterating over all tracepoints in all modules. . tracepoint: Add a function for iterating over tracepoints in a module. . tracing/fprobe: Support raw tracepoint events on modules. . tracing/fprobe: Support raw tracepoints on future loaded modules. This allows user to add tracepoint events on modules which is not loaded yet. . selftests/tracing: Add a test for tracepoint events on modules. -----BEGIN PGP SIGNATURE----- iQEzBAABCgAdFiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmb0HXgACgkQ2/sHvwUr Pxs7AAf+K89Q7eyqKLP/oG5LGsnmWwhZHP26HTbGKh7mRaxGE+cf3l1O2lCMAgBt 0Y1J0sHkgRSnubmlPrgEMKKLOKVBwnvwBqbqO8Zw8L3GxMegG5YYsl3Y60Q0T6Gq xiL17sHILbb/yefUqnf6C3QHoSjR4aTMEaQSpux1tsCqG/sLeU7V6DZrWdM5t4Fl CvQDuy//UdQUKFTUC5XOc6lRbKr94ktp/VTxdHZLXa5u6p/slq8ISf9EA+Rrsjkp m+FtW8MpfcYt3K+hs0kV58F43XWeRt9F7OlLf+MlyCeRRQor4xvkVlV0iw6VcRG9 sXt6ml6AmyA2JWRzR5qSKYvMAsNVyA== =GYlS -----END PGP SIGNATURE----- Merge tag 'probes-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes updates from Masami Hiramatsu: - uprobes: make trace_uprobe->nhit counter a per-CPU one This makes uprobe event's hit counter per-CPU for improving scalability on multi-core environment - kprobes: Remove obsoleted declaration for init_test_probes Remove unused init_test_probes() from header - Raw tracepoint probe supports raw tracepoint events on modules: - add a function for iterating over all tracepoints in all modules - add a function for iterating over tracepoints in a module - support raw tracepoint events on modules - support raw tracepoints on future loaded modules - add a test for tracepoint events on modules" * tag 'probes-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: sefltests/tracing: Add a test for tracepoint events on modules tracing/fprobe: Support raw tracepoints on future loaded modules tracing/fprobe: Support raw tracepoint events on modules tracepoint: Support iterating tracepoints in a loading module tracepoint: Support iterating over tracepoints on modules kprobes: Remove obsoleted declaration for init_test_probes uprobes: turn trace_uprobe's nhit counter to be per-CPU one |
||
![]() |
57a7e6de9e |
tracing/fprobe: Support raw tracepoints on future loaded modules
Support raw tracepoint events on future loaded (unloaded) modules. This allows user to create raw tracepoint events which can be used from module's __init functions. Note: since the kernel does not have any information about the tracepoints in the unloaded modules, fprobe events can not check whether the tracepoint exists nor extend the BTF based arguments. Link: https://lore.kernel.org/all/172397780593.286558.18360375226968537828.stgit@devnote2/ Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> |
||
![]() |
67e9a9ee47 |
tracing/fprobe: Support raw tracepoint events on modules
Support raw tracepoint event on module by fprobe events. Since it only uses for_each_kernel_tracepoint() to find a tracepoint, the tracepoints on modules are not handled. Thus if user specified a tracepoint on a module, it shows an error. This adds new for_each_module_tracepoint() API to tracepoint subsystem, and uses it to find tracepoints on modules. Link: https://lore.kernel.org/all/172397779651.286558.15903703620679186867.stgit@devnote2/ Reported-by: don <zds100@gmail.com> Closes: https://lore.kernel.org/all/20240530215718.aeec973a1d0bf058d39cb1e3@kernel.org/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> |
||
![]() |
10cdb82aa7 |
uprobes: turn trace_uprobe's nhit counter to be per-CPU one
trace_uprobe->nhit counter is not incremented atomically, so its value is questionable in when uprobe is hit on multiple CPUs simultaneously. Also, doing this shared counter increment across many CPUs causes heavy cache line bouncing, limiting uprobe/uretprobe performance scaling with number of CPUs. Solve both problems by making this a per-CPU counter. Link: https://lore.kernel.org/all/20240813203409.3985398-1-andrii@kernel.org/ Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> |
||
![]() |
af9c191ac2 |
ring-buffer: Updates for v6.12:
- Merged v6.11-rc3 into trace/ring-buffer/core The v6.10 ring buffer pull request was not made due to Mathieu Desnoyers making a comment to the pull request. Mathieu and I resolved it on IRC, but we did not let Linus know that it was resolved. Linus did not do the pull thinking it still had some unresolved issues. The ring buffer work for 6.12 was dependent on both this pull request as well as the reserve_mem kernel command line option that was going upstream through the memory management tree. The ring buffer repo was being used by others so it could not be rebased. In order to continue the work, the v6.11-rc3 branch was pulled in to get access to the reserve_mem work. This has the 6.11 pull request that did not make it into 6.11, which was: tracing/ring-buffer: Have persistent buffer across reboots This allows for the tracing instance ring buffer to stay persistent across reboots. The way this is done is by adding to the kernel command line: trace_instance=boot_map@0x285400000:12M This will reserve 12 megabytes at the address 0x285400000, and then map the tracing instance "boot_map" ring buffer to that memory. This will appear as a normal instance in the tracefs system: /sys/kernel/tracing/instances/boot_map A user could enable tracing in that instance, and on reboot or kernel crash, if the memory is not wiped by the firmware, it will recreate the trace in that instance. For example, if one was debugging a shutdown of a kernel reboot: # cd /sys/kernel/tracing # echo function > instances/boot_map/current_tracer # reboot [..] # cd /sys/kernel/tracing # tail instances/boot_map/trace swapper/0-1 [000] d..1. 164.549800: restore_boot_irq_mode <-native_machine_shutdown swapper/0-1 [000] d..1. 164.549801: native_restore_boot_irq_mode <-native_machine_shutdown swapper/0-1 [000] d..1. 164.549802: disconnect_bsp_APIC <-native_machine_shutdown swapper/0-1 [000] d..1. 164.549811: hpet_disable <-native_machine_shutdown swapper/0-1 [000] d..1. 164.549812: iommu_shutdown_noop <-native_machine_restart swapper/0-1 [000] d..1. 164.549813: native_machine_emergency_restart <-__do_sys_reboot swapper/0-1 [000] d..1. 164.549813: tboot_shutdown <-native_machine_emergency_restart swapper/0-1 [000] d..1. 164.549820: acpi_reboot <-native_machine_emergency_restart swapper/0-1 [000] d..1. 164.549821: acpi_reset <-acpi_reboot swapper/0-1 [000] d..1. 164.549822: acpi_os_write_port <-acpi_reboot On reboot, the buffer is examined to make sure it is valid. The validation check even steps through every event to make sure the meta data of the event is correct. If any test fails, it will simply reset the buffer, and the buffer will be empty on boot. The new changes for 6.12 are: - Allow the tracing persistent boot buffer to use the "reserve_mem" option Instead of having the admin find a physical address to store the persistent buffer, which can be very tedious if they have to administrate several different machines, allow them to use the "reserve_mem" option that will find a location for them. It is not as reliable because of KASLR, as the loading of the kernel in different locations can cause the memory allocated to be inconsistent. Booting with "nokaslr" can make reserve_mem more reliable. - Have function graph tracer handle offsets from a previous boot. The ring buffer output from a previous boot may have different addresses due to kaslr. Have the function graph tracer handle these by using the delta from the previous boot to the new boot address space. - Only reset the saved meta offset when the buffer is started or reset In the persistent memory meta data, it holds the previous address space information, so that it can calculate the delta to have function tracing work. But this gets updated after being read to hold the new address space. But if the buffer isn't used for that boot, on reboot, the delta is now calculated from the previous boot and not the boot that holds the data in the ring buffer. This causes the functions not to be shown. Do not save the address space information of the current kernel until it is being recorded. - Add a magic variable to test the valid meta data Add a magic variable in the meta data that can also be used for validation. The validator of the previous buffer doesn't need this magic data, but it can be used if the meta data is changed by a new kernel, which may have the same format that passes the validator but is used differently. This magic number can also be used as a "versioning" of the meta data. - Align user space mapped ring buffer sub buffers to improve TLB entries Linus mentioned that the mapped ring buffer sub buffers were misaligned between the meta page and the sub-buffers, so that if the sub-buffers were bigger than PAGE_SIZE, it wouldn't allow the TLB to use bigger entries. - Add new kernel command line "traceoff" to disable tracing on boot for instances If tracing is enabled for a boot instance, there needs a way to be able to disable it on boot so that new events do not get entered into the ring buffer and be mixed with events from a previous boot, as that can be confusing. - Allow trace_printk() to go to other instances Currently, trace_printk() can only go to the top level instance. When debugging with a persistent buffer, it is really useful to be able to add trace_printk() to go to that buffer, so that you have access to them after a crash. - Do not use "bin_printk()" for traces to a boot instance The bin_printk() saves only a pointer to the printk format in the ring buffer, as the reader of the buffer can still have access to it. But this is not the case if the buffer is from a previous boot. If the trace_printk() is going to a "persistent" buffer, it will use the slower version that writes the printk format into the buffer. - Add command line option to allow trace_printk() to go to an instance Allow the kernel command line to define which instance the trace_printk() goes to, instead of forcing the admin to set it for every boot via the tracefs options. - Start a document that explains how to use tracefs to debug the kernel - Add some more kernel selftests to test user mapped ring buffer -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZu/PxxQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qowiAQCx86Nm48aCACjrvGWCFb+jgQZn8QdO MeK15Fcc5C3b5gEAkJkDKqtul7ybI9+vq+3yNzdl7pO7Y7+pCNzz3PfVaQA= =Ce81 -----END PGP SIGNATURE----- Merge tag 'trace-ring-buffer-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ring-buffer updates from Steven Rostedt: - tracing/ring-buffer: persistent buffer across reboots This allows for the tracing instance ring buffer to stay persistent across reboots. The way this is done is by adding to the kernel command line: trace_instance=boot_map@0x285400000:12M This will reserve 12 megabytes at the address 0x285400000, and then map the tracing instance "boot_map" ring buffer to that memory. This will appear as a normal instance in the tracefs system: /sys/kernel/tracing/instances/boot_map A user could enable tracing in that instance, and on reboot or kernel crash, if the memory is not wiped by the firmware, it will recreate the trace in that instance. For example, if one was debugging a shutdown of a kernel reboot: # cd /sys/kernel/tracing # echo function > instances/boot_map/current_tracer # reboot [..] # cd /sys/kernel/tracing # tail instances/boot_map/trace swapper/0-1 [000] d..1. 164.549800: restore_boot_irq_mode <-native_machine_shutdown swapper/0-1 [000] d..1. 164.549801: native_restore_boot_irq_mode <-native_machine_shutdown swapper/0-1 [000] d..1. 164.549802: disconnect_bsp_APIC <-native_machine_shutdown swapper/0-1 [000] d..1. 164.549811: hpet_disable <-native_machine_shutdown swapper/0-1 [000] d..1. 164.549812: iommu_shutdown_noop <-native_machine_restart swapper/0-1 [000] d..1. 164.549813: native_machine_emergency_restart <-__do_sys_reboot swapper/0-1 [000] d..1. 164.549813: tboot_shutdown <-native_machine_emergency_restart swapper/0-1 [000] d..1. 164.549820: acpi_reboot <-native_machine_emergency_restart swapper/0-1 [000] d..1. 164.549821: acpi_reset <-acpi_reboot swapper/0-1 [000] d..1. 164.549822: acpi_os_write_port <-acpi_reboot On reboot, the buffer is examined to make sure it is valid. The validation check even steps through every event to make sure the meta data of the event is correct. If any test fails, it will simply reset the buffer, and the buffer will be empty on boot. - Allow the tracing persistent boot buffer to use the "reserve_mem" option Instead of having the admin find a physical address to store the persistent buffer, which can be very tedious if they have to administrate several different machines, allow them to use the "reserve_mem" option that will find a location for them. It is not as reliable because of KASLR, as the loading of the kernel in different locations can cause the memory allocated to be inconsistent. Booting with "nokaslr" can make reserve_mem more reliable. - Have function graph tracer handle offsets from a previous boot. The ring buffer output from a previous boot may have different addresses due to kaslr. Have the function graph tracer handle these by using the delta from the previous boot to the new boot address space. - Only reset the saved meta offset when the buffer is started or reset In the persistent memory meta data, it holds the previous address space information, so that it can calculate the delta to have function tracing work. But this gets updated after being read to hold the new address space. But if the buffer isn't used for that boot, on reboot, the delta is now calculated from the previous boot and not the boot that holds the data in the ring buffer. This causes the functions not to be shown. Do not save the address space information of the current kernel until it is being recorded. - Add a magic variable to test the valid meta data Add a magic variable in the meta data that can also be used for validation. The validator of the previous buffer doesn't need this magic data, but it can be used if the meta data is changed by a new kernel, which may have the same format that passes the validator but is used differently. This magic number can also be used as a "versioning" of the meta data. - Align user space mapped ring buffer sub buffers to improve TLB entries Linus mentioned that the mapped ring buffer sub buffers were misaligned between the meta page and the sub-buffers, so that if the sub-buffers were bigger than PAGE_SIZE, it wouldn't allow the TLB to use bigger entries. - Add new kernel command line "traceoff" to disable tracing on boot for instances If tracing is enabled for a boot instance, there needs a way to be able to disable it on boot so that new events do not get entered into the ring buffer and be mixed with events from a previous boot, as that can be confusing. - Allow trace_printk() to go to other instances Currently, trace_printk() can only go to the top level instance. When debugging with a persistent buffer, it is really useful to be able to add trace_printk() to go to that buffer, so that you have access to them after a crash. - Do not use "bin_printk()" for traces to a boot instance The bin_printk() saves only a pointer to the printk format in the ring buffer, as the reader of the buffer can still have access to it. But this is not the case if the buffer is from a previous boot. If the trace_printk() is going to a "persistent" buffer, it will use the slower version that writes the printk format into the buffer. - Add command line option to allow trace_printk() to go to an instance Allow the kernel command line to define which instance the trace_printk() goes to, instead of forcing the admin to set it for every boot via the tracefs options. - Start a document that explains how to use tracefs to debug the kernel - Add some more kernel selftests to test user mapped ring buffer * tag 'trace-ring-buffer-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (28 commits) selftests/ring-buffer: Handle meta-page bigger than the system selftests/ring-buffer: Verify the entire meta-page padding tracing/Documentation: Start a document on how to debug with tracing tracing: Add option to set an instance to be the trace_printk destination tracing: Have trace_printk not use binary prints if boot buffer tracing: Allow trace_printk() to go to other instance buffers tracing: Add "traceoff" flag to boot time tracing instances ring-buffer: Align meta-page to sub-buffers for improved TLB usage ring-buffer: Add magic and struct size to boot up meta data ring-buffer: Don't reset persistent ring-buffer meta saved addresses tracing/fgraph: Have fgraph handle previous boot function addresses tracing: Allow boot instances to use reserve_mem boot memory tracing: Fix ifdef of snapshots to not prevent last_boot_info file ring-buffer: Use vma_pages() helper function tracing: Fix NULL vs IS_ERR() check in enable_instances() tracing: Add last boot delta offset for stack traces tracing: Update function tracing output for previous boot buffer tracing: Handle old buffer mappings for event strings and functions tracing/ring-buffer: Add last_boot_info file to boot instance ring-buffer: Save text and data locations in mapped meta data ... |
||
![]() |
440b652328 |
bpf-next-6.12
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmbk/nIACgkQ6rmadz2v bTqxuBAAnqW81Rr0nORIxeJMbyo4EiFuYHGk6u5BYP9NPzqHroUPCLVmSP7Hp/Ta CJjsiZeivZsGa6Qlc3BCa4hHNpqP5WE1C/73svSDn7/99EfxdSBtirpMVFUPsUtn DDb5chNpvnxKNS8Mw5Ty8wBrdbXHMlSx+IfaFHpv0Yn6EAcuF4UdoEUq2l3PqhfD Il9Zm127eViPGAP+o+TBZFfW+rRw8d0ngqeRq2GvJ8ibNEDWss+GmBI1Dod7d+fC dUDg96Ipdm1a5Xz7dnH80eXz9JHdpu6qhQrQMKKArnlpJElrKiOf9b17ZcJoPQOR ZnstEnUyVnrWROZxUuKY72+2tx3TuSf+L9uZqFHNx3Ix5FIoS+tFbHf4b8SxtsOb hb2X7SigdGqhQDxUT+IPeO5hsJlIvG1/VYxMXxgc++rh9DjL06hDLUSH1WBSU0fC kFQ7HrcpAlVHtWmGbwwUyVjD+KC/qmZBTAnkcYT4C62WZVytSCnihIuSFAvV1tpZ SSIhVPyQ599UoZIiQYihp0S4qP74FotCtErWSrThneh2Cl8kDsRq//lV1nj/PTV8 CpTvz4VCFDFTgthCfd62fP95EwW5K+aE3NjGTPW/9Hx/0+J/1tT+yqWsrToGaruf TbrqtzQhpclz9UEqA+696cVAXNj9uRU4AoD3YIg72kVnRlkgYd0= =MDwh -----END PGP SIGNATURE----- Merge tag 'bpf-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Introduce '__attribute__((bpf_fastcall))' for helpers and kfuncs with corresponding support in LLVM. It is similar to existing 'no_caller_saved_registers' attribute in GCC/LLVM with a provision for backward compatibility. It allows compilers generate more efficient BPF code assuming the verifier or JITs will inline or partially inline a helper/kfunc with such attribute. bpf_cast_to_kern_ctx, bpf_rdonly_cast, bpf_get_smp_processor_id are the first set of such helpers. - Harden and extend ELF build ID parsing logic. When called from sleepable context the relevants parts of ELF file will be read to find and fetch .note.gnu.build-id information. Also harden the logic to avoid TOCTOU, overflow, out-of-bounds problems. - Improvements and fixes for sched-ext: - Allow passing BPF iterators as kfunc arguments - Make the pointer returned from iter_next method trusted - Fix x86 JIT convergence issue due to growing/shrinking conditional jumps in variable length encoding - BPF_LSM related: - Introduce few VFS kfuncs and consolidate them in fs/bpf_fs_kfuncs.c - Enforce correct range of return values from certain LSM hooks - Disallow attaching to other LSM hooks - Prerequisite work for upcoming Qdisc in BPF: - Allow kptrs in program provided structs - Support for gen_epilogue in verifier_ops - Important fixes: - Fix uprobe multi pid filter check - Fix bpf_strtol and bpf_strtoul helpers - Track equal scalars history on per-instruction level - Fix tailcall hierarchy on x86 and arm64 - Fix signed division overflow to prevent INT_MIN/-1 trap on x86 - Fix get kernel stack in BPF progs attached to tracepoint:syscall - Selftests: - Add uprobe bench/stress tool - Generate file dependencies to drastically improve re-build time - Match JIT-ed and BPF asm with __xlated/__jited keywords - Convert older tests to test_progs framework - Add support for RISC-V - Few fixes when BPF programs are compiled with GCC-BPF backend (support for GCC-BPF in BPF CI is ongoing in parallel) - Add traffic monitor - Enable cross compile and musl libc * tag 'bpf-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (260 commits) btf: require pahole 1.21+ for DEBUG_INFO_BTF with default DWARF version btf: move pahole check in scripts/link-vmlinux.sh to lib/Kconfig.debug btf: remove redundant CONFIG_BPF test in scripts/link-vmlinux.sh bpf: Call the missed kfree() when there is no special field in btf bpf: Call the missed btf_record_free() when map creation fails selftests/bpf: Add a test case to write mtu result into .rodata selftests/bpf: Add a test case to write strtol result into .rodata selftests/bpf: Rename ARG_PTR_TO_LONG test description selftests/bpf: Fix ARG_PTR_TO_LONG {half-,}uninitialized test bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of error bpf: Improve check_raw_mode_ok test for MEM_UNINIT-tagged types bpf: Fix helper writes to read-only maps bpf: Remove truncation test in bpf_strtol and bpf_strtoul helpers bpf: Fix bpf_strtol and bpf_strtoul helpers for 32bit selftests/bpf: Add tests for sdiv/smod overflow cases bpf: Fix a sdiv overflow issue libbpf: Add bpf_object__token_fd accessor docs/bpf: Add missing BPF program types to docs docs/bpf: Add constant values for linkages bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing ... |
||
![]() |
2004cef11e |
In the v6.12 scheduler development cycle we had 63 commits from 18 contributors:
- Implement the SCHED_DEADLINE server infrastructure - Daniel Bristot de Oliveira's last major contribution to the kernel: "SCHED_DEADLINE servers can help fixing starvation issues of low priority tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU cycles. Today we have RT Throttling; DEADLINE servers should be able to replace and improve that." (Daniel Bristot de Oliveira, Peter Zijlstra, Joel Fernandes, Youssef Esmat, Huang Shijie) - Preparatory changes for sched_ext integration: - Use set_next_task(.first) where required - Fix up set_next_task() implementations - Clean up DL server vs. core sched - Split up put_prev_task_balance() - Rework pick_next_task() - Combine the last put_prev_task() and the first set_next_task() - Rework dl_server - Add put_prev_task(.next) (Peter Zijlstra, with a fix by Tejun Heo) - Complete the EEVDF transition and refine EEVDF scheduling: - Implement delayed dequeue - Allow shorter slices to wakeup-preempt - Use sched_attr::sched_runtime to set request/slice suggestion - Document the new feature flags - Remove unused and duplicate-functionality fields - Simplify & unify pick_next_task_fair() - Misc debuggability enhancements (Peter Zijlstra, with fixes/cleanups by Dietmar Eggemann, Valentin Schneider and Chuyi Zhou) - Initialize the vruntime of a new task when it is first enqueued, resulting in significant decrease in latency of newly woken tasks. (Zhang Qiao) - Introduce SM_IDLE and an idle re-entry fast-path in __schedule() (K Prateek Nayak, Peter Zijlstra) - Clean up and clarify the usage of Clean up usage of rt_task() (Qais Yousef) - Preempt SCHED_IDLE entities in strict cgroup hierarchies (Tianchen Ding) - Clarify the documentation of time units for deadline scheduler parameters. (Christian Loehle) - Remove the HZ_BW chicken-bit feature flag introduced a year ago, the original change seems to be working fine. (Phil Auld) - Misc fixes and cleanups (Chen Yu, Dan Carpenter, Huang Shijie, Peilin He, Qais Yousefm and Vincent Guittot) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmbr8qcRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gdbw/+Mj3zWfYP+dtUkfgrR2FClPAJoo1/9Dz0 LYD8XgYHu8rEJ0Aq+VbdkgYGUt9utvzUFPIxvWFDcldQl57KwhF4hp9Ir+PqJyYC NolQ1q8ddo1hnslxnEg6SgHVzQq/4FqMM0nDNUkQETCx6zTyFFeRf+q7o/2c2m5B uI9dSU1Wrx7XrXm2D3kB8+xP+ZRy+qhbFN5Pfuz96mhelfklylgKMfPzgAiCT/7T JTbQhQ2HdcCNgiLoSrWsHBDy2UYpouP4zb4jyd+lDQzhSUJrj3u4Xy4vVmuTKq+y sTgWlgKB+MTuh9UuJ4UYzSnMqg161UlMvtXeH84ABmAqDNGHRPtOKrrlcLtJ3D4x m1SPhNnsvpjOu2pH0XLIS8al3VUesWND5S+rucHRYSq6Nvhivf4MTvRJlicXXurL Mt2APnIlhGJuKBNWnmyZovVdtO0ZUUPlaZWfr3rCS4txAVo+HwWhsm3uhtTycQqN gazsCiuGh6Jds90ZqA/BvdLWG+DY8J0xLlV3ex4pCXuQ/HFrabVWTyThJsULhrZ2 5mTdWIsocPctNMO9/RHMy7vJI7G7ljgHEquWVn5kiGGzXhK6VwVwKAMpfgXGw+YA yVP6/M7a7g2yEzj69gXkcDa8k/kedMVquJ/G/8YhZM7u7sPqsMjpmaGsqsJRfnpT ChngAzap+kA= =TEC6 -----END PGP SIGNATURE----- Merge tag 'sched-core-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Implement the SCHED_DEADLINE server infrastructure - Daniel Bristot de Oliveira's last major contribution to the kernel: "SCHED_DEADLINE servers can help fixing starvation issues of low priority tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU cycles. Today we have RT Throttling; DEADLINE servers should be able to replace and improve that." (Daniel Bristot de Oliveira, Peter Zijlstra, Joel Fernandes, Youssef Esmat, Huang Shijie) - Preparatory changes for sched_ext integration: - Use set_next_task(.first) where required - Fix up set_next_task() implementations - Clean up DL server vs. core sched - Split up put_prev_task_balance() - Rework pick_next_task() - Combine the last put_prev_task() and the first set_next_task() - Rework dl_server - Add put_prev_task(.next) (Peter Zijlstra, with a fix by Tejun Heo) - Complete the EEVDF transition and refine EEVDF scheduling: - Implement delayed dequeue - Allow shorter slices to wakeup-preempt - Use sched_attr::sched_runtime to set request/slice suggestion - Document the new feature flags - Remove unused and duplicate-functionality fields - Simplify & unify pick_next_task_fair() - Misc debuggability enhancements (Peter Zijlstra, with fixes/cleanups by Dietmar Eggemann, Valentin Schneider and Chuyi Zhou) - Initialize the vruntime of a new task when it is first enqueued, resulting in significant decrease in latency of newly woken tasks (Zhang Qiao) - Introduce SM_IDLE and an idle re-entry fast-path in __schedule() (K Prateek Nayak, Peter Zijlstra) - Clean up and clarify the usage of Clean up usage of rt_task() (Qais Yousef) - Preempt SCHED_IDLE entities in strict cgroup hierarchies (Tianchen Ding) - Clarify the documentation of time units for deadline scheduler parameters (Christian Loehle) - Remove the HZ_BW chicken-bit feature flag introduced a year ago, the original change seems to be working fine (Phil Auld) - Misc fixes and cleanups (Chen Yu, Dan Carpenter, Huang Shijie, Peilin He, Qais Yousefm and Vincent Guittot) * tag 'sched-core-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits) sched/cpufreq: Use NSEC_PER_MSEC for deadline task cpufreq/cppc: Use NSEC_PER_MSEC for deadline task sched/deadline: Clarify nanoseconds in uapi sched/deadline: Convert schedtool example to chrt sched/debug: Fix the runnable tasks output sched: Fix sched_delayed vs sched_core kernel/sched: Fix util_est accounting for DELAY_DEQUEUE kthread: Fix task state in kthread worker if being frozen sched/pelt: Use rq_clock_task() for hw_pressure sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule() sched: Add put_prev_task(.next) sched: Rework dl_server sched: Combine the last put_prev_task() and the first set_next_task() sched: Rework pick_next_task() sched: Split up put_prev_task_balance() sched: Clean up DL server vs core sched sched: Fixup set_next_task() implementations sched: Use set_next_task(.first) where required sched/fair: Properly deactivate sched_delayed task upon class change ... |
||
![]() |
9f0c253ddd |
Performance events changes for v6.12:
- Implement per-PMU context rescheduling to significantly improve single-PMU performance, and related cleanups/fixes. (by Peter Zijlstra and Namhyung Kim) - Fix ancient bug resulting in a lot of events being dropped erroneously at higher sampling frequencies. (by Luo Gengkun) - uprobes enhancements: - Implement RCU-protected hot path optimizations for better performance: "For baseline vs SRCU, peak througput increased from 3.7 M/s (million uprobe triggerings per second) up to about 8 M/s. For uretprobes it's a bit more modest with bump from 2.4 M/s to 5 M/s. For SRCU vs RCU Tasks Trace, peak throughput for uprobes increases further from 8 M/s to 10.3 M/s (+28%!), and for uretprobes from 5.3 M/s to 5.8 M/s (+11%), as we have more work to do on uretprobes side. Even single-thread (no contention) performance is slightly better: 3.276 M/s to 3.396 M/s (+3.5%) for uprobes, and 2.055 M/s to 2.174 M/s (+5.8%) for uretprobes." (by Andrii Nakryiko et al) - Document mmap_lock, don't abuse get_user_pages_remote(). (by Oleg Nesterov) - Cleanups & fixes to prepare for future work: - Remove uprobe_register_refctr() - Simplify error handling for alloc_uprobe() - Make uprobe_register() return struct uprobe * - Fold __uprobe_unregister() into uprobe_unregister() - Shift put_uprobe() from delete_uprobe() to uprobe_unregister() - BPF: Fix use-after-free in bpf_uprobe_multi_link_attach() (by Oleg Nesterov) - New feature & ABI extension: allow events to use PERF_SAMPLE READ with inheritance, enabling sample based profiling of a group of counters over a hierarchy of processes or threads. (by Ben Gainey) - Intel uncore & power events updates: - Add Arrow Lake and Lunar Lake support - Add PERF_EV_CAP_READ_SCOPE - Clean up and enhance cpumask and hotplug support (by Kan Liang) - Add LNL uncore iMC freerunning support - Use D0:F0 as a default device (by Zhenyu Wang) - Intel PT: fix AUX snapshot handling race. (by Adrian Hunter) - Misc fixes and cleanups. (by James Clark, Jiri Olsa, Oleg Nesterov and Peter Zijlstra) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJEBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmbqxEwRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iusw/43UAcAZVof6Qs+j6bVAxSabF66fFfE9Wh jc+F4yZ2MGl9x6a1f392+CPcTdVsYp6G2QtRGMipD+trmi/lhDhmRrhxxD1KWIwP zVGSBx9CSFl0UpCXdGiVrGzT5xpIpJ4qqW2XUVr32n8SxTT5X/vM5ySm6KUXsIrD 2/KXwucT9a7grkl3pvy/A/FUHxaF7oAMJjcIPSvLBveQjQSHUrZoCZdHsRGT9rjS HjzxG6gDy97172z5XV1ej3HJOfFlFTQ1RcoxNqdLfiZ6n3hD4hfmtsXWB5zTzRjT xHaCOmWLhEp5v+fK2+RCFiWUbDBsmW/mecZdrjGb3C1RIDWQhLCXXc95XtrobTvk BkW9QEC/XRB+vU6Ssdv3ugN7yRWxih0BsLU5sy4nlzmwoYt9qOy8fgjRvSBKHr5K Mu1RIFu+KXq++sa7+ZJjUMY70PHQCp2m4AHprG/Y98t93CQMhDXzGVpPzWyQuW/V lqYFjd/CAoCIVGF4Jxq7sqOdZ1emDN+P0WSnnFWssJ0ZJFvxN9ZDPH2AaMk4lwo7 NFW6u3+0Vx9P0m/H6xRQj00Iye2JLMqJNCIA8QtjnB7L6upgVvcIPjgcG58fpV1o xfJekOR1A7T2aQUDlX5t9Cu36ZUImDRmwHj2m1p84s5AANlbD7/fOmffR1Hn9uFj wCTqSpi8Hg== =E3s3 -----END PGP SIGNATURE----- Merge tag 'perf-core-2024-09-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events updates from Ingo Molnar: - Implement per-PMU context rescheduling to significantly improve single-PMU performance, and related cleanups/fixes (Peter Zijlstra and Namhyung Kim) - Fix ancient bug resulting in a lot of events being dropped erroneously at higher sampling frequencies (Luo Gengkun) - uprobes enhancements: - Implement RCU-protected hot path optimizations for better performance: "For baseline vs SRCU, peak througput increased from 3.7 M/s (million uprobe triggerings per second) up to about 8 M/s. For uretprobes it's a bit more modest with bump from 2.4 M/s to 5 M/s. For SRCU vs RCU Tasks Trace, peak throughput for uprobes increases further from 8 M/s to 10.3 M/s (+28%!), and for uretprobes from 5.3 M/s to 5.8 M/s (+11%), as we have more work to do on uretprobes side. Even single-thread (no contention) performance is slightly better: 3.276 M/s to 3.396 M/s (+3.5%) for uprobes, and 2.055 M/s to 2.174 M/s (+5.8%) for uretprobes." (Andrii Nakryiko et al) - Document mmap_lock, don't abuse get_user_pages_remote() (Oleg Nesterov) - Cleanups & fixes to prepare for future work: - Remove uprobe_register_refctr() - Simplify error handling for alloc_uprobe() - Make uprobe_register() return struct uprobe * - Fold __uprobe_unregister() into uprobe_unregister() - Shift put_uprobe() from delete_uprobe() to uprobe_unregister() - BPF: Fix use-after-free in bpf_uprobe_multi_link_attach() (Oleg Nesterov) - New feature & ABI extension: allow events to use PERF_SAMPLE READ with inheritance, enabling sample based profiling of a group of counters over a hierarchy of processes or threads (Ben Gainey) - Intel uncore & power events updates: - Add Arrow Lake and Lunar Lake support - Add PERF_EV_CAP_READ_SCOPE - Clean up and enhance cpumask and hotplug support (Kan Liang) - Add LNL uncore iMC freerunning support - Use D0:F0 as a default device (Zhenyu Wang) - Intel PT: fix AUX snapshot handling race (Adrian Hunter) - Misc fixes and cleanups (James Clark, Jiri Olsa, Oleg Nesterov and Peter Zijlstra) * tag 'perf-core-2024-09-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits) dmaengine: idxd: Clean up cpumask and hotplug for perfmon iommu/vt-d: Clean up cpumask and hotplug for perfmon perf/x86/intel/cstate: Clean up cpumask and hotplug perf: Add PERF_EV_CAP_READ_SCOPE perf: Generic hotplug support for a PMU with a scope uprobes: perform lockless SRCU-protected uprobes_tree lookup rbtree: provide rb_find_rcu() / rb_find_add_rcu() perf/uprobe: split uprobe_unregister() uprobes: travers uprobe's consumer list locklessly under SRCU protection uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks uprobes: protected uprobe lifetime with SRCU uprobes: revamp uprobe refcounting and lifetime management bpf: Fix use-after-free in bpf_uprobe_multi_link_attach() perf/core: Fix small negative period being ignored perf: Really fix event_function_call() locking perf: Optimize __pmu_ctx_sched_out() perf: Add context time freeze perf: Fix event_function_call() locking perf: Extract a few helpers perf: Optimize context reschedule for single PMU cases ... |
||
![]() |
067610ebaa |
RCU pull request for v6.12
This pull request contains the following branches: context_tracking.15.08.24a: Rename context tracking state related symbols and remove references to "dynticks" in various context tracking state variables and related helpers; force context_tracking_enabled_this_cpu() to be inlined to avoid leaving a noinstr section. csd.lock.15.08.24a: Enhance CSD-lock diagnostic reports; add an API to provide an indication of ongoing CSD-lock stall. nocb.09.09.24a: Update and simplify RCU nocb code to handle (de-)offloading of callbacks only for offline CPUs; fix RT throttling hrtimer being armed from offline CPU. rcutorture.14.08.24a: Remove redundant rcu_torture_ops get_gp_completed fields; add SRCU ->same_gp_state and ->get_comp_state functions; add generic test for NUM_ACTIVE_*RCU_POLL* for testing RCU and SRCU polled grace periods; add CFcommon.arch for arch-specific Kconfig options; print number of update types in rcu_torture_write_types(); add rcutree.nohz_full_patience_delay testing to the TREE07 scenario; add a stall_cpu_repeat module parameter to test repeated CPU stalls; add argument to limit number of CPUs a guest OS can use in torture.sh; rcustall.09.09.24a: Abbreviate RCU CPU stall warnings during CSD-lock stalls; Allow dump_cpu_task() to be called without disabling preemption; defer printing stall-warning backtrace when holding rcu_node lock. srcu.12.08.24a: Make SRCU gp seq wrap-around faster; add KCSAN checks for concurrent updates to ->srcu_n_exp_nodelay and ->reschedule_count which are used in heuristics governing auto-expediting of normal SRCU grace periods and grace-period-state-machine delays; mark idle SRCU-barrier callbacks to help identify stuck SRCU-barrier callback. rcu.tasks.14.08.24a: Remove RCU Tasks Rude asynchronous APIs as they are no longer used; stop testing RCU Tasks Rude asynchronous APIs; fix access to non-existent percpu regions; check processor-ID assumptions during chosen CPU calculation for callback enqueuing; update description of rtp->tasks_gp_seq grace-period sequence number; add rcu_barrier_cb_is_done() to identify whether a given rcu_barrier callback is stuck; mark idle Tasks-RCU-barrier callbacks; add *torture_stats_print() functions to print detailed diagnostics for Tasks-RCU variants; capture start time of rcu_barrier_tasks*() operation to help distinguish a hung barrier operation from a long series of barrier operations. rcu_scaling_tests.15.08.24a: refscale: Add a TINY scenario to support tests of Tiny RCU and Tiny SRCU; Optimize process_durations() operation; rcuscale: Dump stacks of stalled rcu_scale_writer() instances; dump grace-period statistics when rcu_scale_writer() stalls; mark idle RCU-barrier callbacks to identify stuck RCU-barrier callbacks; print detailed grace-period and barrier diagnostics on rcu_scale_writer() hangs for Tasks-RCU variants; warn if async module parameter is specified for RCU implementations that do not have async primitives such as RCU Tasks Rude; make all writer tasks report upon hang; tolerate repeated GFP_KERNEL failure in rcu_scale_writer(); use special allocator for rcu_scale_writer(); NULL out top-level pointers to heap memory to avoid double-free bugs on modprobe failures; maintain per-task instead of per-CPU callbacks count to avoid any issues with migration of either tasks or callbacks; constify struct ref_scale_ops. fixes.12.08.24a: Use system_unbound_wq for kfree_rcu work to avoid disturbing isolated CPUs. misc.11.08.24a: Warn on unexpected rcu_state.srs_done_tail state; Better define "atomic" for list_replace_rcu() and hlist_replace_rcu() routines; annotate struct kvfree_rcu_bulk_data with __counted_by(). -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSi2tPIQIc2VEtjarIAHS7/6Z0wpQUCZt8+8wAKCRAAHS7/6Z0w pTqoAPwPN//tlEoJx2PRs6t0q+nD1YNvnZawPaRmdzgdM8zJogD+PiSN+XhqRr80 jzyvMDU4Aa0wjUNP3XsCoaCxo7L/lQk= =bZ9z -----END PGP SIGNATURE----- Merge tag 'rcu.release.v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU updates from Neeraj Upadhyay: "Context tracking: - rename context tracking state related symbols and remove references to "dynticks" in various context tracking state variables and related helpers - force context_tracking_enabled_this_cpu() to be inlined to avoid leaving a noinstr section CSD lock: - enhance CSD-lock diagnostic reports - add an API to provide an indication of ongoing CSD-lock stall nocb: - update and simplify RCU nocb code to handle (de-)offloading of callbacks only for offline CPUs - fix RT throttling hrtimer being armed from offline CPU rcutorture: - remove redundant rcu_torture_ops get_gp_completed fields - add SRCU ->same_gp_state and ->get_comp_state functions - add generic test for NUM_ACTIVE_*RCU_POLL* for testing RCU and SRCU polled grace periods - add CFcommon.arch for arch-specific Kconfig options - print number of update types in rcu_torture_write_types() - add rcutree.nohz_full_patience_delay testing to the TREE07 scenario - add a stall_cpu_repeat module parameter to test repeated CPU stalls - add argument to limit number of CPUs a guest OS can use in torture.sh rcustall: - abbreviate RCU CPU stall warnings during CSD-lock stalls - Allow dump_cpu_task() to be called without disabling preemption - defer printing stall-warning backtrace when holding rcu_node lock srcu: - make SRCU gp seq wrap-around faster - add KCSAN checks for concurrent updates to ->srcu_n_exp_nodelay and ->reschedule_count which are used in heuristics governing auto-expediting of normal SRCU grace periods and grace-period-state-machine delays - mark idle SRCU-barrier callbacks to help identify stuck SRCU-barrier callback rcu tasks: - remove RCU Tasks Rude asynchronous APIs as they are no longer used - stop testing RCU Tasks Rude asynchronous APIs - fix access to non-existent percpu regions - check processor-ID assumptions during chosen CPU calculation for callback enqueuing - update description of rtp->tasks_gp_seq grace-period sequence number - add rcu_barrier_cb_is_done() to identify whether a given rcu_barrier callback is stuck - mark idle Tasks-RCU-barrier callbacks - add *torture_stats_print() functions to print detailed diagnostics for Tasks-RCU variants - capture start time of rcu_barrier_tasks*() operation to help distinguish a hung barrier operation from a long series of barrier operations refscale: - add a TINY scenario to support tests of Tiny RCU and Tiny SRCU - optimize process_durations() operation rcuscale: - dump stacks of stalled rcu_scale_writer() instances and grace-period statistics when rcu_scale_writer() stalls - mark idle RCU-barrier callbacks to identify stuck RCU-barrier callbacks - print detailed grace-period and barrier diagnostics on rcu_scale_writer() hangs for Tasks-RCU variants - warn if async module parameter is specified for RCU implementations that do not have async primitives such as RCU Tasks Rude - make all writer tasks report upon hang - tolerate repeated GFP_KERNEL failure in rcu_scale_writer() - use special allocator for rcu_scale_writer() - NULL out top-level pointers to heap memory to avoid double-free bugs on modprobe failures - maintain per-task instead of per-CPU callbacks count to avoid any issues with migration of either tasks or callbacks - constify struct ref_scale_ops Fixes: - use system_unbound_wq for kfree_rcu work to avoid disturbing isolated CPUs Misc: - warn on unexpected rcu_state.srs_done_tail state - better define "atomic" for list_replace_rcu() and hlist_replace_rcu() routines - annotate struct kvfree_rcu_bulk_data with __counted_by()" * tag 'rcu.release.v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (90 commits) rcu: Defer printing stall-warning backtrace when holding rcu_node lock rcu/nocb: Remove superfluous memory barrier after bypass enqueue rcu/nocb: Conditionally wake up rcuo if not already waiting on GP rcu/nocb: Fix RT throttling hrtimer armed from offline CPU rcu/nocb: Simplify (de-)offloading state machine context_tracking: Tag context_tracking_enabled_this_cpu() __always_inline context_tracking, rcu: Rename rcu_dyntick trace event into rcu_watching rcu: Update stray documentation references to rcu_dynticks_eqs_{enter, exit}() rcu: Rename rcu_momentary_dyntick_idle() into rcu_momentary_eqs() rcu: Rename rcu_implicit_dynticks_qs() into rcu_watching_snap_recheck() rcu: Rename dyntick_save_progress_counter() into rcu_watching_snap_save() rcu: Rename struct rcu_data .exp_dynticks_snap into .exp_watching_snap rcu: Rename struct rcu_data .dynticks_snap into .watching_snap rcu: Rename rcu_dynticks_zero_in_eqs() into rcu_watching_zero_in_eqs() rcu: Rename rcu_dynticks_in_eqs_since() into rcu_watching_snap_stopped_since() rcu: Rename rcu_dynticks_in_eqs() into rcu_watching_snap_in_eqs() rcu: Rename rcu_dynticks_eqs_online() into rcu_watching_online() context_tracking, rcu: Rename rcu_dynticks_curr_cpu_in_eqs() into rcu_is_watching_curr_cpu() context_tracking, rcu: Rename rcu_dynticks_task*() into rcu_task*() refscale: Constify struct ref_scale_ops ... |
||
![]() |
32556ce93b |
bpf: Fix helper writes to read-only maps
Lonial found an issue that despite user- and BPF-side frozen BPF map
(like in case of .rodata), it was still possible to write into it from
a BPF program side through specific helpers having ARG_PTR_TO_{LONG,INT}
as arguments.
In check_func_arg() when the argument is as mentioned, the meta->raw_mode
is never set. Later, check_helper_mem_access(), under the case of
PTR_TO_MAP_VALUE as register base type, it assumes BPF_READ for the
subsequent call to check_map_access_type() and given the BPF map is
read-only it succeeds.
The helpers really need to be annotated as ARG_PTR_TO_{LONG,INT} | MEM_UNINIT
when results are written into them as opposed to read out of them. The
latter indicates that it's okay to pass a pointer to uninitialized memory
as the memory is written to anyway.
However, ARG_PTR_TO_{LONG,INT} is a special case of ARG_PTR_TO_FIXED_SIZE_MEM
just with additional alignment requirement. So it is better to just get
rid of the ARG_PTR_TO_{LONG,INT} special cases altogether and reuse the
fixed size memory types. For this, add MEM_ALIGNED to additionally ensure
alignment given these helpers write directly into the args via *<ptr> = val.
The .arg*_size has been initialized reflecting the actual sizeof(*<ptr>).
MEM_ALIGNED can only be used in combination with MEM_FIXED_SIZE annotated
argument types, since in !MEM_FIXED_SIZE cases the verifier does not know
the buffer size a priori and therefore cannot blindly write *<ptr> = val.
Fixes:
|
||
![]() |
376bd59e2a |
bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing
Salvatore Benedetto reported an issue that when doing syscall tracepoint tracing the kernel stack is empty. For example, using the following command line bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }' bpftrace -e 'tracepoint:syscalls:sys_exit_read { print("Kernel Stack\n"); print(kstack()); }' the output for both commands is === Kernel Stack === Further analysis shows that pt_regs used for bpf syscall tracepoint tracing is from the one constructed during user->kernel transition. The call stack looks like perf_syscall_enter+0x88/0x7c0 trace_sys_enter+0x41/0x80 syscall_trace_enter+0x100/0x160 do_syscall_64+0x38/0xf0 entry_SYSCALL_64_after_hwframe+0x76/0x7e The ip address stored in pt_regs is from user space hence no kernel stack is printed. To fix the issue, kernel address from pt_regs is required. In kernel repo, there are already a few cases like this. For example, in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr) instances are used to supply ip address or use ip address to construct call stack. Instead of allocate fake_regs in the stack which may consume a lot of bytes, the function perf_trace_buf_alloc() in perf_syscall_{enter, exit}() is leveraged to create fake_regs, which will be passed to perf_call_bpf_{enter,exit}(). For the above bpftrace script, I got the following output with this patch: for tracepoint:syscalls:sys_enter_read === Kernel Stack syscall_trace_enter+407 syscall_trace_enter+407 do_syscall_64+74 entry_SYSCALL_64_after_hwframe+75 === and for tracepoint:syscalls:sys_exit_read === Kernel Stack syscall_exit_work+185 syscall_exit_work+185 syscall_exit_to_user_mode+305 do_syscall_64+118 entry_SYSCALL_64_after_hwframe+75 === Reported-by: Salvatore Benedetto <salvabenedetto@meta.com> Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240910214037.3663272-1-yonghong.song@linux.dev |
||
![]() |
d4dd9775ec |
bpf: wire up sleepable bpf_get_stack() and bpf_get_task_stack() helpers
Add sleepable implementations of bpf_get_stack() and bpf_get_task_stack() helpers and allow them to be used from sleepable BPF program (e.g., sleepable uprobes). Note, the stack trace IPs capturing itself is not sleepable (that would need to be a separate project), only build ID fetching is sleepable and thus more reliable, as it will wait for data to be paged in, if necessary. For that we make use of sleepable build_id_parse() implementation. Now that build ID related internals in kernel/bpf/stackmap.c can be used both in sleepable and non-sleepable contexts, we need to add additional rcu_read_lock()/rcu_read_unlock() protection around fetching perf_callchain_entry, but with the refactoring in previous commit it's now pretty straightforward. We make sure to do rcu_read_unlock (in sleepable mode only) right before stack_map_get_build_id_offset() call which can sleep. By that time we don't have any more use of perf_callchain_entry. Note, bpf_get_task_stack() will fail for user mode if task != current. And for kernel mode build ID are irrelevant. So in that sense adding sleepable bpf_get_task_stack() implementation is a no-op. It feel right to wire this up for symmetry and completeness, but I'm open to just dropping it until we support `user && crosstask` condition. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240829174232.3133883-10-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
![]() |
4e378158e5 |
tracing: Drop unused helper function to fix the build
A helper function defined but not used. This, in particular,
prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y:
kernel/trace/trace.c:2229:19: error: unused function 'run_tracer_selftest' [-Werror,-Wunused-function]
2229 | static inline int run_tracer_selftest(struct tracer *type)
| ^~~~~~~~~~~~~~~~~~~
Fix this by dropping unused functions.
See also commit
|
||
![]() |
af17814334 |
tracing/osnoise: Fix build when timerlat is not enabled
To fix some critical section races, the interface_lock was added to a few
locations. One of those locations was above where the interface_lock was
declared, so the declaration was moved up before that usage.
Unfortunately, where it was placed was inside a CONFIG_TIMERLAT_TRACER
ifdef block. As the interface_lock is used outside that config, this broke
the build when CONFIG_OSNOISE_TRACER was enabled but
CONFIG_TIMERLAT_TRACER was not.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Helena Anna" <helena.anna.dubel@intel.com>
Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Cc: Tomas Glozar <tglozar@redhat.com>
Link: https://lore.kernel.org/20240909103231.23a289e2@gandalf.local.home
Fixes:
|
||
![]() |
900f362e20 |
bpf: Fix uprobe multi pid filter check
Uprobe multi link does its own process (thread leader) filtering before running the bpf program by comparing task's vm pointers. But as Oleg pointed out there can be processes sharing the vm (CLONE_VM), so we can't just compare task->vm pointers, but instead we need to use same_thread_group call. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/bpf/20240905115124.1503998-2-jolsa@kernel.org |
||
![]() |
5bfbcd1ee5 |
tracing/timerlat: Add interface_lock around clearing of kthread in stop_kthread()
The timerlat interface will get and put the task that is part of the
"kthread" field of the osn_var to keep it around until all references are
released. But here's a race in the "stop_kthread()" code that will call
put_task_struct() on the kthread if it is not a kernel thread. This can
race with the releasing of the references to that task struct and the
put_task_struct() can be called twice when it should have been called just
once.
Take the interface_lock() in stop_kthread() to synchronize this change.
But to do so, the function stop_per_cpu_kthreads() needs to change the
loop from for_each_online_cpu() to for_each_possible_cpu() and remove the
cpu_read_lock(), as the interface_lock can not be taken while the cpu
locks are held. The only side effect of this change is that it may do some
extra work, as the per_cpu variables of the offline CPUs would not be set
anyway, and would simply be skipped in the loop.
Remove unneeded "return;" in stop_kthread().
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Link: https://lore.kernel.org/20240905113359.2b934242@gandalf.local.home
Fixes:
|
||
![]() |
e6a53481da |
tracing/timerlat: Only clear timer if a kthread exists
The timerlat tracer can use user space threads to check for osnoise and
timer latency. If the program using this is killed via a SIGTERM, the
threads are shutdown one at a time and another tracing instance can start
up resetting the threads before they are fully closed. That causes the
hrtimer assigned to the kthread to be shutdown and freed twice when the
dying thread finally closes the file descriptors, causing a use-after-free
bug.
Only cancel the hrtimer if the associated thread is still around. Also add
the interface_lock around the resetting of the tlat_var->kthread.
Note, this is just a quick fix that can be backported to stable. A real
fix is to have a better synchronization between the shutdown of old
threads and the starting of new ones.
Link: https://lore.kernel.org/all/20240820130001.124768-1-tglozar@redhat.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Link: https://lore.kernel.org/20240905085330.45985730@gandalf.local.home
Fixes:
|
||
![]() |
177e1cc2f4 |
tracing/osnoise: Use a cpumask to know what threads are kthreads
The start_kthread() and stop_thread() code was not always called with the
interface_lock held. This means that the kthread variable could be
unexpectedly changed causing the kthread_stop() to be called on it when it
should not have been, leading to:
while true; do
rtla timerlat top -u -q & PID=$!;
sleep 5;
kill -INT $PID;
sleep 0.001;
kill -TERM $PID;
wait $PID;
done
Causing the following OOPS:
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
CPU: 5 UID: 0 PID: 885 Comm: timerlatu/5 Not tainted 6.11.0-rc4-test-00002-gbc754cc76d1b-dirty #125 a533010b71dab205ad2f507188ce8c82203b0254
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:hrtimer_active+0x58/0x300
Code: 48 c1 ee 03 41 54 48 01 d1 48 01 d6 55 53 48 83 ec 20 80 39 00 0f 85 30 02 00 00 49 8b 6f 30 4c 8d 75 10 4c 89 f0 48 c1 e8 03 <0f> b6 3c 10 4c 89 f0 83 e0 07 83 c0 03 40 38 f8 7c 09 40 84 ff 0f
RSP: 0018:ffff88811d97f940 EFLAGS: 00010202
RAX: 0000000000000002 RBX: ffff88823c6b5b28 RCX: ffffed10478d6b6b
RDX: dffffc0000000000 RSI: ffffed10478d6b6c RDI: ffff88823c6b5b28
RBP: 0000000000000000 R08: ffff88823c6b5b58 R09: ffff88823c6b5b60
R10: ffff88811d97f957 R11: 0000000000000010 R12: 00000000000a801d
R13: ffff88810d8b35d8 R14: 0000000000000010 R15: ffff88823c6b5b28
FS: 0000000000000000(0000) GS:ffff88823c680000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000561858ad7258 CR3: 000000007729e001 CR4: 0000000000170ef0
Call Trace:
<TASK>
? die_addr+0x40/0xa0
? exc_general_protection+0x154/0x230
? asm_exc_general_protection+0x26/0x30
? hrtimer_active+0x58/0x300
? __pfx_mutex_lock+0x10/0x10
? __pfx_locks_remove_file+0x10/0x10
hrtimer_cancel+0x15/0x40
timerlat_fd_release+0x8e/0x1f0
? security_file_release+0x43/0x80
__fput+0x372/0xb10
task_work_run+0x11e/0x1f0
? _raw_spin_lock+0x85/0xe0
? __pfx_task_work_run+0x10/0x10
? poison_slab_object+0x109/0x170
? do_exit+0x7a0/0x24b0
do_exit+0x7bd/0x24b0
? __pfx_migrate_enable+0x10/0x10
? __pfx_do_exit+0x10/0x10
? __pfx_read_tsc+0x10/0x10
? ktime_get+0x64/0x140
? _raw_spin_lock_irq+0x86/0xe0
do_group_exit+0xb0/0x220
get_signal+0x17ba/0x1b50
? vfs_read+0x179/0xa40
? timerlat_fd_read+0x30b/0x9d0
? __pfx_get_signal+0x10/0x10
? __pfx_timerlat_fd_read+0x10/0x10
arch_do_signal_or_restart+0x8c/0x570
? __pfx_arch_do_signal_or_restart+0x10/0x10
? vfs_read+0x179/0xa40
? ksys_read+0xfe/0x1d0
? __pfx_ksys_read+0x10/0x10
syscall_exit_to_user_mode+0xbc/0x130
do_syscall_64+0x74/0x110
? __pfx___rseq_handle_notify_resume+0x10/0x10
? __pfx_ksys_read+0x10/0x10
? fpregs_restore_userregs+0xdb/0x1e0
? fpregs_restore_userregs+0xdb/0x1e0
? syscall_exit_to_user_mode+0x116/0x130
? do_syscall_64+0x74/0x110
? do_syscall_64+0x74/0x110
? do_syscall_64+0x74/0x110
entry_SYSCALL_64_after_hwframe+0x71/0x79
RIP: 0033:0x7ff0070eca9c
Code: Unable to access opcode bytes at 0x7ff0070eca72.
RSP: 002b:00007ff006dff8c0 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00007ff0070eca9c
RDX: 0000000000000400 RSI: 00007ff006dff9a0 RDI: 0000000000000003
RBP: 00007ff006dffde0 R08: 0000000000000000 R09: 00007ff000000ba0
R10: 00007ff007004b08 R11: 0000000000000246 R12: 0000000000000003
R13: 00007ff006dff9a0 R14: 0000000000000007 R15: 0000000000000008
</TASK>
Modules linked in: snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hwdep snd_hda_core
---[ end trace 0000000000000000 ]---
This is because it would mistakenly call kthread_stop() on a user space
thread making it "exit" before it actually exits.
Since kthreads are created based on global behavior, use a cpumask to know
when kthreads are running and that they need to be shutdown before
proceeding to do new work.
Link: https://lore.kernel.org/all/20240820130001.124768-1-tglozar@redhat.com/
This was debugged by using the persistent ring buffer:
Link: https://lore.kernel.org/all/20240823013902.135036960@goodmis.org/
Note, locking was originally used to fix this, but that proved to cause too
many deadlocks to work around:
https://lore.kernel.org/linux-trace-kernel/20240823102816.5e55753b@gandalf.local.home/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Link: https://lore.kernel.org/20240904103428.08efdf4c@gandalf.local.home
Fixes:
|
||
![]() |
04b01625da |
perf/uprobe: split uprobe_unregister()
With uprobe_unregister() having grown a synchronize_srcu(), it becomes fairly slow to call. Esp. since both users of this API call it in a loop. Peel off the sync_srcu() and do it once, after the loop. We also need to add uprobe_unregister_sync() into uprobe_register()'s error handling path, as we need to be careful about returning to the caller before we have a guarantee that partially attached consumer won't be called anymore. This is an unlikely slow path and this should be totally fine to be slow in the case of a failed attach. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Co-developed-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240903174603.3554182-6-andrii@kernel.org |
||
![]() |
59da880afe |
uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks
It serves no purpose beyond adding unnecessray argument passed to the filter callback. Just get rid of it, no one is actually using it. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240903174603.3554182-4-andrii@kernel.org |
||
![]() |
5fe6e308ab |
bpf: Fix use-after-free in bpf_uprobe_multi_link_attach()
If bpf_link_prime() fails, bpf_uprobe_multi_link_attach() goes to the
error_free label and frees the array of bpf_uprobe's without calling
bpf_uprobe_unregister().
This leaks bpf_uprobe->uprobe and worse, this frees bpf_uprobe->consumer
without removing it from the uprobe->consumers list.
Fixes:
|
||
![]() |
49aa8a1f4d |
tracing: Avoid possible softlockup in tracing_iter_reset()
In __tracing_open(), when max latency tracers took place on the cpu,
the time start of its buffer would be updated, then event entries with
timestamps being earlier than start of the buffer would be skipped
(see tracing_iter_reset()).
Softlockup will occur if the kernel is non-preemptible and too many
entries were skipped in the loop that reset every cpu buffer, so add
cond_resched() to avoid it.
Cc: stable@vger.kernel.org
Fixes:
|
||
![]() |
95c13662b6 |
Merge branch 'perf/urgent' into perf/core, to pick up fixes
This also refreshes the -rc1 based branch to -rc5. Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
ef2bd81d0c |
tracing: Add option to set an instance to be the trace_printk destination
Add a option "trace_printk_dest" that will make the tracing instance the location that trace_printk() will go to. This is useful if the trace_printk or one of the top level tracers is too noisy and there's a need to separate the two. Then an instance can be created, the trace_printk can be set to go there instead, where it will not be lost in the noise of the top level tracer. Note, only one instance can be the destination of trace_printk at a time. If an instance sets this flag, the instance that had it set will have it cleared. There is always one instance that has this set. By default, that is the top instance. This flag cannot be cleared from the top instance. Doing so will result in an -EINVAL. The only way this flag can be cleared from the top instance is by another instance setting it. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineeth Pillai <vineeth@bitbyteword.org> Cc: Beau Belgrave <beaub@linux.microsoft.com> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Ross Zwisler <zwisler@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Alexander Aring <aahringo@redhat.com> Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com> Cc: Tomas Glozar <tglozar@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "Jonathan Corbet" <corbet@lwn.net> Link: https://lore.kernel.org/20240823014019.545459018@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
9b7bdf6f6e |
tracing: Have trace_printk not use binary prints if boot buffer
If the persistent boot mapped ring buffer is used for trace_printk(), force it to not use the binary versions. trace_printk() by default uses bin_printf() that only saves the pointer to the format and not the format itself inside the ring buffer. But for a persistent buffer that is read after reboot, the pointers to the format strings may not be the same, or worse, not even exist! Instead, just force the more robust, but slower, version that does the formatting before saving into the ring buffer. The boot mapped buffer can now be used for trace_printk and friends! Using the trace_printk() and the persistent buffer was used to debug the issue with the osnoise tracer: Link: https://lore.kernel.org/all/20240822103443.6a6ae051@gandalf.local.home/ Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineeth Pillai <vineeth@bitbyteword.org> Cc: Beau Belgrave <beaub@linux.microsoft.com> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Ross Zwisler <zwisler@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Alexander Aring <aahringo@redhat.com> Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com> Cc: Tomas Glozar <tglozar@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "Jonathan Corbet" <corbet@lwn.net> Link: https://lore.kernel.org/20240823014019.386925800@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
ddb8ea9e5a |
tracing: Allow trace_printk() to go to other instance buffers
Currently, trace_printk() just goes to the top level ring buffer. But there may be times that it should go to one of the instances created by the kernel command line. Add a new trace_instance flag: traceprintk (also can use "printk" or "trace_printk" as people tend to forget the actual flag name). trace_instance=foo^traceprintk Will assign the trace_printk to this buffer at boot up. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineeth Pillai <vineeth@bitbyteword.org> Cc: Beau Belgrave <beaub@linux.microsoft.com> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Ross Zwisler <zwisler@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Alexander Aring <aahringo@redhat.com> Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com> Cc: Tomas Glozar <tglozar@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "Jonathan Corbet" <corbet@lwn.net> Link: https://lore.kernel.org/20240823014019.226694946@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
b6fc31b687 |
tracing: Add "traceoff" flag to boot time tracing instances
Add a "flags" delimiter (^) to the "trace_instance" kernel command line parameter, and add the "traceoff" flag. The format is: trace_instance=<name>[^<flag1>[^<flag2>]][@<memory>][,<events>] The code allows for more than one flag to be added, but currently only "traceoff" is done so. The motivation for this change came from debugging with the persistent ring buffer and having trace_printk() writing to it. The trace_printk calls are always enabled, and the boot after the crash was having the unwanted trace_printks from the current boot inject into the ring buffer with the trace_printks of the crash kernel, making the output very confusing. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineeth Pillai <vineeth@bitbyteword.org> Cc: Beau Belgrave <beaub@linux.microsoft.com> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Ross Zwisler <zwisler@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Alexander Aring <aahringo@redhat.com> Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com> Cc: Tomas Glozar <tglozar@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "Jonathan Corbet" <corbet@lwn.net> Link: https://lore.kernel.org/20240823014019.053229958@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
eb2dcde9f9 |
ring-buffer: Align meta-page to sub-buffers for improved TLB usage
Previously, the mapped ring-buffer layout caused misalignment between the meta-page and sub-buffers when the sub-buffer size was not a multiple of PAGE_SIZE. This prevented hardware with larger TLB entries from utilizing them effectively. Add a padding with the zero-page between the meta-page and sub-buffers. Also update the ring-buffer map_test to verify that padding. Link: https://lore.kernel.org/20240628104611.1443542-1-vdonnefort@google.com Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
d0f2d6e951 |
ring-buffer: Add magic and struct size to boot up meta data
Add a magic number as well as save the struct size of the ring_buffer_meta structure in the meta data to also use as validation. Updating the magic number could be used to force a invalidation between kernel versions, and saving the structure size is also a good method to make sure the content is what is expected. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20240815115032.0c197b32@rorschach.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
bca704f62d |
ring-buffer: Don't reset persistent ring-buffer meta saved addresses
The text and data address is saved in the meta data so that it can be used to know the delta of the text and data addresses of the last boot compared to the text and data addresses of the current boot. The delta is used to convert function pointer entries in the ring buffer to something that can be used by kallsyms (note this only works for built-in functions). But the saved addresses get reset on boot up. If the buffer is not used and there's another reboot, then the saved text and data addresses will be of the last boot and not that of the boot that created the content in the ring buffer. To get an idea of the issue: # trace-cmd start -B boot_mapped -p function # reboot # trace-cmd show -B boot_mapped | tail <...>-1 [000] d..1. 461.983243: native_apic_msr_write <-native_kick_ap <...>-1 [000] d..1. 461.983244: __pfx_native_apic_msr_eoi <-native_kick_ap <...>-1 [000] d..1. 461.983244: reserve_irq_vector_locked <-native_kick_ap <...>-1 [000] d..1. 461.983262: branch_emulate_op <-native_kick_ap <...>-1 [000] d..1. 461.983262: __ia32_sys_ia32_pread64 <-native_kick_ap <...>-1 [000] d..1. 461.983263: native_kick_ap <-__smpboot_create_thread <...>-1 [000] d..1. 461.983263: store_cache_disable <-native_kick_ap <...>-1 [000] d..1. 461.983279: acpi_power_off_prepare <-native_kick_ap <...>-1 [000] d..1. 461.983280: __pfx_acpi_ns_delete_node <-acpi_suspend_enter <...>-1 [000] d..1. 461.983280: __pfx_acpi_os_release_lock <-acpi_suspend_enter # reboot # trace-cmd show -B boot_mapped |tail <...>-1 [000] d..1. 461.983243: 0xffffffffa9669220 <-0xffffffffa965f3db <...>-1 [000] d..1. 461.983244: 0xffffffffa96690f0 <-0xffffffffa965f3db <...>-1 [000] d..1. 461.983244: 0xffffffffa9663fa0 <-0xffffffffa965f3db <...>-1 [000] d..1. 461.983262: 0xffffffffa9672e80 <-0xffffffffa965f3e0 <...>-1 [000] d..1. 461.983262: 0xffffffffa962b940 <-0xffffffffa965f3ec <...>-1 [000] d..1. 461.983263: 0xffffffffa965f540 <-0xffffffffa96e1362 <...>-1 [000] d..1. 461.983263: 0xffffffffa963c940 <-0xffffffffa965f55b <...>-1 [000] d..1. 461.983279: 0xffffffffa9ee30c0 <-0xffffffffa965f59b <...>-1 [000] d..1. 461.983280: 0xffffffffa9f16c10 <-0xffffffffa9ee3157 <...>-1 [000] d..1. 461.983280: 0xffffffffa9ee02e0 <-0xffffffffa9ee3157 By not updating the saved text and data addresses in the meta data at every boot up and only updating them when the buffer is reset, it allows multiple boots to see the same data. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20240815113629.0dc90af8@rorschach.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
||
![]() |
50c374c6d1 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Cross-merge bpf fixes after downstream PR including important fixes (from bpf-next point of view): commit |
||
![]() |
bc754cc76d |
tracing: Fix memory leak in fgraph storage selftest
With ftrace boot-time selftest, kmemleak reported some memory leaks in
the new test case for function graph storage for multiple tracers.
unreferenced object 0xffff888005060080 (size 32):
comm "swapper/0", pid 1, jiffies 4294676440
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 20 10 06 05 80 88 ff ff ........ .......
54 0c 1e 81 ff ff ff ff 00 00 00 00 00 00 00 00 T...............
backtrace (crc 7c93416c):
[<000000000238ee6f>] __kmalloc_cache_noprof+0x11f/0x2a0
[<0000000033d2b6c5>] enter_record+0xe8/0x150
[<0000000054c38424>] match_records+0x1cd/0x230
[<00000000c775b63d>] ftrace_set_hash+0xff/0x380
[<000000007bf7208c>] ftrace_set_filter+0x70/0x90
[<00000000a5c08dda>] test_graph_storage_multi+0x2e/0xf0
[<000000006ba028ca>] trace_selftest_startup_function_graph+0x1e8/0x260
[<00000000a715d3eb>] run_tracer_selftest+0x111/0x190
[<00000000395cbf90>] register_tracer+0xdf/0x1f0
[<0000000093e67f7b>] do_one_initcall+0x141/0x3b0
[<00000000c591b682>] do_initcall_level+0x82/0xa0
[<000000004e4c6600>] do_initcalls+0x43/0x70
[<0000000034f3c4e4>] kernel_init_freeable+0x170/0x1f0
[<00000000c7a5dab2>] kernel_init+0x1a/0x1a0
[<00000000ea105947>] ret_from_fork+0x3a/0x50
[<00000000a1932e84>] ret_from_fork_asm+0x1a/0x30
...
This means filter hash allocated for the fixtures are not correctly
released after the test.
Free those hash lists after tests are done and split the loop for
initialize fixture and register fixture for rollback.
Fixes:
|