The following can crash the kernel:
# cd /sys/kernel/tracing
# echo 'p:sched schedule' > kprobe_events
# exec 5>>events/kprobes/sched/enable
# > kprobe_events
# exec 5>&-
The above commands:
1. Change directory to the tracefs directory
2. Create a kprobe event (doesn't matter what one)
3. Open bash file descriptor 5 on the enable file of the kprobe event
4. Delete the kprobe event (removes the files too)
5. Close the bash file descriptor 5
The above causes a crash!
BUG: kernel NULL pointer dereference, address: 0000000000000028
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 6 PID: 877 Comm: bash Not tainted 6.5.0-rc4-test-00008-g2c6b6b1029d4-dirty #186
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:tracing_release_file_tr+0xc/0x50
What happens here is that the kprobe event creates a trace_event_file
"file" descriptor that represents the file in tracefs to the event. It
maintains state of the event (is it enabled for the given instance?).
Opening the "enable" file gets a reference to the event "file" descriptor
via the open file descriptor. When the kprobe event is deleted, the file is
also deleted from the tracefs system which also frees the event "file"
descriptor.
But as the tracefs file is still opened by user space, it will not be
totally removed until the final dput() is called on it. But this is not
true with the event "file" descriptor that is already freed. If the user
does a write to or simply closes the file descriptor it will reference the
event "file" descriptor that was just freed, causing a use-after-free bug.
To solve this, add a ref count to the event "file" descriptor as well as a
new flag called "FREED". The "file" will not be freed until the last
reference is released. But the FREE flag will be set when the event is
removed to prevent any more modifications to that event from happening,
even if there's still a reference to the event "file" descriptor.
Link: https://lore.kernel.org/linux-trace-kernel/20231031000031.1e705592@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20231031122453.7a48b923@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Fixes: f5ca233e2e ("tracing: Increase trace array ref count on enable and filter files")
Reported-by: Beau Belgrave <beaub@linux.microsoft.com>
Tested-by: Beau Belgrave <beaub@linux.microsoft.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Solve two ergonomic issues with struct seq_buf;
1) Too much boilerplate is required to initialize:
struct seq_buf s;
char buf[32];
seq_buf_init(s, buf, sizeof(buf));
Instead, we can build this directly on the stack. Provide
DECLARE_SEQ_BUF() macro to do this:
DECLARE_SEQ_BUF(s, 32);
2) %NUL termination is fragile and requires 2 steps to get a valid
C String (and is a layering violation exposing the "internals" of
seq_buf):
seq_buf_terminate(s);
do_something(s->buffer);
Instead, we can just return s->buffer directly after terminating it in
the refactored seq_buf_terminate(), now known as seq_buf_str():
do_something(seq_buf_str(s));
Link: https://lore.kernel.org/linux-trace-kernel/20231027155634.make.260-kees@kernel.org
Link: https://lore.kernel.org/linux-trace-kernel/20231026194033.it.702-kees@kernel.org/
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Yun Zhou <yun.zhou@windriver.com>
Cc: Jacob Keller <jacob.e.keller@intel.com>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
To make seq_buf more lightweight as a string buf, move the readpos member
from seq_buf to its container, trace_seq. That puts the responsibility
of maintaining the readpos entirely in the tracing code. If some future
users want to package up the readpos with a seq_buf, we can define a
new struct then.
Link: https://lore.kernel.org/linux-trace-kernel/20231020033545.2587554-2-willy@infradead.org
Cc: Kees Cook <keescook@chromium.org>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Instead of having a descriptor for every file represented in the eventfs
directory, only have the directory itself represented. Change the API to
send in a list of entries that represent all the files in the directory
(but not other directories). The entry list contains a name and a callback
function that will be used to create the files when they are accessed.
struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry *parent,
const struct eventfs_entry *entries,
int size, void *data);
is used for the top level eventfs directory, and returns an eventfs_inode
that will be used by:
struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode *parent,
const struct eventfs_entry *entries,
int size, void *data);
where both of the above take an array of struct eventfs_entry entries for
every file that is in the directory.
The entries are defined by:
typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data,
const struct file_operations **fops);
struct eventfs_entry {
const char *name;
eventfs_callback callback;
};
Where the name is the name of the file and the callback gets called when
the file is being created. The callback passes in the name (in case the
same callback is used for multiple files), a pointer to the mode, data and
fops. The data will be pointing to the data that was passed in
eventfs_create_dir() or eventfs_create_events_dir() but may be overridden
to point to something else, as it will be used to point to the
inode->i_private that is created. The information passed back from the
callback is used to create the dentry/inode.
If the callback fills the data and the file should be created, it must
return a positive number. On zero or negative, the file is ignored.
This logic may also be used as a prototype to convert entire pseudo file
systems into just-in-time allocation.
The "show_events_dentry" file has been updated to show the directories,
and any files they have.
With just the eventfs_file allocations:
Before after deltas for meminfo (in kB):
MemFree: -14360
MemAvailable: -14260
Buffers: 40
Cached: 24
Active: 44
Inactive: 48
Inactive(anon): 28
Active(file): 44
Inactive(file): 20
Dirty: -4
AnonPages: 28
Mapped: 4
KReclaimable: 132
Slab: 1604
SReclaimable: 132
SUnreclaim: 1472
Committed_AS: 12
Before after deltas for slabinfo:
<slab>: <objects> [ * <size> = <total>]
ext4_inode_cache 27 [* 1184 = 31968 ]
extent_status 102 [* 40 = 4080 ]
tracefs_inode_cache 144 [* 656 = 94464 ]
buffer_head 39 [* 104 = 4056 ]
shmem_inode_cache 49 [* 800 = 39200 ]
filp -53 [* 256 = -13568 ]
dentry 251 [* 192 = 48192 ]
lsm_file_cache 277 [* 32 = 8864 ]
vm_area_struct -14 [* 184 = -2576 ]
trace_event_file 1748 [* 88 = 153824 ]
kmalloc-1k 35 [* 1024 = 35840 ]
kmalloc-256 49 [* 256 = 12544 ]
kmalloc-192 -28 [* 192 = -5376 ]
kmalloc-128 -30 [* 128 = -3840 ]
kmalloc-96 10581 [* 96 = 1015776 ]
kmalloc-64 3056 [* 64 = 195584 ]
kmalloc-32 1291 [* 32 = 41312 ]
kmalloc-16 2310 [* 16 = 36960 ]
kmalloc-8 9216 [* 8 = 73728 ]
Free memory dropped by 14,360 kB
Available memory dropped by 14,260 kB
Total slab additions in size: 1,771,032 bytes
With this change:
Before after deltas for meminfo (in kB):
MemFree: -12084
MemAvailable: -11976
Buffers: 32
Cached: 32
Active: 72
Inactive: 168
Inactive(anon): 176
Active(file): 72
Inactive(file): -8
Dirty: 24
AnonPages: 196
Mapped: 8
KReclaimable: 148
Slab: 836
SReclaimable: 148
SUnreclaim: 688
Committed_AS: 324
Before after deltas for slabinfo:
<slab>: <objects> [ * <size> = <total>]
tracefs_inode_cache 144 [* 656 = 94464 ]
shmem_inode_cache -23 [* 800 = -18400 ]
filp -92 [* 256 = -23552 ]
dentry 179 [* 192 = 34368 ]
lsm_file_cache -3 [* 32 = -96 ]
vm_area_struct -13 [* 184 = -2392 ]
trace_event_file 1748 [* 88 = 153824 ]
kmalloc-1k -49 [* 1024 = -50176 ]
kmalloc-256 -27 [* 256 = -6912 ]
kmalloc-128 1864 [* 128 = 238592 ]
kmalloc-64 4685 [* 64 = 299840 ]
kmalloc-32 -72 [* 32 = -2304 ]
kmalloc-16 256 [* 16 = 4096 ]
total = 721352
Free memory dropped by 12,084 kB
Available memory dropped by 11,976 kB
Total slab additions in size: 721,352 bytes
That's over 2 MB in savings per instance for free and available memory,
and over 1 MB in savings per instance of slab memory.
Link: https://lore.kernel.org/linux-trace-kernel/20231003184059.4924468e@gandalf.local.home
Link: https://lore.kernel.org/linux-trace-kernel/20231004165007.43d79161@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ajay Kaher <akaher@vmware.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The ring buffer of global_trace is set to the minimum size in
order to save memory on boot up and then it will be expand when
some trace feature enabled.
However currently operations under an instance can also cause
global_trace ring buffer being expanded, and the expanded memory
would be wasted if global_trace then not being used.
See following case, we enable 'sched_switch' event in instance 'A', then
ring buffer of global_trace is unexpectedly expanded to be 1410KB, also
the '(expanded: 1408)' from 'buffer_size_kb' of instance is confusing.
# cd /sys/kernel/tracing
# mkdir instances/A
# cat buffer_size_kb
7 (expanded: 1408)
# cat instances/A/buffer_size_kb
1410 (expanded: 1408)
# echo sched:sched_switch > instances/A/set_event
# cat buffer_size_kb
1410
# cat instances/A/buffer_size_kb
1410
To fix it, we can:
- Make 'ring_buffer_expanded' as a member of 'struct trace_array';
- Make 'ring_buffer_expanded' of instance is defaultly true,
global_trace is defaultly false;
- In order not to expose 'global_trace' outside of file
'kernel/trace/trace.c', introduce trace_set_ring_buffer_expanded()
to set 'ring_buffer_expanded' as 'true';
- Pass the expected trace_array to tracing_update_buffers().
Link: https://lore.kernel.org/linux-trace-kernel/20230906091837.3998020-1-zhengyejian1@huawei.com
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- Add missing LOCKDOWN checks for eventfs callers
When LOCKDOWN is active for tracing, it causes inconsistent state
when some functions succeed and others fail.
- Use dput() to free the top level eventfs descriptor
There was a race between accesses and freeing it.
- Fix a long standing bug that eventfs exposed due to changing timings
by dynamically creating files. That is, If a event file is opened
for an instance, there's nothing preventing the instance from being
removed which will make accessing the files cause use-after-free bugs.
- Fix a ring buffer race that happens when iterating over the ring
buffer while writers are active. Check to make sure not to read
the event meta data if it's beyond the end of the ring buffer sub buffer.
- Fix the print trigger that disappeared because the test to create it
was looking for the event dir field being filled, but now it has the
"ef" field filled for the eventfs structure.
- Remove the unused "dir" field from the event structure.
- Fix the order of the trace_dynamic_info as it had it backwards for the
offset and len fields for which one was for which endianess.
- Fix NULL pointer dereference with eventfs_remove_rec()
If an allocation fails in one of the eventfs_add_*() functions,
the caller of it in event_subsystem_dir() or event_create_dir()
assigns the result to the structure. But it's assigning the ERR_PTR
and not NULL. This was passed to eventfs_remove_rec() which expects
either a good pointer or a NULL, not ERR_PTR. The fix is to not
assign the ERR_PTR to the structure, but to keep it NULL on error.
- Fix list_for_each_rcu() to use list_for_each_srcu() in
dcache_dir_open_wrapper(). One iteration of the code used RCU
but because it had to call sleepable code, it had to be changed
to use SRCU, but one of the iterations was missed.
- Fix synthetic event print function to use "as_u64" instead of
passing in a pointer to the union. To fix big/little endian issues,
the u64 that represented several types was turned into a union to
define the types properly.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZQCvoBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qtgrAP9MiYiCMU+90oJ+61DFchbs3y7BNidP
s3lLRDUMJ935NQD/SSAm54PqWb+YXMpD7m9+3781l6xqwfabBMXNaEl+FwA=
=tlZu
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Add missing LOCKDOWN checks for eventfs callers
When LOCKDOWN is active for tracing, it causes inconsistent state
when some functions succeed and others fail.
- Use dput() to free the top level eventfs descriptor
There was a race between accesses and freeing it.
- Fix a long standing bug that eventfs exposed due to changing timings
by dynamically creating files. That is, If a event file is opened for
an instance, there's nothing preventing the instance from being
removed which will make accessing the files cause use-after-free
bugs.
- Fix a ring buffer race that happens when iterating over the ring
buffer while writers are active. Check to make sure not to read the
event meta data if it's beyond the end of the ring buffer sub buffer.
- Fix the print trigger that disappeared because the test to create it
was looking for the event dir field being filled, but now it has the
"ef" field filled for the eventfs structure.
- Remove the unused "dir" field from the event structure.
- Fix the order of the trace_dynamic_info as it had it backwards for
the offset and len fields for which one was for which endianess.
- Fix NULL pointer dereference with eventfs_remove_rec()
If an allocation fails in one of the eventfs_add_*() functions, the
caller of it in event_subsystem_dir() or event_create_dir() assigns
the result to the structure. But it's assigning the ERR_PTR and not
NULL. This was passed to eventfs_remove_rec() which expects either a
good pointer or a NULL, not ERR_PTR. The fix is to not assign the
ERR_PTR to the structure, but to keep it NULL on error.
- Fix list_for_each_rcu() to use list_for_each_srcu() in
dcache_dir_open_wrapper(). One iteration of the code used RCU but
because it had to call sleepable code, it had to be changed to use
SRCU, but one of the iterations was missed.
- Fix synthetic event print function to use "as_u64" instead of passing
in a pointer to the union. To fix big/little endian issues, the u64
that represented several types was turned into a union to define the
types properly.
* tag 'trace-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
eventfs: Fix the NULL pointer dereference bug in eventfs_remove_rec()
tracefs/eventfs: Use list_for_each_srcu() in dcache_dir_open_wrapper()
tracing/synthetic: Print out u64 values properly
tracing/synthetic: Fix order of struct trace_dynamic_info
selftests/ftrace: Fix dependencies for some of the synthetic event tests
tracing: Remove unused trace_event_file dir field
tracing: Use the new eventfs descriptor for print trigger
ring-buffer: Do not attempt to read past "commit"
tracefs/eventfs: Free top level files on removal
ring-buffer: Avoid softlockup in ring_buffer_resize()
tracing: Have event inject files inc the trace array ref count
tracing: Have option files inc the trace array ref count
tracing: Have current_trace inc the trace array ref count
tracing: Have tracing_max_latency inc the trace array ref count
tracing: Increase trace array ref count on enable and filter files
tracefs/eventfs: Use dput to free the toplevel events directory
tracefs/eventfs: Add missing lockdown checks
tracefs: Add missing lockdown check to tracefs_create_dir()
The check to create the print event "trigger" was using the obsolete "dir"
value of the trace_event_file to determine if it should create the trigger
or not. But that value will now be NULL because it uses the event file
descriptor.
Change it to test the "ef" field of the trace_event_file structure so that
the trace_marker "trigger" file appears again.
Link: https://lkml.kernel.org/r/20230908022001.371815239@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ajay Kaher <akaher@vmware.com>
Fixes: 27152bceea ("eventfs: Move tracing/events to eventfs")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The option files update the options for a given trace array. For an
instance, if the file is opened and the instance is deleted, reading or
writing to the file will cause a use after free.
Up the ref count of the trace_array when an option file is opened.
Link: https://lkml.kernel.org/r/20230907024804.086679464@goodmis.org
Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Fixes: 8530dec63e ("tracing: Add tracing_check_open_get_tr()")
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The current_trace updates the trace array tracer. For an instance, if the
file is opened and the instance is deleted, reading or writing to the file
will cause a use after free.
Up the ref count of the trace array when current_trace is opened.
Link: https://lkml.kernel.org/r/20230907024803.877687227@goodmis.org
Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Fixes: 8530dec63e ("tracing: Add tracing_check_open_get_tr()")
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The tracing_max_latency file points to the trace_array max_latency field.
For an instance, if the file is opened and the instance is deleted,
reading or writing to the file will cause a use after free.
Up the ref count of the trace_array when tracing_max_latency is opened.
Link: https://lkml.kernel.org/r/20230907024803.666889383@goodmis.org
Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Fixes: 8530dec63e ("tracing: Add tracing_check_open_get_tr()")
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When the trace event enable and filter files are opened, increment the
trace array ref counter, otherwise they can be accessed when the trace
array is being deleted. The ref counter keeps the trace array from being
deleted while those files are opened.
Link: https://lkml.kernel.org/r/20230907024803.456187066@goodmis.org
Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 8530dec63e ("tracing: Add tracing_check_open_get_tr()")
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- kprobes: use struct_size() for variable size kretprobe_instance
data structure.
- eprobe: Simplify trace_eprobe list iteration.
- probe events: Data structure field access support on BTF argument.
. Update BTF argument support on the functions in the kernel loadable
modules (only loaded modules are supported).
. Move generic BTF access function (search function prototype and get
function parameters) to a separated file.
. Add a function to search a member of data structure in BTF.
. Support accessing BTF data structure member from probe args by
C-like arrow('->') and dot('.') operators. e.g.
't sched_switch next=next->pid vruntime=next->se.vruntime'
. Support accessing BTF data structure member from $retval. e.g.
'f getname_flags%return +0($retval->name):string'
. Add string type checking if BTF type info is available.
This will reject if user specify ":string" type for non "char
pointer" type.
. Automatically assume the fprobe event as a function return event
if $retval is used.
- selftests/ftrace: Add BTF data field access test cases.
- Documentation: Update fprobe event example with BTF data field.
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmTycQkbHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bqS8H/jeR1JhOzIXOvTw7XCFm
MrSY/SKi8tQfV6lau2UmoYdbYvYjpqL34XLOQPNf2/lrcL2M9aNYXk9fbhlW8enx
vkMyKQ0E5anixkF4vsTbEl9DaprxbpsPVACmZ/7VjQk2JuXIdyaNk8hno9LgIcEq
udztb0o2HmDFqAXfRi0LvlSTAIwvXZ+usmEvYpaq1g2WwrCe7NHEYl42vMpj+h4H
9l4t5rA9JyPPX4yQUjtKGW5eRVTwDTm/Gn6DRzYfYzkkiBZv27qfovzBOt672LgG
hyot+u7XeKvZx3jjnF7+mRWoH/m0dqyhyi/nPhpIE09VhgwclrbGAcDuR1x6sp01
PHY=
=hBDN
-----END PGP SIGNATURE-----
Merge tag 'probes-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes updates from Masami Hiramatsu:
- kprobes: use struct_size() for variable size kretprobe_instance data
structure.
- eprobe: Simplify trace_eprobe list iteration.
- probe events: Data structure field access support on BTF argument.
- Update BTF argument support on the functions in the kernel
loadable modules (only loaded modules are supported).
- Move generic BTF access function (search function prototype and
get function parameters) to a separated file.
- Add a function to search a member of data structure in BTF.
- Support accessing BTF data structure member from probe args by
C-like arrow('->') and dot('.') operators. e.g.
't sched_switch next=next->pid vruntime=next->se.vruntime'
- Support accessing BTF data structure member from $retval. e.g.
'f getname_flags%return +0($retval->name):string'
- Add string type checking if BTF type info is available. This will
reject if user specify ":string" type for non "char pointer"
type.
- Automatically assume the fprobe event as a function return event
if $retval is used.
- selftests/ftrace: Add BTF data field access test cases.
- Documentation: Update fprobe event example with BTF data field.
* tag 'probes-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
Documentation: tracing: Update fprobe event example with BTF field
selftests/ftrace: Add BTF fields access testcases
tracing/fprobe-event: Assume fprobe is a return event by $retval
tracing/probes: Add string type check with BTF
tracing/probes: Support BTF field access from $retval
tracing/probes: Support BTF based data structure field access
tracing/probes: Add a function to search a member of a struct/union
tracing/probes: Move finding func-proto API and getting func-param API to trace_btf
tracing/probes: Support BTF argument on module functions
tracing/eprobe: Iterate trace_eprobe directly
kernel: kprobes: Use struct_size()
The pipe cpumask used to serialize opens between the main and percpu
trace pipes is not zeroed or initialized. This can result in
spurious -EBUSY returns if underlying memory is not fully zeroed.
This has been observed by immediate failure to read the main
trace_pipe file on an otherwise newly booted and idle system:
# cat /sys/kernel/debug/tracing/trace_pipe
cat: /sys/kernel/debug/tracing/trace_pipe: Device or resource busy
Zero the allocation of pipe_cpumask to avoid the problem.
Link: https://lore.kernel.org/linux-trace-kernel/20230831125500.986862-1-bfoster@redhat.com
Cc: stable@vger.kernel.org
Fixes: c2489bb7e6 ("tracing: Introduce pipe_cpumask to avoid race on trace_pipes")
Reviewed-by: Zheng Yejian <zhengyejian1@huawei.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Warning happened in rb_end_commit() at code:
if (RB_WARN_ON(cpu_buffer, !local_read(&cpu_buffer->committing)))
WARNING: CPU: 0 PID: 139 at kernel/trace/ring_buffer.c:3142
rb_commit+0x402/0x4a0
Call Trace:
ring_buffer_unlock_commit+0x42/0x250
trace_buffer_unlock_commit_regs+0x3b/0x250
trace_event_buffer_commit+0xe5/0x440
trace_event_buffer_reserve+0x11c/0x150
trace_event_raw_event_sched_switch+0x23c/0x2c0
__traceiter_sched_switch+0x59/0x80
__schedule+0x72b/0x1580
schedule+0x92/0x120
worker_thread+0xa0/0x6f0
It is because the race between writing event into cpu buffer and swapping
cpu buffer through file per_cpu/cpu0/snapshot:
Write on CPU 0 Swap buffer by per_cpu/cpu0/snapshot on CPU 1
-------- --------
tracing_snapshot_write()
[...]
ring_buffer_lock_reserve()
cpu_buffer = buffer->buffers[cpu]; // 1. Suppose find 'cpu_buffer_a';
[...]
rb_reserve_next_event()
[...]
ring_buffer_swap_cpu()
if (local_read(&cpu_buffer_a->committing))
goto out_dec;
if (local_read(&cpu_buffer_b->committing))
goto out_dec;
buffer_a->buffers[cpu] = cpu_buffer_b;
buffer_b->buffers[cpu] = cpu_buffer_a;
// 2. cpu_buffer has swapped here.
rb_start_commit(cpu_buffer);
if (unlikely(READ_ONCE(cpu_buffer->buffer)
!= buffer)) { // 3. This check passed due to 'cpu_buffer->buffer'
[...] // has not changed here.
return NULL;
}
cpu_buffer_b->buffer = buffer_a;
cpu_buffer_a->buffer = buffer_b;
[...]
// 4. Reserve event from 'cpu_buffer_a'.
ring_buffer_unlock_commit()
[...]
cpu_buffer = buffer->buffers[cpu]; // 5. Now find 'cpu_buffer_b' !!!
rb_commit(cpu_buffer)
rb_end_commit() // 6. WARN for the wrong 'committing' state !!!
Based on above analysis, we can easily reproduce by following testcase:
``` bash
#!/bin/bash
dmesg -n 7
sysctl -w kernel.panic_on_warn=1
TR=/sys/kernel/tracing
echo 7 > ${TR}/buffer_size_kb
echo "sched:sched_switch" > ${TR}/set_event
while [ true ]; do
echo 1 > ${TR}/per_cpu/cpu0/snapshot
done &
while [ true ]; do
echo 1 > ${TR}/per_cpu/cpu0/snapshot
done &
while [ true ]; do
echo 1 > ${TR}/per_cpu/cpu0/snapshot
done &
```
To fix it, IIUC, we can use smp_call_function_single() to do the swap on
the target cpu where the buffer is located, so that above race would be
avoided.
Link: https://lore.kernel.org/linux-trace-kernel/20230831132739.4070878-1-zhengyejian1@huawei.com
Cc: <mhiramat@kernel.org>
Fixes: f1affcaaa8 ("tracing: Add snapshot in the per_cpu trace directories")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
User visible changes:
- Added a way to easier filter with cpumasks:
# echo 'cpumask & CPUS{17-42}' > /sys/kernel/tracing/events/ipi_send_cpumask/filter
- Show actual size of ring buffer after modifying the ring buffer size via
buffer_size_kb. Currently it just returns what was written, but the actual
size rounds up to the sub buffer size. Show that real size instead.
Major changes:
- Added "eventfs". This is the code that handles the inodes and dentries of
tracefs/events directory. As there are thousands of events, and each event
has several inodes and dentries that currently exist even when tracing is
never used, they take up precious memory. Instead, eventfs will allocate
the inodes and dentries in a JIT way (similar to what procfs does). There
is now metadata that handles the events and subdirectories, and will create
the inodes and dentries when they are used.
Note, I also have patches that remove the subdirectory meta data, but will
wait till the next merge window before applying them. It's a little more
complex, and I want to make sure the dynamic code works properly before
adding more complexity, making it easier to revert if need be.
Minor changes:
- Optimization to user event list traversal.
- Remove intermediate permission of tracefs files (note the intermediate
permission removes all access to the files so it is not a security concern,
but just a clean up.)
- Add the complex fix to FORTIFY_SOURCE to the kernel stack event logic.
- Other minor clean ups.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEXtmkj8VMCiLR0IBM68Js21pW3nMFAmTwtAsUHHJvc3RlZHRA
Z29vZG1pcy5vcmcACgkQ68Js21pW3nNOXRAAsslQT6alY4OeplC4x47+V6+6NiIA
oDtOmWAqf7TsH9bukzRFD36rUly42O20RJDx9z0Q3iRc3vGxEawId8z6P0HmBwRb
VSl5BryWvL5Wc5w94xS8EeCuC1MRfhVDyfbtVFmWigzfvd/f+hp71ViMPHUvrRJX
KhzzNSBc4ir5E1lzfwa7meYTXzDwrQlZbYfdf5aH94IWAkqDj85PUZDJ7UmLZhXG
CIglSpNFXZ0j19Wo/U6KZlHR1XfunBKungCzJ5Dbznc9YLWZTQXOIZF4YPKfPIJL
ulRG9chwXY0nQWhG3xM1UHZLsAMSWw5i13a4ZN4d8FCNOgv8ttcJnfDk7ZYUS0Oz
RmY1dGcSRKAZTUTjm8ZBtmyiUCc9kZAIk0fyEfIHtoDYXmhnvni3wuTnbRSdXaSi
q4YkxPaLfX8Fn3QloCqqddt8iONu7BnbpZOhUCl2AtBib52gnTTF7+rQ6/0D3rjo
SSuvEHhnjJhzk+3jM2odxjmTAztNT+yu6FbKXZUKPt1Kj9YHv1J9cEQw9/Etw+GV
8jQBe979D8hFJmDOJOT/O/TdPqE9mQoMNBt6Y8QnE4nbJWM+i/MBrThFpUSQhRCr
0Ya/HgR2QyRH7RmZW5o2H9mNtN+V9c7RxZW8erYzRbUs0YofK2OpGi9SrPzxWCke
w6j0VVZHaxdPguM=
=/s+e
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
"User visible changes:
- Added a way to easier filter with cpumasks:
# echo 'cpumask & CPUS{17-42}' > /sys/kernel/tracing/events/ipi_send_cpumask/filter
- Show actual size of ring buffer after modifying the ring buffer
size via buffer_size_kb.
Currently it just returns what was written, but the actual size
rounds up to the sub buffer size. Show that real size instead.
Major changes:
- Added "eventfs". This is the code that handles the inodes and
dentries of tracefs/events directory. As there are thousands of
events, and each event has several inodes and dentries that
currently exist even when tracing is never used, they take up
precious memory. Instead, eventfs will allocate the inodes and
dentries in a JIT way (similar to what procfs does). There is now
metadata that handles the events and subdirectories, and will
create the inodes and dentries when they are used.
Note, I also have patches that remove the subdirectory meta data,
but will wait till the next merge window before applying them. It's
a little more complex, and I want to make sure the dynamic code
works properly before adding more complexity, making it easier to
revert if need be.
Minor changes:
- Optimization to user event list traversal
- Remove intermediate permission of tracefs files (note the
intermediate permission removes all access to the files so it is
not a security concern, but just a clean up)
- Add the complex fix to FORTIFY_SOURCE to the kernel stack event
logic
- Other minor cleanups"
* tag 'trace-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (29 commits)
tracefs: Remove kerneldoc from struct eventfs_file
tracefs: Avoid changing i_mode to a temp value
tracing/user_events: Optimize safe list traversals
ftrace: Remove empty declaration ftrace_enable_daemon() and ftrace_disable_daemon()
tracing: Remove unused function declarations
tracing/filters: Document cpumask filtering
tracing/filters: Further optimise scalar vs cpumask comparison
tracing/filters: Optimise CPU vs cpumask filtering when the user mask is a single CPU
tracing/filters: Optimise scalar vs cpumask filtering when the user mask is a single CPU
tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a single CPU
tracing/filters: Enable filtering the CPU common field by a cpumask
tracing/filters: Enable filtering a scalar field by a cpumask
tracing/filters: Enable filtering a cpumask field by another cpumask
tracing/filters: Dynamically allocate filter_pred.regex
test: ftrace: Fix kprobe test for eventfs
eventfs: Move tracing/events to eventfs
eventfs: Implement removal of meta data from eventfs
eventfs: Implement functions to create files and dirs when accessed
eventfs: Implement eventfs lookup, read, open functions
eventfs: Implement eventfs file add functions
...
There is race issue when concurrently splice_read main trace_pipe and
per_cpu trace_pipes which will result in data read out being different
from what actually writen.
As suggested by Steven:
> I believe we should add a ref count to trace_pipe and the per_cpu
> trace_pipes, where if they are opened, nothing else can read it.
>
> Opening trace_pipe locks all per_cpu ref counts, if any of them are
> open, then the trace_pipe open will fail (and releases any ref counts
> it had taken).
>
> Opening a per_cpu trace_pipe will up the ref count for just that
> CPU buffer. This will allow multiple tasks to read different per_cpu
> trace_pipe files, but will prevent the main trace_pipe file from
> being opened.
But because we only need to know whether per_cpu trace_pipe is open or
not, using a cpumask instead of using ref count may be easier.
After this patch, users will find that:
- Main trace_pipe can be opened by only one user, and if it is
opened, all per_cpu trace_pipes cannot be opened;
- Per_cpu trace_pipes can be opened by multiple users, but each per_cpu
trace_pipe can only be opened by one user. And if one of them is
opened, main trace_pipe cannot be opened.
Link: https://lore.kernel.org/linux-trace-kernel/20230818022645.1948314-1-zhengyejian1@huawei.com
Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Kmemleak report a leak in graph_trace_open():
unreferenced object 0xffff0040b95f4a00 (size 128):
comm "cat", pid 204981, jiffies 4301155872 (age 99771.964s)
hex dump (first 32 bytes):
e0 05 e7 b4 ab 7d 00 00 0b 00 01 00 00 00 00 00 .....}..........
f4 00 01 10 00 a0 ff ff 00 00 00 00 65 00 10 00 ............e...
backtrace:
[<000000005db27c8b>] kmem_cache_alloc_trace+0x348/0x5f0
[<000000007df90faa>] graph_trace_open+0xb0/0x344
[<00000000737524cd>] __tracing_open+0x450/0xb10
[<0000000098043327>] tracing_open+0x1a0/0x2a0
[<00000000291c3876>] do_dentry_open+0x3c0/0xdc0
[<000000004015bcd6>] vfs_open+0x98/0xd0
[<000000002b5f60c9>] do_open+0x520/0x8d0
[<00000000376c7820>] path_openat+0x1c0/0x3e0
[<00000000336a54b5>] do_filp_open+0x14c/0x324
[<000000002802df13>] do_sys_openat2+0x2c4/0x530
[<0000000094eea458>] __arm64_sys_openat+0x130/0x1c4
[<00000000a71d7881>] el0_svc_common.constprop.0+0xfc/0x394
[<00000000313647bf>] do_el0_svc+0xac/0xec
[<000000002ef1c651>] el0_svc+0x20/0x30
[<000000002fd4692a>] el0_sync_handler+0xb0/0xb4
[<000000000c309c35>] el0_sync+0x160/0x180
The root cause is descripted as follows:
__tracing_open() { // 1. File 'trace' is being opened;
...
*iter->trace = *tr->current_trace; // 2. Tracer 'function_graph' is
// currently set;
...
iter->trace->open(iter); // 3. Call graph_trace_open() here,
// and memory are allocated in it;
...
}
s_start() { // 4. The opened file is being read;
...
*iter->trace = *tr->current_trace; // 5. If tracer is switched to
// 'nop' or others, then memory
// in step 3 are leaked!!!
...
}
To fix it, in s_start(), close tracer before switching then reopen the
new tracer after switching. And some tracers like 'wakeup' may not update
'iter->private' in some cases when reopen, then it should be cleared
to avoid being mistakenly closed again.
Link: https://lore.kernel.org/linux-trace-kernel/20230817125539.1646321-1-zhengyejian1@huawei.com
Fixes: d7350c3f45 ("tracing/core: make the read callbacks reentrants")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Trace ring buffer can no longer record anything after executing
following commands at the shell prompt:
# cd /sys/kernel/tracing
# cat tracing_cpumask
fff
# echo 0 > tracing_cpumask
# echo 1 > snapshot
# echo fff > tracing_cpumask
# echo 1 > tracing_on
# echo "hello world" > trace_marker
-bash: echo: write error: Bad file descriptor
The root cause is that:
1. After `echo 0 > tracing_cpumask`, 'record_disabled' of cpu buffers
in 'tr->array_buffer.buffer' became 1 (see tracing_set_cpumask());
2. After `echo 1 > snapshot`, 'tr->array_buffer.buffer' is swapped
with 'tr->max_buffer.buffer', then the 'record_disabled' became 0
(see update_max_tr());
3. After `echo fff > tracing_cpumask`, the 'record_disabled' become -1;
Then array_buffer and max_buffer are both unavailable due to value of
'record_disabled' is not 0.
To fix it, enable or disable both array_buffer and max_buffer at the same
time in tracing_set_cpumask().
Link: https://lkml.kernel.org/r/20230805033816.3284594-2-zhengyejian1@huawei.com
Cc: <mhiramat@kernel.org>
Cc: <vnagarnaik@google.com>
Cc: <shuah@kernel.org>
Fixes: 71babb2705 ("tracing: change CPU ring buffer state from tracing_cpumask")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently we can resize trace ringbuffer by writing a value into file
'buffer_size_kb', then by reading the file, we get the value that is
usually what we wrote. However, this value may be not actual size of
trace ring buffer because of the round up when doing resize in kernel,
and the actual size would be more useful.
Link: https://lore.kernel.org/linux-trace-kernel/20230705002705.576633-1-zhengyejian1@huawei.com
Cc: <mhiramat@kernel.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
As the trace iterator is created and used by various interfaces, the clean
up of it needs to be consistent. Create a free_trace_iter_content() helper
function that frees the content of the iterator and use that to clean it
up in all places that it is used.
Link: https://lkml.kernel.org/r/20230715141348.341887497@goodmis.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The iterator allocated a descriptor to copy the current_trace. This was done
with the assumption that the function pointers might change. But this was a
false assuption, as it does not change. There's no reason to make a copy of the
current_trace and just use the pointer it points to. This removes needing to
manage freeing the descriptor. Worse yet, there's locations that the iterator
is used but does make a copy and just uses the pointer. This could cause the
actual pointer to the trace descriptor to be freed and not the allocated copy.
This is more of a clean up than a fix.
Link: https://lkml.kernel.org/r/20230715141348.135792275@goodmis.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: d7350c3f45 ("tracing/core: make the read callbacks reentrants")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
For backward compatibility, older tooling expects to see the kernel_stack
event with a "caller" field that is a fixed size array of 8 addresses. The
code now supports more than 8 with an added "size" field that states the
real number of entries. But the "caller" field still just looks like a
fixed size to user space.
Since the tracing macros that create the user space format files also
creates the structures that those files represent, the kernel_stack event
structure had its "caller" field a fixed size of 8, but in reality, when
it is allocated on the ring buffer, it can hold more if the stack trace is
bigger that 8 functions. The copying of these entries was simply done with
a memcpy():
size = nr_entries * sizeof(unsigned long);
memcpy(entry->caller, fstack->calls, size);
The FORTIFY_SOURCE logic noticed at runtime that when the nr_entries was
larger than 8, that the memcpy() was writing more than what the structure
stated it can hold and it complained about it. This is because the
FORTIFY_SOURCE code is unaware that the amount allocated is actually
enough to hold the size. It does not expect that a fixed size field will
hold more than the fixed size.
This was originally solved by hiding the caller assignment with some
pointer arithmetic.
ptr = ring_buffer_data();
entry = ptr;
ptr += offsetof(typeof(*entry), caller);
memcpy(ptr, fstack->calls, size);
But it is considered bad form to hide from kernel hardening. Instead, make
it work nicely with FORTIFY_SOURCE by adding a new __stack_array() macro
that is specific for this one special use case. The macro will take 4
arguments: type, item, len, field (whereas the __array() macro takes just
the first three). This macro will act just like the __array() macro when
creating the code to deal with the format file that is exposed to user
space. But for the kernel, it will turn the caller field into:
type item[] __counted_by(field);
or for this instance:
unsigned long caller[] __counted_by(size);
Now the kernel code can expose the assignment of the caller to the
FORTIFY_SOURCE and everyone is happy!
Link: https://lore.kernel.org/linux-trace-kernel/20230712105235.5fc441aa@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20230713092605.2ddb9788@rorschach.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
The stack_trace event is an event created by the tracing subsystem to
store stack traces. It originally just contained a hard coded array of 8
words to hold the stack, and a "size" to know how many entries are there.
This is exported to user space as:
name: kernel_stack
ID: 4
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:int size; offset:8; size:4; signed:1;
field:unsigned long caller[8]; offset:16; size:64; signed:0;
print fmt: "\t=> %ps\n\t=> %ps\n\t=> %ps\n" "\t=> %ps\n\t=> %ps\n\t=> %ps\n" "\t=> %ps\n\t=> %ps\n",i
(void *)REC->caller[0], (void *)REC->caller[1], (void *)REC->caller[2],
(void *)REC->caller[3], (void *)REC->caller[4], (void *)REC->caller[5],
(void *)REC->caller[6], (void *)REC->caller[7]
Where the user space tracers could parse the stack. The library was
updated for this specific event to only look at the size, and not the
array. But some older users still look at the array (note, the older code
still checks to make sure the array fits inside the event that it read.
That is, if only 4 words were saved, the parser would not read the fifth
word because it will see that it was outside of the event size).
This event was changed a while ago to be more dynamic, and would save a
full stack even if it was greater than 8 words. It does this by simply
allocating more ring buffer to hold the extra words. Then it copies in the
stack via:
memcpy(&entry->caller, fstack->calls, size);
As the entry is struct stack_entry, that is created by a macro to both
create the structure and export this to user space, it still had the caller
field of entry defined as: unsigned long caller[8].
When the stack is greater than 8, the FORTIFY_SOURCE code notices that the
amount being copied is greater than the source array and complains about
it. It has no idea that the source is pointing to the ring buffer with the
required allocation.
To hide this from the FORTIFY_SOURCE logic, pointer arithmetic is used:
ptr = ring_buffer_event_data(event);
entry = ptr;
ptr += offsetof(typeof(*entry), caller);
memcpy(ptr, fstack->calls, size);
Link: https://lore.kernel.org/all/20230612160748.4082850-1-svens@linux.ibm.com/
Link: https://lore.kernel.org/linux-trace-kernel/20230712105235.5fc441aa@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Reported-by: Sven Schnelle <svens@linux.ibm.com>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- Fix bad git merge of #endif in arm64 code
A merge of the arm64 tree caused #endif to go into the wrong place
- Fix crash on lseek of write access to tracefs/error_log
Opening error_log as write only, and then doing an lseek() causes
a kernel panic, because the lseek() handle expects a "seq_file"
to exist (which is not done on write only opens). Use tracing_lseek()
that tests for this instead of calling the default seq lseek handler.
- Check for negative instead of -E2BIG for error on strscpy() returns
Instead of testing for -E2BIG from strscpy(), to be more robust,
check for less than zero, which will make sure it catches any error
that strscpy() may someday return.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZKbQUhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qpWaAP9zQ1eLQSfMt0dHH01OBSJvc2mMd4QJ
VZtWZ+xTSvk+4gD/axDzDS7Qisfrrli+1oQSPwVik2SXiz0SPJqJ25m9zw4=
=xMlg
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix bad git merge of #endif in arm64 code
A merge of the arm64 tree caused #endif to go into the wrong place
- Fix crash on lseek of write access to tracefs/error_log
Opening error_log as write only, and then doing an lseek() causes a
kernel panic, because the lseek() handle expects a "seq_file" to
exist (which is not done on write only opens). Use tracing_lseek()
that tests for this instead of calling the default seq lseek handler.
- Check for negative instead of -E2BIG for error on strscpy() returns
Instead of testing for -E2BIG from strscpy(), to be more robust,
check for less than zero, which will make sure it catches any error
that strscpy() may someday return.
* tag 'trace-v6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/boot: Test strscpy() against less than zero for error
arm64: ftrace: fix build error with CONFIG_FUNCTION_GRAPH_TRACER=n
tracing: Fix null pointer dereference in tracing_err_log_open()
Fix an issue in function 'tracing_err_log_open'.
The function doesn't call 'seq_open' if the file is opened only with
write permissions, which results in 'file->private_data' being left as null.
If we then use 'lseek' on that opened file, 'seq_lseek' dereferences
'file->private_data' in 'mutex_lock(&m->lock)', resulting in a kernel panic.
Writing to this node requires root privileges, therefore this bug
has very little security impact.
Tracefs node: /sys/kernel/tracing/error_log
Example Kernel panic:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000038
Call trace:
mutex_lock+0x30/0x110
seq_lseek+0x34/0xb8
__arm64_sys_lseek+0x6c/0xb8
invoke_syscall+0x58/0x13c
el0_svc_common+0xc4/0x10c
do_el0_svc+0x24/0x98
el0_svc+0x24/0x88
el0t_64_sync_handler+0x84/0xe4
el0t_64_sync+0x1b4/0x1b8
Code: d503201f aa0803e0 aa1f03e1 aa0103e9 (c8e97d02)
---[ end trace 561d1b49c12cf8a5 ]---
Kernel panic - not syncing: Oops: Fatal exception
Link: https://lore.kernel.org/linux-trace-kernel/20230703155237eucms1p4dfb6a19caa14c79eb6c823d127b39024@eucms1p4
Link: https://lore.kernel.org/linux-trace-kernel/20230704102706eucms1p30d7ecdcc287f46ad67679fc8491b2e0f@eucms1p3
Cc: stable@vger.kernel.org
Fixes: 8a062902be ("tracing: Add tracing error log")
Signed-off-by: Mateusz Stachyra <m.stachyra@samsung.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- fprobe: Pass return address to the fprobe entry/exit callbacks so that
the callbacks don't need to analyze pt_regs/stack to find the function
return address.
- kprobe events: cleanup usage of TPARG_FL_FENTRY and TPARG_FL_RETURN
flags so that those are not set at once.
- fprobe events:
. Add a new fprobe events for tracing arbitrary function entry and
exit as a trace event.
. Add a new tracepoint events for tracing raw tracepoint as a trace
event. This allows user to trace non user-exposed tracepoints.
. Move eprobe's event parser code into probe event common file.
. Introduce BTF (BPF type format) support to kernel probe (kprobe,
fprobe and tracepoint probe) events so that user can specify traced
function arguments by name. This also applies the type of argument
when fetching the argument.
. Introduce '$arg*' wildcard support if BTF is available. This expands
the '$arg*' meta argument to all function argument automatically.
. Check the return value types by BTF. If the function returns 'void',
'$retval' is rejected.
. Add some selftest script for fprobe events, tracepoint events and
BTF support.
. Update documentation about the fprobe events.
. Some fixes for above features, document and selftests.
- selftests for ftrace (except for new fprobe events):
. Add a test case for multiple consecutive probes in a function which
checks if ftrace based kprobe, optimized kprobe and normal kprobe
can be defined in the same target function.
. Add a test case for optimized probe, which checks whether kprobe
can be optimized or not.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmSa+9MACgkQ2/sHvwUr
PxsmOAgAmUOIWtvH5py7AZpIRhCj8B18F6KnT7w2hByCsRxf7SaCqMhpBCk9VnYv
9fJFBHpvYRJEmpHoH3o2ET5AGfKVNac9z96AGI2qJ4ECWITd6I5+WfTdZ5ueVn2d
f6DQ10mHXDHSMFbuqfYWSHtkeivJpWpUNHhwzPb4doNOe06bZNfVuSgnksFg1at5
kq16HbvGnhPzdO4YHmvqwjmRHr5/nCI1KDE9xIBcqNtWFbiRigC11zaZEUkLX+vT
F63ShyfCK718AiwDfnjXpGkXAiVOZuAIR8RELaSqQ92YHCFKq5k9K4++WllPR5f9
AxjVultFDiCd4oSPgYpQkjuZdFq9NA==
=IhmY
-----END PGP SIGNATURE-----
Merge tag 'probes-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes updates from Masami Hiramatsu:
- fprobe: Pass return address to the fprobe entry/exit callbacks so
that the callbacks don't need to analyze pt_regs/stack to find the
function return address.
- kprobe events: cleanup usage of TPARG_FL_FENTRY and TPARG_FL_RETURN
flags so that those are not set at once.
- fprobe events:
- Add a new fprobe events for tracing arbitrary function entry and
exit as a trace event.
- Add a new tracepoint events for tracing raw tracepoint as a
trace event. This allows user to trace non user-exposed
tracepoints.
- Move eprobe's event parser code into probe event common file.
- Introduce BTF (BPF type format) support to kernel probe (kprobe,
fprobe and tracepoint probe) events so that user can specify
traced function arguments by name. This also applies the type of
argument when fetching the argument.
- Introduce '$arg*' wildcard support if BTF is available. This
expands the '$arg*' meta argument to all function argument
automatically.
- Check the return value types by BTF. If the function returns
'void', '$retval' is rejected.
- Add some selftest script for fprobe events, tracepoint events
and BTF support.
- Update documentation about the fprobe events.
- Some fixes for above features, document and selftests.
- selftests for ftrace (in addition to the new fprobe events):
- Add a test case for multiple consecutive probes in a function
which checks if ftrace based kprobe, optimized kprobe and normal
kprobe can be defined in the same target function.
- Add a test case for optimized probe, which checks whether kprobe
can be optimized or not.
* tag 'probes-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/probes: Fix tracepoint event with $arg* to fetch correct argument
Documentation: Fix typo of reference file name
tracing/probes: Fix to return NULL and keep using current argc
selftests/ftrace: Add new test case which checks for optimized probes
selftests/ftrace: Add new test case which adds multiple consecutive probes in a function
Documentation: tracing/probes: Add fprobe event tracing document
selftests/ftrace: Add BTF arguments test cases
selftests/ftrace: Add tracepoint probe test case
tracing/probes: Add BTF retval type support
tracing/probes: Add $arg* meta argument for all function args
tracing/probes: Support function parameters if BTF is available
tracing/probes: Move event parameter fetching code to common parser
tracing/probes: Add tracepoint support on fprobe_events
selftests/ftrace: Add fprobe related testcases
tracing/probes: Add fprobe events for tracing function entry and exit.
tracing/probes: Avoid setting TPARG_FL_FENTRY and TPARG_FL_RETURN
fprobe: Pass return address to the handlers
- Fix KMSAN vs FORTIFY in strlcpy/strlcat (Alexander Potapenko)
- Convert strreplace() to return string start (Andy Shevchenko)
- Flexible array conversions (Arnd Bergmann, Wyes Karny, Kees Cook)
- Add missing function prototypes seen with W=1 (Arnd Bergmann)
- Fix strscpy() kerndoc typo (Arne Welzel)
- Replace strlcpy() with strscpy() across many subsystems which were
either Acked by respective maintainers or were trivial changes that
went ignored for multiple weeks (Azeem Shaikh)
- Remove unneeded cc-option test for UBSAN_TRAP (Nick Desaulniers)
- Add KUnit tests for strcat()-family
- Enable KUnit tests of FORTIFY wrappers under UML
- Add more complete FORTIFY protections for strlcat()
- Add missed disabling of FORTIFY for all arch purgatories.
- Enable -fstrict-flex-arrays=3 globally
- Tightening UBSAN_BOUNDS when using GCC
- Improve checkpatch to check for strcpy, strncpy, and fake flex arrays
- Improve use of const variables in FORTIFY
- Add requested struct_size_t() helper for types not pointers
- Add __counted_by macro for annotating flexible array size members
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmSbftQWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJj0MD/9X9jzJzCmsAU+yNldeoAzC84Sk
GVU3RBxGcTNysL1gZXynkIgigw7DWc4htMGeSABHHwQRVP65JCH1Kw/VqIkyumbx
9LdX6IklMJb4pRT4PVU3azebV4eNmSjlur2UxMeW54Czm91/6I8RHbJOyAPnOUmo
2oomGdP/hpEHtKR7hgy8Axc6w5ySwQixh2V5sVZG3VbvCS5WKTmTXbs6puuRT5hz
iHt7v+7VtEg/Qf1W7J2oxfoghvVBsaRrSLrExWT/oZYh1ZxM7DsCAAoG/IsDgHGA
9LBXiRECgAFThbHVxLvvKZQMXdVk0i8iXLX43XMKC0wTA+NTyH7wlcQQ4RWNMuo8
sfA9Qm9gMArXaf64aymr3Uwn20Zan0391HdlbhOJZAE6v3PPJbleUnM58AzD2d3r
5Lz6AIFBxDImy+3f9iDWgacCT5/PkeiXTHzk9QnKhJyKKtRA58XJxj4q2+rPnGJP
n4haXqoxD5FJbxdXiGKk31RS0U5HBug7wkOcUrTqDHUbc/QNU2b7dxTKUx+zYtCU
uV5emPzpF4H4z+91WpO47n9gkMAfwV0lt9S2dwS8pxsgqctbmIan+Jgip7rsqZ2G
OgLXBsb43eEs+6WgO8tVt/ZHYj9ivGMdrcNcsIfikzNs/xweUJ53k2xSEn2xEa5J
cwANDmkL6QQK7yfeeg==
=s0j1
-----END PGP SIGNATURE-----
Merge tag 'hardening-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
"There are three areas of note:
A bunch of strlcpy()->strscpy() conversions ended up living in my tree
since they were either Acked by maintainers for me to carry, or got
ignored for multiple weeks (and were trivial changes).
The compiler option '-fstrict-flex-arrays=3' has been enabled
globally, and has been in -next for the entire devel cycle. This
changes compiler diagnostics (though mainly just -Warray-bounds which
is disabled) and potential UBSAN_BOUNDS and FORTIFY _warning_
coverage. In other words, there are no new restrictions, just
potentially new warnings. Any new FORTIFY warnings we've seen have
been fixed (usually in their respective subsystem trees). For more
details, see commit df8fc4e934.
The under-development compiler attribute __counted_by has been added
so that we can start annotating flexible array members with their
associated structure member that tracks the count of flexible array
elements at run-time. It is possible (likely?) that the exact syntax
of the attribute will change before it is finalized, but GCC and Clang
are working together to sort it out. Any changes can be made to the
macro while we continue to add annotations.
As an example of that last case, I have a treewide commit waiting with
such annotations found via Coccinelle:
https://git.kernel.org/linus/adc5b3cb48a049563dc673f348eab7b6beba8a9b
Also see commit dd06e72e68 for more details.
Summary:
- Fix KMSAN vs FORTIFY in strlcpy/strlcat (Alexander Potapenko)
- Convert strreplace() to return string start (Andy Shevchenko)
- Flexible array conversions (Arnd Bergmann, Wyes Karny, Kees Cook)
- Add missing function prototypes seen with W=1 (Arnd Bergmann)
- Fix strscpy() kerndoc typo (Arne Welzel)
- Replace strlcpy() with strscpy() across many subsystems which were
either Acked by respective maintainers or were trivial changes that
went ignored for multiple weeks (Azeem Shaikh)
- Remove unneeded cc-option test for UBSAN_TRAP (Nick Desaulniers)
- Add KUnit tests for strcat()-family
- Enable KUnit tests of FORTIFY wrappers under UML
- Add more complete FORTIFY protections for strlcat()
- Add missed disabling of FORTIFY for all arch purgatories.
- Enable -fstrict-flex-arrays=3 globally
- Tightening UBSAN_BOUNDS when using GCC
- Improve checkpatch to check for strcpy, strncpy, and fake flex
arrays
- Improve use of const variables in FORTIFY
- Add requested struct_size_t() helper for types not pointers
- Add __counted_by macro for annotating flexible array size members"
* tag 'hardening-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (54 commits)
netfilter: ipset: Replace strlcpy with strscpy
uml: Replace strlcpy with strscpy
um: Use HOST_DIR for mrproper
kallsyms: Replace all non-returning strlcpy with strscpy
sh: Replace all non-returning strlcpy with strscpy
of/flattree: Replace all non-returning strlcpy with strscpy
sparc64: Replace all non-returning strlcpy with strscpy
Hexagon: Replace all non-returning strlcpy with strscpy
kobject: Use return value of strreplace()
lib/string_helpers: Change returned value of the strreplace()
jbd2: Avoid printing outside the boundary of the buffer
checkpatch: Check for 0-length and 1-element arrays
riscv/purgatory: Do not use fortified string functions
s390/purgatory: Do not use fortified string functions
x86/purgatory: Do not use fortified string functions
acpi: Replace struct acpi_table_slit 1-element array with flex-array
clocksource: Replace all non-returning strlcpy with strscpy
string: use __builtin_memcpy() in strlcpy/strlcat
staging: most: Replace all non-returning strlcpy with strscpy
drm/i2c: tda998x: Replace all non-returning strlcpy with strscpy
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8QgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpupIEADKEZvpxDyaxHjYZFFeoSJRkh+AEJHe0Xtr
J5vUL8t8zmAV3F7i8XaoAEcR0dC0VQcoTc8fAOty71+5hsc7gvtyyNjqU/YWRVqK
Xr+VJuSJ+OGx3MzpRWEkepagfPyqP5cyyCOK6gqIgqzc3IwqkR/3QHVRc6oR8YbY
AQd7tqm2fQXK9WDHEy5hcaQeqb9uKZjQQoZejpPPerpJM+9RMgKxpCGtnLLIUhr/
sgl7KyLIQPBmveO2vfOR+dmsJBqsLqneqkXDKMAIfpeVEEkHHAlCH4E5Ne1XUS+s
ie4If+reuyn1Ktt5Ry1t7w2wr8cX1fcay3K28tgwjE2Bvremc5YnYgb3pyUDW38f
tXXkpg/eTXd/Pn0Crpagoa9zJ927tt5JXIO1/PagPEP1XOqUuthshDFsrVqfqbs+
36gqX2JWB4NJTg9B9KBHA3+iVCJyZLjUqOqws7hOJOvhQytZVm/IwkGBg1Slhe1a
J5WemBlqX8lTgXz0nM7cOhPYTZeKe6hazCcb5VwxTUTj9SGyYtsMfqqTwRJO9kiF
j1VzbOAgExDYe+GvfqOFPh9VqZho66+DyOD/Xtca4eH7oYyHSmP66o8nhRyPBPZA
maBxQhUkPQn4/V/0fL2TwIdWYKsbj8bUyINKPZ2L35YfeICiaYIctTwNJxtRmItB
M3VxWD3GZQ==
=KhW4
-----END PGP SIGNATURE-----
Merge tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux
Pull splice updates from Jens Axboe:
"This kills off ITER_PIPE to avoid a race between truncate,
iov_iter_revert() on the pipe and an as-yet incomplete DMA to a bio
with unpinned/unref'ed pages from an O_DIRECT splice read. This causes
memory corruption.
Instead, we either use (a) filemap_splice_read(), which invokes the
buffered file reading code and splices from the pagecache into the
pipe; (b) copy_splice_read(), which bulk-allocates a buffer, reads
into it and then pushes the filled pages into the pipe; or (c) handle
it in filesystem-specific code.
Summary:
- Rename direct_splice_read() to copy_splice_read()
- Simplify the calculations for the number of pages to be reclaimed
in copy_splice_read()
- Turn do_splice_to() into a helper, vfs_splice_read(), so that it
can be used by overlayfs and coda to perform the checks on the
lower fs
- Make vfs_splice_read() jump to copy_splice_read() to handle
direct-I/O and DAX
- Provide shmem with its own splice_read to handle non-existent pages
in the pagecache. We don't want a ->read_folio() as we don't want
to populate holes, but filemap_get_pages() requires it
- Provide overlayfs with its own splice_read to call down to a lower
layer as overlayfs doesn't provide ->read_folio()
- Provide coda with its own splice_read to call down to a lower layer
as coda doesn't provide ->read_folio()
- Direct ->splice_read to copy_splice_read() in tty, procfs, kernfs
and random files as they just copy to the output buffer and don't
splice pages
- Provide wrappers for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3,
ocfs2, orangefs, xfs and zonefs to do locking and/or revalidation
- Make cifs use filemap_splice_read()
- Replace pointers to generic_file_splice_read() with pointers to
filemap_splice_read() as DIO and DAX are handled in the caller;
filesystems can still provide their own alternate ->splice_read()
op
- Remove generic_file_splice_read()
- Remove ITER_PIPE and its paraphernalia as generic_file_splice_read
was the only user"
* tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux: (31 commits)
splice: kdoc for filemap_splice_read() and copy_splice_read()
iov_iter: Kill ITER_PIPE
splice: Remove generic_file_splice_read()
splice: Use filemap_splice_read() instead of generic_file_splice_read()
cifs: Use filemap_splice_read()
trace: Convert trace/seq to use copy_splice_read()
zonefs: Provide a splice-read wrapper
xfs: Provide a splice-read wrapper
orangefs: Provide a splice-read wrapper
ocfs2: Provide a splice-read wrapper
ntfs3: Provide a splice-read wrapper
nfs: Provide a splice-read wrapper
f2fs: Provide a splice-read wrapper
ext4: Provide a splice-read wrapper
ecryptfs: Provide a splice-read wrapper
ceph: Provide a splice-read wrapper
afs: Provide a splice-read wrapper
9p: Add splice_read wrapper
net: Make sock_splice_read() use copy_splice_read() by default
tty, proc, kernfs, random: Use copy_splice_read()
...
Support function or tracepoint parameters by name if BTF support is enabled
and the event is for function entry (this feature can be used with kprobe-
events, fprobe-events and tracepoint probe events.)
Note that the BTF variable syntax does not require a prefix. If it starts
with an alphabetic character or an underscore ('_') without a prefix like
'$' and '%', it is considered as a BTF variable.
If you specify only the BTF variable name, the argument name will also
be the same name instead of 'arg*'.
# echo 'p vfs_read count pos' >> dynamic_events
# echo 'f vfs_write count pos' >> dynamic_events
# echo 't sched_overutilized_tp rd overutilized' >> dynamic_events
# cat dynamic_events
p:kprobes/p_vfs_read_0 vfs_read count=count pos=pos
f:fprobes/vfs_write__entry vfs_write count=count pos=pos
t:tracepoints/sched_overutilized_tp sched_overutilized_tp rd=rd overutilized=overutilized
Link: https://lore.kernel.org/all/168507474014.913472.16963996883278039183.stgit@mhiramat.roam.corp.google.com/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Allow fprobe_events to trace raw tracepoints so that user can trace
tracepoints which don't have traceevent wrappers. This new event is
always available if the fprobe_events is enabled (thus no kconfig),
because the fprobe_events depends on the trace-event and traceporint.
e.g.
# echo 't sched_overutilized_tp' >> dynamic_events
# echo 't 9p_client_req' >> dynamic_events
# cat dynamic_events
t:tracepoints/sched_overutilized_tp sched_overutilized_tp
t:tracepoints/_9p_client_req 9p_client_req
The event name is based on the tracepoint name, but if it is started
with digit character, an underscore '_' will be added.
NOTE: to avoid further confusion, this renames TPARG_FL_TPOINT to
TPARG_FL_TEVENT because this flag is used for eprobe (trace-event probe).
And reuse TPARG_FL_TPOINT for this raw tracepoint probe.
Link: https://lore.kernel.org/all/168507471874.913472.17214624519622959593.stgit@mhiramat.roam.corp.google.com/
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305020453.afTJ3VVp-lkp@intel.com/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Add fprobe events for tracing function entry and exit instead of kprobe
events. With this change, we can continue to trace function entry/exit
even if the CONFIG_KPROBES_ON_FTRACE is not available. Since
CONFIG_KPROBES_ON_FTRACE requires the CONFIG_DYNAMIC_FTRACE_WITH_REGS,
it is not available if the architecture only supports
CONFIG_DYNAMIC_FTRACE_WITH_ARGS. And that means kprobe events can not
probe function entry/exit effectively on such architecture.
But this can be solved if the dynamic events supports fprobe events.
The fprobe event is a new dynamic events which is only for the function
(symbol) entry and exit. This event accepts non register fetch arguments
so that user can trace the function arguments and return values.
The fprobe events syntax is here;
f[:[GRP/][EVENT]] FUNCTION [FETCHARGS]
f[MAXACTIVE][:[GRP/][EVENT]] FUNCTION%return [FETCHARGS]
E.g.
# echo 'f vfs_read $arg1' >> dynamic_events
# echo 'f vfs_read%return $retval' >> dynamic_events
# cat dynamic_events
f:fprobes/vfs_read__entry vfs_read arg1=$arg1
f:fprobes/vfs_read__exit vfs_read%return arg1=$retval
# echo 1 > events/fprobes/enable
# head -n 20 trace | tail
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
sh-142 [005] ...1. 448.386420: vfs_read__entry: (vfs_read+0x4/0x340) arg1=0xffff888007f7c540
sh-142 [005] ..... 448.386436: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=0x1
sh-142 [005] ...1. 448.386451: vfs_read__entry: (vfs_read+0x4/0x340) arg1=0xffff888007f7c540
sh-142 [005] ..... 448.386458: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=0x1
sh-142 [005] ...1. 448.386469: vfs_read__entry: (vfs_read+0x4/0x340) arg1=0xffff888007f7c540
sh-142 [005] ..... 448.386476: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=0x1
sh-142 [005] ...1. 448.602073: vfs_read__entry: (vfs_read+0x4/0x340) arg1=0xffff888007f7c540
sh-142 [005] ..... 448.602089: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=0x1
Link: https://lore.kernel.org/all/168507469754.913472.6112857614708350210.stgit@mhiramat.roam.corp.google.com/
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/all/202302011530.7vm4O8Ro-lkp@intel.com/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
The tracing_selftest_running and tracing_selftest_disabled variables were
to keep trace_printk() and other writes from affecting the tracing
selftests, as the tracing selftests would examine the ring buffer to see
if it contained what it expected or not. trace_printk() and friends could
add to the ring buffer and cause the selftests to fail (and then disable
the tracer that was being tested). To keep that from happening, these
variables were added and would keep trace_printk() and friends from
writing to the ring buffer while the tests were going on.
But this was only the top level ring buffer (owned by the global_trace
instance). There is no reason to prevent writing into ring buffers of
other instances via the trace_array_printk() and friends. For the
functions that could be used by other instances, check if the global_trace
is the tracer instance that is being written to before deciding to not
allow the write.
Link: https://lkml.kernel.org/r/20230528051742.1325503-5-rostedt@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
There's no reason to test the condition variables tracing_selftest_running
or tracing_selftest_delete when tracing selftests are not enabled. Make
them define 0s when not the selftests are not configured in.
Link: https://lkml.kernel.org/r/20230528051742.1325503-4-rostedt@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
As there are more and more internal selftests being added to the Linux
kernel (KSAN, lockdep, etc) the selftests are taking longer to run when
these are enabled. Add a cond_resched() to the calling of
do_run_tracer_selftest() to force a schedule if NEED_RESCHED is set,
otherwise the soft lockup watchdog may trigger on boot up.
Link: https://lkml.kernel.org/r/20230528051742.1325503-3-rostedt@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The variables tracing_selftest_running and tracing_selftest_disabled are
only used for when CONFIG_FTRACE_STARTUP_TEST is enabled. Make them only
visible within the selftest code. The setting of those variables are in
the register_tracer() call, and set in a location where they do not need
to be. Create a wrapper around run_tracer_selftest() called
do_run_tracer_selftest() which sets those variables, and have
register_tracer() call that instead.
Having those variables only set within the CONFIG_FTRACE_STARTUP_TEST
scope gets rid of them (and also the ability to remove testing against
them) when the startup tests are not enabled (most cases).
Link: https://lkml.kernel.org/r/20230528051742.1325503-2-rostedt@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().
No return values were used, so direct replacement with strlcpy is safe.
[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89
Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230516143956.1367827-1-azeemshaikh38@gmail.com
The histogram and synthetic events can use a pseudo event called
"stacktrace" that will create a stacktrace at the time of the event and
use it just like it was a normal field. We have other pseudo events such
as "common_cpu" and "common_timestamp". To stay consistent with that,
convert "stacktrace" to "common_stacktrace". As this was used in older
kernels, to keep backward compatibility, this will act just like
"common_cpu" did with "cpu". That is, "cpu" will be the same as
"common_cpu" unless the event has a "cpu" field. In which case, the
event's field is used. The same is true with "stacktrace".
Also update the documentation to reflect this change.
Link: https://lore.kernel.org/linux-trace-kernel/20230523230913.6860e28d@rorschach.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- Make buffer_percent read/write. The buffer_percent file is how users can
state how long to block on the tracing buffer depending on how much
is in the buffer. When it hits the "buffer_percent" it will wake the
task waiting on the buffer. For some reason it was set to read-only.
This was not noticed because testing was done as root without SELinux,
but with SELinux it will prevent even root to write to it without having
CAP_DAC_OVERRIDE.
- The "touched_functions" was added this merge window, but one of the
reasons for adding it was not implemented. That was to show what functions
were not only touched, but had either a direct trampoline attached to
it, or a kprobe or live kernel patching that can "hijack" the function
to run a different function. The point is to know if there's functions
in the kernel that may not be behaving as the kernel code shows. This can
be used for debugging. TODO: Add this information to kernel oops too.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZFUcrxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qgOoAP0U2R6+jvA2ehQFb0UTCH9wEu2uEELA
g2CkdPNdn6wJjAD+O1+v5nVkqSpsArjHOhv5OGYrgh+VSXK3Z8EpQ9vUVgg=
=nfoh
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull more tracing updates from Steven Rostedt:
- Make buffer_percent read/write.
The buffer_percent file is how users can state how long to block on
the tracing buffer depending on how much is in the buffer. When it
hits the "buffer_percent" it will wake the task waiting on the
buffer. For some reason it was set to read-only.
This was not noticed because testing was done as root without
SELinux, but with SELinux it will prevent even root to write to it
without having CAP_DAC_OVERRIDE.
- The "touched_functions" was added this merge window, but one of the
reasons for adding it was not implemented.
That was to show what functions were not only touched, but had either
a direct trampoline attached to it, or a kprobe or live kernel
patching that can "hijack" the function to run a different function.
The point is to know if there's functions in the kernel that may not
be behaving as the kernel code shows. This can be used for debugging.
TODO: Add this information to kernel oops too.
* tag 'trace-v6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Add MODIFIED flag to show if IPMODIFY or direct was attached
tracing: Fix permissions for the buffer_percent file
This file defines both read and write operations, yet it is being
created as read-only. This means that it can't be written to without the
CAP_DAC_OVERRIDE capability. Fix the permissions to allow root to write
to it without the need to override DAC perms.
Link: https://lore.kernel.org/linux-trace-kernel/20230503140114.3280002-1-omosnace@redhat.com
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 03329f9939 ("tracing: Add tracefs file buffer_percentage")
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- User events are finally ready!
After lots of collaboration between various parties, we finally locked
down on a stable interface for user events that can also work with user
space only tracing. This is implemented by telling the kernel (or user
space library, but that part is user space only and not part of this
patch set), where the variable is that the application uses to know if
something is listening to the trace. There's also an interface to tell
the kernel about these events, which will show up in the
/sys/kernel/tracing/events/user_events/ directory, where it can be
enabled. When it's enabled, the kernel will update the variable, to tell
the application to start writing to the kernel.
See https://lwn.net/Articles/927595/
- Cleaned up the direct trampolines code to simplify arm64 addition of
direct trampolines. Direct trampolines use the ftrace interface but
instead of jumping to the ftrace trampoline, applications (mostly BPF)
can register their own trampoline for performance reasons.
- Some updates to the fprobe infrastructure. fprobes are more efficient than
kprobes, as it does not need to save all the registers that kprobes on
ftrace do. More work needs to be done before the fprobes will be exposed
as dynamic events.
- More updates to references to the obsolete path of
/sys/kernel/debug/tracing for the new /sys/kernel/tracing path.
- Add a seq_buf_do_printk() helper to seq_bufs, to print a large buffer line
by line instead of all at once. There's users in production kernels that
have a large data dump that originally used printk() directly, but the
data dump was larger than what printk() allowed as a single print.
Using seq_buf() to do the printing fixes that.
- Add /sys/kernel/tracing/touched_functions that shows all functions that
was every traced by ftrace or a direct trampoline. This is used for
debugging issues where a traced function could have caused a crash by
a bpf program or live patching.
- Add a "fields" option that is similar to "raw" but outputs the fields of
the events. It's easier to read by humans.
- Some minor fixes and clean ups.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZEr36xQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6quZHAQCzuqnn2S8DsPd3Sy1vKIYaj0uajW5D
Kz1oUJH4F0H7kgEA8XwXkdtfKpOXWc/ZH4LWfL7Orx2wJZJQMV9dVqEPDAE=
=w0Z1
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
- User events are finally ready!
After lots of collaboration between various parties, we finally
locked down on a stable interface for user events that can also work
with user space only tracing.
This is implemented by telling the kernel (or user space library, but
that part is user space only and not part of this patch set), where
the variable is that the application uses to know if something is
listening to the trace.
There's also an interface to tell the kernel about these events,
which will show up in the /sys/kernel/tracing/events/user_events/
directory, where it can be enabled.
When it's enabled, the kernel will update the variable, to tell the
application to start writing to the kernel.
See https://lwn.net/Articles/927595/
- Cleaned up the direct trampolines code to simplify arm64 addition of
direct trampolines.
Direct trampolines use the ftrace interface but instead of jumping to
the ftrace trampoline, applications (mostly BPF) can register their
own trampoline for performance reasons.
- Some updates to the fprobe infrastructure. fprobes are more efficient
than kprobes, as it does not need to save all the registers that
kprobes on ftrace do. More work needs to be done before the fprobes
will be exposed as dynamic events.
- More updates to references to the obsolete path of
/sys/kernel/debug/tracing for the new /sys/kernel/tracing path.
- Add a seq_buf_do_printk() helper to seq_bufs, to print a large buffer
line by line instead of all at once.
There are users in production kernels that have a large data dump
that originally used printk() directly, but the data dump was larger
than what printk() allowed as a single print.
Using seq_buf() to do the printing fixes that.
- Add /sys/kernel/tracing/touched_functions that shows all functions
that was every traced by ftrace or a direct trampoline. This is used
for debugging issues where a traced function could have caused a
crash by a bpf program or live patching.
- Add a "fields" option that is similar to "raw" but outputs the fields
of the events. It's easier to read by humans.
- Some minor fixes and clean ups.
* tag 'trace-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (41 commits)
ring-buffer: Sync IRQ works before buffer destruction
tracing: Add missing spaces in trace_print_hex_seq()
ring-buffer: Ensure proper resetting of atomic variables in ring_buffer_reset_online_cpus
recordmcount: Fix memory leaks in the uwrite function
tracing/user_events: Limit max fault-in attempts
tracing/user_events: Prevent same address and bit per process
tracing/user_events: Ensure bit is cleared on unregister
tracing/user_events: Ensure write index cannot be negative
seq_buf: Add seq_buf_do_printk() helper
tracing: Fix print_fields() for __dyn_loc/__rel_loc
tracing/user_events: Set event filter_type from type
ring-buffer: Clearly check null ptr returned by rb_set_head_page()
tracing: Unbreak user events
tracing/user_events: Use print_format_fields() for trace output
tracing/user_events: Align structs with tabs for readability
tracing/user_events: Limit global user_event count
tracing/user_events: Charge event allocs to cgroups
tracing/user_events: Update documentation for ABI
tracing/user_events: Use write ABI in example
tracing/user_events: Add ABI self-test
...
The kernel command line ftrace_boot_snapshot by itself is supposed to
trigger a snapshot at the end of boot up of the main top level trace
buffer. A ftrace_boot_snapshot=foo will do the same for an instance called
foo that was created by trace_instance=foo,...
The logic was broken where if ftrace_boot_snapshot was by itself, it would
trigger a snapshot for all instances that had tracing enabled, regardless
if it asked for a snapshot or not.
When a snapshot is requested for a buffer, the buffer's
tr->allocated_snapshot is set to true. Use that to know if a trace buffer
wants a snapshot at boot up or not.
Since the top level buffer is part of the ftrace_trace_arrays list,
there's no reason to treat it differently than the other buffers. Just
iterate the list if ftrace_boot_snapshot was specified.
Link: https://lkml.kernel.org/r/20230405022341.895334039@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <zwisler@google.com>
Fixes: 9c1c251d67 ("tracing: Allow boot instances to have snapshot buffers")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
If a trace instance has a failure with its snapshot code, the error
message is to be written to that instance's buffer. But currently, the
message is written to the top level buffer. Worse yet, it may also disable
the top level buffer and not the instance that had the issue.
Link: https://lkml.kernel.org/r/20230405022341.688730321@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <zwisler@google.com>
Fixes: 2824f50332 ("tracing: Make the snapshot trigger work with instances")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- Fix setting affinity of hwlat threads in containers
Using sched_set_affinity() has unwanted side effects when being
called within a container. Use set_cpus_allowed_ptr() instead.
- Fix per cpu thread management of the hwlat tracer
* Do not start per_cpu threads if one is already running for the CPU.
* When starting per_cpu threads, do not clear the kthread variable
as it may already be set to running per cpu threads
- Fix return value for test_gen_kprobe_cmd()
On error the return value was overwritten by being set to
the result of the call from kprobe_event_delete(), which would
likely succeed, and thus have the function return success.
- Fix splice() reads from the trace file that was broken by
36e2c7421f ("fs: don't allow splice read/write without explicit ops")
- Remove obsolete and confusing comment in ring_buffer.c
The original design of the ring buffer used struct page flags
for tricks to optimize, which was shortly removed due to them
being tricks. But a comment for those tricks remained.
- Set local functions and variables to static
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZBdIkBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qti2AP49s1GM8teFgDF/CO3oa45BoYq1lMJO
1Z+x+mS/bdUAZgEAw+3KGo8oZDuvWu/nr04JPeoy0GL1/JnbQ6JNCCjzhQc=
=Yk66
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix setting affinity of hwlat threads in containers
Using sched_set_affinity() has unwanted side effects when being
called within a container. Use set_cpus_allowed_ptr() instead
- Fix per cpu thread management of the hwlat tracer:
- Do not start per_cpu threads if one is already running for the CPU
- When starting per_cpu threads, do not clear the kthread variable
as it may already be set to running per cpu threads
- Fix return value for test_gen_kprobe_cmd()
On error the return value was overwritten by being set to the result
of the call from kprobe_event_delete(), which would likely succeed,
and thus have the function return success
- Fix splice() reads from the trace file that was broken by commit
36e2c7421f ("fs: don't allow splice read/write without explicit
ops")
- Remove obsolete and confusing comment in ring_buffer.c
The original design of the ring buffer used struct page flags for
tricks to optimize, which was shortly removed due to them being
tricks. But a comment for those tricks remained
- Set local functions and variables to static
* tag 'trace-v6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/hwlat: Replace sched_setaffinity with set_cpus_allowed_ptr
ring-buffer: remove obsolete comment for free_buffer_page()
tracing: Make splice_read available again
ftrace: Set direct_ops storage-class-specifier to static
trace/hwlat: Do not start per-cpu thread if it is already running
trace/hwlat: Do not wipe the contents of per-cpu thread data
tracing/osnoise: set several trace_osnoise.c variables storage-class-specifier to static
tracing: Fix wrong return in kprobe_event_gen_test.c
Since the commit 36e2c7421f ("fs: don't allow splice read/write
without explicit ops") is applied to the kernel, splice() and
sendfile() calls on the trace file (/sys/kernel/debug/tracing
/trace) return EINVAL.
This patch restores these system calls by initializing splice_read
in file_operations of the trace file. This patch only enables such
functionalities for the read case.
Link: https://lore.kernel.org/linux-trace-kernel/20230314013707.28814-1-sfoon.kim@samsung.com
Cc: stable@vger.kernel.org
Fixes: 36e2c7421f ("fs: don't allow splice read/write without explicit ops")
Signed-off-by: Sung-hun Kim <sfoon.kim@samsung.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- Skip negative return code check for snprintf in eprobe.
- Add recursive call test cases for kprobe unit test
- Add 'char' type to probe events to show it as the character instead of value.
- Update kselftest kprobe-event testcase to ignore '__pfx_' symbols.
- Fix kselftest to check filter on eprobe event correctly.
- Add filter on eprobe to the README file in tracefs.
- Fix optprobes to check whether there is 'under unoptimizing' optprobe when optimizing another kprobe correctly.
- Fix optprobe to check whether there is 'under unoptimizing' optprobe when fetching the original instruction correctly.
- Fix optprobe to free 'forcibly unoptimized' optprobe correctly.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmP0JdYACgkQ2/sHvwUr
Pxt6sQf/TD9Kwqx3XG1tnLPev6yt2nuggUippHwWUFHlJtMyUaLV8aKFqByyEe+j
tCQvrFIIJq242xg0Jac/MAf2exlWG9jsmVZPmvC1YzepOAbjXu2eBkIS7LsbeHjF
JJypNnEceffWCpNoD6nlvR0xWXenqRbZJwdsGqo3u+fXnzTurEMY2GU2xOyv39tv
S1uNLPANJxdMb/2iUsUE3hMbe82dqr8zPcApqWFtTBB6QPHI3B2SjuQHpQxwbTPl
bzAl0yQkLSQXprVzT7xJ4xLnzbl1ljgJBci5aX8BFF+VD9oYkypdfYVczBH5VsP9
E3eT9T9lRf4Q99EqxNy5uw7NqQXGQg==
=CMPb
-----END PGP SIGNATURE-----
Merge tag 'probes-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull kprobes updates from Masami Hiramatsu:
- Skip negative return code check for snprintf in eprobe
- Add recursive call test cases for kprobe unit test
- Add 'char' type to probe events to show it as the character instead
of value
- Update kselftest kprobe-event testcase to ignore '__pfx_' symbols
- Fix kselftest to check filter on eprobe event correctly
- Add filter on eprobe to the README file in tracefs
- Fix optprobes to check whether there is 'under unoptimizing' optprobe
when optimizing another kprobe correctly
- Fix optprobe to check whether there is 'under unoptimizing' optprobe
when fetching the original instruction correctly
- Fix optprobe to free 'forcibly unoptimized' optprobe correctly
* tag 'probes-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/eprobe: no need to check for negative ret value for snprintf
test_kprobes: Add recursed kprobe test case
tracing/probe: add a char type to show the character value of traced arguments
selftests/ftrace: Fix probepoint testcase to ignore __pfx_* symbols
selftests/ftrace: Fix eprobe syntax test case to check filter support
tracing/eprobe: Fix to add filter on eprobe description in README file
x86/kprobes: Fix arch_check_optimized_kprobe check within optimized_kprobe range
x86/kprobes: Fix __recover_optprobed_insn check optimizing logic
kprobes: Fix to handle forcibly unoptimized kprobes on freeing_list
- Add function names as a way to filter function addresses
- Add sample module to test ftrace ops and dynamic trampolines
- Allow stack traces to be passed from beginning event to end event for
synthetic events. This will allow seeing the stack trace of when a task is
scheduled out and recorded when it gets scheduled back in.
- Add trace event helper __get_buf() to use as a temporary buffer when printing
out trace event output.
- Add kernel command line to create trace instances on boot up.
- Add enabling of events to instances created at boot up.
- Add trace_array_puts() to write into instances.
- Allow boot instances to take a snapshot at the end of boot up.
- Allow live patch modules to include trace events
- Minor fixes and clean ups
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY/PaaBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qh5iAPoD0LKZzD33rhO5Ec4hoexE0DkqycP3
dvmOMbCBL8GkxwEA+d2gLz/EquSFm166hc4D79Sn3geCqvkwmy8vQWVjIQc=
=M82D
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
- Add function names as a way to filter function addresses
- Add sample module to test ftrace ops and dynamic trampolines
- Allow stack traces to be passed from beginning event to end event for
synthetic events. This will allow seeing the stack trace of when a
task is scheduled out and recorded when it gets scheduled back in.
- Add trace event helper __get_buf() to use as a temporary buffer when
printing out trace event output.
- Add kernel command line to create trace instances on boot up.
- Add enabling of events to instances created at boot up.
- Add trace_array_puts() to write into instances.
- Allow boot instances to take a snapshot at the end of boot up.
- Allow live patch modules to include trace events
- Minor fixes and clean ups
* tag 'trace-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (31 commits)
tracing: Remove unnecessary NULL assignment
tracepoint: Allow livepatch module add trace event
tracing: Always use canonical ftrace path
tracing/histogram: Fix stacktrace histogram Documententation
tracing/histogram: Fix stacktrace key
tracing/histogram: Fix a few problems with stacktrace variable printing
tracing: Add BUILD_BUG() to make sure stacktrace fits in strings
tracing/histogram: Don't use strlen to find length of stacktrace variables
tracing: Allow boot instances to have snapshot buffers
tracing: Add trace_array_puts() to write into instance
tracing: Add enabling of events to boot instances
tracing: Add creation of instances at boot command line
tracing: Fix trace_event_raw_event_synth() if else statement
samples: ftrace: Make some global variables static
ftrace: sample: avoid open-coded 64-bit division
samples: ftrace: Include the nospec-branch.h only for x86
tracing: Acquire buffer from temparary trace sequence
tracing/histogram: Wrap remaining shell snippets in code blocks
tracing/osnoise: No need for schedule_hrtimeout range
bpf/tracing: Use stage6 of tracing to not duplicate macros
...
- Improve the scalability of the CFS bandwidth unthrottling logic
with large number of CPUs.
- Fix & rework various cpuidle routines, simplify interaction with
the generic scheduler code. Add __cpuidle methods as noinstr to
objtool's noinstr detection and fix boatloads of cpuidle bugs & quirks.
- Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS,
to query previously issued registrations.
- Limit scheduler slice duration to the sysctl_sched_latency period,
to improve scheduling granularity with a large number of SCHED_IDLE
tasks.
- Debuggability enhancement on sys_exit(): warn about disabled IRQs,
but also enable them to prevent a cascade of followup problems and
repeat warnings.
- Fix the rescheduling logic in prio_changed_dl().
- Micro-optimize cpufreq and sched-util methods.
- Micro-optimize ttwu_runnable()
- Micro-optimize the idle-scanning in update_numa_stats(),
select_idle_capacity() and steal_cookie_task().
- Update the RSEQ code & self-tests
- Constify various scheduler methods
- Remove unused methods
- Refine __init tags
- Documentation updates
- ... Misc other cleanups, fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmPzbJwRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iIvA//ZcEaB8Z6ChLRQjM+bsaudKJu3pdLQbPK
iYbP8Da+LsAfxbEfYuGV3m+jIp0LlBOtsI/EezxQrXV+V7FvNyAX9Y00eEu/zlj8
7Jn3LMy/DBYTwH7LwVdcU0MyIVI8ZPc6WNnkx0LOtGZn8n+qfHPSDzcP3CW+a5AV
UvllPYpYyEmsX0Eby7CF4Ue8mSmbViw/xR3rNr8ZSve0c25XzKabw8O9kE3jiHxP
d/zERJoAYeDyYUEuZqhfn5dTlB4an4IjNEkAfRE5SQ09RA8Gkxsa5Ar8gob9e9M1
eQsdd4/bdhnrkM8L5qDZczqmgCTZ2bukQrxkBXhRDhLgoFxwAn77b+2ZjmIW3Lae
AyGqRcDSg1q2oxaYm5ZiuO/t26aDOZu9vPHyHRDGt95EGbZlrp+GgeePyfCigJYz
UmPdZAAcHdSymnnnlcvdG37WVvaVkpgWZzd8LbtBi23QR+Zc4WQ2IlgnUS5WKNNf
VOBcAcP6E1IslDotZDQCc2dPFFQoQQEssVooyUc5oMytm7BsvxXLOeHG+Ncu/8uc
H+U8Qn8jnqTxJbC5hkWQIJlhVKCq2FJrHxxySYTKROfUNcDgCmxboFeAcXTCIU1K
T0S+sdoTS/CvtLklRkG0j6B8N4N98mOd9cFwUV3tX+/gMLMep3hCQs5L76JagvC5
skkQXoONNaM=
=l1nN
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Improve the scalability of the CFS bandwidth unthrottling logic with
large number of CPUs.
- Fix & rework various cpuidle routines, simplify interaction with the
generic scheduler code. Add __cpuidle methods as noinstr to objtool's
noinstr detection and fix boatloads of cpuidle bugs & quirks.
- Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS, to query
previously issued registrations.
- Limit scheduler slice duration to the sysctl_sched_latency period, to
improve scheduling granularity with a large number of SCHED_IDLE
tasks.
- Debuggability enhancement on sys_exit(): warn about disabled IRQs,
but also enable them to prevent a cascade of followup problems and
repeat warnings.
- Fix the rescheduling logic in prio_changed_dl().
- Micro-optimize cpufreq and sched-util methods.
- Micro-optimize ttwu_runnable()
- Micro-optimize the idle-scanning in update_numa_stats(),
select_idle_capacity() and steal_cookie_task().
- Update the RSEQ code & self-tests
- Constify various scheduler methods
- Remove unused methods
- Refine __init tags
- Documentation updates
- Misc other cleanups, fixes
* tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits)
sched/rt: pick_next_rt_entity(): check list_entry
sched/deadline: Add more reschedule cases to prio_changed_dl()
sched/fair: sanitize vruntime of entity being placed
sched/fair: Remove capacity inversion detection
sched/fair: unlink misfit task from cpu overutilized
objtool: mem*() are not uaccess safe
cpuidle: Fix poll_idle() noinstr annotation
sched/clock: Make local_clock() noinstr
sched/clock/x86: Mark sched_clock() noinstr
x86/pvclock: Improve atomic update of last_value in pvclock_clocksource_read()
x86/atomics: Always inline arch_atomic64*()
cpuidle: tracing, preempt: Squash _rcuidle tracing
cpuidle: tracing: Warn about !rcu_is_watching()
cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG
cpuidle: drivers: firmware: psci: Dont instrument suspend code
KVM: selftests: Fix build of rseq test
exit: Detect and fix irq disabled state in oops
cpuidle, arm64: Fix the ARM64 cpuidle logic
cpuidle: mvebu: Fix duplicate flags assignment
sched/fair: Limit sched slice duration
...
There are scenes that we want to show the character value of traced
arguments other than a decimal or hexadecimal or string value for debug
convinience. I add a new type named 'char' to do it and a new test case
file named 'kprobe_args_char.tc' to do selftest for char type.
For example:
The to be traced function is 'void demo_func(char type, char *name);', we
can add a kprobe event as follows to show argument values as we want:
echo 'p:myprobe demo_func $arg1:char +0($arg2):char[5]' > kprobe_events
we will get the following trace log:
... myprobe: (demo_func+0x0/0x29) arg1='A' arg2={'b','p','f','1',''}
Link: https://lore.kernel.org/all/20221219110613.367098-1-dolinux.peng@gmail.com/
Signed-off-by: Donglin Peng <dolinux.peng@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
The canonical location for the tracefs filesystem is at /sys/kernel/tracing.
But, from Documentation/trace/ftrace.rst:
Before 4.1, all ftrace tracing control files were within the debugfs
file system, which is typically located at /sys/kernel/debug/tracing.
For backward compatibility, when mounting the debugfs file system,
the tracefs file system will be automatically mounted at:
/sys/kernel/debug/tracing
Many comments and Kconfig help messages in the tracing code still refer
to this older debugfs path, so let's update them to avoid confusion.
Link: https://lore.kernel.org/linux-trace-kernel/20230215223350.2658616-2-zwisler@google.com
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Ross Zwisler <zwisler@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add to ftrace_boot_snapshot, "=<instance>" name, where the instance will
get a snapshot buffer, and will take a snapshot at the end of boot (which
will save the boot traces).
Link: https://lkml.kernel.org/r/20230207173026.792774721@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ross Zwisler <zwisler@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add a generic trace_array_puts() that can be used to "trace_puts()" into
an allocated trace_array instance. This is just another variant of
trace_array_printk().
Link: https://lkml.kernel.org/r/20230207173026.584717290@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ross Zwisler <zwisler@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add the format of:
trace_instance=foo,sched:sched_switch,irq_handler_entry,initcall
That will create the "foo" instance and enable the sched_switch event
(here were the "sched" system is explicitly specified), the
irq_handler_entry event, and all events under the system initcall.
Link: https://lkml.kernel.org/r/20230207173026.386114535@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ross Zwisler <zwisler@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add kernel command line to add tracing instances. This only creates
instances at boot but still does not enable any events to them. Later
changes will extend this command line to add enabling of events, filters,
and triggers. As well as possibly redirecting trace_printk()!
Link: https://lkml.kernel.org/r/20230207173026.186210158@goodmis.org
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ross Zwisler <zwisler@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
poll() and select() on per_cpu trace_pipe and trace_pipe_raw do not work
since kernel 6.1-rc6. This issue is seen after the commit
42fb0a1e84 ("tracing/ring-buffer: Have
polling block on watermark").
This issue is firstly detected and reported, when testing the CXL error
events in the rasdaemon and also erified using the test application for poll()
and select().
This issue occurs for the per_cpu case, when calling the ring_buffer_poll_wait(),
in kernel/trace/ring_buffer.c, with the buffer_percent > 0 and then wait until the
percentage of pages are available. The default value set for the buffer_percent is 50
in the kernel/trace/trace.c.
As a fix, allow userspace application could set buffer_percent as 0 through
the buffer_percent_fops, so that the task will wake up as soon as data is added
to any of the specific cpu buffer.
Link: https://lore.kernel.org/linux-trace-kernel/20230202182309.742-2-shiju.jose@huawei.com
Cc: <mhiramat@kernel.org>
Cc: <mchehab@kernel.org>
Cc: <linux-edac@vger.kernel.org>
Cc: stable@vger.kernel.org
Fixes: 42fb0a1e84 ("tracing/ring-buffer: Have polling block on watermark")
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmPW7E8eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGf7MIAI0JnHN9WvtEukSZ
E6j6+cEGWxsvD6q0g3GPolaKOCw7hlv0pWcFJFcUAt0jebspMdxV2oUGJ8RYW7Lg
nCcHvEVswGKLAQtQSWw52qotW6fUfMPsNYYB5l31sm1sKH4Cgss0W7l2HxO/1LvG
TSeNHX53vNAZ8pVnFYEWCSXC9bzrmU/VALF2EV00cdICmfvjlgkELGXoLKJJWzUp
s63fBHYGGURSgwIWOKStoO6HNo0j/F/wcSMx8leY8qDUtVKHj4v24EvSgxUSDBER
ch3LiSQ6qf4sw/z7pqruKFthKOrlNmcc0phjiES0xwwGiNhLv0z3rAhc4OM2cgYh
SDc/Y/c=
=zpaD
-----END PGP SIGNATURE-----
Merge tag 'v6.2-rc6' into sched/core, to pick up fixes
Pick up fixes before merging another batch of cpuidle updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Update the selftests to include a test of passing a stacktrace between the
events of a synthetic event.
Link: https://lkml.kernel.org/r/20230117152236.475439286@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Ross Zwisler <zwisler@google.com>
Cc: Ching-lin Yu <chinglinyu@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently trace_printk() can be used as soon as early_trace_init() is
called from start_kernel(). But if a crash happens, and
"ftrace_dump_on_oops" is set on the kernel command line, all you get will
be:
[ 0.456075] <idle>-0 0dN.2. 347519us : Unknown type 6
[ 0.456075] <idle>-0 0dN.2. 353141us : Unknown type 6
[ 0.456075] <idle>-0 0dN.2. 358684us : Unknown type 6
This is because the trace_printk() event (type 6) hasn't been registered
yet. That gets done via an early_initcall(), which may be early, but not
early enough.
Instead of registering the trace_printk() event (and other ftrace events,
which are not trace events) via an early_initcall(), have them registered at
the same time that trace_printk() can be used. This way, if there is a
crash before early_initcall(), then the trace_printk()s will actually be
useful.
Link: https://lkml.kernel.org/r/20230104161412.019f6c55@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: e725c731e3 ("tracing: Split tracing initialization into two for early initialization")
Reported-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
ARCH_WANTS_NO_INSTR (a superset of CONFIG_GENERIC_ENTRY) disallows any
and all tracing when RCU isn't enabled.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230112195541.416110581@infradead.org
- New "symstr" type for dynamic events that writes the name of the
function+offset into the ring buffer and not just the address
- Prevent kernel symbol processing on addresses in user space probes
(uprobes).
- And minor fixes and clean ups
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY5yAHxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qoWoAP9ZLmqgIqlH3Zcms31SR250kLXxsxT3
JHe82hiuI1I3fAD/Z93QLHw9wngLqIMx/wXsdFjTNOGGWdxfclSWI2qI6Q0=
=KaJg
-----END PGP SIGNATURE-----
Merge tag 'trace-probes-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull trace probes updates from Steven Rostedt:
- New "symstr" type for dynamic events that writes the name of the
function+offset into the ring buffer and not just the address
- Prevent kernel symbol processing on addresses in user space probes
(uprobes).
- And minor fixes and clean ups
* tag 'trace-probes-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/probes: Reject symbol/symstr type for uprobe
tracing/probes: Add symstr type for dynamic events
kprobes: kretprobe events missing on 2-core KVM guest
kprobes: Fix check for probe enabled in kill_kprobe()
test_kprobes: Fix implicit declaration error of test_kprobes
tracing: Fix race where eprobes can be called before the event
Add 'symstr' type for storing the kernel symbol as a string data
instead of the symbol address. This allows us to filter the
events by wildcard symbol name.
e.g.
# echo 'e:wqfunc workqueue.workqueue_execute_start symname=$function:symstr' >> dynamic_events
# cat events/eprobes/wqfunc/format
name: wqfunc
ID: 2110
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:__data_loc char[] symname; offset:8; size:4; signed:1;
print fmt: " symname=\"%s\"", __get_str(symname)
Note that there is already 'symbol' type which just change the
print format (so it still stores the symbol address in the tracing
ring buffer.) On the other hand, 'symstr' type stores the actual
"symbol+offset/size" data as a string.
Link: https://lore.kernel.org/all/166679930847.1528100.4124308529180235965.stgit@devnote3/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Currently the tracing dump_on_oops feature is implemented through
separate notifiers, one for die/oops and the other for panic;
given they have the same functionality, let's unify them.
Also improve the function comment and change the priority of the
notifier to make it execute earlier, avoiding showing useless trace
data (like the callback names for the other notifiers); finally,
we also removed an unnecessary header inclusion.
Link: https://lkml.kernel.org/r/20220819221731.480795-7-gpiccoli@igalia.com
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergei Shtylyov <sergei.shtylyov@gmail.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
print_trace_line may overflow seq_file buffer. If the event is not
consumed, the while loop keeps peeking this event, causing a infinite loop.
Link: https://lkml.kernel.org/r/20221129113009.182425-1-yangjihong1@huawei.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 088b1e427d ("ftrace: pipe fixes")
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Both CONFIG_OSNOISE_TRACER and CONFIG_HWLAT_TRACER partially enables the
CONFIG_TRACER_MAX_TRACE code, but that is complicated and has
introduced a bug; It declares tracing_max_lat_fops data structure outside
of #ifdefs, but since it is defined only when CONFIG_TRACER_MAX_TRACE=y
or CONFIG_HWLAT_TRACER=y, if only CONFIG_OSNOISE_TRACER=y, that
declaration comes to a definition(!).
To fix this issue, and do not repeat the similar problem, makes
CONFIG_OSNOISE_TRACER and CONFIG_HWLAT_TRACER enables the
CONFIG_TRACER_MAX_TRACE always. It has there benefits;
- Fix the tracing_max_lat_fops bug
- Simplify the #ifdefs
- CONFIG_TRACER_MAX_TRACE code is fully enabled, or not.
Link: https://lore.kernel.org/linux-trace-kernel/167033628155.4111793.12185405690820208159.stgit@devnote3
Fixes: 424b650f35 ("tracing: Fix missing osnoise tracer on max_latency")
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: stable@vger.kernel.org
Reported-by: David Howells <dhowells@redhat.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/all/166992525941.1716618.13740663757583361463.stgit@warthog.procyon.org.uk/ (original thread and v1)
Link: https://lore.kernel.org/all/202212052253.VuhZ2ulJ-lkp@intel.com/T/#u (v1 error report)
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
After change in commit 4239174570 ("tracing: Make tracepoint_printk a
static_key"), this symbol is not used outside of the file, so mark it
static.
Link: https://lkml.kernel.org/r/20221122091456.72055-1-xiujianfeng@huawei.com
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
After commit a389d86f7f ("ring-buffer: Have nested events still record
running time stamp"), the "event" parameter is no longer used in either
ring_buffer_unlock_commit() or rb_commit(). Best to remove it.
Link: https://lkml.kernel.org/r/1666274811-24138-1-git-send-email-chensong_2000@189.cn
Signed-off-by: Song Chen <chensong_2000@189.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently the tracing_reset_all_online_cpus() requires the
trace_types_lock held. But only one caller of this function actually has
that lock held before calling it, and the other just takes the lock so
that it can call it. More users of this function is needed where the lock
is not held.
Add a tracing_reset_all_online_cpus_unlocked() function for the one use
case that calls it without being held, and also add a lockdep_assert to
make sure it is held when called.
Then have tracing_reset_all_online_cpus() take the lock internally, such
that callers do not need to worry about taking it.
Link: https://lkml.kernel.org/r/20221123192741.658273220@goodmis.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Entries in list 'tr->err_log' will be reused after entry number
exceed TRACING_LOG_ERRS_MAX.
The cmd string of the to be reused entry will be freed first then
allocated a new one. If the allocation failed, then the entry will
still be in list 'tr->err_log' but its 'cmd' field is set to be NULL,
later access of 'cmd' is risky.
Currently above problem can cause the loss of 'cmd' information of first
entry in 'tr->err_log'. When execute `cat /sys/kernel/tracing/error_log`,
reproduce logs like:
[ 37.495100] trace_kprobe: error: Maxactive is not for kprobe(null) ^
[ 38.412517] trace_kprobe: error: Maxactive is not for kprobe
Command: p4:myprobe2 do_sys_openat2
^
Link: https://lore.kernel.org/linux-trace-kernel/20221114104632.3547266-1-zhengyejian1@huawei.com
Fixes: 1581a884b7 ("tracing: Remove size restriction on tracing_log_err cmd strings")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently the way polling works on the ring buffer is broken. It will
return immediately if there's any data in the ring buffer whereas a read
will block until the watermark (defined by the tracefs buffer_percent file)
is hit.
That is, a select() or poll() will return as if there's data available,
but then the following read will block. This is broken for the way
select()s and poll()s are supposed to work.
Have the polling on the ring buffer also block the same way reads and
splice does on the ring buffer.
Link: https://lkml.kernel.org/r/20221020231427.41be3f26@gandalf.local.home
Cc: Linux Trace Kernel <linux-trace-kernel@vger.kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Primiano Tucci <primiano@google.com>
Cc: stable@vger.kernel.org
Fixes: 1e0d6714ac ("ring-buffer: Do not wake up a splice waiter when page is not full")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The ftrace_boot_snapshot and alloc_snapshot cmdline options allocate the
snapshot buffer at boot up for use later. The ftrace_boot_snapshot in
particular requires the snapshot to be allocated because it will take a
snapshot at the end of boot up allowing to see the traces that happened
during boot so that it's not lost when user space takes over.
When a tracer is registered (started) there's a path that checks if it
requires the snapshot buffer or not, and if it does not and it was
allocated it will do a synchronization and free the snapshot buffer.
This is only required if the previous tracer was using it for "max
latency" snapshots, as it needs to make sure all max snapshots are
complete before freeing. But this is only needed if the previous tracer
was using the snapshot buffer for latency (like irqoff tracer and
friends). But it does not make sense to free it, if the previous tracer
was not using it, and the snapshot was allocated by the cmdline
parameters. This basically takes away the point of allocating it in the
first place!
Note, the allocated snapshot worked fine for just trace events, but fails
when a tracer is enabled on the cmdline.
Further investigation, this goes back even further and it does not require
a tracer on the cmdline to fail. Simply enable snapshots and then enable a
tracer, and it will remove the snapshot.
Link: https://lkml.kernel.org/r/20221005113757.041df7fe@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: 45ad21ca55 ("tracing: Have trace_array keep track if snapshot buffer is allocated")
Reported-by: Ross Zwisler <zwisler@kernel.org>
Tested-by: Ross Zwisler <zwisler@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When tracing is disabled, there's no reason that waiters should stay
waiting, wake them up, otherwise tasks get stuck when they should be
flushing the buffers.
Cc: stable@vger.kernel.org
Fixes: e30f53aad2 ("tracing: Do not busy wait in buffer splice")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
If a process is waiting on the ring buffer for data, there currently isn't
a clean way to force it to wake up. Add an ioctl call that will force any
tasks that are waiting on the trace_pipe_raw file to wake up.
Link: https://lkml.kernel.org/r/20220929095029.117f913f@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: e30f53aad2 ("tracing: Do not busy wait in buffer splice")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When the file that represents the ring buffer is closed, there may be
waiters waiting on more input from the ring buffer. Call
ring_buffer_wake_waiters() to wake up any waiters when the file is
closed.
Link: https://lkml.kernel.org/r/20220927231825.182416969@goodmis.org
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: e30f53aad2 ("tracing: Do not busy wait in buffer splice")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
It was found that some tracing functions in kernel/trace/trace.c acquire
an arch_spinlock_t with preemption and irqs enabled. An example is the
tracing_saved_cmdlines_size_read() function which intermittently causes
a "BUG: using smp_processor_id() in preemptible" warning when the LTP
read_all_proc test is run.
That can be problematic in case preemption happens after acquiring the
lock. Add the necessary preemption or interrupt disabling code in the
appropriate places before acquiring an arch_spinlock_t.
The convention here is to disable preemption for trace_cmdline_lock and
interupt for max_lock.
Link: https://lkml.kernel.org/r/20220922145622.1744826-1-longman@redhat.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: stable@vger.kernel.org
Fixes: a35873a099 ("tracing: Add conditional snapshot")
Fixes: 939c7a4f04 ("tracing: Introduce saved_cmdlines_size file")
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- Runtime verification infrastructure
This is the biggest change for this pull request. It introduces the
runtime verification that is necessary for running Linux on safety
critical systems. It allows for deterministic automata models to be
inserted into the kernel that will attach to tracepoints, where the
information on these tracepoints will move the model from state to state.
If a state is encountered that does not belong to the model, it will then
activate a given reactor, that could just inform the user or even panic
the kernel (for which safety critical systems will detect and can recover
from).
- Two monitor models are also added: Wakeup In Preemptive (WIP - not to be
confused with "work in progress"), and Wakeup While Not Running (WWNR).
- Added __vstring() helper to the TRACE_EVENT() macro to replace several
vsnprintf() usages that were all doing it wrong.
- eprobes now can have their event autogenerated when the event name is left
off.
- The rest is various cleanups and fixes.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYu0yzRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qj4HAP4tQtV55rjj4DQ5XIXmtI3/64PmyRSJ
+y4DEXi1UvEUCQD/QAuQfWoT/7gh35ltkfeS4t3ockzy14rrkP5drZigiQA=
=kEtM
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Runtime verification infrastructure
This is the biggest change here. It introduces the runtime
verification that is necessary for running Linux on safety critical
systems.
It allows for deterministic automata models to be inserted into the
kernel that will attach to tracepoints, where the information on
these tracepoints will move the model from state to state.
If a state is encountered that does not belong to the model, it will
then activate a given reactor, that could just inform the user or
even panic the kernel (for which safety critical systems will detect
and can recover from).
- Two monitor models are also added: Wakeup In Preemptive (WIP - not to
be confused with "work in progress"), and Wakeup While Not Running
(WWNR).
- Added __vstring() helper to the TRACE_EVENT() macro to replace
several vsnprintf() usages that were all doing it wrong.
- eprobes now can have their event autogenerated when the event name is
left off.
- The rest is various cleanups and fixes.
* tag 'trace-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (50 commits)
rv: Unlock on error path in rv_unregister_reactor()
tracing: Use alignof__(struct {type b;}) instead of offsetof()
tracing/eprobe: Show syntax error logs in error_log file
scripts/tracing: Fix typo 'the the' in comment
tracepoints: It is CONFIG_TRACEPOINTS not CONFIG_TRACEPOINT
tracing: Use free_trace_buffer() in allocate_trace_buffers()
tracing: Use a struct alignof to determine trace event field alignment
rv/reactor: Add the panic reactor
rv/reactor: Add the printk reactor
rv/monitor: Add the wwnr monitor
rv/monitor: Add the wip monitor
rv/monitor: Add the wip monitor skeleton created by dot2k
Documentation/rv: Add deterministic automata instrumentation documentation
Documentation/rv: Add deterministic automata monitor synthesis documentation
tools/rv: Add dot2k
Documentation/rv: Add deterministic automaton documentation
tools/rv: Add dot2c
Documentation/rv: Add a basic documentation
rv/include: Add instrumentation helper functions
rv/include: Add deterministic automata monitor definition via C macros
...
This pull request contains the following branches:
doc.2022.06.21a: Documentation updates.
fixes.2022.07.19a: Miscellaneous fixes.
nocb.2022.07.19a: Callback-offload updates, perhaps most notably a new
RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that causes all CPUs to
be offloaded at boot time, regardless of kernel boot parameters.
This is useful to battery-powered systems such as ChromeOS
and Android. In addition, a new RCU_NOCB_CPU_CB_BOOST kernel
boot parameter prevents offloaded callbacks from interfering
with real-time workloads and with energy-efficiency mechanisms.
poll.2022.07.21a: Polled grace-period updates, perhaps most notably
making these APIs account for both normal and expedited grace
periods.
rcu-tasks.2022.06.21a: Tasks RCU updates, perhaps most notably reducing
the CPU overhead of RCU tasks trace grace periods by more than
a factor of two on a system with 15,000 tasks. The reduction
is expected to increase with the number of tasks, so it seems
reasonable to hypothesize that a system with 150,000 tasks might
see a 20-fold reduction in CPU overhead.
torture.2022.06.21a: Torture-test updates.
ctxt.2022.07.05a: Updates that merge RCU's dyntick-idle tracking into
context tracking, thus reducing the overhead of transitioning to
kernel mode from either idle or nohz_full userspace execution
for kernels that track context independently of RCU. This is
expected to be helpful primarily for kernels built with
CONFIG_NO_HZ_FULL=y.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmLgMcgTHHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jArXD/0fjbCwqpRjHVTzjMY8jN4zDkqZZD6m
g8Fx27hZ4ToNFwRptyHwNezrNj14skjAJEXfdjaVw32W62ivXvf0HINvSzsTLCSq
k2kWyBdXLc9CwY5p5W4smnpn5VoAScjg5PoPL59INoZ/Zziji323C7Zepl/1DYJt
0T6bPCQjo1ZQoDUCyVpSjDmAqxnderWG0MeJVt74GkLqmnYLANg0GH8c7mH4+9LL
kVGlLp5nlPgNJ4FEoFdMwNU8T/ETmaVld/m2dkiawjkXjJzB2XKtBigU91DDmXz5
7DIdV4ABrxiy4kGNqtIe/jFgnKyVD7xiDpyfjd6KTeDr/rDS8u2ZH7+1iHsyz3g0
Np/tS3vcd0KR+gI/d0eXxPbgm5sKlCmKw/nU2eArpW/+4LmVXBUfHTG9Jg+LJmBc
JrUh6aEdIZJZHgv/nOQBNig7GJW43IG50rjuJxAuzcxiZNEG5lUSS23ysaA9CPCL
PxRWKSxIEfK3kdmvVO5IIbKTQmIBGWlcWMTcYictFSVfBgcCXpPAksGvqA5JiUkc
egW+xLFo/7K+E158vSKsVqlWZcEeUbsNJ88QOlpqnRgH++I2Yv/LhK41XfJfpH+Y
ALxVaDd+mAq6v+qSHNVq9wT3ozXIPy/zK1hDlMIqx40h2YvaEsH4je+521oSoN9r
vX60+QNxvUBLwA==
=vUNm
-----END PGP SIGNATURE-----
Merge tag 'rcu.2022.07.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Documentation updates
- Miscellaneous fixes
- Callback-offload updates, perhaps most notably a new
RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that causes all CPUs to be
offloaded at boot time, regardless of kernel boot parameters.
This is useful to battery-powered systems such as ChromeOS and
Android. In addition, a new RCU_NOCB_CPU_CB_BOOST kernel boot
parameter prevents offloaded callbacks from interfering with
real-time workloads and with energy-efficiency mechanisms
- Polled grace-period updates, perhaps most notably making these APIs
account for both normal and expedited grace periods
- Tasks RCU updates, perhaps most notably reducing the CPU overhead of
RCU tasks trace grace periods by more than a factor of two on a
system with 15,000 tasks.
The reduction is expected to increase with the number of tasks, so it
seems reasonable to hypothesize that a system with 150,000 tasks
might see a 20-fold reduction in CPU overhead
- Torture-test updates
- Updates that merge RCU's dyntick-idle tracking into context tracking,
thus reducing the overhead of transitioning to kernel mode from
either idle or nohz_full userspace execution for kernels that track
context independently of RCU.
This is expected to be helpful primarily for kernels built with
CONFIG_NO_HZ_FULL=y
* tag 'rcu.2022.07.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (98 commits)
rcu: Add irqs-disabled indicator to expedited RCU CPU stall warnings
rcu: Diagnose extended sync_rcu_do_polled_gp() loops
rcu: Put panic_on_rcu_stall() after expedited RCU CPU stall warnings
rcutorture: Test polled expedited grace-period primitives
rcu: Add polled expedited grace-period primitives
rcutorture: Verify that polled GP API sees synchronous grace periods
rcu: Make Tiny RCU grace periods visible to polled APIs
rcu: Make polled grace-period API account for expedited grace periods
rcu: Switch polled grace-period APIs to ->gp_seq_polled
rcu/nocb: Avoid polling when my_rdp->nocb_head_rdp list is empty
rcu/nocb: Add option to opt rcuo kthreads out of RT priority
rcu: Add nocb_cb_kthread check to rcu_is_callbacks_kthread()
rcu/nocb: Add an option to offload all CPUs on boot
rcu/nocb: Fix NOCB kthreads spawn failure with rcu_nocb_rdp_deoffload() direct call
rcu/nocb: Invert rcu_state.barrier_mutex VS hotplug lock locking order
rcu/nocb: Add/del rdp to iterate from rcuog itself
rcu/tree: Add comment to describe GP-done condition in fqs loop
rcu: Initialize first_gp_fqs at declaration in rcu_gp_fqs()
rcu/kvfree: Remove useless monitor_todo flag
rcu: Cleanup RCU urgency state for offline CPU
...
In allocate_trace_buffers(), if allocating tr->max_buffer
fails, we can directly call free_trace_buffer to free
tr->array_buffer.
Link: https://lkml.kernel.org/r/65f0702d-07f6-08de-2a07-4c50af56a67b@huawei.com
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
RV is a lightweight (yet rigorous) method that complements classical
exhaustive verification techniques (such as model checking and
theorem proving) with a more practical approach to complex systems.
RV works by analyzing the trace of the system's actual execution,
comparing it against a formal specification of the system behavior.
RV can give precise information on the runtime behavior of the
monitored system while enabling the reaction for unexpected
events, avoiding, for example, the propagation of a failure on
safety-critical systems.
The development of this interface roots in the development of the
paper:
De Oliveira, Daniel Bristot; Cucinotta, Tommaso; De Oliveira, Romulo
Silva. Efficient formal verification for the Linux kernel. In:
International Conference on Software Engineering and Formal Methods.
Springer, Cham, 2019. p. 315-332.
And:
De Oliveira, Daniel Bristot. Automata-based formal analysis
and verification of the real-time Linux kernel. PhD Thesis, 2020.
The RV interface resembles the tracing/ interface on purpose. The current
path for the RV interface is /sys/kernel/tracing/rv/.
It presents these files:
"available_monitors"
- List the available monitors, one per line.
For example:
# cat available_monitors
wip
wwnr
"enabled_monitors"
- Lists the enabled monitors, one per line;
- Writing to it enables a given monitor;
- Writing a monitor name with a '!' prefix disables it;
- Truncating the file disables all enabled monitors.
For example:
# cat enabled_monitors
# echo wip > enabled_monitors
# echo wwnr >> enabled_monitors
# cat enabled_monitors
wip
wwnr
# echo '!wip' >> enabled_monitors
# cat enabled_monitors
wwnr
# echo > enabled_monitors
# cat enabled_monitors
#
Note that more than one monitor can be enabled concurrently.
"monitoring_on"
- It is an on/off general switcher for monitoring. Note
that it does not disable enabled monitors or detach events,
but stop the per-entity monitors of monitoring the events
received from the system. It resembles the "tracing_on" switcher.
"monitors/"
Each monitor will have its one directory inside "monitors/". There
the monitor specific files will be presented.
The "monitors/" directory resembles the "events" directory on
tracefs.
For example:
# cd monitors/wip/
# ls
desc enable
# cat desc
wakeup in preemptive per-cpu testing monitor.
# cat enable
0
For further information, see the comments in the header of
kernel/trace/rv/rv.c from this patch.
Link: https://lkml.kernel.org/r/a4bfe038f50cb047bfb343ad0e12b0e646ab308b.1659052063.git.bristot@kernel.org
Cc: Wim Van Sebroeck <wim@linux-watchdog.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Gabriele Paoloni <gpaoloni@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Tao Zhou <tao.zhou@linux.dev>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently when creating a specific group of trace events,
take kprobe event as example, the user must use the following format:
p:GRP/EVENT [MOD:]KSYM[+OFFS]|KADDR [FETCHARGS],
which means user must enter EVENT name, one example is:
echo 'p:usb_gadget/config_usb_cfg_link config_usb_cfg_link $arg1' >> kprobe_events
It is not simple if there are too many entries because the event name is
the same as symbol name.
This change allows user to specify no EVENT name, format changed as:
p:GRP/ [MOD:]KSYM[+OFFS]|KADDR [FETCHARGS]
It will generate event name automatically and one example is:
echo 'p:usb_gadget/ config_usb_cfg_link $arg1' >> kprobe_events.
Link: https://lore.kernel.org/all/1656296348-16111-4-git-send-email-quic_linyyuan@quicinc.com/
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Linyu Yuan <quic_linyyuan@quicinc.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
If you drop into kdb and type "ftdump" you'll get a sleeping while
atomic warning from memory allocation in trace_find_next_entry().
This appears to have been caused by commit ff895103a8 ("tracing:
Save off entry when peeking at next entry"), which added the
allocation in that path. The problematic commit was already fixed by
commit 8e99cf91b9 ("tracing: Do not allocate buffer in
trace_find_next_entry() in atomic") but that fix missed the kdb case.
The fix here is easy: just move the assignment of the static buffer to
the place where it should have been to begin with:
trace_init_global_iter(). That function is called in two places, once
is right before the assignment of the static buffer added by the
previous fix and once is in kdb.
Note that it appears that there's a second static buffer that we need
to assign that was added in commit efbbdaa22b ("tracing: Show real
address for trace event arguments"), so we'll move that too.
Link: https://lkml.kernel.org/r/20220708170919.1.I75844e5038d9425add2ad853a608cb44bb39df40@changeid
Fixes: ff895103a8 ("tracing: Save off entry when peeking at next entry")
Fixes: efbbdaa22b ("tracing: Show real address for trace event arguments")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The RCU dynticks counter is going to be merged into the context tracking
subsystem. Prepare with moving the NMI extended quiescent states
entrypoints to context tracking. For now those are dumb redirection to
existing RCU calls.
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
The RCU dynticks counter is going to be merged into the context tracking
subsystem. Prepare with moving the IRQ extended quiescent states
entrypoints to context tracking. For now those are dumb redirection to
existing RCU calls.
[ paulmck: Apply Stephen Rothwell feedback from -next. ]
[ paulmck: Apply Nathan Chancellor feedback. ]
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Two conditional compilation directives "#ifdef CONFIG_TRACER_MAX_TRACE"
are used consecutively, and no other code in between. Simplify conditional
the compilation code and only use one "#ifdef CONFIG_TRACER_MAX_TRACE".
Link: https://lkml.kernel.org/r/20220602140613.545069-1-sunliming@kylinos.cn
Signed-off-by: sunliming <sunliming@kylinos.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- The majority of the changes are for fixes and clean ups.
Noticeable changes:
- Rework trace event triggers code to be easier to interact with.
- Support for embedding bootconfig with the kernel (as suppose to having it
embedded in initram). This is useful for embedded boards without initram
disks.
- Speed up boot by parallelizing the creation of tracefs files.
- Allow absolute ring buffer timestamps handle timestamps that use more than
59 bits.
- Added new tracing clock "TAI" (International Atomic Time)
- Have weak functions show up in available_filter_function list as:
__ftrace_invalid_address___<invalid-offset>
instead of using the name of the function before it.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYpOgXRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qjkKAQDbpemxvpFyJlZqT8KgEIXubu+ag2/q
p0XDHaPS0zF9OQEAjTxg6GMEbnFYl6fzxZtOoEbiaQ7ppfdhRI8t6sSMVA8=
=+nDD
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"The majority of the changes are for fixes and clean ups.
Notable changes:
- Rework trace event triggers code to be easier to interact with.
- Support for embedding bootconfig with the kernel (as suppose to
having it embedded in initram). This is useful for embedded boards
without initram disks.
- Speed up boot by parallelizing the creation of tracefs files.
- Allow absolute ring buffer timestamps handle timestamps that use
more than 59 bits.
- Added new tracing clock "TAI" (International Atomic Time)
- Have weak functions show up in available_filter_function list as:
__ftrace_invalid_address___<invalid-offset> instead of using the
name of the function before it"
* tag 'trace-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (52 commits)
ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid adding weak function
tracing: Fix comments for event_trigger_separate_filter()
x86/traceponit: Fix comment about irq vector tracepoints
x86,tracing: Remove unused headers
ftrace: Clean up hash direct_functions on register failures
tracing: Fix comments of create_filter()
tracing: Disable kcov on trace_preemptirq.c
tracing: Initialize integer variable to prevent garbage return value
ftrace: Fix typo in comment
ftrace: Remove return value of ftrace_arch_modify_*()
tracing: Cleanup code by removing init "char *name"
tracing: Change "char *" string form to "char []"
tracing/timerlat: Do not wakeup the thread if the trace stops at the IRQ
tracing/timerlat: Print stacktrace in the IRQ handler if needed
tracing/timerlat: Notify IRQ new max latency only if stop tracing is set
kprobes: Fix build errors with CONFIG_KRETPROBES=n
tracing: Fix return value of trace_pid_write()
tracing: Fix potential double free in create_var_ref()
tracing: Use strim() to remove whitespace instead of doing it manually
ftrace: Deal with error return code of the ftrace_process_locs() function
...
The "char []" string form declares a single variable. It is better
than "char *" which creates two variables in the final assembly.
Link: https://lkml.kernel.org/r/20220512143230.28796-1-liqiong@nfschina.com
Signed-off-by: liqiong <liqiong@nfschina.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Setting set_event_pid with trailing whitespace lead to endless write
system calls like below.
$ strace echo "123 " > /sys/kernel/debug/tracing/set_event_pid
execve("/usr/bin/echo", ["echo", "123 "], ...) = 0
...
write(1, "123 \n", 5) = 4
write(1, "\n", 1) = 0
write(1, "\n", 1) = 0
write(1, "\n", 1) = 0
write(1, "\n", 1) = 0
write(1, "\n", 1) = 0
....
This is because, the result of trace_get_user's are not returned when it
read at least one pid. To fix it, update read variable even if
parser->idx == 0.
The result of applied patch is below.
$ strace echo "123 " > /sys/kernel/debug/tracing/set_event_pid
execve("/usr/bin/echo", ["echo", "123 "], ...) = 0
...
write(1, "123 \n", 5) = 5
close(1) = 0
Link: https://lkml.kernel.org/r/20220503050546.288911-1-vvghjk1234@gmail.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Baik Song An <bsahn@etri.re.kr>
Cc: Hong Yeon Kim <kimhy@etri.re.kr>
Cc: Taeung Song <taeung@reallinux.co.kr>
Cc: linuxgeek@linuxgeek.io
Cc: stable@vger.kernel.org
Fixes: 4909010788 ("tracing: Add set_event_pid directory for future use")
Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The tracing_set_trace_write() function just removes the trailing whitespace
from the user supplied tracer name, but the leading whitespace should also
be removed.
In addition, if the user supplied tracer name contains only a few
whitespace characters, the first one will not be removed using the current
method, which results it a single whitespace character left in the buf.
To fix all of these issues, we use strim() to correctly remove both the
leading and trailing whitespace.
Link: https://lkml.kernel.org/r/20220121095623.1826679-1-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
A fast/NMI safe accessor for CLOCK_TAI has been introduced.
Use it for adding the additional trace clock "tai".
Link: https://lkml.kernel.org/r/20220414091805.89667-3-kurt@linutronix.de
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Move trace_eval_init() to subsys_initcall to make it start
earlier.
And to avoid tracer_init_tracefs being blocked by
trace_event_sem which trace_eval_init() hold [1],
queue tracer_init_tracefs() to eval_map_wq to let
the two works being executed sequentially.
It can speed up the initialization of kernel as result
of making tracer_init_tracefs asynchronous.
On my arm64 platform, it reduce ~20ms of 125ms which total
time do_initcalls spend.
Link: https://lkml.kernel.org/r/20220426122407.17042-3-mark-pk.tsai@mediatek.com
[1]: https://lore.kernel.org/r/68d7b3327052757d0cd6359a6c9015a85b437232.camel@pengutronix.de
Signed-off-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
To prepare for support asynchronous tracer_init_tracefs initcall,
avoid calling create_trace_option_files before __update_tracer_options.
Otherwise, create_trace_option_files will show warning because
some tracers in trace_types list are already in tr->topts.
For example, hwlat_tracer call register_tracer in late_initcall,
and global_trace.dir is already created in tracing_init_dentry,
hwlat_tracer will be put into tr->topts.
Then if the __update_tracer_options is executed after hwlat_tracer
registered, create_trace_option_files find that hwlat_tracer is
already in tr->topts.
Link: https://lkml.kernel.org/r/20220426122407.17042-2-mark-pk.tsai@mediatek.com
Link: https://lore.kernel.org/lkml/20220322133339.GA32582@xsang-OptiPlex-9020/
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When setting bootparams="trace_event=initcall:initcall_start tp_printk=1" in the
cmdline, the output_printk() was called, and the spin_lock_irqsave() was called in the
atomic and irq disable interrupt context suitation. On the PREEMPT_RT kernel,
these locks are replaced with sleepable rt-spinlock, so the stack calltrace will
be triggered.
Fix it by raw_spin_lock_irqsave when PREEMPT_RT and "trace_event=initcall:initcall_start
tp_printk=1" enabled.
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
preempt_count: 2, expected: 0
RCU nest depth: 0, expected: 0
Preemption disabled at:
[<ffffffff8992303e>] try_to_wake_up+0x7e/0xba0
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.1-rt17+ #19 34c5812404187a875f32bee7977f7367f9679ea7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x60/0x8c
dump_stack+0x10/0x12
__might_resched.cold+0x11d/0x155
rt_spin_lock+0x40/0x70
trace_event_buffer_commit+0x2fa/0x4c0
? map_vsyscall+0x93/0x93
trace_event_raw_event_initcall_start+0xbe/0x110
? perf_trace_initcall_finish+0x210/0x210
? probe_sched_wakeup+0x34/0x40
? ttwu_do_wakeup+0xda/0x310
? trace_hardirqs_on+0x35/0x170
? map_vsyscall+0x93/0x93
do_one_initcall+0x217/0x3c0
? trace_event_raw_event_initcall_level+0x170/0x170
? push_cpu_stop+0x400/0x400
? cblist_init_generic+0x241/0x290
kernel_init_freeable+0x1ac/0x347
? _raw_spin_unlock_irq+0x65/0x80
? rest_init+0xf0/0xf0
kernel_init+0x1e/0x150
ret_from_fork+0x22/0x30
</TASK>
Link: https://lkml.kernel.org/r/20220419013910.894370-1-jun.miao@intel.com
Signed-off-by: Jun Miao <jun.miao@intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Per PREEMPT_DYNAMIC, checking CONFIG_PREEMPT doesn't tell you the actual
preemption model of the live kernel. Use the newly-introduced accessors
instead.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20211112185203.280040-5-valentin.schneider@arm.com
- The run time string verifier that checks all trace event formats
as they are read from the tracing file to make sure that the %s
pointers are not reading something that no longer exists, failed
to account for %*.s where the length given is zero, and the string
is NULL. It incorrectly flagged it as a null pointer dereference and
gave a WARN_ON().
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYjzyxBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qk+WAP4+ICutAiIGyPmHEqtjLGyDPC25nbB3
vg+qWWkWEOIi5gD+PpGsGSE7HYFdWJi1BCfshNOm8I92TyoE2nJkXh3LeA8=
=mZr5
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull trace event string verifier fix from Steven Rostedt:
"The run-time string verifier checks all trace event formats as
they are read from the tracing file to make sure that the %s pointers
are not reading something that no longer exists.
However, it failed to account for the valid case of '%*.s' where the
length given is zero, and the string is NULL. It incorrectly flagged
it as a null pointer dereference and gave a WARN_ON()"
* tag 'trace-v5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Have trace event string test handle zero length strings
- New user_events interface. User space can register an event with the kernel
describing the format of the event. Then it will receive a byte in a page
mapping that it can check against. A privileged task can then enable that
event like any other event, which will change the mapped byte to true,
telling the user space application to start writing the event to the
tracing buffer.
- Add new "ftrace_boot_snapshot" kernel command line parameter. When set,
the tracing buffer will be saved in the snapshot buffer at boot up when
the kernel hands things over to user space. This will keep the traces that
happened at boot up available even if user space boot up has tracing as
well.
- Have TRACE_EVENT_ENUM() also update trace event field type descriptions.
Thus if a static array defines its size with an enum, the user space trace
event parsers can still know how to parse that array.
- Add new TRACE_CUSTOM_EVENT() macro. This acts the same as the
TRACE_EVENT() macro, but will attach to an existing tracepoint. This will
make one tracepoint be able to trace different content and not be stuck at
only what the original TRACE_EVENT() macro exports.
- Fixes to tracing error logging.
- Better saving of cmdlines to PIDs when tracing (use the wakeup events for
mapping).
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYjiO3RQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qhQzAQDtek5p80p/zkMGymm14wSH6qq0NdgN
Kv7fTBwEewUa0gD/UCOVLw4Oj+JtHQhCa3sCGZopmRv0BT1+4UQANqosKQY=
=Au08
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- New user_events interface. User space can register an event with the
kernel describing the format of the event. Then it will receive a
byte in a page mapping that it can check against. A privileged task
can then enable that event like any other event, which will change
the mapped byte to true, telling the user space application to start
writing the event to the tracing buffer.
- Add new "ftrace_boot_snapshot" kernel command line parameter. When
set, the tracing buffer will be saved in the snapshot buffer at boot
up when the kernel hands things over to user space. This will keep
the traces that happened at boot up available even if user space boot
up has tracing as well.
- Have TRACE_EVENT_ENUM() also update trace event field type
descriptions. Thus if a static array defines its size with an enum,
the user space trace event parsers can still know how to parse that
array.
- Add new TRACE_CUSTOM_EVENT() macro. This acts the same as the
TRACE_EVENT() macro, but will attach to an existing tracepoint. This
will make one tracepoint be able to trace different content and not
be stuck at only what the original TRACE_EVENT() macro exports.
- Fixes to tracing error logging.
- Better saving of cmdlines to PIDs when tracing (use the wakeup events
for mapping).
* tag 'trace-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (30 commits)
tracing: Have type enum modifications copy the strings
user_events: Add trace event call as root for low permission cases
tracing/user_events: Use alloc_pages instead of kzalloc() for register pages
tracing: Add snapshot at end of kernel boot up
tracing: Have TRACE_DEFINE_ENUM affect trace event types as well
tracing: Fix strncpy warning in trace_events_synth.c
user_events: Prevent dyn_event delete racing with ioctl add/delete
tracing: Add TRACE_CUSTOM_EVENT() macro
tracing: Move the defines to create TRACE_EVENTS into their own files
tracing: Add sample code for custom trace events
tracing: Allow custom events to be added to the tracefs directory
tracing: Fix last_cmd_set() string management in histogram code
user_events: Fix potential uninitialized pointer while parsing field
tracing: Fix allocation of last_cmd in last_cmd_set()
user_events: Add documentation file
user_events: Add sample code for typical usage
user_events: Add self-test for validator boundaries
user_events: Add self-test for perf_event integration
user_events: Add self-test for dynamic_events integration
user_events: Add self-test for ftrace integration
...
If a trace event has in its TP_printk():
"%*.s", len, len ? __get_str(string) : NULL
It is perfectly valid if len is zero and passing in the NULL.
Unfortunately, the runtime string check at time of reading the trace sees
the NULL and flags it as a bad string and produces a WARN_ON().
Handle this case by passing into the test function if the format has an
asterisk (star) and if so, if the length is zero, then mark it as safe.
Link: https://lore.kernel.org/all/YjsWzuw5FbWPrdqq@bfoster/
Cc: stable@vger.kernel.org
Reported-by: Brian Foster <bfoster@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Fixes: 9a6944fee6 ("tracing: Add a verifier to check string pointers for trace events")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add ftrace_boot_snapshot kernel parameter that will take a snapshot at the
end of boot up just before switching over to user space (it happens during
the kernel freeing of init memory).
This is useful when there's interesting data that can be collected from
kernel start up, but gets overridden by user space start up code. With
this option, the ring buffer content from the boot up traces gets saved in
the snapshot at the end of boot up. This trace can be read from:
/sys/kernel/tracing/snapshot
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
__setup() handlers should generally return 1 to indicate that the
boot options have been handled.
Using invalid option values causes the entire kernel boot option
string to be reported as Unknown and added to init's environment
strings, polluting it.
Unknown kernel command line parameters "BOOT_IMAGE=/boot/bzImage-517rc6
kprobe_event=p,syscall_any,$arg1 trace_options=quiet
trace_clock=jiffies", will be passed to user space.
Run /sbin/init as init process
with arguments:
/sbin/init
with environment:
HOME=/
TERM=linux
BOOT_IMAGE=/boot/bzImage-517rc6
kprobe_event=p,syscall_any,$arg1
trace_options=quiet
trace_clock=jiffies
Return 1 from the __setup() handlers so that init's environment is not
polluted with kernel boot options.
Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru
Link: https://lkml.kernel.org/r/20220303031744.32356-1-rdunlap@infradead.org
Cc: stable@vger.kernel.org
Fixes: 7bcfaf54f5 ("tracing: Add trace_options kernel command line parameter")
Fixes: e1e232ca6b ("tracing: Add trace_clock=<clock> kernel parameter")
Fixes: 970988e19e ("tracing/kprobe: Add kprobe_event= boot parameter")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Igor Zhbanov <i.zhbanov@omprussia.ru>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently, tracing_log_err.cmd strings are restricted to a length of
MAX_FILTER_STR_VAL (256), which is too short for some commands already
seen in the wild (with cmd strings longer than that showing up
truncated).
Remove the restriction so that no command string is ever truncated.
Link: https://lkml.kernel.org/r/ca965f23256b350ebd94b3dc1a319f28e8267f5f.1643319703.git.zanussi@kernel.org
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The kernel parameter "tp_printk_stop_on_boot" starts with "tp_printk" which is
the same as another kernel parameter "tp_printk". If "tp_printk" setup is
called before the "tp_printk_stop_on_boot", it will override the latter
and keep it from being set.
This is similar to other kernel parameter issues, such as:
Commit 745a600cf1 ("um: console: Ignore console= option")
or init/do_mounts.c:45 (setup function of "ro" kernel param)
Fix it by checking for a "_" right after the "tp_printk" and if that
exists do not process the parameter.
Link: https://lkml.kernel.org/r/20220208195421.969326-1-jsyoo5b@gmail.com
Signed-off-by: JaeSang Yoo <jsyoo5b@gmail.com>
[ Fixed up change log and added space after if condition ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
New:
- The Real Time Linux Analysis (RTLA) tool is added to the tools directory.
- Can safely filter on user space pointers with: field.ustring ~ "match-string"
- eprobes can now be filtered like any other event.
- trace_marker(_raw) now uses stream_open() to allow multiple threads to safely
write to it. Note, this could possibly break existing user space, but we will
not know until we hear about it, and then can revert the change if need be.
- New field in events to display when bottom halfs are disabled.
- Sorting of the ftrace functions are now done at compile time instead of
at bootup.
Infrastructure changes to support future efforts:
- Added __rel_loc type for trace events. Similar to __data_loc but the offset
to the dynamic data is based off of the location of the descriptor and not
the beginning of the event. Needed for user defined events.
- Some simplification of event trigger code.
- Make synthetic events process its callback better to not hinder other
event callbacks that are registered. Needed for user defined events.
And other small fixes and clean ups.
-
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYeGvcxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qrZtAP9ICjJxX54MTErElhhUL/NFLV7wqhJi
OIAgmp6jGVRqPAD+JxQtBnGH+3XMd71ioQkTfQ1rp+jBz2ERBj2DmELUAg0=
=zmda
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"New:
- The Real Time Linux Analysis (RTLA) tool is added to the tools
directory.
- Can safely filter on user space pointers with: field.ustring ~
"match-string"
- eprobes can now be filtered like any other event.
- trace_marker(_raw) now uses stream_open() to allow multiple threads
to safely write to it. Note, this could possibly break existing
user space, but we will not know until we hear about it, and then
can revert the change if need be.
- New field in events to display when bottom halfs are disabled.
- Sorting of the ftrace functions are now done at compile time
instead of at bootup.
Infrastructure changes to support future efforts:
- Added __rel_loc type for trace events. Similar to __data_loc but
the offset to the dynamic data is based off of the location of the
descriptor and not the beginning of the event. Needed for user
defined events.
- Some simplification of event trigger code.
- Make synthetic events process its callback better to not hinder
other event callbacks that are registered. Needed for user defined
events.
And other small fixes and cleanups"
* tag 'trace-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (50 commits)
tracing: Add ustring operation to filtering string pointers
rtla: Add rtla timerlat hist documentation
rtla: Add rtla timerlat top documentation
rtla: Add rtla timerlat documentation
rtla: Add rtla osnoise hist documentation
rtla: Add rtla osnoise top documentation
rtla: Add rtla osnoise man page
rtla: Add Documentation
rtla/timerlat: Add timerlat hist mode
rtla: Add timerlat tool and timelart top mode
rtla/osnoise: Add the hist mode
rtla/osnoise: Add osnoise top mode
rtla: Add osnoise tool
rtla: Helper functions for rtla
rtla: Real-Time Linux Analysis tool
tracing/osnoise: Properly unhook events if start_per_cpu_kthreads() fails
tracing: Remove duplicate warnings when calling trace_create_file()
tracing/kprobes: 'nmissed' not showed correctly for kretprobe
tracing: Add test for user space strings when filtering on string pointers
tracing: Have syscall trace events use trace_event_buffer_lock_reserve()
...
Disabling only bottom halves via local_bh_disable() disables also
preemption but this remains invisible to tracing. On a CONFIG_PREEMPT
kernel one might wonder why there is no scheduling happening despite the
N flag in the trace. The reason might be the a rcu_read_lock_bh()
section.
Add a 'b' to the tracing output if in task context with disabled bottom
halves.
Link: https://lkml.kernel.org/r/YbcbtdtC/bjCKo57@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the new osnoise tracer, we are seeing the below splat:
Kernel attempted to read user page (c7d880000) - exploit attempt? (uid: 0)
BUG: Unable to handle kernel data access on read at 0xc7d880000
Faulting instruction address: 0xc0000000002ffa10
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
...
NIP [c0000000002ffa10] __trace_array_vprintk.part.0+0x70/0x2f0
LR [c0000000002ff9fc] __trace_array_vprintk.part.0+0x5c/0x2f0
Call Trace:
[c0000008bdd73b80] [c0000000001c49cc] put_prev_task_fair+0x3c/0x60 (unreliable)
[c0000008bdd73be0] [c000000000301430] trace_array_printk_buf+0x70/0x90
[c0000008bdd73c00] [c0000000003178b0] trace_sched_switch_callback+0x250/0x290
[c0000008bdd73c90] [c000000000e70d60] __schedule+0x410/0x710
[c0000008bdd73d40] [c000000000e710c0] schedule+0x60/0x130
[c0000008bdd73d70] [c000000000030614] interrupt_exit_user_prepare_main+0x264/0x270
[c0000008bdd73de0] [c000000000030a70] syscall_exit_prepare+0x150/0x180
[c0000008bdd73e10] [c00000000000c174] system_call_vectored_common+0xf4/0x278
osnoise tracer on ppc64le is triggering osnoise_taint() for negative
duration in get_int_safe_duration() called from
trace_sched_switch_callback()->thread_exit().
The problem though is that the check for a valid trace_percpu_buffer is
incorrect in get_trace_buf(). The check is being done after calculating
the pointer for the current cpu, rather than on the main percpu pointer.
Fix the check to be against trace_percpu_buffer.
Link: https://lkml.kernel.org/r/a920e4272e0b0635cf20c444707cbce1b2c8973d.1640255304.git.naveen.n.rao@linux.vnet.ibm.com
Cc: stable@vger.kernel.org
Fixes: e2ace00117 ("tracing: Choose static tp_printk buffer by explicit nesting count")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The tracing marker files are write-only streams with no meaningful
concept of file position. Using stream_open() to mark them as
stream-link indicates this and has the added advantage that a single
file descriptor can now be used from multiple threads without contention
thanks to clearing FMODE_ATOMIC_POS.
Note that this has the potential to break existing userspace by since
both lseek(2) and pwrite(2) will now return ESPIPE when previously lseek
would have updated the stored offset and pwrite would have appended to
the trace. A survey of libtracefs and several other projects found to
use trace_marker(_raw) [1][2][3] suggests that everyone limits
themselves to calling write(2) and close(2) on these file descriptors so
there is a good chance this will go unnoticed and the benefits of
reduced overhead and lock contention seem worth the risk.
[1] https://github.com/google/perfetto
[2] https://github.com/intel/media-driver/
[3] https://w1.fi/cgit/hostap/
Link: https://lkml.kernel.org/r/20211207142558.347029-1-john@metanate.com
Signed-off-by: John Keeping <john@metanate.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In case trace_event_buffer_lock_reserve() is called with preemption
enabled, the algorithm that defines the usage of the per cpu filter buffer
may fail if the task schedules to another CPU after determining which
buffer it will use.
Disable preemption when using the filter buffer. And because that same
buffer must be used throughout the call, keep preemption disabled until
the filter buffer is released.
This will also keep the semantics between the use case of when the filter
buffer is used, and when the ring buffer itself is used, as that case also
disables preemption until the ring buffer is released.
Link: https://lkml.kernel.org/r/20211130024318.880190623@goodmis.org
[ Fixed warning of assignment in if statement
Reported-by: kernel test robot <lkp@intel.com> ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The value read by this_cpu_read() is used later and its use is expected to
stay on the same CPU as being read. But this_cpu_read() does not warn if
it is called without preemption disabled, where as __this_cpu_read() will
check if preemption is disabled on CONFIG_DEBUG_PREEMPT
Currently all callers have preemption disabled, but there may be new
callers in the future that may not.
Link: https://lkml.kernel.org/r/20211130024318.698165354@goodmis.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If trace_seq becomes full, trace_seq_vprintf() no longer consumes
arguments from va_list, making va_list out of sync with format
processing by trace_check_vprintf().
This causes va_arg() in trace_check_vprintf() to return wrong
positional argument, which results into a WARN_ON_ONCE() hit.
ftrace_stress_test from LTP triggers this situation.
Fix it by explicitly avoiding further use if va_list at the point
when it's consistency can no longer be guaranteed.
Link: https://lkml.kernel.org/r/20211118145516.13219-1-nikita.yushchenko@virtuozzo.com
Signed-off-by: Nikita Yushchenko <nikita.yushchenko@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memset(), avoid intentionally writing across
neighboring fields.
Use memset_startat() to avoid confusing memset() about writing beyond
the target struct member.
Link: https://lkml.kernel.org/r/20211118202217.1285588-1-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- osnoise and timerlat updates that will work with the RTLA tool (Real-Time
Linux Analysis). Specifically it disconnects the work load (threads
that look for latency) from the tracing instances attached to them,
allowing for more than one instance to retrieve data from the work load.
- Optimization on division in the trace histogram trigger code to use shift
and multiply when possible. Also added documentation.
- Fix prototype to my_direct_func in direct ftrace trampoline sample code.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYYKWXxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qqJEAP9czpSZ/nFvDjxdGHZAcKKXCFWbGcK5
IF2cHDDwxXjZ/gD+NnpRhR1JPfA55fO52DUJPn2cOU5xOsP6DmJxu6mwDg0=
=AKVv
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull more tracing updates from Steven Rostedt:
- osnoise and timerlat updates that will work with the RTLA tool
(Real-Time Linux Analysis).
Specifically it disconnects the work load (threads that look for
latency) from the tracing instances attached to them, allowing for
more than one instance to retrieve data from the work load.
- Optimization on division in the trace histogram trigger code to use
shift and multiply when possible. Also added documentation.
- Fix prototype to my_direct_func in direct ftrace trampoline sample
code.
* tag 'trace-v5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace/samples: Add missing prototype for my_direct_func
tracing/selftests: Add tests for hist trigger expression parsing
tracing/histogram: Document hist trigger variables
tracing/histogram: Update division by 0 documentation
tracing/histogram: Optimize division by constants
tracing/osnoise: Remove PREEMPT_RT ifdefs from inside functions
tracing/osnoise: Remove STACKTRACE ifdefs from inside functions
tracing/osnoise: Allow multiple instances of the same tracer
tracing/osnoise: Remove TIMERLAT ifdefs from inside functions
tracing/osnoise: Support a list of trace_array *tr
tracing/osnoise: Use start/stop_per_cpu_kthreads() on osnoise_cpus_write()
tracing/osnoise: Split workload start from the tracer start
tracing/osnoise: Improve comments about barrier need for NMI callbacks
tracing/osnoise: Do not follow tracing_cpumask
- kprobes: Restructured stack unwinder to show properly on x86 when a stack
dump happens from a kretprobe callback.
- Fix to bootconfig parsing
- Have tracefs allow owner and group permissions by default (only denying
others). There's been pressure to allow non root to tracefs in a
controlled fashion, and using groups is probably the safest.
- Bootconfig memory managament updates.
- Bootconfig clean up to have the tools directory be less dependent on
changes in the kernel tree.
- Allow perf to be traced by function tracer.
- Rewrite of function graph tracer to be a callback from the function tracer
instead of having its own trampoline (this change will happen on an arch
by arch basis, and currently only x86_64 implements it).
- Allow multiple direct trampolines (bpf hooks to functions) be batched
together in one synchronization.
- Allow histogram triggers to add variables that can perform calculations
against the event's fields.
- Use the linker to determine architecture callbacks from the ftrace
trampoline to allow for proper parameter prototypes and prevent warnings
from the compiler.
- Extend histogram triggers to key off of variables.
- Have trace recursion use bit magic to determine preempt context over if
branches.
- Have trace recursion disable preemption as all use cases do anyway.
- Added testing for verification of tracing utilities.
- Various small clean ups and fixes.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYYBdxhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qp1sAQD2oYFwaG3sx872gj/myBcHIBSKdiki
Hry5csd8zYDBpgD+Poylopt5JIbeDuoYw/BedgEXmscZ8Qr7VzjAXdnv/Q4=
=Loz8
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- kprobes: Restructured stack unwinder to show properly on x86 when a
stack dump happens from a kretprobe callback.
- Fix to bootconfig parsing
- Have tracefs allow owner and group permissions by default (only
denying others). There's been pressure to allow non root to tracefs
in a controlled fashion, and using groups is probably the safest.
- Bootconfig memory managament updates.
- Bootconfig clean up to have the tools directory be less dependent on
changes in the kernel tree.
- Allow perf to be traced by function tracer.
- Rewrite of function graph tracer to be a callback from the function
tracer instead of having its own trampoline (this change will happen
on an arch by arch basis, and currently only x86_64 implements it).
- Allow multiple direct trampolines (bpf hooks to functions) be batched
together in one synchronization.
- Allow histogram triggers to add variables that can perform
calculations against the event's fields.
- Use the linker to determine architecture callbacks from the ftrace
trampoline to allow for proper parameter prototypes and prevent
warnings from the compiler.
- Extend histogram triggers to key off of variables.
- Have trace recursion use bit magic to determine preempt context over
if branches.
- Have trace recursion disable preemption as all use cases do anyway.
- Added testing for verification of tracing utilities.
- Various small clean ups and fixes.
* tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (101 commits)
tracing/histogram: Fix semicolon.cocci warnings
tracing/histogram: Fix documentation inline emphasis warning
tracing: Increase PERF_MAX_TRACE_SIZE to handle Sentinel1 and docker together
tracing: Show size of requested perf buffer
bootconfig: Initialize ret in xbc_parse_tree()
ftrace: do CPU checking after preemption disabled
ftrace: disable preemption when recursion locked
tracing/histogram: Document expression arithmetic and constants
tracing/histogram: Optimize division by a power of 2
tracing/histogram: Covert expr to const if both operands are constants
tracing/histogram: Simplify handling of .sym-offset in expressions
tracing: Fix operator precedence for hist triggers expression
tracing: Add division and multiplication support for hist triggers
tracing: Add support for creating hist trigger variables from literal
selftests/ftrace: Stop tracing while reading the trace file by default
MAINTAINERS: Update KPROBES and TRACING entries
test_kprobes: Move it from kernel/ to lib/
docs, kprobes: Remove invalid URL and add new reference
samples/kretprobes: Fix return value if register_kretprobe() failed
lib/bootconfig: Fix the xbc_get_info kerneldoc
...
When building the files in the tracefs file system, do not by default set
any permissions for OTH (other). This will make it easier for admins who
want to define a group for accessing tracefs and not having to first
disable all the permission bits for "other" in the file system.
As tracing can leak sensitive information, it should never by default
allowing all users access. An admin can still set the permission bits for
others to have access, which may be useful for creating a honeypot and
seeing who takes advantage of it and roots the machine.
Link: https://lkml.kernel.org/r/20210818153038.864149276@goodmis.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The compiler warns when the data are actually unused:
kernel/trace/trace.c:1712:13: error: ‘trace_create_maxlat_file’ defined but not used [-Werror=unused-function]
1712 | static void trace_create_maxlat_file(struct trace_array *tr,
| ^~~~~~~~~~~~~~~~~~~~~~~~
[Why]
CONFIG_HWLAT_TRACER=n, CONFIG_TRACER_MAX_TRACE=n, CONFIG_OSNOISE_TRACER=y
gcc report warns.
[How]
Now trace_create_maxlat_file will only take effect when
CONFIG_HWLAT_TRACER=y or CONFIG_TRACER_MAX_TRACE=y. In fact, after
adding osnoise trace, it also needs to take effect.
Link: https://lore.kernel.org/all/c1d9e328-ad7c-920b-6c24-9e1598a6421c@infradead.org/
Link: https://lkml.kernel.org/r/20210922025122.3268022-1-liu.yun@linux.dev
Fixes: bce29ac9ce ("trace: Add osnoise tracer")
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Instead of having the logic that does trace_pid_list open coded, wrap it in
abstract functions. This will allow a rewrite of the logic that implements
the trace_pid_list without affecting the users.
Note, this causes a change in behavior. Every time a pid is written into
the set_*_pid file, it creates a new list and uses RCU to update it. If
pid_max is lowered, but there was a pid currently in the list that was
higher than pid_max, those pids will now be removed on updating the list.
The old behavior kept that from happening.
The rewrite of the pid_list logic will no longer depend on pid_max,
and will return the old behavior.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Add migrate-disable counter to tracing header
- Fix error handling in event probes
- Fix missed unlock in osnoise in error path
- Fix merge issue with tools/bootconfig
- Clean up bootconfig data when init memory is removed
- Fix bootconfig to loop only on subkeys
- Have kernel command lines override bootconfig options
- Increase field counts for synthetic events
- Have histograms dynamic allocate event elements to save space
- Fixes in testing and documentation
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYToFZBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qtg5AP44U3Dn1m1lQo3y1DJ9kUP3HsAsDofS
Cv7ZM9tLV2p4MQEA9KJc3/B/5BZEK1kso3uLeLT+WxJOC4YStXY19WwmjAI=
=Wuo+
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull more tracing updates from Steven Rostedt:
- Add migrate-disable counter to tracing header
- Fix error handling in event probes
- Fix missed unlock in osnoise in error path
- Fix merge issue with tools/bootconfig
- Clean up bootconfig data when init memory is removed
- Fix bootconfig to loop only on subkeys
- Have kernel command lines override bootconfig options
- Increase field counts for synthetic events
- Have histograms dynamic allocate event elements to save space
- Fixes in testing and documentation
* tag 'trace-v5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/boot: Fix to loop on only subkeys
selftests/ftrace: Exclude "(fault)" in testing add/remove eprobe events
tracing: Dynamically allocate the per-elt hist_elt_data array
tracing: synth events: increase max fields count
tools/bootconfig: Show whole test command for each test case
bootconfig: Fix missing return check of xbc_node_compose_key function
tools/bootconfig: Fix tracing_on option checking in ftrace2bconf.sh
docs: bootconfig: Add how to use bootconfig for kernel parameters
init/bootconfig: Reorder init parameter from bootconfig and cmdline
init: bootconfig: Remove all bootconfig data when the init memory is removed
tracing/osnoise: Fix missed cpus_read_unlock() in start_per_cpu_kthreads()
tracing: Fix some alloc_event_probe() error handling bugs
tracing: Add migrate-disabled counter to tracing output.
- Simplifying the Kconfig use of FTRACE and TRACE_IRQFLAGS_SUPPORT
- bootconfig now can start histograms
- bootconfig supports group/all enabling
- histograms now can put values in linear size buckets
- execnames can be passed to synthetic events
- Introduction of "event probes" that attach to other events and
can retrieve data from pointers of fields, or record fields
as different types (a pointer to a string as a string instead
of just a hex number)
- Various fixes and clean ups
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYTJDixQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qnPLAP9XviWrZD27uFj6LU/Vp2umbq8la1aC
oW8o9itUGpLoHQD+OtsMpQXsWrxoNw/JD1OWCH4J0YN+TnZAUUG2E9e0twA=
=OZXG
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- simplify the Kconfig use of FTRACE and TRACE_IRQFLAGS_SUPPORT
- bootconfig can now start histograms
- bootconfig supports group/all enabling
- histograms now can put values in linear size buckets
- execnames can be passed to synthetic events
- introduce "event probes" that attach to other events and can retrieve
data from pointers of fields, or record fields as different types (a
pointer to a string as a string instead of just a hex number)
- various fixes and clean ups
* tag 'trace-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (35 commits)
tracing/doc: Fix table format in histogram code
selftests/ftrace: Add selftest for testing duplicate eprobes and kprobes
selftests/ftrace: Add selftest for testing eprobe events on synthetic events
selftests/ftrace: Add test case to test adding and removing of event probe
selftests/ftrace: Fix requirement check of README file
selftests/ftrace: Add clear_dynamic_events() to test cases
tracing: Add a probe that attaches to trace events
tracing/probes: Reject events which have the same name of existing one
tracing/probes: Have process_fetch_insn() take a void * instead of pt_regs
tracing/probe: Change traceprobe_set_print_fmt() to take a type
tracing/probes: Use struct_size() instead of defining custom macros
tracing/probes: Allow for dot delimiter as well as slash for system names
tracing/probe: Have traceprobe_parse_probe_arg() take a const arg
tracing: Have dynamic events have a ref counter
tracing: Add DYNAMIC flag for dynamic events
tracing: Replace deprecated CPU-hotplug functions.
MAINTAINERS: Add an entry for os noise/latency
tracepoint: Fix kerneldoc comments
bootconfig/tracing/ktest: Update ktest example for boot-time tracing
tools/bootconfig: Use per-group/all enable option in ftrace2bconf script
...
migrate_disable() forbids task migration to another CPU. It is available
since v5.11 and has already users such as highmem or BPF. It is useful
to observe this task state in tracing which already has other states
like the preemption counter.
Instead of adding the migrate disable counter as a new entry to struct
trace_entry, which would extend the whole struct by four bytes, it is
squashed into the preempt-disable counter. The lower four bits represent
the preemption counter, the upper four bits represent the migrate
disable counter. Both counter shouldn't exceed 15 but if they do, there
is a safety net which caps the value at 15.
Add the migrate-disable counter to the trace entry so it shows up in the
trace. Due to the users mentioned above, it is already possible to
observe it:
| bash-1108 [000] ...21 73.950578: rss_stat: mm_id=2213312838 curr=0 type=MM_ANONPAGES size=8192B
| bash-1108 [000] d..31 73.951222: irq_disable: caller=flush_tlb_mm_range+0x115/0x130 parent=ptep_clear_flush+0x42/0x50
| bash-1108 [000] d..31 73.951222: tlb_flush: pages:1 reason:local mm shootdown (3)
The last value is the migrate-disable counter.
Things that popped up:
- trace_print_lat_context() does not print the migrate counter. Not sure
if it should. It is used in "verbose" mode and uses 8 digits and I'm
not sure ther is something processing the value.
- trace_define_common_fields() now defines a different variable. This
probably breaks things. No ide what to do in order to preserve the old
behaviour. Since this is used as a filter it should be split somehow
to be able to match both nibbles here.
Link: https://lkml.kernel.org/r/20210810132625.ylssabmsrkygokuv@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[bigeasy: patch description.]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
[ SDR: Removed change to common_preempt_count field name ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmEt+hwACgkQUqAMR0iA
lPLppBAAiyrUNVmqqtdww+IJajEs1uD/4FqPsysHRwroHBFymJeQG1XCwUpDZ7jj
6gXT0chxyjQE18gT/W9nf+PSmA9XvIVA1WSR+WCECTNW3YoZXqtgwiHfgnitXYku
HlmoZLthYeuoXWw2wn+hVLfTRh6VcPHYEaC21jXrs6B1pOXHbvjJ5eTLHlX9oCfL
UKSK+jFTHAJcn/GskRzviBe0Hpe8fqnkRol2XX13ltxqtQ73MjaGNu7imEH6/Pa7
/MHXWtuWJtOvuYz17aztQP4Qwh1xy+kakMy3aHucdlxRBTP4PTzzTuQI3L/RYi6l
+ttD7OHdRwqFAauBLY3bq3uJjYb5v/64ofd8DNnT2CJvtznY8wrPbTdFoSdPcL2Q
69/opRWHcUwbU/Gt4WLtyQf3Mk0vepgMbbVg1B5SSy55atRZaXMrA2QJ/JeawZTB
KK6D/mE7ccze/YFzsySunCUVKCm0veoNxEAcakCCZKXSbsvd1MYcIRC0e+2cv6e5
2NEH7gL4dD+5tqu5nzvIuKDn3NrDQpbi28iUBoFbkxRgcVyvHJ9AGSa62wtb5h3D
OgkqQMdVKBbjYNeUodPlQPzmXZDasytavyd0/BC/KENOcBvU/8gW++2UZTfsh/1A
dLjgwFBdyJncQcCS9Abn20/EKntbIMEX8NLa97XWkA3fuzMKtak=
=yEVq
-----END PGP SIGNATURE-----
Merge tag 'printk-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:
- Optionally, provide an index of possible printk messages via
<debugfs>/printk/index/. It can be used when monitoring important
kernel messages on a farm of various hosts. The monitor has to be
updated when some messages has changed or are not longer available by
a newly deployed kernel.
- Add printk.console_no_auto_verbose boot parameter. It allows to
generate crash dump even with slow consoles in a reasonable time
frame.
- Remove printk_safe buffers. The messages are always stored directly
to the main logbuffer, even in NMI or recursive context. Also it
allows to serialize syslog operations by a mutex instead of a spin
lock.
- Misc clean up and build fixes.
* tag 'printk-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk/index: Fix -Wunused-function warning
lib/nmi_backtrace: Serialize even messages about idle CPUs
printk: Add printk.console_no_auto_verbose boot parameter
printk: Remove console_silent()
lib/test_scanf: Handle n_bits == 0 in random tests
printk: syslog: close window between wait and read
printk: convert @syslog_lock to mutex
printk: remove NMI tracking
printk: remove safe buffers
printk: track/limit recursion
lib/nmi_backtrace: explicitly serialize banner and regs
printk: Move the printk() kerneldoc comment to its new home
printk/index: Fix warning about missing prototypes
MIPS/asm/printk: Fix build failure caused by printk
printk: index: Add indexing support to dev_printk
printk: Userspace format indexing support
printk: Rework parse_prefix into printk_parse_prefix
printk: Straighten out log_flags into printk_info_flags
string_helpers: Escape double quotes in escape_special
printk/console: Check consistent sequence number when handling race in console_unlock()
A new dynamic event is introduced: event probe. The event is attached
to an existing tracepoint and uses its fields as arguments. The user
can specify custom format string of the new event, select what tracepoint
arguments will be printed and how to print them.
An event probe is created by writing configuration string in
'dynamic_events' ftrace file:
e[:[SNAME/]ENAME] SYSTEM/EVENT [FETCHARGS] - Set an event probe
-:SNAME/ENAME - Delete an event probe
Where:
SNAME - System name, if omitted 'eprobes' is used.
ENAME - Name of the new event in SNAME, if omitted the SYSTEM_EVENT is used.
SYSTEM - Name of the system, where the tracepoint is defined, mandatory.
EVENT - Name of the tracepoint event in SYSTEM, mandatory.
FETCHARGS - Arguments:
<name>=$<field>[:TYPE] - Fetch given filed of the tracepoint and print
it as given TYPE with given name. Supported
types are:
(u8/u16/u32/u64/s8/s16/s32/s64), basic type
(x8/x16/x32/x64), hexadecimal types
"string", "ustring" and bitfield.
Example, attach an event probe on openat system call and print name of the
file that will be opened:
echo "e:esys/eopen syscalls/sys_enter_openat file=\$filename:string" >> dynamic_events
A new dynamic event is created in events/esys/eopen/ directory. It
can be deleted with:
echo "-:esys/eopen" >> dynamic_events
Filters, triggers and histograms can be attached to the new event, it can
be matched in synthetic events. There is one limitation - an event probe
can not be attached to kprobe, uprobe or another event probe.
Link: https://lkml.kernel.org/r/20210812145805.2292326-1-tz.stoyanov@gmail.com
Link: https://lkml.kernel.org/r/20210819152825.142428383@goodmis.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Co-developed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As dynamic events are not created by modules, if something is attached to
one, calling "try_module_get()" on its "mod" field, is not going to keep
the dynamic event from going away.
Since dynamic events do not need the "mod" pointer of the event structure,
make a union out of it in order to save memory (there's one structure for
each of the thousand+ events in the kernel), and have any event with the
DYNAMIC flag set to use a ref counter instead.
Link: https://lore.kernel.org/linux-trace-devel/20210813004448.51c7de69ce432d338f4d226b@kernel.org/
Link: https://lkml.kernel.org/r/20210817035027.174869074@goodmis.org
Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Update both the tracefs README file as well as the histogram.rst to
include an explanation of what the buckets modifier is and how to use it.
Include an example with the wakeup_latency example for both log2 and the
buckets modifiers as there was no existing log2 example.
Link: https://lkml.kernel.org/r/20210707213922.167218794@goodmis.org
Acked-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The event filters are not applied on all of the output, which results in
the flood of printk when using tp_printk. Unfolding
event_trigger_unlock_commit_regs() into trace_event_buffer_commit(), so
the filters can be applied on every output.
Link: https://lkml.kernel.org/r/20210814034538.8428-1-kernelfans@gmail.com
Cc: stable@vger.kernel.org
Fixes: 0daa230296 ("tracing: Add tp_printk cmdline to have tracepoints go to printk()")
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The event_trace_add_tracer() can fail. In this case, it leads to a crash
in start_creating with below call stack. Handle the error scenario
properly in trace_array_create_dir.
Call trace:
down_write+0x7c/0x204
start_creating.25017+0x6c/0x194
tracefs_create_file+0xc4/0x2b4
init_tracer_tracefs+0x5c/0x940
trace_array_create_dir+0x58/0xb4
trace_array_create+0x1bc/0x2b8
trace_array_get_by_name+0xdc/0x18c
Link: https://lkml.kernel.org/r/1627651386-21315-1-git-send-email-kamaagra@codeaurora.org
Cc: stable@vger.kernel.org
Fixes: 4114fbfd02 ("tracing: Enable creating new instance early boot")
Signed-off-by: Kamal Agrawal <kamaagra@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
All NMI contexts are handled the same as the safe context: store the
message and defer printing. There is no need to have special NMI
context tracking for this. Using in_nmi() is enough.
There are several parts of the kernel that are manually calling into
the printk NMI context tracking in order to cause general printk
deferred printing:
arch/arm/kernel/smp.c
arch/powerpc/kexec/crash.c
kernel/trace/trace.c
For arm/kernel/smp.c and powerpc/kexec/crash.c, provide a new
function pair printk_deferred_enter/exit that explicitly achieves the
same objective.
For ftrace, remove the printk context manipulation completely. It was
added in commit 03fc7f9c99 ("printk/nmi: Prevent deadlock when
accessing the main log buffer in NMI"). The purpose was to enforce
storing messages directly into the ring buffer even in NMI context.
It really should have only modified the behavior in NMI context.
There is no need for a special behavior any longer. All messages are
always stored directly now. The console deferring is handled
transparently in vprintk().
Signed-off-by: John Ogness <john.ogness@linutronix.de>
[pmladek@suse.com: Remove special handling in ftrace.c completely.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210715193359.25946-5-john.ogness@linutronix.de
Currently the histogram logic allows the user to write "cpu" in as an
event field, and it will record the CPU that the event happened on.
The problem with this is that there's a lot of events that have "cpu"
as a real field, and using "cpu" as the CPU it ran on, makes it
impossible to run histograms on the "cpu" field of events.
For example, if I want to have a histogram on the count of the
workqueue_queue_work event on its cpu field, running:
># echo 'hist:keys=cpu' > events/workqueue/workqueue_queue_work/trigger
Gives a misleading and wrong result.
Change the command to "common_cpu" as no event should have "common_*"
fields as that's a reserved name for fields used by all events. And
this makes sense here as common_cpu would be a field used by all events.
Now we can even do:
># echo 'hist:keys=common_cpu,cpu if cpu < 100' > events/workqueue/workqueue_queue_work/trigger
># cat events/workqueue/workqueue_queue_work/hist
# event histogram
#
# trigger info: hist:keys=common_cpu,cpu:vals=hitcount:sort=hitcount:size=2048 if cpu < 100 [active]
#
{ common_cpu: 0, cpu: 2 } hitcount: 1
{ common_cpu: 0, cpu: 4 } hitcount: 1
{ common_cpu: 7, cpu: 7 } hitcount: 1
{ common_cpu: 0, cpu: 7 } hitcount: 1
{ common_cpu: 0, cpu: 1 } hitcount: 1
{ common_cpu: 0, cpu: 6 } hitcount: 2
{ common_cpu: 0, cpu: 5 } hitcount: 2
{ common_cpu: 1, cpu: 1 } hitcount: 4
{ common_cpu: 6, cpu: 6 } hitcount: 4
{ common_cpu: 5, cpu: 5 } hitcount: 14
{ common_cpu: 4, cpu: 4 } hitcount: 26
{ common_cpu: 0, cpu: 0 } hitcount: 39
{ common_cpu: 2, cpu: 2 } hitcount: 184
Now for backward compatibility, I added a trick. If "cpu" is used, and
the field is not found, it will fall back to "common_cpu" and work as
it did before. This way, it will still work for old programs that use
"cpu" to get the actual CPU, but if the event has a "cpu" as a field, it
will get that event's "cpu" field, which is probably what it wants
anyway.
I updated the tracefs/README to include documentation about both the
common_timestamp and the common_cpu. This way, if that text is present in
the README, then an application can know that common_cpu is supported over
just plain "cpu".
Link: https://lkml.kernel.org/r/20210721110053.26b4f641@oasis.local.home
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: 8b7622bf94 ("tracing: Add cpu field for hist triggers")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Added option for per CPU threads to the hwlat tracer
- Have hwlat tracer handle hotplug CPUs
- New tracer: osnoise, that detects latency caused by interrupts, softirqs
and scheduling of other tasks.
- Added timerlat tracer that creates a thread and measures in detail what
sources of latency it has for wake ups.
- Removed the "success" field of the sched_wakeup trace event.
This has been hardcoded as "1" since 2015, no tooling should be looking
at it now. If one exists, we can revert this commit, fix that tool and
try to remove it again in the future.
- tgid mapping fixed to handle more than PID_MAX_DEFAULT pids/tgids.
- New boot command line option "tp_printk_stop", as tp_printk causes trace
events to write to console. When user space starts, this can easily live
lock the system. Having a boot option to stop just after boot up is
useful to prevent that from happening.
- Have ftrace_dump_on_oops boot command line option take numbers that match
the numbers shown in /proc/sys/kernel/ftrace_dump_on_oops.
- Bootconfig clean ups, fixes and enhancements.
- New ktest script that tests bootconfig options.
- Add tracepoint_probe_register_may_exist() to register a tracepoint
without triggering a WARN*() if it already exists. BPF has a path from
user space that can do this. All other paths are considered a bug.
- Small clean ups and fixes
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYN8YPhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qhxLAP9Mo5hHv7Hg6W7Ddv77rThm+qclsMR/
yW0P+eJpMm4+xAD8Cq03oE1DimPK+9WZBKU5rSqAkqG6CjgDRw6NlIszzQQ=
=WEPR
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Added option for per CPU threads to the hwlat tracer
- Have hwlat tracer handle hotplug CPUs
- New tracer: osnoise, that detects latency caused by interrupts,
softirqs and scheduling of other tasks.
- Added timerlat tracer that creates a thread and measures in detail
what sources of latency it has for wake ups.
- Removed the "success" field of the sched_wakeup trace event. This has
been hardcoded as "1" since 2015, no tooling should be looking at it
now. If one exists, we can revert this commit, fix that tool and try
to remove it again in the future.
- tgid mapping fixed to handle more than PID_MAX_DEFAULT pids/tgids.
- New boot command line option "tp_printk_stop", as tp_printk causes
trace events to write to console. When user space starts, this can
easily live lock the system. Having a boot option to stop just after
boot up is useful to prevent that from happening.
- Have ftrace_dump_on_oops boot command line option take numbers that
match the numbers shown in /proc/sys/kernel/ftrace_dump_on_oops.
- Bootconfig clean ups, fixes and enhancements.
- New ktest script that tests bootconfig options.
- Add tracepoint_probe_register_may_exist() to register a tracepoint
without triggering a WARN*() if it already exists. BPF has a path
from user space that can do this. All other paths are considered a
bug.
- Small clean ups and fixes
* tag 'trace-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (49 commits)
tracing: Resize tgid_map to pid_max, not PID_MAX_DEFAULT
tracing: Simplify & fix saved_tgids logic
treewide: Add missing semicolons to __assign_str uses
tracing: Change variable type as bool for clean-up
trace/timerlat: Fix indentation on timerlat_main()
trace/osnoise: Make 'noise' variable s64 in run_osnoise()
tracepoint: Add tracepoint_probe_register_may_exist() for BPF tracing
tracing: Fix spelling in osnoise tracer "interferences" -> "interference"
Documentation: Fix a typo on trace/osnoise-tracer
trace/osnoise: Fix return value on osnoise_init_hotplug_support
trace/osnoise: Make interval u64 on osnoise_main
trace/osnoise: Fix 'no previous prototype' warnings
tracing: Have osnoise_main() add a quiescent state for task rcu
seq_buf: Make trace_seq_putmem_hex() support data longer than 8
seq_buf: Fix overflow in seq_buf_putmem_hex()
trace/osnoise: Support hotplug operations
trace/hwlat: Support hotplug operations
trace/hwlat: Protect kdata->kthread with get/put_online_cpus
trace: Add timerlat tracer
trace: Add osnoise tracer
...
Currently tgid_map is sized at PID_MAX_DEFAULT entries, which means that
on systems where pid_max is configured higher than PID_MAX_DEFAULT the
ftrace record-tgid option doesn't work so well. Any tasks with PIDs
higher than PID_MAX_DEFAULT are simply not recorded in tgid_map, and
don't show up in the saved_tgids file.
In particular since systemd v243 & above configure pid_max to its
highest possible 1<<22 value by default on 64 bit systems this renders
the record-tgids option of little use.
Increase the size of tgid_map to the configured pid_max instead,
allowing it to cover the full range of PIDs up to the maximum value of
PID_MAX_LIMIT if the system is configured that way.
On 64 bit systems with pid_max == PID_MAX_LIMIT this will increase the
size of tgid_map from 256KiB to 16MiB. Whilst this 64x increase in
memory overhead sounds significant 64 bit systems are presumably best
placed to accommodate it, and since tgid_map is only allocated when the
record-tgid option is actually used presumably the user would rather it
spends sufficient memory to actually record the tgids they expect.
The size of tgid_map could also increase for CONFIG_BASE_SMALL=y
configurations, but these seem unlikely to be systems upon which people
are both configuring a large pid_max and running ftrace with record-tgid
anyway.
Of note is that we only allocate tgid_map once, the first time that the
record-tgid option is enabled. Therefore its size is only set once, to
the value of pid_max at the time the record-tgid option is first
enabled. If a user increases pid_max after that point, the saved_tgids
file will not contain entries for any tasks with pids beyond the earlier
value of pid_max.
Link: https://lkml.kernel.org/r/20210701172407.889626-2-paulburton@google.com
Fixes: d914ba37d7 ("tracing: Add support for recording tgid of tasks")
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Paul Burton <paulburton@google.com>
[ Fixed comment coding style ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
kernel.h is being used as a dump for all kinds of stuff for a long time.
Here is the attempt to start cleaning it up by splitting out panic and
oops helpers.
There are several purposes of doing this:
- dropping dependency in bug.h
- dropping a loop by moving out panic_notifier.h
- unload kernel.h from something which has its own domain
At the same time convert users tree-wide to use new headers, although for
the time being include new header back to kernel.h to avoid twisted
indirected includes for existing users.
[akpm@linux-foundation.org: thread_info.h needs limits.h]
[andriy.shevchenko@linux.intel.com: ia64 fix]
Link: https://lkml.kernel.org/r/20210520130557.55277-1-andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20210511074137.33666-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Co-developed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Sebastian Reichel <sre@kernel.org>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Acked-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The tgid_map array records a mapping from pid to tgid, where the index
of an entry within the array is the pid & the value stored at that index
is the tgid.
The saved_tgids_next() function iterates over pointers into the tgid_map
array & dereferences the pointers which results in the tgid, but then it
passes that dereferenced value to trace_find_tgid() which treats it as a
pid & does a further lookup within the tgid_map array. It seems likely
that the intent here was to skip over entries in tgid_map for which the
recorded tgid is zero, but instead we end up skipping over entries for
which the thread group leader hasn't yet had its own tgid recorded in
tgid_map.
A minimal fix would be to remove the call to trace_find_tgid, turning:
if (trace_find_tgid(*ptr))
into:
if (*ptr)
..but it seems like this logic can be much simpler if we simply let
seq_read() iterate over the whole tgid_map array & filter out empty
entries by returning SEQ_SKIP from saved_tgids_show(). Here we take that
approach, removing the incorrect logic here entirely.
Link: https://lkml.kernel.org/r/20210630003406.4013668-1-paulburton@google.com
Fixes: d914ba37d7 ("tracing: Add support for recording tgid of tasks")
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Paul Burton <paulburton@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
With the coming addition of the osnoise tracer, the configs needed to
include the latency_fsnotify() has become more complex, and to keep the
declaration in the header file the same as in the C file, just have the
logic needed to define it in one place, and that defines LATENCY_FS_NOTIFY
which will be used in the C code.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The hwlat detector and (in preparation for) the osnoise/timerlat tracers
have a set of u64 parameters that the user can read/write via tracefs.
For instance, we have hwlat_detector's window and width.
To reduce the code duplication, hwlat's window and width share the same
read function. However, they do not share the write functions because
they do different parameter checks. For instance, the width needs to
be smaller than the window, while the window needs to be larger
than the window. The same pattern repeats on osnoise/timerlat, and
a large portion of the code was devoted to the write function.
Despite having different checks, the write functions have the same
structure:
read a user-space buffer
take the lock that protects the value
check for minimum and maximum acceptable values
save the value
release the lock
return success or error
To reduce the code duplication also in the write functions, this patch
provides a generic read and write implementation for u64 values that
need to be within some minimum and/or maximum parameters, while
(potentially) being protected by a lock.
To use this interface, the structure trace_min_max_param needs to be
filled:
struct trace_min_max_param {
struct mutex *lock;
u64 *val;
u64 *min;
u64 *max;
};
The desired value is stored on the variable pointed by *val. If *min
points to a minimum acceptable value, it will be checked during the
write operation. Likewise, if *max points to a maximum allowable value,
it will be checked during the write operation. Finally, if *lock points
to a mutex, it will be taken at the beginning of the operation and
released at the end.
The definition of a trace_min_max_param needs to passed as the
(private) *data for tracefs_create_file(), and the trace_min_max_fops
(added by this patch) as the *fops file_operations.
Link: https://lkml.kernel.org/r/3e35760a7c8b5c55f16ae5ad5fc54a0e71cbe647.1624372313.git.bristot@redhat.com
Cc: Phil Auld <pauld@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Kate Carcia <kcarcia@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
Cc: Clark Willaims <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
A while ago, when the "trace" file was opened, tracing was stopped, and
code was added to stop recording the comms to saved_cmdlines, for mapping
of the pids to the task name.
Code has been added that only records the comm if a trace event occurred,
and there's no reason to not trace it if the trace file is opened.
Cc: stable@vger.kernel.org
Fixes: 7ffbd48d5c ("tracing: Cache comms only after an event occurred")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The saved_cmdlines is used to map pids to the task name, such that the
output of the tracing does not just show pids, but also gives a human
readable name for the task.
If the name is not mapped, the output looks like this:
<...>-1316 [005] ...2 132.044039: ...
Instead of this:
gnome-shell-1316 [005] ...2 132.044039: ...
The names are updated when tracing is running, but are skipped if tracing
is stopped. Unfortunately, this stops the recording of the names if the
top level tracer is stopped, and not if there's other tracers active.
The recording of a name only happens when a new event is written into a
ring buffer, so there is no need to test if tracing is on or not. If
tracing is off, then no event is written and no need to test if tracing is
off or not.
Remove the check, as it hides the names of tasks for events in the
instance buffers.
Cc: stable@vger.kernel.org
Fixes: 7ffbd48d5c ("tracing: Cache comms only after an event occurred")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The kernel parameter for ftrace_dump_on_oops can take a single assignment.
That is, it can be:
ftrace_dump_on_oops or ftrace_dump_on_oops=orig_cpu
But the content in the sysctl file is a number.
0 for disabled
1 for ftrace_dump_on_oops (all CPUs)
2 for ftrace_dump_on_oops (orig CPU)
Allow the kernel command line to take a number as well to match the sysctl
numbers.
That is:
ftrace_dump_on_oops=1 is the same as ftrace_dump_on_oops
and
ftrace_dump_on_oops=2 is the same as ftrace_dump_on_oops=orig_cpu
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add a kernel command line option that disables printing of events to
console at late_initcall_sync(). This is useful when needing to see
specific events written to console on boot up, but not wanting it when
user space starts, as user space may make the console so noisy that the
system becomes inoperable.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When filtering is enabled, the event is copied into a temp buffer instead
of being written into the ring buffer directly, because the discarding of
events from the ring buffer is very expensive, and doing the extra copy is
much faster than having to discard most of the time.
As that logic is subtle, add comments to explain in more detail to what is
going on and how it works.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When filtering trace events, a temp buffer is used because the extra copy
from the temp buffer into the ring buffer is still faster than the direct
write into the ring buffer followed by a discard if the filter does not
match.
But the data that can be stored in the temp buffer is a PAGE_SIZE minus the
ring buffer event header. The calculation of that header size is complex,
but using the helper macro "struct_size()" can simplify it.
Link: https://lore.kernel.org/stable/CAHk-=whKbJkuVmzb0hD3N6q7veprUrSpiBHRxVY=AffWZPtxmg@mail.gmail.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The variable ret is being initialized with a value that is never read,
it is being updated later on. The assignment is redundant and can be
removed.
Link: https://lkml.kernel.org/r/20210513115517.58178-1-colin.king@canonical.com
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If a trace event uses the %*.s notation, the trace_check_vprintf() will
fail and will warn about a bad processing of strings, because it does not
take into account the length field when processing the star (*) part.
Have it handle this case as well.
Link: https://lore.kernel.org/linux-nfs/238C0E2D-C2A4-4578-ADD2-C565B3B99842@oracle.com/
Reported-by: Chuck Lever III <chuck.lever@oracle.com>
Fixes: 9a6944fee6 ("tracing: Add a verifier to check string pointers for trace events")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
New feature:
The "func-no-repeats" option in tracefs/options directory. When set
the function tracer will detect if the current function being traced
is the same as the previous one, and instead of recording it, it will
keep track of the number of times that the function is repeated in a row.
And when another function is recorded, it will write a new event that
shows the function that repeated, the number of times it repeated and
the time stamp of when the last repeated function occurred.
Enhancements:
In order to implement the above "func-no-repeats" option, the ring
buffer timestamp can now give the accurate timestamp of the event
as it is being recorded, instead of having to record an absolute
timestamp for all events. This helps the histogram code which no longer
needs to waste ring buffer space.
New validation logic to make sure all trace events that access
dereferenced pointers do so in a safe way, and will warn otherwise.
Fixes:
No longer limit the PIDs of tasks that are recorded for "saved_cmdlines"
to PID_MAX_DEFAULT (32768), as systemd now allows for a much larger
range. This caused the mapping of PIDs to the task names to be dropped
for all tasks with a PID greater than 32768.
Change trace_clock_global() to never block. This caused a deadlock.
Clean ups:
Typos, prototype fixes, and removing of duplicate or unused code.
Better management of ftrace_page allocations.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYI/1vBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qiL0AP9EemIC5TDh2oihqLRNeUjdTu0ryEoM
HRFqxozSF985twD/bfkt86KQC8rLHwxTbxQZ863bmdaC6cMGFhWiF+H/MAs=
=psYt
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"New feature:
- A new "func-no-repeats" option in tracefs/options directory.
When set the function tracer will detect if the current function
being traced is the same as the previous one, and instead of
recording it, it will keep track of the number of times that the
function is repeated in a row. And when another function is
recorded, it will write a new event that shows the function that
repeated, the number of times it repeated and the time stamp of
when the last repeated function occurred.
Enhancements:
- In order to implement the above "func-no-repeats" option, the ring
buffer timestamp can now give the accurate timestamp of the event
as it is being recorded, instead of having to record an absolute
timestamp for all events. This helps the histogram code which no
longer needs to waste ring buffer space.
- New validation logic to make sure all trace events that access
dereferenced pointers do so in a safe way, and will warn otherwise.
Fixes:
- No longer limit the PIDs of tasks that are recorded for
"saved_cmdlines" to PID_MAX_DEFAULT (32768), as systemd now allows
for a much larger range. This caused the mapping of PIDs to the
task names to be dropped for all tasks with a PID greater than
32768.
- Change trace_clock_global() to never block. This caused a deadlock.
Clean ups:
- Typos, prototype fixes, and removing of duplicate or unused code.
- Better management of ftrace_page allocations"
* tag 'trace-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (32 commits)
tracing: Restructure trace_clock_global() to never block
tracing: Map all PIDs to command lines
ftrace: Reuse the output of the function tracer for func_repeats
tracing: Add "func_no_repeats" option for function tracing
tracing: Unify the logic for function tracing options
tracing: Add method for recording "func_repeats" events
tracing: Add "last_func_repeats" to struct trace_array
tracing: Define new ftrace event "func_repeats"
tracing: Define static void trace_print_time()
ftrace: Simplify the calculation of page number for ftrace_page->records some more
ftrace: Store the order of pages allocated in ftrace_page
tracing: Remove unused argument from "ring_buffer_time_stamp()
tracing: Remove duplicate struct declaration in trace_events.h
tracing: Update create_system_filter() kernel-doc comment
tracing: A minor cleanup for create_system_filter()
kernel: trace: Mundane typo fixes in the file trace_events_filter.c
tracing: Fix various typos in comments
scripts/recordmcount.pl: Make vim and emacs indent the same
scripts/recordmcount.pl: Make indent spacing consistent
tracing: Add a verifier to check string pointers for trace events
...
The default max PID is set by PID_MAX_DEFAULT, and the tracing
infrastructure uses this number to map PIDs to the comm names of the
tasks, such output of the trace can show names from the recorded PIDs in
the ring buffer. This mapping is also exported to user space via the
"saved_cmdlines" file in the tracefs directory.
But currently the mapping expects the PIDs to be less than
PID_MAX_DEFAULT, which is the default maximum and not the real maximum.
Recently, systemd will increases the maximum value of a PID on the system,
and when tasks are traced that have a PID higher than PID_MAX_DEFAULT, its
comm is not recorded. This leads to the entire trace to have "<...>" as
the comm name, which is pretty useless.
Instead, keep the array mapping the size of PID_MAX_DEFAULT, but instead
of just mapping the index to the comm, map a mask of the PID
(PID_MAX_DEFAULT - 1) to the comm, and find the full PID from the
map_cmdline_to_pid array (that already exists).
This bug goes back to the beginning of ftrace, but hasn't been an issue
until user space started increasing the maximum value of PIDs.
Link: https://lkml.kernel.org/r/20210427113207.3c601884@gandalf.local.home
Cc: stable@vger.kernel.org
Fixes: bc0c38d139 ("ftrace: latency tracer infrastructure")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Core changes:
- Provide IRQF_NO_AUTOEN as a flag for request*_irq() so drivers can be
cleaned up which either use a seperate mechanism to prevent auto-enable
at request time or have a racy mechanism which disables the interrupt
right after request.
- Get rid of the last usage of irq_create_identity_mapping() and remove
the interface.
- An overhaul of tasklet_disable(). Most usage sites of tasklet_disable()
are in task context and usually in cleanup, teardown code pathes.
tasklet_disable() spinwaits for a tasklet which is currently executed.
That's not only a problem for PREEMPT_RT where this can lead to a live
lock when the disabling task preempts the softirq thread. It's also
problematic in context of virtualization when the vCPU which runs the
tasklet is scheduled out and the disabling code has to spin wait until
it's scheduled back in. Though there are a few code pathes which invoke
tasklet_disable() from non-sleepable context. For these a new disable
variant which still spinwaits is provided which allows to switch
tasklet_disable() to a sleep wait mechanism. For the atomic use cases
this does not solve the live lock issue on PREEMPT_RT. That is mitigated
by blocking on the RT specific softirq lock.
- The PREEMPT_RT specific implementation of softirq processing and
local_bh_disable/enable().
On RT enabled kernels soft interrupt processing happens always in task
context and all interrupt handlers, which are not explicitly marked to
be invoked in hard interrupt context are forced into task context as
well. This allows to protect against softirq processing with a per
CPU lock, which in turn allows to make BH disabled regions preemptible.
Most of the softirq handling code is still shared. The RT/non-RT
specific differences are addressed with a set of inline functions which
provide the context specific functionality. The local_bh_disable() /
local_bh_enable() mechanism are obviously seperate.
- The usual set of small improvements and cleanups
Driver changes:
- New drivers for Nuvoton WPCM450 and DT 79rc3243x interrupt controllers
- Extended functionality for MStar, STM32 and SC7280 irq chips
- Enhanced robustness for ARM GICv3/4.1 drivers
- The usual set of cleanups and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmCGh5wTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoZ+/EACWBpQ/2ZHizEw1bzjaDzJrR8U228xu
wNi7nSP92Y07nJ3cCX7a6TJ53mqd0n3RT+DprlsOuqSN0D7Ktr/x44V/aZtm0d3N
GkFOlpeGCRnHusLaUTwk7a8289LuoQ7OhSxIB409n1I4nLI96ZK41D1tYonMYl6E
nxDiGADASfjaciBWbjwJO/mlwmiW/VRpSTxswx0wzakFfbIx9iKyKv1bCJQZ5JK+
lHmf0jxpDIs1EVK/ElJ9Ky6TMBlEmZyiX7n6rujtwJ1W+Jc/uL/y8pLJvGwooVmI
yHTYsLMqzviCbAMhJiB3h1qs3GbCGlM78prgJTnOd0+xEUOCcopCRQlsTXVBq8Nb
OS+HNkYmYXRfiSH6lINJsIok8Xis28bAw/qWz2Ho+8wLq0TI8crK38roD1fPndee
FNJRhsPPOBkscpIldJ0Cr0X5lclkJFiAhAxORPHoseKvQSm7gBMB7H99xeGRffTn
yB3XqeTJMvPNmAHNN4Brv6ey3OjwnEWBgwcnIM2LtbIlRtlmxTYuR+82OPOgEvzk
fSrjFFJqu0LEMLEOXS4pYN824PawjV//UAy4IaG8AodmUUCSGHgw1gTVa4sIf72t
tXY54HqWfRWRpujhVRgsZETqBUtZkL6yvpoe8f6H7P91W5tAfv3oj4ch9RkhUo+Z
b0/u9T0+Fpbg+w==
=id4G
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"The usual updates from the irq departement:
Core changes:
- Provide IRQF_NO_AUTOEN as a flag for request*_irq() so drivers can
be cleaned up which either use a seperate mechanism to prevent
auto-enable at request time or have a racy mechanism which disables
the interrupt right after request.
- Get rid of the last usage of irq_create_identity_mapping() and
remove the interface.
- An overhaul of tasklet_disable().
Most usage sites of tasklet_disable() are in task context and
usually in cleanup, teardown code pathes. tasklet_disable()
spinwaits for a tasklet which is currently executed. That's not
only a problem for PREEMPT_RT where this can lead to a live lock
when the disabling task preempts the softirq thread. It's also
problematic in context of virtualization when the vCPU which runs
the tasklet is scheduled out and the disabling code has to spin
wait until it's scheduled back in.
There are a few code pathes which invoke tasklet_disable() from
non-sleepable context. For these a new disable variant which still
spinwaits is provided which allows to switch tasklet_disable() to a
sleep wait mechanism. For the atomic use cases this does not solve
the live lock issue on PREEMPT_RT. That is mitigated by blocking on
the RT specific softirq lock.
- The PREEMPT_RT specific implementation of softirq processing and
local_bh_disable/enable().
On RT enabled kernels soft interrupt processing happens always in
task context and all interrupt handlers, which are not explicitly
marked to be invoked in hard interrupt context are forced into task
context as well. This allows to protect against softirq processing
with a per CPU lock, which in turn allows to make BH disabled
regions preemptible.
Most of the softirq handling code is still shared. The RT/non-RT
specific differences are addressed with a set of inline functions
which provide the context specific functionality. The
local_bh_disable() / local_bh_enable() mechanism are obviously
seperate.
- The usual set of small improvements and cleanups
Driver changes:
- New drivers for Nuvoton WPCM450 and DT 79rc3243x interrupt
controllers
- Extended functionality for MStar, STM32 and SC7280 irq chips
- Enhanced robustness for ARM GICv3/4.1 drivers
- The usual set of cleanups and improvements all over the place"
* tag 'irq-core-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
irqchip/xilinx: Expose Kconfig option for Zynq/ZynqMP
irqchip/gic-v3: Do not enable irqs when handling spurious interrups
dt-bindings: interrupt-controller: Add IDT 79RC3243x Interrupt Controller
irqchip: Add support for IDT 79rc3243x interrupt controller
irqdomain: Drop references to recusive irqdomain setup
irqdomain: Get rid of irq_create_strict_mappings()
irqchip/jcore-aic: Kill use of irq_create_strict_mappings()
ARM: PXA: Kill use of irq_create_strict_mappings()
irqchip/gic-v4.1: Disable vSGI upon (GIC CPUIF < v4.1) detection
irqchip/tb10x: Use 'fallthrough' to eliminate a warning
genirq: Reduce irqdebug cacheline bouncing
kernel: Initialize cpumask before parsing
irqchip/wpcm450: Drop COMPILE_TEST
irqchip/irq-mst: Support polarity configuration
irqchip: Add driver for WPCM450 interrupt controller
dt-bindings: interrupt-controller: Add nuvoton, wpcm450-aic
dt-bindings: qcom,pdc: Add compatible for sc7280
irqchip/stm32: Add usart instances exti direct event support
irqchip/gic-v3: Fix OF_BAD_ADDR error handling
irqchip/sifive-plic: Mark two global variables __ro_after_init
...
Pointers in events that are printed are unhashed if the flags allow it,
and the logic to do so is called before processing the event output from
the raw ring buffer. In most cases, this is done when a user reads one of
the trace files.
But if tp_printk is added on the kernel command line, this logic is done
for trace events when they are triggered, and their output goes out via
printk. The unhash logic (and even the validation of the output) did not
support the tp_printk output, and would crash.
Link: https://lore.kernel.org/linux-tegra/9835d9f1-8d3a-3440-c53f-516c2606ad07@nvidia.com/
Fixes: efbbdaa22b ("tracing: Show real address for trace event arguments")
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This patch only provides the implementation of the method.
Later we will used it in a combination with a new option for
function tracing.
Link: https://lkml.kernel.org/r/20210415181854.147448-5-y.karadz@gmail.com
Signed-off-by: Yordan Karadzhov (VMware) <y.karadz@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The field is used to keep track of the consecutive (on the same CPU) calls
of a single function. This information is needed in order to consolidate
the function tracing record in the cases when a single function is called
number of times.
Link: https://lkml.kernel.org/r/20210415181854.147448-4-y.karadz@gmail.com
Signed-off-by: Yordan Karadzhov (VMware) <y.karadz@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
KMSAN complains that new_value at cpumask_parse_user() from
write_irq_affinity() from irq_affinity_proc_write() is uninitialized.
[ 148.133411][ T5509] =====================================================
[ 148.135383][ T5509] BUG: KMSAN: uninit-value in find_next_bit+0x325/0x340
[ 148.137819][ T5509]
[ 148.138448][ T5509] Local variable ----new_value.i@irq_affinity_proc_write created at:
[ 148.140768][ T5509] irq_affinity_proc_write+0xc3/0x3d0
[ 148.142298][ T5509] irq_affinity_proc_write+0xc3/0x3d0
[ 148.143823][ T5509] =====================================================
Since bitmap_parse() from cpumask_parse_user() calls find_next_bit(),
any alloc_cpumask_var() + cpumask_parse_user() sequence has possibility
that find_next_bit() accesses uninitialized cpu mask variable. Fix this
problem by replacing alloc_cpumask_var() with zalloc_cpumask_var().
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20210401055823.3929-1-penguin-kernel@I-love.SAKURA.ne.jp
Needed to merge trace/ftrace/urgent to get:
Commit 59300b36f8 ("ftrace: Check if pages were allocated before calling free_pages()")
To clean up the code that is affected by it as well.
Commit cbc3b92ce0 fixed an issue to modify the macros of the stack trace
event so that user space could parse it properly. Originally the stack
trace format to user space showed that the called stack was a dynamic
array. But it is not actually a dynamic array, in the way that other
dynamic event arrays worked, and this broke user space parsing for it. The
update was to make the array look to have 8 entries in it. Helper
functions were added to make it parse it correctly, as the stack was
dynamic, but was determined by the size of the event stored.
Although this fixed user space on how it read the event, it changed the
internal structure used for the stack trace event. It changed the array
size from [0] to [8] (added 8 entries). This increased the size of the
stack trace event by 8 words. The size reserved on the ring buffer was the
size of the stack trace event plus the number of stack entries found in
the stack trace. That commit caused the amount to be 8 more than what was
needed because it did not expect the caller field to have any size. This
produced 8 entries of garbage (and reading random data) from the stack
trace event:
<idle>-0 [002] d... 1976396.837549: <stack trace>
=> trace_event_raw_event_sched_switch
=> __traceiter_sched_switch
=> __schedule
=> schedule_idle
=> do_idle
=> cpu_startup_entry
=> secondary_startup_64_no_verify
=> 0xc8c5e150ffff93de
=> 0xffff93de
=> 0
=> 0
=> 0xc8c5e17800000000
=> 0x1f30affff93de
=> 0x00000004
=> 0x200000000
Instead, subtract the size of the caller field from the size of the event
to make sure that only the amount needed to store the stack trace is
reserved.
Link: https://lore.kernel.org/lkml/your-ad-here.call-01617191565-ext-9692@work.hours/
Cc: stable@vger.kernel.org
Fixes: cbc3b92ce0 ("tracing: Set kernel_stack's caller size properly")
Reported-by: Vasily Gorbik <gor@linux.ibm.com>
Tested-by: Vasily Gorbik <gor@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
It is a common mistake for someone writing a trace event to save a pointer
to a string in the TP_fast_assign() and then display that string pointer
in the TP_printk() with %s. The problem is that those two events may happen
a long time apart, where the source of the string may no longer exist.
The proper way to handle displaying any string that is not guaranteed to be
in the kernel core rodata section, is to copy it into the ring buffer via
the __string(), __assign_str() and __get_str() helper macros.
Add a check at run time while displaying the TP_printk() of events to make
sure that every %s referenced is safe to dereference, and if it is not,
trigger a warning and only show the address of the pointer, and the
dereferenced string if it can be safely retrieved with a
strncpy_from_kernel_nofault() call.
In order to not have to copy the parsing of vsnprintf() formats, or even
exporting its code, the verifier relies on vsnprintf() being able to
modify the va_list that is passed to it, and it remains modified after it
is called. This is the case for some architectures like x86_64, but other
architectures like x86_32 pass the va_list to vsnprintf() as a value not a
reference, and the verifier can not use it to parse the non string
arguments. Thus, at boot up, it is checked if vsnprintf() modifies the
passed in va_list or not, and a static branch will disable the verifier if
it's not compatible.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add a tracing_event_time_stamp() API that checks if the event passed in is
not on the ring buffer but a pointer to the per CPU trace_buffered_event
which does not have its time stamp set yet.
If it is a pointer to the trace_buffered_event, then just return the
current time stamp that the ring buffer would produce.
Otherwise, return the time stamp from the event.
Link: https://lkml.kernel.org/r/20210316164114.131996180@goodmis.org
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently, the trace histograms relies on it using absolute time stamps to
trigger the tracing to not use the temp buffer if filters are set. That's
because the histograms need the full timestamp that is saved in the ring
buffer. That is no longer the case, as the ring_buffer_event_time_stamp()
can now return the time stamp for all events without all triggering a full
absolute time stamp.
Now that the absolute time stamp is an unrelated dependency to not using
the filters. There's nothing about having absolute timestamps to keep from
using the filter buffer. Instead, change the interface to explicitly state
to disable filter buffering that the histogram logic can use.
Link: https://lkml.kernel.org/r/20210316164113.847886563@goodmis.org
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The ring_buffer_event_time_stamp() is going to be updated to extract the
time stamp for the event without needing it to be set to have absolute
values for all events. But to do so, it needs the buffer that the event is
on as the buffer saves information for the event before it is committed to
the buffer.
If the trace buffer is disabled, a temporary buffer is used, and there's
no access to this buffer from the current histogram triggers, even though
it is passed to the trace event code.
Pass the buffer that the event is on all the way down to the histogram
triggers.
Link: https://lkml.kernel.org/r/20210316164113.542448131@goodmis.org
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If tracing is disabled for some reason (traceoff_on_warning, command line,
etc), the ftrace selftests are guaranteed to fail, as their results are
defined by trace data in the ring buffers. If the ring buffers are turned
off, the tests will fail, due to lack of data.
Because tracing being disabled is for a specific reason (warning, user
decided to, etc), it does not make sense to enable tracing to run the self
tests, as the test output may corrupt the reason for the tracing to be
disabled.
Instead, simply skip the self tests and report that they are being skipped
due to tracing being disabled.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Update to the way irqs and preemption is tracked via the trace event PC field
- Fix handling of unregistering event failing due to allocate memory.
This is only triggered by failure injection, as it is pretty much guaranteed
to have less than a page allocation succeed.
- Do not show the useless "filter" or "enable" files for the "ftrace" trace
system, as they have no effect on doing anything.
- Add a warning if kprobes are registered more than once.
- Synthetic events now have their fields parsed by semicolons.
Old formats without semicolons will still work, but new features will
require them.
- New option to allow trace events to show %p without hashing in trace file.
The trace file can only be read by root, and reading the raw event buffer
did not have any pointers hashed, so this does not expose anything new.
- New directory in tools called tools/tracing, where a new tool that reads
sequential latency reports from the ftrace latency tracers.
- Other minor fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYDL2wBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qti6AP0RUcSU5U1onx8DcwPQLC5Xr3CPqJkm
RvKeJDdgFP+sVgEAiMTFsy2UMc0gmlHZMFd5nZLSiJCu1I2hHmhS5yKbHgY=
=fD9+
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Update to the way irqs and preemption is tracked via the trace event
PC field
- Fix handling of unregistering event failing due to allocate memory.
This is only triggered by failure injection, as it is pretty much
guaranteed to have less than a page allocation succeed.
- Do not show the useless "filter" or "enable" files for the "ftrace"
trace system, as they have no effect on doing anything.
- Add a warning if kprobes are registered more than once.
- Synthetic events now have their fields parsed by semicolons. Old
formats without semicolons will still work, but new features will
require them.
- New option to allow trace events to show %p without hashing in trace
file. The trace file can only be read by root, and reading the raw
event buffer did not have any pointers hashed, so this does not
expose anything new.
- New directory in tools called tools/tracing, where a new tool that
reads sequential latency reports from the ftrace latency tracers.
- Other minor fixes and cleanups.
* tag 'trace-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits)
kprobes: Fix to delay the kprobes jump optimization
tracing/tools: Add the latency-collector to tools directory
tracing: Make hash-ptr option default
tracing: Add ptr-hash option to show the hashed pointer value
tracing: Update the stage 3 of trace event macro comment
tracing: Show real address for trace event arguments
selftests/ftrace: Add '!event' synthetic event syntax check
selftests/ftrace: Update synthetic event syntax errors
tracing: Add a backward-compatibility check for synthetic event creation
tracing: Update synth command errors
tracing: Rework synthetic event command parsing
tracing/dynevent: Delegate parsing to create function
kprobes: Warn if the kprobe is reregistered
ftrace: Remove unused ftrace_force_update()
tracepoints: Code clean up
tracepoints: Do not punish non static call users
tracepoints: Remove unnecessary "data_args" macro parameter
tracing: Do not create "enable" or "filter" files for ftrace event subsystem
kernel: trace: preemptirq_delay_test: add cpu affinity
tracepoint: Do not fail unregistering a probe due to memory failure
...
Since the original behavior of the trace events is to hash the %p pointers,
make that the default, and have developers have to enable the option in
order to have them unhashed.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add tracefs/options/hash-ptr option to show hashed pointer
value by %p in event printk format string.
For the security reason, normal printk will show the hashed
pointer value (encrypted by random number) with %p to printk
buffer to hide the real address. But the tracefs/trace always
shows real address for debug. To bridge those outputs, add an
option to switch the output format. Ftrace users can use it
to find the hashed value corresponding to the real address
in trace log.
Link: https://lkml.kernel.org/r/160277372504.29307.14909828808982012211.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
To help debugging kernel, show real address for trace event arguments
in tracefs/trace{,pipe} instead of hashed pointer value.
Since ftrace human-readable format uses vsprintf(), all %p are
translated to hash values instead of pointer address.
However, when debugging the kernel, raw address value gives a
hint when comparing with the memory mapping in the kernel.
(Those are sometimes used with crash log, which is not hashed too)
So converting %p with %px when calling trace_seq_printf().
Moreover, this is not improving the security because the tracefs
can be used only by root user and the raw address values are readable
from tracefs/percpu/cpu*/trace_pipe_raw file.
Link: https://lkml.kernel.org/r/160277370703.29307.5134475491761971203.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When filters are used by trace events, a page is allocated on each CPU and
used to copy the trace event fields to this page before writing to the ring
buffer. The reason to use the filter and not write directly into the ring
buffer is because a filter may discard the event and there's more overhead
on discarding from the ring buffer than the extra copy.
The problem here is that there is no check against the size being allocated
when using this page. If an event asks for more than a page size while being
filtered, it will get only a page, leading to the caller writing more that
what was allocated.
Check the length of the request, and if it is more than PAGE_SIZE minus the
header default back to allocating from the ring buffer directly. The ring
buffer may reject the event if its too big anyway, but it wont overflow.
Link: https://lore.kernel.org/ath10k/1612839593-2308-1-git-send-email-wgong@codeaurora.org/
Cc: stable@vger.kernel.org
Fixes: 0fc1b09ff1 ("tracing: Use temp buffer when filtering events")
Reported-by: Wen Gong <wgong@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Delegate command parsing to each create function so that the
command syntax can be customized.
This requires changes to the kprobe/uprobe/synthetic event handling,
which are also included here.
Link: https://lkml.kernel.org/r/e488726f49cbdbc01568618f8680584306c4c79f.1612208610.git.zanussi@kernel.org
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
[ zanussi@kernel.org: added synthetic event modifications ]
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add description for trace_array_put() parameter.
kernel/trace/trace.c:464: warning: Function parameter or member 'this_tr' not described in 'trace_array_put'
Link: https://lkml.kernel.org/r/20210112111202.23508-1-huobean@gmail.com
Signed-off-by: Bean Huo <beanhuo@micron.com>
[ Merged as one of the original fixes was already fixed by someone else ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
PREEMPT_RT does not report "serving softirq" because the tracing core
looks at the preemption counter while PREEMPT_RT does not update it
while processing softirqs in order to remain preemptible. The
information is stored somewhere else.
The in_serving_softirq() macro and the SOFTIRQ_OFFSET define are still
working but not on the preempt-counter.
Use in_serving_softirq() macro which works on PREEMPT_RT. On !PREEMPT_RT
the compiler (gcc-10 / clang-11) is smart enough to optimize the
in_serving_softirq() related read of the preemption counter away.
The only difference I noticed by using in_serving_softirq() on
!PREEMPT_RT is that gcc-10 implemented tracing_gen_ctx_flags() as
reading FLAG, jmp _tracing_gen_ctx_flags(). Without in_serving_softirq()
it inlined _tracing_gen_ctx_flags() into tracing_gen_ctx_flags().
Link: https://lkml.kernel.org/r/20210125194511.3924915-4-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The state of the interrupts (irqflags) and the preemption counter are
both passed down to tracing_generic_entry_update(). Only one bit of
irqflags is actually required: The on/off state. The complete 32bit
of the preemption counter isn't needed. Just whether of the upper bits
(softirq, hardirq and NMI) are set and the preemption depth is needed.
The irqflags and the preemption counter could be evaluated early and the
information stored in an integer `trace_ctx'.
tracing_generic_entry_update() would use the upper bits as the
TRACE_FLAG_* and the lower 8bit as the disabled-preemption depth
(considering that one must be substracted from the counter in one
special cases).
The actual preemption value is not used except for the tracing record.
The `irqflags' variable is mostly used only for the tracing record. An
exception here is for instance wakeup_tracer_call() or
probe_wakeup_sched_switch() which explicilty disable interrupts and use
that `irqflags' to save (and restore) the IRQ state and to record the
state.
Struct trace_event_buffer has also the `pc' and flags' members which can
be replaced with `trace_ctx' since their actual value is not used
outside of trace recording.
This will reduce tracing_generic_entry_update() to simply assign values
to struct trace_entry. The evaluation of the TRACE_FLAG_* bits is moved
to _tracing_gen_ctx_flags() which replaces preempt_count() and
local_save_flags() invocations.
As an example, ftrace_syscall_enter() may invoke:
- trace_buffer_lock_reserve() -> … -> tracing_generic_entry_update()
- event_trigger_unlock_commit()
-> ftrace_trace_stack() -> … -> tracing_generic_entry_update()
-> ftrace_trace_userstack() -> … -> tracing_generic_entry_update()
In this case the TRACE_FLAG_* bits were evaluated three times. By using
the `trace_ctx' they are evaluated once and assigned three times.
A build with all tracers enabled on x86-64 with and without the patch:
text data bss dec hex filename
21970669 17084168 7639260 46694097 2c87ed1 vmlinux.old
21970293 17084168 7639260 46693721 2c87d59 vmlinux.new
text shrank by 379 bytes, data remained constant.
Link: https://lkml.kernel.org/r/20210125194511.3924915-2-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The major update to this release is that there's a new arch config option called:
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS. Currently, only x86_64 enables it.
All the ftrace callbacks now take a struct ftrace_regs instead of a struct
pt_regs. If the architecture has HAVE_DYNAMIC_FTRACE_WITH_ARGS enabled, then
the ftrace_regs will have enough information to read the arguments of the
function being traced, as well as access to the stack pointer. This way, if
a user (like live kernel patching) only cares about the arguments, then it
can avoid using the heavier weight "regs" callback, that puts in enough
information in the struct ftrace_regs to simulate a breakpoint exception
(needed for kprobes).
New config option that audits the timestamps of the ftrace ring buffer at
most every event recorded. The "check_buffer()" calls will conflict with
mainline, because I purposely added the check without including the fix that
it caught, which is in mainline. Running a kernel built from the commit of
the added check will trigger it.
Ftrace recursion protection has been cleaned up to move the protection to
the callback itself (this saves on an extra function call for those
callbacks).
Perf now handles its own RCU protection and does not depend on ftrace to do
it for it (saving on that extra function call).
New debug option to add "recursed_functions" file to tracefs that lists all
the places that triggered the recursion protection of the function tracer.
This will show where things need to be fixed as recursion slows down the
function tracer.
The eval enum mapping updates done at boot up are now offloaded to a work
queue, as it caused a noticeable pause on slow embedded boards.
Various clean ups and last minute fixes.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCX9uq8xQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qtrwAQCHevqWMjKc1Q76bnCgwB0AbFKB6vqy
5b6g/co5+ihv8wD/eJPWlZMAt97zTVW7bdp5qj/GTiCDbAsODMZ597LsxA0=
=rZEz
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"The major update to this release is that there's a new arch config
option called CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS.
Currently, only x86_64 enables it. All the ftrace callbacks now take a
struct ftrace_regs instead of a struct pt_regs. If the architecture
has HAVE_DYNAMIC_FTRACE_WITH_ARGS enabled, then the ftrace_regs will
have enough information to read the arguments of the function being
traced, as well as access to the stack pointer.
This way, if a user (like live kernel patching) only cares about the
arguments, then it can avoid using the heavier weight "regs" callback,
that puts in enough information in the struct ftrace_regs to simulate
a breakpoint exception (needed for kprobes).
A new config option that audits the timestamps of the ftrace ring
buffer at most every event recorded.
Ftrace recursion protection has been cleaned up to move the protection
to the callback itself (this saves on an extra function call for those
callbacks).
Perf now handles its own RCU protection and does not depend on ftrace
to do it for it (saving on that extra function call).
New debug option to add "recursed_functions" file to tracefs that
lists all the places that triggered the recursion protection of the
function tracer. This will show where things need to be fixed as
recursion slows down the function tracer.
The eval enum mapping updates done at boot up are now offloaded to a
work queue, as it caused a noticeable pause on slow embedded boards.
Various clean ups and last minute fixes"
* tag 'trace-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits)
tracing: Offload eval map updates to a work queue
Revert: "ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS"
ring-buffer: Add rb_check_bpage in __rb_allocate_pages
ring-buffer: Fix two typos in comments
tracing: Drop unneeded assignment in ring_buffer_resize()
tracing: Disable ftrace selftests when any tracer is running
seq_buf: Avoid type mismatch for seq_buf_init
ring-buffer: Fix a typo in function description
ring-buffer: Remove obsolete rb_event_is_commit()
ring-buffer: Add test to validate the time stamp deltas
ftrace/documentation: Fix RST C code blocks
tracing: Clean up after filter logic rewriting
tracing: Remove the useless value assignment in test_create_synth_event()
livepatch: Use the default ftrace_ops instead of REGS when ARGS is available
ftrace/x86: Allow for arguments to be passed in to ftrace_regs by default
ftrace: Have the callbacks receive a struct ftrace_regs instead of pt_regs
MAINTAINERS: assign ./fs/tracefs to TRACING
tracing: Fix some typos in comments
ftrace: Remove unused varible 'ret'
ring-buffer: Add recording of ring buffer recursion into recursed_functions
...
In order for tracepoints to export their enums to user space, the use of the
TRACE_DEFINE_ENUM() macro is used. On boot up, the strings shown in the
tracefs "print fmt" lines are processed, and all the enums registered by
TRACE_DEFINE_ENUM are replaced with the interger value. This way, userspace
tools that read the raw binary data, knows how to evaluate the raw events.
This is currently done in an initcall, but it has been noticed that slow
embedded boards that have tracing may take a few seconds to process them
all, and a few seconds slow down on an embedded device is detrimental to the
system.
Instead, offload the work to a work queue and make sure that its finished by
destroying the work queue (which flushes all work) in a late initcall. This
will allow the system to continue to boot and run the updates in the
background, and this speeds up the boot time. Note, the strings being
updated are only used by user space, so finishing the process before the
system is fully booted will prevent any race issues.
Link: https://lore.kernel.org/r/68d7b3327052757d0cd6359a6c9015a85b437232.camel@pengutronix.de
Reported-by: Lucas Stach <l.stach@pengutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When the instances were able to use their own options, the userstacktrace
option was left hardcoded for the top level. This made the instance
userstacktrace option bascially into a nop, and will confuse users that set
it, but nothing happens (I was confused when it happened to me!)
Cc: stable@vger.kernel.org
Fixes: 16270145ce ("tracing: Add trace options for core options to instances")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
With 5.9 kernel on ARM64, I found ftrace_dump output was broken but
it had no problem with normal output "cat /sys/kernel/debug/tracing/trace".
With investigation, it seems coping the data into temporal buffer seems to
break the align binary printf expects if the static buffer is not aligned
with 4-byte. IIUC, get_arg in bstr_printf expects that args has already
right align to be decoded and seq_buf_bprintf says ``the arguments are saved
in a 32bit word array that is defined by the format string constraints``.
So if we don't keep the align under copy to temporal buffer, the output
will be broken by shifting some bytes.
This patch fixes it.
Link: https://lkml.kernel.org/r/20201125225654.1618966-1-minchan@kernel.org
Cc: <stable@vger.kernel.org>
Fixes: 8e99cf91b9 ("tracing: Do not allocate buffer in trace_find_next_entry() in atomic")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The array size is FTRACE_KSTACK_NESTING, so the index FTRACE_KSTACK_NESTING
is illegal too. And fix two typos by the way.
Link: https://lkml.kernel.org/r/20201031085714.2147-1-hqjagain@gmail.com
Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The nesting count of trace_printk allows for 4 levels of nesting. The
nesting counter starts at zero and is incremented before being used to
retrieve the current context's buffer. But the index to the buffer uses the
nesting counter after it was incremented, and not its original number,
which in needs to do.
Link: https://lkml.kernel.org/r/20201029161905.4269-1-hqjagain@gmail.com
Cc: stable@vger.kernel.org
Fixes: 3d9622c12c ("tracing: Add barrier to trace_printk() buffer nesting modification")
Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Add support for "bool" type in synthetic events
- Add per instance tracing for bootconfig
- Support perf-style return probe ("SYMBOL%return") in kprobes and uprobes
- Allow for kprobes to be enabled earlier in boot up
- Added tracepoint helper function to allow testing if tracepoints are
enabled in headers
- Synthetic events can now have dynamic strings (variable length)
- Various fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCX4iMDRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qrMPAP0UAfOeQcYxBAw9y8oX7oJnBBylLFTR
CICOVEhBYC/xIQD/edVPEUt77ozM/Bplwv8BiO4QxFjgZFqtpZI8mskIfAo=
=sbny
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"Updates for tracing and bootconfig:
- Add support for "bool" type in synthetic events
- Add per instance tracing for bootconfig
- Support perf-style return probe ("SYMBOL%return") in kprobes and
uprobes
- Allow for kprobes to be enabled earlier in boot up
- Added tracepoint helper function to allow testing if tracepoints
are enabled in headers
- Synthetic events can now have dynamic strings (variable length)
- Various fixes and cleanups"
* tag 'trace-v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (58 commits)
tracing: support "bool" type in synthetic trace events
selftests/ftrace: Add test case for synthetic event syntax errors
tracing: Handle synthetic event array field type checking correctly
selftests/ftrace: Change synthetic event name for inter-event-combined test
tracing: Add synthetic event error logging
tracing: Check that the synthetic event and field names are legal
tracing: Move is_good_name() from trace_probe.h to trace.h
tracing: Don't show dynamic string internals in synthetic event description
tracing: Fix some typos in comments
tracing/boot: Add ftrace.instance.*.alloc_snapshot option
tracing: Fix race in trace_open and buffer resize call
tracing: Check return value of __create_val_fields() before using its result
tracing: Fix synthetic print fmt check for use of __get_str()
tracing: Remove a pointless assignment
ftrace: ftrace_global_list is renamed to ftrace_ops_list
ftrace: Format variable declarations of ftrace_allocate_records
ftrace: Simplify the calculation of page number for ftrace_page->records
ftrace: Simplify the dyn_ftrace->flags macro
ftrace: Simplify the hash calculation
ftrace: Use fls() to get the bits for dup_hash()
...
Here is the big set of char, misc, and other assorted driver subsystem
patches for 5.10-rc1.
There's a lot of different things in here, all over the drivers/
directory. Some summaries:
- soundwire driver updates
- habanalabs driver updates
- extcon driver updates
- nitro_enclaves new driver
- fsl-mc driver and core updates
- mhi core and bus updates
- nvmem driver updates
- eeprom driver updates
- binder driver updates and fixes
- vbox minor bugfixes
- fsi driver updates
- w1 driver updates
- coresight driver updates
- interconnect driver updates
- misc driver updates
- other minor driver updates
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCX4g8YQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yngKgCeNpArCP/9vQJRK9upnDm8ZLunSCUAn1wUT/2A
/bTQ42c/WRQ+LU828GSM
=6sO2
-----END PGP SIGNATURE-----
Merge tag 'char-misc-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver updates from Greg KH:
"Here is the big set of char, misc, and other assorted driver subsystem
patches for 5.10-rc1.
There's a lot of different things in here, all over the drivers/
directory. Some summaries:
- soundwire driver updates
- habanalabs driver updates
- extcon driver updates
- nitro_enclaves new driver
- fsl-mc driver and core updates
- mhi core and bus updates
- nvmem driver updates
- eeprom driver updates
- binder driver updates and fixes
- vbox minor bugfixes
- fsi driver updates
- w1 driver updates
- coresight driver updates
- interconnect driver updates
- misc driver updates
- other minor driver updates
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (396 commits)
binder: fix UAF when releasing todo list
docs: w1: w1_therm: Fix broken xref, mistakes, clarify text
misc: Kconfig: fix a HISI_HIKEY_USB dependency
LSM: Fix type of id parameter in kernel_post_load_data prototype
misc: Kconfig: add a new dependency for HISI_HIKEY_USB
firmware_loader: fix a kernel-doc markup
w1: w1_therm: make w1_poll_completion static
binder: simplify the return expression of binder_mmap
test_firmware: Test partial read support
firmware: Add request_partial_firmware_into_buf()
firmware: Store opt_flags in fw_priv
fs/kernel_file_read: Add "offset" arg for partial reads
IMA: Add support for file reads without contents
LSM: Add "contents" flag to kernel_read_file hook
module: Call security_kernel_post_load_data()
firmware_loader: Use security_post_load_data()
LSM: Introduce kernel_post_load_data() hook
fs/kernel_read_file: Add file_size output argument
fs/kernel_read_file: Switch buffer size arg to size_t
fs/kernel_read_file: Remove redundant size argument
...
applied to indirect function calls. Remove a data load (indirection) by
modifying the text.
They give the flexibility of function pointers, but with better
performance. (This is especially important for cases where
retpolines would otherwise be used, as retpolines can be pretty
slow.)
API overview:
DECLARE_STATIC_CALL(name, func);
DEFINE_STATIC_CALL(name, func);
DEFINE_STATIC_CALL_NULL(name, typename);
static_call(name)(args...);
static_call_cond(name)(args...);
static_call_update(name, func);
x86 is supported via text patching, otherwise basic indirect calls are used,
with function pointers.
There's a second variant using inline code patching, inspired by jump-labels,
implemented on x86 as well.
The new APIs are utilized in the x86 perf code, a heavy user of function pointers,
where static calls speed up the PMU handler by 4.2% (!).
The generic implementation is not really excercised on other architectures,
outside of the trivial test_static_call_init() self-test.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl+EfAQRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iEAw//divHeVCJnHhV+YBbuI9ROUsERkzu8VhK
O1DEmW68Fvj7pszT8NZsMjtkt97ZtxDRK7aCJiiup0eItG9qCJ8lpCLb84ZbizHV
HhCbhBLrpxSvTrWlQnkgP1OkPAbtoryIjVlZzWhjye2MY8UEbVnZWyviBolbAAxH
Fk1Yi56fIMu19GO+9Ohzy9E2VDnVEH1iMx5YWoLD2H88Qbq/yEMP+U2tIj8hIVKT
Y/jdogihNXRIau6QB+YPfDPisdty+RHxfU7zct4Rv8cFF5ylglZB5fD34C3sUQF2
WqsaYz7zjUj9f02F8pw8hIaAT7InzArPhlNVITxf2oMfmdrNqBptnSCddZqCJLvv
oDGew21k50Zcbqkv9amclpxXH5tTpRvJeqit2pz/85GMeqBRuhzHUAkCpht5YA73
qJsHWS3z+qIxKi0tDbhDJswuwa51q5sgdUUwo1uCr3wT3DGDlqNhCAZBzX14dcty
0shDSbv13TCwqAcb7asPzEoPwE15cwa+x+viGEIL901pyZKyQYjs/abDU26It3BW
roWRkuVJZ9/QMdZJs1v7kaXw1L8YiKIDkBgke+xbfrDwEvvjudQkl2LUL66DB11j
RJU3GyxKClvdY06SSRh/H13fqZLNKh1JZ0nPEWSTJECDFN9zcDjrDrod/7PFOcpY
NAlawLoGG+s=
=JvpF
-----END PGP SIGNATURE-----
Merge tag 'core-static_call-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull static call support from Ingo Molnar:
"This introduces static_call(), which is the idea of static_branch()
applied to indirect function calls. Remove a data load (indirection)
by modifying the text.
They give the flexibility of function pointers, but with better
performance. (This is especially important for cases where retpolines
would otherwise be used, as retpolines can be pretty slow.)
API overview:
DECLARE_STATIC_CALL(name, func);
DEFINE_STATIC_CALL(name, func);
DEFINE_STATIC_CALL_NULL(name, typename);
static_call(name)(args...);
static_call_cond(name)(args...);
static_call_update(name, func);
x86 is supported via text patching, otherwise basic indirect calls are
used, with function pointers.
There's a second variant using inline code patching, inspired by
jump-labels, implemented on x86 as well.
The new APIs are utilized in the x86 perf code, a heavy user of
function pointers, where static calls speed up the PMU handler by
4.2% (!).
The generic implementation is not really excercised on other
architectures, outside of the trivial test_static_call_init()
self-test"
* tag 'core-static_call-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
static_call: Fix return type of static_call_init
tracepoint: Fix out of sync data passing by static caller
tracepoint: Fix overly long tracepoint names
x86/perf, static_call: Optimize x86_pmu methods
tracepoint: Optimize using static_call()
static_call: Allow early init
static_call: Add some validation
static_call: Handle tail-calls
static_call: Add static_call_cond()
x86/alternatives: Teach text_poke_bp() to emulate RET
static_call: Add simple self-test for static calls
x86/static_call: Add inline static call implementation for x86-64
x86/static_call: Add out-of-line static call implementation
static_call: Avoid kprobes on inline static_call()s
static_call: Add inline static call infrastructure
static_call: Add basic static call infrastructure
compiler.h: Make __ADDRESSABLE() symbol truly unique
jump_label,module: Fix module lifetime for __jump_label_mod_text_reserved()
module: Properly propagate MODULE_STATE_COMING failure
module: Fix up module_notifier return values
...
The variable 'len' has been assigned a value but is not used after that.
So, remove the assignement.
Link: https://lkml.kernel.org/r/20200930184303.22896-1-sudipm.mukherjee@gmail.com
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>