Commit Graph

26253 Commits

Author SHA1 Message Date
Arnd Bergmann
df27067e60 pstore: Use ktime_get_real_fast_ns() instead of __getnstimeofday()
__getnstimeofday() is a rather odd interface, with a number of quirks:

- The caller may come from NMI context, but the implementation is not NMI safe,
  one way to get there from NMI is

      NMI handler:
        something bad
          panic()
            kmsg_dump()
              pstore_dump()
                 pstore_record_init()
                   __getnstimeofday()

- The calling conventions are different from any other timekeeping functions,
  to deal with returning an error code during suspended timekeeping.

Address the above issues by using a completely different method to get the
time: ktime_get_real_fast_ns() is NMI safe and has a reasonable behavior
when timekeeping is suspended: it returns the time at which it got
suspended. As Thomas Gleixner explained, this is safe, as
ktime_get_real_fast_ns() does not call into the clocksource driver that
might be suspended.

The result can easily be transformed into a timespec structure. Since
ktime_get_real_fast_ns() was not exported to modules, add the export.

The pstore behavior for the suspended case changes slightly, as it now
stores the timestamp at which timekeeping was suspended instead of storing
a zero timestamp.

This change is not addressing y2038-safety, that's subject to a more
complex follow up patch.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Colin Cross <ccross@android.com>
Link: https://lkml.kernel.org/r/20171110152530.1926955-1-arnd@arndb.de
2017-11-12 15:05:52 +01:00
Thomas Gleixner
d00a08cf9e irq/work: Use llist_for_each_entry_safe
The llist_for_each_entry() loop in irq_work_run_list() is unsafe because
once the works PENDING bit is cleared it can be requeued on another CPU.

Use llist_for_each_entry_safe() instead.

Fixes: 16c0890dc6 ("irq/work: Don't reinvent the wheel but use existing llist API")
Reported-by:Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petri Latvala <petri.latvala@intel.com>
Link: http://lkml.kernel.org/r/151027307351.14762.4611888896020658384@mail.alporthouse.com
2017-11-12 13:15:14 +01:00
David S. Miller
f3edacbd69 bpf: Revert bpf_overrid_function() helper changes.
NACK'd by x86 maintainer.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-11 18:24:55 +09:00
Josef Bacik
dd0bb688ea bpf: add a bpf_override_function helper
Error injection is sloppy and very ad-hoc.  BPF could fill this niche
perfectly with it's kprobe functionality.  We could make sure errors are
only triggered in specific call chains that we care about with very
specific situations.  Accomplish this with the bpf_override_funciton
helper.  This will modify the probe'd callers return value to the
specified value and set the PC to an override function that simply
returns, bypassing the originally probed function.  This gives us a nice
clean way to implement systematic error injection for all of our code
paths.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-11 12:18:05 +09:00
Shaohua Li
e10237cc76 kthread: zero the kthread data structure
kthread() could bail out early before we initialize blkcg_css (if the
kthread is killed very early. Please see xchg() statement in kthread()),
which confuses free_kthread_struct. Instead of moving the blkcg_css
initialization early, we simply zero the whole 'self' data structure,
which doesn't sound much overhead.

Reported-by: syzbot <syzkaller@googlegroups.com>
Fixes: 05e3db95eb ("kthread: add a mechanism to store cgroup info")
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Jens Axboe
a6da0024ff blktrace: fix unlocked registration of tracepoints
We need to ensure that tracepoints are registered and unregistered
with the users of them. The existing atomic count isn't enough for
that. Add a lock around the tracepoints, so we serialize access
to them.

This fixes cases where we have multiple users setting up and
tearing down tracepoints, like this:

CPU: 0 PID: 2995 Comm: syzkaller857118 Not tainted
4.14.0-rc5-next-20171018+ #36
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:16 [inline]
  dump_stack+0x194/0x257 lib/dump_stack.c:52
  panic+0x1e4/0x41c kernel/panic.c:183
  __warn+0x1c4/0x1e0 kernel/panic.c:546
  report_bug+0x211/0x2d0 lib/bug.c:183
  fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:177
  do_trap_no_signal arch/x86/kernel/traps.c:211 [inline]
  do_trap+0x260/0x390 arch/x86/kernel/traps.c:260
  do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:297
  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:310
  invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
RIP: 0010:tracepoint_add_func kernel/tracepoint.c:210 [inline]
RIP: 0010:tracepoint_probe_register_prio+0x397/0x9a0 kernel/tracepoint.c:283
RSP: 0018:ffff8801d1d1f6c0 EFLAGS: 00010293
RAX: ffff8801d22e8540 RBX: 00000000ffffffef RCX: ffffffff81710f07
RDX: 0000000000000000 RSI: ffffffff85b679c0 RDI: ffff8801d5f19818
RBP: ffff8801d1d1f7c8 R08: ffffffff81710c10 R09: 0000000000000004
R10: ffff8801d1d1f6b0 R11: 0000000000000003 R12: ffffffff817597f0
R13: 0000000000000000 R14: 00000000ffffffff R15: ffff8801d1d1f7a0
  tracepoint_probe_register+0x2a/0x40 kernel/tracepoint.c:304
  register_trace_block_rq_insert include/trace/events/block.h:191 [inline]
  blk_register_tracepoints+0x1e/0x2f0 kernel/trace/blktrace.c:1043
  do_blk_trace_setup+0xa10/0xcf0 kernel/trace/blktrace.c:542
  blk_trace_setup+0xbd/0x180 kernel/trace/blktrace.c:564
  sg_ioctl+0xc71/0x2d90 drivers/scsi/sg.c:1089
  vfs_ioctl fs/ioctl.c:45 [inline]
  do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
  SYSC_ioctl fs/ioctl.c:700 [inline]
  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
  entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x444339
RSP: 002b:00007ffe05bb5b18 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00000000006d66c0 RCX: 0000000000444339
RDX: 000000002084cf90 RSI: 00000000c0481273 RDI: 0000000000000009
RBP: 0000000000000082 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000206 R12: ffffffffffffffff
R13: 00000000c0481273 R14: 0000000000000000 R15: 0000000000000000

since we can now run these in parallel. Ensure that the exported helpers
for doing this are grabbing the queue trace mutex.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Jens Axboe
1f2cac107c blktrace: fix unlocked access to init/start-stop/teardown
sg.c calls into the blktrace functions without holding the proper queue
mutex for doing setup, start/stop, or teardown.

Add internal unlocked variants, and export the ones that do the proper
locking.

Fixes: 6da127ad09 ("blktrace: Add blktrace ioctls to SCSI generic devices")
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Richard Guy Briggs
42d5e37654 audit: filter PATH records keyed on filesystem magic
Tracefs or debugfs were causing hundreds to thousands of PATH records to
be associated with the init_module and finit_module SYSCALL records on a
few modules when the following rule was in place for startup:
	-a always,exit -F arch=x86_64 -S init_module -F key=mod-load

Provide a method to ignore these large number of PATH records from
overwhelming the logs if they are not of interest.  Introduce a new
filter list "AUDIT_FILTER_FS", with a new field type AUDIT_FSTYPE,
which keys off the filesystem 4-octet hexadecimal magic identifier to
filter specific filesystem PATH records.

An example rule would look like:
	-a never,filesystem -F fstype=0x74726163 -F key=ignore_tracefs
	-a never,filesystem -F fstype=0x64626720 -F key=ignore_debugfs

Arguably the better way to address this issue is to disable tracefs and
debugfs on boot from production systems.

See: https://github.com/linux-audit/audit-kernel/issues/16
See: https://github.com/linux-audit/audit-userspace/issues/8
Test case: https://github.com/linux-audit/audit-testsuite/issues/42

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: fixed the whitespace damage in kernel/auditsc.c]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-11-10 16:08:56 -05:00
Casey Schaufler
f7b53637c0 Audit: remove unused audit_log_secctx function
The function audit_log_secctx() is unused in the upstream kernel.
All it does is wrap another function that doesn't need wrapping.
It claims to give you the SELinux context, but that is not true if
you are using a different security module.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-11-10 16:08:47 -05:00
Steve Grubb
33e8a90780 audit: Allow auditd to set pid to 0 to end auditing
The API to end auditing has historically been for auditd to set the
pid to 0. This patch restores that functionality.

See: https://github.com/linux-audit/audit-kernel/issues/69

Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-11-10 16:08:36 -05:00
Paul Moore
5d842a5b77 audit: use audit_set_enabled() in audit_enable()
Use audit_set_enabled() to enable auditing during early boot.  This
obviously won't emit an audit change record, but it will work anyway
and should help prevent in future problems by consolidating the
enable/disable code in one function.

Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-11-10 16:07:57 -05:00
Paul Moore
b3b4fdf6a8 audit: convert audit_ever_enabled to a boolean
We were treating it as a boolean, let's make it a boolean to help
avoid future mistakes.

Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-11-10 16:07:54 -05:00
Paul Moore
80ab4df627 audit: don't use simple_strtol() anymore
The simple_strtol() function is deprecated, use kstrtol() instead.

Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-11-10 16:07:50 -05:00
Paul Moore
be4104abf2 audit: initialize the audit subsystem as early as possible
We can't initialize the audit subsystem until after the network layer
is initialized (core_initcall), but do it soon after.

Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-11-10 16:07:46 -05:00
Paul Moore
173743dd99 audit: ensure that 'audit=1' actually enables audit for PID 1
Prior to this patch we enabled audit in audit_init(), which is too
late for PID 1 as the standard initcalls are run after the PID 1 task
is forked.  This means that we never allocate an audit_context (see
audit_alloc()) for PID 1 and therefore miss a lot of audit events
generated by PID 1.

This patch enables audit as early as possible to help ensure that when
PID 1 is forked it can allocate an audit_context if required.

Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-11-10 16:07:39 -05:00
Ingo Molnar
91a6a6cfee Merge branch 'linus' into x86/asm, to resolve conflict
Conflicts:
	arch/x86/mm/mem_encrypt.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-10 08:06:47 +01:00
David S. Miller
4dc6758d78 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Simple cases of overlapping changes in the packet scheduler.

Must easier to resolve this time.

Which probably means that I screwed it up somehow.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-10 10:00:18 +09:00
Linus Torvalds
3fefc31843 Final power management fixes for v4.14
- Prevent the schedutil cpufreq governor from using the
    utilization of a wrong CPU in some cases which started to
    happen after one of the recent changes in it (Chris Redpath).
 
  - Blacklist Dell XPS13 9360 from using the Low Power S0 Idle _DSM
    interface as that causes serious issue (related to NVMe) to
    appear on one of these machines, even though the other Dells
    XPS13 9360 in somewhat different HW configurations behave
    correctly (Rafael Wysocki).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJaBFMsAAoJEILEb/54YlRxb8sP/jhqgCEiaUvSpFYBLklx1fqK
 QDbbYw3Os2JlAcMFElGjZEoV0+WBbadTHQxIxB353mpcAe1moI/SoDAJEgA9HWib
 tjJIt1lak3nLF967y0pOgAi9uvFCG6Rw+ZyVSrYBnT+eG6fw3h6MdHaOEFeCHJY4
 qUgj+Bm0OtisW9mRFk5zdGam+6MxaE1z9mhjRAebrAjOI+46Oj/LQl7q32S2vWTp
 GJm6lSYOmSruo5vdU7T3dOSkBrkymf4E//1XkegFvcweEX/aITuCAil0Rl4X2PV0
 NSeeCp66skftvlZ/vf3QOem1XLgPdVZ9dr7WRzOZt3CDctXuDaA3uKfRJf3oIMsi
 rnrsmWnfsT4sNaqV6zHYTkMtQfSsxFfjOde/HC7vAtC0oM/EECsonBtIlMUBmYlR
 7fi+wtrE6ErsMgkThOvplG+cAnvWCK8yhmFBAKbcwg9UHwooF0HxXPqVwHE6M8pw
 eFALkJyagx1PTGKXNHOUTHfZcYGlhhJxH6SEhs/pJIxL8rVgey8mJfA9y8eGmEXP
 fKPnEb6yI7/9ZpzFFu3DLiHMifgc6o9F0mRqw3apeQFGtg46frYw51meP5Faq0EZ
 BAb0AXBF4qNB+XvORG0ABJkPqqDaZVPVfdo1SIA06IBr2mpVT1RAetMU0qSuWy+x
 STI04myU/9hHmCSdqzpg
 =F9fg
 -----END PGP SIGNATURE-----

Merge tag 'pm-final-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull final power management fixes from Rafael Wysocki:
 "These fix a regression in the schedutil cpufreq governor introduced by
  a recent change and blacklist Dell XPS13 9360 from using the Low Power
  S0 Idle _DSM interface which triggers serious problems on one of these
  machines.

  Specifics:

   - Prevent the schedutil cpufreq governor from using the utilization
     of a wrong CPU in some cases which started to happen after one of
     the recent changes in it (Chris Redpath).

   - Blacklist Dell XPS13 9360 from using the Low Power S0 Idle _DSM
     interface as that causes serious issue (related to NVMe) to appear
     on one of these machines, even though the other Dells XPS13 9360 in
     somewhat different HW configurations behave correctly (Rafael
     Wysocki)"

* tag 'pm-final-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI / PM: Blacklist Low Power S0 Idle _DSM for Dell XPS13 9360
  cpufreq: schedutil: Examine the correct CPU when we update util
2017-11-09 11:16:28 -08:00
Patrick Bellasi
765cc3a4b2 sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds
When the kernel is compiled with !CONFIG_SCHED_DEBUG support, we expect that
all SCHED_FEAT are turned into compile time constants being propagated
to support compiler optimizations.

Specifically, we expect that code blocks like this:

   if (sched_feat(FEATURE_NAME) [&& <other_conditions>]) {
	/* FEATURE CODE */
   }

are turned into dead-code in case FEATURE_NAME defaults to FALSE, and thus
being removed by the compiler from the finale image.

For this mechanism to properly work it's required for the compiler to
have full access, from each translation unit, to whatever is the value
defined by the sched_feat macro. This macro is defined as:

   #define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))

and thus, the compiler can optimize that code only if the value of
sysctl_sched_features is visible within each translation unit.

Since:

   029632fbb ("sched: Make separate sched*.c translation units")

the scheduler code has been split into separate translation units
however the definition of sysctl_sched_features is part of
kernel/sched/core.c while, for all the other scheduler modules, it is
visible only via kernel/sched/sched.h as an:

   extern const_debug unsigned int sysctl_sched_features

Unfortunately, an extern reference does not allow the compiler to apply
constants propagation. Thus, on !CONFIG_SCHED_DEBUG kernel we still end up
with code to load a memory reference and (eventually) doing an unconditional
jump of a chunk of code.

This mechanism is unavoidable when sched_features can be turned on and off at
run-time. However, this is not the case for "production" kernels compiled with
!CONFIG_SCHED_DEBUG. In this case, sysctl_sched_features is just a constant value
which cannot be changed at run-time and thus memory loads and jumps can be
avoided altogether.

This patch fixes the case of !CONFIG_SCHED_DEBUG kernel by declaring a local version
of the sysctl_sched_features constant for each translation unit. This will
ultimately allow the compiler to perform constants propagation and dead-code
pruning.

Tests have been done, with !CONFIG_SCHED_DEBUG on a v4.14-rc8 with and without
the patch, by running 30 iterations of:

   perf bench sched messaging --pipe --thread --group 4 --loop 50000

on a 40 cores Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz using the
powersave governor to rule out variations due to frequency scaling.

Statistics on the reported completion time:

                   count     mean       std     min       99%     max
  v4.14-rc8         30.0  15.7831  0.176032  15.442  16.01226  16.014
  v4.14-rc8+patch   30.0  15.5033  0.189681  15.232  15.93938  15.962

... show a 1.8% speedup on average completion time and 0.5% speedup in the
99 percentile.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20171108184101.16006-1-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-09 07:35:08 +01:00
Rafael J. Wysocki
e029b9bf12 Merge branch 'pm-cpufreq-sched'
* pm-cpufreq-sched:
  cpufreq: schedutil: Examine the correct CPU when we update util
2017-11-09 00:07:56 +01:00
Viresh Kumar
07458f6a51 cpufreq: schedutil: Reset cached_raw_freq when not in sync with next_freq
'cached_raw_freq' is used to get the next frequency quickly but should
always be in sync with sg_policy->next_freq. There is a case where it is
not and in such cases it should be reset to avoid switching to incorrect
frequencies.

Consider this case for example:

 - policy->cur is 1.2 GHz (Max)
 - New request comes for 780 MHz and we store that in cached_raw_freq.
 - Based on 780 MHz, we calculate the effective frequency as 800 MHz.
 - We then see the CPU wasn't idle recently and choose to keep the next
   freq as 1.2 GHz.
 - Now we have cached_raw_freq is 780 MHz and sg_policy->next_freq is
   1.2 GHz.
 - Now if the utilization doesn't change in then next request, then the
   next target frequency will still be 780 MHz and it will match with
   cached_raw_freq. But we will choose 1.2 GHz instead of 800 MHz here.

Fixes: b7eaf1aab9 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 4.12+ <stable@vger.kernel.org> # 4.12+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-11-08 23:59:33 +01:00
Rajat Jain
95b982b451 PM / s2idle: Clear the events_check_enabled flag
Problem: This flag does not get cleared currently in the suspend or
resume path in the following cases:

 * In case some driver's suspend routine returns an error.
 * Successful s2idle case
 * etc?

Why is this a problem: What happens is that the next suspend attempt
could fail even though the user did not enable the flag by writing to
/sys/power/wakeup_count. This is 1 use case how the issue can be seen
(but similar use case with driver suspend failure can be thought of):

 1. Read /sys/power/wakeup_count
 2. echo count > /sys/power/wakeup_count
 3. echo freeze > /sys/power/wakeup_count
 4. Let the system suspend, and wakeup the system using some wake source
    that calls pm_wakeup_event() e.g. power button or something.
 5. Note that the combined wakeup count would be incremented due
    to the pm_wakeup_event() in the resume path.
 6. After resuming the events_check_enabled flag is still set.

At this point if the user attempts to freeze again (without writing to
/sys/power/wakeup_count), the suspend would fail even though there has
been no wake event since the past resume.

Address that by clearing the flag just before a resume is completed,
so that it is always cleared for the corner cases mentioned above.

Signed-off-by: Rajat Jain <rajatja@google.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-11-08 23:52:02 +01:00
Linus Torvalds
c0f3ea1589 stop using '%pK' for /proc/kallsyms pointer values
Not only is it annoying to have one single flag for all pointers, as if
that was a global choice and all kernel pointers are the same, but %pK
can't get the 'access' vs 'open' time check right anyway.

So make the /proc/kallsyms pointer value code use logic specific to that
particular file.  We do continue to honor kptr_restrict, but the default
(which is unrestricted) is changed to instead take expected users into
account, and restrict access by default.

Right now the only actual expected user is kernel profiling, which has a
separate sysctl flag for kernel profile access.  There may be others.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-08 12:51:04 -08:00
Bruno E. O. Meneguele
fda784e50a module: export module signature enforcement status
A static variable sig_enforce is used as status var to indicate the real
value of CONFIG_MODULE_SIG_FORCE, once this one is set the var will hold
true, but if the CONFIG is not set the status var will hold whatever
value is present in the module.sig_enforce kernel cmdline param: true
when =1 and false when =0 or not present.

Considering this cmdline param take place over the CONFIG value when
it's not set, other places in the kernel could misbehave since they
would have only the CONFIG_MODULE_SIG_FORCE value to rely on. Exporting
this status var allows the kernel to rely in the effective value of
module signature enforcement, being it from CONFIG value or cmdline
param.

Signed-off-by: Bruno E. O. Meneguele <brdeoliv@redhat.com>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
2017-11-08 15:16:36 -05:00
Frederic Weisbecker
b04db8e19f rcu: Use lockdep to assert IRQs are disabled/enabled
Lockdep now has an integrated IRQs disabled/enabled sanity check. Just
use it instead of the ad-hoc RCU version.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1509980490-4285-15-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08 11:13:55 +01:00
Frederic Weisbecker
a69682200d timers/posix-cpu-timers: Use lockdep to assert IRQs are disabled/enabled
Use lockdep to check that IRQs are enabled or disabled as expected. This
way the sanity check only shows overhead when concurrency correctness
debug code is enabled.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David S . Miller <davem@davemloft.net>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1509980490-4285-13-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08 11:13:54 +01:00
Frederic Weisbecker
2c11dba00a sched/clock, sched/cputime: Use lockdep to assert IRQs are disabled/enabled
Use lockdep to check that IRQs are enabled or disabled as expected. This
way the sanity check only shows overhead when concurrency correctness
debug code is enabled.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David S . Miller <davem@davemloft.net>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1509980490-4285-12-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08 11:13:53 +01:00
Frederic Weisbecker
3c7169a3bf irq_work: Use lockdep to assert IRQs are disabled/enabled
Use lockdep to check that IRQs are enabled or disabled as expected. This
way the sanity check only shows overhead when concurrency correctness
debug code is enabled.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David S . Miller <davem@davemloft.net>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1509980490-4285-11-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08 11:13:52 +01:00
Frederic Weisbecker
a934d4d15f irq/timings: Use lockdep to assert IRQs are disabled/enabled
Use lockdep to check that IRQs are enabled or disabled as expected. This
way the sanity check only shows overhead when concurrency correctness
debug code is enabled.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David S . Miller <davem@davemloft.net>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1509980490-4285-10-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08 11:13:52 +01:00
Frederic Weisbecker
164446455a perf/core: Use lockdep to assert IRQs are disabled/enabled
Use lockdep to check that IRQs are enabled or disabled as expected. This
way the sanity check only shows overhead when concurrency correctness
debug code is enabled.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David S . Miller <davem@davemloft.net>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1509980490-4285-9-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08 11:13:51 +01:00
Frederic Weisbecker
83efcbd028 smp/core: Use lockdep to assert IRQs are disabled/enabled
Use lockdep to check that IRQs are enabled or disabled as expected. This
way the sanity check only shows overhead when concurrency correctness
debug code is enabled.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David S . Miller <davem@davemloft.net>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1509980490-4285-7-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08 11:13:50 +01:00
Frederic Weisbecker
53bef3fd47 timers/hrtimer: Use lockdep to assert IRQs are disabled/enabled
Use lockdep to check that IRQs are enabled or disabled as expected. This
way the sanity check only shows overhead when concurrency correctness
debug code is enabled.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David S . Miller <davem@davemloft.net>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1509980490-4285-6-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08 11:13:49 +01:00
Frederic Weisbecker
ebf3adbad0 timers/nohz: Use lockdep to assert IRQs are disabled/enabled
Use lockdep to check that IRQs are enabled or disabled as expected. This
way the sanity check only shows overhead when concurrency correctness
debug code is enabled.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David S . Miller <davem@davemloft.net>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1509980490-4285-5-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08 11:13:49 +01:00
Frederic Weisbecker
8e8eb73075 workqueue: Use lockdep to assert IRQs are disabled/enabled
Use lockdep to check that IRQs are enabled or disabled as expected. This
way the sanity check only shows overhead when concurrency correctness
debug code is enabled.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: David S . Miller <davem@davemloft.net>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1509980490-4285-4-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08 11:13:48 +01:00
Frederic Weisbecker
f71b74bca6 irq/softirqs: Use lockdep to assert IRQs are disabled/enabled
Use lockdep to check that IRQs are enabled or disabled as expected. This
way the sanity check only shows overhead when concurrency correctness
debug code is enabled.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David S . Miller <davem@davemloft.net>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1509980490-4285-3-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08 11:13:48 +01:00
Ingo Molnar
8a103df440 Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08 10:17:15 +01:00
Waiman Long
11752adb68 locking/pvqspinlock: Implement hybrid PV queued/unfair locks
Currently, all the lock waiters entering the slowpath will do one
lock stealing attempt to acquire the lock. That helps performance,
especially in VMs with over-committed vCPUs. However, the current
pvqspinlocks still don't perform as good as unfair locks in many cases.
On the other hands, unfair locks do have the problem of lock starvation
that pvqspinlocks don't have.

This patch combines the best attributes of an unfair lock and a
pvqspinlock into a hybrid lock with 2 modes - queued mode & unfair
mode. A lock waiter goes into the unfair mode when there are waiters
in the wait queue but the pending bit isn't set. Otherwise, it will
go into the queued mode waiting in the queue for its turn.

On a 2-socket 36-core E5-2699 v3 system (HT off), a kernel build
(make -j<n>) was done in a VM with unpinned vCPUs 3 times with the
best time selected and <n> is the number of vCPUs available. The build
times of the original pvqspinlock, hybrid pvqspinlock and unfair lock
with various number of vCPUs are as follows:

  vCPUs    pvqlock     hybrid pvqlock    unfair lock
  -----    -------     --------------    -----------
    30      342.1s         329.1s          329.1s
    36      314.1s         305.3s          307.3s
    45      345.0s         302.1s          306.6s
    54      365.4s         308.6s          307.8s
    72      358.9s         293.6s          303.9s
   108      343.0s         285.9s          304.2s

The hybrid pvqspinlock performs better or comparable to the unfair
lock.

By turning on QUEUED_LOCK_STAT, the table below showed the number
of lock acquisitions in unfair mode and queue mode after a kernel
build with various number of vCPUs.

  vCPUs    queued mode  unfair mode
  -----    -----------  -----------
    30      9,130,518      294,954
    36     10,856,614      386,809
    45      8,467,264   11,475,373
    54      6,409,987   19,670,855
    72      4,782,063   25,712,180

It can be seen that as the VM became more and more over-committed,
the ratio of locks acquired in unfair mode increases. This is all
done automatically to get the best overall performance as possible.

Using a kernel locking microbenchmark with number of locking
threads equals to the number of vCPUs available on the same machine,
the minimum, average and maximum (min/avg/max) numbers of locking
operations done per thread in a 5-second testing interval are shown
below:

  vCPUs         hybrid pvqlock             unfair lock
  -----         --------------             -----------
    36     822,135/881,063/950,363    75,570/313,496/  690,465
    54     542,435/581,664/625,937    35,460/204,280/  457,172
    72     397,500/428,177/499,299    17,933/150,679/  708,001
   108     257,898/288,150/340,871     3,085/181,176/1,257,109

It can be seen that the hybrid pvqspinlocks are more fair and
performant than the unfair locks in this test.

The table below shows the kernel build times on a smaller 2-socket
16-core 32-thread E5-2620 v4 system.

  vCPUs    pvqlock     hybrid pvqlock    unfair lock
  -----    -------     --------------    -----------
    16      436.8s         433.4s          435.6s
    36      366.2s         364.8s          364.5s
    48      423.6s         376.3s          370.2s
    64      433.1s         376.6s          376.8s

Again, the performance of the hybrid pvqspinlock was comparable to
that of the unfair lock.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Eduardo Valentin <eduval@amazon.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1510089486-3466-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08 10:10:04 +01:00
Tom Lendacky
0e4c12b45a x86/mm, resource: Use PAGE_KERNEL protection for ioremap of memory pages
In order for memory pages to be properly mapped when SEV is active, it's
necessary to use the PAGE_KERNEL protection attribute as the base
protection.  This ensures that memory mapping of, e.g. ACPI tables,
receives the proper mapping attributes.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Tested-by: Borislav Petkov <bp@suse.de>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: kvm@vger.kernel.org
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: https://lkml.kernel.org/r/20171020143059.3291-11-brijesh.singh@amd.com
2017-11-07 15:35:58 +01:00
Tom Lendacky
1d2e733b13 resource: Provide resource struct in resource walk callback
In preperation for a new function that will need additional resource
information during the resource walk, update the resource walk callback to
pass the resource structure.  Since the current callback start and end
arguments are pulled from the resource structure, the callback functions
can obtain them from the resource structure directly.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Tested-by: Borislav Petkov <bp@suse.de>
Cc: kvm@vger.kernel.org
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: linuxppc-dev@lists.ozlabs.org
Link: https://lkml.kernel.org/r/20171020143059.3291-10-brijesh.singh@amd.com
2017-11-07 15:35:57 +01:00
Tom Lendacky
4ac2aed837 resource: Consolidate resource walking code
The walk_iomem_res_desc(), walk_system_ram_res() and walk_system_ram_range()
functions each have much of the same code.

Create a new function that consolidates the common code from these
functions in one place to reduce the amount of duplicated code.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Tested-by: Borislav Petkov <bp@suse.de>
Cc: kvm@vger.kernel.org
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20171020143059.3291-9-brijesh.singh@amd.com
2017-11-07 15:35:57 +01:00
Cheng Jian
f791dd2589 locking/rwlocks: Fix comments
- fix the list of locking API headers in kernel/locking/spinlock.c
- fix an #endif comment

Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: huawei.libin@huawei.com
Cc: xiexiuqi@huawei.com
Link: http://lkml.kernel.org/r/1509706788-152547-1-git-send-email-cj.chengjian@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-07 12:22:21 +01:00
Zhou Chengming
e846d13958 kprobes, x86/alternatives: Use text_mutex to protect smp_alt_modules
We use alternatives_text_reserved() to check if the address is in
the fixed pieces of alternative reserved, but the problem is that
we don't hold the smp_alt mutex when call this function. So the list
traversal may encounter a deleted list_head if another path is doing
alternatives_smp_module_del().

One solution is that we can hold smp_alt mutex before call this
function, but the difficult point is that the callers of this
functions, arch_prepare_kprobe() and arch_prepare_optimized_kprobe(),
are called inside the text_mutex. So we must hold smp_alt mutex
before we go into these arch dependent code. But we can't now,
the smp_alt mutex is the arch dependent part, only x86 has it.
Maybe we can export another arch dependent callback to solve this.

But there is a simpler way to handle this problem. We can reuse the
text_mutex to protect smp_alt_modules instead of using another mutex.
And all the arch dependent checks of kprobes are inside the text_mutex,
so it's safe now.

Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@suse.de
Fixes: 2cfa197 "ftrace/alternatives: Introducing *_text_reserved functions"
Link: http://lkml.kernel.org/r/1509585501-79466-1-git-send-email-zhouchengming1@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-07 12:20:09 +01:00
Ingo Molnar
141d3b1daa Merge branch 'linus' into x86/apic, to resolve conflicts
Conflicts:
	arch/x86/include/asm/x2apic.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-07 10:51:10 +01:00
Ingo Molnar
8c5db92a70 Merge branch 'linus' into locking/core, to resolve conflicts
Conflicts:
	include/linux/compiler-clang.h
	include/linux/compiler-gcc.h
	include/linux/compiler-intel.h
	include/uapi/linux/stddef.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-07 10:32:44 +01:00
Ingo Molnar
15bcdc9477 Merge branch 'linus' into perf/core, to fix conflicts
Conflicts:
	tools/perf/arch/arm/annotate/instructions.c
	tools/perf/arch/arm64/annotate/instructions.c
	tools/perf/arch/powerpc/annotate/instructions.c
	tools/perf/arch/s390/annotate/instructions.c
	tools/perf/arch/x86/tests/intel-cqm.c
	tools/perf/ui/tui/progress.c
	tools/perf/util/zlib.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-07 10:30:18 +01:00
Linus Torvalds
e4880bc5df Merge branch 'for-4.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
 "Another fix for a really old bug.

  It only affects drain_workqueue() which isn't used often and even then
  triggers only during a pretty small race window, so it isn't too
  surprising that it stayed hidden for so long.

  The fix is straight-forward and low-risk. Kudos to Li Bin for
  reporting and fixing the bug"

* 'for-4.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Fix NULL pointer dereference
2017-11-06 12:26:49 -08:00
Roman Gushchin
5f2e673405 cgroup: export list of cgroups v2 features using sysfs
The active development of cgroups v2 sometimes leads to a creation
of interfaces, which are not turned on by default (to provide
backward compatibility). It's handy to know from userspace, which
cgroup v2 features are supported without calculating it based
on the kernel version. So, let's export the list of such features
using /sys/kernel/cgroup/features pseudo-file.

The list is hardcoded and has to be extended when new functionality
is added. Each feature is printed on a new line.

Example:
  $ cat /sys/kernel/cgroup/features
  nsdelegate

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-11-06 12:01:57 -08:00
Roman Gushchin
01ee6cfb14 cgroup: export list of delegatable control files using sysfs
Delegatable cgroup v2 control files may require special handling
(e.g. chowning), and the exact list of such files varies between
kernel versions (and likely to be extended in the future).

To guarantee correctness of this list and simplify the life
of userspace (systemd, first of all), let's export the list
via /sys/kernel/cgroup/delegate pseudo-file.

Format is siple: each control file name is printed on a new line.
Example:
  $ cat /sys/kernel/cgroup/delegate
  cgroup.procs
  cgroup.subtree_control

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-11-06 12:01:54 -08:00
Wang Long
9a19b46386 workqueue: Fix comment for unbound workqueue's attrbutes
Signed-off-by: Wang Long <wanglong19@meituan.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-11-06 07:04:47 -08:00
Linus Torvalds
9d9cc4aa00 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Various fixes:

   - synchronize kernel and tooling headers

   - cgroup support fix

   - two tooling fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tools/headers: Synchronize kernel ABI headers
  perf/cgroup: Fix perf cgroup hierarchy support
  perf tools: Unwind properly location after REJECT
  perf symbols: Fix memory corruption because of zero length symbols
2017-11-05 11:44:39 -08:00
Roman Gushchin
ebc614f687 bpf, cgroup: implement eBPF-based device controller for cgroup v2
Cgroup v2 lacks the device controller, provided by cgroup v1.
This patch adds a new eBPF program type, which in combination
of previously added ability to attach multiple eBPF programs
to a cgroup, will provide a similar functionality, but with some
additional flexibility.

This patch introduces a BPF_PROG_TYPE_CGROUP_DEVICE program type.
A program takes major and minor device numbers, device type
(block/character) and access type (mknod/read/write) as parameters
and returns an integer which defines if the operation should be
allowed or terminated with -EPERM.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 23:26:51 +09:00
Jakub Kicinski
b37a530613 bpf: remove old offload/analyzer
Thanks to the ability to load a program for a specific device,
running verifier twice is no longer needed.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:26:20 +09:00
Jakub Kicinski
6c8dfe21c4 cls_bpf: allow attaching programs loaded for specific device
If TC program is loaded with skip_sw flag, we should allow
the device-specific programs to be accepted.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:26:19 +09:00
Jakub Kicinski
248f346ffe xdp: allow attaching programs loaded for specific device
Pass the netdev pointer to bpf_prog_get_type().  This way
BPF code can decide whether the device matches what the
code was loaded/translated for.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:26:19 +09:00
Jakub Kicinski
bd601b6ada bpf: report offload info to user space
Extend struct bpf_prog_info to contain information about program
being bound to a device.  Since the netdev may get destroyed while
program still exists we need a flag to indicate the program is
loaded for a device, even if the device is gone.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:26:18 +09:00
Jakub Kicinski
ab3f0063c4 bpf: offload: add infrastructure for loading programs for a specific netdev
The fact that we don't know which device the program is going
to be used on is quite limiting in current eBPF infrastructure.
We have to reverse or limit the changes which kernel makes to
the loaded bytecode if we want it to be offloaded to a networking
device.  We also have to invent new APIs for debugging and
troubleshooting support.

Make it possible to load programs for a specific netdev.  This
helps us to bring the debug information closer to the core
eBPF infrastructure (e.g. we will be able to reuse the verifer
log in device JIT).  It allows device JITs to perform translation
on the original bytecode.

__bpf_prog_get() when called to get a reference for an attachment
point will now refuse to give it if program has a device assigned.
Following patches will add a version of that function which passes
the expected netdev in. @type argument in __bpf_prog_get() is
renamed to attach_type to make it clearer that it's only set on
attachment.

All calls to ndo_bpf are protected by rtnl, only verifier callbacks
are not.  We need a wait queue to make sure netdev doesn't get
destroyed while verifier is still running and calling its driver.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:26:18 +09:00
Chris Redpath
d62d813c0d cpufreq: schedutil: Examine the correct CPU when we update util
After commit 674e75411f (sched: cpufreq: Allow remote cpufreq
callbacks) we stopped to always read the utilization for the CPU we
are running the governor on, and instead we read it for the CPU
which we've been told has updated utilization.  This is stored in
sugov_cpu->cpu.

The value is set in sugov_register() but we clear it in sugov_start()
which leads to always looking at the utilization of CPU0 instead of
the correct one.

Fix this by consolidating the initialization code into sugov_start().

Fixes: 674e75411f (sched: cpufreq: Allow remote cpufreq callbacks)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Reviewed-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-11-04 17:44:28 +01:00
Ingo Molnar
649e441f49 Merge branch 'linus' into core/urgent, to pick up dependent commits
We want to fix an objtool build warning that got introduced in the latest upstream kernel.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-04 08:53:04 +01:00
David S. Miller
2a171788ba Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Files removed in 'net-next' had their license header updated
in 'net'.  We take the remove from 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-04 09:26:51 +09:00
Dave Martin
2d2123bc7c arm64/sve: Add prctl controls for userspace vector length management
This patch adds two arm64-specific prctls, to permit userspace to
control its vector length:

 * PR_SVE_SET_VL: set the thread's SVE vector length and vector
   length inheritance mode.

 * PR_SVE_GET_VL: get the same information.

Although these prctls resemble instruction set features in the SVE
architecture, they provide additional control: the vector length
inheritance mode is Linux-specific and nothing to do with the
architecture, and the architecture does not permit EL0 to set its
own vector length directly.  Both can be used in portable tools
without requiring the use of SVE instructions.

Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alex Bennée <alex.bennee@linaro.org>
[will: Fixed up prctl constants to avoid clash with PDEATHSIG]
Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-11-03 15:24:19 +00:00
Herbert Xu
ab387f0af2 Linux 4.14-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJZ0WQ6AAoJEHm+PkMAQRiGuloH/3sF4qfBhPuJo8OTf0uCtQ18
 4Ux9zZbm81df/Jjz0exAp1Jqk+TvdIS3OXPWcKilvbUBP16hQcsxFTnI/5QF+YcN
 87aNr+OCMJzOBK4suN1yhzO46NYHeIizdB0PTZVL1Zsto69Tt31D8VJmgH6oBxAw
 Isb/nAkOr31dZ9PI5UEExTIanUt6EywVb0UswA+2rNl3h1UkeasQCpMpK2n6HBhU
 kVD7sxEd/CN0MmfhB0HrySSam/BeSpOtzoU9bemOwrU2uu9+5+2rqMe7Gsdj4nX6
 3Kk+7FQNktlrhxCZIFN/+CdusOUuDd8r/75d7DnsRK5YvSb0sZzJkfD3Nba68Ms=
 =7J2+
 -----END PGP SIGNATURE-----

Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux

Merge 4.14-rc3 in order to pick up the new timer_setup function.
2017-11-03 22:10:18 +08:00
Tejun Heo
edbfd9112f Revert "workqueue: respect isolated cpus when queueing an unbound work"
This reverts commit b5149873a0.

It conflicts with the following isolcpus change from the sched branch.

 edb9382175 ("sched/isolation: Move isolcpus= handling to the housekeeping code")

Let's revert for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
2017-11-03 07:02:15 -07:00
Ingo Molnar
294cbd05e3 Merge branch 'linus' into perf/urgent, to pick up dependent commits
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-03 12:30:12 +01:00
Craig Gallek
8c01c4f896 bpf: fix verifier NULL pointer dereference
do_check() can fail early without allocating env->cur_state under
memory pressure.  Syzkaller found the stack below on the linux-next
tree because of this.

  kasan: CONFIG_KASAN_INLINE enabled
  kasan: GPF could be caused by NULL-ptr deref or user memory access
  general protection fault: 0000 [#1] SMP KASAN
  Dumping ftrace buffer:
     (ftrace buffer empty)
  Modules linked in:
  CPU: 1 PID: 27062 Comm: syz-executor5 Not tainted 4.14.0-rc7+ #106
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  task: ffff8801c2c74700 task.stack: ffff8801c3e28000
  RIP: 0010:free_verifier_state kernel/bpf/verifier.c:347 [inline]
  RIP: 0010:bpf_check+0xcf4/0x19c0 kernel/bpf/verifier.c:4533
  RSP: 0018:ffff8801c3e2f5c8 EFLAGS: 00010202
  RAX: dffffc0000000000 RBX: 00000000fffffff4 RCX: 0000000000000000
  RDX: 0000000000000070 RSI: ffffffff817d5aa9 RDI: 0000000000000380
  RBP: ffff8801c3e2f668 R08: 0000000000000000 R09: 1ffff100387c5d9f
  R10: 00000000218c4e80 R11: ffffffff85b34380 R12: ffff8801c4dc6a28
  R13: 0000000000000000 R14: ffff8801c4dc6a00 R15: ffff8801c4dc6a20
  FS:  00007f311079b700(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000004d4a24 CR3: 00000001cbcd0000 CR4: 00000000001406e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   bpf_prog_load+0xcbb/0x18e0 kernel/bpf/syscall.c:1166
   SYSC_bpf kernel/bpf/syscall.c:1690 [inline]
   SyS_bpf+0xae9/0x4620 kernel/bpf/syscall.c:1652
   entry_SYSCALL_64_fastpath+0x1f/0xbe
  RIP: 0033:0x452869
  RSP: 002b:00007f311079abe8 EFLAGS: 00000212 ORIG_RAX: 0000000000000141
  RAX: ffffffffffffffda RBX: 0000000000758020 RCX: 0000000000452869
  RDX: 0000000000000030 RSI: 0000000020168000 RDI: 0000000000000005
  RBP: 00007f311079aa20 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000212 R12: 00000000004b7550
  R13: 00007f311079ab58 R14: 00000000004b7560 R15: 0000000000000000
  Code: df 48 c1 ea 03 80 3c 02 00 0f 85 e6 0b 00 00 4d 8b 6e 20 48 b8 00 00 00 00 00 fc ff df 49 8d bd 80 03 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 b6 0b 00 00 49 8b bd 80 03 00 00 e8 d6 0c 26
  RIP: free_verifier_state kernel/bpf/verifier.c:347 [inline] RSP: ffff8801c3e2f5c8
  RIP: bpf_check+0xcf4/0x19c0 kernel/bpf/verifier.c:4533 RSP: ffff8801c3e2f5c8
  ---[ end trace c8d37f339dc64004 ]---

Fixes: 638f5b90d4 ("bpf: reduce verifier memory consumption")
Fixes: 1969db47f8 ("bpf: fix verifier memory leaks")
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-03 15:49:15 +09:00
Arnd Bergmann
eba0c929d1 bpf: fix out-of-bounds access warning in bpf_check
The bpf_verifer_ops array is generated dynamically and may be
empty depending on configuration, which then causes an out
of bounds access:

kernel/bpf/verifier.c: In function 'bpf_check':
kernel/bpf/verifier.c:4320:29: error: array subscript is above array bounds [-Werror=array-bounds]

This adds a check to the start of the function as a workaround.
I would assume that the function is never called in that configuration,
so the warning is probably harmless.

Fixes: 00176a34d9 ("bpf: remove the verifier ops from program structure")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-03 14:20:22 +09:00
Arnd Bergmann
7cce782ef3 bpf: fix link error without CONFIG_NET
I ran into this link error with the latest net-next plus linux-next
trees when networking is disabled:

kernel/bpf/verifier.o:(.rodata+0x2958): undefined reference to `tc_cls_act_analyzer_ops'
kernel/bpf/verifier.o:(.rodata+0x2970): undefined reference to `xdp_analyzer_ops'

It seems that the code was written to deal with varying contents of
the arrray, but the actual #ifdef was missing. Both tc_cls_act_analyzer_ops
and xdp_analyzer_ops are defined in the core networking code, so adding
a check for CONFIG_NET seems appropriate here, and I've verified this with
many randconfig builds

Fixes: 4f9218aaf8 ("bpf: move knowledge about post-translation offsets out of verifier")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-03 14:20:22 +09:00
Kees Cook
fd30b717b8 rcu: Convert timers to use timer_setup()
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-11-02 15:44:09 -07:00
Thomas Gleixner
722c908f84 irqchip updates for 4.15, take #2
- A number of MIPS GIC updates and cleanups
 - One GICv4 update
 - Another firmware workaround for GICv2
 - Support for Mason8 GPIOs
 - Tiny documentation fix
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAln7UeAVHG1hcmMuenlu
 Z2llckBhcm0uY29tAAoJECPQ0LrRPXpDtLkQAII8Ij6h2nWN8MaSOcrOsar58PU9
 gfa8HidjH7QRht5xcpaAYXO0+VO7etP37JpTlzcJfpaAU8tFTKAMJKbWaHRe2KKV
 YbfRYWhnyizmb2Sn5+JDKlVVPAgR9tZN9UQyDceERb4yZmGOc9nHBObJIPAYtvnr
 zKLPo5ldVs7bCQan+uGXMyncVGgJ70rYCNPhHqzyNbAUgkI3XA85Wk9NgBXQCr/2
 8SiE6hs+QdkELy545Pi0oPxvqY5q4f2QmgOjWhQtJgbAQvWpizMi2Qz4zH8Gs89t
 Iln6zaw8315vfPSiyXckbIlMoUUzejLuxtvwvdgdrjE5vogNkj+Z874R3yNbLZ2H
 CW+DZz9gv+e2uzIiqBBpw1DatWBQQxoXvUVW1fyEOqwZPpr7JvAKwJ9Nwhc0kUdG
 H2AjO1KL/WqiqCpEDAYF+zY3zLUSAaoNjMWX8+nlaUstgQO4hSyPyU3RUJXmpHf8
 DHIVelZlhurKjQP6SfH1QKs+Wqv2QAZb8Otb8Pt/IMw73mXmRMbMPB7uOc2A7TqG
 uceAQuOyVBG+QF2dGiXZ5RqTNtL1gWns+S+abJCtRkD+JV83teZw3I4sMnVpeqtO
 rdavJVejrePck8Tgd1qOCGfXUQrBhCG1xJFkt/Y9tWbduU0R4V1y5J7LcGBLMYSy
 cV5WD3fejfjayTlT
 =Aozt
 -----END PGP SIGNATURE-----

Merge tag 'irqchip-4.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core

Pull the second batch of irqchip updates for 4.15 from marc Zyngier:

 - A number of MIPS GIC updates and cleanups
 - One GICv4 update
 - Another firmware workaround for GICv2
 - Support for Mason8 GPIOs
 - Tiny documentation fix
2017-11-02 19:18:08 +01:00
Linus Torvalds
ead751507d License cleanup: add SPDX license identifiers to some files
Many source files in the tree are missing licensing information, which
 makes it harder for compliance tools to determine the correct license.
 
 By default all files without license information are under the default
 license of the kernel, which is GPL version 2.
 
 Update the files which contain no license information with the 'GPL-2.0'
 SPDX license identifier.  The SPDX identifier is a legally binding
 shorthand, which can be used instead of the full boiler plate text.
 
 This patch is based on work done by Thomas Gleixner and Kate Stewart and
 Philippe Ombredanne.
 
 How this work was done:
 
 Patches were generated and checked against linux-4.14-rc6 for a subset of
 the use cases:
  - file had no licensing information it it.
  - file was a */uapi/* one with no licensing information in it,
  - file was a */uapi/* one with existing licensing information,
 
 Further patches will be generated in subsequent months to fix up cases
 where non-standard license headers were used, and references to license
 had to be inferred by heuristics based on keywords.
 
 The analysis to determine which SPDX License Identifier to be applied to
 a file was done in a spreadsheet of side by side results from of the
 output of two independent scanners (ScanCode & Windriver) producing SPDX
 tag:value files created by Philippe Ombredanne.  Philippe prepared the
 base worksheet, and did an initial spot review of a few 1000 files.
 
 The 4.13 kernel was the starting point of the analysis with 60,537 files
 assessed.  Kate Stewart did a file by file comparison of the scanner
 results in the spreadsheet to determine which SPDX license identifier(s)
 to be applied to the file. She confirmed any determination that was not
 immediately clear with lawyers working with the Linux Foundation.
 
 Criteria used to select files for SPDX license identifier tagging was:
  - Files considered eligible had to be source code files.
  - Make and config files were included as candidates if they contained >5
    lines of source
  - File already had some variant of a license header in it (even if <5
    lines).
 
 All documentation files were explicitly excluded.
 
 The following heuristics were used to determine which SPDX license
 identifiers to apply.
 
  - when both scanners couldn't find any license traces, file was
    considered to have no license information in it, and the top level
    COPYING file license applied.
 
    For non */uapi/* files that summary was:
 
    SPDX license identifier                            # files
    ---------------------------------------------------|-------
    GPL-2.0                                              11139
 
    and resulted in the first patch in this series.
 
    If that file was a */uapi/* path one, it was "GPL-2.0 WITH
    Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
 
    SPDX license identifier                            # files
    ---------------------------------------------------|-------
    GPL-2.0 WITH Linux-syscall-note                        930
 
    and resulted in the second patch in this series.
 
  - if a file had some form of licensing information in it, and was one
    of the */uapi/* ones, it was denoted with the Linux-syscall-note if
    any GPL family license was found in the file or had no licensing in
    it (per prior point).  Results summary:
 
    SPDX license identifier                            # files
    ---------------------------------------------------|------
    GPL-2.0 WITH Linux-syscall-note                       270
    GPL-2.0+ WITH Linux-syscall-note                      169
    ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
    ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
    LGPL-2.1+ WITH Linux-syscall-note                      15
    GPL-1.0+ WITH Linux-syscall-note                       14
    ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
    LGPL-2.0+ WITH Linux-syscall-note                       4
    LGPL-2.1 WITH Linux-syscall-note                        3
    ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
    ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
 
    and that resulted in the third patch in this series.
 
  - when the two scanners agreed on the detected license(s), that became
    the concluded license(s).
 
  - when there was disagreement between the two scanners (one detected a
    license but the other didn't, or they both detected different
    licenses) a manual inspection of the file occurred.
 
  - In most cases a manual inspection of the information in the file
    resulted in a clear resolution of the license that should apply (and
    which scanner probably needed to revisit its heuristics).
 
  - When it was not immediately clear, the license identifier was
    confirmed with lawyers working with the Linux Foundation.
 
  - If there was any question as to the appropriate license identifier,
    the file was flagged for further research and to be revisited later
    in time.
 
 In total, over 70 hours of logged manual review was done on the
 spreadsheet to determine the SPDX license identifiers to apply to the
 source files by Kate, Philippe, Thomas and, in some cases, confirmation
 by lawyers working with the Linux Foundation.
 
 Kate also obtained a third independent scan of the 4.13 code base from
 FOSSology, and compared selected files where the other two scanners
 disagreed against that SPDX file, to see if there was new insights.  The
 Windriver scanner is based on an older version of FOSSology in part, so
 they are related.
 
 Thomas did random spot checks in about 500 files from the spreadsheets
 for the uapi headers and agreed with SPDX license identifier in the
 files he inspected. For the non-uapi files Thomas did random spot checks
 in about 15000 files.
 
 In initial set of patches against 4.14-rc6, 3 files were found to have
 copy/paste license identifier errors, and have been fixed to reflect the
 correct identifier.
 
 Additionally Philippe spent 10 hours this week doing a detailed manual
 inspection and review of the 12,461 patched files from the initial patch
 version early this week with:
  - a full scancode scan run, collecting the matched texts, detected
    license ids and scores
  - reviewing anything where there was a license detected (about 500+
    files) to ensure that the applied SPDX license was correct
  - reviewing anything where there was no detection but the patch license
    was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
    SPDX license was correct
 
 This produced a worksheet with 20 files needing minor correction.  This
 worksheet was then exported into 3 different .csv files for the
 different types of files to be modified.
 
 These .csv files were then reviewed by Greg.  Thomas wrote a script to
 parse the csv files and add the proper SPDX tag to the file, in the
 format that the file expected.  This script was further refined by Greg
 based on the output to detect more types of files automatically and to
 distinguish between header and source .c files (which need different
 comment types.)  Finally Greg ran the script using the .csv files to
 generate the patches.
 
 Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
 Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
 Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWfswbQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykvEwCfXU1MuYFQGgMdDmAZXEc+xFXZvqgAoKEcHDNA
 6dVh26uchcEQLN/XqUDt
 =x306
 -----END PGP SIGNATURE-----

Merge tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull initial SPDX identifiers from Greg KH:
 "License cleanup: add SPDX license identifiers to some files

  Many source files in the tree are missing licensing information, which
  makes it harder for compliance tools to determine the correct license.

  By default all files without license information are under the default
  license of the kernel, which is GPL version 2.

  Update the files which contain no license information with the
  'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally
  binding shorthand, which can be used instead of the full boiler plate
  text.

  This patch is based on work done by Thomas Gleixner and Kate Stewart
  and Philippe Ombredanne.

  How this work was done:

  Patches were generated and checked against linux-4.14-rc6 for a subset
  of the use cases:

   - file had no licensing information it it.

   - file was a */uapi/* one with no licensing information in it,

   - file was a */uapi/* one with existing licensing information,

  Further patches will be generated in subsequent months to fix up cases
  where non-standard license headers were used, and references to
  license had to be inferred by heuristics based on keywords.

  The analysis to determine which SPDX License Identifier to be applied
  to a file was done in a spreadsheet of side by side results from of
  the output of two independent scanners (ScanCode & Windriver)
  producing SPDX tag:value files created by Philippe Ombredanne.
  Philippe prepared the base worksheet, and did an initial spot review
  of a few 1000 files.

  The 4.13 kernel was the starting point of the analysis with 60,537
  files assessed. Kate Stewart did a file by file comparison of the
  scanner results in the spreadsheet to determine which SPDX license
  identifier(s) to be applied to the file. She confirmed any
  determination that was not immediately clear with lawyers working with
  the Linux Foundation.

  Criteria used to select files for SPDX license identifier tagging was:

   - Files considered eligible had to be source code files.

   - Make and config files were included as candidates if they contained
     >5 lines of source

   - File already had some variant of a license header in it (even if <5
     lines).

  All documentation files were explicitly excluded.

  The following heuristics were used to determine which SPDX license
  identifiers to apply.

   - when both scanners couldn't find any license traces, file was
     considered to have no license information in it, and the top level
     COPYING file license applied.

     For non */uapi/* files that summary was:

       SPDX license identifier                            # files
       ---------------------------------------------------|-------
       GPL-2.0                                              11139

     and resulted in the first patch in this series.

     If that file was a */uapi/* path one, it was "GPL-2.0 WITH
     Linux-syscall-note" otherwise it was "GPL-2.0". Results of that
     was:

       SPDX license identifier                            # files
       ---------------------------------------------------|-------
       GPL-2.0 WITH Linux-syscall-note                        930

     and resulted in the second patch in this series.

   - if a file had some form of licensing information in it, and was one
     of the */uapi/* ones, it was denoted with the Linux-syscall-note if
     any GPL family license was found in the file or had no licensing in
     it (per prior point). Results summary:

       SPDX license identifier                            # files
       ---------------------------------------------------|------
       GPL-2.0 WITH Linux-syscall-note                       270
       GPL-2.0+ WITH Linux-syscall-note                      169
       ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
       ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
       LGPL-2.1+ WITH Linux-syscall-note                      15
       GPL-1.0+ WITH Linux-syscall-note                       14
       ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
       LGPL-2.0+ WITH Linux-syscall-note                       4
       LGPL-2.1 WITH Linux-syscall-note                        3
       ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
       ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

     and that resulted in the third patch in this series.

   - when the two scanners agreed on the detected license(s), that
     became the concluded license(s).

   - when there was disagreement between the two scanners (one detected
     a license but the other didn't, or they both detected different
     licenses) a manual inspection of the file occurred.

   - In most cases a manual inspection of the information in the file
     resulted in a clear resolution of the license that should apply
     (and which scanner probably needed to revisit its heuristics).

   - When it was not immediately clear, the license identifier was
     confirmed with lawyers working with the Linux Foundation.

   - If there was any question as to the appropriate license identifier,
     the file was flagged for further research and to be revisited later
     in time.

  In total, over 70 hours of logged manual review was done on the
  spreadsheet to determine the SPDX license identifiers to apply to the
  source files by Kate, Philippe, Thomas and, in some cases,
  confirmation by lawyers working with the Linux Foundation.

  Kate also obtained a third independent scan of the 4.13 code base from
  FOSSology, and compared selected files where the other two scanners
  disagreed against that SPDX file, to see if there was new insights.
  The Windriver scanner is based on an older version of FOSSology in
  part, so they are related.

  Thomas did random spot checks in about 500 files from the spreadsheets
  for the uapi headers and agreed with SPDX license identifier in the
  files he inspected. For the non-uapi files Thomas did random spot
  checks in about 15000 files.

  In initial set of patches against 4.14-rc6, 3 files were found to have
  copy/paste license identifier errors, and have been fixed to reflect
  the correct identifier.

  Additionally Philippe spent 10 hours this week doing a detailed manual
  inspection and review of the 12,461 patched files from the initial
  patch version early this week with:

   - a full scancode scan run, collecting the matched texts, detected
     license ids and scores

   - reviewing anything where there was a license detected (about 500+
     files) to ensure that the applied SPDX license was correct

   - reviewing anything where there was no detection but the patch
     license was not GPL-2.0 WITH Linux-syscall-note to ensure that the
     applied SPDX license was correct

  This produced a worksheet with 20 files needing minor correction. This
  worksheet was then exported into 3 different .csv files for the
  different types of files to be modified.

  These .csv files were then reviewed by Greg. Thomas wrote a script to
  parse the csv files and add the proper SPDX tag to the file, in the
  format that the file expected. This script was further refined by Greg
  based on the output to detect more types of files automatically and to
  distinguish between header and source .c files (which need different
  comment types.) Finally Greg ran the script using the .csv files to
  generate the patches.

  Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
  Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
  Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

* tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  License cleanup: add SPDX license identifier to uapi header files with a license
  License cleanup: add SPDX license identifier to uapi header files with no license
  License cleanup: add SPDX GPL-2.0 license identifier to files with no license
2017-11-02 10:04:46 -07:00
Marc Zyngier
05f3647359 Linux 4.14-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJZ0WQ6AAoJEHm+PkMAQRiGuloH/3sF4qfBhPuJo8OTf0uCtQ18
 4Ux9zZbm81df/Jjz0exAp1Jqk+TvdIS3OXPWcKilvbUBP16hQcsxFTnI/5QF+YcN
 87aNr+OCMJzOBK4suN1yhzO46NYHeIizdB0PTZVL1Zsto69Tt31D8VJmgH6oBxAw
 Isb/nAkOr31dZ9PI5UEExTIanUt6EywVb0UswA+2rNl3h1UkeasQCpMpK2n6HBhU
 kVD7sxEd/CN0MmfhB0HrySSam/BeSpOtzoU9bemOwrU2uu9+5+2rqMe7Gsdj4nX6
 3Kk+7FQNktlrhxCZIFN/+CdusOUuDd8r/75d7DnsRK5YvSb0sZzJkfD3Nba68Ms=
 =7J2+
 -----END PGP SIGNATURE-----

Merge tag 'v4.14-rc3' into irq/irqchip-4.15

Required merge to get mainline irqchip updates.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2017-11-02 15:54:58 +00:00
Jiri Slaby
e78c38f6bd futex: futex_wake_op, do not fail on invalid op
In commit 30d6e0a419 ("futex: Remove duplicated code and fix undefined
behaviour"), I let FUTEX_WAKE_OP to fail on invalid op.  Namely when op
should be considered as shift and the shift is out of range (< 0 or > 31).

But strace's test suite does this madness:

  futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xa0caffee);
  futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xbadfaced);
  futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xffffffff);

When I pick the first 0xa0caffee, it decodes as:

  0x80000000 & 0xa0caffee: oparg is shift
  0x70000000 & 0xa0caffee: op is FUTEX_OP_OR
  0x0f000000 & 0xa0caffee: cmp is FUTEX_OP_CMP_EQ
  0x00fff000 & 0xa0caffee: oparg is sign-extended 0xcaf = -849
  0x00000fff & 0xa0caffee: cmparg is sign-extended 0xfee = -18

That means the op tries to do this:

  (futex |= (1 << (-849))) == -18

which is completely bogus. The new check of op in the code is:

        if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28)) {
                if (oparg < 0 || oparg > 31)
                        return -EINVAL;
                oparg = 1 << oparg;
        }

which results obviously in the "Invalid argument" errno:

  FAIL: futex
  ===========

  futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xa0caffee) = -1: Invalid argument
  futex.test: failed test: ../futex failed with code 1

So let us soften the failure to print only a (ratelimited) message, crop
the value and continue as if it were right.  When userspace keeps up, we
can switch this to return -EINVAL again.

[v2] Do not return 0 immediatelly, proceed with the cropped value.

Fixes: 30d6e0a419 ("futex: Remove duplicated code and fix undefined behaviour")
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-02 07:41:50 -07:00
Randy Dunlap
6082a6e444 kernel/time/Kconfig: Fix typo in comment
Fix typo in Kconfig comment text.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Jiri Kosina <trivial@kernel.org>
Link: https://lkml.kernel.org/r/0e586dd4-2b27-864e-c252-bc72df52fd01@infradead.org
2017-11-02 12:50:34 +01:00
Greg Kroah-Hartman
b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Daniel Borkmann
5beca081be bpf: also improve pattern matches for meta access
Follow-up to 0fd4759c55 ("bpf: fix pattern matches for direct
packet access") to cover also the remaining data_meta/data matches
in the verifier. The matches are also refactored a bit to simplify
handling of all the cases.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-02 17:01:38 +09:00
Daniel Borkmann
b06723da82 bpf: minor cleanups after merge
Two minor cleanups after Dave's recent merge in f8ddadc4db
("Merge git://git.kernel.org...") of net into net-next in
order to get the code in line with what was done originally
in the net tree: i) use max() instead of max_t() since both
ranges are u16, ii) don't split the direct access test cases
in the middle with bpf_exit test cases from 390ee7e29f
("bpf: enforce return code for cgroup-bpf programs").

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-02 17:01:38 +09:00
Jesper Dangaard Brouer
03c4cc385f bpf: cpumap micro-optimization in cpu_map_enqueue
Discovered that the compiler laid-out asm code in suboptimal way
when studying perf report during benchmarking of cpumap. Help
the compiler by the marking unlikely code paths.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-02 16:13:14 +09:00
David S. Miller
ed29668d1a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Smooth Cong Wang's bug fix into 'net-next'.  Basically put
the bulk of the tcf_block_put() logic from 'net' into
tcf_block_put_ext(), but after the offload unbind.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-02 15:23:39 +09:00
Linus Torvalds
3a99df9a3d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull signal bugfix from Eric Biederman:
 "When making the generic support for SIGEMT conditional on the presence
  of SIGEMT I made a typo that causes it to fail to activate. It was
  noticed comparatively quickly but the bug report just made it to me
  today"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  signal: Fix name of SIGEMT in #if defined() check
2017-11-01 16:04:27 -07:00
Andrew Clayton
c3aff086ea signal: Fix name of SIGEMT in #if defined() check
Commit cc731525f2 ("signal: Remove kernel interal si_code magic")
added a check for SIGMET and NSIGEMT being defined. That SIGMET should
in fact be SIGEMT, with SIGEMT being defined in
arch/{alpha,mips,sparc}/include/uapi/asm/signal.h

This was actually pointed out by BenHutchings in a lwn.net comment
here https://lwn.net/Comments/734608/

Fixes: cc731525f2 ("signal: Remove kernel interal si_code magic")
Signed-off-by: Andrew Clayton <andrew@digital-domain.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-11-01 17:04:57 -05:00
Don Zickus
42f930da7f watchdog/hardlockup/perf: Use atomics to track in-use cpu counter
Guenter reported:
  There is still a problem. When running 
    echo 6 > /proc/sys/kernel/watchdog_thresh
    echo 5 > /proc/sys/kernel/watchdog_thresh
  repeatedly, the message
 
   NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
 
  stops after a while (after ~10-30 iterations, with fluctuations).
  Maybe watchdog_cpus needs to be atomic ?

That's correct as this again is affected by the asynchronous nature of the
smpboot thread unpark mechanism.

CPU 0				CPU1			CPU2
write(watchdog_thresh, 6)	
  stop()
    park()
  update()
  start()
    unpark()
				thread->unpark()
				  cnt++;
write(watchdog_thresh, 5)				thread->unpark()
  stop()
    park()			thread->park()
				   cnt--;		  cnt++;
  update()
  start()
    unpark()

That's not a functional problem, it just affects the informational message.

Convert watchdog_cpus to atomic_t to prevent the problem

Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20171101181126.j727fqjmdthjz4xk@redhat.com
2017-11-01 21:18:40 +01:00
Thomas Gleixner
9c388a5ed1 watchdog/harclockup/perf: Revert a33d44843d ("watchdog/hardlockup/perf: Simplify deferred event destroy")
Guenter reported a crash in the watchdog/perf code, which is caused by
cleanup() and enable() running concurrently. The reason for this is:

The watchdog functions are serialized via the watchdog_mutex and cpu
hotplug locking, but the enable of the perf based watchdog happens in
context of the unpark callback of the smpboot thread. But that unpark
function is not synchronous inside the locking. The unparking of the thread
just wakes it up and leaves so there is no guarantee when the thread is
executing.

If it starts running _before_ the cleanup happened then it will create a
event and overwrite the dead event pointer. The new event is then cleaned
up because the event is marked dead.

    lock(watchdog_mutex);
    lockup_detector_reconfigure();
        cpus_read_lock();
	stop();
	   park()
	update();
	start();
	   unpark()
	cpus_read_unlock();		thread runs()
					  overwrite dead event ptr
	cleanup();
	  free new event, which is active inside perf....
    unlock(watchdog_mutex);

The park side is safe as that actually waits for the thread to reach
parked state.

Commit a33d44843d removed the protection against this kind of scenario
under the stupid assumption that the hotplug serialization and the
watchdog_mutex cover everything. 

Bring it back.

Reverts: a33d44843d ("watchdog/hardlockup/perf: Simplify deferred event destroy")
Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thomas Feels-stupid Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Don Zickus <dzickus@redhat.com>
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1710312145190.1942@nanos
2017-11-01 21:18:39 +01:00
Prasad Sodagudi
39c82caff8 clockevents: Update clockevents device next_event on stop
clockevent_device::next_event holds the next timer event of a clock event
device. The value is updated in clockevents_program_event(), i.e. when the
hardware timer is armed for the next expiry.

When there are no software timers armed on a CPU, the corresponding per CPU
clockevent device is brought into ONESHOT_STOPPED state, but
clockevent_device::next_event is not updated, because
clockevents_program_event() is not called.

So the content of clockevent_device::next_event is stale, which is not an
issue when real hardware is used. But the hrtimer broadcast device relies
on that information and the stale value causes spurious wakeups.

Update clockevent_device::next_event to KTIME_MAX when it has been brought
into ONESHOT_STOPPED state to avoid spurious wakeups. This reflects the
proper expiry time of the stopped timer: infinity.

[ tglx: Massaged changelog ]

Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: viresh.kumar@linaro.org
Link: https://lkml.kernel.org/r/1509043042-32486-1-git-send-email-psodagud@codeaurora.org
2017-11-01 18:20:17 +01:00
Alexei Starovoitov
1969db47f8 bpf: fix verifier memory leaks
fix verifier memory leaks

Fixes: 638f5b90d4 ("bpf: reduce verifier memory consumption")
Signed-off-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-01 22:07:31 +09:00
Rakib Mullick
10d94ff4d5 irq/core: Fix boot crash when the irqaffinity= boot parameter is passed on CPUMASK_OFFSTACK=y kernels(v1)
When the irqaffinity= kernel parameter is passed in a CPUMASK_OFFSTACK=y
kernel, it fails to boot, because zalloc_cpumask_var() cannot be used before
initializing the slab allocator to allocate a cpumask.

So, use alloc_bootmem_cpumask_var() instead.

Also do some cleanups while at it: in init_irq_default_affinity() remove
an #ifdef via using cpumask_available().

Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171026045800.27087-1-rakib.mullick@gmail.com
Link: http://lkml.kernel.org/r/20171101041451.12581-1-rakib.mullick@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-01 09:56:39 +01:00
Peter Zijlstra
153fbd1226 futex: Fix more put_pi_state() vs. exit_pi_state_list() races
Dmitry (through syzbot) reported being able to trigger the WARN in
get_pi_state() and a use-after-free on:

	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);

Both are due to this race:

  exit_pi_state_list()				put_pi_state()

  lock(&curr->pi_lock)
  while() {
	pi_state = list_first_entry(head);
	hb = hash_futex(&pi_state->key);
	unlock(&curr->pi_lock);

						dec_and_test(&pi_state->refcount);

	lock(&hb->lock)
	lock(&pi_state->pi_mutex.wait_lock)	// uaf if pi_state free'd
	lock(&curr->pi_lock);

	....

	unlock(&curr->pi_lock);
	get_pi_state();				// WARN; refcount==0

The problem is we take the reference count too late, and don't allow it
being 0. Fix it by using inc_not_zero() and simply retrying the loop
when we fail to get a refcount. In that case put_pi_state() should
remove the entry from the list.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Gratian Crisan <gratian.crisan@ni.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: dvhart@infradead.org
Cc: syzbot <bot+2af19c9e1ffe4d4ee1d16c56ae7580feaee75765@syzkaller.appspotmail.com>
Cc: syzkaller-bugs@googlegroups.com
Cc: <stable@vger.kernel.org>
Fixes: c74aef2d06 ("futex: Fix pi_state->owner serialization")
Link: http://lkml.kernel.org/r/20171031101853.xpfh72y643kdfhjs@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-01 09:05:00 +01:00
Yonghong Song
07c41a295c bpf: avoid rcu_dereference inside bpf_event_mutex lock region
During perf event attaching/detaching bpf programs,
the tp_event->prog_array change is protected by the
bpf_event_mutex lock in both attaching and deteching
functions. Although tp_event->prog_array is a rcu
pointer, rcu_derefrence is not needed to access it
since mutex lock will guarantee ordering.

Verified through "make C=2" that sparse
locking check still happy with the new change.

Also change the label name in perf_event_{attach,detach}_bpf_prog
from "out" to "unlock" to reflect the code action after the label.

Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-01 12:35:48 +09:00
John Fastabend
04686ef299 bpf: remove SK_REDIRECT from UAPI
Now that SK_REDIRECT is no longer a valid return code. Remove it
from the UAPI completely. Then do a namespace remapping internal
to sockmap so SK_REDIRECT is no longer externally visible.

Patchs primary change is to do a namechange from SK_REDIRECT to
__SK_REDIRECT

Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-01 11:43:50 +09:00
Alexei Starovoitov
638f5b90d4 bpf: reduce verifier memory consumption
the verifier got progressively smarter over time and size of its internal
state grew as well. Time to reduce the memory consumption.

Before:
sizeof(struct bpf_verifier_state) = 6520
After:
sizeof(struct bpf_verifier_state) = 896

It's done by observing that majority of BPF programs use little to
no stack whereas verifier kept all of 512 stack slots ready always.
Instead dynamically reallocate struct verifier state when stack
access is detected.
Runtime difference before vs after is within a noise.
The number of processed instructions stays the same.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-01 11:41:18 +09:00
Eric W. Biederman
3fda0e737e userns: Simplify insert_extent
Consolidate the code to write to the new mapping at the end of the
function to remove the duplication.  Move the increase in the number
of mappings into insert_extent, keeping the logic together.

Just a small increase in readability and maintainability.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-10-31 17:23:13 -05:00
Eric W. Biederman
ece6613397 userns: Make map_id_down a wrapper for map_id_range_down
There is no good reason for this code duplication, the number of cache
line accesses not the number of instructions are the bottleneck in
this code.

Therefore simplify maintenance by removing unnecessary code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-10-31 17:23:13 -05:00
Eric W. Biederman
d5e7b3c5f5 userns: Don't read extents twice in m_start
This is important so reading /proc/<pid>/{uid_map,gid_map,projid_map} while
the map is being written does not do strange things.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-10-31 17:23:12 -05:00
Eric W. Biederman
3edf652fa1 userns: Simplify the user and group mapping functions
Consolidate reading the number of extents and computing the return
value in the map_id_down, map_id_range_down and map_id_range.

This removal of one read of extents makes one smp_rmb unnecessary
and makes the code safe it is executed during the map write.  Reading
the number of extents twice and depending on the result being the same
is not safe, as it could be 0 the first time and > 5 the second time,
which would lead to misinterpreting the union fields.

The consolidation of the return value just removes a duplicate
caluculation which should make it easier to understand and maintain
the code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-10-31 17:23:12 -05:00
Eric W. Biederman
11a8b9270e userns: Don't special case a count of 0
We can always use a count of 1 so there is no reason to have
a special case of a count of 0.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-10-31 17:23:11 -05:00
Christian Brauner
6397fac491 userns: bump idmap limits to 340
There are quite some use cases where users run into the current limit for
{g,u}id mappings. Consider a user requesting us to map everything but 999, and
1001 for a given range of 1000000000 with a sub{g,u}id layout of:

some-user:100000:1000000000
some-user:999:1
some-user:1000:1
some-user:1001:1
some-user:1002:1

This translates to:

MAPPING-TYPE | CONTAINER |    HOST |     RANGE |
-------------|-----------|---------|-----------|
         uid |       999 |     999 |         1 |
         uid |      1001 |    1001 |         1 |
         uid |         0 | 1000000 |       999 |
         uid |      1000 | 1001000 |         1 |
         uid |      1002 | 1001002 | 999998998 |
------------------------------------------------
         gid |       999 |     999 |         1 |
         gid |      1001 |    1001 |         1 |
         gid |         0 | 1000000 |       999 |
         gid |      1000 | 1001000 |         1 |
         gid |      1002 | 1001002 | 999998998 |

which is already the current limit.

As discussed at LPC simply bumping the number of limits is not going to work
since this would mean that struct uid_gid_map won't fit into a single cache-line
anymore thereby regressing performance for the base-cases. The same problem
seems to arise when using a single pointer. So the idea is to use

struct uid_gid_extent {
	u32 first;
	u32 lower_first;
	u32 count;
};

struct uid_gid_map { /* 64 bytes -- 1 cache line */
	u32 nr_extents;
	union {
		struct uid_gid_extent extent[UID_GID_MAP_MAX_BASE_EXTENTS];
		struct {
			struct uid_gid_extent *forward;
			struct uid_gid_extent *reverse;
		};
	};
};

For the base cases we will only use the struct uid_gid_extent extent member. If
we go over UID_GID_MAP_MAX_BASE_EXTENTS mappings we perform a single 4k
kmalloc() which means we can have a maximum of 340 mappings
(340 * size(struct uid_gid_extent) = 4080). For the latter case we use two
pointers "forward" and "reverse". The forward pointer points to an array sorted
by "first" and the reverse pointer points to an array sorted by "lower_first".
We can then perform binary search on those arrays.

Performance Testing:
When Eric introduced the extent-based struct uid_gid_map approach he measured
the performanc impact of his idmap changes:

> My benchmark consisted of going to single user mode where nothing else was
> running. On an ext4 filesystem opening 1,000,000 files and looping through all
> of the files 1000 times and calling fstat on the individuals files. This was
> to ensure I was benchmarking stat times where the inodes were in the kernels
> cache, but the inode values were not in the processors cache. My results:

> v3.4-rc1:         ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
> v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
> v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)

I used an identical approach on my laptop. Here's a thorough description of what
I did. I built a 4.14.0-rc4 mainline kernel with my new idmap patches applied. I
booted into single user mode and used an ext4 filesystem to open/create
1,000,000 files. Then I looped through all of the files calling fstat() on each
of them 1000 times and calculated the mean fstat() time for a single file. (The
test program can be found below.)

Here are the results. For fun, I compared the first version of my patch which
scaled linearly with the new version of the patch:

|   # MAPPINGS |   PATCH-V1 | PATCH-NEW |
|--------------|------------|-----------|
|   0 mappings |     158 ns |   158 ns  |
|   1 mappings |     164 ns |   157 ns  |
|   2 mappings |     170 ns |   158 ns  |
|   3 mappings |     175 ns |   161 ns  |
|   5 mappings |     187 ns |   165 ns  |
|  10 mappings |     218 ns |   199 ns  |
|  50 mappings |     528 ns |   218 ns  |
| 100 mappings |     980 ns |   229 ns  |
| 200 mappings |    1880 ns |   239 ns  |
| 300 mappings |    2760 ns |   240 ns  |
| 340 mappings | not tested |   248 ns  |

Here's the test program I used. I asked Eric what he did and this is a more
"advanced" implementation of the idea. It's pretty straight-forward:

 #define __GNU_SOURCE
 #define __STDC_FORMAT_MACROS
 #include <errno.h>
 #include <dirent.h>
 #include <fcntl.h>
 #include <inttypes.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
 #include <sys/stat.h>
 #include <sys/time.h>
 #include <sys/types.h>

 int main(int argc, char *argv[])
 {
 	int ret;
 	size_t i, k;
 	int fd[1000000];
 	int times[1000];
 	char pathname[4096];
 	struct stat st;
 	struct timeval t1, t2;
 	uint64_t time_in_mcs;
 	uint64_t sum = 0;

 	if (argc != 2) {
 		fprintf(stderr, "Please specify a directory where to create "
 				"the test files\n");
 		exit(EXIT_FAILURE);
 	}

 	for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++) {
 		sprintf(pathname, "%s/idmap_test_%zu", argv[1], i);
 		fd[i]= open(pathname, O_RDWR | O_CREAT, S_IXUSR | S_IXGRP | S_IXOTH);
 		if (fd[i] < 0) {
 			ssize_t j;
 			for (j = i; j >= 0; j--)
 				close(fd[j]);
 			exit(EXIT_FAILURE);
 		}
 	}

 	for (k = 0; k < 1000; k++) {
 		ret = gettimeofday(&t1, NULL);
 		if (ret < 0)
 			goto close_all;

 		for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++) {
 			ret = fstat(fd[i], &st);
 			if (ret < 0)
 				goto close_all;
 		}

 		ret = gettimeofday(&t2, NULL);
 		if (ret < 0)
 			goto close_all;

 		time_in_mcs = (1000000 * t2.tv_sec + t2.tv_usec) -
 			      (1000000 * t1.tv_sec + t1.tv_usec);
 		printf("Total time in micro seconds:       %" PRIu64 "\n",
 		       time_in_mcs);
 		printf("Total time in nanoseconds:         %" PRIu64 "\n",
 		       time_in_mcs * 1000);
 		printf("Time per file in nanoseconds:      %" PRIu64 "\n",
 		       (time_in_mcs * 1000) / 1000000);
 		times[k] = (time_in_mcs * 1000) / 1000000;
 	}

 close_all:
 	for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++)
 		close(fd[i]);

 	if (ret < 0)
 		exit(EXIT_FAILURE);

 	for (k = 0; k < 1000; k++) {
 		sum += times[k];
 	}

 	printf("Mean time per file in nanoseconds: %" PRIu64 "\n", sum / 1000);

 	exit(EXIT_SUCCESS);;
 }

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
CC: Serge Hallyn <serge@hallyn.com>
CC: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-10-31 17:23:04 -05:00
Christian Brauner
aa4bf44dc8 userns: use union in {g,u}idmap struct
- Add a struct containing two pointer to extents and wrap both the static extent
  array and the struct into a union. This is done in preparation for bumping the
  {g,u}idmap limits for user namespaces.
- Add brackets around anonymous union when using designated initializers to
  initialize members in order to please gcc <= 4.4.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-10-31 17:22:58 -05:00
Thomas Gleixner
fb56d689fb Merge branch 'fortglx/4.15/time' of https://git.linaro.org/people/john.stultz/linux into timers/core
Pull timekeeping updates from John Stultz:

 - More y2038 work from Arnd Bergmann

 - A new mechanism to allow RTC drivers to specify the resolution of the
   RTC so the suspend/resume code can make informed decisions whether to
   inject the suspended time or not in case of fast suspend/resume cycles.
2017-10-31 23:17:28 +01:00
Elena Reshetova
ab97f87325 fsnotify: convert fsnotify_mark.refcnt from atomic_t to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable fsnotify_mark.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-10-31 17:54:56 +01:00
Byungchul Park
16c0890dc6 irq/work: Don't reinvent the wheel but use existing llist API
Use the proper llist APIs instead of open-coded variants of them.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1509414414-14987-1-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-31 10:24:39 +01:00
Arnd Bergmann
abc8f96e3e time: Move time_t conversion helpers to time32.h
On 64-bit architectures, the timespec64 based helpers in linux/time.h
are defined as macros pointing to their timespec based counterparts.
This made sense when they were first introduced, but as we are migrating
away from timespec in general, it's much less intuitive now.

This changes the macros to work in the exact opposite way: we always
provide the timespec64 based helpers and define the old interfaces as
macros for them. Now we can move those macros into linux/time32.h, which
already contains the respective helpers for 32-bit architectures.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2017-10-30 15:17:19 -07:00
Arnd Bergmann
85bf19e7df time: Remove unused functions
The (slow but) ongoing work on conversion from timespec to timespec64
has led some timespec based helper functions to become unused.

No new code should use them, so we can remove the functions entirely.
I'm planning to obsolete additional interfaces next and remove
more of these.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2017-10-30 15:14:18 -07:00