mirror of
https://git.proxmox.com/git/mirror_ubuntu-kernels.git
synced 2025-11-29 22:30:49 +00:00
The shrinker_rwsem is a global read-write lock in shrinkers subsystem,
which protects most operations such as slab shrink, registration and
unregistration of shrinkers, etc. This can easily cause problems in the
following cases.
1) When the memory pressure is high and there are many filesystems
mounted or unmounted at the same time, slab shrink will be affected
(down_read_trylock() failed).
Such as the real workload mentioned by Kirill Tkhai:
```
One of the real workloads from my experience is start
of an overcommitted node containing many starting
containers after node crash (or many resuming containers
after reboot for kernel update). In these cases memory
pressure is huge, and the node goes round in long reclaim.
```
2) If a shrinker is blocked (such as the case mentioned
in [1]) and a writer comes in (such as mount a fs),
then this writer will be blocked and cause all
subsequent shrinker-related operations to be blocked.
Even if there is no competitor when shrinking slab, there may still be a
problem. The down_read_trylock() may become a perf hotspot with frequent
calls to shrink_slab(). Because of the poor multicore scalability of
atomic operations, this can lead to a significant drop in IPC
(instructions per cycle).
We used to implement the lockless slab shrink with SRCU [2], but then
kernel test robot reported -88.8% regression in
stress-ng.ramfs.ops_per_sec test case [3], so we reverted it [4].
This commit uses the refcount+RCU method [5] proposed by Dave Chinner
to re-implement the lockless global slab shrink. The memcg slab shrink is
handled in the subsequent patch.
For now, all shrinker instances are converted to dynamically allocated and
will be freed by call_rcu(). So we can use rcu_read_{lock,unlock}() to
ensure that the shrinker instance is valid.
And the shrinker instance will not be run again after unregistration. So
the structure that records the pointer of shrinker instance can be safely
freed without waiting for the RCU read-side critical section.
In this way, while we implement the lockless slab shrink, we don't need to
be blocked in unregister_shrinker().
The following are the test results:
stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
1) Before applying this patchset:
setting to a 60 second run per stressor
dispatching hogs: 9 ramfs
stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
(secs) (secs) (secs) (real time) (usr+sys time)
ramfs 473062 60.00 8.00 279.13 7884.12 1647.59
for a 60.01s run time:
1440.34s available CPU time
7.99s user time ( 0.55%)
279.13s system time ( 19.38%)
287.12s total time ( 19.93%)
load average: 7.12 2.99 1.15
successful run completed in 60.01s (1 min, 0.01 secs)
2) After applying this patchset:
setting to a 60 second run per stressor
dispatching hogs: 9 ramfs
stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
(secs) (secs) (secs) (real time) (usr+sys time)
ramfs 477165 60.00 8.13 281.34 7952.55 1648.40
for a 60.01s run time:
1440.33s available CPU time
8.12s user time ( 0.56%)
281.34s system time ( 19.53%)
289.46s total time ( 20.10%)
load average: 6.98 3.03 1.19
successful run completed in 60.01s (1 min, 0.01 secs)
We can see that the ops/s has hardly changed.
[1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
[2]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/
[3]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
[4]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/
[5]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
Link: https://lkml.kernel.org/r/20230911094444.68966-43-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Chuck Lever <cel@kernel.org>
Cc: Coly Li <colyli@suse.de>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sean Paul <sean@poorly.run>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Song Liu <song@kernel.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
160 lines
4.9 KiB
C
160 lines
4.9 KiB
C
/* SPDX-License-Identifier: GPL-2.0 */
|
|
#ifndef _LINUX_SHRINKER_H
|
|
#define _LINUX_SHRINKER_H
|
|
|
|
#include <linux/atomic.h>
|
|
#include <linux/types.h>
|
|
#include <linux/refcount.h>
|
|
#include <linux/completion.h>
|
|
|
|
#define SHRINKER_UNIT_BITS BITS_PER_LONG
|
|
|
|
/*
|
|
* Bitmap and deferred work of shrinker::id corresponding to memcg-aware
|
|
* shrinkers, which have elements charged to the memcg.
|
|
*/
|
|
struct shrinker_info_unit {
|
|
atomic_long_t nr_deferred[SHRINKER_UNIT_BITS];
|
|
DECLARE_BITMAP(map, SHRINKER_UNIT_BITS);
|
|
};
|
|
|
|
struct shrinker_info {
|
|
struct rcu_head rcu;
|
|
int map_nr_max;
|
|
struct shrinker_info_unit *unit[];
|
|
};
|
|
|
|
/*
|
|
* This struct is used to pass information from page reclaim to the shrinkers.
|
|
* We consolidate the values for easier extension later.
|
|
*
|
|
* The 'gfpmask' refers to the allocation we are currently trying to
|
|
* fulfil.
|
|
*/
|
|
struct shrink_control {
|
|
gfp_t gfp_mask;
|
|
|
|
/* current node being shrunk (for NUMA aware shrinkers) */
|
|
int nid;
|
|
|
|
/*
|
|
* How many objects scan_objects should scan and try to reclaim.
|
|
* This is reset before every call, so it is safe for callees
|
|
* to modify.
|
|
*/
|
|
unsigned long nr_to_scan;
|
|
|
|
/*
|
|
* How many objects did scan_objects process?
|
|
* This defaults to nr_to_scan before every call, but the callee
|
|
* should track its actual progress.
|
|
*/
|
|
unsigned long nr_scanned;
|
|
|
|
/* current memcg being shrunk (for memcg aware shrinkers) */
|
|
struct mem_cgroup *memcg;
|
|
};
|
|
|
|
#define SHRINK_STOP (~0UL)
|
|
#define SHRINK_EMPTY (~0UL - 1)
|
|
/*
|
|
* A callback you can register to apply pressure to ageable caches.
|
|
*
|
|
* @count_objects should return the number of freeable items in the cache. If
|
|
* there are no objects to free, it should return SHRINK_EMPTY, while 0 is
|
|
* returned in cases of the number of freeable items cannot be determined
|
|
* or shrinker should skip this cache for this time (e.g., their number
|
|
* is below shrinkable limit). No deadlock checks should be done during the
|
|
* count callback - the shrinker relies on aggregating scan counts that couldn't
|
|
* be executed due to potential deadlocks to be run at a later call when the
|
|
* deadlock condition is no longer pending.
|
|
*
|
|
* @scan_objects will only be called if @count_objects returned a non-zero
|
|
* value for the number of freeable objects. The callout should scan the cache
|
|
* and attempt to free items from the cache. It should then return the number
|
|
* of objects freed during the scan, or SHRINK_STOP if progress cannot be made
|
|
* due to potential deadlocks. If SHRINK_STOP is returned, then no further
|
|
* attempts to call the @scan_objects will be made from the current reclaim
|
|
* context.
|
|
*
|
|
* @flags determine the shrinker abilities, like numa awareness
|
|
*/
|
|
struct shrinker {
|
|
unsigned long (*count_objects)(struct shrinker *,
|
|
struct shrink_control *sc);
|
|
unsigned long (*scan_objects)(struct shrinker *,
|
|
struct shrink_control *sc);
|
|
|
|
long batch; /* reclaim batch size, 0 = default */
|
|
int seeks; /* seeks to recreate an obj */
|
|
unsigned flags;
|
|
|
|
/*
|
|
* The reference count of this shrinker. Registered shrinker have an
|
|
* initial refcount of 1, then the lookup operations are now allowed
|
|
* to use it via shrinker_try_get(). Later in the unregistration step,
|
|
* the initial refcount will be discarded, and will free the shrinker
|
|
* asynchronously via RCU after its refcount reaches 0.
|
|
*/
|
|
refcount_t refcount;
|
|
struct completion done; /* use to wait for refcount to reach 0 */
|
|
struct rcu_head rcu;
|
|
|
|
void *private_data;
|
|
|
|
/* These are for internal use */
|
|
struct list_head list;
|
|
#ifdef CONFIG_MEMCG
|
|
/* ID in shrinker_idr */
|
|
int id;
|
|
#endif
|
|
#ifdef CONFIG_SHRINKER_DEBUG
|
|
int debugfs_id;
|
|
const char *name;
|
|
struct dentry *debugfs_entry;
|
|
#endif
|
|
/* objs pending delete, per node */
|
|
atomic_long_t *nr_deferred;
|
|
};
|
|
#define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
|
|
|
|
/* Internal flags */
|
|
#define SHRINKER_REGISTERED BIT(0)
|
|
#define SHRINKER_ALLOCATED BIT(1)
|
|
|
|
/* Flags for users to use */
|
|
#define SHRINKER_NUMA_AWARE BIT(2)
|
|
#define SHRINKER_MEMCG_AWARE BIT(3)
|
|
/*
|
|
* It just makes sense when the shrinker is also MEMCG_AWARE for now,
|
|
* non-MEMCG_AWARE shrinker should not have this flag set.
|
|
*/
|
|
#define SHRINKER_NONSLAB BIT(4)
|
|
|
|
struct shrinker *shrinker_alloc(unsigned int flags, const char *fmt, ...);
|
|
void shrinker_register(struct shrinker *shrinker);
|
|
void shrinker_free(struct shrinker *shrinker);
|
|
|
|
static inline bool shrinker_try_get(struct shrinker *shrinker)
|
|
{
|
|
return refcount_inc_not_zero(&shrinker->refcount);
|
|
}
|
|
|
|
static inline void shrinker_put(struct shrinker *shrinker)
|
|
{
|
|
if (refcount_dec_and_test(&shrinker->refcount))
|
|
complete(&shrinker->done);
|
|
}
|
|
|
|
#ifdef CONFIG_SHRINKER_DEBUG
|
|
extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker,
|
|
const char *fmt, ...);
|
|
#else /* CONFIG_SHRINKER_DEBUG */
|
|
static inline __printf(2, 3)
|
|
int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...)
|
|
{
|
|
return 0;
|
|
}
|
|
#endif /* CONFIG_SHRINKER_DEBUG */
|
|
#endif /* _LINUX_SHRINKER_H */
|