Commit Graph

2544 Commits

Author SHA1 Message Date
Liang Jie
2f0805d7c0 ceph: streamline request head structures in MDS client
The existence of the ceph_mds_request_head_old structure in the MDS
client code is no longer required due to improvements in handling
different MDS request header versions. This patch removes the now
redundant ceph_mds_request_head_old structure and replaces its usage
with the flexible and extensible ceph_mds_request_head structure.

Changes include:
- Modification of find_legacy_request_head to directly cast the
  pointer to ceph_mds_request_head_legacy without going through the
  old structure.
- Update sizeof calculations in create_request_message to use
  offsetofend for consistency and future-proofing, rather than
  referencing the old structure.
- Use of the structured ceph_mds_request_head directly instead of the
  old one.

Additionally, this consolidation normalizes the handling of
request_head_version v1 to align with versions v2 and v3, leading to
a more consistent and maintainable codebase.

These changes simplify the codebase and reduce potential confusion
stemming from the existence of an obsolete structure.

Signed-off-by: Liang Jie <liangjie@lixiang.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2025-01-27 16:07:42 +01:00
Linus Torvalds
c159dfbdd4 Mainly individually changelogged singleton patches. The patch series in
this pull are:
 
 - "lib min_heap: Improve min_heap safety, testing, and documentation"
   from Kuan-Wei Chiu provides various tightenings to the min_heap library
   code.
 
 - "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms some
   cleanup and Rust preparation in the xarray library code.
 
 - "Update reference to include/asm-<arch>" from Geert Uytterhoeven fixes
   pathnames in some code comments.
 
 - "Converge on using secs_to_jiffies()" from Easwar Hariharan uses the
   new secs_to_jiffies() in various places where that is appropriate.
 
 - "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen
   switches two filesystems to the new mount API.
 
 - "Convert ocfs2 to use folios" from Matthew Wilcox does that.
 
 - "Remove get_task_comm() and print task comm directly" from Yafang Shao
   removes now-unneeded calls to get_task_comm() in various places.
 
 - "squashfs: reduce memory usage and update docs" from Phillip Lougher
   implements some memory savings in squashfs and performs some
   maintainability work.
 
 - "lib: clarify comparison function requirements" from Kuan-Wei Chiu
   tightens the sort code's behaviour and adds some maintenance work.
 
 - "nilfs2: protect busy buffer heads from being force-cleared" from
   Ryusuke Konishi fixes an issues in nlifs when the fs is presented with a
   corrupted image.
 
 - "nilfs2: fix kernel-doc comments for function return values" from
   Ryusuke Konishi fixes some nilfs kerneldoc.
 
 - "nilfs2: fix issues with rename operations" from Ryusuke Konishi
   addresses some nilfs BUG_ONs which syzbot was able to trigger.
 
 - "minmax.h: Cleanups and minor optimisations" from David Laight
   does some maintenance work on the min/max library code.
 
 - "Fixes and cleanups to xarray" from Kemeng Shi does maintenance work
   on the xarray library code.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5SP5QAKCRDdBJ7gKXxA
 jqN7AQChvwXGG43n4d5SDiA/rH7ddvowQcDqhC9cAMJ1ReR7qwEA8/LIWDE4PdMX
 mJnaZ1/ibpEpearrChCViApQtcyEGQI=
 =ti4E
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:
 "Mainly individually changelogged singleton patches. The patch series
  in this pull are:

   - "lib min_heap: Improve min_heap safety, testing, and documentation"
     from Kuan-Wei Chiu provides various tightenings to the min_heap
     library code

   - "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms
     some cleanup and Rust preparation in the xarray library code

   - "Update reference to include/asm-<arch>" from Geert Uytterhoeven
     fixes pathnames in some code comments

   - "Converge on using secs_to_jiffies()" from Easwar Hariharan uses
     the new secs_to_jiffies() in various places where that is
     appropriate

   - "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen
     switches two filesystems to the new mount API

   - "Convert ocfs2 to use folios" from Matthew Wilcox does that

   - "Remove get_task_comm() and print task comm directly" from Yafang
     Shao removes now-unneeded calls to get_task_comm() in various
     places

   - "squashfs: reduce memory usage and update docs" from Phillip
     Lougher implements some memory savings in squashfs and performs
     some maintainability work

   - "lib: clarify comparison function requirements" from Kuan-Wei Chiu
     tightens the sort code's behaviour and adds some maintenance work

   - "nilfs2: protect busy buffer heads from being force-cleared" from
     Ryusuke Konishi fixes an issues in nlifs when the fs is presented
     with a corrupted image

   - "nilfs2: fix kernel-doc comments for function return values" from
     Ryusuke Konishi fixes some nilfs kerneldoc

   - "nilfs2: fix issues with rename operations" from Ryusuke Konishi
     addresses some nilfs BUG_ONs which syzbot was able to trigger

   - "minmax.h: Cleanups and minor optimisations" from David Laight does
     some maintenance work on the min/max library code

   - "Fixes and cleanups to xarray" from Kemeng Shi does maintenance
     work on the xarray library code"

* tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (131 commits)
  ocfs2: use str_yes_no() and str_no_yes() helper functions
  include/linux/lz4.h: add some missing macros
  Xarray: use xa_mark_t in xas_squash_marks() to keep code consistent
  Xarray: remove repeat check in xas_squash_marks()
  Xarray: distinguish large entries correctly in xas_split_alloc()
  Xarray: move forward index correctly in xas_pause()
  Xarray: do not return sibling entries from xas_find_marked()
  ipc/util.c: complete the kernel-doc function descriptions
  gcov: clang: use correct function param names
  latencytop: use correct kernel-doc format for func params
  minmax.h: remove some #defines that are only expanded once
  minmax.h: simplify the variants of clamp()
  minmax.h: move all the clamp() definitions after the min/max() ones
  minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp()
  minmax.h: reduce the #define expansion of min(), max() and clamp()
  minmax.h: update some comments
  minmax.h: add whitespace around operators and after commas
  nilfs2: do not update mtime of renamed directory that is not moved
  nilfs2: handle errors that nilfs_prepare_chunk() may return
  CREDITS: fix spelling mistake
  ...
2025-01-26 17:50:53 -08:00
Linus Torvalds
f96a974170 lsm/stable-6.14 PR 20250121
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmeQFBoUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXPvcA//XCdwMz0bGtWKv58nuyP8vkQx08n6
 //olz/O8te3uWK5O3kRiarzFLwH8qsHQ6A7GYalwwix34hatR4ndJE0Y/guVRWa1
 +aBmJxJ7Jm/q3fvpAEfqiSgreuE6kBoztlDOWEq+hUQGu4qfnQGm2EnvbvfFrAmN
 VheOfIQSU2KCL/Scc3FGnF6uru4WrqN0JJ9RbvrEpfdQgmcyTGLnQsZLljutWSIq
 kDWkteIr7cj3O9J45zpxZsTftvYSgVn/y1iKeXbHI4DBA1eheK12vsHB9AADKI1J
 GwHxOrnLpZtv+ICUKqcfFTmWTl+NmfJJurAT5KXKdBjL3xM5MoJlBvK1A5qE9CMo
 LaHVG/TZR2MmBaoM3EN+gvWhDgWlvT02Q/0cYaafTlVLMez3HtfctxN6OnCvTXTB
 Y8dqYClhhlBm/mHQwYfMoeKw4MftUpzEqBd1Nj7Qe8dbP0f/62Ca3K2B3D6Rf8QV
 pj3ryMlSWYV9mdTerruLNQexTGoN7l66jPwzdWpTbFeL3WmNtfCako8OZGbXgPIu
 Iahm3P+jnSVx8ZQro2c9zwdKXI5xiI335pCBbDZ8aX+JAsfj0OofHsFx5Q5diber
 M7tAEhxDqRisbpz7Ei+/LOAEGg2Z619XKg8ks4z6Y4P5PF7zEgeWTkZJk2iLbxXe
 6LLOjmF7LLw+G4M=
 =fgyr
 -----END PGP SIGNATURE-----

Merge tag 'lsm-pr-20250121' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm

Pull lsm updates from Paul Moore:

 - Improved handling of LSM "secctx" strings through lsm_context struct

   The LSM secctx string interface is from an older time when only one
   LSM was supported, migrate over to the lsm_context struct to better
   support the different LSMs we now have and make it easier to support
   new LSMs in the future.

   These changes explain the Rust, VFS, and networking changes in the
   diffstat.

 - Only build lsm_audit.c if CONFIG_SECURITY and CONFIG_AUDIT are
   enabled

   Small tweak to be a bit smarter about when we build the LSM's common
   audit helpers.

 - Check for absurdly large policies from userspace in SafeSetID

   SafeSetID policies rules are fairly small, basically just "UID:UID",
   it easy to impose a limit of KMALLOC_MAX_SIZE on policy writes which
   helps quiet a number of syzbot related issues. While work is being
   done to address the syzbot issues through other mechanisms, this is a
   trivial and relatively safe fix that we can do now.

 - Various minor improvements and cleanups

   A collection of improvements to the kernel selftests, constification
   of some function parameters, removing redundant assignments, and
   local variable renames to improve readability.

* tag 'lsm-pr-20250121' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
  lockdown: initialize local array before use to quiet static analysis
  safesetid: check size of policy writes
  net: corrections for security_secid_to_secctx returns
  lsm: rename variable to avoid shadowing
  lsm: constify function parameters
  security: remove redundant assignment to return variable
  lsm: Only build lsm_audit.c if CONFIG_SECURITY and CONFIG_AUDIT are set
  selftests: refactor the lsm `flags_overset_lsm_set_self_attr` test
  binder: initialize lsm_context structure
  rust: replace lsm context+len with lsm_context
  lsm: secctx provider check on release
  lsm: lsm_context in security_dentry_init_security
  lsm: use lsm_context in security_inode_getsecctx
  lsm: replace context+len with lsm_context
  lsm: ensure the correct LSM context releaser
2025-01-21 20:03:04 -08:00
Linus Torvalds
ca56a74a31 vfs-6.14-rc1.netfs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ4pRKQAKCRCRxhvAZXjc
 ov2dAQCULWjTBWdF8Ro2bfNeXzWvUUnSPjoLJ9B4xlrOB9c2MAEAiwkKHkzAxUco
 hCvaRJc3H2ze2wrgbIABPKB2noQVVwk=
 =4ojv
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.14-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs netfs updates from Christian Brauner:
 "This contains read performance improvements and support for monolithic
  single-blob objects that have to be read/written as such (e.g. AFS
  directory contents). The implementation of the two parts is interwoven
  as each makes the other possible.

   - Read performance improvements

     The read performance improvements are intended to speed up some
     loss of performance detected in cifs and to a lesser extend in afs.

     The problem is that we queue too many work items during the
     collection of read results: each individual subrequest is collected
     by its own work item, and then they have to interact with each
     other when a series of subrequests don't exactly align with the
     pattern of folios that are being read by the overall request.

     Whilst the processing of the pages covered by individual
     subrequests as they complete potentially allows folios to be woken
     in parallel and with minimum delay, it can shuffle wakeups for
     sequential reads out of order - and that is the most common I/O
     pattern.

     The final assessment and cleanup of an operation is then held up
     until the last I/O completes - and for a synchronous sequential
     operation, this means the bouncing around of work items just adds
     latency.

     Two changes have been made to make this work:

     (1) All collection is now done in a single "work item" that works
         progressively through the subrequests as they complete (and
         also dispatches retries as necessary).

     (2) For readahead and AIO, this work item be done on a workqueue
         and can run in parallel with the ultimate consumer of the data;
         for synchronous direct or unbuffered reads, the collection is
         run in the application thread and not offloaded.

     Functions such as smb2_readv_callback() then just tell netfslib
     that the subrequest has terminated; netfslib does a minimal bit of
     processing on the spot - stat counting and tracing mostly - and
     then queues/wakes up the worker. This simplifies the logic as the
     collector just walks sequentially through the subrequests as they
     complete and walks through the folios, if buffered, unlocking them
     as it goes. It also keeps to a minimum the amount of latency
     injected into the filesystem's low-level I/O handling

     The way netfs supports filesystems using the deprecated
     PG_private_2 flag is changed: folios are flagged and added to a
     write request as they complete and that takes care of scheduling
     the writes to the cache. The originating read request can then just
     unlock the pages whatever happens.

   - Single-blob object support

     Single-blob objects are files for which the content of the file
     must be read from or written to the server in a single operation
     because reading them in parts may yield inconsistent results. AFS
     directories are an example of this as there exists the possibility
     that the contents are generated on the fly and would differ between
     reads or might change due to third party interference.

     Such objects will be written to and retrieved from the cache if one
     is present, though we allow/may need to propose multiple
     subrequests to do so. The important part is that read from/write to
     the *server* is monolithic.

     Single blob reading is, for the moment, fully synchronous and does
     result collection in the application thread and, also for the
     moment, the API is supplied the buffer in the form of a folio_queue
     chain rather than using the pagecache.

   - Related afs changes

     This series makes a number of changes to the kafs filesystem,
     primarily in the area of directory handling:

      - AFS's FetchData RPC reply processing is made partially
        asynchronous which allows the netfs_io_request's outstanding
        operation counter to be removed as part of reducing the
        collection to a single work item.

      - Directory and symlink reading are plumbed through netfslib using
        the single-blob object API and are now cacheable with fscache.
        This also allows the afs_read struct to be eliminated and
        netfs_io_subrequest to be used directly instead.

      - Directory and symlink content are now stored in a folio_queue
        buffer rather than in the pagecache. This means we don't require
        the RCU read lock and xarray iteration to access it, and folios
        won't randomly disappear under us because the VM wants them
        back.

      - The vnode operation lock is changed from a mutex struct to a
        private lock implementation. The problem is that the lock now
        needs to be dropped in a separate thread and mutexes don't
        permit that.

      - When a new directory or symlink is created, we now initialise it
        locally and mark it valid rather than downloading it (we know
        what it's likely to look like).

      - We now use the in-directory hashtable to reduce the number of
        entries we need to scan when doing a lookup. The edit routines
        have to maintain the hash chains.

      - Cancellation (e.g. by signal) of an async call after the
        rxrpc_call has been set up is now offloaded to the worker thread
        as there will be a notification from rxrpc upon completion. This
        avoids a double cleanup.

   - A "rolling buffer" implementation is created to abstract out the
     two separate folio_queue chaining implementations I had (one for
     read and one for write).

   - Functions are provided to create/extend a buffer in a folio_queue
     chain and tear it down again.

     This is used to handle AFS directories, but could also be used to
     create bounce buffers for content crypto and transport crypto.

   - The was_async argument is dropped from netfs_read_subreq_terminated()

     Instead we wake the read collection work item by either queuing it
     or waking up the app thread.

   - We don't need to use BH-excluding locks when communicating between
     the issuing thread and the collection thread as neither of them now
     run in BH context.

   - Also included are a number of new tracepoints; a split of the
     netfslib write collection code to put retrying into its own file
     (it gets more complicated with content encryption).

   - There are also some minor fixes AFS included, including fixing the
     AFS directory format struct layout, reducing some directory
     over-invalidation and making afs_mkdir() translate EEXIST to
     ENOTEMPY (which is not available on all systems the servers
     support).

   - Finally, there's a patch to try and detect entry into the folio
     unlock function with no folio_queue structs in the buffer (which
     isn't allowed in the cases that can get there).

     This is a debugging patch, but should be minimal overhead"

* tag 'vfs-6.14-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (31 commits)
  netfs: Report on NULL folioq in netfs_writeback_unlock_folios()
  afs: Add a tracepoint for afs_read_receive()
  afs: Locally initialise the contents of a new symlink on creation
  afs: Use the contained hashtable to search a directory
  afs: Make afs_mkdir() locally initialise a new directory's content
  netfs: Change the read result collector to only use one work item
  afs: Make {Y,}FS.FetchData an asynchronous operation
  afs: Fix cleanup of immediately failed async calls
  afs: Eliminate afs_read
  afs: Use netfslib for symlinks, allowing them to be cached
  afs: Use netfslib for directories
  afs: Make afs_init_request() get a key if not given a file
  netfs: Add support for caching single monolithic objects such as AFS dirs
  netfs: Add functions to build/clean a buffer in a folio_queue
  afs: Add more tracepoints to do with tracking validity
  cachefiles: Add auxiliary data trace
  cachefiles: Add some subrequest tracepoints
  netfs: Remove some extraneous directory invalidations
  afs: Fix directory format encoding struct
  afs: Fix EEXIST error returned from afs_rmdir() to be ENOTEMPTY
  ...
2025-01-20 09:29:11 -08:00
Antoine Viallon
3b7d93db45 ceph: fix memory leak in ceph_mds_auth_match()
We now free the temporary target path substring allocation on every
possible branch, instead of omitting the default branch.  In some
cases, a memory leak occured, which could rapidly crash the system
(depending on how many file accesses were attempted).

This was detected in production because it caused a continuous memory
growth, eventually triggering kernel OOM and completely hard-locking
the kernel.

Relevant kmemleak stacktrace:

    unreferenced object 0xffff888131e69900 (size 128):
      comm "git", pid 66104, jiffies 4295435999
      hex dump (first 32 bytes):
        76 6f 6c 75 6d 65 73 2f 63 6f 6e 74 61 69 6e 65  volumes/containe
        72 73 2f 67 69 74 65 61 2f 67 69 74 65 61 2f 67  rs/gitea/gitea/g
      backtrace (crc 2f3bb450):
        [<ffffffffaa68fb49>] __kmalloc_noprof+0x359/0x510
        [<ffffffffc32bf1df>] ceph_mds_check_access+0x5bf/0x14e0 [ceph]
        [<ffffffffc3235722>] ceph_open+0x312/0xd80 [ceph]
        [<ffffffffaa7dd786>] do_dentry_open+0x456/0x1120
        [<ffffffffaa7e3729>] vfs_open+0x79/0x360
        [<ffffffffaa832875>] path_openat+0x1de5/0x4390
        [<ffffffffaa834fcc>] do_filp_open+0x19c/0x3c0
        [<ffffffffaa7e44a1>] do_sys_openat2+0x141/0x180
        [<ffffffffaa7e4945>] __x64_sys_open+0xe5/0x1a0
        [<ffffffffac2cc2f7>] do_syscall_64+0xb7/0x210
        [<ffffffffac400130>] entry_SYSCALL_64_after_hwframe+0x77/0x7f

It can be triggered by mouting a subdirectory of a CephFS filesystem,
and then trying to access files on this subdirectory with an auth token
using a path-scoped capability:

    $ ceph auth get client.services
    [client.services]
            key = REDACTED
            caps mds = "allow rw fsname=cephfs path=/volumes/"
            caps mon = "allow r fsname=cephfs"
            caps osd = "allow rw tag cephfs data=cephfs"

    $ cat /proc/self/mounts
    services@[REDACTED].cephfs=/volumes/containers /ceph/containers ceph rw,noatime,name=services,secret=<hidden>,ms_mode=prefer-crc,mount_timeout=300,acl,mon_addr=[REDACTED]:3300,recover_session=clean 0 0

    $ seq 1 1000000 | xargs -P32 --replace={} touch /ceph/containers/file-{} && \
    seq 1 1000000 | xargs -P32 --replace={} cat /ceph/containers/file-{}

[ idryomov: combine if statements, rename rc to path_matched and make
            it a bool, formatting ]

Cc: stable@vger.kernel.org
Fixes: 596afb0b89 ("ceph: add ceph_mds_check_access() helper")
Signed-off-by: Antoine Viallon <antoine@lesviallon.fr>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2025-01-15 20:35:06 +01:00
Easwar Hariharan
f5ea0319ef ceph: convert timeouts to secs_to_jiffies()
Commit b35108a51c ("jiffies: Define secs_to_jiffies()") introduced
secs_to_jiffies().  As the value here is a multiple of 1000, use
secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication.

This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with
the following Coccinelle rules:

@@ constant C; @@

- msecs_to_jiffies(C * 1000)
+ secs_to_jiffies(C)

@@ constant C; @@

- msecs_to_jiffies(C * MSEC_PER_SEC)
+ secs_to_jiffies(C)

Link: https://lkml.kernel.org/r/20241210-converge-secs-to-jiffies-v3-17-ddfefd7e9f2a@linux.microsoft.com
Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Daniel Mack <daniel@zonque.org>
Cc: David Airlie <airlied@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dick Kennedy <dick.kennedy@broadcom.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Fainelli <florian.fainelli@broadcom.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haojian Zhuang <haojian.zhuang@gmail.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jack Wang <jinpu.wang@cloud.ionos.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Jeff Johnson <jjohnson@kernel.org>
Cc: Jeff Johnson <quic_jjohnson@quicinc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jeroen de Borst <jeroendb@google.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kalle Valo <kvalo@kernel.org>
Cc: Louis Peens <louis.peens@corigine.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nicolas Palix <nicolas.palix@imag.fr>
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: Ofir Bitton <obitton@habana.ai>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Praveen Kaligineedi <pkaligineedi@google.com>
Cc: Ray Jui <rjui@broadcom.com>
Cc: Robert Jarzmik <robert.jarzmik@free.fr>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Shailend Chand <shailend@google.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Simon Horman <horms@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12 20:21:04 -08:00
David Howells
e2d46f2ec3
netfs: Change the read result collector to only use one work item
Change the way netfslib collects read results to do all the collection for
a particular read request using a single work item that walks along the
subrequest queue as subrequests make progress or complete, unlocking folios
progressively rather than doing the unlock in parallel as parallel requests
come in.

The code is remodelled to be more like the write-side code, though only
using a single stream.  This makes it more directly comparable and thus
easier to duplicate fixes between the two sides.

This has a number of advantages:

 (1) It's simpler.  There doesn't need to be a complex donation mechanism
     to handle mismatches between the size and alignment of subrequests and
     folios.  The collector unlocks folios as the subrequests covering each
     complete.

 (2) It should cause less scheduler overhead as there's a single work item
     in play unlocking pages in parallel when a read gets split up into a
     lot of subrequests instead of one per subrequest.

     Whilst the parallellism is nice in theory, in practice, the vast
     majority of loads are sequential reads of the whole file, so
     committing a bunch of threads to unlocking folios out of order doesn't
     help in those cases.

 (3) It should make it easier to implement content decryption.  A folio
     cannot be decrypted until all the requests that contribute to it have
     completed - and, again, most loads are sequential and so, most of the
     time, we want to begin decryption sequentially (though it's great if
     the decryption can happen in parallel).

There is a disadvantage in that we're losing the ability to decrypt and
unlock things on an as-things-arrive basis which may affect some
applications.

Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-28-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-20 22:34:08 +01:00
David Howells
31fc366aa7
netfs: Drop the was_async arg from netfs_read_subreq_terminated()
Drop the was_async argument from netfs_read_subreq_terminated().  Almost
every caller is either in process context and passes false.  Some
filesystems delegate the call to a workqueue to avoid doing the work in
their network message queue parsing thread.

The only exception is netfs_cache_read_terminated() which handles
completion in the cache - which is usually a callback from the backing
filesystem in softirq context, though it can be from process context if an
error occurred.  In this case, delegate to a workqueue.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/CAHk-=wiVC5Cgyz6QKXFu6fTaA6h4CjexDR-OV9kL6Vo5x9v8=A@mail.gmail.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-10-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-20 22:34:03 +01:00
David Howells
360157829e
netfs: Drop the error arg from netfs_read_subreq_terminated()
Drop the error argument from netfs_read_subreq_terminated() in favour of
passing the value in subreq->error.

Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-9-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-20 22:34:03 +01:00
Ilya Dryomov
18d44c5d06 ceph: allocate sparse_ext map only for sparse reads
If mounted with sparseread option, ceph_direct_read_write() ends up
making an unnecessarily allocation for O_DIRECT writes.

Fixes: 03bc06c7b0 ("ceph: add new mount option to enable sparse reads")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
2024-12-16 23:25:44 +01:00
Ilya Dryomov
66e0c4f914 ceph: fix memory leak in ceph_direct_read_write()
The bvecs array which is allocated in iter_get_bvecs_alloc() is leaked
and pages remain pinned if ceph_alloc_sparse_ext_map() fails.

There is no need to delay the allocation of sparse_ext map until after
the bvecs array is set up, so fix this by moving sparse_ext allocation
a bit earlier.  Also, make a similar adjustment in __ceph_sync_read()
for consistency (a leak of the same kind in __ceph_sync_read() has been
addressed differently).

Cc: stable@vger.kernel.org
Fixes: 03bc06c7b0 ("ceph: add new mount option to enable sparse reads")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
2024-12-16 23:25:44 +01:00
Alex Markuze
9abee47580 ceph: improve error handling and short/overflow-read logic in __ceph_sync_read()
This patch refines the read logic in __ceph_sync_read() to ensure more
predictable and efficient behavior in various edge cases.

- Return early if the requested read length is zero or if the file size
  (`i_size`) is zero.
- Initialize the index variable (`idx`) where needed and reorder some
  code to ensure it is always set before use.
- Improve error handling by checking for negative return values earlier.
- Remove redundant encrypted file checks after failures. Only attempt
  filesystem-level decryption if the read succeeded.
- Simplify leftover calculations to correctly handle cases where the
  read extends beyond the end of the file or stops short.  This can be
  hit by continuously reading a file while, on another client, we keep
  truncating and writing new data into it.
- This resolves multiple issues caused by integer and consequent buffer
  overflow (`pages` array being accessed beyond `num_pages`):
  - https://tracker.ceph.com/issues/67524
  - https://tracker.ceph.com/issues/68980
  - https://tracker.ceph.com/issues/68981

Cc: stable@vger.kernel.org
Fixes: 1065da21e5 ("ceph: stop copying to iter at EOF on sync reads")
Reported-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-12-16 23:25:43 +01:00
Ilya Dryomov
12eb22a5a6 ceph: validate snapdirname option length when mounting
It becomes a path component, so it shouldn't exceed NAME_MAX
characters.  This was hardened in commit c152737be2 ("ceph: Use
strscpy() instead of strcpy() in __get_snap_name()"), but no actual
check was put in place.

Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
2024-12-16 23:25:43 +01:00
Max Kellermann
550f7ca98e ceph: give up on paths longer than PATH_MAX
If the full path to be built by ceph_mdsc_build_path() happens to be
longer than PATH_MAX, then this function will enter an endless (retry)
loop, effectively blocking the whole task.  Most of the machine
becomes unusable, making this a very simple and effective DoS
vulnerability.

I cannot imagine why this retry was ever implemented, but it seems
rather useless and harmful to me.  Let's remove it and fail with
ENAMETOOLONG instead.

Cc: stable@vger.kernel.org
Reported-by: Dario Weißer <dario@cure53.de>
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-12-16 23:25:43 +01:00
Max Kellermann
d6fd6f8280 ceph: fix memory leaks in __ceph_sync_read()
In two `break` statements, the call to ceph_release_page_vector() was
missing, leaking the allocation from ceph_alloc_page_vector().

Instead of adding the missing ceph_release_page_vector() calls, the
Ceph maintainers preferred to transfer page ownership to the
`ceph_osd_request` by passing `own_pages=true` to
osd_req_op_extent_osd_data_pages().  This requires postponing the
ceph_osdc_put_request() call until after the block that accesses the
`pages`.

Cc: stable@vger.kernel.org
Fixes: 03bc06c7b0 ("ceph: add new mount option to enable sparse reads")
Fixes: f0fe1e54cf ("ceph: plumb in decryption during reads")
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-12-16 23:25:43 +01:00
Casey Schaufler
b530104f50 lsm: lsm_context in security_dentry_init_security
Replace the (secctx,seclen) pointer pair with a single lsm_context
pointer to allow return of the LSM identifier along with the context
and context length. This allows security_release_secctx() to know how
to release the context. Callers have been modified to use or save the
returned data from the new structure.

Cc: ceph-devel@vger.kernel.org
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subject tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-12-04 14:58:51 -05:00
Casey Schaufler
6fba89813c lsm: ensure the correct LSM context releaser
Add a new lsm_context data structure to hold all the information about a
"security context", including the string, its size and which LSM allocated
the string. The allocation information is necessary because LSMs have
different policies regarding the lifecycle of these strings. SELinux
allocates and destroys them on each use, whereas Smack provides a pointer
to an entry in a list that never goes away.

Update security_release_secctx() to use the lsm_context instead of a
(char *, len) pair. Change its callers to do likewise.  The LSMs
supporting this hook have had comments added to remind the developer
that there is more work to be done.

The BPF security module provides all LSM hooks. While there has yet to
be a known instance of a BPF configuration that uses security contexts,
the possibility is real. In the existing implementation there is
potential for multiple frees in that case.

Cc: linux-integrity@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: audit@vger.kernel.org
Cc: netfilter-devel@vger.kernel.org
To: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: linux-nfs@vger.kernel.org
Cc: Todd Kjos <tkjos@google.com>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subject tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-12-04 10:46:26 -05:00
Linus Torvalds
9d0ad04553 A fix for the mount "device" string parser from Patrick and two cred
reference counting fixups from Max, marked for stable.  Also included
 a number of cleanups and a tweak to MAINTAINERS to avoid unnecessarily
 CCing netdev list.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmdJ/sMTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi8EOB/9Jhq1nOe0dN7aAWN1owZH85TXmOLuX
 eSS79AJp63lmJgx+mF0CbLLN6Vwjvm1vqz10Uhe5VCmqtxKy1/F4QxEOwk+zEMwT
 iGkM+6nUtMMqnxqItJpFC19YQONwgidsNcbi7v8nDEHqH8FXEC4R0pi0990bUQSj
 E8zVzq44TNFQrhWjDJHPnXsxbH9SijRuwu1O4KEZ2HK0QQ8LfpPptozJawH0p2Hs
 Wc6V6ppt7o9F9MHW137I4BOG9xm18aAQa9Ztd5GHhip63SuLpdQNwoM9JhC8J/bt
 bcci5CoCa34P57g9/1kh9ov/NXf9XgjSCFlDzk9zZH0IxaX5CYgTqYL5
 =xf5D
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.13-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "A fix for the mount "device" string parser from Patrick and two cred
  reference counting fixups from Max, marked for stable.

  Also included a number of cleanups and a tweak to MAINTAINERS to avoid
  unnecessarily CCing netdev list"

* tag 'ceph-for-6.13-rc1' of https://github.com/ceph/ceph-client:
  ceph: fix cred leak in ceph_mds_check_access()
  ceph: pass cred pointer to ceph_mds_auth_match()
  ceph: improve caps debugging output
  ceph: correct ceph_mds_cap_peer field name
  ceph: correct ceph_mds_cap_item field name
  ceph: miscellaneous spelling fixes
  ceph: Use strscpy() instead of strcpy() in __get_snap_name()
  ceph: Use str_true_false() helper in status_show()
  ceph: requalify some char pointers as const
  ceph: extract entity name from device id
  MAINTAINERS: exclude net/ceph from networking
  ceph: Remove fs/ceph deadcode
  libceph: Remove unused ceph_crypto_key_encode
  libceph: Remove unused ceph_osdc_watch_check
  libceph: Remove unused pagevec functions
  libceph: Remove unused ceph_pagelist functions
2024-11-30 10:22:38 -08:00
Max Kellermann
c5cf420303 ceph: fix cred leak in ceph_mds_check_access()
get_current_cred() increments the reference counter, but the
put_cred() call was missing.

Cc: stable@vger.kernel.org
Fixes: 596afb0b89 ("ceph: add ceph_mds_check_access() helper")
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-25 15:51:13 +01:00
Max Kellermann
23426309a4 ceph: pass cred pointer to ceph_mds_auth_match()
This eliminates a redundant get_current_cred() call, because
ceph_mds_check_access() has already obtained this pointer.

As a side effect, this also fixes a reference leak in
ceph_mds_auth_match(): by omitting the get_current_cred() call, no
additional cred reference is taken.

Cc: stable@vger.kernel.org
Fixes: 596afb0b89 ("ceph: add ceph_mds_check_access() helper")
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-25 15:51:13 +01:00
Patrick Donnelly
8ea412e181 ceph: improve caps debugging output
This improves uniformity and exposes important sequence numbers.

Now looks like:

    <7>[   73.749563] ceph:           caps.c:4465 : [c9653bca-110b-4f70-9f84-5a195b205e9a 15290]  caps mds2 op export ino 20000000000.fffffffffffffffe inode 0000000008d2e5ea seq 0 iseq 0 mseq 0
    ...
    <7>[   73.749574] ceph:           caps.c:4102 : [c9653bca-110b-4f70-9f84-5a195b205e9a 15290]  cap 20000000000.fffffffffffffffe export to peer 1 piseq 1 pmseq 1
    ...
    <7>[   73.749645] ceph:           caps.c:4465 : [c9653bca-110b-4f70-9f84-5a195b205e9a 15290]  caps mds1 op import ino 20000000000.fffffffffffffffe inode 0000000008d2e5ea seq 1 iseq 1 mseq 1
    ...
    <7>[   73.749681] ceph:           caps.c:4244 : [c9653bca-110b-4f70-9f84-5a195b205e9a 15290]  cap 20000000000.fffffffffffffffe import from peer 2 piseq 686 pmseq 0
    ...
    <7>[  248.645596] ceph:           caps.c:4465 : [c9653bca-110b-4f70-9f84-5a195b205e9a 15290]  caps mds1 op revoke ino 20000000000.fffffffffffffffe inode 0000000008d2e5ea seq 2538 iseq 1 mseq 1

See also ceph.git commit cb4ff28af09f ("mds: add issue_seq to all cap
messages").

Link: https://tracker.ceph.com/issues/66704
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-19 11:47:16 +01:00
Patrick Donnelly
8b41ac43c7 ceph: correct ceph_mds_cap_peer field name
The peer seq is used as the issue_seq. Use that name for consistency.
See also ceph.git commit 1da6ef237fc7 ("include/ceph_fs: correct
ceph_mds_cap_peer field name").

Link: https://tracker.ceph.com/issues/66704
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-19 11:08:17 +01:00
Patrick Donnelly
50f42c4895 ceph: correct ceph_mds_cap_item field name
The issue_seq is sent with bulk cap releases, not the current sequence
number. See also ceph.git commit 655cddb7c9f3 ("include/ceph_fs: correct
ceph_mds_cap_item field name").

Link: https://tracker.ceph.com/issues/66704
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-19 11:08:17 +01:00
Linus Torvalds
56be9aaf98 vfs-6.13.pagecache
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcUQAAKCRCRxhvAZXjc
 onEpAQCUdwIBHpwmSIFvJFA9aNGpbLzi0dDSEIxuWYtp5qVuogD+ImccwqpG3kEi
 Zq9vokdPpB1zbahxKl1mkvBG4G0GFQE=
 =LbP6
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.13.pagecache' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs pagecache updates from Christian Brauner:
 "Cleanup filesystem page flag usage: This continues the work to make
  the mappedtodisk/owner_2 flag available to filesystems which don't use
  buffer heads. Further patches remove uses of Private2. This brings us
  very close to being rid of it entirely"

* tag 'vfs-6.13.pagecache' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  migrate: Remove references to Private2
  ceph: Remove call to PagePrivate2()
  btrfs: Switch from using the private_2 flag to owner_2
  mm: Remove PageMappedToDisk
  nilfs2: Convert nilfs_copy_buffer() to use folios
  fs: Move clearing of mappedtodisk to buffer.c
2024-11-18 09:54:32 -08:00
Dmitry Antipov
3500000bb1 ceph: miscellaneous spelling fixes
Correct spelling here and there as suggested by codespell.

Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-18 17:34:36 +01:00
Abdul Rahim
c152737be2 ceph: Use strscpy() instead of strcpy() in __get_snap_name()
strcpy() performs no bounds checking on the destination buffer. This
could result in linear overflows beyond the end of the buffer, leading
to all kinds of misbehaviors [1].

This fixes checkpatch warning:
    WARNING: Prefer strscpy over strcpy

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy

[ idryomov: formatting ]

Signed-off-by: Abdul Rahim <abdul.rahim@myyahoo.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-18 17:34:36 +01:00
Thorsten Blum
e50f960bea ceph: Use str_true_false() helper in status_show()
Remove hard-coded strings by using the str_true_false() helper function.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-18 17:34:36 +01:00
Patrick Donnelly
64cf95d0b1 ceph: requalify some char pointers as const
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-18 17:34:36 +01:00
Patrick Donnelly
955710afcb ceph: extract entity name from device id
Previously, the "name" in the new device syntax "<name>@<fsid>.<fsname>"
was ignored because (presumably) tests were done using mount.ceph which
also passed the entity name using "-o name=foo". If mounting is done
without the mount.ceph helper, the new device id syntax fails to set
the name properly.

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/68516
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-18 17:34:35 +01:00
Dr. David Alan Gilbert
6025b482e4 ceph: Remove fs/ceph deadcode
ceph_caps_revoking() has been unused since 2017's commit
3fb99d483e ("ceph: nuke startsync op")

ceph_mdsc_open_export_target_sessions() has been unused since 2013's
commit 11df2dfb61 ("ceph: add imported caps when handling cap export message")

Remove them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-18 17:34:35 +01:00
Linus Torvalds
a3a37691e6 A fix from Patrick for a variety of CephFS lockup scenarios caused by
a regression in cap handling which sneaked in through the netfs helper
 library in 5.18 (marked for stable) and an unrelated one-line cleanup.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmcACf8THGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi3DiCACDP9ZubGLZ2GKv3fEyXEURnOyrSBAd
 M4Ci3cc31W0Q1rWtQndGxW9QF/9smcp4kN/xgelWfSZc6FJ1HLR1RHUPjMJpBPx2
 dQs+RaaBTIe0coVaMmiDnBQ1wnfuNhEZH8opnLonjHygANnoR2aYZpyIxpgYD7eZ
 KHfk1F/2rmE+QzYGB+tE7sCpRs3bhbtL2xa/2yQJyw+EWfgyJUjPW1cZe0liiWHi
 dXTMxJqfZLLUW2otLdGnaaZS+7yF3ItcoAWh6cUoiZ9cpqDu43B0RGXOV7Kegui1
 2e8W0sK3QIIYbUKoHo/4VPqrmN4puSjeNteumDvpZE4SgG8BfKWzQnmh
 =3OKU
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.12-rc2' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A fix from Patrick for a variety of CephFS lockup scenarios caused by
  a regression in cap handling which sneaked in through the netfs helper
  library in 5.18 (marked for stable) and an unrelated one-line cleanup"

* tag 'ceph-for-6.12-rc2' of https://github.com/ceph/ceph-client:
  ceph: fix cap ref leak via netfs init_request
  ceph: use struct_size() helper in __ceph_pool_perm_get()
2024-10-04 10:10:23 -07:00
Matthew Wilcox (Oracle)
fd15ba4cb0
ceph: Remove call to PagePrivate2()
Use the folio that we already have to call folio_test_private_2()
instead.  This is the last call to PagePrivate2(), so replace its
PAGEFLAG() definition with FOLIO_FLAG().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20241002040111.1023018-6-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-04 09:24:25 +02:00
Patrick Donnelly
ccda9910d8 ceph: fix cap ref leak via netfs init_request
Log recovered from a user's cluster:

    <7>[ 5413.970692] ceph:  get_cap_refs 00000000958c114b ret 1 got Fr
    <7>[ 5413.970695] ceph:  start_read 00000000958c114b, no cache cap
    ...
    <7>[ 5473.934609] ceph:   my wanted = Fr, used = Fr, dirty -
    <7>[ 5473.934616] ceph:  revocation: pAsLsXsFr -> pAsLsXs (revoking Fr)
    <7>[ 5473.934632] ceph:  __ceph_caps_issued 00000000958c114b cap 00000000f7784259 issued pAsLsXs
    <7>[ 5473.934638] ceph:  check_caps 10000000e68.fffffffffffffffe file_want - used Fr dirty - flushing - issued pAsLsXs revoking Fr retain pAsLsXsFsr  AUTHONLY NOINVAL FLUSH_FORCE

The MDS subsequently complains that the kernel client is late releasing
caps.

Approximately, a series of changes to this code by commits 4987005600
("ceph: convert ceph_readpages to ceph_readahead"), 2de1604173
("netfs: Change ->init_request() to return an error code") and
a5c9dc4451 ("ceph: Make ceph_init_request() check caps on readahead")
resulted in subtle resource cleanup to be missed. The main culprit is
the change in error handling in 2de1604173 which meant that a failure
in init_request() would no longer cause cleanup to be called. That
would prevent the ceph_put_cap_refs() call which would cleanup the
leaked cap ref.

Cc: stable@vger.kernel.org
Fixes: a5c9dc4451 ("ceph: Make ceph_init_request() check caps on readahead")
Link: https://tracker.ceph.com/issues/67008
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-10-03 09:31:08 +02:00
Thorsten Blum
7264745d55 ceph: use struct_size() helper in __ceph_pool_perm_get()
Use struct_size() to calculate the number of bytes to be allocated.

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-10-03 09:31:08 +02:00
Al Viro
5f60d5f6bb move asm/unaligned.h to linux/unaligned.h
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.

auto-generated by the following:

for i in `git grep -l -w asm/unaligned.h`; do
	sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
	sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
2024-10-02 17:23:23 -04:00
Linus Torvalds
894b3c35d1 Three CephFS fixes from Xiubo and Luis and a bunch of assorted
cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmb3JroTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzizDiB/0elHQQaFxXMjuJRY1IzohozAHi0cHK
 gwgE4nEbECE8vRYK/QvyvZ3S+ep+N+r6jOIiIDyqhjtlY3//oSyyxL7RjMJlVFBq
 Ie37w8r4q1aL1mn9QDQ4iQxcRYyU+JxcUcPR1UUUvLiKgWaRixmq27zby/WQSrkA
 ke2ScBRDtEAYVtdxvxmUJK/DrPr3skwJAGY52KesjwgVhXSL8KG9X1zMUbWdJYDV
 THbQzLZsu4NVh7LlAsS/mh+z0EIZsXxQYU5IY3dIVEYcuLK93lXRGZb+7whtmUef
 wsDtYIe/w30QVxFdrN28qAQp8daUJhp+3t0EZSyecRcq5OPey6ICx1P4
 =+bdB
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.12-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "Three CephFS fixes from Xiubo and Luis and a bunch of assorted
  cleanups"

* tag 'ceph-for-6.12-rc1' of https://github.com/ceph/ceph-client:
  ceph: remove the incorrect Fw reference check when dirtying pages
  ceph: Remove empty definition in header file
  ceph: Fix typo in the comment
  ceph: fix a memory leak on cap_auths in MDS client
  ceph: flush all caps releases when syncing the whole filesystem
  ceph: rename ceph_flush_cap_releases() to ceph_flush_session_cap_releases()
  libceph: use min() to simplify code in ceph_dns_resolve_name()
  ceph: Convert to use jiffies macro
  ceph: Remove unused declarations
2024-09-28 08:40:36 -07:00
Xiubo Li
c08dfb1b49 ceph: remove the incorrect Fw reference check when dirtying pages
When doing the direct-io reads it will also try to mark pages dirty,
but for the read path it won't hold the Fw caps and there is case
will it get the Fw reference.

Fixes: 5dda377cf0 ("ceph: set i_head_snapc when getting CEPH_CAP_FILE_WR reference")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-09-24 22:51:33 +02:00
Zhang Zekun
74249188f3 ceph: Remove empty definition in header file
The real definition of ceph_acl_chmod() has been removed since
commit 4db658ea0c ("ceph: Fix up after semantic merge conflict"),
remain the empty definition untouched in the header files. Let's
remove the empty definition.

Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-09-24 22:51:33 +02:00
Yan Zhen
0039aebfe8 ceph: Fix typo in the comment
Correctly spelled comments make it easier for the reader to understand
the code.

replace 'tagert' with 'target' in the comment &
replace 'vaild' with 'valid' in the comment &
replace 'carefull' with 'careful' in the comment &
replace 'trsaverse' with 'traverse' in the comment.

Signed-off-by: Yan Zhen <yanzhen@vivo.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-09-24 22:51:33 +02:00
Luis Henriques (SUSE)
d97079e97e ceph: fix a memory leak on cap_auths in MDS client
The cap_auths that are allocated during an MDS session opening are never
released, causing a memory leak detected by kmemleak.  Fix this by freeing
the memory allocated when shutting down the MDS client.

Fixes: 1d17de9534 ("ceph: save cap_auths in MDS client when session is opened")
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-09-24 22:51:33 +02:00
Xiubo Li
adc5246176 ceph: flush all caps releases when syncing the whole filesystem
We have hit a race between cap releases and cap revoke request
that will cause the check_caps() to miss sending a cap revoke ack
to MDS. And the client will depend on the cap release to release
that revoking caps, which could be delayed for some unknown reasons.

In Kclient we have figured out the RCA about race and we need
a way to explictly trigger this manually could help to get rid
of the caps revoke stuck issue.

Link: https://tracker.ceph.com/issues/67221
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-09-24 22:51:28 +02:00
Xiubo Li
c085f6ca95 ceph: rename ceph_flush_cap_releases() to ceph_flush_session_cap_releases()
Prepare for adding a helper to flush the cap releases for all
sessions.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-09-24 22:51:19 +02:00
Linus Torvalds
35219bc5c7 vfs-6.12.netfs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEvgAKCRCRxhvAZXjc
 onQWAQD6IxAKPU0zom2FoWNilvSzPs7WglTtvddX9pu/lT1RNAD/YC/wOLW8mvAv
 9oTAmigQDQQhEWdJA9RgLZBiw7k+DAw=
 =zWFb
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull netfs updates from Christian Brauner:
 "This contains the work to improve read/write performance for the new
  netfs library.

  The main performance enhancing changes are:

   - Define a structure, struct folio_queue, and a new iterator type,
     ITER_FOLIOQ, to hold a buffer as a replacement for ITER_XARRAY. See
     that patch for questions about naming and form.

     ITER_FOLIOQ is provided as a replacement for ITER_XARRAY. The
     problem with an xarray is that accessing it requires the use of a
     lock (typically the RCU read lock) - and this means that we can't
     supply iterate_and_advance() with a step function that might sleep
     (crypto for example) without having to drop the lock between pages.
     ITER_FOLIOQ is the iterator for a chain of folio_queue structs,
     where each folio_queue holds a small list of folios. A folio_queue
     struct is a simpler structure than xarray and is not subject to
     concurrent manipulation by the VM. folio_queue is used rather than
     a bvec[] as it can form lists of indefinite size, adding to one end
     and removing from the other on the fly.

   - Provide a copy_folio_from_iter() wrapper.

   - Make cifs RDMA support ITER_FOLIOQ.

   - Use folio queues in the write-side helpers instead of xarrays.

   - Add a function to reset the iterator in a subrequest.

   - Simplify the write-side helpers to use sheaves to skip gaps rather
     than trying to work out where gaps are.

   - In afs, make the read subrequests asynchronous, putting them into
     work items to allow the next patch to do progressive
     unlocking/reading.

   - Overhaul the read-side helpers to improve performance.

   - Fix the caching of a partial block at the end of a file.

   - Allow a store to be cancelled.

  Then some changes for cifs to make it use folio queues instead of
  xarrays for crypto bufferage:

   - Use raw iteration functions rather than manually coding iteration
     when hashing data.

   - Switch to using folio_queue for crypto buffers.

   - Remove the xarray bits.

  Make some adjustments to the /proc/fs/netfs/stats file such that:

   - All the netfs stats lines begin 'Netfs:' but change this to
     something a bit more useful.

   - Add a couple of stats counters to track the numbers of skips and
     waits on the per-inode writeback serialisation lock to make it
     easier to check for this as a source of performance loss.

  Miscellaneous work:

   - Ensure that the sb_writers lock is taken around
     vfs_{set,remove}xattr() in the cachefiles code.

   - Reduce the number of conditional branches in netfs_perform_write().

   - Move the CIFS_INO_MODIFIED_ATTR flag to the netfs_inode struct and
     remove cifs_post_modify().

   - Move the max_len/max_nr_segs members from netfs_io_subrequest to
     netfs_io_request as they're only needed for one subreq at a time.

   - Add an 'unknown' source value for tracing purposes.

   - Remove NETFS_COPY_TO_CACHE as it's no longer used.

   - Set the request work function up front at allocation time.

   - Use bh-disabling spinlocks for rreq->lock as cachefiles completion
     may be run from block-filesystem DIO completion in softirq context.

   - Remove fs/netfs/io.c"

* tag 'vfs-6.12.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits)
  docs: filesystems: corrected grammar of netfs page
  cifs: Don't support ITER_XARRAY
  cifs: Switch crypto buffer to use a folio_queue rather than an xarray
  cifs: Use iterate_and_advance*() routines directly for hashing
  netfs: Cancel dirty folios that have no storage destination
  cachefiles, netfs: Fix write to partial block at EOF
  netfs: Remove fs/netfs/io.c
  netfs: Speed up buffered reading
  afs: Make read subreqs async
  netfs: Simplify the writeback code
  netfs: Provide an iterator-reset function
  netfs: Use new folio_queue data type and iterator instead of xarray iter
  cifs: Provide the capability to extract from ITER_FOLIOQ to RDMA SGEs
  iov_iter: Provide copy_folio_from_iter()
  mm: Define struct folio_queue and ITER_FOLIOQ to handle a sequence of folios
  netfs: Use bh-disabling spinlocks for rreq->lock
  netfs: Set the request work function upon allocation
  netfs: Remove NETFS_COPY_TO_CACHE
  netfs: Reserve netfs_sreq_source 0 as unset/unknown
  netfs: Move max_len/max_nr_segs from netfs_io_subrequest to netfs_io_stream
  ...
2024-09-16 12:13:31 +02:00
Linus Torvalds
3352633ce6 vfs-6.12.file
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEwAAKCRCRxhvAZXjc
 osS0AQCgIpvey9oW5DMyMw6Bv0hFMRv95gbNQZfHy09iK+NMNAD9GALhb/4cMIVB
 7YrZGXEz454lpgcs8AnrOVjVNfctOQg=
 =e9s9
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs file updates from Christian Brauner:
 "This is the work to cleanup and shrink struct file significantly.

  Right now, (focusing on x86) struct file is 232 bytes. After this
  series struct file will be 184 bytes aka 3 cacheline and a spare 8
  bytes for future extensions at the end of the struct.

  With struct file being as ubiquitous as it is this should make a
  difference for file heavy workloads and allow further optimizations in
  the future.

   - struct fown_struct was embedded into struct file letting it take up
     32 bytes in total when really it shouldn't even be embedded in
     struct file in the first place. Instead, actual users of struct
     fown_struct now allocate the struct on demand. This frees up 24
     bytes.

   - Move struct file_ra_state into the union containg the cleanup hooks
     and move f_iocb_flags out of the union. This closes a 4 byte hole
     we created earlier and brings struct file to 192 bytes. Which means
     struct file is 3 cachelines and we managed to shrink it by 40
     bytes.

   - Reorder struct file so that nothing crosses a cacheline.

     I suspect that in the future we will end up reordering some members
     to mitigate false sharing issues or just because someone does
     actually provide really good perf data.

   - Shrinking struct file to 192 bytes is only part of the work.

     Files use a slab that is SLAB_TYPESAFE_BY_RCU and when a kmem cache
     is created with SLAB_TYPESAFE_BY_RCU the free pointer must be
     located outside of the object because the cache doesn't know what
     part of the memory can safely be overwritten as it may be needed to
     prevent object recycling.

     That has the consequence that SLAB_TYPESAFE_BY_RCU may end up
     adding a new cacheline.

     So this also contains work to add a new kmem_cache_create_rcu()
     function that allows the caller to specify an offset where the
     freelist pointer is supposed to be placed. Thus avoiding the
     implicit addition of a fourth cacheline.

   - And finally this removes the f_version member in struct file.

     The f_version member isn't particularly well-defined. It is mainly
     used as a cookie to detect concurrent seeks when iterating
     directories. But it is also abused by some subsystems for
     completely unrelated things.

     It is mostly a directory and filesystem specific thing that doesn't
     really need to live in struct file and with its wonky semantics it
     really lacks a specific function.

     For pipes, f_version is (ab)used to defer poll notifications until
     a write has happened. And struct pipe_inode_info is used by
     multiple struct files in their ->private_data so there's no chance
     of pushing that down into file->private_data without introducing
     another pointer indirection.

     But pipes don't rely on f_pos_lock so this adds a union into struct
     file encompassing f_pos_lock and a pipe specific f_pipe member that
     pipes can use. This union of course can be extended to other file
     types and is similar to what we do in struct inode already"

* tag 'vfs-6.12.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (26 commits)
  fs: remove f_version
  pipe: use f_pipe
  fs: add f_pipe
  ubifs: store cookie in private data
  ufs: store cookie in private data
  udf: store cookie in private data
  proc: store cookie in private data
  ocfs2: store cookie in private data
  input: remove f_version abuse
  ext4: store cookie in private data
  ext2: store cookie in private data
  affs: store cookie in private data
  fs: add generic_llseek_cookie()
  fs: use must_set_pos()
  fs: add must_set_pos()
  fs: add vfs_setpos_cookie()
  s390: remove unused f_version
  ceph: remove unused f_version
  adi: remove unused f_version
  mm: Removed @freeptr_offset to prevent doc warning
  ...
2024-09-16 09:14:02 +02:00
Linus Torvalds
2775df6e5e vfs-6.12.folio
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEvgAKCRCRxhvAZXjc
 ou77AQD3U1KjbdgzbUi6kaUmiiWOPhfYTlm8mho8dBjqvTCB+AD/XTWSFCWWhHB4
 KyQZTbjRD81xmVNbKjASazp0EA6Ahwc=
 =gIsD
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12.folio' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull vfs folio updates from Christian Brauner:
 "This contains work to port write_begin and write_end to rely on folios
  for various filesystems.

  This converts ocfs2, vboxfs, orangefs, jffs2, hostfs, fuse, f2fs,
  ecryptfs, ntfs3, nilfs2, reiserfs, minixfs, qnx6, sysv, ufs, and
  squashfs.

  After this series lands a bunch of the filesystems in this list do not
  mention struct page anymore"

* tag 'vfs-6.12.folio' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (61 commits)
  Squashfs: Ensure all readahead pages have been used
  Squashfs: Rewrite and update squashfs_readahead_fragment() to not use page->index
  Squashfs: Update squashfs_readpage_block() to not use page->index
  Squashfs: Update squashfs_readahead() to not use page->index
  Squashfs: Update page_actor to not use page->index
  jffs2: Use a folio in jffs2_garbage_collect_dnode()
  jffs2: Convert jffs2_do_readpage_nolock to take a folio
  buffer: Convert __block_write_begin() to take a folio
  ocfs2: Convert ocfs2_write_zero_page to use a folio
  fs: Convert aops->write_begin to take a folio
  fs: Convert aops->write_end to take a folio
  vboxsf: Use a folio in vboxsf_write_end()
  orangefs: Convert orangefs_write_begin() to use a folio
  orangefs: Convert orangefs_write_end() to use a folio
  jffs2: Convert jffs2_write_begin() to use a folio
  jffs2: Convert jffs2_write_end() to use a folio
  hostfs: Convert hostfs_write_end() to use a folio
  fuse: Convert fuse_write_begin() to use a folio
  fuse: Convert fuse_write_end() to use a folio
  f2fs: Convert f2fs_write_begin() to use a folio
  ...
2024-09-16 08:54:30 +02:00
David Howells
ee4cdf7ba8
netfs: Speed up buffered reading
Improve the efficiency of buffered reads in a number of ways:

 (1) Overhaul the algorithm in general so that it's a lot more compact and
     split the read submission code between buffered and unbuffered
     versions.  The unbuffered version can be vastly simplified.

 (2) Read-result collection is handed off to a work queue rather than being
     done in the I/O thread.  Multiple subrequests can be processes
     simultaneously.

 (3) When a subrequest is collected, any folios it fully spans are
     collected and "spare" data on either side is donated to either the
     previous or the next subrequest in the sequence.

Notes:

 (*) Readahead expansion is massively slows down fio, presumably because it
     causes a load of extra allocations, both folio and xarray, up front
     before RPC requests can be transmitted.

 (*) RDMA with cifs does appear to work, both with SIW and RXE.

 (*) PG_private_2-based reading and copy-to-cache is split out into its own
     file and altered to use folio_queue.  Note that the copy to the cache
     now creates a new write transaction against the cache and adds the
     folios to be copied into it.  This allows it to use part of the
     writeback I/O code.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-20-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 12:20:41 +02:00
Christian Brauner
387b499b78
ceph: remove unused f_version
It's not used for ceph so don't bother with it at all.

Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-3-6d3e4816aa7b@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-09 11:58:06 +02:00
Chen Yufan
2015716adb ceph: Convert to use jiffies macro
Use time_after_eq macro instead of using
jiffies directly to handle wraparound.

[ xiubli: adjust the header files order ]

Signed-off-by: Chen Yufan <chenyufan@vivo.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-08-27 09:30:09 +02:00
Yue Haibing
9a948c0c8e ceph: Remove unused declarations
These functions is never implemented and used.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-08-27 09:28:48 +02:00
David Howells
92764e8822
netfs, ceph: Partially revert "netfs: Replace PG_fscache by setting folio->private and marking dirty"
This partially reverts commit 2ff1e97587.

In addition to reverting the removal of PG_private_2 wrangling from the
buffered read code[1][2], the removal of the waits for PG_private_2 from
netfs_release_folio() and netfs_invalidate_folio() need reverting too.

It also adds a wait into ceph_evict_inode() to wait for netfs read and
copy-to-cache ops to complete.

Fixes: 2ff1e97587 ("netfs: Replace PG_fscache by setting folio->private and marking dirty")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk [1]
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8e5ced7804cb9184c4a23f8054551240562a8eda [2]
Link: https://lore.kernel.org/r/20240814203850.2240469-2-dhowells@redhat.com
cc: Max Kellermann <max.kellermann@ionos.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: ceph-devel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-21 22:32:58 +02:00
Linus Torvalds
4ac0f08f44 vfs-6.11-rc4.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZrym4AAKCRCRxhvAZXjc
 oqT3AP9ydoUNavaZcRayH8r3ybvz9+aJGJ6Q7NznFVCk71vn0gD/buLzmq96Muns
 M5DWHbft2AFwK0Rz2nx8j5OXUeHwrQg=
 =HZBL
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
 "VFS:

   - Fix the name of file lease slab cache. When file leases were split
     out of file locks the name of the file lock slab cache was used for
     the file leases slab cache as well.

   - Fix a type in take_fd() helper.

   - Fix infinite directory iteration for stable offsets in tmpfs.

   - When the icache is pruned all reclaimable inodes are marked with
     I_FREEING and other processes that try to lookup such inodes will
     block.

     But some filesystems like ext4 can trigger lookups in their inode
     evict callback causing deadlocks. Ext4 does such lookups if the
     ea_inode feature is used whereby a separate inode may be used to
     store xattrs.

     Introduce I_LRU_ISOLATING which pins the inode while its pages are
     reclaimed. This avoids inode deletion during inode_lru_isolate()
     avoiding the deadlock and evict is made to wait until
     I_LRU_ISOLATING is done.

  netfs:

   - Fault in smaller chunks for non-large folio mappings for
     filesystems that haven't been converted to large folios yet.

   - Fix the CONFIG_NETFS_DEBUG config option. The config option was
     renamed a short while ago and that introduced two minor issues.
     First, it depended on CONFIG_NETFS whereas it wants to depend on
     CONFIG_NETFS_SUPPORT. The former doesn't exist, while the latter
     does. Second, the documentation for the config option wasn't fixed
     up.

   - Revert the removal of the PG_private_2 writeback flag as ceph is
     using it and fix how that flag is handled in netfs.

   - Fix DIO reads on 9p. A program watching a file on a 9p mount
     wouldn't see any changes in the size of the file being exported by
     the server if the file was changed directly in the source
     filesystem. Fix this by attempting to read the full size specified
     when a DIO read is requested.

   - Fix a NULL pointer dereference bug due to a data race where a
     cachefiles cookies was retired even though it was still in use.
     Check the cookie's n_accesses counter before discarding it.

  nsfs:

   - Fix ioctl declaration for NS_GET_MNTNS_ID from _IO() to _IOR() as
     the kernel is writing to userspace.

  pidfs:

   - Prevent the creation of pidfds for kthreads until we have a
     use-case for it and we know the semantics we want. It also confuses
     userspace why they can get pidfds for kthreads.

  squashfs:

   - Fix an unitialized value bug reported by KMSAN caused by a
     corrupted symbolic link size read from disk. Check that the
     symbolic link size is not larger than expected"

* tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  Squashfs: sanity check symbolic link size
  9p: Fix DIO read through netfs
  vfs: Don't evict inode under the inode lru traversing context
  netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags
  netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag"
  file: fix typo in take_fd() comment
  pidfd: prevent creation of pidfds for kthreads
  netfs: clean up after renaming FSCACHE_DEBUG config
  libfs: fix infinite directory reads for offset dir
  nsfs: fix ioctl declaration
  fs/netfs/fscache_cookie: add missing "n_accesses" check
  filelock: fix name of file_lease slab cache
  netfs: Fault in smaller chunks for non-large folio mappings
2024-08-14 09:06:28 -07:00
Dominique Martinet
e3786b29c5
9p: Fix DIO read through netfs
If a program is watching a file on a 9p mount, it won't see any change in
size if the file being exported by the server is changed directly in the
source filesystem, presumably because 9p doesn't have change notifications,
and because netfs skips the reads if the file is empty.

Fix this by attempting to read the full size specified when a DIO read is
requested (such as when 9p is operating in unbuffered mode) and dealing
with a short read if the EOF was less than the expected read.

To make this work, filesystems using netfslib must not set
NETFS_SREQ_CLEAR_TAIL if performing a DIO read where that read hit the EOF.
I don't want to mandatorily clear this flag in netfslib for DIO because,
say, ceph might make a read from an object that is not completely filled,
but does not reside at the end of file - and so we need to clear the
excess.

This can be tested by watching an empty file over 9p within a VM (such as
in the ktest framework):

        while true; do read content; if [ -n "$content" ]; then echo $content; break; fi; done < /host/tmp/foo

then writing something into the empty file.  The watcher should immediately
display the file content and break out of the loop.  Without this fix, it
remains in the loop indefinitely.

Fixes: 80105ed2fd ("9p: Use netfslib read/write_iter")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218916
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/1229195.1723211769@warthog.procyon.org.uk
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-13 13:53:09 +02:00
David Howells
7b589a9b45
netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags
The NETFS_RREQ_USE_PGPRIV2 and NETFS_RREQ_WRITE_TO_CACHE flags aren't used
correctly.  The problem is that we try to set them up in the request
initialisation, but we the cache may be in the process of setting up still,
and so the state may not be correct.  Further, we secondarily sample the
cache state and make contradictory decisions later.

The issue arises because we set up the cache resources, which allows the
cache's ->prepare_read() to switch on NETFS_SREQ_COPY_TO_CACHE - which
triggers cache writing even if we didn't set the flags when allocating.

Fix this in the following way:

 (1) Drop NETFS_ICTX_USE_PGPRIV2 and instead set NETFS_RREQ_USE_PGPRIV2 in
     ->init_request() rather than trying to juggle that in
     netfs_alloc_request().

 (2) Repurpose NETFS_RREQ_USE_PGPRIV2 to merely indicate that if caching is
     to be done, then PG_private_2 is to be used rather than only setting
     it if we decide to cache and then having netfs_rreq_unlock_folios()
     set the non-PG_private_2 writeback-to-cache if it wasn't set.

 (3) Split netfs_rreq_unlock_folios() into two functions, one of which
     contains the deprecated code for using PG_private_2 to avoid
     accidentally doing the writeback path - and always use it if
     USE_PGPRIV2 is set.

 (4) As NETFS_ICTX_USE_PGPRIV2 is removed, make netfs_write_begin() always
     wait for PG_private_2.  This function is deprecated and only used by
     ceph anyway, and so label it so.

 (5) Drop the NETFS_RREQ_WRITE_TO_CACHE flag and use
     fscache_operation_valid() on the cache_resources instead.  This has
     the advantage of picking up the result of netfs_begin_cache_read() and
     fscache_begin_write_operation() - which are called after the object is
     initialised and will wait for the cache to come to a usable state.

Just reverting ae678317b95e[1] isn't a sufficient fix, so this need to be
applied on top of that.  Without this as well, things like:

 rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: {

and:

 WARNING: CPU: 13 PID: 3621 at fs/ceph/caps.c:3386

may happen, along with some UAFs due to PG_private_2 not getting used to
wait on writeback completion.

Fixes: 2ff1e97587 ("netfs: Replace PG_fscache by setting folio->private and marking dirty")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Hristo Venev <hristo@venev.name>
cc: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: ceph-devel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk/ [1]
Link: https://lore.kernel.org/r/1173209.1723152682@warthog.procyon.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12 22:03:27 +02:00
David Howells
8e5ced7804
netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag"
This reverts commit ae678317b9.

Revert the patch that removes the deprecated use of PG_private_2 in
netfslib for the moment as Ceph is actually still using this to track
data copied to the cache.

Fixes: ae678317b9 ("netfs: Remove deprecated use of PG_private_2 as a second writeback flag")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: ceph-devel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
https: //lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12 22:03:27 +02:00
Matthew Wilcox (Oracle)
1da86618bd
fs: Convert aops->write_begin to take a folio
Convert all callers from working on a page to working on one page
of a folio (support for working on an entire folio can come later).
Removes a lot of folio->page->folio conversions.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07 11:33:21 +02:00
Matthew Wilcox (Oracle)
a225800f32
fs: Convert aops->write_end to take a folio
Most callers have a folio, and most implementations operate on a folio,
so remove the conversion from folio->page->folio to fit through this
interface.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07 11:32:02 +02:00
Xiubo Li
31634d7597 ceph: force sending a cap update msg back to MDS for revoke op
If a client sends out a cap update dropping caps with the prior 'seq'
just before an incoming cap revoke request, then the client may drop
the revoke because it believes it's already released the requested
capabilities.

This causes the MDS to wait indefinitely for the client to respond
to the revoke. It's therefore always a good idea to ack the cap
revoke request with the bumped up 'seq'.

Currently if the cap->issued equals to the newcaps the check_caps()
will do nothing, we should force flush the caps.

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/61782
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-08-01 13:14:28 +02:00
Linus Torvalds
6467dfdfc9 A small patchset to address bogus I/O errors and ultimately an
assertion failure in the face of watch errors with -o exclusive
 mappings in RBD marked for stable and some assorted CephFS fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmajtIUTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi1fwB/4/CsFZLQC+ybWSqoZVYD01qND0muol
 44Nr6NyKdW302jQhAXchecB6s6L+N0azhVlAKKB8sO9XCifKA8RzuW75Y0+8z78B
 pgB6K7ZOzAIuPIG9mmNbutUHEd24CzGXNA28lEQkrNT8D6UZTENXQRJb1dS2GzQ7
 T2PyjFFoyF0z1bDZ85URHxeyMetEe/TzWUlG1P2QI98V+RP8nK+mGYmdXNGKhH87
 Ltf2pxjsiomtoH4QRm4LX7vwOUY1Ljf4HpSS1p+c5Fova2UTtTDbVfTFbh+ZjuQV
 hlh1ypNLM+igifu3nVeJ/Ga2f71zVFM66tnmpjcY3DxZAp70e1W2HMFD
 =Zfcy
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.11-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "A small patchset to address bogus I/O errors and ultimately an
  assertion failure in the face of watch errors with -o exclusive
  mappings in RBD marked for stable and some assorted CephFS fixes"

* tag 'ceph-for-6.11-rc1' of https://github.com/ceph/ceph-client:
  rbd: don't assume rbd_is_lock_owner() for exclusive mappings
  rbd: don't assume RBD_LOCK_STATE_LOCKED for exclusive mappings
  rbd: rename RBD_LOCK_STATE_RELEASING and releasing_wait
  ceph: fix incorrect kmalloc size of pagevec mempool
  ceph: periodically flush the cap releases
  ceph: convert comma to semicolon in __ceph_dentry_dir_lease_touch()
  ceph: use cap_wait_list only if debugfs is enabled
2024-07-26 10:34:42 -07:00
ethanwu
03230edb0b ceph: fix incorrect kmalloc size of pagevec mempool
The kmalloc size of pagevec mempool is incorrectly calculated.
It misses the size of page pointer and only accounts the number for the array.

Fixes: a0102bda5b ("ceph: move sb->wb_pagevec_pool to be a global mempool")
Signed-off-by: ethanwu <ethanwu@synology.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-07-23 10:01:57 +02:00
Xiubo Li
578eb54c4a ceph: periodically flush the cap releases
The MDS could be waiting the caps releases infinitely in some corner
case and then reporting the caps revoke stuck warning. To fix this
we should periodically flush the cap releases.

Link: https://tracker.ceph.com/issues/57244
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-07-23 10:01:57 +02:00
Chen Ni
77bb4a501a ceph: convert comma to semicolon in __ceph_dentry_dir_lease_touch()
Replace a comma between expression statements by a semicolon.

Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-07-23 10:01:57 +02:00
Max Kellermann
65d284a38c ceph: use cap_wait_list only if debugfs is enabled
Only debugfs uses this list.  By omitting it, we save some memory and
reduce lock contention on `caps_list_lock`.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-07-23 10:01:57 +02:00
Kairui Song
5e425300af ceph: drop usage of page_index
page_index is needed for mixed usage of page cache and swap cache, for
pure page cache usage, the caller can just use page->index instead.

It can't be a swap cache page here, so just drop it.

Link: https://lkml.kernel.org/r/20240521175854.96038-4-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03 19:29:55 -07:00
Linus Torvalds
74eca356f6 We have a series from Xiubo that adds support for additional access
checks based on MDS auth caps which were recently made available to
 clients.  This is needed to prevent scenarios where the MDS quietly
 discards updates that a UID-restricted client previously (wrongfully)
 acked to the user.  Other than that, just a documentation fixup.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmZRsm0THGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi6++B/4o7j4CNzjJcBw9UgxEUugwJYBe2Ht3
 vSTUkHD9NILVgrYSHNhgkCdvU8ckv8Sd+7W/Kb0BC/GRyQd57F7zoM6mvR1WMozt
 lLbYU/kdVI+TcIY2bhupMma5f+nWv6vIBTzca78UhogEBuIEYHAG3BnNaT/AEqF+
 yZ3uQEVQ2bHmwbPn3A5dibYxOR8zLyhmaq/RUvqpiiYcSkEfZ6QqKiMlZkcVD7F8
 c+NfjwXXGNTXDhfIbG4VndQi7xLXk3GI5E8xvdo2ALwumfp2KdVlMEYT8SHggSPS
 A8bWq+d5o7uVV6WviNK63XcGRpadCSSSL310vR78K+tNIIq5dnI81IsU
 =6c1e
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.10-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "A series from Xiubo that adds support for additional access checks
  based on MDS auth caps which were recently made available to clients.

  This is needed to prevent scenarios where the MDS quietly discards
  updates that a UID-restricted client previously (wrongfully) acked to
  the user.

  Other than that, just a documentation fixup"

* tag 'ceph-for-6.10-rc1' of https://github.com/ceph/ceph-client:
  doc: ceph: update userspace command to get CephFS metadata
  ceph: add CEPHFS_FEATURE_MDS_AUTH_CAPS_CHECK feature bit
  ceph: check the cephx mds auth access for async dirop
  ceph: check the cephx mds auth access for open
  ceph: check the cephx mds auth access for setattr
  ceph: add ceph_mds_check_access() helper
  ceph: save cap_auths in MDS client when session is opened
2024-05-25 14:23:58 -07:00
Xiubo Li
d8fc89815f ceph: add CEPHFS_FEATURE_MDS_AUTH_CAPS_CHECK feature bit
Since we have support checking the mds auth cap in kclient, just
set the feature bit.

Link: https://tracker.ceph.com/issues/61333
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-05-23 10:35:47 +02:00
Xiubo Li
2827badaf8 ceph: check the cephx mds auth access for async dirop
Before doing the op locally we need to check the cephx access.

Link: https://tracker.ceph.com/issues/61333
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-05-23 10:35:47 +02:00
Xiubo Li
845ae9d492 ceph: check the cephx mds auth access for open
Before opening the file locally we need to check the cephx access.

Link: https://tracker.ceph.com/issues/61333
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-05-23 10:35:47 +02:00
Xiubo Li
ded6783040 ceph: check the cephx mds auth access for setattr
If we hit any failre just try to force it to do the sync setattr.

Link: https://tracker.ceph.com/issues/61333
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-05-23 10:35:47 +02:00
Xiubo Li
596afb0b89 ceph: add ceph_mds_check_access() helper
This will help check the mds auth access in client side. Always
insert the server path in front of the target path when matching
the paths.

[ idryomov: use u32 instead of uint32_t ]

Link: https://tracker.ceph.com/issues/61333
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-05-23 10:35:46 +02:00
Xiubo Li
1d17de9534 ceph: save cap_auths in MDS client when session is opened
Save the cap_auths, which have been parsed by the MDS, in the opened
session.

[ idryomov: use s64 and u32 instead of int64_t and uint32_t, switch to
            bool for root_squash, readable and writeable ]

Link: https://tracker.ceph.com/issues/61333
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-05-23 10:35:46 +02:00
David Howells
7ba167c4c7 netfs: Switch to using unsigned long long rather than loff_t
Switch to using unsigned long long rather than loff_t in netfslib to avoid
problems with the sign flipping in the maths when we're dealing with the
byte at position 0x7fffffffffffffff.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: netfs@lists.linux.dev
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
2024-05-01 18:07:35 +01:00
David Howells
ae678317b9 netfs: Remove deprecated use of PG_private_2 as a second writeback flag
Remove the deprecated use of PG_private_2 in netfslib.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2024-04-29 15:01:43 +01:00
David Howells
2e9d7e4b98 mm: Remove the PG_fscache alias for PG_private_2
Remove the PG_fscache alias for PG_private_2 and use the latter directly.
Use of this flag for marking pages undergoing writing to the cache should
be considered deprecated and the folios should be marked dirty instead and
the write done in ->writepages().

Note that PG_private_2 itself should be considered deprecated and up for
future removal by the MM folks too.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: Tom Talpey <tom@talpey.com>
cc: Bharath SM <bharathsm@microsoft.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Anna Schumaker <anna@kernel.org>
cc: netfs@lists.linux.dev
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2024-04-29 15:01:42 +01:00
David Howells
2ff1e97587 netfs: Replace PG_fscache by setting folio->private and marking dirty
When dirty data is being written to the cache, setting/waiting on/clearing
the fscache flag is always done in tandem with setting/waiting on/clearing
the writeback flag.  The netfslib buffered write routines wait on and set
both flags and the write request cleanup clears both flags, so the fscache
flag is almost superfluous.

The reason it isn't superfluous is because the fscache flag is also used to
indicate that data just read from the server is being written to the cache.
The flag is used to prevent a race involving overlapping direct-I/O writes
to the cache.

Change this to indicate that a page is in need of being copied to the cache
by placing a magic value in folio->private and marking the folios dirty.
Then when the writeback code sees a folio marked in this way, it only
writes it to the cache and not to the server.

If a folio that has this magic value set is modified, the value is just
replaced and the folio will then be uplodaded too.

With this, PG_fscache is no longer required by the netfslib core, 9p and
afs.

Ceph and nfs, however, still need to use the old PG_fscache-based tracking.
To deal with this, a flag, NETFS_ICTX_USE_PGPRIV2, now has to be set on the
flags in the netfs_inode struct for those filesystems.  This reenables the
use of PG_fscache in that inode.  9p and afs use the netfslib write helpers
so get switched over; cifs, for the moment, does page-by-page manual access
to the cache, so doesn't use PG_fscache and is unaffected.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: Tom Talpey <tom@talpey.com>
cc: Bharath SM <bharathsm@microsoft.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Anna Schumaker <anna@kernel.org>
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2024-04-29 15:01:42 +01:00
Xiubo Li
17f8dc2db5 ceph: switch to use cap_delay_lock for the unlink delay list
The same list item will be used in both cap_delay_list and
cap_unlink_delay_list, so it's buggy to use two different locks
to protect them.

Cc: stable@vger.kernel.org
Fixes: dbc347ef7f ("ceph: add ceph_cap_unlink_work to fire check_caps() immediately")
Link: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AODC76VXRAMXKLFDCTK4TKFDDPWUSCN5
Reported-by: Marc Ruhmann <ruhmann@luis.uni-hannover.de>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Marc Ruhmann <ruhmann@luis.uni-hannover.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-04-11 22:56:28 +02:00
NeilBrown
b372e96bd0 ceph: redirty page before returning AOP_WRITEPAGE_ACTIVATE
The page has been marked clean before writepage is called.  If we don't
redirty it before postponing the write, it might never get written.

Cc: stable@vger.kernel.org
Fixes: 503d4fa6ee ("ceph: remove reliance on bdi congestion")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-04-11 19:17:02 +02:00
Linus Torvalds
ff9c18e435 A patch to minimize blockage when processing very large batches of
dirty caps and two fixes to better handle EOF in the face of multiple
 clients performing reads and size-extending writes at the same time.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmX9xDETHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi3HzCACJYjTUKq4v8/LkhzyJM0WTYXKQ+Orz
 BnwgFHGIEiihQKko/7Ks+fcuEGpdy97Rsn9mtmkN0UCKfbcCHwGwflaoYfkkIA4t
 V9pNX0xRwDTyKaENtiI5GVjC/nYcfotbRK4BfURRYKb1xYHq8lO0mOxXwvt5weqQ
 CISWACp7k7eMcX0R0fKT9LemfBDDu2Pxi5ZnDNSdI6Z87Bwdv96jOaCaJ93Azo1W
 Mjr9ddMmaaqsrmaUE3jp58b56nxTrcOUGR7XQUZtjNjEy5h91WazydD4TJFaEQrF
 CQsV5nXHSRT8E4ROUZk8fa7amLs6FGrx307fkOKW02exQPBF7ij/SAt4
 =STLZ
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.9-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "A patch to minimize blockage when processing very large batches of
  dirty caps and two fixes to better handle EOF in the face of multiple
  clients performing reads and size-extending writes at the same time"

* tag 'ceph-for-6.9-rc1' of https://github.com/ceph/ceph-client:
  ceph: set correct cap mask for getattr request for read
  ceph: stop copying to iter at EOF on sync reads
  ceph: remove SLAB_MEM_SPREAD flag usage
  ceph: break the check delayed cap loop every 5s
2024-03-22 11:15:45 -07:00
Xiubo Li
825b82f6b8 ceph: set correct cap mask for getattr request for read
In case of hitting the file EOF, ceph_read_iter() needs to retrieve the
file size from MDS, and Fr caps aren't neccessary.

[ idryomov: fold into existing retry_op == READ_INLINE branch ]

Reported-by: Frank Hsiao <frankhsiao@qnap.com>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Frank Hsiao <frankhsiao@qnap.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-03-19 14:35:55 +01:00
Xiubo Li
1065da21e5 ceph: stop copying to iter at EOF on sync reads
If EOF is encountered, ceph_sync_read() return value is adjusted down
according to i_size, but the "to" iter is advanced by the actual number
of bytes read.  Then, when retrying, the remainder of the range may be
skipped incorrectly.

Ensure that the "to" iter is advanced only until EOF.

[ idryomov: changelog ]

Fixes: c3d8e0b5de ("ceph: return the real size read when it hits EOF")
Reported-by: Frank Hsiao <frankhsiao@qnap.com>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Frank Hsiao <frankhsiao@qnap.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-03-19 14:35:55 +01:00
Chengming Zhou
a8922f7967 ceph: remove SLAB_MEM_SPREAD flag usage
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1, so it became a dead flag since the commit
16a1d96835 ("mm/slab: remove mm/slab.c and slab_def.h"). And the
series [1] went on to mark it obsolete to avoid confusion for users.
Here we can just remove all its users, which has no functional change.

[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-03-18 22:03:29 +01:00
Xiubo Li
09927e7ef1 ceph: break the check delayed cap loop every 5s
In some cases this may take a long time and will block renewing
the caps to MDS.

[ idryomov: massage comment ]

Link: https://tracker.ceph.com/issues/50223#note-21
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-03-18 22:03:29 +01:00
Linus Torvalds
f88c3fb81c mm, slab: remove last vestiges of SLAB_MEM_SPREAD
Yes, yes, I know the slab people were planning on going slow and letting
every subsystem fight this thing on their own.  But let's just rip off
the band-aid and get it over and done with.  I don't want to see a
number of unnecessary pull requests just to get rid of a flag that no
longer has any meaning.

This was mainly done with a couple of 'sed' scripts and then some manual
cleanup of the end result.

Link: https://lore.kernel.org/all/CAHk-=wji0u+OOtmAOD-5JV3SXcRJF___k_+8XNKmak0yd5vW1Q@mail.gmail.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-03-12 20:32:19 -07:00
Linus Torvalds
0c750012e8 vfs-6.9.file
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem4tQAKCRCRxhvAZXjc
 ohnfAP4sm946PZfiC4y5Euk96WDC3hC8WCSBar+fpFmYVzeD9wEAy+NVCsjkMElz
 vqNxwFULUwQjFxxvsM9gvhrgGUud1AE=
 =UZk/
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.9.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull file locking updates from Christian Brauner:
 "A few years ago struct file_lock_context was added to allow for
  separate lists to track different types of file locks instead of using
  a singly-linked list for all of them.

  Now leases no longer need to be tracked using struct file_lock.
  However, a lot of the infrastructure is identical for leases and locks
  so separating them isn't trivial.

  This splits a group of fields used by both file locks and leases into
  a new struct file_lock_core. The new core struct is embedded in struct
  file_lock. Coccinelle was used to convert a lot of the callers to deal
  with the move, with the remaining 25% or so converted by hand.

  Afterwards several internal functions in fs/locks.c are made to work
  with struct file_lock_core. Ultimately this allows to split struct
  file_lock into struct file_lock and struct file_lease. The file lease
  APIs are then converted to take struct file_lease"

* tag 'vfs-6.9.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (51 commits)
  filelock: fix deadlock detection in POSIX locking
  filelock: always define for_each_file_lock()
  smb: remove redundant check
  filelock: don't do security checks on nfsd setlease calls
  filelock: split leases out of struct file_lock
  filelock: remove temporary compatibility macros
  smb/server: adapt to breakup of struct file_lock
  smb/client: adapt to breakup of struct file_lock
  ocfs2: adapt to breakup of struct file_lock
  nfsd: adapt to breakup of struct file_lock
  nfs: adapt to breakup of struct file_lock
  lockd: adapt to breakup of struct file_lock
  fuse: adapt to breakup of struct file_lock
  gfs2: adapt to breakup of struct file_lock
  dlm: adapt to breakup of struct file_lock
  ceph: adapt to breakup of struct file_lock
  afs: adapt to breakup of struct file_lock
  9p: adapt to breakup of struct file_lock
  filelock: convert seqfile handling to use file_lock_core
  filelock: convert locks_translate_pid to take file_lock_core
  ...
2024-03-11 10:37:45 -07:00
Xiubo Li
51d31149a8 ceph: switch to corrected encoding of max_xattr_size in mdsmap
The addition of bal_rank_mask with encoding version 17 was merged
into ceph.git in Oct 2022 and made it into v18.2.0 release normally.
A few months later, the much delayed addition of max_xattr_size got
merged, also with encoding version 17, placed before bal_rank_mask
in the encoding -- but it didn't make v18.2.0 release.

The way this ended up being resolved on the MDS side is that
bal_rank_mask will continue to be encoded in version 17 while
max_xattr_size is now encoded in version 18.  This does mean that
older kernels will misdecode version 17, but this is also true for
v18.2.0 and v18.2.1 clients in userspace.

The best we can do is backport this adjustment -- see ceph.git
commit 78abfeaff27fee343fb664db633de5b221699a73 for details.

[ idryomov: changelog ]

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/64440
Fixes: d93231a6bc ("ceph: prevent a client from exceeding the MDS maximum xattr size")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
Reviewed-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-02-26 19:20:30 +01:00
Xiubo Li
dbc347ef7f ceph: add ceph_cap_unlink_work to fire check_caps() immediately
When unlinking a file the check caps could be delayed for more than
5 seconds, but in MDS side it maybe waiting for the clients to
release caps.

This will use the cap_wq work queue and a dedicated list to help
fire the check_caps() and dirty buffer flushing immediately.

Link: https://tracker.ceph.com/issues/50223
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-02-13 11:22:54 +01:00
Xiubo Li
902d6d013f ceph: always queue a writeback when revoking the Fb caps
In case there is 'Fw' dirty caps and 'CHECK_CAPS_FLUSH' is set we
will always ignore queue a writeback. Queue a writeback is very
important because it will block kclient flushing the snapcaps to
MDS and which will block MDS waiting for revoking the 'Fb' caps.

Link: https://tracker.ceph.com/issues/50223
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-02-13 11:22:35 +01:00
Xiubo Li
07045648c0 ceph: always check dir caps asynchronously
The MDS will issue the 'Fr' caps for async dirop, while there is
buggy in kclient and it could miss releasing the async dirop caps,
which is 'Fsxr'. And then the MDS will complain with:

"[WRN] client.xxx isn't responding to mclientcaps(revoke) ..."

So when releasing the dirop async requests or when they fail we
should always make sure that being revoked caps could be released.

Link: https://tracker.ceph.com/issues/50223
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-02-07 14:58:02 +01:00
Rishabh Dave
cda4672da1 ceph: prevent use-after-free in encode_cap_msg()
In fs/ceph/caps.c, in encode_cap_msg(), "use after free" error was
caught by KASAN at this line - 'ceph_buffer_get(arg->xattr_buf);'. This
implies before the refcount could be increment here, it was freed.

In same file, in "handle_cap_grant()" refcount is decremented by this
line - 'ceph_buffer_put(ci->i_xattrs.blob);'. It appears that a race
occurred and resource was freed by the latter line before the former
line could increment it.

encode_cap_msg() is called by __send_cap() and __send_cap() is called by
ceph_check_caps() after calling __prep_cap(). __prep_cap() is where
arg->xattr_buf is assigned to ci->i_xattrs.blob. This is the spot where
the refcount must be increased to prevent "use after free" error.

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/59259
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-02-07 14:58:02 +01:00
Xiubo Li
bbb20ea993 ceph: always set initial i_blkbits to CEPH_FSCRYPT_BLOCK_SHIFT
The fscrypt code will use i_blkbits to setup ci_data_unit_bits when
allocating the new inode, but ceph will initiate i_blkbits ater when
filling the inode, which is too late. Since ci_data_unit_bits will only
be used by the fscrypt framework so initiating i_blkbits with
CEPH_FSCRYPT_BLOCK_SHIFT is safe.

Link: https://tracker.ceph.com/issues/64035
Fixes: 5b11888471 ("fscrypt: support crypto data unit size less than filesystem block size")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-02-07 14:43:29 +01:00
Jeff Layton
3956f35fbd
ceph: adapt to breakup of struct file_lock
Most of the existing APIs have remained the same, but subsystems that
access file_lock fields directly need to reach into struct
file_lock_core now.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-36-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-05 13:11:42 +01:00
Jeff Layton
a69ce85ec9
filelock: split common fields into struct file_lock_core
In a future patch, we're going to split file leases into their own
structure. Since a lot of the underlying machinery uses the same fields
move those into a new file_lock_core, and embed that inside struct
file_lock.

For now, add some macros to ensure that we can continue to build while
the conversion is in progress.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-17-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-05 13:11:38 +01:00
Jeff Layton
75e9570c93
ceph: convert to using new filelock helpers
Convert to using the new file locking helper functions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-7-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-05 13:11:35 +01:00
Linus Torvalds
556e2d17ca Assorted CephFS fixes and cleanups with nothing standing out.
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmWqmP8THGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi3cQB/0XJABiPkolqNtd3dSGw8x2YnpS6ciV
 yHxpJViF0+qmnS5l6Vn2lEDr/57h/jts0t3kXUUSDVbitK9glim5ar2FsBeuY7gi
 lQbqhFPfQ+G3APDn2Dn27JYvO1VQLMmvuFJyE4rJ03XZjvOYpq4zM3zPO0jPGvCN
 Gnw0VqPst/h4eobcsFEsHvHuMkkVy6YIOQPsDkiYUShaY6OBUWM4kewrlztmEvaK
 fyuo/FSNmZeEkoc5R7Pfo1FE4PZzfdUie7RmEznxqgHUWFmx2jKZ5TwnCZt1D2av
 dV2e2JWnZUZZL9vAnCQddvnYrj8j+an/IbGZ+0Wa5DZo/eMglDd01VV2
 =kNSw
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.8-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "Assorted CephFS fixes and cleanups with nothing standing out"

* tag 'ceph-for-6.8-rc1' of https://github.com/ceph/ceph-client:
  ceph: get rid of passing callbacks in __dentry_leases_walk()
  ceph: d_obtain_{alias,root}(ERR_PTR(...)) will do the right thing
  ceph: fix invalid pointer access if get_quota_realm return ERR_PTR
  ceph: remove duplicated code in ceph_netfs_issue_read()
  ceph: send oldest_client_tid when renewing caps
  ceph: rename create_session_open_msg() to create_session_full_msg()
  ceph: select FS_ENCRYPTION_ALGS if FS_ENCRYPTION
  ceph: fix deadlock or deadcode of misusing dget()
  ceph: try to allocate a smaller extent map for sparse read
  libceph: remove MAX_EXTENTS check for sparse reads
  ceph: reinitialize mds feature bit even when session in open
  ceph: skip reconnecting if MDS is not ready
2024-01-19 09:58:55 -08:00
Linus Torvalds
16df6e07d6 vfs-6.8.netfs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZabMrQAKCRCRxhvAZXjc
 ovnUAQDgCOonb1tjtTvC8s8IMDUEoaVYZI91KVfsZQSJYN1sdQD+KfJmX1BhJnWG
 l0cEffGfnWGXMZkZqDgLPHUIPzFrmws=
 =1b3j
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.8.netfs' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull netfs updates from Christian Brauner:
 "This extends the netfs helper library that network filesystems can use
  to replace their own implementations. Both afs and 9p are ported. cifs
  is ready as well but the patches are way bigger and will be routed
  separately once this is merged. That will remove lots of code as well.

  The overal goal is to get high-level I/O and knowledge of the page
  cache and ouf of the filesystem drivers. This includes knowledge about
  the existence of pages and folios

  The pull request converts afs and 9p. This removes about 800 lines of
  code from afs and 300 from 9p. For 9p it is now possible to do writes
  in larger than a page chunks. Additionally, multipage folio support
  can be turned on for 9p. Separate patches exist for cifs removing
  another 2000+ lines. I've included detailed information in the
  individual pulls I took.

  Summary:

   - Add NFS-style (and Ceph-style) locking around DIO vs buffered I/O
     calls to prevent these from happening at the same time.

   - Support for direct and unbuffered I/O.

   - Support for write-through caching in the page cache.

   - O_*SYNC and RWF_*SYNC writes use write-through rather than writing
     to the page cache and then flushing afterwards.

   - Support for write-streaming.

   - Support for write grouping.

   - Skip reads for which the server could only return zeros or EOF.

   - The fscache module is now part of the netfs library and the
     corresponding maintainer entry is updated.

   - Some helpers from the fscache subsystem are renamed to mark them as
     belonging to the netfs library.

   - Follow-up fixes for the netfs library.

   - Follow-up fixes for the 9p conversion"

* tag 'vfs-6.8.netfs' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (50 commits)
  netfs: Fix wrong #ifdef hiding wait
  cachefiles: Fix signed/unsigned mixup
  netfs: Fix the loop that unmarks folios after writing to the cache
  netfs: Fix interaction between write-streaming and cachefiles culling
  netfs: Count DIO writes
  netfs: Mark netfs_unbuffered_write_iter_locked() static
  netfs: Fix proc/fs/fscache symlink to point to "netfs" not "../netfs"
  netfs: Rearrange netfs_io_subrequest to put request pointer first
  9p: Use length of data written to the server in preference to error
  9p: Do a couple of cleanups
  9p: Fix initialisation of netfs_inode for 9p
  cachefiles: Fix __cachefiles_prepare_write()
  9p: Use netfslib read/write_iter
  afs: Use the netfs write helpers
  netfs: Export the netfs_sreq tracepoint
  netfs: Optimise away reads above the point at which there can be no data
  netfs: Implement a write-through caching option
  netfs: Provide a launder_folio implementation
  netfs: Provide a writepages implementation
  netfs, cachefiles: Pass upper bound length to allow expansion
  ...
2024-01-19 09:10:23 -08:00
Al Viro
2a965d1b15 ceph: get rid of passing callbacks in __dentry_leases_walk()
__dentry_leases_walk() gets a callback and calls it for
a bunch of denties; there are exactly two callers and
we already have a flag telling them apart - lwc->dir_lease.

Seeing that indirect calls are costly these days, let's
get rid of the callback and just call the right function
directly.  Has a side benefit of saner signatures...

[ xiubli: a minor fix in the commit title ]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-15 15:54:54 +01:00
Al Viro
f6fb21b22f ceph: d_obtain_{alias,root}(ERR_PTR(...)) will do the right thing
Clean up the code.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-15 15:40:51 +01:00
Wenchao Hao
0f4cf64eab ceph: fix invalid pointer access if get_quota_realm return ERR_PTR
This issue is reported by smatch that get_quota_realm() might return
ERR_PTR but we did not handle it. It's not a immediate bug, while we
still should address it to avoid potential bugs if get_quota_realm()
is changed to return other ERR_PTR in future.

Set ceph_snap_realm's pointer in get_quota_realm()'s to address this
issue, the pointer would be set to NULL if get_quota_realm() failed
to get struct ceph_snap_realm, so no ERR_PTR would happen any more.

[ xiubli: minor code style clean up ]

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-15 15:40:51 +01:00
Xiubo Li
b36b03344f ceph: remove duplicated code in ceph_netfs_issue_read()
When allocating an osd request the libceph.ko will add the
'read_from_replica' flag by default.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-15 15:40:51 +01:00
Xiubo Li
6df89bf220 ceph: send oldest_client_tid when renewing caps
Update the oldest_client_tid via the session renew caps msg to
make sure that the MDSs won't pile up the completed request list
in a very large size.

[ idryomov: drop inapplicable comment ]

Link: https://tracker.ceph.com/issues/63364
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-15 15:40:51 +01:00
Xiubo Li
66207de308 ceph: rename create_session_open_msg() to create_session_full_msg()
Makes the create session msg helper to be more general and could
be used by other ops.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-15 15:40:51 +01:00