Commit Graph

2164 Commits

Author SHA1 Message Date
Venky Shankar
adbed05ed6 ceph: mount syntax module parameter
Add read-only module parameters for supported mount syntaxes. Primary
user is the user-space mount helper for catching v2 syntax bugs during
testing by cross verifying if the kernel supports v2 syntax on mount
failure.

Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-01-13 13:40:06 +01:00
Venky Shankar
2167f2cc68 ceph: record updated mon_addr on remount
Note that the new monitors are just shown in /proc/mounts.
Ceph does not (re)connect to new monitors yet.

[ jlayton: s/printk\(KERN_NOTICE/pr_notice(/
	   s/strcmp/strcmp_null/ ]

Signed-off-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-01-13 13:40:06 +01:00
Venky Shankar
7b19b4db5a ceph: new device mount syntax
Old mount device syntax (source) has the following problems:

- mounts to the same cluster but with different fsnames
  and/or creds have identical device string which can
  confuse xfstests.

- Userspace mount helper tool resolves monitor addresses
  and fill in mon addrs automatically, but that means the
  device shown in /proc/mounts is different than what was
  used for mounting.

New device syntax is as follows:

  cephuser@fsid.mycephfs2=/path

Note, there is no "monitor address" in the device string.
That gets passed in as mount option. This keeps the device
string same when monitor addresses change (on remounts).

Also note that the userspace mount helper tool is backward
compatible. I.e., the mount helper will fallback to using
old syntax after trying to mount with the new syntax.

[ idryomov: drop CEPH_MON_ADDR_MNTOPT_DELIM ]

Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-01-13 13:40:06 +01:00
Venky Shankar
2d7c86a8f9 libceph: generalize addr/ip parsing based on delimiter
... and remove hardcoded function name in ceph_parse_ips().

[ idryomov: delim parameter, drop CEPH_ADDR_PARSE_DEFAULT_DELIM ]

Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-01-13 13:40:05 +01:00
Linus Torvalds
8834147f95 fscache rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmHeBGsACgkQ+7dXa6fL
 C2tyLw/8C2Gs/XvOZvRO7KPetKI9BbQSFoCe7uvGbiPq5CEmgcjWzQxvQGklBiZD
 qYa6pMNye1iGpsHOY3Yu210b7vMQiRLnnxvVle0UrjpZR7CcxYS0gGV+6yRdbDGy
 W1X6GFiX06qiNsgBH4msYp0SmbhhfkTyAx1BeBZAEtX8iFgaPfOldPY2nLMcTDD6
 6FT1nTzRcMHx9IUQZJtpeatzc70Qg8+fOr2UAY2nOIypXh6+vAMBO80xtUjGVU+1
 pWD1E+8cXSLfwEEzquFWoWTsTX7hNfsesEN10FmBf1bVCH9ZDFE01MOl6B8+CkFl
 +xfkvDNFC3yyUwAMVAV4+A4Be+cVLSqN2R91QIKJnAj9w1OjxASrwZJ1YeZp6KP4
 h0XKuPs3sRwwbNPVL/nP0UPNexoJnOUAaHesl4uKkRrExmxz9xGOIqIri2+tUIO+
 HkGyNns1huymj1K1ja4AQbDiZZX39GgYVleyg9g3uuy1FS4k+/myJcXo/CqWn3ON
 4oeNwxwLvlcqIQnPrESvwev50lFZYB4pfwvez6T2C5dL/Wk/xdeJK9iG81RWgx7y
 5XcDeoGDE08gMCGWVPjuhOCXypeiRGHhRNlcxTtq5kLwBZGkcYg/wFFnWn+6hzc4
 kyXw2kS5WZq4Q/FPh7BdY0eHp6xv0EpAOZwceneLB9lhNINdxcQ=
 =ISJ6
 -----END PGP SIGNATURE-----

Merge tag 'fscache-rewrite-20220111' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull fscache rewrite from David Howells:
 "This is a set of patches that rewrites the fscache driver and the
  cachefiles driver, significantly simplifying the code compared to
  what's upstream, removing the complex operation scheduling and object
  state machine in favour of something much smaller and simpler.

  The series is structured such that the first few patches disable
  fscache use by the network filesystems using it, remove the cachefiles
  driver entirely and as much of the fscache driver as can be got away
  with without causing build failures in the network filesystems.

  The patches after that recreate fscache and then cachefiles,
  attempting to add the pieces in a logical order. Finally, the
  filesystems are reenabled and then the very last patch changes the
  documentation.

  [!] Note: I have dropped the cifs patch for the moment, leaving local
      caching in cifs disabled. I've been having trouble getting that
      working. I think I have it done, but it needs more testing (there
      seem to be some test failures occurring with v5.16 also from
      xfstests), so I propose deferring that patch to the end of the
      merge window.

  WHY REWRITE?
  ============

  Fscache's operation scheduling API was intended to handle sequencing
  of cache operations, which were all required (where possible) to run
  asynchronously in parallel with the operations being done by the
  network filesystem, whilst allowing the cache to be brought online and
  offline and to interrupt service for invalidation.

  With the advent of the tmpfile capacity in the VFS, however, an
  opportunity arises to do invalidation much more simply, without having
  to wait for I/O that's actually in progress: Cachefiles can simply
  create a tmpfile, cut over the file pointer for the backing object
  attached to a cookie and abandon the in-progress I/O, dismissing it
  upon completion.

  Future work here would involve using Omar Sandoval's vfs_link() with
  AT_LINK_REPLACE[1] to allow an extant file to be displaced by a new
  hard link from a tmpfile as currently I have to unlink the old file
  first.

  These patches can also simplify the object state handling as I/O
  operations to the cache don't all have to be brought to a stop in
  order to invalidate a file. To that end, and with an eye on to writing
  a new backing cache model in the future, I've taken the opportunity to
  simplify the indexing structure.

  I've separated the index cookie concept from the file cookie concept
  by C type now. The former is now called a "volume cookie" (struct
  fscache_volume) and there is a container of file cookies. There are
  then just the two levels. All the index cookie levels are collapsed
  into a single volume cookie, and this has a single printable string as
  a key. For instance, an AFS volume would have a key of something like
  "afs,example.com,1000555", combining the filesystem name, cell name
  and volume ID. This is freeform, but must not have '/' chars in it.

  I've also eliminated all pointers back from fscache into the network
  filesystem. This required the duplication of a little bit of data in
  the cookie (cookie key, coherency data and file size), but it's not
  actually that much. This gets rid of problems with making sure we keep
  netfs data structures around so that the cache can access them.

  These patches mean that most of the code that was in the drivers
  before is simply gone and those drivers are now almost entirely new
  code. That being the case, there doesn't seem any particular reason to
  try and maintain bisectability across it. Further, there has to be a
  point in the middle where things are cut over as there's a single
  point everything has to go through (ie. /dev/cachefiles) and it can't
  be in use by two drivers at once.

  ISSUES YET OUTSTANDING
  ======================

  There are some issues still outstanding, unaddressed by this patchset,
  that will need fixing in future patchsets, but that don't stop this
  series from being usable:

  (1) The cachefiles driver needs to stop using the backing filesystem's
      metadata to store information about what parts of the cache are
      populated. This is not reliable with modern extent-based
      filesystems.

      Fixing this is deferred to a separate patchset as it involves
      negotiation with the network filesystem and the VM as to how much
      data to download to fulfil a read - which brings me on to (2)...

  (2) NFS (and CIFS with the dropped patch) do not take account of how
      the cache would like I/O to be structured to meet its granularity
      requirements. Previously, the cache used page granularity, which
      was fine as the network filesystems also dealt in page
      granularity, and the backing filesystem (ext4, xfs or whatever)
      did whatever it did out of sight. However, we now have folios to
      deal with and the cache will now have to store its own metadata to
      track its contents.

      The change I'm looking at making for cachefiles is to store
      content bitmaps in one or more xattrs and making a bit in the map
      correspond to something like a 256KiB block. However, the size of
      an xattr and the fact that they have to be read/updated in one go
      means that I'm looking at covering 1GiB of data per 512-byte map
      and storing each map in an xattr. Cachefiles has the potential to
      grow into a fully fledged filesystem of its very own if I'm not
      careful.

      However, I'm also looking at changing things even more radically
      and going to a different model of how the cache is arranged and
      managed - one that's more akin to the way, say, openafs does
      things - which brings me on to (3)...

  (3) The way cachefilesd does culling is very inefficient for large
      caches and it would be better to move it into the kernel if I can
      as cachefilesd has to keep asking the kernel if it can cull a
      file. Changing the way the backend works would allow this to be
      addressed.

  BITS THAT MAY BE CONTROVERSIAL
  ==============================

  There are some bits I've added that may be controversial:

  (1) I've provided a flag, S_KERNEL_FILE, that cachefiles uses to check
      if a files is already being used by some other kernel service
      (e.g. a duplicate cachefiles cache in the same directory) and
      reject it if it is. This isn't entirely necessary, but it helps
      prevent accidental data corruption.

      I don't want to use S_SWAPFILE as that has other effects, but
      quite possibly swapon() should set S_KERNEL_FILE too.

      Note that it doesn't prevent userspace from interfering, though
      perhaps it should. (I have made it prevent a marked directory from
      being rmdir-able).

  (2) Cachefiles wants to keep the backing file for a cookie open whilst
      we might need to write to it from network filesystem writeback.
      The problem is that the network filesystem unuses its cookie when
      its file is closed, and so we have nothing pinning the cachefiles
      file open and it will get closed automatically after a short time
      to avoid EMFILE/ENFILE problems.

      Reopening the cache file, however, is a problem if this is being
      done due to writeback triggered by exit(). Some filesystems will
      oops if we try to open a file in that context because they want to
      access current->fs or suchlike.

      To get around this, I added the following:

      (A) An inode flag, I_PINNING_FSCACHE_WB, to be set on a network
          filesystem inode to indicate that we have a usage count on the
          cookie caching that inode.

      (B) A flag in struct writeback_control, unpinned_fscache_wb, that
          is set when __writeback_single_inode() clears the last dirty
          page from i_pages - at which point it clears
          I_PINNING_FSCACHE_WB and sets this flag.

          This has to be done here so that clearing I_PINNING_FSCACHE_WB
          can be done atomically with the check of PAGECACHE_TAG_DIRTY
          that clears I_DIRTY_PAGES.

      (C) A function, fscache_set_page_dirty(), which if it is not set,
          sets I_PINNING_FSCACHE_WB and calls fscache_use_cookie() to
          pin the cache resources.

      (D) A function, fscache_unpin_writeback(), to be called by
          ->write_inode() to unuse the cookie.

      (E) A function, fscache_clear_inode_writeback(), to be called when
          the inode is evicted, before clear_inode() is called. This
          cleans up any lingering I_PINNING_FSCACHE_WB.

      The network filesystem can then use these tools to make sure that
      fscache_write_to_cache() can write locally modified data to the
      cache as well as to the server.

      For the future, I'm working on write helpers for netfs lib that
      should allow this facility to be removed by keeping track of the
      dirty regions separately - but that's incomplete at the moment and
      is also going to be affected by folios, one way or another, since
      it deals with pages"

Link: https://lore.kernel.org/all/510611.1641942444@warthog.procyon.org.uk/
Tested-by: Dominique Martinet <asmadeus@codewreck.org> # 9p
Tested-by: kafs-testing@auristor.com # afs
Tested-by: Jeff Layton <jlayton@kernel.org> # ceph
Tested-by: Dave Wysochanski <dwysocha@redhat.com> # nfs
Tested-by: Daire Byrne <daire@dneg.com> # nfs

* tag 'fscache-rewrite-20220111' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (67 commits)
  9p, afs, ceph, nfs: Use current_is_kswapd() rather than gfpflags_allow_blocking()
  fscache: Add a tracepoint for cookie use/unuse
  fscache: Rewrite documentation
  ceph: add fscache writeback support
  ceph: conversion to new fscache API
  nfs: Implement cache I/O by accessing the cache directly
  nfs: Convert to new fscache volume/cookie API
  9p: Copy local writes to the cache when writing to the server
  9p: Use fscache indexing rewrite and reenable caching
  afs: Skip truncation on the server of data we haven't written yet
  afs: Copy local writes to the cache when writing to the server
  afs: Convert afs to use the new fscache API
  fscache, cachefiles: Display stat of culling events
  fscache, cachefiles: Display stats of no-space events
  cachefiles: Allow cachefiles to actually function
  fscache, cachefiles: Store the volume coherency data
  cachefiles: Implement the I/O routines
  cachefiles: Implement cookie resize for truncate
  cachefiles: Implement begin and end I/O operation
  cachefiles: Implement backing file wrangling
  ...
2022-01-12 13:45:12 -08:00
David Howells
d7bdba1c81 9p, afs, ceph, nfs: Use current_is_kswapd() rather than gfpflags_allow_blocking()
In 9p, afs ceph, and nfs, gfpflags_allow_blocking() (which wraps a
test for __GFP_DIRECT_RECLAIM being set) is used to determine if
->releasepage() should wait for the completion of a DIO write to fscache
with something like:

	if (folio_test_fscache(folio)) {
		if (!gfpflags_allow_blocking(gfp) || !(gfp & __GFP_FS))
			return false;
		folio_wait_fscache(folio);
	}

Instead, current_is_kswapd() should be used instead.

Note that this is based on a patch originally by Zhaoyang Huang[1].  In
addition to extending it to the other network filesystems and putting it on
top of my fscache rewrite, it also needs to include linux/swap.h in a bunch
of places.  Can current_is_kswapd() be moved to linux/mm.h?

Changes
=======
ver #5:
 - Dropping the changes for cifs.

Originally-signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Co-developed-by: David Howells <dhowells@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Steve French <smfrench@gmail.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: linux-cachefs@redhat.com
cc: v9fs-developer@lists.sourceforge.net
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/1638952658-20285-1-git-send-email-huangzhaoyang@gmail.com/ [1]
Link: https://lore.kernel.org/r/164021590773.640689.16777975200823659231.stgit@warthog.procyon.org.uk/ # v4
2022-01-11 22:27:42 +00:00
Jeff Layton
1702e79734 ceph: add fscache writeback support
When updating the backing store from the pagecache (a'la writepage or
writepages), write to the cache first. This allows us to keep caching
files even when they are being written, as long as we have appropriate
caps.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20211129162907.149445-3-jlayton@kernel.org/ # v1
Link: https://lore.kernel.org/r/20211207134451.66296-3-jlayton@kernel.org/ # v2
Link: https://lore.kernel.org/r/163906985808.143852.1383891557313186623.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967190257.1823006.16713609520911954804.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021585020.640689.6765214932458435472.stgit@warthog.procyon.org.uk/ # v4
2022-01-11 22:13:01 +00:00
Jeff Layton
400e1286c0 ceph: conversion to new fscache API
Now that the fscache API has been reworked and simplified, change ceph
over to use it.

With the old API, we would only instantiate a cookie when the file was
open for reads. Change it to instantiate the cookie when the inode is
instantiated and call use/unuse when the file is opened/closed.

Also, ensure we resize the cached data on truncates, and invalidate the
cache in response to the appropriate events. This will allow us to
plumb in write support later.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20211129162907.149445-2-jlayton@kernel.org/ # v1
Link: https://lore.kernel.org/r/20211207134451.66296-2-jlayton@kernel.org/ # v2
Link: https://lore.kernel.org/r/163906984277.143852.14697110691303589000.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967188351.1823006.5065634844099079351.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021581427.640689.14128682147127509264.stgit@warthog.procyon.org.uk/ # v4
2022-01-11 22:13:01 +00:00
David Howells
01491a7565 fscache, cachefiles: Disable configuration
Disable fscache and cachefiles in Kconfig whilst it is rewritten.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819576672.215744.12444272479560406780.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906882835.143852.11073015983885872901.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967075113.1823006.277316290062782998.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021481179.640689.2004199594774033658.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 09:21:44 +00:00
Christian Brauner
fd84bfdddd ceph: fix up non-directory creation in SGID directories
Ceph always inherits the SGID bit if it is set on the parent inode,
while the generic inode_init_owner does not do this in a few cases where
it can create a possible security problem (cf. [1]).

Update ceph to strip the SGID bit just as inode_init_owner would.

This bug was detected by the mapped mount testsuite in [3]. The
testsuite tests all core VFS functionality and semantics with and
without mapped mounts. That is to say it functions as a generic VFS
testsuite in addition to a mapped mount testsuite. While working on
mapped mount support for ceph, SIGD inheritance was the only failing
test for ceph after the port.

The same bug was detected by the mapped mount testsuite in XFS in
January 2021 (cf. [2]).

[1]: commit 0fa3ecd878 ("Fix up non-directory creation in SGID directories")
[2]: commit 01ea173e10 ("xfs: fix up non-directory creation in SGID directories")
[3]: https://git.kernel.org/fs/xfs/xfstests-dev.git

Cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-12-01 17:08:27 +01:00
Xiubo Li
ee2a095d3b ceph: initialize pathlen variable in reconnect_caps_cb
The smatch static checker warned about an uninitialized symbol usage in
this function, in the case where ceph_mdsc_build_path returns an error.

It turns out that that case is harmless, but it just looks sketchy.
Initialize the variable at declaration time, and remove the unneeded
setting of it later.

Fixes: a33f6432b3 ("ceph: encode inodes' parent/d_name in cap reconnect message")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-12-01 17:08:26 +01:00
Jeff Layton
e485d028bb ceph: initialize i_size variable in ceph_sync_read
Newer compilers seem to determine that this variable being uninitialized
isn't a problem, but older compilers (from the RHEL8 era) seem to choke
on it and complain that it could be used uninitialized.

Go ahead and initialize the variable at declaration time to silence
potential compiler warnings.

Fixes: c3d8e0b5de ("ceph: return the real size read when it hits EOF")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-12-01 17:08:26 +01:00
Hu Weiwen
973e524563 ceph: fix duplicate increment of opened_inodes metric
opened_inodes is incremented twice when the same inode is opened twice
with O_RDONLY and O_WRONLY respectively.

To reproduce, run this python script, then check the metrics:

import os
for _ in range(10000):
    fd_r = os.open('a', os.O_RDONLY)
    fd_w = os.open('a', os.O_WRONLY)
    os.close(fd_r)
    os.close(fd_w)

Fixes: 1dd8d47081 ("ceph: metrics for opened files, pinned caps and opened inodes")
Signed-off-by: Hu Weiwen <sehuww@mail.scut.edu.cn>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-12-01 17:08:26 +01:00
Linus Torvalds
0ecca62beb One notable change here is that async creates and unlinks introduced
in 5.7 are now enabled by default.  This should greatly speed up things
 like rm, tar and rsync.  To opt out, wsync mount option can be used.
 
 Other than that we have a pile of bug fixes all across the filesystem
 from Jeff, Xiubo and Kotresh and a metrics infrastructure rework from
 Luis.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmGORY8THGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi5XmB/0SRQTW+BRAhYvSD/Ib2/c6uVZGnmJU
 MUCO/uuD4fZvfxyMVb3qnAzHIGh3YlFWa/dgSZNStvLmY0L9P0MIvyaYolVuC4Tu
 7llX1I+yckTns9VmiULBNy9D812eRY282nMbRzikMGPO1eb6Yqo3r50AdTvkam/R
 Qs5pfwRqLerbP7VUv4vSsrBflwVyHOrCFZaUUVTu4f2kKz/FzZd/FCw5VV51KYzq
 ygN+8eFotf5zluMX0tl0FXVgJ13N9vbg+YNxyOzHmOAV0AhSmmtV8vJ+j+m7+sOj
 b7wNl5AuI5Ogeg8PHSGsXOHBVn6IbhdGDcougEVL+1gHkxxOQ5hfBHWQ
 =22Vd
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.16-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "One notable change here is that async creates and unlinks introduced
  in 5.7 are now enabled by default. This should greatly speed up things
  like rm, tar and rsync. To opt out, wsync mount option can be used.

  Other than that we have a pile of bug fixes all across the filesystem
  from Jeff, Xiubo and Kotresh and a metrics infrastructure rework from
  Luis"

* tag 'ceph-for-5.16-rc1' of git://github.com/ceph/ceph-client:
  ceph: add a new metric to keep track of remote object copies
  libceph, ceph: move ceph_osdc_copy_from() into cephfs code
  ceph: clean-up metrics data structures to reduce code duplication
  ceph: split 'metric' debugfs file into several files
  ceph: return the real size read when it hits EOF
  ceph: properly handle statfs on multifs setups
  ceph: shut down mount on bad mdsmap or fsmap decode
  ceph: fix mdsmap decode when there are MDS's beyond max_mds
  ceph: ignore the truncate when size won't change with Fx caps issued
  ceph: don't rely on error_string to validate blocklisted session.
  ceph: just use ci->i_version for fscache aux info
  ceph: shut down access to inode when async create fails
  ceph: refactor remove_session_caps_cb
  ceph: fix auth cap handling logic in remove_session_caps_cb
  ceph: drop private list from remove_session_caps_cb
  ceph: don't use -ESTALE as special return code in try_get_cap_refs
  ceph: print inode numbers instead of pointer values
  ceph: enable async dirops by default
  libceph: drop ->monmap and err initialization
  ceph: convert to noop_direct_IO
2021-11-13 11:31:07 -08:00
David Howells
78525c74d9 netfs, 9p, afs, ceph: Use folios
Convert the netfs helper library to use folios throughout, convert the 9p
and afs filesystems to use folios in their file I/O paths and convert the
ceph filesystem to use just enough folios to compile.

With these changes, afs passes -g quick xfstests.

Changes
=======
ver #5:
 - Got rid of folio_end{io,_read,_write}() and inlined the stuff it does
   instead (Willy decided he didn't want this after all).

ver #4:
 - Fixed a bug in afs_redirty_page() whereby it didn't set the next page
   index in the loop and returned too early.
 - Simplified a check in v9fs_vfs_write_folio_locked()[1].
 - Undid a change to afs_symlink_readpage()[1].
 - Used offset_in_folio() in afs_write_end()[1].
 - Changed from using page_endio() to folio_end{io,_read,_write}()[1].

ver #2:
 - Add 9p foliation.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Dominique Martinet <asmadeus@codewreck.org>
Tested-by: kafs-testing@auristor.com
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: v9fs-developer@lists.sourceforge.net
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/YYKa3bfQZxK5/wDN@casper.infradead.org/ [1]
Link: https://lore.kernel.org/r/2408234.1628687271@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/162877311459.3085614.10601478228012245108.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/162981153551.1901565.3124454657133703341.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/163005745264.2472992.9852048135392188995.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163584187452.4023316.500389675405550116.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/163649328026.309189.1124218109373941936.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/163657852454.834781.9265101983152100556.stgit@warthog.procyon.org.uk/ # v5
2021-11-10 21:16:56 +00:00
Luís Henriques
c02cb7bdc4 ceph: add a new metric to keep track of remote object copies
This patch adds latency and size metrics for remote object copies
operations ("copyfrom").  For now, these metrics will be available on the
client only, they won't be sent to the MDS.

Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08 03:29:52 +01:00
Luís Henriques
aca39d9e86 libceph, ceph: move ceph_osdc_copy_from() into cephfs code
This patch moves ceph_osdc_copy_from() function out of libceph code into
cephfs.  There are no other users for this function, and there is the need
(in another patch) to access internal ceph_osd_request struct members.

Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08 03:29:52 +01:00
Luís Henriques
17e9fc9fca ceph: clean-up metrics data structures to reduce code duplication
This patch modifies struct ceph_client_metric so that each metric block
(read, write and metadata) becomes an element in a array.  This allows to
also re-write the helper functions that handle these blocks, making them
simpler and, above all, reduce the amount of copy&paste every time a new
metric is added.

Thus, for each of these metrics there will be a new struct ceph_metric
entry that'll will contain all the sizes and latencies fields (and a lock).
Note however that the metadata metric doesn't really use the size_fields,
and thus this metric won't be shown in the debugfs '../metrics/size' file.

Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08 03:29:52 +01:00
Luís Henriques
cbed4ff76b ceph: split 'metric' debugfs file into several files
Currently, all the metrics are grouped together in a single file, making
it difficult to process this file from scripts.  Furthermore, as new
metrics are added, processing this file will become even more challenging.

This patch turns the 'metric' file into a directory that will contain
several files, one for each metric.

Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08 03:29:52 +01:00
Xiubo Li
c3d8e0b5de ceph: return the real size read when it hits EOF
Currently, if the sync read handler ends up reading more from the last
object in the file than the i_size indicates, then it'll end up
returning the wrong length. Ensure that we cap the returned length and
pos at the EOF.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08 03:29:52 +01:00
Jeff Layton
8cfc0c7ed3 ceph: properly handle statfs on multifs setups
ceph_statfs currently stuffs the cluster fsid into the f_fsid field.
This was fine when we only had a single filesystem per cluster, but now
that we have multiples we need to use something that will vary between
them.

Change ceph_statfs to xor each 32-bit chunk of the fsid (aka cluster id)
into the lower bits of the statfs->f_fsid. Change the lower bits to hold
the fscid (filesystem ID within the cluster).

That should give us a value that is guaranteed to be unique between
filesystems within a cluster, and should minimize the chance of
collisions between mounts of different clusters.

URL: https://tracker.ceph.com/issues/52812
Reported-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08 03:29:52 +01:00
Jeff Layton
631ed4b082 ceph: shut down mount on bad mdsmap or fsmap decode
As Greg pointed out, if we get a mangled mdsmap or fsmap, then something
has gone very wrong, and we should avoid doing any activity on the
filesystem.

When this occurs, shut down the mount the same way we would with a
forced umount by calling ceph_umount_begin when decoding fails on either
map. This causes most operations done against the filesystem to return
an error. Any dirty data or caps in the cache will be dropped as well.

The effect is not reversible, so the only remedy is to umount.

[ idryomov: print fsmap decoding error ]

URL: https://tracker.ceph.com/issues/52303
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08 03:29:52 +01:00
Xiubo Li
0e24421ac4 ceph: fix mdsmap decode when there are MDS's beyond max_mds
If the max_mds is decreased in a cephfs cluster, there is a window
of time before the MDSs are removed. If a map goes out during this
period, the mdsmap may show the decreased max_mds but still shows
those MDSes as in or in the export target list.

Ensure that we don't fail the map decode in that case.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/52436
Fixes: d517b3983d ("ceph: reconnect to the export targets on new mdsmaps")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08 03:29:52 +01:00
Xiubo Li
e90334e89b ceph: ignore the truncate when size won't change with Fx caps issued
If the new size is the same as the current size, the MDS will do nothing
but change the mtime/atime. POSIX doesn't mandate that the filesystems
must update them in this case, so just ignore it instead.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08 03:29:52 +01:00
Kotresh HR
e1c9788cb3 ceph: don't rely on error_string to validate blocklisted session.
The "error_string" in the metadata of MClientSession is being
parsed by kclient to validate whether the session is blocklisted.
The "error_string" is for humans and shouldn't be relied on it.
Hence added the flag to MClientsession to indicate the session
is blocklisted.

[ jlayton: minor formatting cleanup ]

URL: https://tracker.ceph.com/issues/47450
Signed-off-by: Kotresh HR <khiremat@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08 03:29:51 +01:00
Jeff Layton
25b7351161 ceph: just use ci->i_version for fscache aux info
If the i_version regresses, then it's likely that the mtime will do the
same in lockstep with it. There's no need to track both here, just use
the i_version counter since it's just as good and gets the aux size down
to 64 bits.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08 03:29:51 +01:00
Jeff Layton
5d6451b148 ceph: shut down access to inode when async create fails
Add proper error handling for when an async create fails. The inode
never existed, so any dirty caps or data are now toast. We already
d_drop the dentry in that case, but the now-stale inode may still be
around. We want to shut down access to these inodes, and ensure that
they can't harbor any more dirty data, which can cause problems at
umount time.

When this occurs, flag such inodes as being SHUTDOWN, and trash any caps
and cap flushes that may be in flight for them, and invalidate the
pagecache for the inode. Add a new helper that can check whether an
inode or an entire mount is now shut down, and call it instead of
accessing the mount_state directly in places where we test that now.

URL: https://tracker.ceph.com/issues/51279
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08 03:29:51 +01:00
Jeff Layton
36e6da987e ceph: refactor remove_session_caps_cb
Move remove_capsnaps to caps.c. Move the part of remove_session_caps_cb
under i_ceph_lock into a separate function that lives in caps.c. Have
remove_session_caps_cb call the new helper after taking the lock.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08 03:29:51 +01:00
Jeff Layton
3c3050267e ceph: fix auth cap handling logic in remove_session_caps_cb
The existing logic relies on ci->i_auth_cap being NULL, but if we end up
removing the auth cap early, then we'll do a lot of useless work and
lock-taking on the remaining caps. Ensure that we only do the auth cap
removal when we're _actually_ removing the auth cap.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08 03:29:51 +01:00
Jeff Layton
c35cac610a ceph: drop private list from remove_session_caps_cb
This function does a lot of list-shuffling with cap flushes, all to
avoid possibly freeing a slab allocation under spinlock (which is
totally ok).  Simplify the code by just detaching and freeing the cap
flushes in place.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08 03:29:51 +01:00
Jeff Layton
8006daff5f ceph: don't use -ESTALE as special return code in try_get_cap_refs
In some cases, we may want to return -ESTALE if it ends up that we're
dealing with an inode that no longer exists. Switch to using -EUCLEAN as
the "special" error return.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08 03:29:51 +01:00
Jeff Layton
6407fbb9c3 ceph: print inode numbers instead of pointer values
We have a lot of log messages that print inode pointer values. This is
of dubious utility. Switch a random assortment of the ones I've found
most useful to use ceph_vinop to print the snap:inum tuple instead.

[ idryomov: use . as a separator, break unnecessarily long lines ]

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08 03:29:51 +01:00
Jeff Layton
f7a67b463f ceph: enable async dirops by default
Async dirops have been supported in mainline kernels for quite some time
now, and we've recently (as of June) started doing regular testing in
teuthology with '-o nowsync'. There were a few issues, but we've sorted
those out now.

Enable async dirops by default, and change /proc/mounts to show "wsync"
when they are disabled rather than "nowsync" when they are enabled.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08 03:29:51 +01:00
Jeff Layton
9c43ff4490 ceph: convert to noop_direct_IO
We have our own op, but the WARN_ON is not terribly helpful, and it's
otherwise identical to the noop one. Just use that.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08 03:29:51 +01:00
Linus Torvalds
cdab10bf32 selinux/stable-5.16 PR 20211101
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmGANbAUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNaMBAAg+9gZr0F7xiafu8JFZqZfx/AQdJ2
 G2cn3le+/tXGZmF8m/+82lOaR6LeQLatgSDJNSkXWkKr0nRwseQJDbtRfvYJdn0t
 Ax05/Fmz6OGxQ2wgRYgaFiSrKpE5p3NhDtiLFVdkCJaQNe/8DZOc7NhBl6EjZf3x
 ubhl2hUiJ4AmiXGwcYhr4uKgP4nhW8OM1/OkskVi+bBMmLA8KTY9kslmIDP5E3BW
 29W4qhqeLNQupY5dGMEMVcyxY9ZUWpO39q4uOaQVZrUGE7xABkj/jhnxT5gFTSlI
 pu8VhsYXm9KuRVveIsv0L5SZfadwoM9YAl7ki1wD3W5rHqOAte3rBTm6VmNlQwfU
 MqxP65Jiyxudxet5Be3/dCRH/+MDQuwBxivgmZXbeVxor2SeznVb0GDaEUC5FSHu
 CJIgWtQzsPJMxgAEGXN4F3QGP0htTTJni56GUPOsrf4TIBW02TT+oLTLFRIokQQL
 INNOfwVSRXElnCsvxsHR4oB+JZ9pJyBaAmeupcQ6jmcKiWlbLj4s+W0U0pM5h91v
 hmMpz7KMxrX6gVL4gB2Jj4aN3r5YRbq26NBu6D+wdwwBTeTTocaHSpAqkv4buClf
 uNk3cG8Hkp8TTg9cM8jYgpxMyzKH/AI/Uw3VhEa1xCiq2Ck3DgfnZvnvcRRaZevU
 FPgmwgqePJXGi60=
 =sb8J
 -----END PGP SIGNATURE-----

Merge tag 'selinux-pr-20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux

Pull selinux updates from Paul Moore:

 - Add LSM/SELinux/Smack controls and auditing for io-uring.

   As usual, the individual commit descriptions have more detail, but we
   were basically missing two things which we're adding here:

      + establishment of a proper audit context so that auditing of
        io-uring ops works similarly to how it does for syscalls (with
        some io-uring additions because io-uring ops are *not* syscalls)

      + additional LSM hooks to enable access control points for some of
        the more unusual io-uring features, e.g. credential overrides.

   The additional audit callouts and LSM hooks were done in conjunction
   with the io-uring folks, based on conversations and RFC patches
   earlier in the year.

 - Fixup the binder credential handling so that the proper credentials
   are used in the LSM hooks; the commit description and the code
   comment which is removed in these patches are helpful to understand
   the background and why this is the proper fix.

 - Enable SELinux genfscon policy support for securityfs, allowing
   improved SELinux filesystem labeling for other subsystems which make
   use of securityfs, e.g. IMA.

* tag 'selinux-pr-20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
  security: Return xattr name from security_dentry_init_security()
  selinux: fix a sock regression in selinux_ip_postroute_compat()
  binder: use cred instead of task for getsecid
  binder: use cred instead of task for selinux checks
  binder: use euid from cred instead of using task
  LSM: Avoid warnings about potentially unused hook variables
  selinux: fix all of the W=1 build warnings
  selinux: make better use of the nf_hook_state passed to the NF hooks
  selinux: fix race condition when computing ocontext SIDs
  selinux: remove unneeded ipv6 hook wrappers
  selinux: remove the SELinux lockdown implementation
  selinux: enable genfscon labeling for securityfs
  Smack: Brutalist io_uring support
  selinux: add support for the io_uring access controls
  lsm,io_uring: add LSM hooks to io_uring
  io_uring: convert io_uring to the secure anon inode interface
  fs: add anon_inode_getfile_secure() similar to anon_inode_getfd_secure()
  audit: add filtering for io_uring records
  audit,io_uring,io-wq: add some basic audit support to io_uring
  audit: prepare audit_context for use in calling contexts beyond syscalls
2021-11-01 21:06:18 -07:00
Linus Torvalds
b6773cdb0e for-5.16/ki_complete-2021-10-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8MOUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmeqEACrayLMDMdlb1FduTYw29QAL7XxS375r92T
 bwLippmKQIFNi8p5ScHraelV5ixgxse2j68MexlQHpl9aHIn/oL7qHACIMgDP05m
 KaSy8Hr2abqr+zz+rLMhkm21zAva6aWjQu7NoEjBE4dC5L4l9p885LaA+jmqQUno
 1wvpaEcype8cITJ+sSCb3kD6nZx7y1Lt5zEefUfk6ruMm9x9FwvU6uc4rIHi+Zve
 Hwo8yGbTvlU8rGSi9naC/U8pIZ4bqEuTAcV5VHNrWG+b4aA/aFPpSjpIiSBZSXo0
 HXa+jmcr6gkejfPeOZkBbRub6Fm9Wq2pDAZskPWFX6zyX0pIV05GjJ2J/ba8rovn
 QrcfxaBv8XitKgrjFZeR0ZBqD2iJjPA/Yq5/r1ZmZ0wSHI3W4UuTGhQYEPyDLceH
 ZWq/wcfVFek4kAoCxCqy9kWiOujY90WWKQW3yD7b8FPZ0d+/R1Mn+drlYaSKN1Pk
 /9/+z1DaLtBWbJ2G+BQ9oUkYmNSapAiYc2YXVss86hmhLX+prFtSj3zECZUvhyAz
 b42A2DVsjU+65yT2zdPBXlMrbI91qNnvIXcz5szNdTfHTn9FiLQb4BffMV0FHT3g
 vap8N3Rb8UkZ3v4NCVAtlfcGr0kvYHQH+Qgh6oAlXB4NQoKJCVadzpTFPMWjx788
 oHBUjA0UTQ==
 =4vl/
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16/ki_complete-2021-10-29' of git://git.kernel.dk/linux-block

Pull kiocb->ki_complete() cleanup from Jens Axboe:
 "This removes the res2 argument from kiocb->ki_complete().

  Only the USB gadget code used it, everybody else passes 0. The USB
  guys checked the user gadget code they could find, and everybody just
  uses res as expected for the async interface"

* tag 'for-5.16/ki_complete-2021-10-29' of git://git.kernel.dk/linux-block:
  fs: get rid of the res2 iocb->ki_complete argument
  usb: remove res2 argument from gadget code completions
2021-11-01 10:17:11 -07:00
Linus Torvalds
9ac211426f File locking changes for v5.16
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmF77NQTHGpsYXl0b25A
 a2VybmVsLm9yZwAKCRAADmhBGVaCFfP9D/wN8rCAPA2J6SpBjdJXSpSQS4PoAOqC
 Z002bOc7sq/zg2cWk+pX1aOR/+wUpk+PvaQdyvfO+o4TVCpsTOklRh/yfYuOkJdP
 PoINUR7vb43/CGqd04YI45+pxOFMJk9JoLkNha0uIY4ukXdt9mA6u/+qBEDboyDQ
 Jbn1JGitRc9WYaE7BV26ba0l+Deb5h2/4c1DiDlsgmLkDPhpowkOznovUCkBnH7H
 bfwlssjZ5P0K5ttZDw6VlkC2xE+Yr56BsEco2bXO42LwUHOx6r6ZNp04rh9zh1Zp
 hFPybgU+ur17EOOmBbCq8aHZqRRcjQQDH/rZ1heHSOfTrEWWth1xNjmeewSgRZHL
 0oMi3yIJPwvuDBQPEQg0VD87k5Z8xbRPql6eM6GeGxDZvzXWqqYKW2OYXtNxG91m
 bGvu2OOGkdF/4WGYBixdjUQb5KjcOqdIFkq3/oHfLQ+cS2uc6eOfnCdxa7cTnTdd
 BcFDgZmWQDLFs6/DIbwUI0KWMAiLFMZnZACE937JvlE74EGiHu47JMzwcU15J6zO
 VD0Oq0XsWQN+TgY2RnjuTFqm6DTpbrkgw88sNDr5g3jZbgJZiZ/r/3M55lcBVWvk
 PFT4fjKhD1mT6/SscAAmOxUKUeDbN7EktiRsZmH9C2sUCERufDb/cmY/RYZ00C4t
 01ovPjs7VukS/w==
 =bcaq
 -----END PGP SIGNATURE-----

Merge tag 'locks-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull file locking updates from Jeff Layton:
 "Most of this is just follow-on cleanup work of documentation and
  comments from the mandatory locking removal in v5.15.

  The only real functional change is that LOCK_MAND flock() support is
  also being removed, as it has basically been non-functional since the
  v2.5 days"

* tag 'locks-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  fs: remove leftover comments from mandatory locking removal
  locks: remove changelog comments
  docs: fs: locks.rst: update comment about mandatory file locking
  Documentation: remove reference to now removed mandatory-locking doc
  locks: remove LOCK_MAND flock lock support
2021-11-01 09:06:53 -07:00
Jens Axboe
6b19b766e8 fs: get rid of the res2 iocb->ki_complete argument
The second argument was only used by the USB gadget code, yet everyone
pays the overhead of passing a zero to be passed into aio, where it
ends up being part of the aio res2 value.

Now that everybody is passing in zero, kill off the extra argument.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25 10:36:24 -06:00
Vivek Goyal
15bf32398a security: Return xattr name from security_dentry_init_security()
Right now security_dentry_init_security() only supports single security
label and is used by SELinux only. There are two users of this hook,
namely ceph and nfs.

NFS does not care about xattr name. Ceph hardcodes the xattr name to
security.selinux (XATTR_NAME_SELINUX).

I am making changes to fuse/virtiofs to send security label to virtiofsd
and I need to send xattr name as well. I also hardcoded the name of
xattr to security.selinux.

Stephen Smalley suggested that it probably is a good idea to modify
security_dentry_init_security() to also return name of xattr so that
we can avoid this hardcoding in the callers.

This patch adds a new parameter "const char **xattr_name" to
security_dentry_init_security() and LSM puts the name of xattr
too if caller asked for it (xattr_name != NULL).

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: James Morris <jamorris@linux.microsoft.com>
[PM: fixed typos in the commit description]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-10-20 08:17:08 -04:00
Jeff Layton
1bd85aa65d ceph: fix handling of "meta" errors
Currently, we check the wb_err too early for directories, before all of
the unsafe child requests have been waited on. In order to fix that we
need to check the mapping->wb_err later nearer to the end of ceph_fsync.

We also have an overly-complex method for tracking errors after
blocklisting. The errors recorded in cleanup_session_requests go to a
completely separate field in the inode, but we end up reporting them the
same way we would for any other error (in fsync).

There's no real benefit to tracking these errors in two different
places, since the only reporting mechanism for them is in fsync, and
we'd need to advance them both every time.

Given that, we can just remove i_meta_err, and convert the places that
used it to instead just use mapping->wb_err instead. That also fixes
the original problem by ensuring that we do a check_and_advance of the
wb_err at the end of the fsync op.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/52864
Reported-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-10-19 09:36:06 +02:00
Jeff Layton
98d0a6fb73 ceph: skip existing superblocks that are blocklisted or shut down when mounting
Currently when mounting, we may end up finding an existing superblock
that corresponds to a blocklisted MDS client. This means that the new
mount ends up being unusable.

If we've found an existing superblock with a client that is already
blocklisted, and the client is not configured to recover on its own,
fail the match. Ditto if the superblock has been forcibly unmounted.

While we're in here, also rename "other" to the more conventional "fsc".

Cc: stable@vger.kernel.org
URL: https://bugzilla.redhat.com/show_bug.cgi?id=1901499
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-10-19 09:36:06 +02:00
Dan Carpenter
708c87168b ceph: fix off by one bugs in unsafe_request_wait()
The "> max" tests should be ">= max" to prevent an out of bounds access
on the next lines.

Fixes: e1a4541ec0 ("ceph: flush the mdlog before waiting on unsafe reqs")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-21 17:39:20 +02:00
Jeff Layton
90f7d7a0d0 locks: remove LOCK_MAND flock lock support
As best I can tell, the logic for these has been broken for a long time
(at least before the move to git), such that they never conflict with
anything. Also, nothing checks for these flags and prevented opens or
read/write behavior on the files. They don't seem to do anything.

Given that, we can rip these symbols out of the kernel, and just make
flock(2) return 0 when LOCK_MAND is set in order to preserve existing
behavior.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
2021-09-10 16:21:44 -04:00
Linus Torvalds
8a05abd0c9 We have:
- a set of patches to address fsync stalls caused by depending on
   periodic rather than triggered MDS journal flushes in some cases
   (Xiubo Li)
 
 - a fix for mtime effectively not getting updated in case of competing
   writers (Jeff Layton)
 
 - a couple of fixes for inode reference leaks and various WARNs after
   "umount -f" (Xiubo Li)
 
 - a new ceph.auth_mds extended attribute (Jeff Layton)
 
 - a smattering of fixups and cleanups from Jeff, Xiubo and Colin.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmE46mYTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi9UEB/sGT4eqMzkQLzJ2XjpKUvxXaJNVdPvS
 Jmg26KV5wc9Y9v6L7ww/eQjxbTOnda3G2/XG0xiE8dC1vq54Vux/FKiAT+H2/z/9
 onShFK+SARoF4DilKnY0JNCwcGxQ3FjWAgPqPKqAyTAX2wjVxDKFHB0C+7yhhJay
 wyDrRaaHyFc4TwHeiEi8xU7dB55XsvxWGUgnHbcOLyUbbBKddt98FadNZ2t9b76y
 EVwAxgY0RbUUFxOJ9VVjiaNLUP4532iXUn+fehMjRGmDCmjaLNxCrsq6d0p//LJV
 nhVRG+Mv8IfTjqZwFbnWV8xbGwX0lY+g+hn0cdi7urUH3GDa97vmJF3u
 =z6dR
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.15-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:

 - a set of patches to address fsync stalls caused by depending on
   periodic rather than triggered MDS journal flushes in some cases
   (Xiubo Li)

 - a fix for mtime effectively not getting updated in case of competing
   writers (Jeff Layton)

 - a couple of fixes for inode reference leaks and various WARNs after
   "umount -f" (Xiubo Li)

 - a new ceph.auth_mds extended attribute (Jeff Layton)

 - a smattering of fixups and cleanups from Jeff, Xiubo and Colin.

* tag 'ceph-for-5.15-rc1' of git://github.com/ceph/ceph-client:
  ceph: fix dereference of null pointer cf
  ceph: drop the mdsc_get_session/put_session dout messages
  ceph: lockdep annotations for try_nonblocking_invalidate
  ceph: don't WARN if we're forcibly removing the session caps
  ceph: don't WARN if we're force umounting
  ceph: remove the capsnaps when removing caps
  ceph: request Fw caps before updating the mtime in ceph_write_iter
  ceph: reconnect to the export targets on new mdsmaps
  ceph: print more information when we can't find snaprealm
  ceph: add ceph_change_snap_realm() helper
  ceph: remove redundant initializations from mdsc and session
  ceph: cancel delayed work instead of flushing on mdsc teardown
  ceph: add a new vxattr to return auth mds for an inode
  ceph: remove some defunct forward declarations
  ceph: flush the mdlog before waiting on unsafe reqs
  ceph: flush mdlog before umounting
  ceph: make iterate_sessions a global symbol
  ceph: make ceph_create_session_msg a global symbol
  ceph: fix comment about short copies in ceph_write_end
  ceph: fix memory leak on decode error in ceph_handle_caps
2021-09-08 15:50:32 -07:00
Colin Ian King
05a444d3f9 ceph: fix dereference of null pointer cf
Currently in the case where kmem_cache_alloc fails the null pointer
cf is dereferenced when assigning cf->is_capsnap = false. Fix this
by adding a null pointer check and return path.

Cc: stable@vger.kernel.org
Addresses-Coverity: ("Dereference null return")
Fixes: b2f9fa1f3b ("ceph: correctly handle releasing an embedded cap flush")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-03 10:55:51 +02:00
Jeff Layton
9f3589993c ceph: drop the mdsc_get_session/put_session dout messages
These are very chatty, racy, and not terribly useful. Just remove them.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Jeff Layton
3eaf5aa1cf ceph: lockdep annotations for try_nonblocking_invalidate
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Xiubo Li
a76d0a9c28 ceph: don't WARN if we're forcibly removing the session caps
For example in the case of a forced umount, we'll remove all the session
caps even if they are dirty. Move the warning to a wrapper function and
make most of the callers use it. Call the core function when removing
caps due to a forced umount.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Xiubo Li
42ad631b4d ceph: don't WARN if we're force umounting
Force umount will try to close the sessions by setting the session
state to _CLOSING. We don't want to WARN in this situation, since it's
expected.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Xiubo Li
a6d37ccdd2 ceph: remove the capsnaps when removing caps
capsnaps will take inode references via ihold when queueing to flush.
When force unmounting, the client will just close the sessions and
may never get a flush reply, causing a leak and inode ref leak.

Fix this by removing the capsnaps for an inode when removing the caps.

URL: https://tracker.ceph.com/issues/52295
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Jeff Layton
b11ed50346 ceph: request Fw caps before updating the mtime in ceph_write_iter
The current code will update the mtime and then try to get caps to
handle the write. If we end up having to request caps from the MDS, then
the mtime in the cap grant will clobber the updated mtime and it'll be
lost.

This is most noticable when two clients are alternately writing to the
same file. Fw caps are continually being granted and revoked, and the
mtime ends up stuck because the updated mtimes are always being
overwritten with the old one.

Fix this by changing the order of operations in ceph_write_iter to get
the caps before updating the times. Also, make sure we check the pool
full conditions before even getting any caps or uninlining.

URL: https://tracker.ceph.com/issues/46574
Reported-by: Jozef Kováč <kovac@firma.zoznam.sk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Xiubo Li
d517b3983d ceph: reconnect to the export targets on new mdsmaps
In the case where the export MDS has crashed just after the EImportStart
journal is flushed, a standby MDS takes over for it and when replaying
the EImportStart journal the MDS will wait the client to reconnect. That
may never happen because the client may not have registered or opened
the sessions yet.

When receiving a new map, ensure we reconnect to valid export targets as
well if their sessions don't exist yet.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Jeff Layton
692e171597 ceph: print more information when we can't find snaprealm
Print a bit more information when we can't find the realm during
ceph_add_cap. Show both the inode number and the old realm inode
number.

Suggested-by: Sage Weil <sage@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Jeff Layton
0ba92e1c5f ceph: add ceph_change_snap_realm() helper
Consolidate some fiddly code for changing an inode's snap_realm
into a new helper function, and change the callers to use it.

While we're in here, nothing uses the i_snap_realm_counter field, so
remove that from the inode.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Jeff Layton
c80dc3aee9 ceph: remove redundant initializations from mdsc and session
The ceph_mds_client and ceph_mds_session structures are kzalloc'ed so
there's no need to explicitly initialize either of their fields to 0.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Jeff Layton
b4002173b7 ceph: cancel delayed work instead of flushing on mdsc teardown
The first thing metric_delayed_work does is check mdsc->stopping,
and then return immediately if it's set. That's good since we would
have already torn down the metric structures at this point, otherwise,
but there is no locking around mdsc->stopping.

It's possible that the ceph_metric_destroy call could race with the
delayed_work, in which case we could end up with the delayed_work
accessing destroyed percpu variables.

At this point in the mdsc teardown, the "stopping" flag has already been
set, so there's no benefit to flushing the work. Move the work
cancellation in ceph_metric_destroy ahead of the percpu variable
destruction, and eliminate the flush_delayed_work call in
ceph_mdsc_destroy.

Fixes: 18f473b384 ("ceph: periodically send perf metrics to MDSes")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Jeff Layton
40e309de4d ceph: add a new vxattr to return auth mds for an inode
Add a new vxattr that shows what MDS is authoritative for an inode (if
we happen to have auth caps). If we don't have an auth cap for the inode
then just return -1.

URL: https://tracker.ceph.com/issues/1276
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Jeff Layton
49f8899e5e ceph: remove some defunct forward declarations
We missed these in the recent fscache rework.

Reported-by: Steve French <smfrench@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Xiubo Li
e1a4541ec0 ceph: flush the mdlog before waiting on unsafe reqs
For the client requests who will have unsafe and safe replies from
MDS daemons, in the MDS side the MDS daemons won't flush the mdlog
(journal log) immediatelly, because they think it's unnecessary.
That's true for most cases but not all, likes the fsync request.
The fsync will wait until all the unsafe replied requests to be
safely replied.

Normally if there have multiple threads or clients are running, the
whole mdlog in MDS daemons could be flushed in time if any request
will trigger the mdlog submit thread. So usually we won't experience
the normal operations will stuck for a long time. But in case there
has only one client with only thread is running, the stuck phenomenon
maybe obvious and the worst case it must wait at most 5 seconds to
wait the mdlog to be flushed by the MDS's tick thread periodically.

This patch will trigger to flush the mdlog in the relevant and auth
MDSes to which the in-flight requests are sent just before waiting
the unsafe requests to finish.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Xiubo Li
d095559ce4 ceph: flush mdlog before umounting
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Xiubo Li
59b312f362 ceph: make iterate_sessions a global symbol
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Xiubo Li
fba97e8025 ceph: make ceph_create_session_msg a global symbol
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Jeff Layton
ce3a8732ae ceph: fix comment about short copies in ceph_write_end
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Jeff Layton
2ad32cf09b ceph: fix memory leak on decode error in ceph_handle_caps
If we hit a decoding error late in the frame, then we might exit the
function without putting the pool_ns string. Ensure that we always put
that reference on the way out of the function.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Linus Torvalds
815409a12c overlayfs update for 5.15
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYTDKKAAKCRDh3BK/laaZ
 PG9PAQCUF0fdBlCKudwSEt5PV5xemycL9OCAlYCd7d4XbBIe9wEA6sVJL9J+OwV2
 aF0NomiXtJccE+S9+byjVCyqSzQJGQQ=
 =6L2Y
 -----END PGP SIGNATURE-----

Merge tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs update from Miklos Szeredi:

 - Copy up immutable/append/sync/noatime attributes (Amir Goldstein)

 - Improve performance by enabling RCU lookup.

 - Misc fixes and improvements

The reason this touches so many files is that the ->get_acl() method now
gets a "bool rcu" argument.  The ->get_acl() API was updated based on
comments from Al and Linus:

Link: https://lore.kernel.org/linux-fsdevel/CAJfpeguQxpd6Wgc0Jd3ks77zcsAv_bn0q17L3VNnnmPKu11t8A@mail.gmail.com/

* tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: enable RCU'd ->get_acl()
  vfs: add rcu argument to ->get_acl() callback
  ovl: fix BUG_ON() in may_delete() when called from ovl_cleanup()
  ovl: use kvalloc in xattr copy-up
  ovl: update ctime when changing fileattr
  ovl: skip checking lower file's i_writecount on truncate
  ovl: relax lookup error on mismatch origin ftype
  ovl: do not set overlay.opaque for new directories
  ovl: add ovl_allow_offline_changes() helper
  ovl: disable decoding null uuid with redirect_dir
  ovl: consistent behavior for immutable/append-only inodes
  ovl: copy up sync/noatime fileattr flags
  ovl: pass ovl_fs to ovl_check_setxattr()
  fs: add generic helper for filling statx attribute flags
2021-09-02 09:21:27 -07:00
Linus Torvalds
6f01c935d9 File locking changes for v5.15.
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmEo3WcTHGpsYXl0b25A
 a2VybmVsLm9yZwAKCRAADmhBGVaCFbnuD/9Vj3dnXRgZ9LHFuQHp5Vy5yaGfujwP
 kIMN7bJiL0pCZ1zHapyn0cidZuTScDK2qMzGVR9Wrj/uYHEI+T08IEfaq/2CU3pS
 PdygHgqQi+WtJj4dks1OphVdlF0+46B/zy2YY0LE39zFUDO38VUIgb/gTiQ8JHIr
 BaDGtohlXmK8bDXo2XqMunSp7uQoRX92mv+bPjvIpsM60RnhPzhX0HeO/xhXO1PP
 OpswQG3nOTX1alZOXuSKdH7e3zfO+pLnC5NYeo5R4rys5xRiShU19OSAJfiPwf2w
 06OX7uLT7aF5LPQPjOP4MA9dHPWWeIuYnKzUpQZZ+D+zjuaxhF62saJG1Um0yQns
 qqPZgTWhxzkJ16bj94T2UZW0bufFKM5cEfBPFw9ZyzQshX5jIE2V5ecV+F/AdtTj
 EA55m/5N8NZMsNgnEOqn6o8TkWoV1seHKJXqja0+ZY9/Xs8GkmjHLhjNBCwGZIbw
 8S64Szp+yex3m1mRS+L6T0Gudic5WdFxOKQAzvvJcmT6u+ZGBzU0FjNTNNdnu3tp
 f9mANYxT1vNfYplRuNhMCp9oioSRpO2vfDQLCBRtsCy0QlKCum6qb7E/BcWRlwn1
 SOAFmjeXpr+VXGX+Rjp/bzCdHlNOVbR4ODWJ+h5D7QZaQwQ6FafkEq1D2yjPR0OR
 i722ECoM++vZ4w==
 =CfcC
 -----END PGP SIGNATURE-----

Merge tag 'locks-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull file locking updates from Jeff Layton:
 "This starts with a couple of fixes for potential deadlocks in the
  fowner/fasync handling.

  The next patch removes the old mandatory locking code from the kernel
  altogether.

  The last patch cleans up rw_verify_area a bit more after the mandatory
  locking removal"

* tag 'locks-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  fs: clean up after mandatory file locking support removal
  fs: remove mandatory file locking support
  fcntl: fix potential deadlock for &fasync_struct.fa_lock
  fcntl: fix potential deadlocks for &fown_struct.lock
2021-08-30 12:38:13 -07:00
Linus Torvalds
aa99f3c2b9 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmEmTZcACgkQnJ2qBz9k
 QNkkmAgArW6XoF1CePds/ZaC9vfg/nk66/zVo0n+J8xXjMWAPxcKbWFfV0uWVixq
 yk4lcLV47a2Mu/B/1oLNd3vrSmhwU+srWqNwOFn1nv+lP/6wJqr8oztRHn/0L9Q3
 ZSRrukSejbQ6AvTL/WzTNnCjjCc2ne3Kyko6W41aU6uyJuzhSM32wbx7qlV6t54Z
 iint9OrB4gM0avLohNafTUq6I+tEGzBMNwpCG/tqCmkcvDcv3rTDVAnPSCTm0Tx2
 hdrYDcY/rLxo93pDBaW1rYA/fohR+mIVye6k2TjkPAL6T1x+rxeT5qnc+YijH5yF
 sFPDhlD+ZsfOLi8stWXLOJ+8+gLODg==
 =pDBR
 -----END PGP SIGNATURE-----

Merge tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fs hole punching vs cache filling race fixes from Jan Kara:
 "Fix races leading to possible data corruption or stale data exposure
  in multiple filesystems when hole punching races with operations such
  as readahead.

  This is the series I was sending for the last merge window but with
  your objection fixed - now filemap_fault() has been modified to take
  invalidate_lock only when we need to create new page in the page cache
  and / or bring it uptodate"

* tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  filesystems/locking: fix Malformed table warning
  cifs: Fix race between hole punch and page fault
  ceph: Fix race between hole punch and page fault
  fuse: Convert to using invalidate_lock
  f2fs: Convert to using invalidate_lock
  zonefs: Convert to using invalidate_lock
  xfs: Convert double locking of MMAPLOCK to use VFS helpers
  xfs: Convert to use invalidate_lock
  xfs: Refactor xfs_isilocked()
  ext2: Convert to using invalidate_lock
  ext4: Convert to use mapping->invalidate_lock
  mm: Add functions to lock invalidate_lock for two mappings
  mm: Protect operations adding pages to page cache with invalidate_lock
  documentation: Sync file_operations members with reality
  mm: Fix comments mentioning i_mutex
2021-08-30 10:24:50 -07:00
Tuo Li
a9e6ffbc5b ceph: fix possible null-pointer dereference in ceph_mdsmap_decode()
kcalloc() is called to allocate memory for m->m_info, and if it fails,
ceph_mdsmap_destroy() behind the label out_err will be called:
  ceph_mdsmap_destroy(m);

In ceph_mdsmap_destroy(), m->m_info is dereferenced through:
  kfree(m->m_info[i].export_targets);

To fix this possible null-pointer dereference, check m->m_info before the
for loop to free m->m_info[i].export_targets.

[ jlayton: fix up whitespace damage
	   only kfree(m->m_info) if it's non-NULL ]

Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Signed-off-by: Tuo Li <islituo@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-08-25 16:34:11 +02:00
Xiubo Li
b2f9fa1f3b ceph: correctly handle releasing an embedded cap flush
The ceph_cap_flush structures are usually dynamically allocated, but
the ceph_cap_snap has an embedded one.

When force umounting, the client will try to remove all the session
caps. During this, it will free them, but that should not be done
with the ones embedded in a capsnap.

Fix this by adding a new boolean that indicates that the cap flush is
embedded in a capsnap, and skip freeing it if that's set.

At the same time, switch to using list_del_init() when detaching the
i_list and g_list heads.  It's possible for a forced umount to remove
these objects but then handle_cap_flushsnap_ack() races in and does the
list_del_init() again, corrupting memory.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/52283
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-08-25 16:34:11 +02:00
Jeff Layton
f7e33bdbd6 fs: remove mandatory file locking support
We added CONFIG_MANDATORY_FILE_LOCKING in 2015, and soon after turned it
off in Fedora and RHEL8. Several other distros have followed suit.

I've heard of one problem in all that time: Someone migrated from an
older distro that supported "-o mand" to one that didn't, and the host
had a fstab entry with "mand" in it which broke on reboot. They didn't
actually _use_ mandatory locking so they just removed the mount option
and moved on.

This patch rips out mandatory locking support wholesale from the kernel,
along with the Kconfig option and the Documentation file. It also
changes the mount code to ignore the "mand" mount option instead of
erroring out, and to throw a big, ugly warning.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
2021-08-23 06:15:36 -04:00
Miklos Szeredi
0cad624662 vfs: add rcu argument to ->get_acl() callback
Add a rcu argument to the ->get_acl() callback to allow
get_cached_acl_rcu() to call the ->get_acl() method in the next patch.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-18 22:08:24 +02:00
Jeff Layton
8434ffe71c ceph: take snap_empty_lock atomically with snaprealm refcount change
There is a race in ceph_put_snap_realm. The change to the nref and the
spinlock acquisition are not done atomically, so you could decrement
nref, and before you take the spinlock, the nref is incremented again.
At that point, you end up putting it on the empty list when it
shouldn't be there. Eventually __cleanup_empty_realms runs and frees
it when it's still in-use.

Fix this by protecting the 1->0 transition with atomic_dec_and_lock,
and just drop the spinlock if we can get the rwsem.

Because these objects can also undergo a 0->1 refcount transition, we
must protect that change as well with the spinlock. Increment locklessly
unless the value is at 0, in which case we take the spinlock, increment
and then take it off the empty list if it did the 0->1 transition.

With these changes, I'm removing the dout() messages from these
functions, as well as in __put_snap_realm. They've always been racy, and
it's better to not print values that may be misleading.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/46419
Reported-by: Mark Nelson <mnelson@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-08-04 19:20:29 +02:00
Luis Henriques
bf2ba43221 ceph: reduce contention in ceph_check_delayed_caps()
Function ceph_check_delayed_caps() is called from the mdsc->delayed_work
workqueue and it can be kept looping for quite some time if caps keep
being added back to the mdsc->cap_delay_list.  This may result in the
watchdog tainting the kernel with the softlockup flag.

This patch breaks this loop if the caps have been recently (i.e. during
the loop execution).  Any new caps added to the list will be handled in
the next run.

Also, allow schedule_delayed() callers to explicitly set the delay value
instead of defaulting to 5s, so we can ensure that it runs soon
afterward if it looks like there is more work.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/46284
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-08-04 19:20:05 +02:00
Luis Henriques
cdb330f4b4 ceph: don't WARN if we're still opening a session to an MDS
If MDSs aren't available while mounting a filesystem, the session state
will transition from SESSION_OPENING to SESSION_CLOSING.  And in that
scenario check_session_state() will be called from delayed_work() and
trigger this WARN.

Avoid this by only WARNing after a session has already been established
(i.e., the s_ttl will be different from 0).

Fixes: 62575e270f ("ceph: check session state after bumping session->s_seq")
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-07-20 17:57:33 +02:00
Jan Kara
057ba5b245 ceph: Fix race between hole punch and page fault
Ceph has a following race between hole punching and page fault:

CPU1                                  CPU2
ceph_fallocate()
  ...
  ceph_zero_pagecache_range()
                                      ceph_filemap_fault()
                                        faults in page in the range being
                                        punched
  ceph_zero_objects()

And now we have a page in punched range with invalid data. Fix the
problem by using mapping->invalidate_lock similarly to other
filesystems. Note that using invalidate_lock also fixes a similar race
wrt ->readpage().

CC: Jeff Layton <jlayton@kernel.org>
CC: ceph-devel@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-07-13 14:29:01 +02:00
Jeff Layton
4c18347238 ceph: take reference to req->r_parent at point of assignment
Currently, we set the r_parent pointer but then don't take a reference
to it until we submit the request. If we end up freeing the req before
that point, then we'll do a iput when we shouldn't.

Instead, take the inode reference in the callers, so that it's always
safe to call ceph_mdsc_put_request on the req, even before submission.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:52 +02:00
Jeff Layton
23c2c76ead ceph: eliminate ceph_async_iput()
Now that we don't need to hold session->s_mutex or the snap_rwsem when
calling ceph_check_caps, we can eliminate ceph_async_iput and just use
normal iput calls.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:52 +02:00
Jeff Layton
7732fe168e ceph: don't take s_mutex in ceph_flush_snaps
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:52 +02:00
Jeff Layton
0449a35222 ceph: don't take s_mutex in try_flush_caps
The s_mutex doesn't protect anything in this codepath.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:52 +02:00
Jeff Layton
6a92b08fda ceph: don't take s_mutex or snap_rwsem in ceph_check_caps
These locks appear to be completely unnecessary. Almost all of this
function is done under the inode->i_ceph_lock, aside from the actual
sending of the message. Don't take either lock in this function.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:52 +02:00
Jeff Layton
52d60f8e18 ceph: eliminate session->s_gen_ttl_lock
Turn s_cap_gen field into an atomic_t, and just rely on the fact that we
hold the s_mutex when changing the s_cap_ttl field.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:52 +02:00
Jeff Layton
7e65624d32 ceph: allow ceph_put_mds_session to take NULL or ERR_PTR
...to simplify some error paths.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:52 +02:00
Jeff Layton
df2c0cb7f8 ceph: clean up locking annotation for ceph_get_snap_realm and __lookup_snap_realm
They both say that the snap_rwsem must be held for write, but I don't
see any real reason for it, and it's not currently always called that
way.

The lookup is just walking the rbtree, so holding it for read should be
fine there. The "get" is bumping the refcount and (possibly) removing
it from the empty list. I see no need to hold the snap_rwsem for write
for that.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:51 +02:00
Jeff Layton
a6862e6708 ceph: add some lockdep assertions around snaprealm handling
Turn some comments into lockdep asserts.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:51 +02:00
Jeff Layton
f3fd3ea6a2 ceph: decoding error in ceph_update_snap_realm should return -EIO
Currently ceph_update_snap_realm returns -EINVAL when it hits a decoding
error, which is the wrong error code. -EINVAL implies that the user gave
us a bogus argument to a syscall or something similar. -EIO is more
descriptive when we hit a decoding error.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:51 +02:00
Xiubo Li
903f4fec78 ceph: add IO size metrics support
This will collect IO's total size and then calculate the average
size, and also will collect the min/max IO sizes.

The debugfs will show the size metrics in bytes and will let the
userspace applications to switch to what they need.

URL: https://tracker.ceph.com/issues/49913
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:51 +02:00
Xiubo Li
fc123d5f50 ceph: update and rename __update_latency helper to __update_stdev
The new __update_stdev() helper will only compute the standard
deviation.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:51 +02:00
Xiubo Li
8ecd34c797 ceph: simplify the metrics struct
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:51 +02:00
Jeff Layton
4364c6938d ceph: make ceph_queue_cap_snap static
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-28 23:49:25 +02:00
Wei Yongjun
675d4d8997 ceph: make ceph_netfs_read_ops static
The sparse tool complains as follows:

fs/ceph/addr.c:316:37: warning:
 symbol 'ceph_netfs_read_ops' was not declared. Should it be static?

This symbol is not used outside of addr.c, so mark it static.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-28 23:49:24 +02:00
Jeff Layton
22d41cdcd3 ceph: remove bogus checks and WARN_ONs from ceph_set_page_dirty
The checks for page->mapping are odd, as set_page_dirty is an
address_space operation, and I don't see where it would be called on a
non-pagecache page.

The warning about the page lock also seems bogus.  The comment over
set_page_dirty() says that it can be called without the page lock in
some rare cases. I don't think we want to warn if that's the case.

Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-28 23:49:24 +02:00
Jeff Layton
7a971e2c07 ceph: fix error handling in ceph_atomic_open and ceph_lookup
Commit aa60cfc3f7 broke the error handling in these functions such
that they don't handle non-ENOENT errors from ceph_mdsc_do_request
properly.

Move the checking of -ENOENT out of ceph_handle_snapdir and into the
callers, and if we get a different error, return it immediately.

Fixes: aa60cfc3f7 ("ceph: don't use d_add in ceph_handle_snapdir")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-22 13:53:28 +02:00
Jeff Layton
27171ae6a0 ceph: must hold snap_rwsem when filling inode for async create
...and add a lockdep assertion for it to ceph_fill_inode().

Cc: stable@vger.kernel.org # v5.7+
Fixes: 9a8d03ca2e ("ceph: attempt to do async create when possible")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-22 13:53:28 +02:00
Linus Torvalds
7ac86b3dca Notable items here are a series to take advantage of David Howells'
netfs helper library from Jeff, three new filesystem client metrics
 from Xiubo, ceph.dir.rsnaps vxattr from Yanhu and two auth-related
 fixes from myself, marked for stable.  Interspersed is a smattering
 of assorted fixes and cleanups across the filesystem.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmCT8IITHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzizgqCACYbyY4Yr/2C8fZsn+P9rd97zRTbcC6
 eufTZwnlECLnc89BxJQRk9a2UpDJfC8RMM3/9tmiulc8G4M+ggVbdFQTCzsZox3c
 vLAunGeVyfKIY+16Bv2RNuoO3KeeZm5aB3jXJ5QcUPcXmd4XnHKI1FU2ebC56UJb
 pxxfHpE6fb59r6Ek1e5uUFyta4KDMrvwXozghuAPEgT1GpKeA9zMIGI0CkQbBHlW
 PWHpcahTiT6GWa/d9ud0CnfssiBxVydWyKTz9xppYC6LNdsZUf9tBmYYGRklcjoA
 yAwPSuqxNmg+7uWubEawc0+a/3fXORgp2SF7Rbp1XYE+HpfnMF1J+nIn
 =IO5c
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.13-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "Notable items here are

   - a series to take advantage of David Howells' netfs helper library
     from Jeff

   - three new filesystem client metrics from Xiubo

   - ceph.dir.rsnaps vxattr from Yanhu

   - two auth-related fixes from myself, marked for stable.

  Interspersed is a smattering of assorted fixes and cleanups across the
  filesystem"

* tag 'ceph-for-5.13-rc1' of git://github.com/ceph/ceph-client: (24 commits)
  libceph: allow addrvecs with a single NONE/blank address
  libceph: don't set global_id until we get an auth ticket
  libceph: bump CephXAuthenticate encoding version
  ceph: don't allow access to MDS-private inodes
  ceph: fix up some bare fetches of i_size
  ceph: convert some PAGE_SIZE invocations to thp_size()
  ceph: support getting ceph.dir.rsnaps vxattr
  ceph: drop pinned_page parameter from ceph_get_caps
  ceph: fix inode leak on getattr error in __fh_to_dentry
  ceph: only check pool permissions for regular files
  ceph: send opened files/pinned caps/opened inodes metrics to MDS daemon
  ceph: avoid counting the same request twice or more
  ceph: rename the metric helpers
  ceph: fix kerneldoc copypasta over ceph_start_io_direct
  ceph: use attach/detach_page_private for tracking snap context
  ceph: don't use d_add in ceph_handle_snapdir
  ceph: don't clobber i_snap_caps on non-I_NEW inode
  ceph: fix fall-through warnings for Clang
  ceph: convert ceph_readpages to ceph_readahead
  ceph: convert ceph_write_begin to netfs_write_begin
  ...
2021-05-06 10:27:02 -07:00
Jeff Layton
d4f6b31d72 ceph: don't allow access to MDS-private inodes
The MDS reserves a set of inodes for its own usage, and these should
never be accessible to clients. Add a new helper to vet a proposed
inode number against that range, and complain loudly and refuse to
create or look it up if it's in it.

Also, ensure that the MDS doesn't try to delegate inodes that are in
that range or lower. Print a warning if it does, and don't save the
range in the xarray.

URL: https://tracker.ceph.com/issues/49922
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Jeff Layton
2d6795fbb8 ceph: fix up some bare fetches of i_size
We need to use i_size_read(), which properly handles the torn read
case on 32-bit arches.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Jeff Layton
8ff2d290c8 ceph: convert some PAGE_SIZE invocations to thp_size()
Start preparing to allow the use of THPs in the pagecache with ceph by
making it use thp_size() in lieu of PAGE_SIZE in the appropriate places.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Yanhu Cao
e7f7295250 ceph: support getting ceph.dir.rsnaps vxattr
Add support for grabbing the rsnaps value out of the inode info in
traces, and exposing that via ceph.dir.rsnaps xattr.

Signed-off-by: Yanhu Cao <gmayyyha@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Jeff Layton
e72968e15b ceph: drop pinned_page parameter from ceph_get_caps
All of the existing callers that don't set this to NULL just drop the
page reference at some arbitrary point later in processing. There's no
point in keeping a page reference that we don't use, so just drop the
reference immediately after checking the Uptodate flag.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Jeff Layton
1775c7ddac ceph: fix inode leak on getattr error in __fh_to_dentry
Fixes: 878dabb641 ("ceph: don't return -ESTALE if there's still an open file")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00