Commit Graph

249 Commits

Author SHA1 Message Date
Kevin Wolf
d402da1360 file-posix: Fix aio=threads performance regression after enablign FUA
For aio=threads, we're currently not implementing REQ_FUA in any useful
way, but just do a separate raw_co_flush_to_disk() call. This changes
behaviour compared to the old state, which used bdrv_co_flush() with its
optimisations. As a quick fix, call bdrv_co_flush() again like before.
Eventually, we can use pwritev2() to make use of RWF_DSYNC if available,
but we'll still have to keep this code path as a fallback, so this fix
is required either way.

While the fix itself is a one-liner, some new graph locking annotations
are needed to convince TSA that the locking is correct.

Cc: qemu-stable@nongnu.org
Fixes: 984a32f17e ("file-posix: Support FUA writes")
Buglink: https://issues.redhat.com/browse/RHEL-96854
Reported-by: Tingting Mao <timao@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20250625085019.27735-1-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-07-14 17:12:35 +02:00
Kevin Wolf
bf627788ef file-posix: Probe paths and retry SG_IO on potential path errors
When scsi-block is used on a host multipath device, it runs into the
problem that the kernel dm-mpath doesn't know anything about SCSI or
SG_IO and therefore can't decide if a SG_IO request returned an error
and needs to be retried on a different path. Instead of getting working
failover, an error is returned to scsi-block and handled according to
the configured error policy. Obviously, this is not what users want,
they want working failover.

QEMU can parse the SG_IO result and determine whether this could have
been a path error, but just retrying the same request could just send it
to the same failing path again and result in the same error.

With a kernel that supports the DM_MPATH_PROBE_PATHS ioctl on dm-mpath
block devices (queued in the device mapper tree for Linux 6.16), we can
tell the kernel to probe all paths and tell us if any usable paths
remained. If so, we can now retry the SG_IO ioctl and expect it to be
sent to a working path.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20250522130803.34738-1-kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-05-22 17:56:50 +02:00
Stefan Hajnoczi
5634622bcb file-posix: allow BLKZEROOUT with -t writeback
The Linux BLKZEROOUT ioctl is only invoked when BDRV_O_NOCACHE is set
because old kernels did not invalidate the page cache. In that case
mixing BLKZEROOUT with buffered I/O could lead to corruption.

However, Linux 4.9 commit 22dd6d356628 ("block: invalidate the page
cache when issuing BLKZEROOUT") made BLKZEROOUT coherent with the page
cache.

I have checked that Linux 4.9+ kernels are shipped at least as far back
as Debian 10 (buster), openSUSE Leap 15.2, and RHEL/CentOS 8.

Use BLKZEROOUT with buffered I/O, mostly so `qemu-img ... -t
writeback` can offload write zeroes.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20250417211053.98700-1-stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-05-22 16:54:10 +02:00
Eric Blake
a6a0a7fb0e file-posix, gluster: Handle zero block status hint better
Although the previous patch to change 'bool want_zero' into a bitmask
made no semantic change, it is now time to differentiate.  When the
caller specifically wants to know what parts of the file read as zero,
we need to use lseek and actually reporting holes, rather than
short-circuiting and advertising full allocation.

This change will be utilized in later patches to let mirroring
optimize for the case when the destination already reads as zeroes.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20250509204341.3553601-17-eblake@redhat.com>
2025-05-14 15:49:27 -05:00
Eric Blake
c33159dec7 block: Expand block status mode from bool to flags
This patch is purely mechanical, changing bool want_zero into an
unsigned int for bitwise-or of flags.  As of this patch, all
implementations are unchanged (the old want_zero==true is now
mode==BDRV_WANT_PRECISE which is a superset of BDRV_WANT_ZERO); but
the callers in io.c that used to pass want_zero==false are now
prepared for future driver changes that can now distinguish bewteen
BDRV_WANT_ZERO vs. BDRV_WANT_ALLOCATED.  The next patch will actually
change the file-posix driver along those lines, now that we have
more-specific hints.

As for the background why this patch is useful: right now, the
file-posix driver recognizes that if allocation is being queried, the
entire image can be reported as allocated (there is no backing file to
refer to) - but this throws away information on whether the entire
image reads as zero (trivially true if lseek(SEEK_HOLE) at offset 0
returns -ENXIO, a bit more complicated to prove if the raw file was
created with 'qemu-img create' since we intentionally allocate a small
chunk of all-zero data to help with alignment probing).  Later patches
will add a generic algorithm for seeing if an entire file reads as
zeroes.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20250509204341.3553601-16-eblake@redhat.com>
2025-05-14 15:33:34 -05:00
Kohei Tokunaga
208808242f block: Fix type conflict of the copy_file_range stub
Emscripten doesn't provide copy_file_range implementation but it declares
this function in its headers. Meson correctly detects the missing
implementation and unsets HAVE_COPY_FILE_RANGE. However, the stub defined in
file-posix.c causes a type conflict with the declaration from Emscripten
during compilation.

To fix this error, this commit updates the stub implementation in
file-posix.c to exactly match the declaration in Emscripten's headers. The
manpage also aligns with this signature.

Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/938d2beba15d4bd496a600ee401995fbaa385c62.1745820062.git.ktokunaga.mail@gmail.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-06 16:02:04 +02:00
Kohei Tokunaga
45e82e495d block: Add including of ioctl header for Emscripten build
Including <sys/ioctl.h> is still required on Emscripten, just like on other
platforms, to make the ioctl function available.

Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/49b6ecdbd23ff83e3f191ef8a9f7cc2feeaea43f.1745820062.git.ktokunaga.mail@gmail.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-06 16:02:04 +02:00
Kevin Wolf
71a30d54e6 file-posix: Fix crash on discard_granularity == 0
Block devices that don't support discard have a discard_granularity of
0. Currently, this results in a division by zero when we try to make
sure that it's a multiple of request_alignment. Only try to update
bs->bl.pdiscard_alignment when we got a non-zero discard_granularity
from sysfs.

Fixes: f605796aae ('file-posix: probe discard alignment on Linux block devices')
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20250429155654.102735-1-kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2025-04-30 11:40:05 -04:00
Stefan Hajnoczi
f605796aae file-posix: probe discard alignment on Linux block devices
Populate the pdiscard_alignment block limit so the block layer is able
align discard requests correctly.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20250417150528.76470-2-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-04-25 16:42:01 +02:00
Kevin Wolf
984a32f17e file-posix: Support FUA writes
Until now, FUA was always emulated with a separate flush after the write
for file-posix. The overhead of processing a second request can reduce
performance significantly for a guest disk that has disabled the write
cache, especially if the host disk is already write through, too, and
the flush isn't actually doing anything.

Advertise support for REQ_FUA in write requests and implement it for
Linux AIO and io_uring using the RWF_DSYNC flag for write requests. The
thread pool still performs a separate fdatasync() call. This can be
improved later by using the pwritev2() syscall if available.

As an example, this is how fio numbers can be improved in some scenarios
with this patch (all using virtio-blk with cache=directsync on an nvme
block device for the VM, fio with ioengine=libaio,direct=1,sync=1):

                              | old           | with FUA support
------------------------------+---------------+-------------------
bs=4k, iodepth=1, numjobs=1   |  45.6k iops   |  56.1k iops
bs=4k, iodepth=1, numjobs=16  | 183.3k iops   | 236.0k iops
bs=4k, iodepth=16, numjobs=1  | 258.4k iops   | 311.1k iops

However, not all scenarios are clear wins. On another slower disk I saw
little to no improvment. In fact, in two corner case scenarios, I even
observed a regression, which I however consider acceptable:

1. On slow host disks in a write through cache mode, when the guest is
   using virtio-blk in a separate iothread so that polling can be
   enabled, and each completion is quickly followed up with a new
   request (so that polling gets it), it can happen that enabling FUA
   makes things slower - the additional very fast no-op flush we used to
   have gave the adaptive polling algorithm a success so that it kept
   polling. Without it, we only have the slow write request, which
   disables polling. This is a problem in the polling algorithm that
   will be fixed later in this series.

2. With a high queue depth, it can be beneficial to have flush requests
   for another reason: The optimisation in bdrv_co_flush() that flushes
   only once per write generation acts as a synchronisation mechanism
   that lets all requests complete at the same time. This can result in
   better batching and if the disk is very fast (I only saw this with a
   null_blk backend), this can make up for the overhead of the flush and
   improve throughput. In theory, we could optionally introduce a
   similar artificial latency in the normal completion path to achieve
   the same kind of completion batching. This is not implemented in this
   series.

Compatibility is not a concern for the kernel side of io_uring, it has
supported RWF_DSYNC from the start. However, io_uring_prep_writev2() is
not available before liburing 2.2.

Linux AIO started supporting it in Linux 4.13 and libaio 0.3.111. The
kernel is not a problem for any supported build platform, so it's not
necessary to add runtime checks. However, openSUSE is still stuck with
an older libaio version that would break the build.

We must detect the presence of the writev2 functions in the user space
libraries at build time to avoid build failures.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20250307221634.71951-2-kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-03-13 17:44:55 +01:00
Daniel P. Berrangé
407bc4bf90 qapi: Move include/qapi/qmp/ to include/qobject/
The general expectation is that header files should follow the same
file/path naming scheme as the corresponding source file. There are
various historical exceptions to this practice in QEMU, with one of
the most notable being the include/qapi/qmp/ directory. Most of the
headers there correspond to source files in qobject/.

This patch corrects most of that inconsistency by creating
include/qobject/ and moving the headers for qobject/ there.

This also fixes MAINTAINERS for include/qapi/qmp/dispatch.h:
scripts/get_maintainer.pl now reports "QAPI" instead of "No
maintainers found".

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com> #s390x
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-ID: <20241118151235.2665921-2-armbru@redhat.com>
[Rebased]
2025-02-10 15:33:16 +01:00
Marc-André Lureau
eb5d28c783 block: fix -Werror=maybe-uninitialized false-positive
../block/file-posix.c:1405:17: error: ‘zoned’ may be used uninitialized [-Werror=maybe-uninitialized]
 1405 |     if (ret < 0 || zoned == BLK_Z_NONE) {

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
2024-10-02 16:14:29 +04:00
Akihiko Odaki
e9c9d8dc3b block/file-posix: Drop ifdef for macOS versions older than 12.0
macOS versions older than 12.0 are no longer supported.

docs/about/build-platforms.rst says:
> Support for the previous major version will be dropped 2 years after
> the new major version is released or when the vendor itself drops
> support, whichever comes first.

macOS 12.0 was released 2021:
https://www.apple.com/newsroom/2021/10/macos-monterey-is-now-available/

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20240629-macos-v1-3-6e70a6b700a0@daynix.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
2024-07-02 06:58:48 +02:00
Paolo Bonzini
44b424dc4a block: remove separate bdrv_file_open callback
bdrv_file_open and bdrv_open are completely equivalent, they are
never checked except to see which one to invoke.  So merge them
into a single one.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-28 14:44:51 +02:00
Prasad Pandit
24687abf23 linux-aio: add IO_CMD_FDSYNC command support
Libaio defines IO_CMD_FDSYNC command to sync all outstanding
asynchronous I/O operations, by flushing out file data to the
disk storage. Enable linux-aio to submit such aio request.

When using aio=native without fdsync() support, QEMU creates
pthreads, and destroying these pthreads results in TLB flushes.
In a real-time guest environment, TLB flushes cause a latency
spike. This patch helps to avoid such spikes.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Message-ID: <20240425070412.37248-1-ppandit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2024-06-10 11:05:43 +02:00
Denis V. Lunev via
b67e353863 block: drop force_dup parameter of raw_reconfigure_getfd()
Since commit 72373e40fb, this parameter is always passed as 'false'
from the caller.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Andrey Zhadchenko <andrey.zhadchenko@virtuozzo.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Hanna Reitz <hreitz@redhat.com>
Message-ID: <20240430170213.148558-1-den@openvz.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2024-06-10 11:05:43 +02:00
Stefan Hajnoczi
cd0c0db0aa block/file-posix: set up Linux AIO and io_uring in the current thread
The file-posix block driver currently only sets up Linux AIO and
io_uring in the BDS's AioContext. In the multi-queue block layer we must
be able to submit I/O requests in AioContexts that do not have Linux AIO
and io_uring set up yet since any thread can call into the block driver.

Set up Linux AIO and io_uring for the current AioContext during request
submission. We lose the ability to return an error from
.bdrv_file_open() when Linux AIO and io_uring setup fails (e.g. due to
resource limits). Instead the user only gets warnings and we fall back
to aio=threads. This is still better than a fatal error after startup.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20230914140101.1065008-2-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-12-21 22:49:27 +01:00
Naohiro Aota
ad4feaca61 file-posix: fix over-writing of returning zone_append offset
raw_co_zone_append() sets "s->offset" where "BDRVRawState *s". This pointer
is used later at raw_co_prw() to save the block address where the data is
written.

When multiple IOs are on-going at the same time, a later IO's
raw_co_zone_append() call over-writes a former IO's offset address before
raw_co_prw() completes. As a result, the former zone append IO returns the
initial value (= the start address of the writing zone), instead of the
proper address.

Fix the issue by passing the offset pointer to raw_co_prw() instead of
passing it through s->offset. Also, remove "offset" from BDRVRawState as
there is no usage anymore.

Fixes: 4751d09adc ("block: introduce zone append write for zoned devices")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Message-Id: <20231030073853.2601162-1-naohiro.aota@wdc.com>
Reviewed-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
2023-11-06 16:15:07 +01:00
Sam Li
10b9e0802a block/file-posix: fix update_zones_wp() caller
When the zoned request fail, it needs to update only the wp of
the target zones for not disrupting the in-flight writes on
these other zones. The wp is updated successfully after the
request completes.

Fixed the callers with right offset and nr_zones.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Message-Id: <20230825040556.4217-1-faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[hreitz: Rebased and fixed comment spelling]
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
2023-11-06 16:15:07 +01:00
Stefan Hajnoczi
416af8564f Block patches
- Fix for file-posix's zoning code crashing on I/O errors
 - Throttling refactoring
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEy2LXoO44KeRfAE00ofpA0JgBnN8FAmTxnMISHGhyZWl0ekBy
 ZWRoYXQuY29tAAoJEKH6QNCYAZzfYkUP+gMG9hhzvgjj/tw9rEBQjciihzcQmqQJ
 2Mm37RH2jj5bnnTdaTbMkcRRwVhncYSCwK9q5EYVbZmU9C/v4YJmsSEQlcl7wVou
 hbPUv6NHaBrJZX9nxNSa2RHui6pZMLKa/D0rJVB7NjYBrrRtiPo7kiLVQYjYXa2g
 kcCCfY4t3Z2RxOP31mMXRjYlhJE9bIuZdTEndrKme8KS2JGPZEJ9xjkoW1tj96EX
 oc/Cg2vk7AEtsFYA0bcD8fTFkBDJEwyYl3usu7Tk24pvH16jk7wFSqRVSsDMfnER
 tG8X3mHLIY0hbSkpzdHJdXINvZ6FWpQb0CGzIKr+pMiuWVdWr1HglBr0m4pVF+Y4
 A6AI6VX2JJgtacypoDyCZC9mzs1jIdeiwq9v5dyuikJ6ivTwEEoeoSLnLTN3AjXn
 0mtQYzgCg5Gd6+rTo7XjSO9SSlbaVrDl/B2eXle6tmIFT5k+86fh0hc+zTmP8Rkw
 Knbc+5Le95wlMrOUNx2GhXrTGwX510hLxKboho/LITxtAzqvXnEJKrYbnkm3WPnw
 wfHnR5VQH1NKEpiH/p33og6OV/vu9e7vgp0ZNZV136SnzC90C1zMUwg2simJW701
 34EtN0XBX8XBKrxfe7KscV9kRE8wrWWJVbhp+WOcQEomGI8uraxzWqDIk/v7NZXv
 m4XBscaB+Iri
 =oKgk
 -----END PGP SIGNATURE-----

Merge tag 'pull-block-2023-09-01' of https://gitlab.com/hreitz/qemu into staging

Block patches

- Fix for file-posix's zoning code crashing on I/O errors
- Throttling refactoring

# -----BEGIN PGP SIGNATURE-----
#
# iQJGBAABCAAwFiEEy2LXoO44KeRfAE00ofpA0JgBnN8FAmTxnMISHGhyZWl0ekBy
# ZWRoYXQuY29tAAoJEKH6QNCYAZzfYkUP+gMG9hhzvgjj/tw9rEBQjciihzcQmqQJ
# 2Mm37RH2jj5bnnTdaTbMkcRRwVhncYSCwK9q5EYVbZmU9C/v4YJmsSEQlcl7wVou
# hbPUv6NHaBrJZX9nxNSa2RHui6pZMLKa/D0rJVB7NjYBrrRtiPo7kiLVQYjYXa2g
# kcCCfY4t3Z2RxOP31mMXRjYlhJE9bIuZdTEndrKme8KS2JGPZEJ9xjkoW1tj96EX
# oc/Cg2vk7AEtsFYA0bcD8fTFkBDJEwyYl3usu7Tk24pvH16jk7wFSqRVSsDMfnER
# tG8X3mHLIY0hbSkpzdHJdXINvZ6FWpQb0CGzIKr+pMiuWVdWr1HglBr0m4pVF+Y4
# A6AI6VX2JJgtacypoDyCZC9mzs1jIdeiwq9v5dyuikJ6ivTwEEoeoSLnLTN3AjXn
# 0mtQYzgCg5Gd6+rTo7XjSO9SSlbaVrDl/B2eXle6tmIFT5k+86fh0hc+zTmP8Rkw
# Knbc+5Le95wlMrOUNx2GhXrTGwX510hLxKboho/LITxtAzqvXnEJKrYbnkm3WPnw
# wfHnR5VQH1NKEpiH/p33og6OV/vu9e7vgp0ZNZV136SnzC90C1zMUwg2simJW701
# 34EtN0XBX8XBKrxfe7KscV9kRE8wrWWJVbhp+WOcQEomGI8uraxzWqDIk/v7NZXv
# m4XBscaB+Iri
# =oKgk
# -----END PGP SIGNATURE-----
# gpg: Signature made Fri 01 Sep 2023 04:11:46 EDT
# gpg:                using RSA key CB62D7A0EE3829E45F004D34A1FA40D098019CDF
# gpg:                issuer "hreitz@redhat.com"
# gpg: Good signature from "Hanna Reitz <hreitz@redhat.com>" [unknown]
# gpg: WARNING: The key's User ID is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: CB62 D7A0 EE38 29E4 5F00  4D34 A1FA 40D0 9801 9CDF

* tag 'pull-block-2023-09-01' of https://gitlab.com/hreitz/qemu:
  tests/file-io-error: New test
  file-posix: Simplify raw_co_prw's 'out' zone code
  file-posix: Fix zone update in I/O error path
  file-posix: Check bs->bl.zoned for zone info
  file-posix: Clear bs->bl.zoned on error
  block/throttle-groups: Use ThrottleDirection instread of bool is_write
  fsdev: Use ThrottleDirection instread of bool is_write
  throttle: use THROTTLE_MAX/ARRAY_SIZE for hard code
  throttle: use enum ThrottleDirection instead of bool is_write
  cryptodev: use NULL throttle timer cb for read direction
  test-throttle: test read only and write only
  throttle: support read-only and write-only
  test-throttle: use enum ThrottleDirection
  throttle: introduce enum ThrottleDirection

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-09-21 09:05:10 -04:00
Michael Tokarev
3202d8e404 block: spelling fixes
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Reviewed-by: Eric Blake <eblake@redhat.com>
2023-09-08 13:08:52 +03:00
Hanna Czenczek
d31b50a15d file-posix: Simplify raw_co_prw's 'out' zone code
We duplicate the same condition three times here, pull it out to the top
level.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-Id: <20230824155345.109765-5-hreitz@redhat.com>
Reviewed-by: Sam Li <faithilikerun@gmail.com>
2023-08-29 10:51:04 +02:00
Hanna Czenczek
deab5c9a4e file-posix: Fix zone update in I/O error path
We must check that zone information is present before running
update_zones_wp().

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2234374
Fixes: Coverity CID 1512459
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-Id: <20230824155345.109765-4-hreitz@redhat.com>
Reviewed-by: Sam Li <faithilikerun@gmail.com>
2023-08-29 10:50:49 +02:00
Hanna Czenczek
4b5d80f3d0 file-posix: Check bs->bl.zoned for zone info
Instead of checking bs->wps or bs->bl.zone_size for whether zone
information is present, check bs->bl.zoned.  That is the flag that
raw_refresh_zoned_limits() reliably sets to indicate zone support.  If
it is set to something other than BLK_Z_NONE, other values and objects
like bs->wps and bs->bl.zone_size must be non-null/zero and valid; if it
is not, we cannot rely on their validity.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-Id: <20230824155345.109765-3-hreitz@redhat.com>
Reviewed-by: Sam Li <faithilikerun@gmail.com>
2023-08-29 10:50:38 +02:00
Hanna Czenczek
56d1a022a7 file-posix: Clear bs->bl.zoned on error
bs->bl.zoned is what indicates whether the zone information is present
and valid; it is the only thing that raw_refresh_zoned_limits() sets if
CONFIG_BLKZONED is not defined, and it is also the only thing that it
sets if CONFIG_BLKZONED is defined, but there are no zones.

Make sure that it is always set to BLK_Z_NONE if there is an error
anywhere in raw_refresh_zoned_limits() so that we do not accidentally
announce zones while our information is incomplete or invalid.

This also fixes a memory leak in the last error path in
raw_refresh_zoned_limits().

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-Id: <20230824155345.109765-2-hreitz@redhat.com>
Reviewed-by: Sam Li <faithilikerun@gmail.com>
2023-08-29 10:49:58 +02:00
Sam Li
29a242e165 block/file-posix: fix g_file_get_contents return path
The g_file_get_contents() function returns a g_boolean. If it fails, the
returned value will be 0 instead of -1. Solve the issue by skipping
assigning ret value.

This issue was found by Matthew Rosato using virtio-blk-{pci,ccw} backed
by an NVMe partition e.g. /dev/nvme0n1p1 on s390x.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-id: 20230727115844.8480-1-faithilikerun@gmail.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-07-27 09:46:09 -04:00
Paolo Bonzini
36c6c8773a file-posix: remove incorrect coroutine_fn calls
raw_co_getlength is called by handle_aiocb_write_zeroes, which is not a coroutine
function.  This is harmless because raw_co_getlength does not actually suspend,
but in the interest of clarity make it a non-coroutine_fn that is just wrapped
by the coroutine_fn raw_co_getlength.  Likewise, check_cache_dropped was only
a coroutine_fn because it called raw_co_getlength, so it can be made non-coroutine
as well.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20230601115145.196465-2-pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-06-28 09:46:14 +02:00
Stefan Hajnoczi
076682885d block/linux-aio: convert to blk_io_plug_call() API
Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Note that a dev_max_batch check is dropped in laio_io_unplug() because
the semantics of unplug_fn() are different from .bdrv_co_unplug():
1. unplug_fn() is only called when the last blk_io_unplug() call occurs,
   not every time blk_io_unplug() is called.
2. unplug_fn() is per-thread, not per-BlockDriverState, so there is no
   way to get per-BlockDriverState fields like dev_max_batch.

Therefore this condition cannot be moved to laio_unplug_fn(). It is not
obvious that this condition affects performance in practice, so I am
removing it instead of trying to come up with a more complex mechanism
to preserve the condition.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Message-id: 20230530180959.1108766-6-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-06-01 07:34:03 -04:00
Stefan Hajnoczi
6a6da231b7 block/io_uring: convert to blk_io_plug_call() API
Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 20230530180959.1108766-5-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-06-01 07:34:03 -04:00
Sam Li
6c811e19bb block: add some trace events for zone append
Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20230508051510.177850-5-faithilikerun@gmail.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-05-15 08:18:10 -04:00
Sam Li
4751d09adc block: introduce zone append write for zoned devices
A zone append command is a write operation that specifies the first
logical block of a zone as the write position. When writing to a zoned
block device using zone append, the byte offset of the call may point at
any position within the zone to which the data is being appended. Upon
completion the device will respond with the position where the data has
been written in the zone.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20230508051510.177850-3-faithilikerun@gmail.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-05-15 08:18:10 -04:00
Sam Li
a3c41f06d5 file-posix: add tracking of the zone write pointers
Since Linux doesn't have a user API to issue zone append operations to
zoned devices from user space, the file-posix driver is modified to add
zone append emulation using regular writes. To do this, the file-posix
driver tracks the wp location of all zones of the device. It uses an
array of uint64_t. The most significant bit of each wp location indicates
if the zone type is conventional zones.

The zones wp can be changed due to the following operations issued:
- zone reset: change the wp to the start offset of that zone
- zone finish: change to the end location of that zone
- write to a zone
- zone append

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Message-id: 20230508051510.177850-2-faithilikerun@gmail.com
[Fix errno propagation from handle_aiocb_zone_mgmt()
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-05-15 08:17:55 -04:00
Sam Li
142e307e79 block: add some trace events for new block layer APIs
Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20230508045533.175575-8-faithilikerun@gmail.com
Message-id: 20230324090605.28361-8-faithilikerun@gmail.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-05-15 08:17:03 -04:00
Sam Li
774c726ceb block: add zoned BlockDriver check to block layer
Putting zoned/non-zoned BlockDrivers on top of each other is not
allowed.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20230508045533.175575-6-faithilikerun@gmail.com
Message-id: 20230324090605.28361-6-faithilikerun@gmail.com
[Adjust commit message prefix as suggested by Philippe Mathieu-Daudé
<philmd@linaro.org> and clarify that the check is about zoned
BlockDrivers.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-05-15 08:17:03 -04:00
Sam Li
6d43eaa396 block/block-backend: add block layer APIs resembling Linux ZonedBlockDevice ioctls
Add zoned device option to host_device BlockDriver. It will be presented only
for zoned host block devices. By adding zone management operations to the
host_block_device BlockDriver, users can use the new block layer APIs
including Report Zone and four zone management operations
(open, close, finish, reset, reset_all).

Qemu-io uses the new APIs to perform zoned storage commands of the device:
zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
zone_finish(zf).

For example, to test zone_report, use following command:
$ ./build/qemu-io --image-opts -n driver=host_device, filename=/dev/nullb0
-c "zrp offset nr_zones"

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20230508045533.175575-4-faithilikerun@gmail.com
Message-id: 20230324090605.28361-4-faithilikerun@gmail.com
[Adjust commit message prefix as suggested by Philippe Mathieu-Daudé
<philmd@linaro.org> and remove spurious ret = -errno in
raw_co_zone_mgmt().
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-05-15 08:17:03 -04:00
Sam Li
a735b56e49 block/file-posix: introduce helper functions for sysfs attributes
Use get_sysfs_str_val() to get the string value of device
zoned model. Then get_sysfs_zoned_model() can convert it to
BlockZoneModel type of QEMU.

Use get_sysfs_long_val() to get the long value of zoned device
information.

Signed-off-by: Sam Li <faithilikerun@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20230508045533.175575-3-faithilikerun@gmail.com
Message-id: 20230324090605.28361-3-faithilikerun@gmail.com
[Adjust commit message prefix as suggested by Philippe Mathieu-Daudé
<philmd@linaro.org>.
--Stefan]
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-05-15 08:17:03 -04:00
Emanuele Giuseppe Esposito
aef04fc790 thread-pool: avoid passing the pool parameter every time
thread_pool_submit_aio() is always called on a pool taken from
qemu_get_current_aio_context(), and that is the only intended
use: each pool runs only in the same thread that is submitting
work to it, it can't run anywhere else.

Therefore simplify the thread_pool_submit* API and remove the
ThreadPool function parameter.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20230203131731.851116-5-eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-04-25 13:17:28 +02:00
Emanuele Giuseppe Esposito
0fdb73112b thread-pool: use ThreadPool from the running thread
Use qemu_get_current_aio_context() where possible, since we always
submit work to the current thread anyways.

We want to also be sure that the thread submitting the work is
the same as the one processing the pool, to avoid adding
synchronization to the pool list.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20230203131731.851116-4-eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-04-25 13:17:28 +02:00
Emanuele Giuseppe Esposito
a75e4e4365 io_uring: use LuringState from the running thread
Remove usage of aio_context_acquire by always submitting asynchronous
AIO to the current thread's LuringState.

In order to prevent mistakes from the caller side, avoid passing LuringState
in luring_io_{plug/unplug} and luring_co_submit, and document the functions
to make clear that they work in the current thread's AioContext.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20230203131731.851116-3-eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-04-25 13:17:28 +02:00
Emanuele Giuseppe Esposito
ab50533b69 linux-aio: use LinuxAioState from the running thread
Remove usage of aio_context_acquire by always submitting asynchronous
AIO to the current thread's LinuxAioState.

In order to prevent mistakes from the caller side, avoid passing LinuxAioState
in laio_io_{plug/unplug} and laio_co_submit, and document the functions
to make clear that they work in the current thread's AioContext.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20230203131731.851116-2-eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-04-25 13:17:28 +02:00
Paolo Bonzini
8c6f27e7d8 block: remove has_variable_length from BlockDriver
Fill in the field in BlockLimits directly for host devices, and
copy it from there for the raw format.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20230407153303.391121-5-pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-04-11 16:39:01 +02:00
Kevin Wolf
4ec8df0183 block: Mark bdrv_co_create() and callers GRAPH_RDLOCK
This adds GRAPH_RDLOCK annotations to declare that callers of
bdrv_co_create() need to hold a reader lock for the graph.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20230203152202.49054-17-kwolf@redhat.com>
Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-02-23 19:49:23 +01:00
Emanuele Giuseppe Esposito
742bf09b20 block: Mark bdrv_co_copy_range() GRAPH_RDLOCK
This adds GRAPH_RDLOCK annotations to declare that callers of
bdrv_co_copy_range() need to hold a reader lock for the graph.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20230203152202.49054-15-kwolf@redhat.com>
Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-02-23 19:49:20 +01:00
Emanuele Giuseppe Esposito
8809534933 block: Mark bdrv_co_flush() and callers GRAPH_RDLOCK
This adds GRAPH_RDLOCK annotations to declare that callers of
bdrv_co_flush() need to hold a reader lock for the graph.

For some places, we know that they will hold the lock, but we don't have
the GRAPH_RDLOCK annotations yet. In this case, add assume_graph_lock()
with a FIXME comment. These places will be removed once everything is
properly annotated.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20230203152202.49054-8-kwolf@redhat.com>
Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-02-23 19:49:12 +01:00
Emanuele Giuseppe Esposito
005ee3cdc7 block/file-posix: don't use functions calling AIO_WAIT_WHILE in worker threads
When calling bdrv_getlength() in handle_aiocb_write_zeroes(), the
function creates a new coroutine and then waits that it finishes using
AIO_WAIT_WHILE.
The problem is that this function could also run in a worker thread,
that has a different AioContext from main loop and iothreads, therefore
in AIO_WAIT_WHILE we will have in_aio_context_home_thread(ctx) == false
and therefore
assert(qemu_get_current_aio_context() == qemu_get_aio_context());
in the else branch will fail, crashing QEMU.

Aside from that, bdrv_getlength() is wrong also conceptually, because
it reads the BDS graph from another thread and is not protected by
any lock.

Replace it with raw_co_getlength, that doesn't create a coroutine and
doesn't read the BDS graph.

Reported-by: Ninad Palsule <ninad@linux.vnet.ibm.com>
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20230209154522.1164401-1-eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-02-17 14:33:58 +01:00
Hanna Reitz
7f36a50ab4 block/file: Add file-specific image info
Add some (optional) information that the file driver can provide for
image files, namely the extent size hint.

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220620162704.80987-3-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-02-01 16:52:33 +01:00
Emanuele Giuseppe Esposito
2c75261cc2 block: Convert bdrv_lock_medium() to co_wrapper
bdrv_lock_medium() is categorized as an I/O function, and it currently
doesn't run in a coroutine. We should let it take a graph rdlock since
it traverses the block nodes graph, which however is only possible in a
coroutine.

The only caller of this function is blk_lock_medium(). Therefore make
blk_lock_medium() a co_wrapper, so that it always creates a new
coroutine, and then make bdrv_lock_medium() a coroutine_fn where the
lock can be taken.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20230113204212.359076-13-kwolf@redhat.com>
Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-02-01 16:52:32 +01:00
Emanuele Giuseppe Esposito
2531b390fb block: Convert bdrv_eject() to co_wrapper
bdrv_eject() is categorized as an I/O function, and it currently
doesn't run in a coroutine. We should let it take a graph rdlock since
it traverses the block nodes graph, which however is only possible in a
coroutine.

The only caller of this function is blk_eject(). Therefore make
blk_eject() a co_wrapper, so that it always creates a new coroutine, and
then make bdrv_eject() coroutine_fn where the lock can be taken.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20230113204212.359076-12-kwolf@redhat.com>
Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-02-01 16:52:32 +01:00
Emanuele Giuseppe Esposito
3d47eb0a2a block: Convert bdrv_get_info() to co_wrapper_mixed
bdrv_get_info() is categorized as an I/O function, and it currently
doesn't run in a coroutine. We should let it take a graph rdlock since
it traverses the block nodes graph, which however is only possible in a
coroutine.

Therefore turn it into a co_wrapper to move the actual function into a
coroutine where the lock can be taken.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20230113204212.359076-11-kwolf@redhat.com>
Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-02-01 16:52:32 +01:00
Emanuele Giuseppe Esposito
82618d7bc3 block: Convert bdrv_get_allocated_file_size() to co_wrapper
bdrv_get_allocated_file_size() is categorized as an I/O function, and it
currently doesn't run in a coroutine. We should let it take a graph
rdlock since it traverses the block nodes graph, which however is only
possible in a coroutine.

Therefore turn it into a co_wrapper to move the actual function into a
coroutine where the lock can be taken.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20230113204212.359076-10-kwolf@redhat.com>
Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-02-01 16:52:32 +01:00