Commit graph

1675 commits

Author SHA1 Message Date
Klaus Jensen
3050b34921 hw/nvme: fix namespace atomic parameter setup
Coverity complains about a possible copy-paste error in the verification
of the namespace atomic parameters (CID 1642811). While the check is
correct, the code (and the intention) is unclear.

Fix this by reworking how the parameters are verified. Peter also
identified that the realize function was not correctly erroring out if
parameters were misconfigured, so fix that too.

Lastly, change the error messages to be more describing.

Coverity: CID 1642811
Fixes: bce51b8370 ("hw/nvme: add atomic boundary support")
Fixes: 3b41acc962 ("hw/nvme: enable ns atomic writes")
Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2025-11-25 09:21:35 +01:00
Hanna Czenczek
d45b2c65f2 block: Note in which AioContext AIO CBs are called
This doesn’t seem to be specified anywhere, but is something we probably
want to be clear.  I believe it is reasonable to implicitly assume that
callbacks are run in the current thread (unless explicitly noted
otherwise), so codify that assumption.

Some implementations don’t actually fulfill this contract yet.  The next
patches should rectify that.

Note: I don’t know of any user-visible bugs produced by not running AIO
callbacks in the original context.  AIO functionality is generally
mapped to coroutines through the use of bdrv_co_io_em_complete(), which
can run in any AioContext, and will always wake the yielding coroutine
in its original context.  The only benefit here is that running
bdrv_co_io_em_complete() in the original context will make that
aio_co_wake() most likely a simpler qemu_coroutine_enter() instead of
scheduling the wakeup through AioContext.co_schedule_bh.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251110154854.151484-17-hreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-11-18 18:01:55 +01:00
Hanna Czenczek
aed74d3d62 block: Note on aio_co_wake use if not yet yielding
aio_co_wake() is generally safe to call regardless of whether the
coroutine is already yielding or not.  If it is not yet yielding, it
will be scheduled to run when it does yield.

Caveats:
- The caller must be independent of the coroutine (to ensure the
  coroutine must be yielding if both are in the same AioContext), i.e.
  must not be the same coroutine
- The coroutine must yield at some point

Make note of this so callers can reason that their use is safe.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251110154854.151484-2-hreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-11-18 18:01:39 +01:00
Eric Blake
1bd7bfbc2b block: Allow drivers to control protocol prefix at creation
This patch is pure refactoring: instead of hard-coding permission to
use a protocol prefix when creating an image, the drivers can now pass
in a parameter, comparable to what they could already do for opening a
pre-existing image.  This patch is purely mechanical (all drivers pass
in true for now), but it will enable the next patch to cater to
drivers that want to differ in behavior for the primary image vs. any
secondary images that are opened at the same time as creating the
primary image.

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20250915213919.3121401-5-eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-11-11 22:06:09 +01:00
Yeqi Fu
9730b9974d block: replace TABs with space
Bring the block files in line with the QEMU coding style, with spaces
for indentation. This patch partially resolves the issue 371.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/371
Signed-off-by: Yeqi Fu <fufuyqqqqqq@gmail.com>
Message-ID: <20230325085224.23842-1-fufuyqqqqqq@gmail.com>
[thuth: Rebased the patch to the current master branch]
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-ID: <20251007163511.334178-1-thuth@redhat.com>
[kwolf: Fixed up vertical alignemnt]
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-11-11 22:06:09 +01:00
Stefan Hajnoczi
047dabef97 block/io_uring: use aio_add_sqe()
AioContext has its own io_uring instance for file descriptor monitoring.
The disk I/O io_uring code was developed separately. Originally I
thought the characteristics of file descriptor monitoring and disk I/O
were too different, requiring separate io_uring instances.

Now it has become clear to me that it's feasible to share a single
io_uring instance for file descriptor monitoring and disk I/O. We're not
using io_uring's IOPOLL feature or anything else that would require a
separate instance.

Unify block/io_uring.c and util/fdmon-io_uring.c using the new
aio_add_sqe() API that allows user-defined io_uring sqe submission. Now
block/io_uring.c just needs to submit readv/writev/fsync and most of the
io_uring-specific logic is handled by fdmon-io_uring.c.

There are two immediate advantages:
1. Fewer system calls. There is no need to monitor the disk I/O io_uring
   ring fd from the file descriptor monitoring io_uring instance. Disk
   I/O completions are now picked up directly. Also, sqes are
   accumulated in the sq ring until the end of the event loop iteration
   and there are fewer io_uring_enter(2) syscalls.
2. Less code duplication.

Note that error_setg() messages are not supposed to end with
punctuation, so I removed a '.' for the non-io_uring build error
message.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251104022933.618123-15-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-11-11 22:06:09 +01:00
Stefan Hajnoczi
1eebdab3c3 aio-posix: add aio_add_sqe() API for user-defined io_uring requests
Introduce the aio_add_sqe() API for submitting io_uring requests in the
current AioContext. This allows other components in QEMU, like the block
layer, to take advantage of io_uring features without creating their own
io_uring context.

This API supports nested event loops just like file descriptor
monitoring and BHs do. This comes at a complexity cost: CQE callbacks
must be placed on a list so that nested event loops can invoke pending
CQE callbacks from parent event loops. If you're wondering why
CqeHandler exists instead of just a callback function pointer, this is
why.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251104022933.618123-14-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-11-11 22:06:09 +01:00
Stefan Hajnoczi
87e7a0f423 aio-posix: add fdmon_ops->dispatch()
The ppoll and epoll file descriptor monitoring implementations rely on
the event loop's generic file descriptor, timer, and BH dispatch code to
invoke user callbacks.

The io_uring file descriptor monitoring implementation will need
io_uring-specific dispatch logic for CQE handlers for custom SQEs.

Introduce a new FDMonOps ->dispatch() callback that allows file
descriptor monitoring implementations to invoke user callbacks. The next
patch will use this new callback.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20251104022933.618123-13-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-11-11 22:06:09 +01:00
Stefan Hajnoczi
421dcc8023 aio: add errp argument to aio_context_setup()
When aio_context_new() -> aio_context_setup() fails at startup it
doesn't really matter whether errors are returned to the caller or the
process terminates immediately.

However, it is not acceptable to terminate when hotplugging --object
iothread at runtime. Refactor aio_context_setup() so that errors can be
propagated. The next commit will set errp when fdmon_io_uring_setup()
fails.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20251104022933.618123-10-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-11-11 22:06:09 +01:00
Stefan Hajnoczi
3769b9abe9 aio: free AioContext when aio_context_new() fails
g_source_destroy() only removes the GSource from the GMainContext it's
attached to, if any. It does not free it.

Use g_source_unref() instead so that the AioContext (which embeds a
GSource) is freed. There is no need to call g_source_destroy() in
aio_context_new() because the GSource isn't attached to a GMainContext
yet.

aio_ctx_finalize() expects everything to be set up already, so introduce
the new ctx->initialized boolean and do nothing when called with
!initialized. This also requires moving aio_context_setup() down after
event_notifier_init() since aio_ctx_finalize() won't release any
resources that aio_context_setup() acquired.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251104022933.618123-9-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-11-11 22:06:09 +01:00
Stefan Hajnoczi
d1f42b600a aio: remove aio_context_use_g_source()
There is no need for aio_context_use_g_source() now that epoll(7) and
io_uring(7) file descriptor monitoring works with the glib event loop.
AioContext doesn't need to be notified that GSource is being used.

On hosts with io_uring support this now enables fdmon-io_uring.c by
default, replacing fdmon-poll.c and fdmon-epoll.c. In other words, the
event loop will use io_uring!

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20251104022933.618123-8-stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-11-11 22:06:09 +01:00
Stefan Hajnoczi
ded29e64c6 aio-posix: integrate fdmon into glib event loop
AioContext's glib integration only supports ppoll(2) file descriptor
monitoring. epoll(7) and io_uring(7) disable themselves and switch back
to ppoll(2) when the glib event loop is used. The main loop thread
cannot use epoll(7) or io_uring(7) because it always uses the glib event
loop.

Future QEMU features may require io_uring(7). One example is uring_cmd
support in FUSE exports. Each feature could create its own io_uring(7)
context and integrate it into the event loop, but this is inefficient
due to extra syscalls. It would be more efficient to reuse the
AioContext's existing fdmon-io_uring.c io_uring(7) context because
fdmon-io_uring.c will already be active on systems where Linux io_uring
is available.

In order to keep fdmon-io_uring.c's AioContext operational even when the
glib event loop is used, extend FDMonOps with an API similar to
GSourceFuncs so that file descriptor monitoring can integrate into the
glib event loop.

A quick summary of the GSourceFuncs API:
- prepare() is called each event loop iteration before waiting for file
  descriptors and timers.
- check() is called to determine whether events are ready to be
  dispatched after waiting.
- dispatch() is called to process events.

More details here: https://docs.gtk.org/glib/struct.SourceFuncs.html

Move the ppoll(2)-specific code from aio-posix.c into fdmon-poll.c and
also implement epoll(7)- and io_uring(7)-specific file descriptor
monitoring code for glib event loops.

Note that it's still faster to use aio_poll() rather than the glib event
loop since glib waits for file descriptor activity with ppoll(2) and
does not support adaptive polling. But at least epoll(7) and io_uring(7)
now work in glib event loops.

Splitting this into multiple commits without temporarily breaking
AioContext proved difficult so this commit makes all the changes. The
next commit will remove the aio_context_use_g_source() API because it is
no longer needed.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251104022933.618123-7-stefanha@redhat.com>
[kwolf: Build fixes; fix AioContext.list_lock use after destroy]
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-11-11 22:04:53 +01:00
Richard Henderson
c494afbb7d nvme queue
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEUigzqnXi3OaiR2bATeGvMW1PDekFAmkDE7gACgkQTeGvMW1P
 DekCOwgAuOQKWWW/UA1MmZ4ZHs+djf4q5UDwqGDx8tra8d32mZWRHgpJ/OBBOY2z
 CmuHqWLgooAqfx4hsrXELdNBEe7ccNE9nvsE3GjnYWxjoe51yl2Xc0RD5CZBVrN4
 RRMbBZRCewxGShyUaT31eedolWdr4zBuqkpLf9gcG8Yk7YD+xUkHUPeMXeAy+vkS
 pxW59AkXdjJZgBktOdV5uVj9gaCPgTcGaQNH2FYSnzHwdu5VyV8BKiiZE/fXS6FU
 xZvu+5p1Ro5vOdwG+iFBrbBwcGyjVOF1OfBZctyc83foyFxwzxqoqj9gy0ewuT2g
 HsupUiJgbkZ1Ut9fzaS5pHx3dd3dKw==
 =WDrH
 -----END PGP SIGNATURE-----

Merge tag 'pull-nvme-20251030' of https://gitlab.com/birkelund/qemu into staging

nvme queue

# -----BEGIN PGP SIGNATURE-----
#
# iQEzBAABCgAdFiEEUigzqnXi3OaiR2bATeGvMW1PDekFAmkDE7gACgkQTeGvMW1P
# DekCOwgAuOQKWWW/UA1MmZ4ZHs+djf4q5UDwqGDx8tra8d32mZWRHgpJ/OBBOY2z
# CmuHqWLgooAqfx4hsrXELdNBEe7ccNE9nvsE3GjnYWxjoe51yl2Xc0RD5CZBVrN4
# RRMbBZRCewxGShyUaT31eedolWdr4zBuqkpLf9gcG8Yk7YD+xUkHUPeMXeAy+vkS
# pxW59AkXdjJZgBktOdV5uVj9gaCPgTcGaQNH2FYSnzHwdu5VyV8BKiiZE/fXS6FU
# xZvu+5p1Ro5vOdwG+iFBrbBwcGyjVOF1OfBZctyc83foyFxwzxqoqj9gy0ewuT2g
# HsupUiJgbkZ1Ut9fzaS5pHx3dd3dKw==
# =WDrH
# -----END PGP SIGNATURE-----
# gpg: Signature made Thu 30 Oct 2025 08:28:56 AM CET
# gpg:                using RSA key 522833AA75E2DCE6A24766C04DE1AF316D4F0DE9
# gpg: Good signature from "Klaus Jensen <its@irrelevant.dk>" [unknown]
# gpg:                 aka "Klaus Jensen <k.jensen@samsung.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: DDCA 4D9C 9EF9 31CC 3468  4272 63D5 6FC5 E55D A838
#      Subkey fingerprint: 5228 33AA 75E2 DCE6 A247  66C0 4DE1 AF31 6D4F 0DE9

* tag 'pull-nvme-20251030' of https://gitlab.com/birkelund/qemu:
  hw/nvme: add atomic boundary support
  hw/nvme: enable ns atomic writes
  hw/nvme: connect SPDM over NVMe Security Send/Recv
  spdm: define SPDM transport enum types
  hw/nvme: add NVMe Admin Security SPDM support
  spdm: add spdm storage transport virtual header
  spdm-socket: add seperate send/recv functions

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2025-10-31 12:56:05 +01:00
Wilfred Mallawa
e5534abeb4 hw/nvme: add NVMe Admin Security SPDM support
Adds the NVMe Admin Security Send/Receive command support with support
for DMTFs SPDM. The transport binding for SPDM is defined in the
DMTF DSP0286.

Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2025-10-30 07:07:14 +01:00
Kevin Wolf
5b4b3bfdfc qemu-img info: Optionally show block limits
Add a new --limits option to 'qemu-img info' that displays the block
limits for the image and all of its children, making the information
more accessible for human users than in QMP. This option is not enabled
by default because it can be a lot of output that isn't usually relevant
if you're not specifically trying to diagnose some I/O problem.

This makes the same information automatically also available in HMP
'info block -v'.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251024123041.51254-4-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-10-29 12:10:10 +01:00
Kevin Wolf
46dd683d56 block: Improve comments in BlockLimits
Patches to expose the limits in QAPI have made clear that the existing
documentation of BlockLimits could be improved: The meaning of
min_mem_alignment and opt_mem_alignment could be clearer, and talking
about better alignment values isn't helpful when we only detect these
values and never choose them.

Make the changes in the BlockLimits documentation now, so that the
patches exposing the fields in QAPI can use descriptions consistent with
it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20251024123041.51254-2-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-10-29 12:10:09 +01:00
Fiona Ebner
cbadaf57a7 block: implement 'resize' callback for child_of_bds class
If a filtered child is resized, the size of the parent node is now
also refreshed (recursively for chains of filtered children).

For filter block drivers that do not implement .bdrv_co_getlength(),
this commit does not change the current behavior, because
bdrv_co_refresh_total_sectors() will used the current size via the
passed-in hint. This is the case for block drivers for (some) block
jobs, as well as copy-before-write.

Block jobs already set up a blocker preventing a QMP block_resize
operation while the job is running. That does not directly cover an
associated 'file' node of a 'raw' node, but resizing such a 'file'
node is already prevented too (backup, commit, mirror and stream were
checked).

The other case is copy-before-write. This commit does not change the
fact that the copy-before-write node still has the same size after its
filtered child is resized.

Block drivers that do implement .bdrv_co_getlength() and where
.is_filter is true, already returned the length of the file child, so
there is no change before and after this commit, with two exceptions:
1. preallocate can return an early data_end and otherwise queries the
   file child, but that special casing is not changed.
2. blkverify returns the length of the test file. This commit does not
   affect that behavior.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250917115509.401015-4-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-10-29 12:10:09 +01:00
Fiona Ebner
08736e7584 block: make bdrv_co_parent_cb_resize() a proper IO API function
In preparation for calling it via the bdrv_child_cb_resize() callback
that will be added by the next commit. Rename it to include the "_co_"
part while at it.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20250917115509.401015-3-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-10-29 12:10:09 +01:00
Fiona Ebner
4120375420 include/block/block_int-common: document when resize callback is used
The 'resize' callback is only called by bdrv_parent_cb_resize() which
is only called by bdrv_co_write_req_finish() to notify the parent(s)
that the child was resized.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20250917115509.401015-2-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-10-29 12:10:09 +01:00
Chandan Somani
9f0c763e16 block: enable stats-intervals for storage devices
This patch allows stats-intervals to be used for storage
devices with the -device option. It accepts a list of interval
lengths in JSON format.

It configures and collects the stats in the BlockBackend layer
through the storage device that consumes the BlockBackend.

Signed-off-by: Chandan Somani <csomani@redhat.com>
Message-ID: <20251003220039.1336663-1-csomani@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-10-29 12:10:09 +01:00
Fiona Ebner
a256a427b0 blockjob: mark block_job_remove_all_bdrv() as GRAPH_UNLOCKED
The function block_job_remove_all_bdrv() calls
bdrv_graph_wrlock_drained(), which must be called with the graph
unlocked.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-49-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-07-14 15:42:28 +02:00
Fiona Ebner
2cf92b15cd block: mark bdrv_open_child_common() and its callers GRAPH_UNLOCKED
The function bdrv_open_child_common() calls
bdrv_graph_wrlock_drained(), which must be called with the graph
unlocked. Mark it and its two callers bdrv_open_file_child() and
bdrv_open_child() as GRAPH_UNLOCKED. This requires temporarily
unlocking in vmdk_parse_extents() and making the locked section
shorter in vmdk_open().

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-48-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-07-14 15:42:27 +02:00
Fiona Ebner
ede0859311 block: mark bdrv_close() as GRAPH_UNLOCKED
The functions blk_log_writes_close(), blkverify_close(),
quorum_close(), vmdk_close() via vmdk_free_extents(), and other
bdrv_close() implementations call bdrv_graph_wrlock_drained(), which
must be called with the graph unlocked. They are reached via the
BlockDriver's bdrv_close() callback and the bdrv_close() wrapper,
which are also marked as GRAPH_UNLOCKED_PTR and GRAPH_UNLOCKED.

Furthermore, the function bdrv_close() also calls bdrv_drained_begin()
and bdrv_graph_wrlock_drained(), so there are additional reasons for
marking it GRAPH_UNLOCKED.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-47-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-07-14 15:42:26 +02:00
Fiona Ebner
6d7e3f8de0 block: mark bdrv_close_all() as GRAPH_UNLOCKED
The function bdrv_close_all() calls bdrv_drain_all(), which must be
called with the graph unlocked.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-46-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-07-14 15:42:25 +02:00
Fiona Ebner
94371745d7 block: mark bdrv_drop_intermediate() as GRAPH_UNLOCKED
The function bdrv_drop_intermediate() calls bdrv_drained_begin(),
which must be called with the graph unlocked.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-45-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-07-14 15:42:23 +02:00
Fiona Ebner
04f4d9c555 block: mark bdrv_insert_node() as GRAPH_UNLOCKED
The function bdrv_insert_node() calls bdrv_drained_begin() which must
be called with the graph unlocked.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-44-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-07-14 15:42:22 +02:00
Fiona Ebner
5d04823347 block: mark bdrv_replace_child_bs() as GRAPH_UNLOCKED
The function bdrv_replace_child_bs() calls bdrv_drained_begin() which
must be called with the graph unlocked.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-43-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-07-14 15:42:21 +02:00
Fiona Ebner
7525aa25db block: mark bdrv_inactivate_all() as GRAPH_UNLOCKED
The function bdrv_inactivate_all() calls bdrv_drain_all_begin(), which
must be called with the graph unlocked.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-37-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-07-14 15:42:08 +02:00
Fiona Ebner
e2d9cc5790 block: mark bdrv_inactivate() as GRAPH_RDLOCK and move drain to callers
The function bdrv_inactivate() calls bdrv_drain_all_begin(), which
needs to be called with the graph unlocked, so either
bdrv_inactivate() should be marked as GRAPH_UNLOCKED or the drain
needs to be moved to the callers. The caller in
qmp_blockdev_set_active() requires that the locked section covers
bdrv_find_node() too, so the latter alternative is chosen.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-36-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-07-14 15:42:07 +02:00
Fiona Ebner
6717dc3075 block: mark bdrv_reopen_queue() and bdrv_reopen_multiple() as GRAPH_UNLOCKED
The function bdrv_reopen_queue() can call bdrv_drain_all_begin(),
which must be called with the graph unlocked.

The function bdrv_reopen_multiple() calls bdrv_reopen_prepare() which
must be called with the graph unlocked.

To mark bdrv_reopen_queue() as GRAPH_UNLOCKED, it is necessary to make
the locked section in reopen_backing_file() shorter.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-35-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-07-14 15:42:05 +02:00
Fiona Ebner
c6b5328b5b block/snapshot: mark bdrv_all_delete_snapshot() as GRAPH_UNLOCKED
The function bdrv_all_delete_snapshot() calls bdrv_drain_all_begin(),
which must be called with the graph unlocked.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-33-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-07-14 15:42:02 +02:00
Fiona Ebner
54eb59d668 block: drop wrapper for bdrv_set_backing_hd_drained()
Nearly all callers (outside of the tests) are already using the
_drained() variant of the function. It doesn't seem worth keeping.
Simply adapt the remaining callers of bdrv_set_backing_hd() and rename
bdrv_set_backing_hd_drained() to bdrv_set_backing_hd().

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-31-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-07-14 15:41:58 +02:00
Fiona Ebner
de0d24c711 block: mark bdrv_set_backing_hd() as GRAPH_UNLOCKED
The function bdrv_set_backing_hd() calls bdrv_drain_all_begin(), which
must be called with the graph unlocked.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-29-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-07-14 15:41:54 +02:00
Fiona Ebner
6b89e851fa block: add bdrv_graph_wrlock_drained() convenience wrapper
Many write-locked sections are also drained sections. A new
bdrv_graph_wrunlock_drained() wrapper around bdrv_graph_wrunlock() is
introduced, which will begin a drained section first. A global
variable is used so bdrv_graph_wrunlock() knows if it also needs
to end such a drained section. Both the aio_poll call in
bdrv_graph_wrlock() and the aio_bh_poll() in bdrv_graph_wrunlock()
can re-enter a write-locked section. While for the latter, ending the
drain could be moved to before the call, the former requires that the
variable is a counter and not just a boolean.

Since the wrapper calls bdrv_drain_all_begin(), which must be called
with the graph unlocked, mark the wrapper as GRAPH_UNLOCKED too.

The switch to the new helpers was generated with the following
commands and then manually checked:
find . -name '*.c' -exec sed -i -z 's/bdrv_drain_all_begin();\n\s*bdrv_graph_wrlock();/bdrv_graph_wrlock_drained();/g' {} ';'
find . -name '*.c' -exec sed -i -z 's/bdrv_graph_wrunlock();\n\s*bdrv_drain_all_end();/bdrv_graph_wrunlock();/g' {} ';'

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-25-f.ebner@proxmox.com>
[kwolf: Removed redundant GRAPH_UNLOCKED]
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-07-14 15:40:58 +02:00
Fiona Ebner
502f00c51a block: never use atomics to access bs->quiesce_counter
All accesses of bs->quiesce_counter are in the main thread, either
after a GLOBAL_STATE_CODE() macro or in a function with GRAPH_WRLOCK
annotation.

This is essentially a revert of 414c2ec358 ("block: access
quiesce_counter with atomic ops"). At that time, neither the
GLOBAL_STATE_CODE() macro nor the GRAPH_WRLOCK annotation existed.
Even if the field was only accessed in the main thread back then (did
not check if that is actually the case), it wouldn't have been easy to
verify.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-24-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-07-14 15:40:45 +02:00
Fiona Ebner
fc1d2f3eac block: mark bdrv_drained_begin() and friends as GRAPH_UNLOCKED
All of bdrv_drain_all_begin(), bdrv_drain_all() and
bdrv_drained_begin() poll and are not allowed to be called with the
block graph lock held. Mark the function as such.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-20-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-06-04 18:16:34 +02:00
Fiona Ebner
d75f8ed1d7 block: move drain outside of quorum_del_child()
The quorum_del_child() callback runs under the graph lock, so it is
not allowed to drain. It is only called as the .bdrv_del_child()
callback, which is only called in the bdrv_del_child() function, which
also runs under the graph lock.

The bdrv_del_child() function is called by qmp_x_blockdev_change().
A drained section was already introduced there by commit "block: move
drain out of quorum_add_child()".

This finally finishes moving out the drain to places that are not
under the graph lock started in "block: move draining out of
bdrv_change_aio_context() and mark GRAPH_RDLOCK".

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-17-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-06-04 18:16:34 +02:00
Fiona Ebner
0414930d3a block: move drain outside of quorum_add_child()
This is part of resolving the deadlock mentioned in commit "block:
move draining out of bdrv_change_aio_context() and mark GRAPH_RDLOCK".

The quorum_add_child() callback runs under the graph lock, so it is
not allowed to drain. It is only called as the .bdrv_add_child()
callback, which is only called in the bdrv_add_child() function, which
also runs under the graph lock.

The bdrv_add_child() function is called by qmp_x_blockdev_change(),
where a drained section is introduced.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-15-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-06-04 18:16:34 +02:00
Fiona Ebner
ffdcd081f5 block: move drain outside of bdrv_root_attach_child()
This is part of resolving the deadlock mentioned in commit "block:
move draining out of bdrv_change_aio_context() and mark GRAPH_RDLOCK".

The function bdrv_root_attach_child() runs under the graph lock, so it
is not allowed to drain. It is called by:
1. blk_insert_bs(), where a drained section is introduced.
2. block_job_add_bdrv(), which holds the graph lock itself.

block_job_add_bdrv() is called by:
1. mirror_start_job()
2. stream_start()
3. commit_start()
4. backup_job_create()
5. block_job_create()
6. In the test_blockjob_common_drain_node() unit test

In all callers, a drained section is introduced.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20250530151125.955508-13-f.ebner@proxmox.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-06-04 18:16:34 +02:00
Fiona Ebner
a1ea8eb591 block: move drain outside of bdrv_try_change_aio_context()
This is part of resolving the deadlock mentioned in commit "block:
move draining out of bdrv_change_aio_context() and mark GRAPH_RDLOCK".

Convert the function to a _locked() version that has to be called with
the graph lock held and add a convenience wrapper that has to be
called with the graph unlocked, which drains and takes the lock
itself. Since bdrv_try_change_aio_context() is global state code, the
wrapper is too.

Callers are adapted to use the appropriate variant, depending on
whether the caller already holds the lock. In the
test_set_aio_context() unit test, prior drains can be removed, because
draining already happens inside the new wrapper.

Note that bdrv_attach_child_common_abort(), bdrv_attach_child_common()
and bdrv_root_unref_child() hold the graph lock and are not actually
allowed to drain either. This will be addressed in the following
commits.

Functions like qmp_blockdev_mirror() query the nodes to act on before
draining and locking. In theory, draining could invalidate those nodes.
This kind of issue is not addressed by these commits.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20250530151125.955508-10-f.ebner@proxmox.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-06-04 18:16:34 +02:00
Fiona Ebner
91ba0e1c38 block: move drain outside of bdrv_change_aio_context() and mark GRAPH_RDLOCK
This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED.

Note that even if bdrv_drained_begin() were already marked as
GRAPH_UNLOCKED, TSA would not complain about the instance in
bdrv_change_aio_context() before this change, because it is preceded
by a bdrv_graph_rdunlock_main_loop() call. It is not correct to
release the lock here, and in case the caller holds a write lock, it
wouldn't actually release the lock.

In combination with block-stream, there is a deadlock that can happen
because of this [0]. In particular, it can happen that
main thread              IO thread
1. acquires write lock
                         in blk_co_do_preadv_part():
                         2. have non-zero blk->in_flight
                         3. try to acquire read lock
4. begin drain

Steps 3 and 4 might be switched. Draining will poll and get stuck,
because it will see the non-zero in_flight counter. But the IO thread
will not make any progress either, because it cannot acquire the read
lock.

After this change, all paths to bdrv_change_aio_context() drain:
bdrv_change_aio_context() is called by:
1. bdrv_child_cb_change_aio_ctx() which is only called via the
   change_aio_ctx() callback, see below.
2. bdrv_child_change_aio_context(), see below.
3. bdrv_try_change_aio_context(), where a drained section is
   introduced.

The change_aio_ctx() callback is called by:
1. bdrv_attach_child_common_abort(), where a drained section is
   introduced.
2. bdrv_attach_child_common(), where a drained section is introduced.
3. bdrv_parent_change_aio_context(), see below.

bdrv_child_change_aio_context() is called by:
1. bdrv_change_aio_context(), i.e. recursive, so being in a drained
   section is invariant.
2. child_job_change_aio_ctx(), which is only called via the
   change_aio_ctx() callback, see above.

bdrv_parent_change_aio_context() is called by:
1. bdrv_change_aio_context(), i.e. recursive, so being in a drained
   section is invariant.

This resolves all code paths. Note that bdrv_attach_child_common()
and bdrv_attach_child_common_abort() hold the graph write lock and
callers of bdrv_try_change_aio_context() might too, so they are not
actually allowed to drain either. This will be addressed in the
following commits.

More granular draining is not trivially possible, because
bdrv_change_aio_context() can recursively call itself e.g. via
bdrv_child_change_aio_context().

[0]: https://lore.kernel.org/qemu-devel/73839c04-7616-407e-b057-80ca69e63f51@virtuozzo.com/

Reported-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20250530151125.955508-9-f.ebner@proxmox.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-06-04 18:16:33 +02:00
Fiona Ebner
469422c45b block: mark bdrv_child_change_aio_context() GRAPH_RDLOCK
This is a small step in preparation to mark bdrv_drained_begin() as
GRAPH_UNLOCKED. More concretely, it is in preparation to move the
drain out of bdrv_change_aio_context() and marking that function as
GRAPH_RDLOCK.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20250530151125.955508-8-f.ebner@proxmox.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-06-04 18:16:33 +02:00
Fiona Ebner
844d550d09 block: mark change_aio_ctx() callback and instances as GRAPH_RDLOCK(_PTR)
This is a small step in preparation to mark bdrv_drained_begin() as
GRAPH_UNLOCKED. More concretely, it is in preparation to move the
drain out of bdrv_change_aio_context() and marking that function as
GRAPH_RDLOCK.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20250530151125.955508-7-f.ebner@proxmox.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-06-04 18:16:33 +02:00
Eric Blake
253b43a290 mirror: Drop redundant zero_target parameter
The two callers to a mirror job (drive-mirror and blockdev-mirror) set
zero_target precisely when sync mode == FULL, with the one exception
that drive-mirror skips zeroing the target if it was newly created and
reads as zero.  But given the previous patch, that exception is
equally captured by target_is_zero.

Meanwhile, there is another slight wrinkle, fortunately caught by
iotest 185: if the caller uses "sync":"top" but the source has no
backing file, the code in blockdev.c was changing sync to be FULL, but
only after it had set zero_target=false.  In mirror.c, prior to recent
patches, this didn't matter: the only places that inspected sync were
setting is_none_mode (both TOP and FULL had set that to false), and
mirror_start() setting base = mode == MIRROR_SYNC_MODE_TOP ?
bdrv_backing_chain_next(bs) : NULL.  But now that we are passing sync
around, the slammed sync mode would result in a new pre-zeroing pass
even when the user had passed "sync":"top" in an effort to skip
pre-zeroing.  Fortunately, the assignment of base when bs has no
backing chain still works out to NULL if we don't slam things.  So
with the forced change of sync ripped out of blockdev.c, the sync mode
is passed through the full callstack unmolested, and we can now
reliably reconstruct the same settings as what used to be passed in by
zero_target=false, without the redundant parameter.

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20250509204341.3553601-24-eblake@redhat.com>
Reviewed-by: Sunny Zhu <sunnyzhyy@qq.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[eblake: Fix regression in iotest 185]
Signed-off-by: Eric Blake <eblake@redhat.com>
2025-05-14 20:10:12 -05:00
Eric Blake
d17a34bfb9 mirror: Allow QMP override to declare target already zero
QEMU has an optimization for a just-created drive-mirror destination
that is not possible for blockdev-mirror (which can't create the
destination) - any time we know the destination starts life as all
zeroes, we can skip a pre-zeroing pass on the destination.  Recent
patches have added an improved heuristic for detecting if a file
contains all zeroes, and we plan to use that heuristic in upcoming
patches.  But since a heuristic cannot quickly detect all scenarios,
and there may be cases where the caller is aware of information that
QEMU cannot learn quickly, it makes sense to have a way to tell QEMU
to assume facts about the destination that can make the mirror
operation faster.  Given our existing example of "qemu-img convert
--target-is-zero", it is time to expose this override in QMP for
blockdev-mirror as well.

This patch results in some slight redundancy between the older
s->zero_target (set any time mode==FULL and the destination image was
not just created - ie. clear if drive-mirror is asking to skip the
pre-zero pass) and the newly-introduced s->target_is_zero (in addition
to the QMP override, it is set when drive-mirror creates the
destination image); this will be cleaned up in the next patch.

There is also a subtlety that we must consider.  When drive-mirror is
passing target_is_zero on behalf of a just-created image, we know the
image is sparse (skipping the pre-zeroing keeps it that way), so it
doesn't matter whether the destination also has "discard":"unmap" and
"detect-zeroes":"unmap".  But now that we are letting the user set the
knob for target-is-zero, if the user passes a pre-existing file that
is fully allocated, it is fine to leave the file fully allocated under
"detect-zeroes":"on", but if the file is open with
"detect-zeroes":"unmap", we should really be trying harder to punch
holes in the destination for every region of zeroes copied from the
source.  The easiest way to do this is to still run the pre-zeroing
pass (turning the entire destination file sparse before populating
just the allocated portions of the source), even though that currently
results in double I/O to the portions of the file that are allocated.
A later patch will add further optimizations to reduce redundant
zeroing I/O during the mirror operation.

Since "target-is-zero":true is designed for optimizations, it is okay
to silently ignore the parameter rather than erroring if the user ever
sets the parameter in a scenario where the mirror job can't exploit it
(for example, when doing "sync":"top" instead of "sync":"full", we
can't pre-zero, so setting the parameter won't make a speed
difference).

Signed-off-by: Eric Blake <eblake@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Message-ID: <20250509204341.3553601-23-eblake@redhat.com>
Reviewed-by: Sunny Zhu <sunnyzhyy@qq.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2025-05-14 16:55:10 -05:00
Eric Blake
5272609670 block: Add new bdrv_co_is_all_zeroes() function
There are some optimizations that require knowing if an image starts
out as reading all zeroes, such as making blockdev-mirror faster by
skipping the copying of source zeroes to the destination.  The
existing bdrv_co_is_zero_fast() is a good building block for answering
this question, but it tends to give an answer of 0 for a file we just
created via QMP 'blockdev-create' or similar (such as 'qemu-img create
-f raw').  Why?  Because file-posix.c insists on allocating a tiny
header to any file rather than leaving it 100% sparse, due to some
filesystems that are unable to answer alignment probes on a hole.  But
teaching file-posix.c to read the tiny header doesn't scale - the
problem of a small header is also visible when libvirt sets up an NBD
client to a just-created file on a migration destination host.

So, we need a wrapper function that handles a bit more complexity in a
common manner for all block devices - when the BDS is mostly a hole,
but has a small non-hole header, it is still worth the time to read
that header and check if it reads as all zeroes before giving up and
returning a pessimistic answer.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20250509204341.3553601-19-eblake@redhat.com>
2025-05-14 16:08:23 -05:00
Eric Blake
c33159dec7 block: Expand block status mode from bool to flags
This patch is purely mechanical, changing bool want_zero into an
unsigned int for bitwise-or of flags.  As of this patch, all
implementations are unchanged (the old want_zero==true is now
mode==BDRV_WANT_PRECISE which is a superset of BDRV_WANT_ZERO); but
the callers in io.c that used to pass want_zero==false are now
prepared for future driver changes that can now distinguish bewteen
BDRV_WANT_ZERO vs. BDRV_WANT_ALLOCATED.  The next patch will actually
change the file-posix driver along those lines, now that we have
more-specific hints.

As for the background why this patch is useful: right now, the
file-posix driver recognizes that if allocation is being queried, the
entire image can be reported as allocated (there is no backing file to
refer to) - but this throws away information on whether the entire
image reads as zero (trivially true if lseek(SEEK_HOLE) at offset 0
returns -ENXIO, a bit more complicated to prove if the raw file was
created with 'qemu-img create' since we intentionally allocate a small
chunk of all-zero data to help with alignment probing).  Later patches
will add a generic algorithm for seeing if an entire file reads as
zeroes.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20250509204341.3553601-16-eblake@redhat.com>
2025-05-14 15:33:34 -05:00
Raman Dzehtsiar
3d3911f16b blockdev-backup: Add error handling option for copy-before-write jobs
This patch extends the blockdev-backup QMP command to allow users to specify
how to behave when IO errors occur during copy-before-write operations.
Previously, the behavior was fixed and could not be controlled by the user.

The new 'on-cbw-error' option can be set to one of two values:
- 'break-guest-write': Forwards the IO error to the guest and triggers
  the on-source-error policy. This preserves snapshot integrity at the
  expense of guest IO operations.
- 'break-snapshot': Allows the guest OS to continue running normally,
  but invalidates the snapshot and aborts related jobs. This prioritizes
  guest operation over backup consistency.

This enhancement provides more flexibility for backup operations in different
environments where requirements for guest availability versus backup
consistency may vary.

The default behavior remains unchanged to maintain backward compatibility.

Signed-off-by: Raman Dzehtsiar <Raman.Dzehtsiar@gmail.com>
Message-ID: <20250414090025.828660-1-Raman.Dzehtsiar@gmail.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
[vsementsov: fix long lines]
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Tested-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
2025-05-12 18:19:31 +03:00
Sunny Zhu
ed1aef1716 block: Remove unused callback function *bdrv_aio_pdiscard
The bytes type in *bdrv_aio_pdiscard should be int64_t rather than int.

There are no drivers implementing the *bdrv_aio_pdiscard() callback,
it appears to be an unused function. Therefore, we'll simply remove it
instead of fixing it.

Additionally, coroutine-based callbacks are preferred. If someone needs
to implement bdrv_aio_pdiscard, a coroutine-based version would be
straightforward to implement.

Signed-off-by: Sunny Zhu <sunnyzhyy@qq.com>
Message-ID: <tencent_7140D2E54157D98CF3D9E64B1A007A1A7906@qq.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-04-25 17:06:50 +02:00
Kevin Wolf
ee416407b3 aio-posix: Separate AioPolledEvent per AioHandler
Adaptive polling has a big problem: It doesn't consider that an event
loop can wait for many different events that may have very different
typical latencies.

For example, think of a guest that tends to send a new I/O request soon
after the previous I/O request completes, but the storage on the host is
rather slow. In this case, getting the new request from guest quickly
means that polling is enabled, but the next thing is performing the I/O
request on the backend, which is slow and disables polling again for the
next guest request. This means that in such a scenario, polling could
help for every other event, but is only ever enabled when it can't
succeed.

In order to fix this, keep a separate AioPolledEvent for each
AioHandler. We will then know that the backend file descriptor always
has a high latency and isn't worth polling for, but we also know that
the guest is always fast and we should poll for it. This solves at least
half of the problem, we can now keep polling for those cases where it
makes sense and get the improved performance from it.

Since the event loop doesn't know which event will be next, we still do
some unnecessary polling while we're waiting for the slow disk. I made
some attempts to be more clever than just randomly growing and shrinking
the polling time, and even to let callers be explicit about when they
expect a new event, but so far this hasn't resulted in improved
performance or even caused performance regressions. For now, let's just
fix the part that is easy enough to fix, we can revisit the rest later.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20250307221634.71951-6-kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2025-03-13 17:57:23 +01:00