qemu-cr16

Author	SHA1	Message	Date
Klaus Jensen	3050b34921	hw/nvme: fix namespace atomic parameter setup Coverity complains about a possible copy-paste error in the verification of the namespace atomic parameters (CID 1642811). While the check is correct, the code (and the intention) is unclear. Fix this by reworking how the parameters are verified. Peter also identified that the realize function was not correctly erroring out if parameters were misconfigured, so fix that too. Lastly, change the error messages to be more describing. Coverity: CID 1642811 Fixes: `bce51b8370` ("hw/nvme: add atomic boundary support") Fixes: `3b41acc962` ("hw/nvme: enable ns atomic writes") Reviewed-by: Jesper Wendel Devantier <foss@defmacro.it> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>	2025-11-25 09:21:35 +01:00
Hanna Czenczek	d45b2c65f2	block: Note in which AioContext AIO CBs are called This doesn’t seem to be specified anywhere, but is something we probably want to be clear. I believe it is reasonable to implicitly assume that callbacks are run in the current thread (unless explicitly noted otherwise), so codify that assumption. Some implementations don’t actually fulfill this contract yet. The next patches should rectify that. Note: I don’t know of any user-visible bugs produced by not running AIO callbacks in the original context. AIO functionality is generally mapped to coroutines through the use of bdrv_co_io_em_complete(), which can run in any AioContext, and will always wake the yielding coroutine in its original context. The only benefit here is that running bdrv_co_io_em_complete() in the original context will make that aio_co_wake() most likely a simpler qemu_coroutine_enter() instead of scheduling the wakeup through AioContext.co_schedule_bh. Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20251110154854.151484-17-hreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-18 18:01:55 +01:00
Hanna Czenczek	aed74d3d62	block: Note on aio_co_wake use if not yet yielding aio_co_wake() is generally safe to call regardless of whether the coroutine is already yielding or not. If it is not yet yielding, it will be scheduled to run when it does yield. Caveats: - The caller must be independent of the coroutine (to ensure the coroutine must be yielding if both are in the same AioContext), i.e. must not be the same coroutine - The coroutine must yield at some point Make note of this so callers can reason that their use is safe. Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20251110154854.151484-2-hreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-18 18:01:39 +01:00
Eric Blake	1bd7bfbc2b	block: Allow drivers to control protocol prefix at creation This patch is pure refactoring: instead of hard-coding permission to use a protocol prefix when creating an image, the drivers can now pass in a parameter, comparable to what they could already do for opening a pre-existing image. This patch is purely mechanical (all drivers pass in true for now), but it will enable the next patch to cater to drivers that want to differ in behavior for the primary image vs. any secondary images that are opened at the same time as creating the primary image. Signed-off-by: Eric Blake <eblake@redhat.com> Message-ID: <20250915213919.3121401-5-eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-11 22:06:09 +01:00
Yeqi Fu	9730b9974d	block: replace TABs with space Bring the block files in line with the QEMU coding style, with spaces for indentation. This patch partially resolves the issue 371. Resolves: https://gitlab.com/qemu-project/qemu/-/issues/371 Signed-off-by: Yeqi Fu <fufuyqqqqqq@gmail.com> Message-ID: <20230325085224.23842-1-fufuyqqqqqq@gmail.com> [thuth: Rebased the patch to the current master branch] Signed-off-by: Thomas Huth <thuth@redhat.com> Message-ID: <20251007163511.334178-1-thuth@redhat.com> [kwolf: Fixed up vertical alignemnt] Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-11 22:06:09 +01:00
Stefan Hajnoczi	047dabef97	block/io_uring: use aio_add_sqe() AioContext has its own io_uring instance for file descriptor monitoring. The disk I/O io_uring code was developed separately. Originally I thought the characteristics of file descriptor monitoring and disk I/O were too different, requiring separate io_uring instances. Now it has become clear to me that it's feasible to share a single io_uring instance for file descriptor monitoring and disk I/O. We're not using io_uring's IOPOLL feature or anything else that would require a separate instance. Unify block/io_uring.c and util/fdmon-io_uring.c using the new aio_add_sqe() API that allows user-defined io_uring sqe submission. Now block/io_uring.c just needs to submit readv/writev/fsync and most of the io_uring-specific logic is handled by fdmon-io_uring.c. There are two immediate advantages: 1. Fewer system calls. There is no need to monitor the disk I/O io_uring ring fd from the file descriptor monitoring io_uring instance. Disk I/O completions are now picked up directly. Also, sqes are accumulated in the sq ring until the end of the event loop iteration and there are fewer io_uring_enter(2) syscalls. 2. Less code duplication. Note that error_setg() messages are not supposed to end with punctuation, so I removed a '.' for the non-io_uring build error message. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-ID: <20251104022933.618123-15-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-11 22:06:09 +01:00
Stefan Hajnoczi	1eebdab3c3	aio-posix: add aio_add_sqe() API for user-defined io_uring requests Introduce the aio_add_sqe() API for submitting io_uring requests in the current AioContext. This allows other components in QEMU, like the block layer, to take advantage of io_uring features without creating their own io_uring context. This API supports nested event loops just like file descriptor monitoring and BHs do. This comes at a complexity cost: CQE callbacks must be placed on a list so that nested event loops can invoke pending CQE callbacks from parent event loops. If you're wondering why CqeHandler exists instead of just a callback function pointer, this is why. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-ID: <20251104022933.618123-14-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-11 22:06:09 +01:00
Stefan Hajnoczi	87e7a0f423	aio-posix: add fdmon_ops->dispatch() The ppoll and epoll file descriptor monitoring implementations rely on the event loop's generic file descriptor, timer, and BH dispatch code to invoke user callbacks. The io_uring file descriptor monitoring implementation will need io_uring-specific dispatch logic for CQE handlers for custom SQEs. Introduce a new FDMonOps ->dispatch() callback that allows file descriptor monitoring implementations to invoke user callbacks. The next patch will use this new callback. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20251104022933.618123-13-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-11 22:06:09 +01:00
Stefan Hajnoczi	421dcc8023	aio: add errp argument to aio_context_setup() When aio_context_new() -> aio_context_setup() fails at startup it doesn't really matter whether errors are returned to the caller or the process terminates immediately. However, it is not acceptable to terminate when hotplugging --object iothread at runtime. Refactor aio_context_setup() so that errors can be propagated. The next commit will set errp when fdmon_io_uring_setup() fails. Suggested-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20251104022933.618123-10-stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-11 22:06:09 +01:00
Stefan Hajnoczi	3769b9abe9	aio: free AioContext when aio_context_new() fails g_source_destroy() only removes the GSource from the GMainContext it's attached to, if any. It does not free it. Use g_source_unref() instead so that the AioContext (which embeds a GSource) is freed. There is no need to call g_source_destroy() in aio_context_new() because the GSource isn't attached to a GMainContext yet. aio_ctx_finalize() expects everything to be set up already, so introduce the new ctx->initialized boolean and do nothing when called with !initialized. This also requires moving aio_context_setup() down after event_notifier_init() since aio_ctx_finalize() won't release any resources that aio_context_setup() acquired. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-ID: <20251104022933.618123-9-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-11 22:06:09 +01:00
Stefan Hajnoczi	d1f42b600a	aio: remove aio_context_use_g_source() There is no need for aio_context_use_g_source() now that epoll(7) and io_uring(7) file descriptor monitoring works with the glib event loop. AioContext doesn't need to be notified that GSource is being used. On hosts with io_uring support this now enables fdmon-io_uring.c by default, replacing fdmon-poll.c and fdmon-epoll.c. In other words, the event loop will use io_uring! Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20251104022933.618123-8-stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-11 22:06:09 +01:00
Stefan Hajnoczi	ded29e64c6	aio-posix: integrate fdmon into glib event loop AioContext's glib integration only supports ppoll(2) file descriptor monitoring. epoll(7) and io_uring(7) disable themselves and switch back to ppoll(2) when the glib event loop is used. The main loop thread cannot use epoll(7) or io_uring(7) because it always uses the glib event loop. Future QEMU features may require io_uring(7). One example is uring_cmd support in FUSE exports. Each feature could create its own io_uring(7) context and integrate it into the event loop, but this is inefficient due to extra syscalls. It would be more efficient to reuse the AioContext's existing fdmon-io_uring.c io_uring(7) context because fdmon-io_uring.c will already be active on systems where Linux io_uring is available. In order to keep fdmon-io_uring.c's AioContext operational even when the glib event loop is used, extend FDMonOps with an API similar to GSourceFuncs so that file descriptor monitoring can integrate into the glib event loop. A quick summary of the GSourceFuncs API: - prepare() is called each event loop iteration before waiting for file descriptors and timers. - check() is called to determine whether events are ready to be dispatched after waiting. - dispatch() is called to process events. More details here: https://docs.gtk.org/glib/struct.SourceFuncs.html Move the ppoll(2)-specific code from aio-posix.c into fdmon-poll.c and also implement epoll(7)- and io_uring(7)-specific file descriptor monitoring code for glib event loops. Note that it's still faster to use aio_poll() rather than the glib event loop since glib waits for file descriptor activity with ppoll(2) and does not support adaptive polling. But at least epoll(7) and io_uring(7) now work in glib event loops. Splitting this into multiple commits without temporarily breaking AioContext proved difficult so this commit makes all the changes. The next commit will remove the aio_context_use_g_source() API because it is no longer needed. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-ID: <20251104022933.618123-7-stefanha@redhat.com> [kwolf: Build fixes; fix AioContext.list_lock use after destroy] Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-11 22:04:53 +01:00
Richard Henderson	c494afbb7d	nvme queue -----BEGIN PGP SIGNATURE----- iQEzBAABCgAdFiEEUigzqnXi3OaiR2bATeGvMW1PDekFAmkDE7gACgkQTeGvMW1P DekCOwgAuOQKWWW/UA1MmZ4ZHs+djf4q5UDwqGDx8tra8d32mZWRHgpJ/OBBOY2z CmuHqWLgooAqfx4hsrXELdNBEe7ccNE9nvsE3GjnYWxjoe51yl2Xc0RD5CZBVrN4 RRMbBZRCewxGShyUaT31eedolWdr4zBuqkpLf9gcG8Yk7YD+xUkHUPeMXeAy+vkS pxW59AkXdjJZgBktOdV5uVj9gaCPgTcGaQNH2FYSnzHwdu5VyV8BKiiZE/fXS6FU xZvu+5p1Ro5vOdwG+iFBrbBwcGyjVOF1OfBZctyc83foyFxwzxqoqj9gy0ewuT2g HsupUiJgbkZ1Ut9fzaS5pHx3dd3dKw== =WDrH -----END PGP SIGNATURE----- Merge tag 'pull-nvme-20251030' of https://gitlab.com/birkelund/qemu into staging nvme queue # -----BEGIN PGP SIGNATURE----- # # iQEzBAABCgAdFiEEUigzqnXi3OaiR2bATeGvMW1PDekFAmkDE7gACgkQTeGvMW1P # DekCOwgAuOQKWWW/UA1MmZ4ZHs+djf4q5UDwqGDx8tra8d32mZWRHgpJ/OBBOY2z # CmuHqWLgooAqfx4hsrXELdNBEe7ccNE9nvsE3GjnYWxjoe51yl2Xc0RD5CZBVrN4 # RRMbBZRCewxGShyUaT31eedolWdr4zBuqkpLf9gcG8Yk7YD+xUkHUPeMXeAy+vkS # pxW59AkXdjJZgBktOdV5uVj9gaCPgTcGaQNH2FYSnzHwdu5VyV8BKiiZE/fXS6FU # xZvu+5p1Ro5vOdwG+iFBrbBwcGyjVOF1OfBZctyc83foyFxwzxqoqj9gy0ewuT2g # HsupUiJgbkZ1Ut9fzaS5pHx3dd3dKw== # =WDrH # -----END PGP SIGNATURE----- # gpg: Signature made Thu 30 Oct 2025 08:28:56 AM CET # gpg: using RSA key 522833AA75E2DCE6A24766C04DE1AF316D4F0DE9 # gpg: Good signature from "Klaus Jensen <its@irrelevant.dk>" [unknown] # gpg: aka "Klaus Jensen <k.jensen@samsung.com>" [unknown] # gpg: WARNING: This key is not certified with a trusted signature! # gpg: There is no indication that the signature belongs to the owner. # Primary key fingerprint: DDCA 4D9C 9EF9 31CC 3468 4272 63D5 6FC5 E55D A838 # Subkey fingerprint: 5228 33AA 75E2 DCE6 A247 66C0 4DE1 AF31 6D4F 0DE9 * tag 'pull-nvme-20251030' of https://gitlab.com/birkelund/qemu: hw/nvme: add atomic boundary support hw/nvme: enable ns atomic writes hw/nvme: connect SPDM over NVMe Security Send/Recv spdm: define SPDM transport enum types hw/nvme: add NVMe Admin Security SPDM support spdm: add spdm storage transport virtual header spdm-socket: add seperate send/recv functions Signed-off-by: Richard Henderson <richard.henderson@linaro.org>	2025-10-31 12:56:05 +01:00
Wilfred Mallawa	e5534abeb4	hw/nvme: add NVMe Admin Security SPDM support Adds the NVMe Admin Security Send/Receive command support with support for DMTFs SPDM. The transport binding for SPDM is defined in the DMTF DSP0286. Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Reviewed-by: Alistair Francis <alistair.francis@wdc.com> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>	2025-10-30 07:07:14 +01:00
Kevin Wolf	5b4b3bfdfc	qemu-img info: Optionally show block limits Add a new --limits option to 'qemu-img info' that displays the block limits for the image and all of its children, making the information more accessible for human users than in QMP. This option is not enabled by default because it can be a lot of output that isn't usually relevant if you're not specifically trying to diagnose some I/O problem. This makes the same information automatically also available in HMP 'info block -v'. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20251024123041.51254-4-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-10-29 12:10:10 +01:00
Kevin Wolf	46dd683d56	block: Improve comments in BlockLimits Patches to expose the limits in QAPI have made clear that the existing documentation of BlockLimits could be improved: The meaning of min_mem_alignment and opt_mem_alignment could be clearer, and talking about better alignment values isn't helpful when we only detect these values and never choose them. Make the changes in the BlockLimits documentation now, so that the patches exposing the fields in QAPI can use descriptions consistent with it. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20251024123041.51254-2-kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-10-29 12:10:09 +01:00
Fiona Ebner	cbadaf57a7	block: implement 'resize' callback for child_of_bds class If a filtered child is resized, the size of the parent node is now also refreshed (recursively for chains of filtered children). For filter block drivers that do not implement .bdrv_co_getlength(), this commit does not change the current behavior, because bdrv_co_refresh_total_sectors() will used the current size via the passed-in hint. This is the case for block drivers for (some) block jobs, as well as copy-before-write. Block jobs already set up a blocker preventing a QMP block_resize operation while the job is running. That does not directly cover an associated 'file' node of a 'raw' node, but resizing such a 'file' node is already prevented too (backup, commit, mirror and stream were checked). The other case is copy-before-write. This commit does not change the fact that the copy-before-write node still has the same size after its filtered child is resized. Block drivers that do implement .bdrv_co_getlength() and where .is_filter is true, already returned the length of the file child, so there is no change before and after this commit, with two exceptions: 1. preallocate can return an early data_end and otherwise queries the file child, but that special casing is not changed. 2. blkverify returns the length of the test file. This commit does not affect that behavior. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250917115509.401015-4-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-10-29 12:10:09 +01:00
Fiona Ebner	08736e7584	block: make bdrv_co_parent_cb_resize() a proper IO API function In preparation for calling it via the bdrv_child_cb_resize() callback that will be added by the next commit. Rename it to include the "_co_" part while at it. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20250917115509.401015-3-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-10-29 12:10:09 +01:00
Fiona Ebner	4120375420	include/block/block_int-common: document when resize callback is used The 'resize' callback is only called by bdrv_parent_cb_resize() which is only called by bdrv_co_write_req_finish() to notify the parent(s) that the child was resized. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20250917115509.401015-2-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-10-29 12:10:09 +01:00
Chandan Somani	9f0c763e16	block: enable stats-intervals for storage devices This patch allows stats-intervals to be used for storage devices with the -device option. It accepts a list of interval lengths in JSON format. It configures and collects the stats in the BlockBackend layer through the storage device that consumes the BlockBackend. Signed-off-by: Chandan Somani <csomani@redhat.com> Message-ID: <20251003220039.1336663-1-csomani@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-10-29 12:10:09 +01:00
Fiona Ebner	a256a427b0	blockjob: mark block_job_remove_all_bdrv() as GRAPH_UNLOCKED The function block_job_remove_all_bdrv() calls bdrv_graph_wrlock_drained(), which must be called with the graph unlocked. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-49-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-07-14 15:42:28 +02:00
Fiona Ebner	2cf92b15cd	block: mark bdrv_open_child_common() and its callers GRAPH_UNLOCKED The function bdrv_open_child_common() calls bdrv_graph_wrlock_drained(), which must be called with the graph unlocked. Mark it and its two callers bdrv_open_file_child() and bdrv_open_child() as GRAPH_UNLOCKED. This requires temporarily unlocking in vmdk_parse_extents() and making the locked section shorter in vmdk_open(). Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-48-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-07-14 15:42:27 +02:00
Fiona Ebner	ede0859311	block: mark bdrv_close() as GRAPH_UNLOCKED The functions blk_log_writes_close(), blkverify_close(), quorum_close(), vmdk_close() via vmdk_free_extents(), and other bdrv_close() implementations call bdrv_graph_wrlock_drained(), which must be called with the graph unlocked. They are reached via the BlockDriver's bdrv_close() callback and the bdrv_close() wrapper, which are also marked as GRAPH_UNLOCKED_PTR and GRAPH_UNLOCKED. Furthermore, the function bdrv_close() also calls bdrv_drained_begin() and bdrv_graph_wrlock_drained(), so there are additional reasons for marking it GRAPH_UNLOCKED. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-47-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-07-14 15:42:26 +02:00
Fiona Ebner	6d7e3f8de0	block: mark bdrv_close_all() as GRAPH_UNLOCKED The function bdrv_close_all() calls bdrv_drain_all(), which must be called with the graph unlocked. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-46-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-07-14 15:42:25 +02:00
Fiona Ebner	94371745d7	block: mark bdrv_drop_intermediate() as GRAPH_UNLOCKED The function bdrv_drop_intermediate() calls bdrv_drained_begin(), which must be called with the graph unlocked. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-45-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-07-14 15:42:23 +02:00
Fiona Ebner	04f4d9c555	block: mark bdrv_insert_node() as GRAPH_UNLOCKED The function bdrv_insert_node() calls bdrv_drained_begin() which must be called with the graph unlocked. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-44-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-07-14 15:42:22 +02:00
Fiona Ebner	5d04823347	block: mark bdrv_replace_child_bs() as GRAPH_UNLOCKED The function bdrv_replace_child_bs() calls bdrv_drained_begin() which must be called with the graph unlocked. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-43-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-07-14 15:42:21 +02:00
Fiona Ebner	7525aa25db	block: mark bdrv_inactivate_all() as GRAPH_UNLOCKED The function bdrv_inactivate_all() calls bdrv_drain_all_begin(), which must be called with the graph unlocked. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-37-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-07-14 15:42:08 +02:00
Fiona Ebner	e2d9cc5790	block: mark bdrv_inactivate() as GRAPH_RDLOCK and move drain to callers The function bdrv_inactivate() calls bdrv_drain_all_begin(), which needs to be called with the graph unlocked, so either bdrv_inactivate() should be marked as GRAPH_UNLOCKED or the drain needs to be moved to the callers. The caller in qmp_blockdev_set_active() requires that the locked section covers bdrv_find_node() too, so the latter alternative is chosen. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-36-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-07-14 15:42:07 +02:00
Fiona Ebner	6717dc3075	block: mark bdrv_reopen_queue() and bdrv_reopen_multiple() as GRAPH_UNLOCKED The function bdrv_reopen_queue() can call bdrv_drain_all_begin(), which must be called with the graph unlocked. The function bdrv_reopen_multiple() calls bdrv_reopen_prepare() which must be called with the graph unlocked. To mark bdrv_reopen_queue() as GRAPH_UNLOCKED, it is necessary to make the locked section in reopen_backing_file() shorter. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-35-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-07-14 15:42:05 +02:00
Fiona Ebner	c6b5328b5b	block/snapshot: mark bdrv_all_delete_snapshot() as GRAPH_UNLOCKED The function bdrv_all_delete_snapshot() calls bdrv_drain_all_begin(), which must be called with the graph unlocked. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-33-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-07-14 15:42:02 +02:00
Fiona Ebner	54eb59d668	block: drop wrapper for bdrv_set_backing_hd_drained() Nearly all callers (outside of the tests) are already using the _drained() variant of the function. It doesn't seem worth keeping. Simply adapt the remaining callers of bdrv_set_backing_hd() and rename bdrv_set_backing_hd_drained() to bdrv_set_backing_hd(). Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-31-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-07-14 15:41:58 +02:00
Fiona Ebner	de0d24c711	block: mark bdrv_set_backing_hd() as GRAPH_UNLOCKED The function bdrv_set_backing_hd() calls bdrv_drain_all_begin(), which must be called with the graph unlocked. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-29-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-07-14 15:41:54 +02:00
Fiona Ebner	6b89e851fa	block: add bdrv_graph_wrlock_drained() convenience wrapper Many write-locked sections are also drained sections. A new bdrv_graph_wrunlock_drained() wrapper around bdrv_graph_wrunlock() is introduced, which will begin a drained section first. A global variable is used so bdrv_graph_wrunlock() knows if it also needs to end such a drained section. Both the aio_poll call in bdrv_graph_wrlock() and the aio_bh_poll() in bdrv_graph_wrunlock() can re-enter a write-locked section. While for the latter, ending the drain could be moved to before the call, the former requires that the variable is a counter and not just a boolean. Since the wrapper calls bdrv_drain_all_begin(), which must be called with the graph unlocked, mark the wrapper as GRAPH_UNLOCKED too. The switch to the new helpers was generated with the following commands and then manually checked: find . -name '.c' -exec sed -i -z 's/bdrv_drain_all_begin();\n\sbdrv_graph_wrlock();/bdrv_graph_wrlock_drained();/g' {} ';' find . -name '.c' -exec sed -i -z 's/bdrv_graph_wrunlock();\n\sbdrv_drain_all_end();/bdrv_graph_wrunlock();/g' {} ';' Suggested-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-25-f.ebner@proxmox.com> [kwolf: Removed redundant GRAPH_UNLOCKED] Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-07-14 15:40:58 +02:00
Fiona Ebner	502f00c51a	block: never use atomics to access bs->quiesce_counter All accesses of bs->quiesce_counter are in the main thread, either after a GLOBAL_STATE_CODE() macro or in a function with GRAPH_WRLOCK annotation. This is essentially a revert of `414c2ec358` ("block: access quiesce_counter with atomic ops"). At that time, neither the GLOBAL_STATE_CODE() macro nor the GRAPH_WRLOCK annotation existed. Even if the field was only accessed in the main thread back then (did not check if that is actually the case), it wouldn't have been easy to verify. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-24-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-07-14 15:40:45 +02:00
Fiona Ebner	fc1d2f3eac	block: mark bdrv_drained_begin() and friends as GRAPH_UNLOCKED All of bdrv_drain_all_begin(), bdrv_drain_all() and bdrv_drained_begin() poll and are not allowed to be called with the block graph lock held. Mark the function as such. Suggested-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-20-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-06-04 18:16:34 +02:00
Fiona Ebner	d75f8ed1d7	block: move drain outside of quorum_del_child() The quorum_del_child() callback runs under the graph lock, so it is not allowed to drain. It is only called as the .bdrv_del_child() callback, which is only called in the bdrv_del_child() function, which also runs under the graph lock. The bdrv_del_child() function is called by qmp_x_blockdev_change(). A drained section was already introduced there by commit "block: move drain out of quorum_add_child()". This finally finishes moving out the drain to places that are not under the graph lock started in "block: move draining out of bdrv_change_aio_context() and mark GRAPH_RDLOCK". Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-17-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-06-04 18:16:34 +02:00
Fiona Ebner	0414930d3a	block: move drain outside of quorum_add_child() This is part of resolving the deadlock mentioned in commit "block: move draining out of bdrv_change_aio_context() and mark GRAPH_RDLOCK". The quorum_add_child() callback runs under the graph lock, so it is not allowed to drain. It is only called as the .bdrv_add_child() callback, which is only called in the bdrv_add_child() function, which also runs under the graph lock. The bdrv_add_child() function is called by qmp_x_blockdev_change(), where a drained section is introduced. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-15-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-06-04 18:16:34 +02:00
Fiona Ebner	ffdcd081f5	block: move drain outside of bdrv_root_attach_child() This is part of resolving the deadlock mentioned in commit "block: move draining out of bdrv_change_aio_context() and mark GRAPH_RDLOCK". The function bdrv_root_attach_child() runs under the graph lock, so it is not allowed to drain. It is called by: 1. blk_insert_bs(), where a drained section is introduced. 2. block_job_add_bdrv(), which holds the graph lock itself. block_job_add_bdrv() is called by: 1. mirror_start_job() 2. stream_start() 3. commit_start() 4. backup_job_create() 5. block_job_create() 6. In the test_blockjob_common_drain_node() unit test In all callers, a drained section is introduced. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20250530151125.955508-13-f.ebner@proxmox.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-06-04 18:16:34 +02:00
Fiona Ebner	a1ea8eb591	block: move drain outside of bdrv_try_change_aio_context() This is part of resolving the deadlock mentioned in commit "block: move draining out of bdrv_change_aio_context() and mark GRAPH_RDLOCK". Convert the function to a _locked() version that has to be called with the graph lock held and add a convenience wrapper that has to be called with the graph unlocked, which drains and takes the lock itself. Since bdrv_try_change_aio_context() is global state code, the wrapper is too. Callers are adapted to use the appropriate variant, depending on whether the caller already holds the lock. In the test_set_aio_context() unit test, prior drains can be removed, because draining already happens inside the new wrapper. Note that bdrv_attach_child_common_abort(), bdrv_attach_child_common() and bdrv_root_unref_child() hold the graph lock and are not actually allowed to drain either. This will be addressed in the following commits. Functions like qmp_blockdev_mirror() query the nodes to act on before draining and locking. In theory, draining could invalidate those nodes. This kind of issue is not addressed by these commits. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20250530151125.955508-10-f.ebner@proxmox.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-06-04 18:16:34 +02:00
Fiona Ebner	91ba0e1c38	block: move drain outside of bdrv_change_aio_context() and mark GRAPH_RDLOCK This is in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED. Note that even if bdrv_drained_begin() were already marked as GRAPH_UNLOCKED, TSA would not complain about the instance in bdrv_change_aio_context() before this change, because it is preceded by a bdrv_graph_rdunlock_main_loop() call. It is not correct to release the lock here, and in case the caller holds a write lock, it wouldn't actually release the lock. In combination with block-stream, there is a deadlock that can happen because of this [0]. In particular, it can happen that main thread IO thread 1. acquires write lock in blk_co_do_preadv_part(): 2. have non-zero blk->in_flight 3. try to acquire read lock 4. begin drain Steps 3 and 4 might be switched. Draining will poll and get stuck, because it will see the non-zero in_flight counter. But the IO thread will not make any progress either, because it cannot acquire the read lock. After this change, all paths to bdrv_change_aio_context() drain: bdrv_change_aio_context() is called by: 1. bdrv_child_cb_change_aio_ctx() which is only called via the change_aio_ctx() callback, see below. 2. bdrv_child_change_aio_context(), see below. 3. bdrv_try_change_aio_context(), where a drained section is introduced. The change_aio_ctx() callback is called by: 1. bdrv_attach_child_common_abort(), where a drained section is introduced. 2. bdrv_attach_child_common(), where a drained section is introduced. 3. bdrv_parent_change_aio_context(), see below. bdrv_child_change_aio_context() is called by: 1. bdrv_change_aio_context(), i.e. recursive, so being in a drained section is invariant. 2. child_job_change_aio_ctx(), which is only called via the change_aio_ctx() callback, see above. bdrv_parent_change_aio_context() is called by: 1. bdrv_change_aio_context(), i.e. recursive, so being in a drained section is invariant. This resolves all code paths. Note that bdrv_attach_child_common() and bdrv_attach_child_common_abort() hold the graph write lock and callers of bdrv_try_change_aio_context() might too, so they are not actually allowed to drain either. This will be addressed in the following commits. More granular draining is not trivially possible, because bdrv_change_aio_context() can recursively call itself e.g. via bdrv_child_change_aio_context(). [0]: https://lore.kernel.org/qemu-devel/73839c04-7616-407e-b057-80ca69e63f51@virtuozzo.com/ Reported-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20250530151125.955508-9-f.ebner@proxmox.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-06-04 18:16:33 +02:00
Fiona Ebner	469422c45b	block: mark bdrv_child_change_aio_context() GRAPH_RDLOCK This is a small step in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED. More concretely, it is in preparation to move the drain out of bdrv_change_aio_context() and marking that function as GRAPH_RDLOCK. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20250530151125.955508-8-f.ebner@proxmox.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-06-04 18:16:33 +02:00
Fiona Ebner	844d550d09	block: mark change_aio_ctx() callback and instances as GRAPH_RDLOCK(_PTR) This is a small step in preparation to mark bdrv_drained_begin() as GRAPH_UNLOCKED. More concretely, it is in preparation to move the drain out of bdrv_change_aio_context() and marking that function as GRAPH_RDLOCK. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20250530151125.955508-7-f.ebner@proxmox.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-06-04 18:16:33 +02:00
Eric Blake	253b43a290	mirror: Drop redundant zero_target parameter The two callers to a mirror job (drive-mirror and blockdev-mirror) set zero_target precisely when sync mode == FULL, with the one exception that drive-mirror skips zeroing the target if it was newly created and reads as zero. But given the previous patch, that exception is equally captured by target_is_zero. Meanwhile, there is another slight wrinkle, fortunately caught by iotest 185: if the caller uses "sync":"top" but the source has no backing file, the code in blockdev.c was changing sync to be FULL, but only after it had set zero_target=false. In mirror.c, prior to recent patches, this didn't matter: the only places that inspected sync were setting is_none_mode (both TOP and FULL had set that to false), and mirror_start() setting base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL. But now that we are passing sync around, the slammed sync mode would result in a new pre-zeroing pass even when the user had passed "sync":"top" in an effort to skip pre-zeroing. Fortunately, the assignment of base when bs has no backing chain still works out to NULL if we don't slam things. So with the forced change of sync ripped out of blockdev.c, the sync mode is passed through the full callstack unmolested, and we can now reliably reconstruct the same settings as what used to be passed in by zero_target=false, without the redundant parameter. Signed-off-by: Eric Blake <eblake@redhat.com> Message-ID: <20250509204341.3553601-24-eblake@redhat.com> Reviewed-by: Sunny Zhu <sunnyzhyy@qq.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> [eblake: Fix regression in iotest 185] Signed-off-by: Eric Blake <eblake@redhat.com>	2025-05-14 20:10:12 -05:00
Eric Blake	d17a34bfb9	mirror: Allow QMP override to declare target already zero QEMU has an optimization for a just-created drive-mirror destination that is not possible for blockdev-mirror (which can't create the destination) - any time we know the destination starts life as all zeroes, we can skip a pre-zeroing pass on the destination. Recent patches have added an improved heuristic for detecting if a file contains all zeroes, and we plan to use that heuristic in upcoming patches. But since a heuristic cannot quickly detect all scenarios, and there may be cases where the caller is aware of information that QEMU cannot learn quickly, it makes sense to have a way to tell QEMU to assume facts about the destination that can make the mirror operation faster. Given our existing example of "qemu-img convert --target-is-zero", it is time to expose this override in QMP for blockdev-mirror as well. This patch results in some slight redundancy between the older s->zero_target (set any time mode==FULL and the destination image was not just created - ie. clear if drive-mirror is asking to skip the pre-zero pass) and the newly-introduced s->target_is_zero (in addition to the QMP override, it is set when drive-mirror creates the destination image); this will be cleaned up in the next patch. There is also a subtlety that we must consider. When drive-mirror is passing target_is_zero on behalf of a just-created image, we know the image is sparse (skipping the pre-zeroing keeps it that way), so it doesn't matter whether the destination also has "discard":"unmap" and "detect-zeroes":"unmap". But now that we are letting the user set the knob for target-is-zero, if the user passes a pre-existing file that is fully allocated, it is fine to leave the file fully allocated under "detect-zeroes":"on", but if the file is open with "detect-zeroes":"unmap", we should really be trying harder to punch holes in the destination for every region of zeroes copied from the source. The easiest way to do this is to still run the pre-zeroing pass (turning the entire destination file sparse before populating just the allocated portions of the source), even though that currently results in double I/O to the portions of the file that are allocated. A later patch will add further optimizations to reduce redundant zeroing I/O during the mirror operation. Since "target-is-zero":true is designed for optimizations, it is okay to silently ignore the parameter rather than erroring if the user ever sets the parameter in a scenario where the mirror job can't exploit it (for example, when doing "sync":"top" instead of "sync":"full", we can't pre-zero, so setting the parameter won't make a speed difference). Signed-off-by: Eric Blake <eblake@redhat.com> Acked-by: Markus Armbruster <armbru@redhat.com> Message-ID: <20250509204341.3553601-23-eblake@redhat.com> Reviewed-by: Sunny Zhu <sunnyzhyy@qq.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>	2025-05-14 16:55:10 -05:00
Eric Blake	5272609670	block: Add new bdrv_co_is_all_zeroes() function There are some optimizations that require knowing if an image starts out as reading all zeroes, such as making blockdev-mirror faster by skipping the copying of source zeroes to the destination. The existing bdrv_co_is_zero_fast() is a good building block for answering this question, but it tends to give an answer of 0 for a file we just created via QMP 'blockdev-create' or similar (such as 'qemu-img create -f raw'). Why? Because file-posix.c insists on allocating a tiny header to any file rather than leaving it 100% sparse, due to some filesystems that are unable to answer alignment probes on a hole. But teaching file-posix.c to read the tiny header doesn't scale - the problem of a small header is also visible when libvirt sets up an NBD client to a just-created file on a migration destination host. So, we need a wrapper function that handles a bit more complexity in a common manner for all block devices - when the BDS is mostly a hole, but has a small non-hole header, it is still worth the time to read that header and check if it reads as all zeroes before giving up and returning a pessimistic answer. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20250509204341.3553601-19-eblake@redhat.com>	2025-05-14 16:08:23 -05:00
Eric Blake	c33159dec7	block: Expand block status mode from bool to flags This patch is purely mechanical, changing bool want_zero into an unsigned int for bitwise-or of flags. As of this patch, all implementations are unchanged (the old want_zero==true is now mode==BDRV_WANT_PRECISE which is a superset of BDRV_WANT_ZERO); but the callers in io.c that used to pass want_zero==false are now prepared for future driver changes that can now distinguish bewteen BDRV_WANT_ZERO vs. BDRV_WANT_ALLOCATED. The next patch will actually change the file-posix driver along those lines, now that we have more-specific hints. As for the background why this patch is useful: right now, the file-posix driver recognizes that if allocation is being queried, the entire image can be reported as allocated (there is no backing file to refer to) - but this throws away information on whether the entire image reads as zero (trivially true if lseek(SEEK_HOLE) at offset 0 returns -ENXIO, a bit more complicated to prove if the raw file was created with 'qemu-img create' since we intentionally allocate a small chunk of all-zero data to help with alignment probing). Later patches will add a generic algorithm for seeing if an entire file reads as zeroes. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20250509204341.3553601-16-eblake@redhat.com>	2025-05-14 15:33:34 -05:00
Raman Dzehtsiar	3d3911f16b	blockdev-backup: Add error handling option for copy-before-write jobs This patch extends the blockdev-backup QMP command to allow users to specify how to behave when IO errors occur during copy-before-write operations. Previously, the behavior was fixed and could not be controlled by the user. The new 'on-cbw-error' option can be set to one of two values: - 'break-guest-write': Forwards the IO error to the guest and triggers the on-source-error policy. This preserves snapshot integrity at the expense of guest IO operations. - 'break-snapshot': Allows the guest OS to continue running normally, but invalidates the snapshot and aborts related jobs. This prioritizes guest operation over backup consistency. This enhancement provides more flexibility for backup operations in different environments where requirements for guest availability versus backup consistency may vary. The default behavior remains unchanged to maintain backward compatibility. Signed-off-by: Raman Dzehtsiar <Raman.Dzehtsiar@gmail.com> Message-ID: <20250414090025.828660-1-Raman.Dzehtsiar@gmail.com> Acked-by: Markus Armbruster <armbru@redhat.com> [vsementsov: fix long lines] Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Tested-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>	2025-05-12 18:19:31 +03:00
Sunny Zhu	ed1aef1716	block: Remove unused callback function bdrv_aio_pdiscard The bytes type in bdrv_aio_pdiscard should be int64_t rather than int. There are no drivers implementing the *bdrv_aio_pdiscard() callback, it appears to be an unused function. Therefore, we'll simply remove it instead of fixing it. Additionally, coroutine-based callbacks are preferred. If someone needs to implement bdrv_aio_pdiscard, a coroutine-based version would be straightforward to implement. Signed-off-by: Sunny Zhu <sunnyzhyy@qq.com> Message-ID: <tencent_7140D2E54157D98CF3D9E64B1A007A1A7906@qq.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-04-25 17:06:50 +02:00
Kevin Wolf	ee416407b3	aio-posix: Separate AioPolledEvent per AioHandler Adaptive polling has a big problem: It doesn't consider that an event loop can wait for many different events that may have very different typical latencies. For example, think of a guest that tends to send a new I/O request soon after the previous I/O request completes, but the storage on the host is rather slow. In this case, getting the new request from guest quickly means that polling is enabled, but the next thing is performing the I/O request on the backend, which is slow and disables polling again for the next guest request. This means that in such a scenario, polling could help for every other event, but is only ever enabled when it can't succeed. In order to fix this, keep a separate AioPolledEvent for each AioHandler. We will then know that the backend file descriptor always has a high latency and isn't worth polling for, but we also know that the guest is always fast and we should poll for it. This solves at least half of the problem, we can now keep polling for those cases where it makes sense and get the improved performance from it. Since the event loop doesn't know which event will be next, we still do some unnecessary polling while we're waiting for the slow disk. I made some attempts to be more clever than just randomly growing and shrinking the polling time, and even to let callers be explicit about when they expect a new event, but so far this hasn't resulted in improved performance or even caused performance regressions. For now, let's just fix the part that is easy enough to fix, we can revisit the rest later. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20250307221634.71951-6-kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-03-13 17:57:23 +01:00

1 2 3 4 5 ...

1675 commits