qemu-cr16

Author	SHA1	Message	Date
Hanna Czenczek	b002acacc1	Revert "nvme: Fix coroutine waking" This reverts commit `0f142cbd91`. Said commit changed the replay_bh_schedule_oneshot_event() in nvme_rw_cb() to aio_co_wake(), allowing the request coroutine to be entered directly (instead of only being scheduled for later execution). This can cause the device to become stalled like so: It is possible that after completion the request coroutine goes on to submit another request without yielding, e.g. a flush after a write to emulate FUA. This will likely cause a nested nvme_process_completion() call because nvme_rw_cb() itself is called from there. (After submitting a request, we invoke nvme_process_completion() through defer_call(); but the fact that nvme_process_completion() ran in the first place indicates that we are not in a call-deferring section, so defer_call() will call nvme_process_completion() immediately.) If this inner nvme_process_completion() loop then processes any completions, it will write the final completion queue (CQ) head index to the CQ head doorbell, and subsequently execution will return to the outer nvme_process_completion() loop. Even if this loop now finds no further completions, it still processed at least one completion before, or it would not have called the nvme_rw_cb() which led to nesting. Therefore, it will now write the exact same CQ head index value to the doorbell, which effectively is an unrecoverable error[1]. Therefore, nesting of nvme_process_completion() does not work at this point. Reverting said commit removes the nesting (by scheduling the request coroutine instead of entering it immediately), and so fixes the stall. On the downside, reverting said commit breaks multiqueue for nvme, but better to have single-queue working than neither. For 11.0, we will have a solution that makes both work. A side note: There is a comment in nvme_process_completion() above qemu_bh_schedule() that claims nesting works, as long as it is done through the completion_bh. I am quite sure that is not true, for two reasons: - The problem described above, which is even worse when going through nvme_process_completion_bh() because that function unconditionally writes to the CQ head doorbell, - nvme_process_completion_bh() never takes q->lock, so nvme_process_completion() unlocking it will likely abort. Given the lack of reports of such aborts, I believe that completion_bh simply is unused in practice. [1] See the NVMe Base Specification revision 2.3, page 180, figure 152: “Invalid Doorbell Write Value: A host attempted to write an invalid doorbell value. Some possible causes of this error are: [...] the value written is the same as the previously written doorbell value.” To even be notified of this error, we would need to send an Asynchronous Event Request to the admin queue (p. 178ff), which we don’t do, and then to handle it, we would need to delete and recreate the queue (p. 88, section 3.3.1.2 Queue Usage). Cc: qemu-stable@nongnu.org Reported-by: Lukáš Doktor <ldoktor@redhat.com> Tested-by: Lukáš Doktor <ldoktor@redhat.com> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-id: 20251215141540.88915-1-hreitz@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2025-12-15 09:50:41 -05:00
Cédric Le Goater	326e620fc0	Fix const qualifier build errors with recent glibc A recent change in glibc 2.42.9000 [1] changes the return type of strstr() and other string functions to be 'const char ' when the input is a 'const char '. This breaks the build in various files with errors such as : error: initialization discards 'const' qualifier from pointer target type [-Werror=discarded-qualifiers] 208 \| char pidstr = strstr(filename, "%"); \| ^~~~~~ Fix this by changing the type of the variables that store the result of these functions to 'const char '. [1] https://sourceware.org/git/?p=glibc.git;a=commit;h=cd748a63ab1a7ae846175c532a3daab341c62690 Signed-off-by: Cédric Le Goater <clg@redhat.com> Reviewed-by: Laurent Vivier <laurent@vivier.eu> Reviewed-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Message-ID: <20251209174328.698774-1-clg@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>	2025-12-09 21:00:15 +01:00
Kevin Wolf	2c3165a1a6	file-posix: Handle suspended dm-multipath better for SG_IO When introducing DM_MPATH_PROBE_PATHS, we already anticipated that dm-multipath devices might be suspended for a short time when the DM tables are reloaded and that they return -EAGAIN in this case. We then wait for a millisecond and retry. However, meanwhile it has also turned out that libmpathpersist (which is used by qemu-pr-helper) may need to perform more complex recovery operations to get reservations back to expected state if a path failure happened in the middle of a PR operation. In this case, the device is suspended for a longer time compared to the case we originally expected. This patch changes hdev_co_ioctl() to treat -EAGAIN separately so that it doesn't result in an immediate failure if the device is suspended for more than 1ms, and moves to incremental backoff to cover both quick and slow cases without excessive delays. Buglink: https://issues.redhat.com/browse/RHEL-121543 Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20251128221440.89125-1-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-12-04 18:34:15 +01:00
Stefan Hajnoczi	d704a13d2c	block: use pwrite_zeroes_alignment when writing first sector Since commit `5634622bcb` ("file-posix: allow BLKZEROOUT with -t writeback"), qemu-img create errors out on a Linux loop block device with a 4 KB sector size: # dd if=/dev/zero of=blockfile bs=1M count=1024 # losetup --sector-size 4096 /dev/loop0 blockfile # qemu-img create -f raw /dev/loop0 1G Formatting '/dev/loop0', fmt=raw size=1073741824 qemu-img: /dev/loop0: Failed to clear the new image's first sector: Invalid argument Use the pwrite_zeroes_alignment block limit to avoid misaligned fallocate(2) or ioctl(BLKZEROOUT) in the block/file-posix.c block driver. Cc: qemu-stable@nongnu.org Fixes: `5634622bcb` ("file-posix: allow BLKZEROOUT with -t writeback") Reported-by: Jean-Louis Dupond <jean-louis@dupond.be> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/3127 Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20251007141700.71891-3-stefanha@redhat.com> Tested-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-25 15:26:22 +01:00
Stefan Hajnoczi	98e788b91a	file-posix: populate pwrite_zeroes_alignment Linux block devices require write zeroes alignment whereas files do not. It may come as a surprise that block devices opened in buffered I/O mode require the alignment for write zeroes requests although normal read/write requests do not. Therefore it is necessary to populate the pwrite_zeroes_alignment field. Cc: qemu-stable@nongnu.org Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20251007141700.71891-2-stefanha@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Tested-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-25 15:26:22 +01:00
Kevin Wolf	8eeaa706ba	block-backend: Fix race when resuming queued requests When new requests arrive at a BlockBackend that is currently drained, these requests are queued until the drain section ends. There is a race window between blk_root_drained_end() waking up a queued request in an iothread from the main thread and blk_wait_while_drained() actually being woken up in the iothread and calling blk_inc_in_flight(). If the BlockBackend is drained again during this window, drain won't wait for this request and it will sneak in when the BlockBackend is already supposed to be quiesced. This causes assertion failures in bdrv_drain_all_begin() and can have other unintended consequences. Fix this by increasing the in_flight counter immediately when scheduling the request to be resumed so that the next drain will wait for it to complete. Cc: qemu-stable@nongnu.org Reported-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20251119172720.135424-1-kwolf@redhat.com> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> Reviewed-by: Fiona Ebner <f.ebner@proxmox.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-25 15:26:22 +01:00
Hanna Czenczek	837c04e9fc	win32-aio: Run CB in original context AIO callbacks must be called in the originally calling AioContext, regardless of the BDS’s “main” AioContext. Note: I tried to test this (under wine), but failed. Whenever I tried to use multiqueue or even just an I/O thread for a virtio-blk (or virtio-scsi) device, I/O stalled, both with and without this patch. For what it’s worth, when not using an I/O thread, I/O continued to work with this patch. Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20251110154854.151484-20-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-18 18:01:57 +01:00
Hanna Czenczek	f18782a8f5	null-aio: Run CB in original AioContext AIO callbacks must be called in the originally calling AioContext, regardless of the BDS’s “main” AioContext. Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20251110154854.151484-19-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-18 18:01:57 +01:00
Hanna Czenczek	63d15c7aa5	iscsi: Create AIO BH in original AioContext AIO callbacks must be called in the original request’s AioContext, regardless of the BDS’s “main” AioContext. Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20251110154854.151484-18-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-18 18:01:57 +01:00
Hanna Czenczek	b0cc742f84	blkreplay: Run BH in coroutine’s AioContext While it does not matter in which AioContext we run aio_co_wake() to continue an exactly-once-yielding coroutine, making this commit not strictly necessary, there is also no reason why the BH should run in any context but the request’s AioContext. Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20251110154854.151484-16-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-18 18:01:55 +01:00
Hanna Czenczek	7c3e9b87f5	ssh: Run restart_coroutine in current AioContext restart_coroutine() is attached as an FD handler just to wake the current coroutine after yielding. It makes most sense to attach it to the current (request) AioContext instead of the BDS main context. This way, the coroutine can be entered directly from the BH instead of having yet another indirection through AioContext.co_schedule_bh. Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20251110154854.151484-15-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-18 18:01:55 +01:00
Hanna Czenczek	94ce870f60	qcow2: Schedule cache-clean-timer in realtime There is no reason why the cache cleaning timer should run in virtual time, run it in realtime instead. Suggested-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20251110154854.151484-14-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-18 18:01:55 +01:00
Hanna Czenczek	f86dde9a15	qcow2: Fix cache_clean_timer The cache-cleaner runs as a timer CB in the BDS AioContext. With multiqueue, it can run concurrently to I/O requests, and because it does not take any lock, this can break concurrent cache accesses, corrupting the image. While the chances of this happening are low, it can be reproduced e.g. by modifying the code to schedule the timer CB every 5 ms (instead of at most once per second) and modifying the last (inner) while loop of qcow2_cache_clean_unused() like so: while (i < c->size && can_clean_entry(c, i)) { for (int j = 0; j < 1000 && can_clean_entry(c, i); j++) { usleep(100); } c->entries[i].offset = 0; c->entries[i].lru_counter = 0; i++; to_clean++; } i.e. making it wait on purpose for the point in time where the cache is in use by something else. The solution chosen for this in this patch is not the best solution, I hope, but I admittedly can’t come up with anything strictly better. We can protect from concurrent cache accesses either by taking the existing s->lock, or we introduce a new (non-coroutine) mutex specifically for cache accesses. I would prefer to avoid the latter so as not to introduce additional (very slight) overhead. Using s->lock, which is a coroutine mutex, however means that we need to take it in a coroutine, so the timer must run in a coroutine. We can transform it from the current timer CB style into a coroutine that sleeps for the set interval. As a result, however, we can no longer just deschedule the timer to instantly guarantee it won’t run anymore, but have to await the coroutine’s exit. (Note even before this patch there were places that may not have been so guaranteed after all: Anything calling cache_clean_timer_del() from the QEMU main AioContext could have been running concurrently to an existing timer CB invocation.) Polling to await the timer to actually settle seems very complicated for something that’s rather a minor problem, but I can’t come up with any better solution that doesn’t again just overlook potential problems. (Not Cc-ing qemu-stable, as the issue is quite unlikely to be hit, and I’m not too fond of this solution.) Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20251110154854.151484-13-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-18 18:01:55 +01:00
Hanna Czenczek	90db3a1721	qcow2: Re-initialize lock in invalidate_cache After clearing our state (memset()-ing it to 0), we should re-initialize objects that need it. Specifically, that applies to s->lock, which is originally initialized in qcow2_open(). Given qemu_co_mutex_init() is just a memset() to 0, this is functionally a no-op, but still seems like the right thing to do. Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20251110154854.151484-12-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-18 18:01:55 +01:00
Hanna Czenczek	9b9ee60c07	block/io: Take reqs_lock for tracked_requests bdrv_co_get_self_request() does not take a lock around iterating through bs->tracked_requests. With multiqueue, it may thus iterate over a list that is in the process of being modified, producing an assertion failure: ../block/file-posix.c:3702: raw_do_pwrite_zeroes: Assertion `req' failed. [0] abort() at /lib64/libc.so.6 [1] __assert_fail_base.cold() at /lib64/libc.so.6 [2] raw_do_pwrite_zeroes() at ../block/file-posix.c:3702 [3] bdrv_co_do_pwrite_zeroes() at ../block/io.c:1910 [4] bdrv_aligned_pwritev() at ../block/io.c:2109 [5] bdrv_co_do_zero_pwritev() at ../block/io.c:2192 [6] bdrv_co_pwritev_part() at ../block/io.c:2292 [7] bdrv_co_pwritev() at ../block/io.c:2225 [8] handle_alloc_space() at ../block/qcow2.c:2573 [9] qcow2_co_pwritev_task() at ../block/qcow2.c:2625 Fix this by taking reqs_lock. Cc: qemu-stable@nongnu.org Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20251110154854.151484-11-hreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-18 18:01:54 +01:00
Hanna Czenczek	ac3520f599	nvme: Note in which AioContext some functions run Sprinkle comments throughout block/nvme.c noting for some functions (where it may not be obvious) that they require a certain AioContext, or in which AioContext they do happen to run (for callbacks, BHs, event notifiers). Suggested-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20251110154854.151484-10-hreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-18 18:01:53 +01:00
Hanna Czenczek	0f142cbd91	nvme: Fix coroutine waking nvme wakes the request coroutine via qemu_coroutine_enter() from a BH scheduled in the BDS AioContext. This may not be the same context as the one in which the request originally ran, which would be wrong: - It could mean we enter the coroutine before it yields, - We would move the coroutine in to a different context. (Can be reproduced with multiqueue by adding a usleep(100000) before the `while (data.ret == -EINPROGRESS)` loop.) To fix that, use aio_co_wake() to run the coroutine in its home context. Just like in the preceding iscsi and nfs patches, we can drop the trivial nvme_rw_cb_bh() and use aio_co_wake() directly. With this, we can remove NVMeCoData.ctx. Note the check of data->co == NULL to bypass the BH/yield combination in case nvme_rw_cb() is called from nvme_submit_command(): We probably want to keep this fast path for performance reasons, but we have to be quite careful about it: - We cannot overload .ret for this, but have to use a dedicated .skip_yield field. Otherwise, if nvme_rw_cb() runs in a different thread than the coroutine, it may see .ret set and skip the yield, while nvme_rw_cb() will still schedule a BH for waking. Therefore, the signal to skip the yield can only be set in nvme_rw_cb() if waking too is skipped, which is independent from communicating the return value. - We can only skip the yield if nvme_rw_cb() actually runs in the request coroutine. Otherwise (specifically if they run in different AioContexts), the order between this function’s execution and the coroutine yielding (or not yielding) is not reliable. - There is no point to yielding in a loop; there are no spurious wakes, so once we yield, we will only be re-entered once the command is done. Replace `while` by `if`. Cc: qemu-stable@nongnu.org Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20251110154854.151484-9-hreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-18 18:01:51 +01:00
Hanna Czenczek	7a501bbd51	nvme: Kick and check completions in BDS context nvme_process_completion() must run in the main BDS context, so schedule a BH for requests that aren’t there. The context in which we kick does not matter, but let’s just keep kick and process_completion together for simplicity’s sake. (For what it’s worth, a quick fio bandwidth test indicates that on my test hardware, if anything, this may be a bit better than kicking immediately before scheduling a pure nvme_process_completion() BH. But I wouldn’t take more from those results than that it doesn’t really seem to matter either way.) Cc: qemu-stable@nongnu.org Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20251110154854.151484-8-hreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-18 18:01:50 +01:00
Hanna Czenczek	7214ad20da	gluster: Do not move coroutine into BDS context The request coroutine may not run in the BDS AioContext. We should wake it in its own context, not move it. With that, we can remove GlusterAIOCB.aio_context. Also add a comment why aio_co_schedule() is safe to use in this way. Note: Due to a lack of a gluster set-up, I have not tested this commit. It seemed safe enough to send anyway, just maybe not to qemu-stable. To be clear, I don’t know of any user-visible bugs that would arise from the state without this patch; the request coroutine is moved into the main BDS AioContext, so guest device completion code will run in a different context than where the request started, which can’t be good, but I haven’t actually confirmed any bugs (due to not being able to test it). Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20251110154854.151484-7-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-18 18:01:50 +01:00
Hanna Czenczek	53d5c7ffac	curl: Fix coroutine waking If we wake a coroutine from a different context, we must ensure that it will yield exactly once (now or later), awaiting that wake. curl’s current .ret == -EINPROGRESS loop may lead to the coroutine not yielding if the request finishes before the loop gets run. To fix it, we must drop the loop and yield exactly once, if we need to yield. Finding out that latter part ("if we need to yield") makes it a bit complicated: Requests may be served from a cache internal to the curl block driver, or fail before being submitted. In these cases, we must not yield. However, if we find a matching but still ongoing request in the cache, we will have to await that, i.e. still yield. To address this, move the yield inside of the respective functions: - Inside of curl_find_buf() when awaiting ongoing concurrent requests, - Inside of curl_setup_preadv() when having created a new request. Rename curl_setup_preadv() to curl_do_preadv() to reflect this. (Can be reproduced with multiqueue by adding a usleep(100000) before the `while (acb.ret == -EINPROGRESS)` loop.) Also, add a comment why aio_co_wake() is safe regardless of whether the coroutine and curl_multi_check_completion() run in the same context. Cc: qemu-stable@nongnu.org Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20251110154854.151484-6-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-18 18:01:50 +01:00
Hanna Czenczek	deb35c129b	nfs: Run co BH CB in the coroutine’s AioContext Like in “rbd: Run co BH CB in the coroutine’s AioContext”, drop the completion flag, yield exactly once, and run the BH in the coroutine’s AioContext. (Can be reproduced with multiqueue by adding a usleep(100000) before the `while (!task.complete)` loops.) Like in “iscsi: Run co BH CB in the coroutine’s AioContext”, this makes nfs_co_generic_bh_cb() trivial, so we can drop it in favor of just calling aio_co_wake() directly. Cc: qemu-stable@nongnu.org Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20251110154854.151484-5-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-18 18:01:50 +01:00
Hanna Czenczek	a9500527db	iscsi: Run co BH CB in the coroutine’s AioContext For rbd (and others), as described in “rbd: Run co BH CB in the coroutine’s AioContext”, the pattern of setting a completion flag and waking a coroutine that yields while the flag is not set can only work when both run in the same thread. iscsi has the same pattern, but the details are a bit different: iscsi_co_generic_cb() can (as far as I understand) only run through iscsi_service(), not just from a random thread at a random time. iscsi_service() in turn can only be run after iscsi_set_events() set up an FD event handler, which is done in iscsi_co_wait_for_task(). As a result, iscsi_co_wait_for_task() will always yield exactly once, because iscsi_co_generic_cb() can only run after iscsi_set_events(), after the completion flag has already been checked, and the yielding coroutine will then be woken only once the completion flag was set to true. So as far as I can tell, iscsi has no bug and already works fine. Still, we don’t need the completion flag because we know we have to yield exactly once, so we can drop it. This simplifies the code and makes it more obvious that the “rbd bug” isn’t present here. This makes iscsi_co_generic_bh_cb() and iscsi_retry_timer_expired() a bit boring, so at least the former we can drop and call aio_co_wake() directly from scsi_co_generic_cb() to the same effect. As for the latter, the timer needs a CB, so we can’t drop it (I suppose we could technically use aio_co_wake directly as the CB, but that would be nasty), but we can put it into the coroutine’s AioContext to make its aio_co_wake() a simple wrapper around qemu_coroutine_enter() without a further BH indirection. Finally, remove the iTask->co != NULL checks: This field is set by iscsi_co_init_iscsitask(), which all users of IscsiTask run before even setting up iscsi_co_generic_cb() as the callback, and it is never set or cleared elsewhere, so it is impossible to not be set in iscsi_co_generic_cb(). Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20251110154854.151484-4-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-18 18:01:50 +01:00
Hanna Czenczek	89d22536d1	rbd: Run co BH CB in the coroutine’s AioContext qemu_rbd_completion_cb() schedules the request completion code (qemu_rbd_finish_bh()) to run in the BDS’s AioContext, assuming that this is the same thread in which qemu_rbd_start_co() runs. To explain, this is how both latter functions interact: In qemu_rbd_start_co(): while (!task.complete) qemu_coroutine_yield(); In qemu_rbd_finish_bh(): task->complete = true; aio_co_wake(task->co); // task->co is qemu_rbd_start_co() For this interaction to work reliably, both must run in the same thread so that qemu_rbd_finish_bh() can only run once the coroutine yields. Otherwise, finish_bh() may run before start_co() checks task.complete, which will result in the latter seeing .complete as true immediately and skipping the yield altogether, even though finish_bh() still wakes it. With multiqueue, the BDS’s AioContext is not necessarily the thread start_co() runs in, and so finish_bh() may be scheduled to run in a different thread than start_co(). With the right timing, this will cause the problems described above; waking a non-yielding coroutine is not good, as can be reproduced by putting e.g. a usleep(100000) above the while loop in start_co() (and using multiqueue), giving finish_bh() a much better chance at exiting before start_co() can yield. So instead of scheduling finish_bh() in the BDS’s AioContext, schedule finish_bh() in task->co’s AioContext. In addition, we can get rid of task.complete altogether because we will get woken exactly once, when the task is indeed complete, no need to check. (We could go further and drop the BH, running aio_co_wake() directly in qemu_rbd_completion_cb() because we are allowed to do that even if the coroutine isn’t yet yielding and we’re in a different thread – but the doc comment on qemu_rbd_completion_cb() says to be careful, so I decided not to go so far here.) Buglink: https://issues.redhat.com/browse/RHEL-67115 Reported-by: Junyao Zhao <junzhao@redhat.com> Cc: qemu-stable@nongnu.org Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20251110154854.151484-3-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-18 18:01:50 +01:00
Eric Blake	2e909d7ca9	qcow2, vmdk: Restrict creation with secondary file using protocol Ever since CVE-2024-4467 (see commit `7ead9469` in qemu v9.1.0), we have intentionally treated the opening of secondary files whose name is specified in the contents of the primary file, such as a qcow2 data_file, as something that must be a local file and not a protocol prefix (it is still possible to open a qcow2 file that wraps an NBD data image by using QMP commands, but that is from the explicit action of the QMP overriding any string encoded in the qcow2 file). At the time, we did not prevent the use of protocol prefixes on the secondary image while creating a qcow2 file, but it results in a qcow2 file that records an empty string for the data_file, rather than the protocol passed in during creation: $ qemu-img create -f raw datastore.raw 2G $ qemu-nbd -e 0 -t -f raw datastore.raw & $ qemu-img create -f qcow2 -o data_file=nbd://localhost:10809/ \ datastore_nbd.qcow2 2G Formatting 'datastore_nbd.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2147483648 data_file=nbd://localhost:10809/ lazy_refcounts=off refcount_bits=16 $ qemu-img info datastore_nbd.qcow2 \| grep data $ qemu-img info datastore_nbd.qcow2 \| grep data image: datastore_nbd.qcow2 data file: data file raw: false filename: datastore_nbd.qcow2 And since an empty string was recorded in the file, attempting to open the image without using QMP to supply the NBD data store fails, with a somewhat confusing error message: $ qemu-io -f qcow2 datastore_nbd.qcow2 qemu-io: can't open device datastore_nbd.qcow2: The 'file' block driver requires a file name Although the ability to create an image with a convenience reference to a protocol data file is not a security hole (unlike the case with open, the image is not untrusted if we are the ones creating it), the above demo shows that it is still inconsistent. Thus, it makes more sense if we also insist that image creation rejects a protocol prefix when using the same syntax. Now, the above attempt produces: $ qemu-img create -f qcow2 -o data_file=nbd://localhost:10809/ \ datastore_nbd.qcow2 2G Formatting 'datastore_nbd.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2147483648 data_file=nbd://localhost:10809/ lazy_refcounts=off refcount_bits=16 qemu-img: datastore_nbd.qcow2: Could not create 'nbd://localhost:10809/': No such file or directory with datastore_nbd.qcow2 no longer created. Signed-off-by: Eric Blake <eblake@redhat.com> Message-ID: <20250915213919.3121401-6-eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-11 22:06:09 +01:00
Eric Blake	1bd7bfbc2b	block: Allow drivers to control protocol prefix at creation This patch is pure refactoring: instead of hard-coding permission to use a protocol prefix when creating an image, the drivers can now pass in a parameter, comparable to what they could already do for opening a pre-existing image. This patch is purely mechanical (all drivers pass in true for now), but it will enable the next patch to cater to drivers that want to differ in behavior for the primary image vs. any secondary images that are opened at the same time as creating the primary image. Signed-off-by: Eric Blake <eblake@redhat.com> Message-ID: <20250915213919.3121401-5-eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-11 22:06:09 +01:00
Jean-Louis Dupond	524d5ba8c0	qcow2: put discards in discard queue when discard-no-unref is enabled When discard-no-unref is enabled, discards are not queued like it should. This was broken since discard-no-unref was added. Add a helper function qcow2_discard_cluster which handles some common checks and calls the queue_discards function if needed to add the discard request to the queue. Signed-off-by: Jean-Louis Dupond <jean-louis@dupond.be> Message-ID: <20250513132628.1055549-3-jean-louis@dupond.be> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-11 22:06:09 +01:00
Jean-Louis Dupond	31242df6ca	qcow2: rename update_refcount_discard to queue_discard The function just queues discards, and doesn't do any refcount change. So let's change the function name to align with its function. Signed-off-by: Jean-Louis Dupond <jean-louis@dupond.be> Message-ID: <20250513132628.1055549-2-jean-louis@dupond.be> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-11 22:06:09 +01:00
Yeqi Fu	9730b9974d	block: replace TABs with space Bring the block files in line with the QEMU coding style, with spaces for indentation. This patch partially resolves the issue 371. Resolves: https://gitlab.com/qemu-project/qemu/-/issues/371 Signed-off-by: Yeqi Fu <fufuyqqqqqq@gmail.com> Message-ID: <20230325085224.23842-1-fufuyqqqqqq@gmail.com> [thuth: Rebased the patch to the current master branch] Signed-off-by: Thomas Huth <thuth@redhat.com> Message-ID: <20251007163511.334178-1-thuth@redhat.com> [kwolf: Fixed up vertical alignemnt] Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-11 22:06:09 +01:00
Stefan Hajnoczi	684363fa3b	block/io_uring: use non-vectored read/write when possible The io_uring_prep_readv2/writev2() man pages recommend using the non-vectored read/write operations when possible for performance reasons. I didn't measure a significant difference but it doesn't hurt to have this optimization in place. Suggested-by: Eric Blake <eblake@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20251104022933.618123-16-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-11 22:06:09 +01:00
Stefan Hajnoczi	047dabef97	block/io_uring: use aio_add_sqe() AioContext has its own io_uring instance for file descriptor monitoring. The disk I/O io_uring code was developed separately. Originally I thought the characteristics of file descriptor monitoring and disk I/O were too different, requiring separate io_uring instances. Now it has become clear to me that it's feasible to share a single io_uring instance for file descriptor monitoring and disk I/O. We're not using io_uring's IOPOLL feature or anything else that would require a separate instance. Unify block/io_uring.c and util/fdmon-io_uring.c using the new aio_add_sqe() API that allows user-defined io_uring sqe submission. Now block/io_uring.c just needs to submit readv/writev/fsync and most of the io_uring-specific logic is handled by fdmon-io_uring.c. There are two immediate advantages: 1. Fewer system calls. There is no need to monitor the disk I/O io_uring ring fd from the file descriptor monitoring io_uring instance. Disk I/O completions are now picked up directly. Also, sqes are accumulated in the sq ring until the end of the event loop iteration and there are fewer io_uring_enter(2) syscalls. 2. Less code duplication. Note that error_setg() messages are not supposed to end with punctuation, so I removed a '.' for the non-io_uring build error message. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-ID: <20251104022933.618123-15-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-11-11 22:06:09 +01:00
Kevin Wolf	5b4b3bfdfc	qemu-img info: Optionally show block limits Add a new --limits option to 'qemu-img info' that displays the block limits for the image and all of its children, making the information more accessible for human users than in QMP. This option is not enabled by default because it can be a lot of output that isn't usually relevant if you're not specifically trying to diagnose some I/O problem. This makes the same information automatically also available in HMP 'info block -v'. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20251024123041.51254-4-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-10-29 12:10:10 +01:00
Kevin Wolf	d2634e1828	block: Expose block limits for images in QMP This information can be useful both for debugging and for management tools trying to configure guest devices with the optimal limits (possibly across multiple hosts). There is no reason not to make it available, so just add it to BlockNodeInfo. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20251024123041.51254-3-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-10-29 12:10:10 +01:00
Fiona Ebner	08736e7584	block: make bdrv_co_parent_cb_resize() a proper IO API function In preparation for calling it via the bdrv_child_cb_resize() callback that will be added by the next commit. Rename it to include the "_co_" part while at it. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20250917115509.401015-3-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-10-29 12:10:09 +01:00
Chandan Somani	9f0c763e16	block: enable stats-intervals for storage devices This patch allows stats-intervals to be used for storage devices with the -device option. It accepts a list of interval lengths in JSON format. It configures and collects the stats in the BlockBackend layer through the storage device that consumes the BlockBackend. Signed-off-by: Chandan Somani <csomani@redhat.com> Message-ID: <20251003220039.1336663-1-csomani@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-10-29 12:10:09 +01:00
Richard W.M. Jones	ad97769e9d	block/curl.c: Fix CURLOPT_VERBOSE parameter type In commit `ed26056d90` ("block/curl.c: Use explicit long constants in curl_easy_setopt calls") we missed a further call that takes a long parameter. Reported-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Richard W.M. Jones <rjones@redhat.com> Message-ID: <20251013124127.604401-1-rjones@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-10-29 12:10:09 +01:00
Bin Guo	dec83ac02b	block/monitor: Use hmp_handle_error to report error According to writing-monitor-commands.rst, best practice is to use the 'hmp_handle_error' function, which ensures that the message gets an 'Error: ' prefix. Signed-off-by: Bin Guo <guobin@linux.alibaba.com> Message-ID: <20250916054850.40963-1-guobin@linux.alibaba.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> [kwolf: Fixed up iotests reference output] Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-10-29 12:10:09 +01:00
Daniel P. Berrangé	c86488abaf	block: fix luks 'amend' when run in coroutine Launch QEMU with $ qemu-img create \ --object secret,id=sec0,data=123456 \ -f luks -o key-secret=sec0 demo.luks 1g $ qemu-system-x86_64 \ --object secret,id=sec0,data=123456 \ -blockdev driver=luks,key-secret=sec0,file.filename=demo.luks,file.driver=file,node-name=luks Then in QMP shell attempt x-blockdev-amend job-id=fish node-name=luks options={'state':'active','new-secret':'sec0','driver':'luks'} It will result in an assertion #0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 #1 0x00007fad18b73f63 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:89 #2 0x00007fad18b19f3e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #3 0x00007fad18b016d0 in __GI_abort () at abort.c:77 #4 0x00007fad18b01639 in __assert_fail_base (fmt=<optimized out>, assertion=<optimized out>, file=<optimized out>, line=<optimized out>, function=<optimized out>) at assert.c:118 #5 0x00007fad18b120af in __assert_fail (assertion=<optimized out>, file=<optimized out>, line=<optimized out>, function=<optimized out>) at assert.c:127 #6 0x000055ff74fdbd46 in bdrv_graph_rdlock_main_loop () at ../block/graph-lock.c:260 #7 0x000055ff7548521b in graph_lockable_auto_lock_mainloop (x=<optimized out>) at /usr/src/debug/qemu-9.2.4-1.fc42.x86_64/include/block/graph-lock.h:266 #8 block_crypto_read_func (block=<optimized out>, offset=4096, buf=0x55ffb6d66ef0 "", buflen=256000, opaque=0x55ffb5edcc30, errp=0x55ffb6f00700) at ../block/crypto.c:71 #9 0x000055ff75439f8b in qcrypto_block_luks_load_key (block=block@entry=0x55ffb5edbe90, slot_idx=slot_idx@entry=0, password=password@entry=0x55ffb67dc260 "123456", masterkey=masterkey@entry=0x55ffb5fb0c40 "", readfunc=readfunc@entry=0x55ff754851e0 <block_crypto_read_func>, opaque=opaque@entry=0x55ffb5edcc30, errp=0x55ffb6f00700) at ../crypto/block-luks.c:927 #10 0x000055ff7543b90f in qcrypto_block_luks_find_key (block=<optimized out>, password=<optimized out>, masterkey=<optimized out>, readfunc=<optimized out>, opaque=<optimized out>, errp=<optimized out>) at ../crypto/block-luks.c:1045 #11 qcrypto_block_luks_amend_add_keyslot (block=0x55ffb5edbe90, readfunc=0x55ff754851e0 <block_crypto_read_func>, writefunc=0x55ff75485100 <block_crypto_write_func>, opaque=0x55ffb5edcc3, opts_luks=0x7fad1715aef8, force=<optimized out>, errp=0x55ffb6f00700) at ../crypto/block-luks.c:1673 #12 qcrypto_block_luks_amend_options (block=0x55ffb5edbe90, readfunc=0x55ff754851e0 <block_crypto_read_func>, writefunc=0x55ff75485100 <block_crypto_write_func>, opaque=0x55ffb5edcc30, options=0x7fad1715aef0, force=<optimized out>, errp=0x55ffb6f00700) at ../crypto/block-luks.c:1865 #13 0x000055ff75485b95 in block_crypto_amend_options_generic_luks (bs=<optimized out>, amend_options=<optimized out>, force=<optimized out>, errp=<optimized out>) at ../block/crypto.c:949 #14 0x000055ff75485c28 in block_crypto_co_amend_luks (bs=<optimized out>, opts=<optimized out>, force=<optimized out>, errp=<optimized out>) at ../block/crypto.c:1008 #15 0x000055ff754778e5 in blockdev_amend_run (job=0x55ffb6f00640, errp=0x55ffb6f00700) at ../block/amend.c:52 #16 0x000055ff75468b90 in job_co_entry (opaque=0x55ffb6f00640) at ../job.c:1106 #17 0x000055ff755a0fc2 in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:175 This changes the read/write callbacks to not assert that they are run in mainloop context if already in a coroutine. This is also reproduced by qemu-iotests cases 295 and 296. Fixes: `1f051dcbdf` Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> Message-ID: <20250919112213.1530079-1-berrange@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-10-29 12:10:09 +01:00
Daniel P. Berrangé	6eda39a87f	block: remove 'detached-header' option from opts after use The code for creating LUKS devices references a 'detached-header' option in the QemuOpts data, but does not consume (remove) the option. Thus when the code later tries to convert the remaining unused QemuOpts into a QCryptoBlockCreateOptions struct, an error is reported by the QAPI code that 'detached-header' is not a valid field. This fixes a regression caused by commit `e818c01ae6` Author: Daniel P. Berrangé <berrange@redhat.com> Date: Mon Feb 19 15:12:59 2024 +0000 qapi: drop unused QCryptoBlockCreateOptionsLUKS.detached-header which identified that the QAPI field was unused, but failed to realize the QemuOpts -> QCryptoBlockCreateOptions conversion was seeing the left-over 'detached-header' option which had not been removed from QemuOpts. This problem was identified by the 'luks-detached-header' I/O test, but unfortunately I/O tests are not run regularly for the LUKS format. Fixes: `e818c01ae6` Reported-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> Message-ID: <20250919103810.1513109-1-berrange@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-10-29 12:10:09 +01:00
Richard W.M. Jones	ed26056d90	block/curl.c: Use explicit long constants in curl_easy_setopt calls curl_easy_setopt takes a variable argument that depends on what CURLOPT you are setting. Some require a long constant. Passing a plain int constant is potentially wrong on some platforms. With warnings enabled, multiple warnings like this were printed: ../block/curl.c: In function ‘curl_init_state’: ../block/curl.c:474:13: warning: call to ‘_curl_easy_setopt_err_long’ declared with attribute warning: curl_easy_setopt expects a long argument [-Wattribute-warning] 474 \| curl_easy_setopt(state->curl, CURLOPT_AUTOREFERER, 1) \|\| \| ^ Signed-off-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Chenxi Mao <maochenxi@bosc.ac.cn> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> Reviewed-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org> Message-ID: <20251009141026.4042021-2-rjones@redhat.com>	2025-10-10 08:24:14 -07:00
Vladimir Sementsov-Ogievskiy	1ed8903916	treewide: handle result of qio_channel_set_blocking() Currently, we just always pass NULL as errp argument. That doesn't look good. Some realizations of interface may actually report errors. Channel-socket realization actually either ignore or crash on errors, but we are going to straighten it out to always reporting an errp in further commits. So, convert all callers to either handle the error (where environment allows) or explicitly use &error_abort. Take also a chance to change the return value to more convenient bool (keeping also in mind, that underlying realizations may return -1 on failure, not -errno). Suggested-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> [DB: fix return type mismatch in TLS/websocket channel impls for qio_channel_set_blocking] Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>	2025-09-19 12:46:07 +01:00
Michael Tokarev	29e68f41c0	block/curl: drop old/unuspported curl version checks We currently require libcurl >=7.29.0 (since `f9cd86fe72`). Drop older LIBCURL_VERSION_NUM checks from the driver. Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>	2025-09-03 10:57:50 +03:00
Michael Tokarev	606978500c	block/curl: fix curl internal handles handling block/curl.c uses CURLMOPT_SOCKETFUNCTION to register a socket callback. According to the documentation, this callback is called not just with application-created sockets but also with internal curl sockets, - and for such sockets, user data pointer is not set by the application, so the result qemu crashing. Pass BDRVCURLState directly to the callback function as user pointer, instead of relying on CURLINFO_PRIVATE. This problem started happening with update of libcurl from 8.9 to 8.10 -- apparently with this change curl started using private handles more. (CURLINFO_PRIVATE is used in one more place, in curl_multi_check_completion() - it might need a similar fix too) Resolves: https://gitlab.com/qemu-project/qemu/-/issues/3081 Cc: qemu-stable@qemu.org Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>	2025-09-03 10:57:50 +03:00
Kevin Wolf	4af976ef39	rbd: Fix .bdrv_get_specific_info implementation qemu_rbd_get_specific_info() has at least two problems: The first is that it issues a blocking rbd_read() call in order to probe the encryption format for the image while querying the node. This means that if the connection to the server goes down, not only I/O is stuck (which is unavoidable), but query-names-block-nodes will actually make the whole QEMU instance unresponsive. .bdrv_get_specific_info implementations shouldn't perform blocking operations, but only return what is already known. The second is that the information returned isn't even correct. If the image is already opened with encryption enabled at the RBD level, we'll probe for "double encryption", i.e. if the encrypted data contains another encryption header. If it doesn't (which is the normal case), we won't return the encryption format. If it does, we return misleading information because it looks like we're talking about the outer level (the encryption format of the image itself) while the information is about an encryption header in the guest data. Fix this by storing the encryption format in BDRVRBDState when the image is opened (and we do blocking operations anyway) and returning only the stored information in qemu_rbd_get_specific_info(). The information we'll store is either the actual encryption format that we enabled on the RBD level, or if the image is unencrypted, the result of the same probing as we previously did when querying the node. Probing image formats based on content that can be modified by the guest has long been known as problematic, but as long as we only output it to the user instead of making decisions based on it, it should be okay. It is undoubtedly useful in the context of 'qemu-img info' when you're trying to figure out which encryption options you have to use to open the image successfully. Fixes: `42e4ac9ef5` ("block/rbd: Add support for rbd image encryption") Buglink: https://issues.redhat.com/browse/RHEL-105440 Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20250811134010.81787-1-kwolf@redhat.com> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-08-12 14:59:39 +02:00
Kevin Wolf	d402da1360	file-posix: Fix aio=threads performance regression after enablign FUA For aio=threads, we're currently not implementing REQ_FUA in any useful way, but just do a separate raw_co_flush_to_disk() call. This changes behaviour compared to the old state, which used bdrv_co_flush() with its optimisations. As a quick fix, call bdrv_co_flush() again like before. Eventually, we can use pwritev2() to make use of RWF_DSYNC if available, but we'll still have to keep this code path as a fallback, so this fix is required either way. While the fix itself is a one-liner, some new graph locking annotations are needed to convince TSA that the locking is correct. Cc: qemu-stable@nongnu.org Fixes: `984a32f17e` ("file-posix: Support FUA writes") Buglink: https://issues.redhat.com/browse/RHEL-96854 Reported-by: Tingting Mao <timao@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20250625085019.27735-1-kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-07-14 17:12:35 +02:00
Fiona Ebner	430e2be81e	block/qapi: make @node-name in @BlockDeviceInfo non-optional Since commit `15489c769b` ("block: auto-generated node-names"), if the node name of a block driver state is not explicitly specified, it will be auto-generated. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250702123204.325470-3-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-07-14 17:11:01 +02:00
Fiona Ebner	cfac5a963e	block/qapi: include child references in block device info In combination with using a throttle filter to enforce IO limits for a guest device, knowing the 'file' child of a block device can be useful. If the throttle filter is only intended for guest IO, block jobs should not also be limited by the throttle filter, so the block operations need to be done with the 'file' child of the top throttle node as the target. In combination with mirroring, the name of that child is not fixed. Another scenario is when unplugging a guest device after mirroring below a top throttle node, where the mirror target is added explicitly via blockdev-add. After mirroring, the target becomes the new 'file' child of the throttle node. For unplugging, both the top throttle node and the mirror target need to be deleted, because only implicitly added child nodes are deleted automatically, and the current 'file' child of the throttle node was explicitly added (as the mirror target). In other scenarios, it could be useful to follow the backing chain. Note that iotests 191 and 273 use _filter_img_info, so the 'children' information is filtered out there. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250702123204.325470-2-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-07-14 17:10:57 +02:00
Fiona Ebner	2cf92b15cd	block: mark bdrv_open_child_common() and its callers GRAPH_UNLOCKED The function bdrv_open_child_common() calls bdrv_graph_wrlock_drained(), which must be called with the graph unlocked. Mark it and its two callers bdrv_open_file_child() and bdrv_open_child() as GRAPH_UNLOCKED. This requires temporarily unlocking in vmdk_parse_extents() and making the locked section shorter in vmdk_open(). Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-48-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-07-14 15:42:27 +02:00
Fiona Ebner	60f609c152	block/commit: mark commit_abort() as GRAPH_UNLOCKED The function commit_abort() calls bdrv_drained_begin(), which must be called with the graph unlocked. Also mark the JobDriver's abort() callback as GRAPH_UNLOCKED_PTR, because that is the callback via which commit_abort() is reached. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-41-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-07-14 15:42:13 +02:00
Fiona Ebner	b326b127df	block: mark blk_remove_bs() as GRAPH_UNLOCKED The function blk_remove_bs() calls bdrv_graph_wrlock_drained() and can also call bdrv_drained_begin(), both of which which must be called with the graph unlocked. Marking blk_remove_bs() as GRAPH_UNLOCKED requires temporarily unlocking in hmp_drive_del(). Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-38-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-07-14 15:42:10 +02:00
Fiona Ebner	6717dc3075	block: mark bdrv_reopen_queue() and bdrv_reopen_multiple() as GRAPH_UNLOCKED The function bdrv_reopen_queue() can call bdrv_drain_all_begin(), which must be called with the graph unlocked. The function bdrv_reopen_multiple() calls bdrv_reopen_prepare() which must be called with the graph unlocked. To mark bdrv_reopen_queue() as GRAPH_UNLOCKED, it is necessary to make the locked section in reopen_backing_file() shorter. Signed-off-by: Fiona Ebner <f.ebner@proxmox.com> Message-ID: <20250530151125.955508-35-f.ebner@proxmox.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>	2025-07-14 15:42:05 +02:00

1 2 3 4 5 ...

6322 commits