Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250605100938.43133-1-f.ebner@proxmox.com>
[kwolf: Fixed up pylint warnings flagged by 297]
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For aio=threads, we're currently not implementing REQ_FUA in any useful
way, but just do a separate raw_co_flush_to_disk() call. This changes
behaviour compared to the old state, which used bdrv_co_flush() with its
optimisations. As a quick fix, call bdrv_co_flush() again like before.
Eventually, we can use pwritev2() to make use of RWF_DSYNC if available,
but we'll still have to keep this code path as a fallback, so this fix
is required either way.
While the fix itself is a one-liner, some new graph locking annotations
are needed to convince TSA that the locking is correct.
Cc: qemu-stable@nongnu.org
Fixes: 984a32f17e ("file-posix: Support FUA writes")
Buglink: https://issues.redhat.com/browse/RHEL-96854
Reported-by: Tingting Mao <timao@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20250625085019.27735-1-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since commit 15489c769b ("block: auto-generated node-names"), if the
node name of a block driver state is not explicitly specified, it
will be auto-generated.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250702123204.325470-3-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In combination with using a throttle filter to enforce IO limits for
a guest device, knowing the 'file' child of a block device can be
useful. If the throttle filter is only intended for guest IO, block
jobs should not also be limited by the throttle filter, so the
block operations need to be done with the 'file' child of the top
throttle node as the target. In combination with mirroring, the name
of that child is not fixed.
Another scenario is when unplugging a guest device after mirroring
below a top throttle node, where the mirror target is added explicitly
via blockdev-add. After mirroring, the target becomes the new 'file'
child of the throttle node. For unplugging, both the top throttle node
and the mirror target need to be deleted, because only implicitly
added child nodes are deleted automatically, and the current 'file'
child of the throttle node was explicitly added (as the mirror
target).
In other scenarios, it could be useful to follow the backing chain.
Note that iotests 191 and 273 use _filter_img_info, so the 'children'
information is filtered out there.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250702123204.325470-2-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The function block_job_remove_all_bdrv() calls
bdrv_graph_wrlock_drained(), which must be called with the graph
unlocked.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-49-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The function bdrv_open_child_common() calls
bdrv_graph_wrlock_drained(), which must be called with the graph
unlocked. Mark it and its two callers bdrv_open_file_child() and
bdrv_open_child() as GRAPH_UNLOCKED. This requires temporarily
unlocking in vmdk_parse_extents() and making the locked section
shorter in vmdk_open().
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-48-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The functions blk_log_writes_close(), blkverify_close(),
quorum_close(), vmdk_close() via vmdk_free_extents(), and other
bdrv_close() implementations call bdrv_graph_wrlock_drained(), which
must be called with the graph unlocked. They are reached via the
BlockDriver's bdrv_close() callback and the bdrv_close() wrapper,
which are also marked as GRAPH_UNLOCKED_PTR and GRAPH_UNLOCKED.
Furthermore, the function bdrv_close() also calls bdrv_drained_begin()
and bdrv_graph_wrlock_drained(), so there are additional reasons for
marking it GRAPH_UNLOCKED.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-47-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The function bdrv_close_all() calls bdrv_drain_all(), which must be
called with the graph unlocked.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-46-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The function bdrv_drop_intermediate() calls bdrv_drained_begin(),
which must be called with the graph unlocked.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-45-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The function bdrv_insert_node() calls bdrv_drained_begin() which must
be called with the graph unlocked.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-44-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The function bdrv_replace_child_bs() calls bdrv_drained_begin() which
must be called with the graph unlocked.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-43-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_new() calls bdrv_drained_begin(), which can poll and therefore
can't be called while holding the graph lock. One option to make sure
that this call is allowed would be marking bdrv_new() GRAPH_UNLOCKED.
However, this is actually an unnecessary restriction because we know
that we only just created the BlockDriverState and it isn't even part of
the graph yet. We can use bdrv_do_drained_begin_quiesce() instead to
avoid the polling, which means that bdrv_new() can now safely be called
from callers that hold the graph lock as well as from callers that
don't.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The function commit_abort() calls bdrv_drained_begin(), which must be
called with the graph unlocked.
Also mark the JobDriver's abort() callback as GRAPH_UNLOCKED_PTR,
because that is the callback via which commit_abort() is reached.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-41-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The function blk_io_limits_disable() calls bdrv_drained_begin(), which
must be called with the graph unlocked.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-40-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The function blk_drain() calls bdrv_drained_begin(), which must be
called with the graph unlocked.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-39-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The function blk_remove_bs() calls bdrv_graph_wrlock_drained() and can
also call bdrv_drained_begin(), both of which which must be called with
the graph unlocked.
Marking blk_remove_bs() as GRAPH_UNLOCKED requires temporarily
unlocking in hmp_drive_del().
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-38-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The function bdrv_inactivate_all() calls bdrv_drain_all_begin(), which
must be called with the graph unlocked.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-37-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The function bdrv_inactivate() calls bdrv_drain_all_begin(), which
needs to be called with the graph unlocked, so either
bdrv_inactivate() should be marked as GRAPH_UNLOCKED or the drain
needs to be moved to the callers. The caller in
qmp_blockdev_set_active() requires that the locked section covers
bdrv_find_node() too, so the latter alternative is chosen.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-36-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The function bdrv_reopen_queue() can call bdrv_drain_all_begin(),
which must be called with the graph unlocked.
The function bdrv_reopen_multiple() calls bdrv_reopen_prepare() which
must be called with the graph unlocked.
To mark bdrv_reopen_queue() as GRAPH_UNLOCKED, it is necessary to make
the locked section in reopen_backing_file() shorter.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-35-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The function stream_prepare() calls bdrv_drain_all_begin(), which
must be called with the graph unlocked.
Also mark the JobDriver's prepare() callback as GRAPH_UNLOCKED_PTR,
because that is the callback via which stream_prepare() is reached.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-34-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The function bdrv_all_delete_snapshot() calls bdrv_drain_all_begin(),
which must be called with the graph unlocked.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-33-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The function blk_drain_all() calls bdrv_drain_all_begin(), which must
be called with the graph unlocked.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-32-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Nearly all callers (outside of the tests) are already using the
_drained() variant of the function. It doesn't seem worth keeping.
Simply adapt the remaining callers of bdrv_set_backing_hd() and rename
bdrv_set_backing_hd_drained() to bdrv_set_backing_hd().
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-31-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
By using the appropriate variants bdrv_set_backing_hd_drained() and
bdrv_try_change_aio_context_locked(), there only needs to be a single
drained and write-locked section in external_snapshot_abort().
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-30-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The function bdrv_set_backing_hd() calls bdrv_drain_all_begin(), which
must be called with the graph unlocked.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-29-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is in preparation to mark bdrv_set_backing_hd() as
GRAPH_UNLOCKED.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-28-f.ebner@proxmox.com>
[kwolf: Removed an extra blank line]
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is in preparation to mark bdrv_set_backing_hd() as
GRAPH_UNLOCKED.
Switch to using the bdrv_set_backing_hd_drained() variant. For the
first pair of calls to avoid draining and locking twice in a row
within the individual calls. For the third call, so that the drained
and locked section can also cover bdrv_cow_bs().
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-27-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is in preparation to mark bdrv_set_backing_hd() as
GRAPH_UNLOCKED.
Switch to using the bdrv_set_backing_hd_drained() variant, so that the
drained and locked section can also cover the calls to
bdrv_skip_filters() and bdrv_cow_bs().
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-26-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Many write-locked sections are also drained sections. A new
bdrv_graph_wrunlock_drained() wrapper around bdrv_graph_wrunlock() is
introduced, which will begin a drained section first. A global
variable is used so bdrv_graph_wrunlock() knows if it also needs
to end such a drained section. Both the aio_poll call in
bdrv_graph_wrlock() and the aio_bh_poll() in bdrv_graph_wrunlock()
can re-enter a write-locked section. While for the latter, ending the
drain could be moved to before the call, the former requires that the
variable is a counter and not just a boolean.
Since the wrapper calls bdrv_drain_all_begin(), which must be called
with the graph unlocked, mark the wrapper as GRAPH_UNLOCKED too.
The switch to the new helpers was generated with the following
commands and then manually checked:
find . -name '*.c' -exec sed -i -z 's/bdrv_drain_all_begin();\n\s*bdrv_graph_wrlock();/bdrv_graph_wrlock_drained();/g' {} ';'
find . -name '*.c' -exec sed -i -z 's/bdrv_graph_wrunlock();\n\s*bdrv_drain_all_end();/bdrv_graph_wrunlock();/g' {} ';'
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-25-f.ebner@proxmox.com>
[kwolf: Removed redundant GRAPH_UNLOCKED]
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
All accesses of bs->quiesce_counter are in the main thread, either
after a GLOBAL_STATE_CODE() macro or in a function with GRAPH_WRLOCK
annotation.
This is essentially a revert of 414c2ec358 ("block: access
quiesce_counter with atomic ops"). At that time, neither the
GLOBAL_STATE_CODE() macro nor the GRAPH_WRLOCK annotation existed.
Even if the field was only accessed in the main thread back then (did
not check if that is actually the case), it wouldn't have been easy to
verify.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20250530151125.955508-24-f.ebner@proxmox.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
tcg: Use uintptr_t in tcg_malloc implementation
linux-user: Hold the fd-trans lock across fork
linux-user: Implement fchmodat2 syscall
linux-user: Check for EFAULT failure in nanosleep
linux-user: Use qemu_set_cloexec() to mark pidfd as FD_CLOEXEC
linux-user/gen-vdso: Handle fseek() failure
linux-user/gen-vdso: Don't read off the end of buf[]
-----BEGIN PGP SIGNATURE-----
iQFRBAABCgA7FiEEekgeeIaLTbaoWgXAZN846K9+IV8FAmhxSAkdHHJpY2hhcmQu
aGVuZGVyc29uQGxpbmFyby5vcmcACgkQZN846K9+IV9wiQf+PrXwKj+FusE0YU1y
Lnx6+S0M/lDRCNhbgBrw7JK5WUwIfnZQuepf0vjuhoHH1rUdT1EUYdJ7Quwj9fgG
0YcKRD8OAVKNU8I3ydtzSaJ3TZ02nbbDbwGMoD/eNXGKx0Gt5907vD4PrjT+mByG
6QTLwuql3ahkl/Tiskk2LwbmHRe0CXiezVuzgprbNiyxrgDT8ArqCq+VJzv/wb2O
4t6BqRDvBzRe7MUUs2B2W+hs0HW4Rfqcye/3rRnYe7HA4CTiVNqY9rwgrQqGEO0P
3Cf+VaF6CaLz+HuHfM8rz+xBhfo+UpZYOVMXk/7VEAG6geMKTcQG1tCJYhL+xklJ
9r4ABw==
=rD+6
-----END PGP SIGNATURE-----
Merge tag 'pull-tcg-20250711' of https://gitlab.com/rth7680/qemu into staging
fpu: Process float_muladd_negate_result after rounding
tcg: Use uintptr_t in tcg_malloc implementation
linux-user: Hold the fd-trans lock across fork
linux-user: Implement fchmodat2 syscall
linux-user: Check for EFAULT failure in nanosleep
linux-user: Use qemu_set_cloexec() to mark pidfd as FD_CLOEXEC
linux-user/gen-vdso: Handle fseek() failure
linux-user/gen-vdso: Don't read off the end of buf[]
# -----BEGIN PGP SIGNATURE-----
#
# iQFRBAABCgA7FiEEekgeeIaLTbaoWgXAZN846K9+IV8FAmhxSAkdHHJpY2hhcmQu
# aGVuZGVyc29uQGxpbmFyby5vcmcACgkQZN846K9+IV9wiQf+PrXwKj+FusE0YU1y
# Lnx6+S0M/lDRCNhbgBrw7JK5WUwIfnZQuepf0vjuhoHH1rUdT1EUYdJ7Quwj9fgG
# 0YcKRD8OAVKNU8I3ydtzSaJ3TZ02nbbDbwGMoD/eNXGKx0Gt5907vD4PrjT+mByG
# 6QTLwuql3ahkl/Tiskk2LwbmHRe0CXiezVuzgprbNiyxrgDT8ArqCq+VJzv/wb2O
# 4t6BqRDvBzRe7MUUs2B2W+hs0HW4Rfqcye/3rRnYe7HA4CTiVNqY9rwgrQqGEO0P
# 3Cf+VaF6CaLz+HuHfM8rz+xBhfo+UpZYOVMXk/7VEAG6geMKTcQG1tCJYhL+xklJ
# 9r4ABw==
# =rD+6
# -----END PGP SIGNATURE-----
# gpg: Signature made Fri 11 Jul 2025 13:21:13 EDT
# gpg: using RSA key 7A481E78868B4DB6A85A05C064DF38E8AF7E215F
# gpg: issuer "richard.henderson@linaro.org"
# gpg: Good signature from "Richard Henderson <richard.henderson@linaro.org>" [full]
# Primary key fingerprint: 7A48 1E78 868B 4DB6 A85A 05C0 64DF 38E8 AF7E 215F
* tag 'pull-tcg-20250711' of https://gitlab.com/rth7680/qemu:
linux-user: Use qemu_set_cloexec() to mark pidfd as FD_CLOEXEC
tcg: Use uintptr_t in tcg_malloc implementation
linux-user: Hold the fd-trans lock across fork
linux-user/mips/o32: Drop sa_restorer functionality
linux-user/gen-vdso: Don't read off the end of buf[]
linux-user/gen-vdso: Handle fseek() failure
linux-user: Check for EFAULT failure in nanosleep
linux-user: Implement fchmodat2 syscall
fpu: Process float_muladd_negate_result after rounding
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* New board type max78000fthr
* Enable use of CXL on Arm 'virt' board
* Some more tidyup of ID register handling
* Refactor AT insns and PMU regs into separate source files
* Don't enforce NSE,NS check for EL3->EL3 returns
* hw/arm/fsl-imx8mp: Wire VIRQ and VFIQ
* Allow nested-virtualization with KVM on the 'virt' board
* system/qdev: Remove pointless NULL check in qdev_device_add_from_qdict
* hw/arm/virt-acpi-build: Don't create ITS id mappings by default
* target/arm: Remove unused helper_sme2_luti4_4b
-----BEGIN PGP SIGNATURE-----
iQJNBAABCAA3FiEE4aXFk81BneKOgxXPPCUl7RQ2DN4FAmhxEcoZHHBldGVyLm1h
eWRlbGxAbGluYXJvLm9yZwAKCRA8JSXtFDYM3j5yEACWYnNeqo8Yph6/EJExE6eV
r0tC6FBb5ShPgA6kDxhpOc1lI6uXGh8+D7bL9BePEdz/brCf1QDfs2Z4q/hb5ysX
D0H6VI5Gr1j6MjkFRBo3+vvYz4Yh++XLn5Q9lZv8zaSEdraq/ay2kxnuhRCK+4Ar
+QoGtKrGMJ7UCpfiRlvNnd1UjgORZf10EE/bRImX13sxeDomP3CZhFzAyJyShOP9
JA7bAd4rYJ4oj8R33y8Yaxjwm4FOndj740B0zwpO8mpjzFiE5zbqsaO+mEgYSflc
OQisCu/KRFpyIR+UqP+4gNaJLfKQW5Y4r61zEaiJWV/c4RdKNnbK1f7MX11fNhOk
k1paF3GIXp6f794Hb14vtsYnKHF2eeNSmRkAomXxLgUSYzLezL+yj7cdYmRJhgYU
thc1PSiEmHYhjRmOaMC9+dkMtvIexWyDNYNFTygoOE5/kTMSazeTFQpFmw+ZuTee
9pjKsYRZJgTa64IkJy1L34jc2gds48Q20KpQsqZ22KQcjwt4PW4eQXkvMylawSut
mArHVH6AAxIK+defeEmnQCJ0OccyGCENjRDuWyWMMGoP/ggZpO47rGWmCUOK8xz8
IfGdPeF/9xsKSKWvjpiHyyKa48wuO2bVC+5bISS6IPA2uGneS2DpmjkHU+gHBqpk
GNlvEnXZfavZOHejE7/L/Q==
=hJ4/
-----END PGP SIGNATURE-----
Merge tag 'pull-target-arm-20250711' of https://gitlab.com/pm215/qemu into staging
target-arm queue:
* New board type max78000fthr
* Enable use of CXL on Arm 'virt' board
* Some more tidyup of ID register handling
* Refactor AT insns and PMU regs into separate source files
* Don't enforce NSE,NS check for EL3->EL3 returns
* hw/arm/fsl-imx8mp: Wire VIRQ and VFIQ
* Allow nested-virtualization with KVM on the 'virt' board
* system/qdev: Remove pointless NULL check in qdev_device_add_from_qdict
* hw/arm/virt-acpi-build: Don't create ITS id mappings by default
* target/arm: Remove unused helper_sme2_luti4_4b
# -----BEGIN PGP SIGNATURE-----
#
# iQJNBAABCAA3FiEE4aXFk81BneKOgxXPPCUl7RQ2DN4FAmhxEcoZHHBldGVyLm1h
# eWRlbGxAbGluYXJvLm9yZwAKCRA8JSXtFDYM3j5yEACWYnNeqo8Yph6/EJExE6eV
# r0tC6FBb5ShPgA6kDxhpOc1lI6uXGh8+D7bL9BePEdz/brCf1QDfs2Z4q/hb5ysX
# D0H6VI5Gr1j6MjkFRBo3+vvYz4Yh++XLn5Q9lZv8zaSEdraq/ay2kxnuhRCK+4Ar
# +QoGtKrGMJ7UCpfiRlvNnd1UjgORZf10EE/bRImX13sxeDomP3CZhFzAyJyShOP9
# JA7bAd4rYJ4oj8R33y8Yaxjwm4FOndj740B0zwpO8mpjzFiE5zbqsaO+mEgYSflc
# OQisCu/KRFpyIR+UqP+4gNaJLfKQW5Y4r61zEaiJWV/c4RdKNnbK1f7MX11fNhOk
# k1paF3GIXp6f794Hb14vtsYnKHF2eeNSmRkAomXxLgUSYzLezL+yj7cdYmRJhgYU
# thc1PSiEmHYhjRmOaMC9+dkMtvIexWyDNYNFTygoOE5/kTMSazeTFQpFmw+ZuTee
# 9pjKsYRZJgTa64IkJy1L34jc2gds48Q20KpQsqZ22KQcjwt4PW4eQXkvMylawSut
# mArHVH6AAxIK+defeEmnQCJ0OccyGCENjRDuWyWMMGoP/ggZpO47rGWmCUOK8xz8
# IfGdPeF/9xsKSKWvjpiHyyKa48wuO2bVC+5bISS6IPA2uGneS2DpmjkHU+gHBqpk
# GNlvEnXZfavZOHejE7/L/Q==
# =hJ4/
# -----END PGP SIGNATURE-----
# gpg: Signature made Fri 11 Jul 2025 09:29:46 EDT
# gpg: using RSA key E1A5C593CD419DE28E8315CF3C2525ED14360CDE
# gpg: issuer "peter.maydell@linaro.org"
# gpg: Good signature from "Peter Maydell <peter.maydell@linaro.org>" [full]
# gpg: aka "Peter Maydell <pmaydell@gmail.com>" [full]
# gpg: aka "Peter Maydell <pmaydell@chiark.greenend.org.uk>" [full]
# gpg: aka "Peter Maydell <peter@archaic.org.uk>" [unknown]
# Primary key fingerprint: E1A5 C593 CD41 9DE2 8E83 15CF 3C25 25ED 1436 0CDE
* tag 'pull-target-arm-20250711' of https://gitlab.com/pm215/qemu: (36 commits)
tests/functional: Add a test for the MAX78000 arm machine
docs/system: arm: Add max78000 board description
target/arm: Remove helper_sme2_luti4_4b
hw/arm/virt-acpi-build: Don't create ITS id mappings by default
system/qdev: Remove pointless NULL check in qdev_device_add_from_qdict
hw/arm/virt: Allow virt extensions with KVM
hw/arm/arm_gicv3_kvm: Add a migration blocker with kvm nested virt
target/arm: Enable feature ARM_FEATURE_EL2 if EL2 is supported
target/arm/kvm: Add helper to detect EL2 when using KVM
hw/arm: Allow setting KVM vGIC maintenance IRQ
hw/arm/fsl-imx8mp: Wire VIRQ and VFIQ
target/arm: Don't enforce NSE,NS check for EL3->EL3 returns
target/arm: Split out performance monitor regs to cpregs-pmu.c
target/arm: Split out AT insns to tcg/cpregs-at.c
target/arm: Drop stub for define_tlb_insn_regs
arm/kvm: shorten one overly long line
arm/cpu: store clidr into the idregs array
arm/cpu: fix trailing ',' for SET_IDREG
arm/cpu: store id_aa64afr{0,1} into the idregs array
arm/cpu: store id_afr0 into the idregs array
...
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* Link s390-ccw.img statically
* Fix broken bamboo functional test
* s390x code cleanups and refactorings
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEJ7iIR+7gJQEY8+q5LtnXdP5wLbUFAmhw2i0RHHRodXRoQHJl
ZGhhdC5jb20ACgkQLtnXdP5wLbUGtA//XVr5t2/iH+zFdaHHFglMtYkqwyYspa/O
zGPgcIZptQrzlbR+GFJwd4ae1HWb60E1YDyC7M1iWGQXeMNrDgeJJjUQfhB7693Y
CPT1FCWaqXdrTHQJhf5+EGJZopwY1K4EHs+bMxCpU3ManD+MKuXzCgOMzZATnPUZ
EcvOrzDBfEFEzQn5COUi5FF5Ds4DpOqQY1g1tpG92hQwWeAgdPPXSYlakG64Hm8C
Km6BzAcylrRiHdORk3GeMJ1cPQ3vCjMrjTd87ra/xuH+DvPeyZ31cRIWIP1dn44x
eog5dWo7pNmwfU50c4w/6dTSqwHG/bD/2ZPJH2nnJDLK02WeguantPN43fdoPU0c
NEMldVE5GAqEr7Sbd5YIw9lBqrROIDfeUAxje4VZa1gSY4N/GYMGEZaM5vqYJJTP
0ndWP83QdamWuE0eOYMA+4oZiPpW79+Igv/PV13lsm9JgvO0WQisPFxE0cZqMTQp
+wgbQ69rpyMiQxpusiL/6LA3khDyC8Z8g7cmjBfpqgwmVAZp7ly+GLk+ctG0zsjE
hB99hkujZVkBZQLnVs0C/pXn1NdJ0wEupiHOSsVlQtqzNHlbweRJoxuGSp4Rl0Et
0DnTr3YHB6bdvRazaKzlkBHLLAXKEw0/xaRWGbE4tftZIrkOEeE0LMLLaLWLNKhX
rqRoxq00OPs=
=SOH3
-----END PGP SIGNATURE-----
Merge tag 'pull-request-2025-07-11' of https://gitlab.com/thuth/qemu into staging
* s390x: Allow to select different entries when booting via pxelinux.cfg
* Link s390-ccw.img statically
* Fix broken bamboo functional test
* s390x code cleanups and refactorings
# -----BEGIN PGP SIGNATURE-----
#
# iQJFBAABCgAvFiEEJ7iIR+7gJQEY8+q5LtnXdP5wLbUFAmhw2i0RHHRodXRoQHJl
# ZGhhdC5jb20ACgkQLtnXdP5wLbUGtA//XVr5t2/iH+zFdaHHFglMtYkqwyYspa/O
# zGPgcIZptQrzlbR+GFJwd4ae1HWb60E1YDyC7M1iWGQXeMNrDgeJJjUQfhB7693Y
# CPT1FCWaqXdrTHQJhf5+EGJZopwY1K4EHs+bMxCpU3ManD+MKuXzCgOMzZATnPUZ
# EcvOrzDBfEFEzQn5COUi5FF5Ds4DpOqQY1g1tpG92hQwWeAgdPPXSYlakG64Hm8C
# Km6BzAcylrRiHdORk3GeMJ1cPQ3vCjMrjTd87ra/xuH+DvPeyZ31cRIWIP1dn44x
# eog5dWo7pNmwfU50c4w/6dTSqwHG/bD/2ZPJH2nnJDLK02WeguantPN43fdoPU0c
# NEMldVE5GAqEr7Sbd5YIw9lBqrROIDfeUAxje4VZa1gSY4N/GYMGEZaM5vqYJJTP
# 0ndWP83QdamWuE0eOYMA+4oZiPpW79+Igv/PV13lsm9JgvO0WQisPFxE0cZqMTQp
# +wgbQ69rpyMiQxpusiL/6LA3khDyC8Z8g7cmjBfpqgwmVAZp7ly+GLk+ctG0zsjE
# hB99hkujZVkBZQLnVs0C/pXn1NdJ0wEupiHOSsVlQtqzNHlbweRJoxuGSp4Rl0Et
# 0DnTr3YHB6bdvRazaKzlkBHLLAXKEw0/xaRWGbE4tftZIrkOEeE0LMLLaLWLNKhX
# rqRoxq00OPs=
# =SOH3
# -----END PGP SIGNATURE-----
# gpg: Signature made Fri 11 Jul 2025 05:32:29 EDT
# gpg: using RSA key 27B88847EEE0250118F3EAB92ED9D774FE702DB5
# gpg: issuer "thuth@redhat.com"
# gpg: Good signature from "Thomas Huth <th.huth@gmx.de>" [full]
# gpg: aka "Thomas Huth <thuth@redhat.com>" [full]
# gpg: aka "Thomas Huth <huth@tuxfamily.org>" [full]
# gpg: aka "Thomas Huth <th.huth@posteo.de>" [unknown]
# Primary key fingerprint: 27B8 8847 EEE0 2501 18F3 EAB9 2ED9 D774 FE70 2DB5
* tag 'pull-request-2025-07-11' of https://gitlab.com/thuth/qemu:
target/s390x: Have s390_cpu_halt() not return anything
target/s390x: Expose s390_count_running_cpus() method
target/s390x: Remove unused s390_cpu_[un]halt() user stubs
tests/functional/test_ppc_bamboo: Replace broken link with working assets
tests/functional: Add dependency to the keymap_targets
pc-bios: Update the s390 bios images with the pxelinux.cfg loadparm changes
pc-bios/s390-ccw: link statically
tests/functional: Add a test for s390x pxelinux.cfg network booting
pc-bios/s390-ccw: Add a boot menu for booting via pxelinux.cfg
pc-bios/s390-ccw: Make get_boot_index() from menu.c global
pc-bios/s390-ccw: Allow up to 31 entries for pxelinux.cfg
pc-bios/s390-ccw: Allow to select a different pxelinux.cfg entry via loadparm
hw/s390x/s390-pci-bus.c: Use g_assert_not_reached() in functions taking an ett
target/s390x/tcg: Use vaddr in s390_probe_access()
target/s390x/kvm: Use vaddr in find/insert_hw_breakpoint()
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
In the linux-user do_fork() function we try to set the FD_CLOEXEC
flag on a pidfd like this:
fcntl(pid_fd, F_SETFD, fcntl(pid_fd, F_GETFL) | FD_CLOEXEC);
This has two problems:
(1) it doesn't check errors, which Coverity complains about
(2) we use F_GETFL when we mean F_GETFD
Deal with both of these problems by using qemu_set_cloexec() instead.
That function will assert() if the fcntls fail, which is fine (we are
inside fork_start()/fork_end() so we know nothing can mess around
with our file descriptors here, and we just got this one from
pidfd_open()).
(As we are touching the if() statement here, we correct the
indentation.)
Coverity: CID 1508111
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-ID: <20250711141217.1429412-1-peter.maydell@linaro.org>
Avoid ubsan failure with clang-20,
tcg.h:715:19: runtime error: applying non-zero offset 64 to null pointer
by not using pointers.
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Recent patch [1] renames the save_live_complete_precopy handler to
save_complete, as the machine is not live in most cases when this
handler is executed. The same is true also for
save_live_complete_precopy_thread, therefore this patch removes the
"live" keyword from the handler itself and related types to keep the
naming unified.
In contrast to save_complete, this handler is only executed at the end
of precopy, therefore the "precopy" keyword is retained.
[1]: https://lore.kernel.org/all/20250613140801.474264-7-peterx@redhat.com/
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250626085235.294690-1-jmarcin@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Add the latency distribution too for blocktime, using order-of-two buckets.
It accounts for all the faults, from either vCPU or non-vCPU threads. With
prior rework, it's very easy to achieve by adding an array to account for
faults in each buckets.
Sample output for HMP (while for QMP it's simply an array):
Postcopy Latency Distribution:
[ 1 us - 2 us ]: 0
[ 2 us - 4 us ]: 0
[ 4 us - 8 us ]: 1
[ 8 us - 16 us ]: 2
[ 16 us - 32 us ]: 2
[ 32 us - 64 us ]: 3
[ 64 us - 128 us ]: 10169
[ 128 us - 256 us ]: 50151
[ 256 us - 512 us ]: 12876
[ 512 us - 1 ms ]: 97
[ 1 ms - 2 ms ]: 42
[ 2 ms - 4 ms ]: 44
[ 4 ms - 8 ms ]: 93
[ 8 ms - 16 ms ]: 138
[ 16 ms - 32 ms ]: 0
[ 32 ms - 65 ms ]: 0
[ 65 ms - 131 ms ]: 0
[ 131 ms - 262 ms ]: 0
[ 262 ms - 524 ms ]: 0
[ 524 ms - 1 sec ]: 0
[ 1 sec - 2 sec ]: 0
[ 2 sec - 4 sec ]: 0
[ 4 sec - 8 sec ]: 0
[ 8 sec - 16 sec ]: 0
Cc: Markus Armbruster <armbru@redhat.com>
Acked-by: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-15-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
When used to report page fault latencies, the blocktime feature can be
almost useless when KVM async page fault is enabled, because in most cases
such remote fault will kickoff async page faults, then it's not trackable
from blocktime layer.
After all these recent rewrites to blocktime layer, it's finally so easy to
also support tracking non-vCPU faults. It'll be even faster if we could
always index fault records with TIDs, unfortunately we need to maintain the
blocktime API which report things in vCPU indexes.
Of course this can work not only for kworkers, but also any guest accesses
that may reach a missing page, for example, very likely when in the QEMU
main thread too (and all other threads whenever applicable).
In this case, we don't care about "how long the threads are blocked", but
we only care about "how long the fault will be resolved".
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Link: https://lore.kernel.org/r/20250613141217.474825-14-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Currently, the postcopy blocktime feature maintains vCPU fault information
using an array (vcpu_addr[]). It has two issues.
Issue 1: Performance Concern
============================
The old algorithm was almost OK and fast on inserts, except that the lookup
is slow and won't scale if there are a lot of vCPUs: when a page is copied
during postcopy, mark_postcopy_blocktime_end() will walk the whole array
trying to find which vCPUs are blocked by the address. So it needs
constant O(N) walk for each page resolution.
Alexey (the author of postcopy blocktime) mentioned the perf issue and how
to optimize it in a piece of comment in the page resolution path. The
comment was (interestingly..) not complete, but it's relatively clear what
he wanted to say about this perf issue.
Issue 2: Wrong Accounting on re-entrancies
==========================================
People might think that each vCPU should only and always get one fault at a
time, so that when the blocktime layer captured one fault on one vCPU, we
should never see another fault message on this vCPU.
It's almost correct, except in some extreme rare cases.
Case 1: it's possible the fault thread processes the userfaultfd messages
too fast so it can see >1 messages on one vCPU before the previous one was
resolved.
Case 2: it's theoretically also possible one vCPU can get even more than
one message on the same fault address if a fault is retried by the
kernel (e.g., handle_userfault() got interrupted before page resolution).
As this info might be important, instead of using commit message, I put
more details into the code as comment, when introducing an array
maintaining concurrent faults on one vCPU. Please refer to the comments
for details on both cases, especially case 1 which can be tricky.
Case 1 sounds rare, but it can be easily reproduced locally for me when we
run blocktime together with the migration-test on the vanilla postcopy.
New Design
==========
This patch should do almost what Alexey mentioned, but slightly
differently: instead of having an array to maintain vCPU fault addresses,
for each of the fault message we push a message into a hash, indexed by the
fault address.
With the hash, it can replace the old two structs: both the vcpu_addr[]
array, and also the array to store the start time of the fault. However
due to above we need one more counter array to account concurrent faults on
the same vCPU - that should even be needed in the old code, it's just that
the old code was buggy and it will blindly overwrite an existing
entry.. now we'll start to really track everything.
The hash structure might be more efficient than tree to maintain such
addr->(cpu, fault_time) information, so that the insert() and lookup()
paths should ideally both be ~O(1). After all, we do not need to sort.
Here we need to do one remove() though after the lookup(). It could be
slow but only if many vCPUs faulted exactly on the same address (so when
the list of cpu entries is long), which should be unlikely. Even with that,
it's still a worst case O(N) (consider 400 vCPUs faulted on the same
address and how likely is it..) rather than a constant O(N) complexity.
When at it, touch up the tracepoints to make them slightly more useful.
One tracepoint is added when walking all the fault entries.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-13-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
The variable vcpu_total_blocktime isn't easy to follow. In reality, it
wants to capture the case where all vCPUs are stopped, and now there will
be some vCPUs starts running.
The name now starts to conflict with vcpu_blocktime_total[], meanwhile it's
actually not necessary to have the variable at all: since nobody is
touching smp_cpus_down except ourselves, we can safely do the calculation
at the end before decrementing smp_cpus_down.
Hopefully this makes the logic easier to read, side benefit is we drop one
temp var.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-12-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Looking up the vCPU index for each fault can be expensive when there're
hundreds of vCPUs. Provide a cache for tid->vcpu instead with a hash
table, then lookup from there.
When at it, add another counter to record how many non-vCPU faults it gets.
For example, the main thread can also access a guest page that was missing.
These kind of faults are not accounted by blocktime so far.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-11-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Before this patch, the blocktime context can be created very early, because
postcopy_ram_supported_by_host() <- migrate_caps_check() can happen during
migration object init.
The trick here is the blocktime context needs system vCPU information,
which seems to be possible to change after that point. I didn't verify it,
but it doesn't sound right.
Now move it out and initialize the context only when postcopy listen
starts. That is already during a migration so it should be guaranteed the
vCPU topology can never change on both sides.
While at it, assert that the ctx isn't created instead this time; the old
"if" trick isn't needed when we're sure it will only happen once now.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-10-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Blocktime so far only cares about the time one vcpu (or the whole system)
got blocked. It would be also be helpful if it can also report the latency
of page requests, which could be very sensitive during postcopy.
Blocktime itself is sometimes not very important, especially when one
thinks about KVM async PF support, which means vCPUs are literally almost
not blocked at all because the guest OS is smart enough to switch to
another task when a remote fault is needed.
However, latency is still sensitive and important because even if the guest
vCPU is running on threads that do not need a remote fault, the workload
that accesses some missing page is still affected.
Add two entries to the report, showing how long it takes to resolve a
remote fault. Mention in the QAPI doc that this is not the real average
fault latency, but only the ones that was requested for a remote fault.
Unwrap get_vcpu_blocktime_list() so we don't need to walk the list twice,
meanwhile add the entry checks in qtests for all postcopy tests.
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Link: https://lore.kernel.org/r/20250613141217.474825-9-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Add a field to count how many remote faults one vCPU has taken. So far
it's still not used, but will be soon.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-8-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
With 64-bit fields, it is trivial. The caution is when exposing any values
in QMP, it was still declared with milliseconds (ms). Hence it's needed to
do the convertion when exporting the values to existing QMP queries.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-7-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Now with 64bits, the offseting using start_time is not needed anymore,
because the array can always remember the whole timestamp.
Then drop the unused parameter in get_low_time_offset() altogether.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-6-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
I am guessing it was used to be 32bits because of the atomic ops. Now all
the atomic ops are gone and we're protected by a mutex instead, it's ok we
can switch to 64 bits.
Reasons to move over:
- Allow further patches to change the unit from ms to us: with postcopy
preempt mode, we're really into hundreds of microseconds level on
blocktime. We'd better be able to trap those.
- This also paves way for some other tricks that the original version
used to avoid overflows, e.g., start_time was almost only useful before
to make sure the sampled timestamp won't overflow a 32-bit field.
- This prepares further reports on top of existing data collected,
e.g. average page fault latencies. When average operation is taken into
account, milliseconds are simply too coarse grained.
When at it:
- Rename page_fault_vcpu_time to vcpu_blocktime_start.
- Rename vcpu_blocktime to vcpu_blocktime_total.
- Touch up the trace-events to not dump blocktime ctx pointer
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-5-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>