Extend 'inhibit=on' setting with the option to specify a pinned XSK map
path along with a starting index (default 0) to push the created XSK
sockets into. Example usage:
# ./build/qemu-system-x86_64 [...] \
-netdev af-xdp,ifname=enp2s0f0np0,id=net0,mode=native,queues=2,start-queue=14,inhibit=on,map-path=/sys/fs/bpf/xsks_map,map-start-index=14 \
-device virtio-net-pci,netdev=net0 [...]
This is useful for the case where an existing XDP program with XSK map
is present on the AF_XDP supported phys device and the XSK map is not
yet populated. For example, the former could have been pre-loaded onto
the netdevice by a control plane, which later launches QEMU to populate
it with XSK sockets.
Normally, the main idea behind 'inhibit=on' is that the QEMU instance
doesn't need to have a lot of privileges to use the pre-loaded program
and the pre-created sockets, but this mentioned use-case here is different
where QEMU still needs privileges to create the sockets.
The 'map-start-index' parameter is optional and defaults to 0. It allows
flexible placement of the XSK sockets, and is up to the user to specify
when the XDP program with XSK map was already preloaded. In the simplest
case the queue-to-map-slot mapping is just 1:1 based on ctx->rx_queue_index
but the user might as well have a different scheme (or smaller map size,
e.g. ctx->rx_queue_index % max_size) to push the inbound traffic to one
of the XSK sockets.
Note that the bpf_xdp_query_id() is now only tested for 'inhibit=off'
since only in the latter case the libxdp takes care of installing the
XDP program which was installed based on the s->xdp_flags pointing to
either driver or skb mode. For 'inhibit=on' we don't make any assumptions
and neither go down the path of probing all possible options in which
way the user installed the XDP program.
Reviewed-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Ilya Maximets <i.maximets@ovn.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
While testing, it turned out that upon error in the queue creation loop,
we never trigger the af_xdp_cleanup() handler. This is because we pass
errp instead of a local err pointer into the various AF_XDP setup functions
instead of a scheme like:
bool fn(..., Error **errp)
{
Error *err = NULL;
foo(arg, &err);
if (err) {
handle the error...
error_propagate(errp, err);
return false;
}
...
}
The same is true for the attachment probing with bpf_xdp_query_id(). With a
conversion into the above format, the af_xdp_cleanup() handler is called as
expected. Note the error_propagate() handles a NULL err internally.
Fixes: cb039ef3d9 ("net: add initial support for AF_XDP network backend")
Reviewed-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Ilya Maximets <i.maximets@ovn.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
There are two issues with the XDP program removal in af_xdp_cleanup():
1) Starting from libxdp 1.3.0 [0] the XDP program gets automatically
detached when we call xsk_socket__delete() for the last successfully
configured queue. libxdp internally keeps track of that. For QEMU
we require libxdp >= 1.4.0. Given QEMU is not loading the program,
lets also not attempt to remove it and delegate this instead.
2) The removal logic is incorrect anyway because we are setting n_queues
into the last queue that never has xdp_flags on failure, so the logic
is always skipped since the non-zero test for s->xdp_flags in
af_xdp_cleanup() fails.
Fixes: cb039ef3d9 ("net: add initial support for AF_XDP network backend")
Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Ilya Maximets <i.maximets@ovn.org>
Cc: Ilya Maximets <i.maximets@ovn.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Anton Protopopov <aspsk@isovalent.com>
Link: 38c2914988 [0]
Signed-off-by: Jason Wang <jasowang@redhat.com>
This commit adds support for the vhost-user interface to the passt
network backend, enabling high-performance, accelerated networking for
guests using passt.
The passt backend can now operate in a vhost-user mode, where it
communicates with the guest's virtio-net device over a socket pair
using the vhost-user protocol. This offloads the datapath from the
main QEMU loop, significantly improving network performance.
When the vhost-user=on option is used with -netdev passt, the new
vhost initialization path is taken instead of the standard
stream-based connection.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
This commit introduces support for passt as a new network backend.
passt is an unprivileged, user-mode networking solution that provides
connectivity for virtual machines by launching an external helper process.
The implementation reuses the generic stream data handling logic. It
launches the passt binary using GSubprocess, passing it a file
descriptor from a socketpair() for communication. QEMU connects to
the other end of the socket pair to establish the network data stream.
The PID of the passt daemon is tracked via a temporary file to
ensure it is terminated when QEMU exits.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Introduce a boolean is_vhost_user field to the vhost_net
structure. This flag is initialized during vhost_net_init based
on whether the backend is vhost-user.
This refactoring simplifies checks for vhost-user specific behavior,
replacing direct comparisons of 'net->nc->info->type' with the new
flag. It improves readability and encapsulates the backend type
information directly within the vhost_net instance.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
This commit refactors how the maximum transmit queue size for
virtio-net devices is determined, making the mechanism more generic
and extensible.
Previously, virtio_net_max_tx_queue_size() contained hardcoded
checks for specific network backend types (vhost-user and
vhost-vdpa) to determine their supported maximum queue size. This
created direct dependencies and would require modifications for
every new backend that supports variable queue sizes.
To improve flexibility, a new max_tx_queue_size field is added
to the vhost_net structure. This allows each network backend
to advertise its supported maximum transmit queue size directly.
The virtio_net_max_tx_queue_size() function now retrieves the max
TX queue size from the vhost_net struct, if available and set.
Otherwise, it defaults to VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
This commit introduces a save_acked_features function pointer to
vhost_net and converts the vhost_net function into a generic dispatcher.
The vhost-user backend provides the callback, making its function static.
With this change, no other module has a direct dependency on the
vhost-user implementation.
This cleanup allows for the complete removal of the net/vhost-user.h
header file.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
This patch continues the effort to decouple the generic vhost layer
from specific network backend implementations.
Previously, the vhost_net initialization code contained a hardcoded
check for the vhost-user client type to retrieve its acked features
by calling vhost_user_get_acked_features(). This exposed an
internal vhost-user function in a public header and coupled the two
modules.
The vhost-user backend is updated to provide a callback, and its
getter function is now static. The call site in vhost_net.c is
simplified to use the new generic helper, removing the type check and
the direct dependency.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Previously, the vhost_net_get_feature_bits() function in
hw/net/vhost_net.c used a large switch statement to determine
the appropriate feature bits based on the NetClientDriver type.
This created unnecessary coupling between the generic vhost layer
and specific network backends (like TAP, vhost-user, and
vhost-vdpa).
This patch moves the definition of vhost feature bits directly into the
vhost_net structure for each relevant network client.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
The get_vhost_net() function previously contained a large switch
statement to find the VHostNetState pointer based on the net
client's type. This created a tight coupling, requiring the generic
vhost layer to be aware of every specific backend that supported
vhost, such as tap, vhost-user, and vhost-vdpa.
This approach is not scalable and requires modifying a central function
for any new backend. It also forced each backend to expose its internal
getter function in a public header file.
This patch refactors the logic by introducing a new get_vhost_net
function pointer to the NetClientInfo struct. The central
get_vhost_net() function is now a simple, generic dispatcher that
invokes the callback provided by the net client.
Each backend now implements its own private getter and registers it in
its NetClientInfo.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
This is a cosmetic change with no functional impact.
The function vhost_set_vring_enable() is specific to vhost_net and
is used outside of vhost_net.c (specifically, in
hw/net/virtio-net.c). To prevent confusion with other similarly named
vhost functions, such as the one found in cryptodev-vhost.c, it has
been renamed to vhost_net_set_vring_enable(). This clarifies that the
function belongs to the vhost_net module.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
The code to set the link status is currently located in
qmp_set_link(). This function identifies the device by name,
searches for the corresponding NetClientState, and then updates
the link status.
In some parts of the code, such as vhost-user.c, the
NetClientState are already available. Calling qmp_set_link()
from these locations leads to a redundant search for the clients.
This patch refactors the logic by introducing a new function,
net_client_set_link(), which accepts a NetClientState array
directly. qmp_set_link() is simplified to be a wrapper that
performs the client search and then calls the new function.
The vhost-user implementation is updated to use net_client_set_link()
directly, thereby eliminating the unnecessary client lookup.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
To prepare for the implementation of '-net passt', this patch moves
the generic stream handling functions from net/stream.c into new
net/stream_data.c and net/stream_data.h files.
This refactoring introduces a NetStreamData struct that encapsulates
the generic fields and logic previously in NetStreamState. The
NetStreamState now embeds NetStreamData and delegates the core
stream operations to the new generic functions.
To maintain flexibility for different users of this generic code,
callbacks for send and listen operations are now passed via
function pointers within the NetStreamData struct. This allows
callers to provide their own specific implementations while reusing
the common connection and data transfer logic.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
virtio_net_pre_load_queues() inspects vdev->guest_features to tell if
VIRTIO_NET_F_RSS or VIRTIO_NET_F_MQ is enabled to infer the required
number of queues. This works for VIRTIO_NET_F_MQ but it doesn't for
VIRTIO_NET_F_RSS because only the lowest 32 bits of vdev->guest_features
is set at the point and VIRTIO_NET_F_RSS uses bit 60 while
VIRTIO_NET_F_MQ uses bit 22.
Instead of inferring the required number of queues from
vdev->guest_features, use the number loaded from the vm state. This
change also has a nice side effect to remove a duplicate peer queue
pair change by circumventing virtio_net_set_multiqueue().
Also update the comment in include/hw/virtio/virtio.h to prevent an
implementation of pre_load_queues() from refering to any fields being
loaded during migration by accident in the future.
Fixes: 8c49756825 ("virtio-net: Add only one queue pair when realizing")
Tested-by: Lei Yang <leiyang@redhat.com>
Cc: qemu-stable@nongnu.org
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
s->pool has n_descs elements so maximum i should be
n_descs - 1. Fix the upper bound.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: cb039ef3d9 ("net: add initial support for AF_XDP network backend")
Cc: qemu-stable@nongnu.org
Reviewed-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Anastasia Belova <nabelova31@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
tcg: Use uintptr_t in tcg_malloc implementation
linux-user: Hold the fd-trans lock across fork
linux-user: Implement fchmodat2 syscall
linux-user: Check for EFAULT failure in nanosleep
linux-user: Use qemu_set_cloexec() to mark pidfd as FD_CLOEXEC
linux-user/gen-vdso: Handle fseek() failure
linux-user/gen-vdso: Don't read off the end of buf[]
-----BEGIN PGP SIGNATURE-----
iQFRBAABCgA7FiEEekgeeIaLTbaoWgXAZN846K9+IV8FAmhxSAkdHHJpY2hhcmQu
aGVuZGVyc29uQGxpbmFyby5vcmcACgkQZN846K9+IV9wiQf+PrXwKj+FusE0YU1y
Lnx6+S0M/lDRCNhbgBrw7JK5WUwIfnZQuepf0vjuhoHH1rUdT1EUYdJ7Quwj9fgG
0YcKRD8OAVKNU8I3ydtzSaJ3TZ02nbbDbwGMoD/eNXGKx0Gt5907vD4PrjT+mByG
6QTLwuql3ahkl/Tiskk2LwbmHRe0CXiezVuzgprbNiyxrgDT8ArqCq+VJzv/wb2O
4t6BqRDvBzRe7MUUs2B2W+hs0HW4Rfqcye/3rRnYe7HA4CTiVNqY9rwgrQqGEO0P
3Cf+VaF6CaLz+HuHfM8rz+xBhfo+UpZYOVMXk/7VEAG6geMKTcQG1tCJYhL+xklJ
9r4ABw==
=rD+6
-----END PGP SIGNATURE-----
Merge tag 'pull-tcg-20250711' of https://gitlab.com/rth7680/qemu into staging
fpu: Process float_muladd_negate_result after rounding
tcg: Use uintptr_t in tcg_malloc implementation
linux-user: Hold the fd-trans lock across fork
linux-user: Implement fchmodat2 syscall
linux-user: Check for EFAULT failure in nanosleep
linux-user: Use qemu_set_cloexec() to mark pidfd as FD_CLOEXEC
linux-user/gen-vdso: Handle fseek() failure
linux-user/gen-vdso: Don't read off the end of buf[]
# -----BEGIN PGP SIGNATURE-----
#
# iQFRBAABCgA7FiEEekgeeIaLTbaoWgXAZN846K9+IV8FAmhxSAkdHHJpY2hhcmQu
# aGVuZGVyc29uQGxpbmFyby5vcmcACgkQZN846K9+IV9wiQf+PrXwKj+FusE0YU1y
# Lnx6+S0M/lDRCNhbgBrw7JK5WUwIfnZQuepf0vjuhoHH1rUdT1EUYdJ7Quwj9fgG
# 0YcKRD8OAVKNU8I3ydtzSaJ3TZ02nbbDbwGMoD/eNXGKx0Gt5907vD4PrjT+mByG
# 6QTLwuql3ahkl/Tiskk2LwbmHRe0CXiezVuzgprbNiyxrgDT8ArqCq+VJzv/wb2O
# 4t6BqRDvBzRe7MUUs2B2W+hs0HW4Rfqcye/3rRnYe7HA4CTiVNqY9rwgrQqGEO0P
# 3Cf+VaF6CaLz+HuHfM8rz+xBhfo+UpZYOVMXk/7VEAG6geMKTcQG1tCJYhL+xklJ
# 9r4ABw==
# =rD+6
# -----END PGP SIGNATURE-----
# gpg: Signature made Fri 11 Jul 2025 13:21:13 EDT
# gpg: using RSA key 7A481E78868B4DB6A85A05C064DF38E8AF7E215F
# gpg: issuer "richard.henderson@linaro.org"
# gpg: Good signature from "Richard Henderson <richard.henderson@linaro.org>" [full]
# Primary key fingerprint: 7A48 1E78 868B 4DB6 A85A 05C0 64DF 38E8 AF7E 215F
* tag 'pull-tcg-20250711' of https://gitlab.com/rth7680/qemu:
linux-user: Use qemu_set_cloexec() to mark pidfd as FD_CLOEXEC
tcg: Use uintptr_t in tcg_malloc implementation
linux-user: Hold the fd-trans lock across fork
linux-user/mips/o32: Drop sa_restorer functionality
linux-user/gen-vdso: Don't read off the end of buf[]
linux-user/gen-vdso: Handle fseek() failure
linux-user: Check for EFAULT failure in nanosleep
linux-user: Implement fchmodat2 syscall
fpu: Process float_muladd_negate_result after rounding
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* New board type max78000fthr
* Enable use of CXL on Arm 'virt' board
* Some more tidyup of ID register handling
* Refactor AT insns and PMU regs into separate source files
* Don't enforce NSE,NS check for EL3->EL3 returns
* hw/arm/fsl-imx8mp: Wire VIRQ and VFIQ
* Allow nested-virtualization with KVM on the 'virt' board
* system/qdev: Remove pointless NULL check in qdev_device_add_from_qdict
* hw/arm/virt-acpi-build: Don't create ITS id mappings by default
* target/arm: Remove unused helper_sme2_luti4_4b
-----BEGIN PGP SIGNATURE-----
iQJNBAABCAA3FiEE4aXFk81BneKOgxXPPCUl7RQ2DN4FAmhxEcoZHHBldGVyLm1h
eWRlbGxAbGluYXJvLm9yZwAKCRA8JSXtFDYM3j5yEACWYnNeqo8Yph6/EJExE6eV
r0tC6FBb5ShPgA6kDxhpOc1lI6uXGh8+D7bL9BePEdz/brCf1QDfs2Z4q/hb5ysX
D0H6VI5Gr1j6MjkFRBo3+vvYz4Yh++XLn5Q9lZv8zaSEdraq/ay2kxnuhRCK+4Ar
+QoGtKrGMJ7UCpfiRlvNnd1UjgORZf10EE/bRImX13sxeDomP3CZhFzAyJyShOP9
JA7bAd4rYJ4oj8R33y8Yaxjwm4FOndj740B0zwpO8mpjzFiE5zbqsaO+mEgYSflc
OQisCu/KRFpyIR+UqP+4gNaJLfKQW5Y4r61zEaiJWV/c4RdKNnbK1f7MX11fNhOk
k1paF3GIXp6f794Hb14vtsYnKHF2eeNSmRkAomXxLgUSYzLezL+yj7cdYmRJhgYU
thc1PSiEmHYhjRmOaMC9+dkMtvIexWyDNYNFTygoOE5/kTMSazeTFQpFmw+ZuTee
9pjKsYRZJgTa64IkJy1L34jc2gds48Q20KpQsqZ22KQcjwt4PW4eQXkvMylawSut
mArHVH6AAxIK+defeEmnQCJ0OccyGCENjRDuWyWMMGoP/ggZpO47rGWmCUOK8xz8
IfGdPeF/9xsKSKWvjpiHyyKa48wuO2bVC+5bISS6IPA2uGneS2DpmjkHU+gHBqpk
GNlvEnXZfavZOHejE7/L/Q==
=hJ4/
-----END PGP SIGNATURE-----
Merge tag 'pull-target-arm-20250711' of https://gitlab.com/pm215/qemu into staging
target-arm queue:
* New board type max78000fthr
* Enable use of CXL on Arm 'virt' board
* Some more tidyup of ID register handling
* Refactor AT insns and PMU regs into separate source files
* Don't enforce NSE,NS check for EL3->EL3 returns
* hw/arm/fsl-imx8mp: Wire VIRQ and VFIQ
* Allow nested-virtualization with KVM on the 'virt' board
* system/qdev: Remove pointless NULL check in qdev_device_add_from_qdict
* hw/arm/virt-acpi-build: Don't create ITS id mappings by default
* target/arm: Remove unused helper_sme2_luti4_4b
# -----BEGIN PGP SIGNATURE-----
#
# iQJNBAABCAA3FiEE4aXFk81BneKOgxXPPCUl7RQ2DN4FAmhxEcoZHHBldGVyLm1h
# eWRlbGxAbGluYXJvLm9yZwAKCRA8JSXtFDYM3j5yEACWYnNeqo8Yph6/EJExE6eV
# r0tC6FBb5ShPgA6kDxhpOc1lI6uXGh8+D7bL9BePEdz/brCf1QDfs2Z4q/hb5ysX
# D0H6VI5Gr1j6MjkFRBo3+vvYz4Yh++XLn5Q9lZv8zaSEdraq/ay2kxnuhRCK+4Ar
# +QoGtKrGMJ7UCpfiRlvNnd1UjgORZf10EE/bRImX13sxeDomP3CZhFzAyJyShOP9
# JA7bAd4rYJ4oj8R33y8Yaxjwm4FOndj740B0zwpO8mpjzFiE5zbqsaO+mEgYSflc
# OQisCu/KRFpyIR+UqP+4gNaJLfKQW5Y4r61zEaiJWV/c4RdKNnbK1f7MX11fNhOk
# k1paF3GIXp6f794Hb14vtsYnKHF2eeNSmRkAomXxLgUSYzLezL+yj7cdYmRJhgYU
# thc1PSiEmHYhjRmOaMC9+dkMtvIexWyDNYNFTygoOE5/kTMSazeTFQpFmw+ZuTee
# 9pjKsYRZJgTa64IkJy1L34jc2gds48Q20KpQsqZ22KQcjwt4PW4eQXkvMylawSut
# mArHVH6AAxIK+defeEmnQCJ0OccyGCENjRDuWyWMMGoP/ggZpO47rGWmCUOK8xz8
# IfGdPeF/9xsKSKWvjpiHyyKa48wuO2bVC+5bISS6IPA2uGneS2DpmjkHU+gHBqpk
# GNlvEnXZfavZOHejE7/L/Q==
# =hJ4/
# -----END PGP SIGNATURE-----
# gpg: Signature made Fri 11 Jul 2025 09:29:46 EDT
# gpg: using RSA key E1A5C593CD419DE28E8315CF3C2525ED14360CDE
# gpg: issuer "peter.maydell@linaro.org"
# gpg: Good signature from "Peter Maydell <peter.maydell@linaro.org>" [full]
# gpg: aka "Peter Maydell <pmaydell@gmail.com>" [full]
# gpg: aka "Peter Maydell <pmaydell@chiark.greenend.org.uk>" [full]
# gpg: aka "Peter Maydell <peter@archaic.org.uk>" [unknown]
# Primary key fingerprint: E1A5 C593 CD41 9DE2 8E83 15CF 3C25 25ED 1436 0CDE
* tag 'pull-target-arm-20250711' of https://gitlab.com/pm215/qemu: (36 commits)
tests/functional: Add a test for the MAX78000 arm machine
docs/system: arm: Add max78000 board description
target/arm: Remove helper_sme2_luti4_4b
hw/arm/virt-acpi-build: Don't create ITS id mappings by default
system/qdev: Remove pointless NULL check in qdev_device_add_from_qdict
hw/arm/virt: Allow virt extensions with KVM
hw/arm/arm_gicv3_kvm: Add a migration blocker with kvm nested virt
target/arm: Enable feature ARM_FEATURE_EL2 if EL2 is supported
target/arm/kvm: Add helper to detect EL2 when using KVM
hw/arm: Allow setting KVM vGIC maintenance IRQ
hw/arm/fsl-imx8mp: Wire VIRQ and VFIQ
target/arm: Don't enforce NSE,NS check for EL3->EL3 returns
target/arm: Split out performance monitor regs to cpregs-pmu.c
target/arm: Split out AT insns to tcg/cpregs-at.c
target/arm: Drop stub for define_tlb_insn_regs
arm/kvm: shorten one overly long line
arm/cpu: store clidr into the idregs array
arm/cpu: fix trailing ',' for SET_IDREG
arm/cpu: store id_aa64afr{0,1} into the idregs array
arm/cpu: store id_afr0 into the idregs array
...
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* Link s390-ccw.img statically
* Fix broken bamboo functional test
* s390x code cleanups and refactorings
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEJ7iIR+7gJQEY8+q5LtnXdP5wLbUFAmhw2i0RHHRodXRoQHJl
ZGhhdC5jb20ACgkQLtnXdP5wLbUGtA//XVr5t2/iH+zFdaHHFglMtYkqwyYspa/O
zGPgcIZptQrzlbR+GFJwd4ae1HWb60E1YDyC7M1iWGQXeMNrDgeJJjUQfhB7693Y
CPT1FCWaqXdrTHQJhf5+EGJZopwY1K4EHs+bMxCpU3ManD+MKuXzCgOMzZATnPUZ
EcvOrzDBfEFEzQn5COUi5FF5Ds4DpOqQY1g1tpG92hQwWeAgdPPXSYlakG64Hm8C
Km6BzAcylrRiHdORk3GeMJ1cPQ3vCjMrjTd87ra/xuH+DvPeyZ31cRIWIP1dn44x
eog5dWo7pNmwfU50c4w/6dTSqwHG/bD/2ZPJH2nnJDLK02WeguantPN43fdoPU0c
NEMldVE5GAqEr7Sbd5YIw9lBqrROIDfeUAxje4VZa1gSY4N/GYMGEZaM5vqYJJTP
0ndWP83QdamWuE0eOYMA+4oZiPpW79+Igv/PV13lsm9JgvO0WQisPFxE0cZqMTQp
+wgbQ69rpyMiQxpusiL/6LA3khDyC8Z8g7cmjBfpqgwmVAZp7ly+GLk+ctG0zsjE
hB99hkujZVkBZQLnVs0C/pXn1NdJ0wEupiHOSsVlQtqzNHlbweRJoxuGSp4Rl0Et
0DnTr3YHB6bdvRazaKzlkBHLLAXKEw0/xaRWGbE4tftZIrkOEeE0LMLLaLWLNKhX
rqRoxq00OPs=
=SOH3
-----END PGP SIGNATURE-----
Merge tag 'pull-request-2025-07-11' of https://gitlab.com/thuth/qemu into staging
* s390x: Allow to select different entries when booting via pxelinux.cfg
* Link s390-ccw.img statically
* Fix broken bamboo functional test
* s390x code cleanups and refactorings
# -----BEGIN PGP SIGNATURE-----
#
# iQJFBAABCgAvFiEEJ7iIR+7gJQEY8+q5LtnXdP5wLbUFAmhw2i0RHHRodXRoQHJl
# ZGhhdC5jb20ACgkQLtnXdP5wLbUGtA//XVr5t2/iH+zFdaHHFglMtYkqwyYspa/O
# zGPgcIZptQrzlbR+GFJwd4ae1HWb60E1YDyC7M1iWGQXeMNrDgeJJjUQfhB7693Y
# CPT1FCWaqXdrTHQJhf5+EGJZopwY1K4EHs+bMxCpU3ManD+MKuXzCgOMzZATnPUZ
# EcvOrzDBfEFEzQn5COUi5FF5Ds4DpOqQY1g1tpG92hQwWeAgdPPXSYlakG64Hm8C
# Km6BzAcylrRiHdORk3GeMJ1cPQ3vCjMrjTd87ra/xuH+DvPeyZ31cRIWIP1dn44x
# eog5dWo7pNmwfU50c4w/6dTSqwHG/bD/2ZPJH2nnJDLK02WeguantPN43fdoPU0c
# NEMldVE5GAqEr7Sbd5YIw9lBqrROIDfeUAxje4VZa1gSY4N/GYMGEZaM5vqYJJTP
# 0ndWP83QdamWuE0eOYMA+4oZiPpW79+Igv/PV13lsm9JgvO0WQisPFxE0cZqMTQp
# +wgbQ69rpyMiQxpusiL/6LA3khDyC8Z8g7cmjBfpqgwmVAZp7ly+GLk+ctG0zsjE
# hB99hkujZVkBZQLnVs0C/pXn1NdJ0wEupiHOSsVlQtqzNHlbweRJoxuGSp4Rl0Et
# 0DnTr3YHB6bdvRazaKzlkBHLLAXKEw0/xaRWGbE4tftZIrkOEeE0LMLLaLWLNKhX
# rqRoxq00OPs=
# =SOH3
# -----END PGP SIGNATURE-----
# gpg: Signature made Fri 11 Jul 2025 05:32:29 EDT
# gpg: using RSA key 27B88847EEE0250118F3EAB92ED9D774FE702DB5
# gpg: issuer "thuth@redhat.com"
# gpg: Good signature from "Thomas Huth <th.huth@gmx.de>" [full]
# gpg: aka "Thomas Huth <thuth@redhat.com>" [full]
# gpg: aka "Thomas Huth <huth@tuxfamily.org>" [full]
# gpg: aka "Thomas Huth <th.huth@posteo.de>" [unknown]
# Primary key fingerprint: 27B8 8847 EEE0 2501 18F3 EAB9 2ED9 D774 FE70 2DB5
* tag 'pull-request-2025-07-11' of https://gitlab.com/thuth/qemu:
target/s390x: Have s390_cpu_halt() not return anything
target/s390x: Expose s390_count_running_cpus() method
target/s390x: Remove unused s390_cpu_[un]halt() user stubs
tests/functional/test_ppc_bamboo: Replace broken link with working assets
tests/functional: Add dependency to the keymap_targets
pc-bios: Update the s390 bios images with the pxelinux.cfg loadparm changes
pc-bios/s390-ccw: link statically
tests/functional: Add a test for s390x pxelinux.cfg network booting
pc-bios/s390-ccw: Add a boot menu for booting via pxelinux.cfg
pc-bios/s390-ccw: Make get_boot_index() from menu.c global
pc-bios/s390-ccw: Allow up to 31 entries for pxelinux.cfg
pc-bios/s390-ccw: Allow to select a different pxelinux.cfg entry via loadparm
hw/s390x/s390-pci-bus.c: Use g_assert_not_reached() in functions taking an ett
target/s390x/tcg: Use vaddr in s390_probe_access()
target/s390x/kvm: Use vaddr in find/insert_hw_breakpoint()
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
In the linux-user do_fork() function we try to set the FD_CLOEXEC
flag on a pidfd like this:
fcntl(pid_fd, F_SETFD, fcntl(pid_fd, F_GETFL) | FD_CLOEXEC);
This has two problems:
(1) it doesn't check errors, which Coverity complains about
(2) we use F_GETFL when we mean F_GETFD
Deal with both of these problems by using qemu_set_cloexec() instead.
That function will assert() if the fcntls fail, which is fine (we are
inside fork_start()/fork_end() so we know nothing can mess around
with our file descriptors here, and we just got this one from
pidfd_open()).
(As we are touching the if() statement here, we correct the
indentation.)
Coverity: CID 1508111
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-ID: <20250711141217.1429412-1-peter.maydell@linaro.org>
Avoid ubsan failure with clang-20,
tcg.h:715:19: runtime error: applying non-zero offset 64 to null pointer
by not using pointers.
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Recent patch [1] renames the save_live_complete_precopy handler to
save_complete, as the machine is not live in most cases when this
handler is executed. The same is true also for
save_live_complete_precopy_thread, therefore this patch removes the
"live" keyword from the handler itself and related types to keep the
naming unified.
In contrast to save_complete, this handler is only executed at the end
of precopy, therefore the "precopy" keyword is retained.
[1]: https://lore.kernel.org/all/20250613140801.474264-7-peterx@redhat.com/
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250626085235.294690-1-jmarcin@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Add the latency distribution too for blocktime, using order-of-two buckets.
It accounts for all the faults, from either vCPU or non-vCPU threads. With
prior rework, it's very easy to achieve by adding an array to account for
faults in each buckets.
Sample output for HMP (while for QMP it's simply an array):
Postcopy Latency Distribution:
[ 1 us - 2 us ]: 0
[ 2 us - 4 us ]: 0
[ 4 us - 8 us ]: 1
[ 8 us - 16 us ]: 2
[ 16 us - 32 us ]: 2
[ 32 us - 64 us ]: 3
[ 64 us - 128 us ]: 10169
[ 128 us - 256 us ]: 50151
[ 256 us - 512 us ]: 12876
[ 512 us - 1 ms ]: 97
[ 1 ms - 2 ms ]: 42
[ 2 ms - 4 ms ]: 44
[ 4 ms - 8 ms ]: 93
[ 8 ms - 16 ms ]: 138
[ 16 ms - 32 ms ]: 0
[ 32 ms - 65 ms ]: 0
[ 65 ms - 131 ms ]: 0
[ 131 ms - 262 ms ]: 0
[ 262 ms - 524 ms ]: 0
[ 524 ms - 1 sec ]: 0
[ 1 sec - 2 sec ]: 0
[ 2 sec - 4 sec ]: 0
[ 4 sec - 8 sec ]: 0
[ 8 sec - 16 sec ]: 0
Cc: Markus Armbruster <armbru@redhat.com>
Acked-by: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-15-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
When used to report page fault latencies, the blocktime feature can be
almost useless when KVM async page fault is enabled, because in most cases
such remote fault will kickoff async page faults, then it's not trackable
from blocktime layer.
After all these recent rewrites to blocktime layer, it's finally so easy to
also support tracking non-vCPU faults. It'll be even faster if we could
always index fault records with TIDs, unfortunately we need to maintain the
blocktime API which report things in vCPU indexes.
Of course this can work not only for kworkers, but also any guest accesses
that may reach a missing page, for example, very likely when in the QEMU
main thread too (and all other threads whenever applicable).
In this case, we don't care about "how long the threads are blocked", but
we only care about "how long the fault will be resolved".
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Link: https://lore.kernel.org/r/20250613141217.474825-14-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Currently, the postcopy blocktime feature maintains vCPU fault information
using an array (vcpu_addr[]). It has two issues.
Issue 1: Performance Concern
============================
The old algorithm was almost OK and fast on inserts, except that the lookup
is slow and won't scale if there are a lot of vCPUs: when a page is copied
during postcopy, mark_postcopy_blocktime_end() will walk the whole array
trying to find which vCPUs are blocked by the address. So it needs
constant O(N) walk for each page resolution.
Alexey (the author of postcopy blocktime) mentioned the perf issue and how
to optimize it in a piece of comment in the page resolution path. The
comment was (interestingly..) not complete, but it's relatively clear what
he wanted to say about this perf issue.
Issue 2: Wrong Accounting on re-entrancies
==========================================
People might think that each vCPU should only and always get one fault at a
time, so that when the blocktime layer captured one fault on one vCPU, we
should never see another fault message on this vCPU.
It's almost correct, except in some extreme rare cases.
Case 1: it's possible the fault thread processes the userfaultfd messages
too fast so it can see >1 messages on one vCPU before the previous one was
resolved.
Case 2: it's theoretically also possible one vCPU can get even more than
one message on the same fault address if a fault is retried by the
kernel (e.g., handle_userfault() got interrupted before page resolution).
As this info might be important, instead of using commit message, I put
more details into the code as comment, when introducing an array
maintaining concurrent faults on one vCPU. Please refer to the comments
for details on both cases, especially case 1 which can be tricky.
Case 1 sounds rare, but it can be easily reproduced locally for me when we
run blocktime together with the migration-test on the vanilla postcopy.
New Design
==========
This patch should do almost what Alexey mentioned, but slightly
differently: instead of having an array to maintain vCPU fault addresses,
for each of the fault message we push a message into a hash, indexed by the
fault address.
With the hash, it can replace the old two structs: both the vcpu_addr[]
array, and also the array to store the start time of the fault. However
due to above we need one more counter array to account concurrent faults on
the same vCPU - that should even be needed in the old code, it's just that
the old code was buggy and it will blindly overwrite an existing
entry.. now we'll start to really track everything.
The hash structure might be more efficient than tree to maintain such
addr->(cpu, fault_time) information, so that the insert() and lookup()
paths should ideally both be ~O(1). After all, we do not need to sort.
Here we need to do one remove() though after the lookup(). It could be
slow but only if many vCPUs faulted exactly on the same address (so when
the list of cpu entries is long), which should be unlikely. Even with that,
it's still a worst case O(N) (consider 400 vCPUs faulted on the same
address and how likely is it..) rather than a constant O(N) complexity.
When at it, touch up the tracepoints to make them slightly more useful.
One tracepoint is added when walking all the fault entries.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-13-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
The variable vcpu_total_blocktime isn't easy to follow. In reality, it
wants to capture the case where all vCPUs are stopped, and now there will
be some vCPUs starts running.
The name now starts to conflict with vcpu_blocktime_total[], meanwhile it's
actually not necessary to have the variable at all: since nobody is
touching smp_cpus_down except ourselves, we can safely do the calculation
at the end before decrementing smp_cpus_down.
Hopefully this makes the logic easier to read, side benefit is we drop one
temp var.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-12-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Looking up the vCPU index for each fault can be expensive when there're
hundreds of vCPUs. Provide a cache for tid->vcpu instead with a hash
table, then lookup from there.
When at it, add another counter to record how many non-vCPU faults it gets.
For example, the main thread can also access a guest page that was missing.
These kind of faults are not accounted by blocktime so far.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-11-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Before this patch, the blocktime context can be created very early, because
postcopy_ram_supported_by_host() <- migrate_caps_check() can happen during
migration object init.
The trick here is the blocktime context needs system vCPU information,
which seems to be possible to change after that point. I didn't verify it,
but it doesn't sound right.
Now move it out and initialize the context only when postcopy listen
starts. That is already during a migration so it should be guaranteed the
vCPU topology can never change on both sides.
While at it, assert that the ctx isn't created instead this time; the old
"if" trick isn't needed when we're sure it will only happen once now.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-10-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Blocktime so far only cares about the time one vcpu (or the whole system)
got blocked. It would be also be helpful if it can also report the latency
of page requests, which could be very sensitive during postcopy.
Blocktime itself is sometimes not very important, especially when one
thinks about KVM async PF support, which means vCPUs are literally almost
not blocked at all because the guest OS is smart enough to switch to
another task when a remote fault is needed.
However, latency is still sensitive and important because even if the guest
vCPU is running on threads that do not need a remote fault, the workload
that accesses some missing page is still affected.
Add two entries to the report, showing how long it takes to resolve a
remote fault. Mention in the QAPI doc that this is not the real average
fault latency, but only the ones that was requested for a remote fault.
Unwrap get_vcpu_blocktime_list() so we don't need to walk the list twice,
meanwhile add the entry checks in qtests for all postcopy tests.
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Link: https://lore.kernel.org/r/20250613141217.474825-9-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Add a field to count how many remote faults one vCPU has taken. So far
it's still not used, but will be soon.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-8-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
With 64-bit fields, it is trivial. The caution is when exposing any values
in QMP, it was still declared with milliseconds (ms). Hence it's needed to
do the convertion when exporting the values to existing QMP queries.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-7-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Now with 64bits, the offseting using start_time is not needed anymore,
because the array can always remember the whole timestamp.
Then drop the unused parameter in get_low_time_offset() altogether.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-6-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
I am guessing it was used to be 32bits because of the atomic ops. Now all
the atomic ops are gone and we're protected by a mutex instead, it's ok we
can switch to 64 bits.
Reasons to move over:
- Allow further patches to change the unit from ms to us: with postcopy
preempt mode, we're really into hundreds of microseconds level on
blocktime. We'd better be able to trap those.
- This also paves way for some other tricks that the original version
used to avoid overflows, e.g., start_time was almost only useful before
to make sure the sampled timestamp won't overflow a 32-bit field.
- This prepares further reports on top of existing data collected,
e.g. average page fault latencies. When average operation is taken into
account, milliseconds are simply too coarse grained.
When at it:
- Rename page_fault_vcpu_time to vcpu_blocktime_start.
- Rename vcpu_blocktime to vcpu_blocktime_total.
- Touch up the trace-events to not dump blocktime ctx pointer
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-5-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
The postcopy blocktime feature was tricky that it used quite some atomic
operations over quite a few arrays and vars, without explaining how that
would be thread safe. The thread safety here is about concurrency between
the fault thread and the fault resolution threads, possible to access the
same chunk of data. All these atomic ops can be expensive too before
knowing clearly how it works.
OTOH, postcopy has one page_request_mutex used to serialize the received
bitmap updates. So far it's ok - we don't yet have a lot of threads
contending the lock. It might change after multifd will be supported, but
that's a separate story. What is important is, with that mutex, it's
pretty lightweight to move all the blocktime maintenance into the mutex
critical section. It's because the blocktime layer is lightweighted:
almost "remember which vcpu faulted on which address", and "ok we get some
fault resolved, calculate how long it takes". It's also an optional
feature for now (but I have thought of changing that, maybe in the future).
Let's push the blocktime layer into the mutex, so that it's always
thread-safe even without any atomic ops.
To achieve that, I'll need to add a tid parameter on fault path so that
it'll start to pass the faulted thread ID into deeper the stack, but not
too deep. When at it, add a comment for the shared fault handler (for
example, vhost-user devices running with postcopy), to mention a TODO. One
reason it might not be trivial is that vhost-user's userfaultfds should be
opened by vhost-user process, so it's pretty hard to control making sure
the TID feature will be around. It wasn't supported before, so keep it
like that for now.
Now we should be as ease when everything is protected by a mutex that we
always take anyway.
One side effect: we can finally remove one ramblock_recv_bitmap_test() in
mark_postcopy_blocktime_begin(), which was pretty weird and which also
includes a weird (but maybe necessary.. but maybe not?) operation to inject
a blocktime entry then quickly erase it.. When we're with the mutex, and
when we make sure it's invoked after checking the receive bitmap, it's not
needed anymore. Instead, we assert.
As another side effect, this paves way for removing all atomic ops in all
the mem accesses in blocktime layer.
Note that we need a stub for mark_postcopy_blocktime_begin() for Windows
builds.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-3-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
This is a follow up on the other commit "migration/ram: avoid to do log
clear in the last round" but for postcopy.
https://lore.kernel.org/r/20250514115827.3216082-1-yanfei.xu@bytedance.com
I can observe more than 10% reduction of average page fault latency during
postcopy phase with this optimization:
Before: 268.00us (+-1.87%)
After: 232.67us (+-2.01%)
The test was done with a 16GB VM with 80 vCPUs, running a workload that
busy random writes to 13GB memory.
Cc: Yanfei Xu <yanfei.xu@bytedance.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613140801.474264-12-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
There're a few things off here in that logic, rewrite it. When at it, add
rich comment to explain each of the decisions.
Since this is very sensitive path for migration, below are the list of
things changed with their reasonings.
(1) Exact pending size is only needed for precopy not postcopy
Fundamentally it's because "exact" version only does one more deep
sync to fetch the pending results, while in postcopy's case it's
never going to sync anything more than estimate as the VM on source
is stopped.
(2) Do _not_ rely on threshold_size anymore to decide whether postcopy
should complete
threshold_size was calculated from the expected downtime and
bandwidth only during precopy as an efficient way to decide when to
switchover. It's not sensible to rely on threshold_size in postcopy.
For precopy, if switchover is decided, the migration will complete
soon. It's not true for postcopy. Logically speaking, postcopy
should only complete the migration if all pending data is flushed.
Here it used to work because save_complete() used to implicitly
contain save_live_iterate() when there's pending size.
Even if that looks benign, having RAMs to be migrated in postcopy's
save_complete() has other bad side effects:
(a) Since save_complete() needs to be run once at a time, it means
when moving RAM there's no way moving other things (rather than
round-robin iterating the vmstate handlers like what we do with
ITERABLE phase). Not an immediate concern, but it may stop working
in the future when there're more than one iterables (e.g. vfio
postcopy).
(b) postcopy recovery, unfortunately, only works during ITERABLE
phase. IOW, if src QEMU moves RAM during postcopy's save_complete()
and network failed, then it'll crash both QEMUs... OTOH if it failed
during iteration it'll still be recoverable. IOW, this change should
further reduce the window QEMU split brain and crash in extreme cases.
If we enable the ram_save_complete() tracepoints, we'll see this
before this patch:
1267959@1748381938.294066:ram_save_complete dirty=9627, done=0
1267959@1748381938.308884:ram_save_complete dirty=0, done=1
It means in this migration there're 9627 pages migrated at complete()
of postcopy phase.
After this change, all the postcopy RAM should be migrated in iterable
phase, rather than save_complete():
1267959@1748381938.294066:ram_save_complete dirty=0, done=0
1267959@1748381938.308884:ram_save_complete dirty=0, done=1
(3) Adjust when to decide to switch to postcopy
This shouldn't be super important, the movement makes sure there's
only one in_postcopy check, then we are clear on what we do with the
two completely differnt use cases (precopy v.s. postcopy).
(4) Trivial touch up on threshold_size comparision
Which changes:
"(!pending_size || pending_size < s->threshold_size)"
into:
"(pending_size <= s->threshold_size)"
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613140801.474264-11-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Take notes on start/end state of dirty pages for the whole system.
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250613140801.474264-10-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
The check over PAGE_DIRTY_FOUND isn't necessary. We could indent one less
and assert that instead.
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250613140801.474264-9-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Since we use the same save_complete() hook for both precopy and postcopy,
add a set of helpers to invoke the hook() to dedup the code.
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613140801.474264-8-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Now after merging the precopy and postcopy version of complete() hook,
rename the precopy version from save_live_complete_precopy() to
save_complete().
Dropping the "live" when at it, because it's in most cases not live when
happening (in precopy).
No functional change intended.
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613140801.474264-7-peterx@redhat.com
[peterx: squash the fixup that covers a few more doc spots, per Juraj]
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
The hook is only defined in two vmstate users ("ram" and "block dirty
bitmap"), meanwhile both of them define the hook exactly the same as the
precopy version. Hence, this postcopy version isn't needed.
No functional change intended.
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613140801.474264-6-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Move it out of vanilla postcopy session, but instead a standalone feature.
When at it, removing the NOTE because it's incorrect now after introduction
of max-postcopy-bandwidth, which can control the throughput even for
postcopy phase.
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20250613140801.474264-4-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Unfortunately, it was never correctly shown..
This is only found when I started to look into making the blocktime feature
more useful (so as to avoid using bpftrace, even though I'm not sure which
one will be harder to use..).
So the old dump would look like this:
Postcopy vCPU Blocktime: 0-1,4,10,21,33,46,48,59
Even though there're actually 40 vcpus, and the string will merge same
elements and also sort them.
To fix it, simply loop over the uint32List manually. Now it looks like:
Postcopy vCPU Blocktime (ms):
[15, 0, 0, 43, 29, 34, 36, 29, 37, 41,
33, 37, 45, 52, 50, 38, 40, 37, 40, 49,
40, 35, 35, 35, 81, 19, 18, 19, 18, 30,
22, 3, 0, 0, 0, 0, 0, 0, 0, 0]
Cc: Dr. David Alan Gilbert <dave@treblig.org>
Cc: Alexey Perevalov <a.perevalov@samsung.com>
Cc: Markus Armbruster <armbru@redhat.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613140801.474264-3-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Dave suggested the HMP output for "info migrate" can not only leverage the
lines but also better grouping:
https://lore.kernel.org/r/aC4_-nMc7FwsMf9p@gallifrey
I followed Dave's suggestion, and some more modifications on top:
- Added all elements into the picture
- Use size_to_str() and drop most of the units: benefit is more friendly
to most human eyes, bad side effect is lose of details, but that should
be corner case per my uses, and one can still leverage the QMP interface
when necessary.
- Sub-grouping for "Transfers" ("Channels" and "Page Types").
- Better indentations
Sample output:
(qemu) info migrate
Status: postcopy-active
Time (ms): total=47317, setup=5, down=8
RAM info:
Throughput (Mbps): 1342.83
Sizes: pagesize=4 KiB, total=4.02 GiB
Transfers: transferred=1.41 GiB, remain=2.46 GiB
Channels: precopy=15.2 MiB, multifd=0 B, postcopy=1.39 GiB
Page Types: normal=367713, zero=41195
Page Rates (pps): transfer=40900, dirty=4
Others: dirty_syncs=2, postcopy_req=57503
Suggested-by: Dr. David Alan Gilbert <dave@treblig.org>
Tested-by: Li Zhijian <lizhijian@fujitsu.com>
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Acked-by: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Juraj Marcin <jmarcin@redhat.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Link: https://lore.kernel.org/r/20250613140801.474264-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Runs a binary from the max78000test repo used in
developing the qemu implementation of the max78000
to verify that the machine and implemented devices
generally still work.
Signed-off-by: Jackson Donaldson <jcksn@duck.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Message-id: 20250711110626.624534-3-jcksn@duck.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>