qemu-cr16

Author	SHA1	Message	Date
Daniel Borkmann	e53d9ec7cc	net/af-xdp: Support pinned map path for AF_XDP sockets Extend 'inhibit=on' setting with the option to specify a pinned XSK map path along with a starting index (default 0) to push the created XSK sockets into. Example usage: # ./build/qemu-system-x86_64 [...] \ -netdev af-xdp,ifname=enp2s0f0np0,id=net0,mode=native,queues=2,start-queue=14,inhibit=on,map-path=/sys/fs/bpf/xsks_map,map-start-index=14 \ -device virtio-net-pci,netdev=net0 [...] This is useful for the case where an existing XDP program with XSK map is present on the AF_XDP supported phys device and the XSK map is not yet populated. For example, the former could have been pre-loaded onto the netdevice by a control plane, which later launches QEMU to populate it with XSK sockets. Normally, the main idea behind 'inhibit=on' is that the QEMU instance doesn't need to have a lot of privileges to use the pre-loaded program and the pre-created sockets, but this mentioned use-case here is different where QEMU still needs privileges to create the sockets. The 'map-start-index' parameter is optional and defaults to 0. It allows flexible placement of the XSK sockets, and is up to the user to specify when the XDP program with XSK map was already preloaded. In the simplest case the queue-to-map-slot mapping is just 1:1 based on ctx->rx_queue_index but the user might as well have a different scheme (or smaller map size, e.g. ctx->rx_queue_index % max_size) to push the inbound traffic to one of the XSK sockets. Note that the bpf_xdp_query_id() is now only tested for 'inhibit=off' since only in the latter case the libxdp takes care of installing the XDP program which was installed based on the s->xdp_flags pointing to either driver or skb mode. For 'inhibit=on' we don't make any assumptions and neither go down the path of probing all possible options in which way the user installed the XDP program. Reviewed-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Ilya Maximets <i.maximets@ovn.org> Cc: Jason Wang <jasowang@redhat.com> Cc: Anton Protopopov <aspsk@isovalent.com> Signed-off-by: Jason Wang <jasowang@redhat.com>	2025-07-15 10:26:55 +08:00
Daniel Borkmann	bdebcb49f4	net/af-xdp: Fix up cleanup path upon failure in queue creation While testing, it turned out that upon error in the queue creation loop, we never trigger the af_xdp_cleanup() handler. This is because we pass errp instead of a local err pointer into the various AF_XDP setup functions instead of a scheme like: bool fn(..., Error *errp) { Error err = NULL; foo(arg, &err); if (err) { handle the error... error_propagate(errp, err); return false; } ... } The same is true for the attachment probing with bpf_xdp_query_id(). With a conversion into the above format, the af_xdp_cleanup() handler is called as expected. Note the error_propagate() handles a NULL err internally. Fixes: `cb039ef3d9` ("net: add initial support for AF_XDP network backend") Reviewed-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Ilya Maximets <i.maximets@ovn.org> Cc: Jason Wang <jasowang@redhat.com> Cc: Anton Protopopov <aspsk@isovalent.com> Signed-off-by: Jason Wang <jasowang@redhat.com>	2025-07-15 10:26:16 +08:00
Daniel Borkmann	d38c5e0d0c	net/af-xdp: Remove XDP program cleanup logic There are two issues with the XDP program removal in af_xdp_cleanup(): 1) Starting from libxdp 1.3.0 [0] the XDP program gets automatically detached when we call xsk_socket__delete() for the last successfully configured queue. libxdp internally keeps track of that. For QEMU we require libxdp >= 1.4.0. Given QEMU is not loading the program, lets also not attempt to remove it and delegate this instead. 2) The removal logic is incorrect anyway because we are setting n_queues into the last queue that never has xdp_flags on failure, so the logic is always skipped since the non-zero test for s->xdp_flags in af_xdp_cleanup() fails. Fixes: `cb039ef3d9` ("net: add initial support for AF_XDP network backend") Suggested-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Ilya Maximets <i.maximets@ovn.org> Cc: Ilya Maximets <i.maximets@ovn.org> Cc: Jason Wang <jasowang@redhat.com> Cc: Anton Protopopov <aspsk@isovalent.com> Link: `38c2914988` [0] Signed-off-by: Jason Wang <jasowang@redhat.com>	2025-07-15 10:26:08 +08:00
Laurent Vivier	da703b06a5	net/passt: Implement vhost-user backend support This commit adds support for the vhost-user interface to the passt network backend, enabling high-performance, accelerated networking for guests using passt. The passt backend can now operate in a vhost-user mode, where it communicates with the guest's virtio-net device over a socket pair using the vhost-user protocol. This offloads the datapath from the main QEMU loop, significantly improving network performance. When the vhost-user=on option is used with -netdev passt, the new vhost initialization path is taken instead of the standard stream-based connection. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>	2025-07-14 13:27:09 +08:00
Laurent Vivier	854ee02b22	net: Add passt network backend This commit introduces support for passt as a new network backend. passt is an unprivileged, user-mode networking solution that provides connectivity for virtual machines by launching an external helper process. The implementation reuses the generic stream data handling logic. It launches the passt binary using GSubprocess, passing it a file descriptor from a socketpair() for communication. QEMU connects to the other end of the socket pair to establish the network data stream. The PID of the passt daemon is tracked via a temporary file to ensure it is terminated when QEMU exits. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>	2025-07-14 13:27:09 +08:00
Laurent Vivier	ba5acc5d6e	net: Add is_vhost_user flag to vhost_net struct Introduce a boolean is_vhost_user field to the vhost_net structure. This flag is initialized during vhost_net_init based on whether the backend is vhost-user. This refactoring simplifies checks for vhost-user specific behavior, replacing direct comparisons of 'net->nc->info->type' with the new flag. It improves readability and encapsulates the backend type information directly within the vhost_net instance. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>	2025-07-14 13:27:09 +08:00
Laurent Vivier	33b78a30a3	net: Allow network backends to advertise max TX queue size This commit refactors how the maximum transmit queue size for virtio-net devices is determined, making the mechanism more generic and extensible. Previously, virtio_net_max_tx_queue_size() contained hardcoded checks for specific network backend types (vhost-user and vhost-vdpa) to determine their supported maximum queue size. This created direct dependencies and would require modifications for every new backend that supports variable queue sizes. To improve flexibility, a new max_tx_queue_size field is added to the vhost_net structure. This allows each network backend to advertise its supported maximum transmit queue size directly. The virtio_net_max_tx_queue_size() function now retrieves the max TX queue size from the vhost_net struct, if available and set. Otherwise, it defaults to VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>	2025-07-14 13:27:09 +08:00
Laurent Vivier	1652f1b335	net: Add save_acked_features callback to vhost_net This commit introduces a save_acked_features function pointer to vhost_net and converts the vhost_net function into a generic dispatcher. The vhost-user backend provides the callback, making its function static. With this change, no other module has a direct dependency on the vhost-user implementation. This cleanup allows for the complete removal of the net/vhost-user.h header file. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>	2025-07-14 13:27:09 +08:00
Laurent Vivier	bd38794a11	net: Add get_acked_features callback to VhostNetOptions This patch continues the effort to decouple the generic vhost layer from specific network backend implementations. Previously, the vhost_net initialization code contained a hardcoded check for the vhost-user client type to retrieve its acked features by calling vhost_user_get_acked_features(). This exposed an internal vhost-user function in a public header and coupled the two modules. The vhost-user backend is updated to provide a callback, and its getter function is now static. The call site in vhost_net.c is simplified to use the new generic helper, removing the type check and the direct dependency. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>	2025-07-14 13:27:09 +08:00
Laurent Vivier	effdacbf28	net: Consolidate vhost feature bits into vhost_net structure Previously, the vhost_net_get_feature_bits() function in hw/net/vhost_net.c used a large switch statement to determine the appropriate feature bits based on the NetClientDriver type. This created unnecessary coupling between the generic vhost layer and specific network backends (like TAP, vhost-user, and vhost-vdpa). This patch moves the definition of vhost feature bits directly into the vhost_net structure for each relevant network client. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>	2025-07-14 13:27:09 +08:00
Laurent Vivier	8f6e5c620a	net: Add get_vhost_net callback to NetClientInfo The get_vhost_net() function previously contained a large switch statement to find the VHostNetState pointer based on the net client's type. This created a tight coupling, requiring the generic vhost layer to be aware of every specific backend that supported vhost, such as tap, vhost-user, and vhost-vdpa. This approach is not scalable and requires modifying a central function for any new backend. It also forced each backend to expose its internal getter function in a public header file. This patch refactors the logic by introducing a new get_vhost_net function pointer to the NetClientInfo struct. The central get_vhost_net() function is now a simple, generic dispatcher that invokes the callback provided by the net client. Each backend now implements its own private getter and registers it in its NetClientInfo. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>	2025-07-14 13:27:09 +08:00
Laurent Vivier	7136352b40	vhost_net: Rename vhost_set_vring_enable() for clarity This is a cosmetic change with no functional impact. The function vhost_set_vring_enable() is specific to vhost_net and is used outside of vhost_net.c (specifically, in hw/net/virtio-net.c). To prevent confusion with other similarly named vhost functions, such as the one found in cryptodev-vhost.c, it has been renamed to vhost_net_set_vring_enable(). This clarifies that the function belongs to the vhost_net module. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>	2025-07-14 13:27:09 +08:00
Laurent Vivier	9b5b45c717	net: Define net_client_set_link() The code to set the link status is currently located in qmp_set_link(). This function identifies the device by name, searches for the corresponding NetClientState, and then updates the link status. In some parts of the code, such as vhost-user.c, the NetClientState are already available. Calling qmp_set_link() from these locations leads to a redundant search for the clients. This patch refactors the logic by introducing a new function, net_client_set_link(), which accepts a NetClientState array directly. qmp_set_link() is simplified to be a wrapper that performs the client search and then calls the new function. The vhost-user implementation is updated to use net_client_set_link() directly, thereby eliminating the unnecessary client lookup. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>	2025-07-14 13:27:09 +08:00
Laurent Vivier	adf684ce69	net: Refactor stream logic for reuse in '-net passt' To prepare for the implementation of '-net passt', this patch moves the generic stream handling functions from net/stream.c into new net/stream_data.c and net/stream_data.h files. This refactoring introduces a NetStreamData struct that encapsulates the generic fields and logic previously in NetStreamState. The NetStreamState now embeds NetStreamData and delegates the core stream operations to the new generic functions. To maintain flexibility for different users of this generic code, callbacks for send and listen operations are now passed via function pointers within the NetStreamData struct. This allows callers to provide their own specific implementations while reusing the common connection and data transfer logic. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>	2025-07-14 13:27:09 +08:00
Akihiko Odaki	adda0ad56b	virtio-net: Add queues for RSS during migration virtio_net_pre_load_queues() inspects vdev->guest_features to tell if VIRTIO_NET_F_RSS or VIRTIO_NET_F_MQ is enabled to infer the required number of queues. This works for VIRTIO_NET_F_MQ but it doesn't for VIRTIO_NET_F_RSS because only the lowest 32 bits of vdev->guest_features is set at the point and VIRTIO_NET_F_RSS uses bit 60 while VIRTIO_NET_F_MQ uses bit 22. Instead of inferring the required number of queues from vdev->guest_features, use the number loaded from the vm state. This change also has a nice side effect to remove a duplicate peer queue pair change by circumventing virtio_net_set_multiqueue(). Also update the comment in include/hw/virtio/virtio.h to prevent an implementation of pre_load_queues() from refering to any fields being loaded during migration by accident in the future. Fixes: `8c49756825` ("virtio-net: Add only one queue pair when realizing") Tested-by: Lei Yang <leiyang@redhat.com> Cc: qemu-stable@nongnu.org Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Signed-off-by: Jason Wang <jasowang@redhat.com>	2025-07-14 13:26:52 +08:00
Anastasia Belova	110d0fa2d4	net: fix buffer overflow in af_xdp_umem_create() s->pool has n_descs elements so maximum i should be n_descs - 1. Fix the upper bound. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: `cb039ef3d9` ("net: add initial support for AF_XDP network backend") Cc: qemu-stable@nongnu.org Reviewed-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Anastasia Belova <nabelova31@gmail.com> Signed-off-by: Jason Wang <jasowang@redhat.com>	2025-07-14 10:13:37 +08:00
Stefan Hajnoczi	9a4e273dde	fpu: Process float_muladd_negate_result after rounding tcg: Use uintptr_t in tcg_malloc implementation linux-user: Hold the fd-trans lock across fork linux-user: Implement fchmodat2 syscall linux-user: Check for EFAULT failure in nanosleep linux-user: Use qemu_set_cloexec() to mark pidfd as FD_CLOEXEC linux-user/gen-vdso: Handle fseek() failure linux-user/gen-vdso: Don't read off the end of buf[] -----BEGIN PGP SIGNATURE----- iQFRBAABCgA7FiEEekgeeIaLTbaoWgXAZN846K9+IV8FAmhxSAkdHHJpY2hhcmQu aGVuZGVyc29uQGxpbmFyby5vcmcACgkQZN846K9+IV9wiQf+PrXwKj+FusE0YU1y Lnx6+S0M/lDRCNhbgBrw7JK5WUwIfnZQuepf0vjuhoHH1rUdT1EUYdJ7Quwj9fgG 0YcKRD8OAVKNU8I3ydtzSaJ3TZ02nbbDbwGMoD/eNXGKx0Gt5907vD4PrjT+mByG 6QTLwuql3ahkl/Tiskk2LwbmHRe0CXiezVuzgprbNiyxrgDT8ArqCq+VJzv/wb2O 4t6BqRDvBzRe7MUUs2B2W+hs0HW4Rfqcye/3rRnYe7HA4CTiVNqY9rwgrQqGEO0P 3Cf+VaF6CaLz+HuHfM8rz+xBhfo+UpZYOVMXk/7VEAG6geMKTcQG1tCJYhL+xklJ 9r4ABw== =rD+6 -----END PGP SIGNATURE----- Merge tag 'pull-tcg-20250711' of https://gitlab.com/rth7680/qemu into staging fpu: Process float_muladd_negate_result after rounding tcg: Use uintptr_t in tcg_malloc implementation linux-user: Hold the fd-trans lock across fork linux-user: Implement fchmodat2 syscall linux-user: Check for EFAULT failure in nanosleep linux-user: Use qemu_set_cloexec() to mark pidfd as FD_CLOEXEC linux-user/gen-vdso: Handle fseek() failure linux-user/gen-vdso: Don't read off the end of buf[] # -----BEGIN PGP SIGNATURE----- # # iQFRBAABCgA7FiEEekgeeIaLTbaoWgXAZN846K9+IV8FAmhxSAkdHHJpY2hhcmQu # aGVuZGVyc29uQGxpbmFyby5vcmcACgkQZN846K9+IV9wiQf+PrXwKj+FusE0YU1y # Lnx6+S0M/lDRCNhbgBrw7JK5WUwIfnZQuepf0vjuhoHH1rUdT1EUYdJ7Quwj9fgG # 0YcKRD8OAVKNU8I3ydtzSaJ3TZ02nbbDbwGMoD/eNXGKx0Gt5907vD4PrjT+mByG # 6QTLwuql3ahkl/Tiskk2LwbmHRe0CXiezVuzgprbNiyxrgDT8ArqCq+VJzv/wb2O # 4t6BqRDvBzRe7MUUs2B2W+hs0HW4Rfqcye/3rRnYe7HA4CTiVNqY9rwgrQqGEO0P # 3Cf+VaF6CaLz+HuHfM8rz+xBhfo+UpZYOVMXk/7VEAG6geMKTcQG1tCJYhL+xklJ # 9r4ABw== # =rD+6 # -----END PGP SIGNATURE----- # gpg: Signature made Fri 11 Jul 2025 13:21:13 EDT # gpg: using RSA key 7A481E78868B4DB6A85A05C064DF38E8AF7E215F # gpg: issuer "richard.henderson@linaro.org" # gpg: Good signature from "Richard Henderson <richard.henderson@linaro.org>" [full] # Primary key fingerprint: 7A48 1E78 868B 4DB6 A85A 05C0 64DF 38E8 AF7E 215F * tag 'pull-tcg-20250711' of https://gitlab.com/rth7680/qemu: linux-user: Use qemu_set_cloexec() to mark pidfd as FD_CLOEXEC tcg: Use uintptr_t in tcg_malloc implementation linux-user: Hold the fd-trans lock across fork linux-user/mips/o32: Drop sa_restorer functionality linux-user/gen-vdso: Don't read off the end of buf[] linux-user/gen-vdso: Handle fseek() failure linux-user: Check for EFAULT failure in nanosleep linux-user: Implement fchmodat2 syscall fpu: Process float_muladd_negate_result after rounding Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2025-07-13 01:46:04 -04:00
Stefan Hajnoczi	52af79811f	Migration pull request - General cleanups around: postcopy, bg-snapshot, migration hooks, migration completion and formatting of 'info migrate'. - Overhaul of postcopy blocktime tracking. -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEqhtIsKIjJqWkw2TPx5jcdBvsMZ0FAmhxGdgQHGZhcm9zYXNA c3VzZS5kZQAKCRDHmNx0G+wxnahoD/9uNXirlmRk3tDnhiJsiYx+HnXYPFEORSZq zlpUyqvhQ1POp3Fa5pRf+bJ5mmPw8h8PdOR2StMpnW2Xa1OatAZj5m1uityAVWOl EkVfZLl0j6j9HCCmE3c4dztOGIBsd9YY0GWizL05XHYZPrdX4zOpolMN4m53RwQY HUVD6T2y9eFDnCO6MsoA9EfmkFYCRvqlS0VzTcYzQFN4H+QHlcpDfweqJpTLPa+1 trahAN9PBuMjoewjDqwkNkf0CLaCXHszAfj6yv62Vi8Cbp9DDPywIYJKFnxspElW Fjg1b4MdsbYZNmeKgIawzgTOL1RrojvKkoi7KWp3D7M+/ZZl9kBwQuUcBXKI7N0R Y0GNfkkTycn18nM0JU/6QWSuVeiPbLArxQUGP1cLgvcHSSNgD9JxWbNBu5+1fFOG Gg3qnyYatJ6xJDiCrdKqV8fwozNlm/G6b9BiCDeVq+4nA2OKQ0shiNA1GZHvVSQL X4uAPexETdHfA/LeA2w5sgVBEw7BewBdjLntZDIFsyBnLrvqrDcU5Aav0wiHoI8U QBC2aIpJfMLHiIQ93mVX96NltXC7KvJTIZVl3iwfiYEYCvQtTYgdJ09ELXFJYxFX XpTTazqpmPSfuZpPRgx9YbDP/kS8Fg/PTOlPeD0T/frFgd1S6Thh6OW455PavMp8 ht2lE4sxjA== =vtRD -----END PGP SIGNATURE----- Merge tag 'migration-20250711-pull-request' of https://gitlab.com/farosas/qemu into staging Migration pull request - General cleanups around: postcopy, bg-snapshot, migration hooks, migration completion and formatting of 'info migrate'. - Overhaul of postcopy blocktime tracking. # -----BEGIN PGP SIGNATURE----- # # iQJEBAABCAAuFiEEqhtIsKIjJqWkw2TPx5jcdBvsMZ0FAmhxGdgQHGZhcm9zYXNA # c3VzZS5kZQAKCRDHmNx0G+wxnahoD/9uNXirlmRk3tDnhiJsiYx+HnXYPFEORSZq # zlpUyqvhQ1POp3Fa5pRf+bJ5mmPw8h8PdOR2StMpnW2Xa1OatAZj5m1uityAVWOl # EkVfZLl0j6j9HCCmE3c4dztOGIBsd9YY0GWizL05XHYZPrdX4zOpolMN4m53RwQY # HUVD6T2y9eFDnCO6MsoA9EfmkFYCRvqlS0VzTcYzQFN4H+QHlcpDfweqJpTLPa+1 # trahAN9PBuMjoewjDqwkNkf0CLaCXHszAfj6yv62Vi8Cbp9DDPywIYJKFnxspElW # Fjg1b4MdsbYZNmeKgIawzgTOL1RrojvKkoi7KWp3D7M+/ZZl9kBwQuUcBXKI7N0R # Y0GNfkkTycn18nM0JU/6QWSuVeiPbLArxQUGP1cLgvcHSSNgD9JxWbNBu5+1fFOG # Gg3qnyYatJ6xJDiCrdKqV8fwozNlm/G6b9BiCDeVq+4nA2OKQ0shiNA1GZHvVSQL # X4uAPexETdHfA/LeA2w5sgVBEw7BewBdjLntZDIFsyBnLrvqrDcU5Aav0wiHoI8U # QBC2aIpJfMLHiIQ93mVX96NltXC7KvJTIZVl3iwfiYEYCvQtTYgdJ09ELXFJYxFX # XpTTazqpmPSfuZpPRgx9YbDP/kS8Fg/PTOlPeD0T/frFgd1S6Thh6OW455PavMp8 # ht2lE4sxjA== # =vtRD # -----END PGP SIGNATURE----- # gpg: Signature made Fri 11 Jul 2025 10:04:08 EDT # gpg: using RSA key AA1B48B0A22326A5A4C364CFC798DC741BEC319D # gpg: issuer "farosas@suse.de" # gpg: Good signature from "Fabiano Rosas <farosas@suse.de>" [unknown] # gpg: aka "Fabiano Almeida Rosas <fabiano.rosas@suse.com>" [unknown] # gpg: WARNING: The key's User ID is not certified with a trusted signature! # gpg: There is no indication that the signature belongs to the owner. # Primary key fingerprint: AA1B 48B0 A223 26A5 A4C3 64CF C798 DC74 1BEC 319D * tag 'migration-20250711-pull-request' of https://gitlab.com/farosas/qemu: (26 commits) migration: Rename save_live_complete_precopy_thread to save_complete_precopy_thread migration/postcopy: Add latency distribution report for blocktime migration/postcopy: blocktime allows track / report non-vCPU faults migration/postcopy: Optimize blocktime fault tracking with hashtable migration/postcopy: Cleanup the total blocktime accounting migration/postcopy: Cache the tid->vcpu mapping for blocktime migration/postcopy: Initialize blocktime context only until listen migration/postcopy: Report fault latencies in blocktime migration/postcopy: Add blocktime fault counts per-vcpu migration/postcopy: Bring blocktime layer to ns level migration/postcopy: Drop PostcopyBlocktimeContext.start_time migration/postcopy: Make all blocktime vars 64bits migration/postcopy: Drop all atomic ops in blocktime feature migration/postcopy: Push blocktime start/end into page req mutex migration: Add option to set postcopy-blocktime migration/postcopy: Avoid clearing dirty bitmap for postcopy too migration: Rewrite the migration complete detect logic migration/ram: Add tracepoints for ram_save_complete() migration/ram: One less indent for ram_find_and_save_block() migration: qemu_savevm_complete*() helpers ... Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2025-07-13 01:45:30 -04:00
Stefan Hajnoczi	0edc2afe0c	target-arm queue: * New board type max78000fthr * Enable use of CXL on Arm 'virt' board * Some more tidyup of ID register handling * Refactor AT insns and PMU regs into separate source files * Don't enforce NSE,NS check for EL3->EL3 returns * hw/arm/fsl-imx8mp: Wire VIRQ and VFIQ * Allow nested-virtualization with KVM on the 'virt' board * system/qdev: Remove pointless NULL check in qdev_device_add_from_qdict * hw/arm/virt-acpi-build: Don't create ITS id mappings by default * target/arm: Remove unused helper_sme2_luti4_4b -----BEGIN PGP SIGNATURE----- iQJNBAABCAA3FiEE4aXFk81BneKOgxXPPCUl7RQ2DN4FAmhxEcoZHHBldGVyLm1h eWRlbGxAbGluYXJvLm9yZwAKCRA8JSXtFDYM3j5yEACWYnNeqo8Yph6/EJExE6eV r0tC6FBb5ShPgA6kDxhpOc1lI6uXGh8+D7bL9BePEdz/brCf1QDfs2Z4q/hb5ysX D0H6VI5Gr1j6MjkFRBo3+vvYz4Yh++XLn5Q9lZv8zaSEdraq/ay2kxnuhRCK+4Ar +QoGtKrGMJ7UCpfiRlvNnd1UjgORZf10EE/bRImX13sxeDomP3CZhFzAyJyShOP9 JA7bAd4rYJ4oj8R33y8Yaxjwm4FOndj740B0zwpO8mpjzFiE5zbqsaO+mEgYSflc OQisCu/KRFpyIR+UqP+4gNaJLfKQW5Y4r61zEaiJWV/c4RdKNnbK1f7MX11fNhOk k1paF3GIXp6f794Hb14vtsYnKHF2eeNSmRkAomXxLgUSYzLezL+yj7cdYmRJhgYU thc1PSiEmHYhjRmOaMC9+dkMtvIexWyDNYNFTygoOE5/kTMSazeTFQpFmw+ZuTee 9pjKsYRZJgTa64IkJy1L34jc2gds48Q20KpQsqZ22KQcjwt4PW4eQXkvMylawSut mArHVH6AAxIK+defeEmnQCJ0OccyGCENjRDuWyWMMGoP/ggZpO47rGWmCUOK8xz8 IfGdPeF/9xsKSKWvjpiHyyKa48wuO2bVC+5bISS6IPA2uGneS2DpmjkHU+gHBqpk GNlvEnXZfavZOHejE7/L/Q== =hJ4/ -----END PGP SIGNATURE----- Merge tag 'pull-target-arm-20250711' of https://gitlab.com/pm215/qemu into staging target-arm queue: * New board type max78000fthr * Enable use of CXL on Arm 'virt' board * Some more tidyup of ID register handling * Refactor AT insns and PMU regs into separate source files * Don't enforce NSE,NS check for EL3->EL3 returns * hw/arm/fsl-imx8mp: Wire VIRQ and VFIQ * Allow nested-virtualization with KVM on the 'virt' board * system/qdev: Remove pointless NULL check in qdev_device_add_from_qdict * hw/arm/virt-acpi-build: Don't create ITS id mappings by default * target/arm: Remove unused helper_sme2_luti4_4b # -----BEGIN PGP SIGNATURE----- # # iQJNBAABCAA3FiEE4aXFk81BneKOgxXPPCUl7RQ2DN4FAmhxEcoZHHBldGVyLm1h # eWRlbGxAbGluYXJvLm9yZwAKCRA8JSXtFDYM3j5yEACWYnNeqo8Yph6/EJExE6eV # r0tC6FBb5ShPgA6kDxhpOc1lI6uXGh8+D7bL9BePEdz/brCf1QDfs2Z4q/hb5ysX # D0H6VI5Gr1j6MjkFRBo3+vvYz4Yh++XLn5Q9lZv8zaSEdraq/ay2kxnuhRCK+4Ar # +QoGtKrGMJ7UCpfiRlvNnd1UjgORZf10EE/bRImX13sxeDomP3CZhFzAyJyShOP9 # JA7bAd4rYJ4oj8R33y8Yaxjwm4FOndj740B0zwpO8mpjzFiE5zbqsaO+mEgYSflc # OQisCu/KRFpyIR+UqP+4gNaJLfKQW5Y4r61zEaiJWV/c4RdKNnbK1f7MX11fNhOk # k1paF3GIXp6f794Hb14vtsYnKHF2eeNSmRkAomXxLgUSYzLezL+yj7cdYmRJhgYU # thc1PSiEmHYhjRmOaMC9+dkMtvIexWyDNYNFTygoOE5/kTMSazeTFQpFmw+ZuTee # 9pjKsYRZJgTa64IkJy1L34jc2gds48Q20KpQsqZ22KQcjwt4PW4eQXkvMylawSut # mArHVH6AAxIK+defeEmnQCJ0OccyGCENjRDuWyWMMGoP/ggZpO47rGWmCUOK8xz8 # IfGdPeF/9xsKSKWvjpiHyyKa48wuO2bVC+5bISS6IPA2uGneS2DpmjkHU+gHBqpk # GNlvEnXZfavZOHejE7/L/Q== # =hJ4/ # -----END PGP SIGNATURE----- # gpg: Signature made Fri 11 Jul 2025 09:29:46 EDT # gpg: using RSA key E1A5C593CD419DE28E8315CF3C2525ED14360CDE # gpg: issuer "peter.maydell@linaro.org" # gpg: Good signature from "Peter Maydell <peter.maydell@linaro.org>" [full] # gpg: aka "Peter Maydell <pmaydell@gmail.com>" [full] # gpg: aka "Peter Maydell <pmaydell@chiark.greenend.org.uk>" [full] # gpg: aka "Peter Maydell <peter@archaic.org.uk>" [unknown] # Primary key fingerprint: E1A5 C593 CD41 9DE2 8E83 15CF 3C25 25ED 1436 0CDE * tag 'pull-target-arm-20250711' of https://gitlab.com/pm215/qemu: (36 commits) tests/functional: Add a test for the MAX78000 arm machine docs/system: arm: Add max78000 board description target/arm: Remove helper_sme2_luti4_4b hw/arm/virt-acpi-build: Don't create ITS id mappings by default system/qdev: Remove pointless NULL check in qdev_device_add_from_qdict hw/arm/virt: Allow virt extensions with KVM hw/arm/arm_gicv3_kvm: Add a migration blocker with kvm nested virt target/arm: Enable feature ARM_FEATURE_EL2 if EL2 is supported target/arm/kvm: Add helper to detect EL2 when using KVM hw/arm: Allow setting KVM vGIC maintenance IRQ hw/arm/fsl-imx8mp: Wire VIRQ and VFIQ target/arm: Don't enforce NSE,NS check for EL3->EL3 returns target/arm: Split out performance monitor regs to cpregs-pmu.c target/arm: Split out AT insns to tcg/cpregs-at.c target/arm: Drop stub for define_tlb_insn_regs arm/kvm: shorten one overly long line arm/cpu: store clidr into the idregs array arm/cpu: fix trailing ',' for SET_IDREG arm/cpu: store id_aa64afr{0,1} into the idregs array arm/cpu: store id_afr0 into the idregs array ... Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2025-07-13 01:45:18 -04:00
Stefan Hajnoczi	3adbf0bb8a	* s390x: Allow to select different entries when booting via pxelinux.cfg * Link s390-ccw.img statically * Fix broken bamboo functional test * s390x code cleanups and refactorings -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEJ7iIR+7gJQEY8+q5LtnXdP5wLbUFAmhw2i0RHHRodXRoQHJl ZGhhdC5jb20ACgkQLtnXdP5wLbUGtA//XVr5t2/iH+zFdaHHFglMtYkqwyYspa/O zGPgcIZptQrzlbR+GFJwd4ae1HWb60E1YDyC7M1iWGQXeMNrDgeJJjUQfhB7693Y CPT1FCWaqXdrTHQJhf5+EGJZopwY1K4EHs+bMxCpU3ManD+MKuXzCgOMzZATnPUZ EcvOrzDBfEFEzQn5COUi5FF5Ds4DpOqQY1g1tpG92hQwWeAgdPPXSYlakG64Hm8C Km6BzAcylrRiHdORk3GeMJ1cPQ3vCjMrjTd87ra/xuH+DvPeyZ31cRIWIP1dn44x eog5dWo7pNmwfU50c4w/6dTSqwHG/bD/2ZPJH2nnJDLK02WeguantPN43fdoPU0c NEMldVE5GAqEr7Sbd5YIw9lBqrROIDfeUAxje4VZa1gSY4N/GYMGEZaM5vqYJJTP 0ndWP83QdamWuE0eOYMA+4oZiPpW79+Igv/PV13lsm9JgvO0WQisPFxE0cZqMTQp +wgbQ69rpyMiQxpusiL/6LA3khDyC8Z8g7cmjBfpqgwmVAZp7ly+GLk+ctG0zsjE hB99hkujZVkBZQLnVs0C/pXn1NdJ0wEupiHOSsVlQtqzNHlbweRJoxuGSp4Rl0Et 0DnTr3YHB6bdvRazaKzlkBHLLAXKEw0/xaRWGbE4tftZIrkOEeE0LMLLaLWLNKhX rqRoxq00OPs= =SOH3 -----END PGP SIGNATURE----- Merge tag 'pull-request-2025-07-11' of https://gitlab.com/thuth/qemu into staging * s390x: Allow to select different entries when booting via pxelinux.cfg * Link s390-ccw.img statically * Fix broken bamboo functional test * s390x code cleanups and refactorings # -----BEGIN PGP SIGNATURE----- # # iQJFBAABCgAvFiEEJ7iIR+7gJQEY8+q5LtnXdP5wLbUFAmhw2i0RHHRodXRoQHJl # ZGhhdC5jb20ACgkQLtnXdP5wLbUGtA//XVr5t2/iH+zFdaHHFglMtYkqwyYspa/O # zGPgcIZptQrzlbR+GFJwd4ae1HWb60E1YDyC7M1iWGQXeMNrDgeJJjUQfhB7693Y # CPT1FCWaqXdrTHQJhf5+EGJZopwY1K4EHs+bMxCpU3ManD+MKuXzCgOMzZATnPUZ # EcvOrzDBfEFEzQn5COUi5FF5Ds4DpOqQY1g1tpG92hQwWeAgdPPXSYlakG64Hm8C # Km6BzAcylrRiHdORk3GeMJ1cPQ3vCjMrjTd87ra/xuH+DvPeyZ31cRIWIP1dn44x # eog5dWo7pNmwfU50c4w/6dTSqwHG/bD/2ZPJH2nnJDLK02WeguantPN43fdoPU0c # NEMldVE5GAqEr7Sbd5YIw9lBqrROIDfeUAxje4VZa1gSY4N/GYMGEZaM5vqYJJTP # 0ndWP83QdamWuE0eOYMA+4oZiPpW79+Igv/PV13lsm9JgvO0WQisPFxE0cZqMTQp # +wgbQ69rpyMiQxpusiL/6LA3khDyC8Z8g7cmjBfpqgwmVAZp7ly+GLk+ctG0zsjE # hB99hkujZVkBZQLnVs0C/pXn1NdJ0wEupiHOSsVlQtqzNHlbweRJoxuGSp4Rl0Et # 0DnTr3YHB6bdvRazaKzlkBHLLAXKEw0/xaRWGbE4tftZIrkOEeE0LMLLaLWLNKhX # rqRoxq00OPs= # =SOH3 # -----END PGP SIGNATURE----- # gpg: Signature made Fri 11 Jul 2025 05:32:29 EDT # gpg: using RSA key 27B88847EEE0250118F3EAB92ED9D774FE702DB5 # gpg: issuer "thuth@redhat.com" # gpg: Good signature from "Thomas Huth <th.huth@gmx.de>" [full] # gpg: aka "Thomas Huth <thuth@redhat.com>" [full] # gpg: aka "Thomas Huth <huth@tuxfamily.org>" [full] # gpg: aka "Thomas Huth <th.huth@posteo.de>" [unknown] # Primary key fingerprint: 27B8 8847 EEE0 2501 18F3 EAB9 2ED9 D774 FE70 2DB5 * tag 'pull-request-2025-07-11' of https://gitlab.com/thuth/qemu: target/s390x: Have s390_cpu_halt() not return anything target/s390x: Expose s390_count_running_cpus() method target/s390x: Remove unused s390_cpu_[un]halt() user stubs tests/functional/test_ppc_bamboo: Replace broken link with working assets tests/functional: Add dependency to the keymap_targets pc-bios: Update the s390 bios images with the pxelinux.cfg loadparm changes pc-bios/s390-ccw: link statically tests/functional: Add a test for s390x pxelinux.cfg network booting pc-bios/s390-ccw: Add a boot menu for booting via pxelinux.cfg pc-bios/s390-ccw: Make get_boot_index() from menu.c global pc-bios/s390-ccw: Allow up to 31 entries for pxelinux.cfg pc-bios/s390-ccw: Allow to select a different pxelinux.cfg entry via loadparm hw/s390x/s390-pci-bus.c: Use g_assert_not_reached() in functions taking an ett target/s390x/tcg: Use vaddr in s390_probe_access() target/s390x/kvm: Use vaddr in find/insert_hw_breakpoint() Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2025-07-13 01:44:51 -04:00
Stefan Hajnoczi	43ec52b4c8	loongarch queue -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQNhkKjomWfgLCz0aQfewwSUazn0QUCaHCzhAAKCRAfewwSUazn 0egkAP0eJcYWSaG1xH6Gevx/hGYthFhJrQ2IwMlTDHQsx8PAtQEArnm+nQ3+ckzN 5ZHx7GR+hFTAy0WJSSndnLttYC1zsws= =kcDz -----END PGP SIGNATURE----- Merge tag 'pull-loongarch-20250711' of https://github.com/bibo-mao/qemu into staging loongarch queue # -----BEGIN PGP SIGNATURE----- # # iHUEABYKAB0WIQQNhkKjomWfgLCz0aQfewwSUazn0QUCaHCzhAAKCRAfewwSUazn # 0egkAP0eJcYWSaG1xH6Gevx/hGYthFhJrQ2IwMlTDHQsx8PAtQEArnm+nQ3+ckzN # 5ZHx7GR+hFTAy0WJSSndnLttYC1zsws= # =kcDz # -----END PGP SIGNATURE----- # gpg: Signature made Fri 11 Jul 2025 02:47:32 EDT # gpg: using EDDSA key 0D8642A3A2659F80B0B3D1A41F7B0C1251ACE7D1 # gpg: Good signature from "bibo mao <maobibo@loongson.cn>" [unknown] # gpg: WARNING: This key is not certified with a trusted signature! # gpg: There is no indication that the signature belongs to the owner. # Primary key fingerprint: 7044 3A00 19C0 E97A 31C7 13C4 8E86 8FB7 A176 9D4C # Subkey fingerprint: 0D86 42A3 A265 9F80 B0B3 D1A4 1F7B 0C12 51AC E7D1 * tag 'pull-loongarch-20250711' of https://github.com/bibo-mao/qemu: target/loongarch: Remove unnecessary page size validity checking target/loongarch: Fix CSR STLBPS register write emulation target/loongarch: Correct spelling in helper_csrwr_pwcl() hw/intc/loongarch_extioi: Move unrealize function to common code Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>	2025-07-13 01:44:30 -04:00
Peter Maydell	d6390204c6	linux-user: Use qemu_set_cloexec() to mark pidfd as FD_CLOEXEC In the linux-user do_fork() function we try to set the FD_CLOEXEC flag on a pidfd like this: fcntl(pid_fd, F_SETFD, fcntl(pid_fd, F_GETFL) \| FD_CLOEXEC); This has two problems: (1) it doesn't check errors, which Coverity complains about (2) we use F_GETFL when we mean F_GETFD Deal with both of these problems by using qemu_set_cloexec() instead. That function will assert() if the fcntls fail, which is fine (we are inside fork_start()/fork_end() so we know nothing can mess around with our file descriptors here, and we just got this one from pidfd_open()). (As we are touching the if() statement here, we correct the indentation.) Coverity: CID 1508111 Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Signed-off-by: Richard Henderson <richard.henderson@linaro.org> Message-ID: <20250711141217.1429412-1-peter.maydell@linaro.org>	2025-07-11 10:45:14 -06:00
Richard Henderson	c86da2b1dd	tcg: Use uintptr_t in tcg_malloc implementation Avoid ubsan failure with clang-20, tcg.h:715:19: runtime error: applying non-zero offset 64 to null pointer by not using pointers. Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>	2025-07-11 10:43:47 -06:00
Juraj Marcin	beeac2df5f	migration: Rename save_live_complete_precopy_thread to save_complete_precopy_thread Recent patch [1] renames the save_live_complete_precopy handler to save_complete, as the machine is not live in most cases when this handler is executed. The same is true also for save_live_complete_precopy_thread, therefore this patch removes the "live" keyword from the handler itself and related types to keep the naming unified. In contrast to save_complete, this handler is only executed at the end of precopy, therefore the "precopy" keyword is retained. [1]: https://lore.kernel.org/all/20250613140801.474264-7-peterx@redhat.com/ Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Cédric Le Goater <clg@redhat.com> Signed-off-by: Juraj Marcin <jmarcin@redhat.com> Link: https://lore.kernel.org/r/20250626085235.294690-1-jmarcin@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:39 -03:00
Peter Xu	3345fb3b6d	migration/postcopy: Add latency distribution report for blocktime Add the latency distribution too for blocktime, using order-of-two buckets. It accounts for all the faults, from either vCPU or non-vCPU threads. With prior rework, it's very easy to achieve by adding an array to account for faults in each buckets. Sample output for HMP (while for QMP it's simply an array): Postcopy Latency Distribution: [ 1 us - 2 us ]: 0 [ 2 us - 4 us ]: 0 [ 4 us - 8 us ]: 1 [ 8 us - 16 us ]: 2 [ 16 us - 32 us ]: 2 [ 32 us - 64 us ]: 3 [ 64 us - 128 us ]: 10169 [ 128 us - 256 us ]: 50151 [ 256 us - 512 us ]: 12876 [ 512 us - 1 ms ]: 97 [ 1 ms - 2 ms ]: 42 [ 2 ms - 4 ms ]: 44 [ 4 ms - 8 ms ]: 93 [ 8 ms - 16 ms ]: 138 [ 16 ms - 32 ms ]: 0 [ 32 ms - 65 ms ]: 0 [ 65 ms - 131 ms ]: 0 [ 131 ms - 262 ms ]: 0 [ 262 ms - 524 ms ]: 0 [ 524 ms - 1 sec ]: 0 [ 1 sec - 2 sec ]: 0 [ 2 sec - 4 sec ]: 0 [ 4 sec - 8 sec ]: 0 [ 8 sec - 16 sec ]: 0 Cc: Markus Armbruster <armbru@redhat.com> Acked-by: Dr. David Alan Gilbert <dave@treblig.org> Reviewed-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/r/20250613141217.474825-15-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:39 -03:00
Peter Xu	ed23a15976	migration/postcopy: blocktime allows track / report non-vCPU faults When used to report page fault latencies, the blocktime feature can be almost useless when KVM async page fault is enabled, because in most cases such remote fault will kickoff async page faults, then it's not trackable from blocktime layer. After all these recent rewrites to blocktime layer, it's finally so easy to also support tracking non-vCPU faults. It'll be even faster if we could always index fault records with TIDs, unfortunately we need to maintain the blocktime API which report things in vCPU indexes. Of course this can work not only for kworkers, but also any guest accesses that may reach a missing page, for example, very likely when in the QEMU main thread too (and all other threads whenever applicable). In this case, we don't care about "how long the threads are blocked", but we only care about "how long the fault will be resolved". Cc: Markus Armbruster <armbru@redhat.com> Cc: Dr. David Alan Gilbert <dave@treblig.org> Reviewed-by: Fabiano Rosas <farosas@suse.de> Tested-by: Mario Casquero <mcasquer@redhat.com> Link: https://lore.kernel.org/r/20250613141217.474825-14-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:39 -03:00
Peter Xu	b63a2e9e4b	migration/postcopy: Optimize blocktime fault tracking with hashtable Currently, the postcopy blocktime feature maintains vCPU fault information using an array (vcpu_addr[]). It has two issues. Issue 1: Performance Concern ============================ The old algorithm was almost OK and fast on inserts, except that the lookup is slow and won't scale if there are a lot of vCPUs: when a page is copied during postcopy, mark_postcopy_blocktime_end() will walk the whole array trying to find which vCPUs are blocked by the address. So it needs constant O(N) walk for each page resolution. Alexey (the author of postcopy blocktime) mentioned the perf issue and how to optimize it in a piece of comment in the page resolution path. The comment was (interestingly..) not complete, but it's relatively clear what he wanted to say about this perf issue. Issue 2: Wrong Accounting on re-entrancies ========================================== People might think that each vCPU should only and always get one fault at a time, so that when the blocktime layer captured one fault on one vCPU, we should never see another fault message on this vCPU. It's almost correct, except in some extreme rare cases. Case 1: it's possible the fault thread processes the userfaultfd messages too fast so it can see >1 messages on one vCPU before the previous one was resolved. Case 2: it's theoretically also possible one vCPU can get even more than one message on the same fault address if a fault is retried by the kernel (e.g., handle_userfault() got interrupted before page resolution). As this info might be important, instead of using commit message, I put more details into the code as comment, when introducing an array maintaining concurrent faults on one vCPU. Please refer to the comments for details on both cases, especially case 1 which can be tricky. Case 1 sounds rare, but it can be easily reproduced locally for me when we run blocktime together with the migration-test on the vanilla postcopy. New Design ========== This patch should do almost what Alexey mentioned, but slightly differently: instead of having an array to maintain vCPU fault addresses, for each of the fault message we push a message into a hash, indexed by the fault address. With the hash, it can replace the old two structs: both the vcpu_addr[] array, and also the array to store the start time of the fault. However due to above we need one more counter array to account concurrent faults on the same vCPU - that should even be needed in the old code, it's just that the old code was buggy and it will blindly overwrite an existing entry.. now we'll start to really track everything. The hash structure might be more efficient than tree to maintain such addr->(cpu, fault_time) information, so that the insert() and lookup() paths should ideally both be ~O(1). After all, we do not need to sort. Here we need to do one remove() though after the lookup(). It could be slow but only if many vCPUs faulted exactly on the same address (so when the list of cpu entries is long), which should be unlikely. Even with that, it's still a worst case O(N) (consider 400 vCPUs faulted on the same address and how likely is it..) rather than a constant O(N) complexity. When at it, touch up the tracepoints to make them slightly more useful. One tracepoint is added when walking all the fault entries. Reviewed-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/r/20250613141217.474825-13-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:38 -03:00
Peter Xu	4c8a119485	migration/postcopy: Cleanup the total blocktime accounting The variable vcpu_total_blocktime isn't easy to follow. In reality, it wants to capture the case where all vCPUs are stopped, and now there will be some vCPUs starts running. The name now starts to conflict with vcpu_blocktime_total[], meanwhile it's actually not necessary to have the variable at all: since nobody is touching smp_cpus_down except ourselves, we can safely do the calculation at the end before decrementing smp_cpus_down. Hopefully this makes the logic easier to read, side benefit is we drop one temp var. Reviewed-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/r/20250613141217.474825-12-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:38 -03:00
Peter Xu	28a185204e	migration/postcopy: Cache the tid->vcpu mapping for blocktime Looking up the vCPU index for each fault can be expensive when there're hundreds of vCPUs. Provide a cache for tid->vcpu instead with a hash table, then lookup from there. When at it, add another counter to record how many non-vCPU faults it gets. For example, the main thread can also access a guest page that was missing. These kind of faults are not accounted by blocktime so far. Reviewed-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/r/20250613141217.474825-11-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:38 -03:00
Peter Xu	f07f2a3092	migration/postcopy: Initialize blocktime context only until listen Before this patch, the blocktime context can be created very early, because postcopy_ram_supported_by_host() <- migrate_caps_check() can happen during migration object init. The trick here is the blocktime context needs system vCPU information, which seems to be possible to change after that point. I didn't verify it, but it doesn't sound right. Now move it out and initialize the context only when postcopy listen starts. That is already during a migration so it should be guaranteed the vCPU topology can never change on both sides. While at it, assert that the ctx isn't created instead this time; the old "if" trick isn't needed when we're sure it will only happen once now. Reviewed-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/r/20250613141217.474825-10-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:38 -03:00
Peter Xu	b4c82b4288	migration/postcopy: Report fault latencies in blocktime Blocktime so far only cares about the time one vcpu (or the whole system) got blocked. It would be also be helpful if it can also report the latency of page requests, which could be very sensitive during postcopy. Blocktime itself is sometimes not very important, especially when one thinks about KVM async PF support, which means vCPUs are literally almost not blocked at all because the guest OS is smart enough to switch to another task when a remote fault is needed. However, latency is still sensitive and important because even if the guest vCPU is running on threads that do not need a remote fault, the workload that accesses some missing page is still affected. Add two entries to the report, showing how long it takes to resolve a remote fault. Mention in the QAPI doc that this is not the real average fault latency, but only the ones that was requested for a remote fault. Unwrap get_vcpu_blocktime_list() so we don't need to walk the list twice, meanwhile add the entry checks in qtests for all postcopy tests. Cc: Markus Armbruster <armbru@redhat.com> Cc: Dr. David Alan Gilbert <dave@treblig.org> Reviewed-by: Fabiano Rosas <farosas@suse.de> Tested-by: Mario Casquero <mcasquer@redhat.com> Link: https://lore.kernel.org/r/20250613141217.474825-9-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:38 -03:00
Peter Xu	271a1940e9	migration/postcopy: Add blocktime fault counts per-vcpu Add a field to count how many remote faults one vCPU has taken. So far it's still not used, but will be soon. Reviewed-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/r/20250613141217.474825-8-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:38 -03:00
Peter Xu	a098761f63	migration/postcopy: Bring blocktime layer to ns level With 64-bit fields, it is trivial. The caution is when exposing any values in QMP, it was still declared with milliseconds (ms). Hence it's needed to do the convertion when exporting the values to existing QMP queries. Reviewed-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/r/20250613141217.474825-7-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:38 -03:00
Peter Xu	08fb2a9335	migration/postcopy: Drop PostcopyBlocktimeContext.start_time Now with 64bits, the offseting using start_time is not needed anymore, because the array can always remember the whole timestamp. Then drop the unused parameter in get_low_time_offset() altogether. Reviewed-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/r/20250613141217.474825-6-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:37 -03:00
Peter Xu	b2819530e3	migration/postcopy: Make all blocktime vars 64bits I am guessing it was used to be 32bits because of the atomic ops. Now all the atomic ops are gone and we're protected by a mutex instead, it's ok we can switch to 64 bits. Reasons to move over: - Allow further patches to change the unit from ms to us: with postcopy preempt mode, we're really into hundreds of microseconds level on blocktime. We'd better be able to trap those. - This also paves way for some other tricks that the original version used to avoid overflows, e.g., start_time was almost only useful before to make sure the sampled timestamp won't overflow a 32-bit field. - This prepares further reports on top of existing data collected, e.g. average page fault latencies. When average operation is taken into account, milliseconds are simply too coarse grained. When at it: - Rename page_fault_vcpu_time to vcpu_blocktime_start. - Rename vcpu_blocktime to vcpu_blocktime_total. - Touch up the trace-events to not dump blocktime ctx pointer Reviewed-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/r/20250613141217.474825-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:37 -03:00
Peter Xu	c0f47dfb5b	migration/postcopy: Drop all atomic ops in blocktime feature Now with the mutex protection it's not needed anymore. Reviewed-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/r/20250613141217.474825-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:37 -03:00
Peter Xu	d2a81ca8c6	migration/postcopy: Push blocktime start/end into page req mutex The postcopy blocktime feature was tricky that it used quite some atomic operations over quite a few arrays and vars, without explaining how that would be thread safe. The thread safety here is about concurrency between the fault thread and the fault resolution threads, possible to access the same chunk of data. All these atomic ops can be expensive too before knowing clearly how it works. OTOH, postcopy has one page_request_mutex used to serialize the received bitmap updates. So far it's ok - we don't yet have a lot of threads contending the lock. It might change after multifd will be supported, but that's a separate story. What is important is, with that mutex, it's pretty lightweight to move all the blocktime maintenance into the mutex critical section. It's because the blocktime layer is lightweighted: almost "remember which vcpu faulted on which address", and "ok we get some fault resolved, calculate how long it takes". It's also an optional feature for now (but I have thought of changing that, maybe in the future). Let's push the blocktime layer into the mutex, so that it's always thread-safe even without any atomic ops. To achieve that, I'll need to add a tid parameter on fault path so that it'll start to pass the faulted thread ID into deeper the stack, but not too deep. When at it, add a comment for the shared fault handler (for example, vhost-user devices running with postcopy), to mention a TODO. One reason it might not be trivial is that vhost-user's userfaultfds should be opened by vhost-user process, so it's pretty hard to control making sure the TID feature will be around. It wasn't supported before, so keep it like that for now. Now we should be as ease when everything is protected by a mutex that we always take anyway. One side effect: we can finally remove one ramblock_recv_bitmap_test() in mark_postcopy_blocktime_begin(), which was pretty weird and which also includes a weird (but maybe necessary.. but maybe not?) operation to inject a blocktime entry then quickly erase it.. When we're with the mutex, and when we make sure it's invoked after checking the receive bitmap, it's not needed anymore. Instead, we assert. As another side effect, this paves way for removing all atomic ops in all the mem accesses in blocktime layer. Note that we need a stub for mark_postcopy_blocktime_begin() for Windows builds. Reviewed-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/r/20250613141217.474825-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:37 -03:00
Peter Xu	f62b7a0a29	migration: Add option to set postcopy-blocktime Add a global property to allow enabling postcopy-blocktime feature. Reviewed-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/r/20250613141217.474825-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:37 -03:00
Peter Xu	adb13d6e42	migration/postcopy: Avoid clearing dirty bitmap for postcopy too This is a follow up on the other commit "migration/ram: avoid to do log clear in the last round" but for postcopy. https://lore.kernel.org/r/20250514115827.3216082-1-yanfei.xu@bytedance.com I can observe more than 10% reduction of average page fault latency during postcopy phase with this optimization: Before: 268.00us (+-1.87%) After: 232.67us (+-2.01%) The test was done with a 16GB VM with 80 vCPUs, running a workload that busy random writes to 13GB memory. Cc: Yanfei Xu <yanfei.xu@bytedance.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/r/20250613140801.474264-12-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:37 -03:00
Peter Xu	7aaa1fc072	migration: Rewrite the migration complete detect logic There're a few things off here in that logic, rewrite it. When at it, add rich comment to explain each of the decisions. Since this is very sensitive path for migration, below are the list of things changed with their reasonings. (1) Exact pending size is only needed for precopy not postcopy Fundamentally it's because "exact" version only does one more deep sync to fetch the pending results, while in postcopy's case it's never going to sync anything more than estimate as the VM on source is stopped. (2) Do _not_ rely on threshold_size anymore to decide whether postcopy should complete threshold_size was calculated from the expected downtime and bandwidth only during precopy as an efficient way to decide when to switchover. It's not sensible to rely on threshold_size in postcopy. For precopy, if switchover is decided, the migration will complete soon. It's not true for postcopy. Logically speaking, postcopy should only complete the migration if all pending data is flushed. Here it used to work because save_complete() used to implicitly contain save_live_iterate() when there's pending size. Even if that looks benign, having RAMs to be migrated in postcopy's save_complete() has other bad side effects: (a) Since save_complete() needs to be run once at a time, it means when moving RAM there's no way moving other things (rather than round-robin iterating the vmstate handlers like what we do with ITERABLE phase). Not an immediate concern, but it may stop working in the future when there're more than one iterables (e.g. vfio postcopy). (b) postcopy recovery, unfortunately, only works during ITERABLE phase. IOW, if src QEMU moves RAM during postcopy's save_complete() and network failed, then it'll crash both QEMUs... OTOH if it failed during iteration it'll still be recoverable. IOW, this change should further reduce the window QEMU split brain and crash in extreme cases. If we enable the ram_save_complete() tracepoints, we'll see this before this patch: 1267959@1748381938.294066:ram_save_complete dirty=9627, done=0 1267959@1748381938.308884:ram_save_complete dirty=0, done=1 It means in this migration there're 9627 pages migrated at complete() of postcopy phase. After this change, all the postcopy RAM should be migrated in iterable phase, rather than save_complete(): 1267959@1748381938.294066:ram_save_complete dirty=0, done=0 1267959@1748381938.308884:ram_save_complete dirty=0, done=1 (3) Adjust when to decide to switch to postcopy This shouldn't be super important, the movement makes sure there's only one in_postcopy check, then we are clear on what we do with the two completely differnt use cases (precopy v.s. postcopy). (4) Trivial touch up on threshold_size comparision Which changes: "(!pending_size \|\| pending_size < s->threshold_size)" into: "(pending_size <= s->threshold_size)" Reviewed-by: Juraj Marcin <jmarcin@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/r/20250613140801.474264-11-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:36 -03:00
Peter Xu	f1549da610	migration/ram: Add tracepoints for ram_save_complete() Take notes on start/end state of dirty pages for the whole system. Reviewed-by: Juraj Marcin <jmarcin@redhat.com> Link: https://lore.kernel.org/r/20250613140801.474264-10-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:36 -03:00
Peter Xu	ff9dfc41d9	migration/ram: One less indent for ram_find_and_save_block() The check over PAGE_DIRTY_FOUND isn't necessary. We could indent one less and assert that instead. Reviewed-by: Juraj Marcin <jmarcin@redhat.com> Link: https://lore.kernel.org/r/20250613140801.474264-9-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:36 -03:00
Peter Xu	7dd95a9630	migration: qemu_savevm_complete*() helpers Since we use the same save_complete() hook for both precopy and postcopy, add a set of helpers to invoke the hook() to dedup the code. Reviewed-by: Juraj Marcin <jmarcin@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/r/20250613140801.474264-8-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:36 -03:00
Peter Xu	57c43e52bd	migration: Rename save_live_complete_precopy to save_complete Now after merging the precopy and postcopy version of complete() hook, rename the precopy version from save_live_complete_precopy() to save_complete(). Dropping the "live" when at it, because it's in most cases not live when happening (in precopy). No functional change intended. Reviewed-by: Juraj Marcin <jmarcin@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/r/20250613140801.474264-7-peterx@redhat.com [peterx: squash the fixup that covers a few more doc spots, per Juraj] Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:36 -03:00
Peter Xu	d7530a9682	migration: Drop save_live_complete_postcopy hook The hook is only defined in two vmstate users ("ram" and "block dirty bitmap"), meanwhile both of them define the hook exactly the same as the precopy version. Hence, this postcopy version isn't needed. No functional change intended. Reviewed-by: Juraj Marcin <jmarcin@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/r/20250613140801.474264-6-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:35 -03:00
Peter Xu	2145f38c31	migration/bg-snapshot: Do not check for SKIP in iterator It's not possible to happen in bg-snapshot case. Reviewed-by: Juraj Marcin <jmarcin@redhat.com> Link: https://lore.kernel.org/r/20250613140801.474264-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:35 -03:00
Peter Xu	35290df01b	migration/docs: Move docs for postcopy blocktime feature Move it out of vanilla postcopy session, but instead a standalone feature. When at it, removing the NOTE because it's incorrect now after introduction of max-postcopy-bandwidth, which can control the throughput even for postcopy phase. Reviewed-by: Juraj Marcin <jmarcin@redhat.com> Link: https://lore.kernel.org/r/20250613140801.474264-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:35 -03:00
Peter Xu	2862d6d4fb	migration/hmp: Fix postcopy-blocktime per-vCPU results Unfortunately, it was never correctly shown.. This is only found when I started to look into making the blocktime feature more useful (so as to avoid using bpftrace, even though I'm not sure which one will be harder to use..). So the old dump would look like this: Postcopy vCPU Blocktime: 0-1,4,10,21,33,46,48,59 Even though there're actually 40 vcpus, and the string will merge same elements and also sort them. To fix it, simply loop over the uint32List manually. Now it looks like: Postcopy vCPU Blocktime (ms): [15, 0, 0, 43, 29, 34, 36, 29, 37, 41, 33, 37, 45, 52, 50, 38, 40, 37, 40, 49, 40, 35, 35, 35, 81, 19, 18, 19, 18, 30, 22, 3, 0, 0, 0, 0, 0, 0, 0, 0] Cc: Dr. David Alan Gilbert <dave@treblig.org> Cc: Alexey Perevalov <a.perevalov@samsung.com> Cc: Markus Armbruster <armbru@redhat.com> Tested-by: Mario Casquero <mcasquer@redhat.com> Reviewed-by: Juraj Marcin <jmarcin@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/r/20250613140801.474264-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:35 -03:00
Peter Xu	31a90360fd	migration/hmp: Reorg "info migrate" once more Dave suggested the HMP output for "info migrate" can not only leverage the lines but also better grouping: https://lore.kernel.org/r/aC4_-nMc7FwsMf9p@gallifrey I followed Dave's suggestion, and some more modifications on top: - Added all elements into the picture - Use size_to_str() and drop most of the units: benefit is more friendly to most human eyes, bad side effect is lose of details, but that should be corner case per my uses, and one can still leverage the QMP interface when necessary. - Sub-grouping for "Transfers" ("Channels" and "Page Types"). - Better indentations Sample output: (qemu) info migrate Status: postcopy-active Time (ms): total=47317, setup=5, down=8 RAM info: Throughput (Mbps): 1342.83 Sizes: pagesize=4 KiB, total=4.02 GiB Transfers: transferred=1.41 GiB, remain=2.46 GiB Channels: precopy=15.2 MiB, multifd=0 B, postcopy=1.39 GiB Page Types: normal=367713, zero=41195 Page Rates (pps): transfer=40900, dirty=4 Others: dirty_syncs=2, postcopy_req=57503 Suggested-by: Dr. David Alan Gilbert <dave@treblig.org> Tested-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Acked-by: Dr. David Alan Gilbert <dave@treblig.org> Reviewed-by: Juraj Marcin <jmarcin@redhat.com> Tested-by: Mario Casquero <mcasquer@redhat.com> Link: https://lore.kernel.org/r/20250613140801.474264-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>	2025-07-11 10:37:35 -03:00
Jackson Donaldson	3a323a813f	tests/functional: Add a test for the MAX78000 arm machine Runs a binary from the max78000test repo used in developing the qemu implementation of the max78000 to verify that the machine and implemented devices generally still work. Signed-off-by: Jackson Donaldson <jcksn@duck.com> Reviewed-by: Thomas Huth <thuth@redhat.com> Message-id: 20250711110626.624534-3-jcksn@duck.com Signed-off-by: Peter Maydell <peter.maydell@linaro.org>	2025-07-11 13:30:32 +01:00

1 2 3 4 5 ...

122494 commits