This fixes a compiler error when higher warning levels are enabled:
../migration/postcopy-ram.c: In function ‘postcopy_temp_pages_setup’:
../migration/postcopy-ram.c:1483:50: error: ‘g_malloc0_n’ sizes specified with ‘sizeof’ in the earlier argument and not in the later argument [-Werror=calloc-transposed-args]
1483 | mis->postcopy_tmp_pages = g_malloc0_n(sizeof(PostcopyTmpPage), channels);
| ^~~~~~~~~~~~~~~
../migration/postcopy-ram.c:1483:50: note: earlier argument should specify number of elements, later size of each element
Avoid also a related int/unsigned mismatch by fixing the type of
two local variables.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20251209125049.764095-1-sw@weilnetz.de>
[PMD: Replace g_malloc0_n() by g_new0()]
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Acked-by: Peter Xu <peterx@redhat.com>
Message-Id: <20251209195010.83219-1-philmd@linaro.org>
One of two error messages show __func__. Drop it; it doesn't help
users, and developers can grep for the message. This also permits
de-duplicating the code to prepend to the error message.
Both error messages show a numeric error code. I doubt that's
helpful, but I'm leaving it alone.
Use error_append_hint() for explaining that some dirty bitmaps may be
lost. Polish the prose.
Don't faff around with g_clear_pointer(), it's not worth its keep
here.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20251115083500.2753895-4-armbru@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Currently, when postcopy starts, the source VM starts switchover and
sends a package containing the state of all non-postcopiable devices.
When the destination loads this package, the switchover is complete and
the destination VM starts. However, if the device state load fails or
the destination side crashes, the source side is already in
POSTCOPY_ACTIVE state and cannot be recovered, even when it has the most
up-to-date machine state as the destination has not yet started.
This patch introduces a new POSTCOPY_DEVICE state which is active while
the destination machine is loading the device state, is not yet running,
and the source side can be resumed in case of a migration failure.
Return-path is required for this state to function, otherwise it will be
skipped in favor of POSTCOPY_ACTIVE.
To transition from POSTCOPY_DEVICE to POSTCOPY_ACTIVE, the source
side uses a PONG message that is a response to a PING message processed
just before the POSTCOPY_RUN command that starts the destination VM.
Thus, this feature is effective even if the destination side does not
yet support this new state.
Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
Link: https://lore.kernel.org/r/20251103183301.3840862-9-jmarcin@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
This patch makes the listen thread joinable instead detached, and joins
it alongside other postcopy threads.
Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/r/20251103183301.3840862-8-jmarcin@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
When exit-on-error was added to migration, it wasn't added to postcopy.
Even though postcopy migration will usually pause and not fail, in cases
it does unrecoverably fail before destination side has been started,
exit-on-error will allow management to query the error.
Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/r/20251103183301.3840862-7-jmarcin@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Currently, there are two functions that are responsible for calling the
cleanup of the incoming migration state. With successful precopy, it's
the incoming migration coroutine, and with successful postcopy it's the
postcopy listen thread. However, if postcopy fails during in the device
load, both functions will try to do the cleanup.
This patch refactors all cleanup that needs to be done on the incoming
side into a common function and defines a clear boundary, who is
responsible for the cleanup. The incoming migration coroutine is
responsible for calling the cleanup function, unless the listen thread
has been started, in which case the postcopy listen thread runs the
incoming migration cleanup in its BH.
Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
Fixes: 9535435795 ("migration: push Error **errp into qemu_loadvm_state()")
Reviewed-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/r/20251103183301.3840862-6-jmarcin@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
After moving postcopy_ram_listen_thread() to postcopy file, this patch
introduces a pair of functions, postcopy_incoming_setup() and
postcopy_incoming_cleanup(). These functions encapsulate setup and
cleanup of all incoming postcopy resources, postcopy-ram and postcopy
listen thread.
Furthermore, this patch also renames the postcopy_ram_listen_thread to
postcopy_listen_thread, as this thread handles not only postcopy-ram,
but also dirty-bitmaps and in the future it could handle other
postcopiable devices.
Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/r/20251103183301.3840862-5-jmarcin@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
This patch addresses a TODO about moving postcopy_ram_listen_thread() to
postcopy file.
Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/r/20251103183301.3840862-4-jmarcin@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
This is an incremental step in converting vmstate loading
code to report error via Error objects instead of directly
printing it to console/monitor.
It is ensured that ram_postcopy_incoming_init() must report an error
in errp, in case of failure.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Arun Menon <armenon@redhat.com>
Tested-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Link: https://lore.kernel.org/r/20250918-propagate_tpm_error-v14-14-36f11a6fb9d3@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
qemu_file_set_blocking() is a wrapper on qio_channel_set_blocking(),
so let's passthrough the errp.
Note the migration should not be using &error_abort in these calls,
however, this is done to expedite the API conversion.
The original code would have eventually ended up calling either
qemu_socket_set_nonblock which would asset on Linux, or
g_unix_set_fd_nonblocking which would propagate errors. We never
saw asserts in practice, and conceptually they should not happen,
but ideally this code will be later adapted to remove use of
&error_abort.
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Add the latency distribution too for blocktime, using order-of-two buckets.
It accounts for all the faults, from either vCPU or non-vCPU threads. With
prior rework, it's very easy to achieve by adding an array to account for
faults in each buckets.
Sample output for HMP (while for QMP it's simply an array):
Postcopy Latency Distribution:
[ 1 us - 2 us ]: 0
[ 2 us - 4 us ]: 0
[ 4 us - 8 us ]: 1
[ 8 us - 16 us ]: 2
[ 16 us - 32 us ]: 2
[ 32 us - 64 us ]: 3
[ 64 us - 128 us ]: 10169
[ 128 us - 256 us ]: 50151
[ 256 us - 512 us ]: 12876
[ 512 us - 1 ms ]: 97
[ 1 ms - 2 ms ]: 42
[ 2 ms - 4 ms ]: 44
[ 4 ms - 8 ms ]: 93
[ 8 ms - 16 ms ]: 138
[ 16 ms - 32 ms ]: 0
[ 32 ms - 65 ms ]: 0
[ 65 ms - 131 ms ]: 0
[ 131 ms - 262 ms ]: 0
[ 262 ms - 524 ms ]: 0
[ 524 ms - 1 sec ]: 0
[ 1 sec - 2 sec ]: 0
[ 2 sec - 4 sec ]: 0
[ 4 sec - 8 sec ]: 0
[ 8 sec - 16 sec ]: 0
Cc: Markus Armbruster <armbru@redhat.com>
Acked-by: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-15-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
When used to report page fault latencies, the blocktime feature can be
almost useless when KVM async page fault is enabled, because in most cases
such remote fault will kickoff async page faults, then it's not trackable
from blocktime layer.
After all these recent rewrites to blocktime layer, it's finally so easy to
also support tracking non-vCPU faults. It'll be even faster if we could
always index fault records with TIDs, unfortunately we need to maintain the
blocktime API which report things in vCPU indexes.
Of course this can work not only for kworkers, but also any guest accesses
that may reach a missing page, for example, very likely when in the QEMU
main thread too (and all other threads whenever applicable).
In this case, we don't care about "how long the threads are blocked", but
we only care about "how long the fault will be resolved".
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Link: https://lore.kernel.org/r/20250613141217.474825-14-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Currently, the postcopy blocktime feature maintains vCPU fault information
using an array (vcpu_addr[]). It has two issues.
Issue 1: Performance Concern
============================
The old algorithm was almost OK and fast on inserts, except that the lookup
is slow and won't scale if there are a lot of vCPUs: when a page is copied
during postcopy, mark_postcopy_blocktime_end() will walk the whole array
trying to find which vCPUs are blocked by the address. So it needs
constant O(N) walk for each page resolution.
Alexey (the author of postcopy blocktime) mentioned the perf issue and how
to optimize it in a piece of comment in the page resolution path. The
comment was (interestingly..) not complete, but it's relatively clear what
he wanted to say about this perf issue.
Issue 2: Wrong Accounting on re-entrancies
==========================================
People might think that each vCPU should only and always get one fault at a
time, so that when the blocktime layer captured one fault on one vCPU, we
should never see another fault message on this vCPU.
It's almost correct, except in some extreme rare cases.
Case 1: it's possible the fault thread processes the userfaultfd messages
too fast so it can see >1 messages on one vCPU before the previous one was
resolved.
Case 2: it's theoretically also possible one vCPU can get even more than
one message on the same fault address if a fault is retried by the
kernel (e.g., handle_userfault() got interrupted before page resolution).
As this info might be important, instead of using commit message, I put
more details into the code as comment, when introducing an array
maintaining concurrent faults on one vCPU. Please refer to the comments
for details on both cases, especially case 1 which can be tricky.
Case 1 sounds rare, but it can be easily reproduced locally for me when we
run blocktime together with the migration-test on the vanilla postcopy.
New Design
==========
This patch should do almost what Alexey mentioned, but slightly
differently: instead of having an array to maintain vCPU fault addresses,
for each of the fault message we push a message into a hash, indexed by the
fault address.
With the hash, it can replace the old two structs: both the vcpu_addr[]
array, and also the array to store the start time of the fault. However
due to above we need one more counter array to account concurrent faults on
the same vCPU - that should even be needed in the old code, it's just that
the old code was buggy and it will blindly overwrite an existing
entry.. now we'll start to really track everything.
The hash structure might be more efficient than tree to maintain such
addr->(cpu, fault_time) information, so that the insert() and lookup()
paths should ideally both be ~O(1). After all, we do not need to sort.
Here we need to do one remove() though after the lookup(). It could be
slow but only if many vCPUs faulted exactly on the same address (so when
the list of cpu entries is long), which should be unlikely. Even with that,
it's still a worst case O(N) (consider 400 vCPUs faulted on the same
address and how likely is it..) rather than a constant O(N) complexity.
When at it, touch up the tracepoints to make them slightly more useful.
One tracepoint is added when walking all the fault entries.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-13-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
The variable vcpu_total_blocktime isn't easy to follow. In reality, it
wants to capture the case where all vCPUs are stopped, and now there will
be some vCPUs starts running.
The name now starts to conflict with vcpu_blocktime_total[], meanwhile it's
actually not necessary to have the variable at all: since nobody is
touching smp_cpus_down except ourselves, we can safely do the calculation
at the end before decrementing smp_cpus_down.
Hopefully this makes the logic easier to read, side benefit is we drop one
temp var.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-12-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Looking up the vCPU index for each fault can be expensive when there're
hundreds of vCPUs. Provide a cache for tid->vcpu instead with a hash
table, then lookup from there.
When at it, add another counter to record how many non-vCPU faults it gets.
For example, the main thread can also access a guest page that was missing.
These kind of faults are not accounted by blocktime so far.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-11-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Before this patch, the blocktime context can be created very early, because
postcopy_ram_supported_by_host() <- migrate_caps_check() can happen during
migration object init.
The trick here is the blocktime context needs system vCPU information,
which seems to be possible to change after that point. I didn't verify it,
but it doesn't sound right.
Now move it out and initialize the context only when postcopy listen
starts. That is already during a migration so it should be guaranteed the
vCPU topology can never change on both sides.
While at it, assert that the ctx isn't created instead this time; the old
"if" trick isn't needed when we're sure it will only happen once now.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-10-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Blocktime so far only cares about the time one vcpu (or the whole system)
got blocked. It would be also be helpful if it can also report the latency
of page requests, which could be very sensitive during postcopy.
Blocktime itself is sometimes not very important, especially when one
thinks about KVM async PF support, which means vCPUs are literally almost
not blocked at all because the guest OS is smart enough to switch to
another task when a remote fault is needed.
However, latency is still sensitive and important because even if the guest
vCPU is running on threads that do not need a remote fault, the workload
that accesses some missing page is still affected.
Add two entries to the report, showing how long it takes to resolve a
remote fault. Mention in the QAPI doc that this is not the real average
fault latency, but only the ones that was requested for a remote fault.
Unwrap get_vcpu_blocktime_list() so we don't need to walk the list twice,
meanwhile add the entry checks in qtests for all postcopy tests.
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Link: https://lore.kernel.org/r/20250613141217.474825-9-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Add a field to count how many remote faults one vCPU has taken. So far
it's still not used, but will be soon.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-8-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
With 64-bit fields, it is trivial. The caution is when exposing any values
in QMP, it was still declared with milliseconds (ms). Hence it's needed to
do the convertion when exporting the values to existing QMP queries.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-7-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Now with 64bits, the offseting using start_time is not needed anymore,
because the array can always remember the whole timestamp.
Then drop the unused parameter in get_low_time_offset() altogether.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-6-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
I am guessing it was used to be 32bits because of the atomic ops. Now all
the atomic ops are gone and we're protected by a mutex instead, it's ok we
can switch to 64 bits.
Reasons to move over:
- Allow further patches to change the unit from ms to us: with postcopy
preempt mode, we're really into hundreds of microseconds level on
blocktime. We'd better be able to trap those.
- This also paves way for some other tricks that the original version
used to avoid overflows, e.g., start_time was almost only useful before
to make sure the sampled timestamp won't overflow a 32-bit field.
- This prepares further reports on top of existing data collected,
e.g. average page fault latencies. When average operation is taken into
account, milliseconds are simply too coarse grained.
When at it:
- Rename page_fault_vcpu_time to vcpu_blocktime_start.
- Rename vcpu_blocktime to vcpu_blocktime_total.
- Touch up the trace-events to not dump blocktime ctx pointer
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-5-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
The postcopy blocktime feature was tricky that it used quite some atomic
operations over quite a few arrays and vars, without explaining how that
would be thread safe. The thread safety here is about concurrency between
the fault thread and the fault resolution threads, possible to access the
same chunk of data. All these atomic ops can be expensive too before
knowing clearly how it works.
OTOH, postcopy has one page_request_mutex used to serialize the received
bitmap updates. So far it's ok - we don't yet have a lot of threads
contending the lock. It might change after multifd will be supported, but
that's a separate story. What is important is, with that mutex, it's
pretty lightweight to move all the blocktime maintenance into the mutex
critical section. It's because the blocktime layer is lightweighted:
almost "remember which vcpu faulted on which address", and "ok we get some
fault resolved, calculate how long it takes". It's also an optional
feature for now (but I have thought of changing that, maybe in the future).
Let's push the blocktime layer into the mutex, so that it's always
thread-safe even without any atomic ops.
To achieve that, I'll need to add a tid parameter on fault path so that
it'll start to pass the faulted thread ID into deeper the stack, but not
too deep. When at it, add a comment for the shared fault handler (for
example, vhost-user devices running with postcopy), to mention a TODO. One
reason it might not be trivial is that vhost-user's userfaultfds should be
opened by vhost-user process, so it's pretty hard to control making sure
the TID feature will be around. It wasn't supported before, so keep it
like that for now.
Now we should be as ease when everything is protected by a mutex that we
always take anyway.
One side effect: we can finally remove one ramblock_recv_bitmap_test() in
mark_postcopy_blocktime_begin(), which was pretty weird and which also
includes a weird (but maybe necessary.. but maybe not?) operation to inject
a blocktime entry then quickly erase it.. When we're with the mutex, and
when we make sure it's invoked after checking the receive bitmap, it's not
needed anymore. Instead, we assert.
As another side effect, this paves way for removing all atomic ops in all
the mem accesses in blocktime layer.
Note that we need a stub for mark_postcopy_blocktime_begin() for Windows
builds.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250613141217.474825-3-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
thread_sync_sem is an one-shot event so it can be converted into
QemuEvent, which is more lightweight.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20250529-event-v5-9-53b285203794@daynix.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Convert the existing includes with sed.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Locking the memory without MCL_ONFAULT instantly prefaults any mmaped
anonymous memory with a write-fault, which introduces a lot of extra
overhead in terms of memory usage when all you want to do is to prevent
kcompactd from migrating and compacting QEMU pages. Add an option to
only lock pages lazily as they're faulted by the process by using
MCL_ONFAULT if asked.
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Link: https://lore.kernel.org/r/20250212143920.1269754-5-d-tatianin@yandex-team.ru
Signed-off-by: Peter Xu <peterx@redhat.com>
Replace the boolean value enable_mlock with an enum and add a helper to
decide whether we should be calling os_mlock.
This is a stepping stone towards introducing a new mlock mode, which
will be the third possible state of this enum.
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Link: https://lore.kernel.org/r/20250212143920.1269754-4-d-tatianin@yandex-team.ru
Signed-off-by: Peter Xu <peterx@redhat.com>
This will be used in the following commits to make it possible to only
lock memory on fault instead of right away.
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Link: https://lore.kernel.org/r/20250212143920.1269754-2-d-tatianin@yandex-team.ru
[peterx: fail os_mlock(on_fault=1) when not supported]
[peterx: use G_GNUC_UNUSED instead of "(void)on_fault", per Dan]
Signed-off-by: Peter Xu <peterx@redhat.com>
Headers in include/sysemu/ are not only related to system
*emulation*, they are also used by virtualization. Rename
as system/ which is clearer.
Files renamed manually then mechanical change using sed tool.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Tested-by: Lei Yang <leiyang@redhat.com>
Message-Id: <20241203172445.28576-1-philmd@linaro.org>
Keep migration thread names together, so it's easier to see a list of all
possible migration threads.
Still two functional changes below besides the macro defintions:
- There's one dirty rate thread that we overlooked before, now we add
that too and name it as "mig/dirtyrate" following the old rules.
- The old name "mig/src/rp-thr" has "-thr" but it may not be useful if
it's a thread name anyway, while "rp" can be slightly hard to read.
Taking this chance to rename it to "mig/src/return", hopefully a better
name.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Acked-by: Hyman Huang <yong.huang@smartx.com>
Reviewed-by: Zhang Chen <chen.zhang@intel.com>
Link: https://lore.kernel.org/r/20241011153652.517440-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Use the uffd_copy_page, uffd_zero_page and uffd_wakeup helpers
rather than calling ioctl ourselves.
They return -errno on error, and print an error_report themselves.
I think this actually makes postcopy_place_page actually more
consistent in it's callers.
Signed-off-by: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/r/20240919134626.166183-7-dave@treblig.org
[peterx: fix i386 build]
Signed-off-by: Peter Xu <peterx@redhat.com>
This patch is part of a series that moves towards a consistent use of
g_assert_not_reached() rather than an ad hoc mix of different
assertion mechanisms.
Signed-off-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-ID: <20240919044641.386068-31-pierrick.bouvier@linaro.org>
Signed-off-by: Thomas Huth <thuth@redhat.com>
This patch is part of a series that moves towards a consistent use of
g_assert_not_reached() rather than an ad hoc mix of different
assertion mechanisms.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-ID: <20240919044641.386068-5-pierrick.bouvier@linaro.org>
Signed-off-by: Thomas Huth <thuth@redhat.com>
This patch adds a migration state on src called "postcopy-recover-setup".
The new state will describe the intermediate step starting from when the
src QEMU received a postcopy recovery request, until the migration channels
are properly established, but before the recovery process take place.
The request came from Libvirt where Libvirt currently rely on the migration
state events to detect migration state changes. That works for most of the
migration process but except postcopy recovery failures at the beginning.
Currently postcopy recovery only has two major states:
- postcopy-paused: this is the state that both sides of QEMU will be in
for a long time as long as the migration channel was interrupted.
- postcopy-recover: this is the state where both sides of QEMU handshake
with each other, preparing for a continuation of postcopy which used to
be interrupted.
The issue here is when the recovery port is invalid, the src QEMU will take
the URI/channels, noticing the ports are not valid, and it'll silently keep
in the postcopy-paused state, with no event sent to Libvirt. In this case,
the only thing Libvirt can do is to poll the migration status with a proper
interval, however that's less optimal.
Considering that this is the only case where Libvirt won't get a
notification from QEMU on such events, let's add postcopy-recover-setup
state to mimic what we have with the "setup" state of a newly initialized
migration, describing the phase of connection establishment.
With that, postcopy recovery will have two paths to go now, and either path
will guarantee an event generated. Now the events will look like this
during a recovery process on src QEMU:
- Initially when the recovery is initiated on src, QEMU will go from
"postcopy-paused" -> "postcopy-recover-setup". Old QEMUs don't have
this event.
- Depending on whether the channel re-establishment is succeeded:
- In succeeded case, src QEMU will move from "postcopy-recover-setup"
to "postcopy-recover". Old QEMUs also have this event.
- In failure case, src QEMU will move from "postcopy-recover-setup" to
"postcopy-paused" again. Old QEMUs don't have this event.
This guarantees that Libvirt will always receive a notification for
recovery process properly.
One thing to mention is, such new status is only needed on src QEMU not
both. On dest QEMU, the state machine doesn't change. Hence the events
don't change either. It's done like so because dest QEMU may not have an
explicit point of setup start. E.g., it can happen that when dest QEMUs
doesn't use migrate-recover command to use a new URI/channel, but the old
URI/channels can be reused in recovery, in which case the old ports simply
can work again after the network routes are fixed up.
Add a new helper postcopy_is_paused() detecting whether postcopy is still
paused, taking RECOVER_SETUP into account too. When using it on both
src/dst, a slight change is done altogether to always wait for the
semaphore before checking the status, because for both sides a sem_post()
will be required for a recovery.
Cc: Jiri Denemark <jdenemar@redhat.com>
Cc: Prasad Pandit <ppandit@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Buglink: https://issues.redhat.com/browse/RHEL-38485
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
The postcopy thread names on dest QEMU are slightly confusing, partly I'll
need to blame myself on 36f62f11e4 ("migration: Postcopy preemption
preparation on channel creation"). E.g., "fault-fast" reads like a fast
version of "fault-default", but it's actually the fast version of
"postcopy/listen".
Taking this chance, rename all the migration threads with proper rules.
Considering we only have 15 chars usable, prefix all threads with "mig/",
meanwhile identify src/dst threads properly this time. So now most thread
names will look like "mig/DIR/xxx", where DIR will be "src"/"dst", except
the bg-snapshot thread which doesn't have a direction.
For multifd threads, making them "mig/{src|dst}/{send|recv}_%d".
We used to have "live_migration" thread for a very long time, now it's
called "mig/src/main". We may hope to have "mig/dst/main" soon but not
yet.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Zhijian Li (Fujitsu) <lizhijian@fujitsu.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
It is defined and referred to exclusively from a .c file.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since the commit 05e385d2a9 ("error: Move ERRP_GUARD() to the beginning
of the function"), there are new codes that don't put ERRP_GUARD() at
the beginning of the functions.
As stated in the commit 05e385d2a9: "include/qapi/error.h advises to put
ERRP_GUARD() right at the beginning of the function, because only then
can it guard the whole function.", so clean up the few spots
disregarding the advice.
Inspired-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Message-ID: <20240312060337.3240965-1-zhao1.liu@linux.intel.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Remove the error object from opaque data passed to notifiers.
Use the new error parameter passed to the notifier instead.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/1708622920-68779-3-git-send-email-steven.sistare@oracle.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Pass an error object as the third parameter to "notifier with return"
notifiers, so clients no longer need to bundle an error object in the
opaque data. The new parameter is used in a later patch.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/1708622920-68779-2-git-send-email-steven.sistare@oracle.com
Signed-off-by: Peter Xu <peterx@redhat.com>
There is no need to use the Linux-internal __u64 type, 1ULL is
guaranteed to be wide enough.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Link: https://lore.kernel.org/r/20240117160313.175609-1-pbonzini@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
We hit intermit CI issue on failing at migration-test over the unit test
preempt/plain:
qemu-system-x86_64: Unable to read from socket: Connection reset by peer
Memory content inconsistency at 5b43000 first_byte = bd last_byte = bc current = 4f hit_edge = 1
**
ERROR:../tests/qtest/migration-test.c:300:check_guests_ram: assertion failed: (bad == 0)
(test program exited with status code -6)
Fabiano debugged into it and found that the preempt thread can quit even
without receiving all the pages, which can cause guest not receiving all
the pages and corrupt the guest memory.
To make sure preempt thread finished receiving all the pages, we can rely
on the page_requested_count being zero because preempt channel will only
receive requested page faults. Note, not all the faulted pages are required
to be sent via the preempt channel/thread; imagine the case when a
requested page is just queued into the background main channel for
migration, the src qemu will just still send it via the background channel.
Here instead of spinning over reading the count, we add a condvar so the
main thread can wait on it if that unusual case happened, without burning
the cpu for no good reason, even if the duration is short; so even if we
spin in this rare case is probably fine. It's just better to not do so.
The condvar is only used when that special case is triggered. Some memory
ordering trick is needed to guarantee it from happening (against the
preempt thread status field), so the main thread will always get a kick
when that triggers correctly.
Closes: https://gitlab.com/qemu-project/qemu/-/issues/1886
Debugged-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20230918172822.19052-2-farosas@suse.de>
virtio-mem wants to know whether it should not mess with the RAMBlock
content (e.g., discard RAM, preallocate memory) on incoming migration.
So let's expose that function as migrate_ram_is_ignored() in
migration/misc.h
Message-ID: <20230706075612.67404-4-david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
It is not needed since we moved the accessor for tls properties to
options.c.
Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Instead of print it to STDERR, bring the error upwards so that it can be
reported via QMP responses.
E.g.:
{ "execute": "migrate-set-capabilities" ,
"arguments": { "capabilities":
[ { "capability": "postcopy-ram", "state": true } ] } }
{ "error":
{ "class": "GenericError",
"desc": "Postcopy is not supported: Host backend files need to be TMPFS
or HUGETLBFS only" } }
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
We move there all capabilities helpers from migration.c.
Signed-off-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
Following David advise:
- looked through the history, capabilities are newer than 2012, so we
can remove that bit of the header.
- This part is posterior to Anthony.
Original Author is Orit. Once there,
I put myself. Peter Xu also did quite a bit of work here.
Anyone else wants/needs to be there? I didn't search too hard
because nobody asked before to be added.
What do you think?
Postcopy requires the memory support userfaultfd to work. Right now we
check it but it's a bit too late (when switching to postcopy migration).
Do that early right at enabling of postcopy.
Note that this is still only a best effort because ramblocks can be
dynamically created. We can add check in hostmem creations and fail if
postcopy enabled, but maybe that's too aggressive.
Still, we have chance to fail the most obvious where we know there's an
existing unsupported ramblock.
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
It does not even pair with a qatomic_mb_set(), so it is clearer to use
load-acquire in this case; they are synonyms.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In 8.0 devel window we reworked preempt channel creation, so that there'll
be no race condition when the migration channel and preempt channel got
established in the wrong order in commit 5655aab079.
However no one noticed that the change will also be not compatible with
older qemus, majorly 7.1/7.2 versions where preempt mode started to be
supported.
Leverage the same pre-7.2 flag introduced in the previous patch to recover
the behavior hopefully before 8.0 releases, so we don't break migration
when we migrate from 8.0 to older qemu binaries.
Fixes: 5655aab079 ("migration: Postpone postcopy preempt channel to be after main")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
postcopy_qemufile_src object should be owned by one thread, either the main
thread (e.g. when at the beginning, or at the end of migration), or by the
return path thread (when during a preempt enabled postcopy migration). If
that's not the case the access to the object might be racy.
postcopy_preempt_shutdown_file() can be potentially racy, because it's
called at the end phase of migration on the main thread, however during
which the return path thread hasn't yet been recycled; the recycle happens
in await_return_path_close_on_source() which is after this point.
It means, logically it's posslbe the main thread and the return path thread
are both operating on the same qemufile. While I don't think qemufile is
thread safe at all.
postcopy_preempt_shutdown_file() used to be needed because that's where we
send EOS to dest so that dest can safely shutdown the preempt thread.
To avoid the possible race, remove this only place that a race can happen.
Instead we figure out another way to safely close the preempt thread on
dest.
The core idea during postcopy on deciding "when to stop" is that dest will
send a postcopy SHUT message to src, telling src that all data is there.
Hence to shut the dest preempt thread maybe better to do it directly on
dest node.
This patch proposed such a way that we change postcopy_prio_thread_created
into PreemptThreadStatus, so that we kick the preempt thread on dest qemu
by a sequence of:
mis->preempt_thread_status = PREEMPT_THREAD_QUIT;
qemu_file_shutdown(mis->postcopy_qemufile_dst);
While here shutdown() is probably so far the easiest way to kick preempt
thread from a blocked qemu_get_be64(). Then it reads preempt_thread_status
to make sure it's not a network failure but a willingness to quit the
thread.
We could have avoided that extra status but just rely on migration status.
The problem is postcopy_ram_incoming_cleanup() is just called early enough
so we're still during POSTCOPY_ACTIVE no matter what.. So just make it
simple to have the status introduced.
One flag x-preempt-pre-7-2 is added to keep old pre-7.2 behaviors of
postcopy preempt.
Fixes: 9358982744 ("migration: Send requested page directly in rp-return thread")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
QEMU main thread will wait until dest preempt channel established during
processing the LISTEN command (within the whole postcopy PACKAGED data), by
waiting on the semaphore postcopy_qemufile_dst_done.
That's racy, because it's possible that the dest QEMU main thread hasn't
yet accept()ed the new connection when processing the LISTEN event. The
sem_wait() will yield the main thread without being able to run anything
else including the accept() of the new socket, which can cause deadlock
within the main thread.
To avoid the race, move the "wait channel" from main thread to the preempt
thread right at the start.
Reported-by: Peter Maydell <peter.maydell@linaro.org>
Fixes: 5655aab079 ("migration: Postpone postcopy preempt channel to be after main")
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>