The user calling himself "John Doe" reported a deadlock when
attempting to use qemu-storage-daemon to serve both a base file over
NBD, and a qcow2 file with that NBD export as its backing file, from
the same process, even though it worked just fine when there were two
q-s-d processes. The bulk of the NBD server code properly uses
coroutines to make progress in an event-driven manner, but the code
for spawning a new coroutine at the point when listen(2) detects a new
client was hard-coded to use the global GMainContext; in other words,
the callback that triggers nbd_client_new to let the server start the
negotiation sequence with the client requires the main loop to be
making progress. However, the code for bdrv_open of a qcow2 image
with an NBD backing file uses an AIO_WAIT_WHILE nested event loop to
ensure that the entire qcow2 backing chain is either fully loaded or
rejected, without any side effects from the main loop causing unwanted
changes to the disk being loaded (in short, an AioContext represents
the set of actions that are known to be safe while handling block
layer I/O, while excluding any other pending actions in the global
main loop with potentially larger risk of unwanted side effects).
This creates a classic case of deadlock: the server can't progress to
the point of accept(2)ing the client to write to the NBD socket
because the main loop is being starved until the AIO_WAIT_WHILE
completes the bdrv_open, but the AIO_WAIT_WHILE can't progress because
it is blocked on the client coroutine stuck in a read() of the
expected magic number from the server side of the socket.
This patch adds a new API to allow clients to opt in to listening via
an AioContext rather than a GMainContext. This will allow NBD to fix
the deadlock by performing all actions during bdrv_open in the main
loop AioContext.
Technical debt warning: I would have loved to utilize a notify
function with AioContext to guarantee that we don't finalize listener
due to an object_unref if there is any callback still running (the way
GSource does), but wiring up notify functions into AioContext is a
bigger task that will be deferred to a later QEMU release. But for
solving the NBD deadlock, it is sufficient to note that the QMP
commands for enabling and disabling the NBD server are really the only
points where we want to change the listener's callback. Furthermore,
those commands are serviced in the main loop, which is the same
AioContext that is also listening for connections. Since a thread
cannot interrupt itself, we are ensured that at the point where we are
changing the watch, there are no callbacks active. This is NOT as
powerful as the GSource cross-thread safety, but sufficient for the
needs of today.
An upcoming patch will then add a unit test (kept separate to make it
easier to rearrange the series to demonstrate the deadlock without
this patch).
Fixes: https://gitlab.com/qemu-project/qemu/-/issues/3169
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251113011625.878876-26-eblake@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
For ease of review, this patch adds an AioContext pointer to the
QIONetListener struct, the code to trace it, and refactors
listener->io_source to instead be an array of utility structs; but the
aio_context pointer is always NULL until the next patch adds an API to
set it. There should be no semantic change in this patch.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20251113011625.878876-25-eblake@redhat.com>
An upcoming patch needs to pass more than just sioc as the opaque
pointer to an AioContext; but since our AioContext code in general
(and its QIO Channel wrapper code) lacks a notify callback present
with GSource, we do not have the trivial option of just g_malloc'ing a
small struct to hold all that data coupled with a notify of g_free.
Instead, the data pointer must outlive the registered handler; in
fact, having the data pointer have the same lifetime as QIONetListener
is adequate.
But the cleanest way to stick such a helper struct in QIONetListener
will be to rearrange internal struct members. And that in turn means
that all existing code that currently directly accesses
listener->nsioc and listener->sioc[] should instead go through
accessor functions, to be immune to the upcoming struct layout
changes. So this patch adds accessor methods qio_net_listener_nsioc()
and qio_net_listener_sioc(), and puts them to use.
While at it, notice that the pattern of grabbing an sioc from the
listener only to turn around can call
qio_channel_socket_get_local_address is common enough to also warrant
the helper of qio_net_listener_get_local_address, and fix a copy-paste
error in the corresponding documentation.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251113011625.878876-24-eblake@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Without a mutex, NetListener can run into this data race between a
thread changing the async callback callback function to use when a
client connects, and the thread servicing polling of the listening
sockets:
Thread 1:
qio_net_listener_set_client_func(lstnr, f1, ...);
=> foreach sock: socket
=> object_ref(lstnr)
=> sock_src = qio_channel_socket_add_watch_source(sock, ...., lstnr, object_unref);
Thread 2:
poll()
=> event POLLIN on socket
=> ref(GSourceCallback)
=> if (lstnr->io_func) // while lstnr->io_func is f1
...interrupt..
Thread 1:
qio_net_listener_set_client_func(lstnr, f2, ...);
=> foreach sock: socket
=> g_source_unref(sock_src)
=> foreach sock: socket
=> object_ref(lstnr)
=> sock_src = qio_channel_socket_add_watch_source(sock, ...., lstnr, object_unref);
Thread 2:
=> call lstnr->io_func(lstnr->io_data) // now sees f2
=> return dispatch(sock)
=> unref(GSourceCallback)
=> destroy-notify
=> object_unref
Found by inspection; I did not spend the time trying to add sleeps or
execute under gdb to try and actually trigger the race in practice.
This is a SEGFAULT waiting to happen if f2 can become NULL because
thread 1 deregisters the user's callback while thread 2 is trying to
service the callback. Other messes are also theoretically possible,
such as running callback f1 with an opaque pointer that should only be
passed to f2 (if the client code were to use more than just a binary
choice between a single async function or NULL).
Mitigating factor: if the code that modifies the QIONetListener can
only be reached by the same thread that is executing the polling and
async callbacks, then we are not in a two-thread race documented above
(even though poll can see two clients trying to connect in the same
window of time, any changes made to the listener by the first async
callback will be completed before the thread moves on to the second
client). However, QEMU is complex enough that this is hard to
generically analyze. If QMP commands (like nbd-server-stop) are run
in the main loop and the listener uses the main loop, things should be
okay. But when a client uses an alternative GMainContext, or if
servicing a QMP command hands off to a coroutine to avoid blocking, I
am unable to state with certainty whether a given net listener can be
modified by a thread different from the polling thread running
callbacks.
At any rate, it is worth having the API be robust. To ensure that
modifying a NetListener can be safely done from any thread, add a
mutex that guarantees atomicity to all members of a listener object
related to callbacks. This problem has been present since
QIONetListener was introduced.
Note that this does NOT prevent the case of a second round of the
user's old async callback being invoked with the old opaque data, even
when the user has already tried to change the async callback during
the first async callback; it is only about ensuring that there is no
sharding (the eventual io_func(io_data) call that does get made will
correspond to a particular combination that the user had requested at
some point in time, and not be sharded to a combination that never
existed in practice). In other words, this patch maintains the status
quo that a user's async callback function already needs to be robust
to parallel clients landing in the same window of poll servicing, even
when only one client is desired, if that particular listener can be
amended in a thread other than the one doing the polling.
CC: qemu-stable@nongnu.org
Fixes: 53047392 ("io: introduce a network socket listener API", v2.12.0)
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251113011625.878876-20-eblake@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
[eblake: minor commit message wording improvements]
Signed-off-by: Eric Blake <eblake@redhat.com>
io/net-listener.c has two modes of use: asynchronous (the user calls
qio_net_listener_set_client_func to wake up the callback via the
global GMainContext, or qio_net_listener_set_client_func_full to wake
up the callback via the caller's own alternative GMainContext), and
synchronous (the user calls qio_net_listener_wait_client which creates
its own GMainContext and waits for the first client connection before
returning, with no need for a user's callback). But commit 938c8b79
has a latent logic flaw: when qio_net_listener_wait_client finishes on
its temporary context, it reverts all of the siocs back to the global
GMainContext rather than the potentially non-NULL context they might
have been originally registered with. Similarly, if the user creates
a net-listener, adds initial addresses, registers an async callback
with a non-default context (which ties to all siocs for the initial
addresses), then adds more addresses with qio_net_listener_add, the
siocs for later addresses are blindly placed in the global context,
rather than sharing the context of the earlier ones.
In practice, I don't think this has caused issues. As pointed out by
the original commit, all async callers prior to that commit were
already okay with the NULL default context; and the typical usage
pattern is to first add ALL the addresses the listener will pay
attention to before ever setting the async callback. Likewise, if a
file uses only qio_net_listener_set_client_func instead of
qio_net_listener_set_client_func_full, then it is never using a custom
context, so later assignments of async callbacks will still be to the
same global context as earlier ones. Meanwhile, any callers that want
to do the sync operation to grab the first client are unlikely to
register an async callback; altogether bypassing the question of
whether later assignments of a GSource are being tied to a different
context over time.
I do note that chardev/char-socket.c is the only file that calls both
qio_net_listener_wait_client (sync for a single client in
tcp_chr_accept_server_sync), and qio_net_listener_set_client_func_full
(several places, all with chr->gcontext, but sometimes with a NULL
callback function during teardown). But as far as I can tell, the two
uses are mutually exclusive, based on the is_waitconnect parameter to
qmp_chardev_open_socket_server.
That said, it is more robust to remember when an async callback
function is tied to a non-default context, and have both the sync wait
and any late address additions honor that same context. That way, the
code will be robust even if a later user performs a sync wait for a
specific client in the middle of servicing a longer-lived
QIONetListener that has an async callback for all other clients.
CC: qemu-stable@nongnu.org
Fixes: 938c8b79 ("qio: store gsources for net listeners", v2.12.0)
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20251113011625.878876-19-eblake@redhat.com>
- Fabiano's patch to fix snapshot crash by rejecting some caps
- Marco's mapped-ram support on snapshot save/load
- Steve's cpr maintainers entry update on retirement
- Peter's coverity fixes
- Chenyi's tdx fix on hugetlbfs regression
- Peter's doc update on migrate resume flag
- Peter's doc update on HMP set parameter for cpr-exec-command's char** parsing
- Xiaoyao's guest-memfd fix for enabling shmem
- Arun's fix on error_fatal regression for migration errors
- Bin's fix on redundant error free for add block failures
- Markus's cleanup around MigMode sets
- Peter's two patches (out of loadvm threadify) to cleanup qio read peek process
- Thomas's vmstate-static-checker update for possible deprecation of argparse use
- Stefan's fix on windows deadlock by making unassigned MMIOs lockless
-----BEGIN PGP SIGNATURE-----
iIgEABYKADAWIQS5GE3CDMRX2s990ak7X8zN86vXBgUCaQkZPBIccGV0ZXJ4QHJl
ZGhhdC5jb20ACgkQO1/MzfOr1wZhTgEA8eCBMpM7PusNSdzzeIygKnIp2A8I70ca
eIJz3ZM+FiUBAPVDrIZ59EhZA6NPcJb8Ya9OY4lT63F4BxrvN+f+uG4N
=GUBi
-----END PGP SIGNATURE-----
Merge tag 'staging-pull-request' of https://gitlab.com/peterx/qemu into staging
mem + migration pull for 10.2
- Fabiano's patch to fix snapshot crash by rejecting some caps
- Marco's mapped-ram support on snapshot save/load
- Steve's cpr maintainers entry update on retirement
- Peter's coverity fixes
- Chenyi's tdx fix on hugetlbfs regression
- Peter's doc update on migrate resume flag
- Peter's doc update on HMP set parameter for cpr-exec-command's char** parsing
- Xiaoyao's guest-memfd fix for enabling shmem
- Arun's fix on error_fatal regression for migration errors
- Bin's fix on redundant error free for add block failures
- Markus's cleanup around MigMode sets
- Peter's two patches (out of loadvm threadify) to cleanup qio read peek process
- Thomas's vmstate-static-checker update for possible deprecation of argparse use
- Stefan's fix on windows deadlock by making unassigned MMIOs lockless
# -----BEGIN PGP SIGNATURE-----
#
# iIgEABYKADAWIQS5GE3CDMRX2s990ak7X8zN86vXBgUCaQkZPBIccGV0ZXJ4QHJl
# ZGhhdC5jb20ACgkQO1/MzfOr1wZhTgEA8eCBMpM7PusNSdzzeIygKnIp2A8I70ca
# eIJz3ZM+FiUBAPVDrIZ59EhZA6NPcJb8Ya9OY4lT63F4BxrvN+f+uG4N
# =GUBi
# -----END PGP SIGNATURE-----
# gpg: Signature made Mon 03 Nov 2025 10:06:04 PM CET
# gpg: using EDDSA key B9184DC20CC457DACF7DD1A93B5FCCCDF3ABD706
# gpg: issuer "peterx@redhat.com"
# gpg: Good signature from "Peter Xu <xzpeter@gmail.com>" [unknown]
# gpg: aka "Peter Xu <peterx@redhat.com>" [unknown]
# gpg: WARNING: The key's User ID is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: B918 4DC2 0CC4 57DA CF7D D1A9 3B5F CCCD F3AB D706
* tag 'staging-pull-request' of https://gitlab.com/peterx/qemu: (36 commits)
migration: Introduce POSTCOPY_DEVICE state
migration: Make postcopy listen thread joinable
migration: Respect exit-on-error when migration fails before resuming
migration: Refactor all incoming cleanup info migration_incoming_destroy()
migration: Introduce postcopy incoming setup and cleanup functions
migration: Move postcopy_ram_listen_thread() to postcopy-ram.c
migration: Do not try to start VM if disk activation fails
migration: Flush migration channel after sending data of CMD_PACKAGED
system/physmem: mark io_mem_unassigned lockless
scripts/vmstate-static-checker: Fix deprecation warnings with latest argparse
migration: vmsd errp handlers: return bool
migration/vmstate: stop reporting error number for new _errp APIs
tmp_emulator: improve and fix use of errp
migration: vmstate_save_state_v(): fix error path
migration: Properly wait on G_IO_IN when peeking messages
io: Add qio_channel_wait_cond() helper
migration: Put Error **errp parameter last
migration: Use bitset of MigMode instead of variable arguments
migration: Use unsigned instead of int for bit set of MigMode
migration: Don't free the reason after calling migrate_add_blocker
...
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Add the helper to wait for QIO channel's IO availability in any
context (coroutine, or non-coroutine). Use it tree-wide for three
occurences.
Cc: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Link: https://lore.kernel.org/r/20251022192612.2737648-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
The kernel allocates extra metadata SKBs in case of a zerocopy send,
eventually used for zerocopy's notification mechanism. This metadata
memory is accounted for in the OPTMEM limit. The kernel queues
completion notifications on the socket error queue and this error queue
is freed when userspace reads it.
Usually, in the case of in-order processing, the kernel will batch the
notifications and merge the metadata into a single SKB and free the
rest. As a result, it never exceeds the OPTMEM limit. However, if there
is any out-of-order processing or intermittent zerocopy failures, this
error chain can grow significantly, exhausting the OPTMEM limit. As a
result, all new sendmsg requests fail to allocate any new SKB, leading
to an ENOBUF error. Depending on the amount of data queued before the
flush (i.e., large live migration iterations), even large OPTMEM limits
are prone to failure.
To work around this, if we encounter an ENOBUF error with a zerocopy
sendmsg, flush the error queue and retry once more.
Co-authored-by: Manish Mishra <manish.mishra@nutanix.com>
Signed-off-by: Tejus GK <tejus.gk@nutanix.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
[DB: change TRUE/FALSE to true/false for 'bool' type;
add more #ifdef QEMU_MSG_ZEROCOPY blocks]
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Add a 'blocking' boolean field to QIOChannelSocket to track whether the
underlying socket is in blocking or non-blocking mode.
Signed-off-by: Tejus GK <tejus.gk@nutanix.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
I/O channel read/write functions can operate on any area of
memory, regardless of the content their represent. Do not
restrict to array of char, use the void* type, which is also
the type of the underlying iovec::iov_base field.
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
[DB: also adapt test-crypto-tlssession.c func signatures]
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
If the QIOChannelWebsock object is freed while it is waiting to
complete a handshake, a GSource is leaked. This can lead to the
callback firing later on and triggering a use-after-free in the
use of the channel. This was observed in the VNC server with the
following trace from valgrind:
==2523108== Invalid read of size 4
==2523108== at 0x4054A24: vnc_disconnect_start (vnc.c:1296)
==2523108== by 0x4054A24: vnc_client_error (vnc.c:1392)
==2523108== by 0x4068A09: vncws_handshake_done (vnc-ws.c:105)
==2523108== by 0x44863B4: qio_task_complete (task.c:197)
==2523108== by 0x448343D: qio_channel_websock_handshake_io (channel-websock.c:588)
==2523108== by 0x6EDB862: UnknownInlinedFun (gmain.c:3398)
==2523108== by 0x6EDB862: g_main_context_dispatch_unlocked.lto_priv.0 (gmain.c:4249)
==2523108== by 0x6EDBAE4: g_main_context_dispatch (gmain.c:4237)
==2523108== by 0x45EC79F: glib_pollfds_poll (main-loop.c:287)
==2523108== by 0x45EC79F: os_host_main_loop_wait (main-loop.c:310)
==2523108== by 0x45EC79F: main_loop_wait (main-loop.c:589)
==2523108== by 0x423A56D: qemu_main_loop (runstate.c:835)
==2523108== by 0x454F300: qemu_default_main (main.c:37)
==2523108== by 0x73D6574: (below main) (libc_start_call_main.h:58)
==2523108== Address 0x57a6e0dc is 28 bytes inside a block of size 103,608 free'd
==2523108== at 0x5F2FE43: free (vg_replace_malloc.c:989)
==2523108== by 0x6EDC444: g_free (gmem.c:208)
==2523108== by 0x4053F23: vnc_update_client (vnc.c:1153)
==2523108== by 0x4053F23: vnc_refresh (vnc.c:3225)
==2523108== by 0x4042881: dpy_refresh (console.c:880)
==2523108== by 0x4042881: gui_update (console.c:90)
==2523108== by 0x45EFA1B: timerlist_run_timers.part.0 (qemu-timer.c:562)
==2523108== by 0x45EFC8F: timerlist_run_timers (qemu-timer.c:495)
==2523108== by 0x45EFC8F: qemu_clock_run_timers (qemu-timer.c:576)
==2523108== by 0x45EFC8F: qemu_clock_run_all_timers (qemu-timer.c:663)
==2523108== by 0x45EC765: main_loop_wait (main-loop.c:600)
==2523108== by 0x423A56D: qemu_main_loop (runstate.c:835)
==2523108== by 0x454F300: qemu_default_main (main.c:37)
==2523108== by 0x73D6574: (below main) (libc_start_call_main.h:58)
==2523108== Block was alloc'd at
==2523108== at 0x5F343F3: calloc (vg_replace_malloc.c:1675)
==2523108== by 0x6EE2F81: g_malloc0 (gmem.c:133)
==2523108== by 0x4057DA3: vnc_connect (vnc.c:3245)
==2523108== by 0x448591B: qio_net_listener_channel_func (net-listener.c:54)
==2523108== by 0x6EDB862: UnknownInlinedFun (gmain.c:3398)
==2523108== by 0x6EDB862: g_main_context_dispatch_unlocked.lto_priv.0 (gmain.c:4249)
==2523108== by 0x6EDBAE4: g_main_context_dispatch (gmain.c:4237)
==2523108== by 0x45EC79F: glib_pollfds_poll (main-loop.c:287)
==2523108== by 0x45EC79F: os_host_main_loop_wait (main-loop.c:310)
==2523108== by 0x45EC79F: main_loop_wait (main-loop.c:589)
==2523108== by 0x423A56D: qemu_main_loop (runstate.c:835)
==2523108== by 0x454F300: qemu_default_main (main.c:37)
==2523108== by 0x73D6574: (below main) (libc_start_call_main.h:58)
==2523108==
The above can be reproduced by launching QEMU with
$ qemu-system-x86_64 -vnc localhost:0,websocket=5700
and then repeatedly running:
for i in {1..100}; do
(echo -n "GET / HTTP/1.1" && sleep 0.05) | nc -w 1 localhost 5700 &
done
CVE-2025-11234
Reported-by: Grant Millar | Cylo <rid@cylo.io>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
We want to switch from qemu_socket_set_block() to newer
qemu_set_blocking(), which provides return status of operation,
to handle errors.
Still, we want to keep qio_channel_socket_readv() interface clean,
as currently it allocate @fds only on success.
So, in case of error, we should close all incoming fds and keep
user's @fds untouched or zero.
Let's make separate functions qio_channel_handle_fds() and
qio_channel_cleanup_fds(), to achieve what we want.
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Currently, we just always pass NULL as errp argument. That doesn't
look good.
Some realizations of interface may actually report errors.
Channel-socket realization actually either ignore or crash on
errors, but we are going to straighten it out to always reporting
an errp in further commits.
So, convert all callers to either handle the error (where environment
allows) or explicitly use &error_abort.
Take also a chance to change the return value to more convenient
bool (keeping also in mind, that underlying realizations may
return -1 on failure, not -errno).
Suggested-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
[DB: fix return type mismatch in TLS/websocket channel
impls for qio_channel_set_blocking]
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
The only realization, which may have incoming fds is
qio_channel_socket_readv() (in io/channel-socket.c).
qio_channel_socket_readv() do call (through
qio_channel_socket_copy_fds()) qemu_socket_set_block() and
qemu_set_cloexec() for each fd.
Also, qio_channel_socket_copy_fds() is called at the end of
qio_channel_socket_readv(), on success path.
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
In migration we want to pass fd "as is", not changing its
blocking status.
The only current user of these fds is CPR state (through VMSTATE_FD),
which of-course doesn't want to modify fds on target when source is
still running and use these fds.
Suggested-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Add a QIO_CHANNEL_FEATURE_CONCURRENT_IO feature flag.
If this is set on a QIOChannelTLS session object, the TLS
session will be marked as requiring thread safety, which
will activate the workaround for GNUTLS bug 1717 if needed.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/qemu-devel/20250718150514.2635338-3-berrange@redhat.com
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Testing reading and writing from qemu-nbd using a unix domain socket
shows that the platform default send buffer size is too low, leading to
poor performance and hight cpu usage.
Add a helper for setting socket send buffer size to be used in NBD code.
It can also be used in other contexts.
We don't need a helper for receive buffer size since it is not used with
unix domain sockets. This is documented for Linux, and not documented
for macOS.
Failing to set the socket buffer size is not a fatal error, but the
caller may want to warn about the failure.
Signed-off-by: Nir Soffer <nirsof@gmail.com>
Message-ID: <20250517201154.88456-2-nirsof@gmail.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Add a read flag that can inform a channel that it's ok to receive an
EOF at any moment. Channels that have some form of strict EOF
tracking, such as TLS session termination, may choose to ignore EOF
errors with the use of this flag.
This is being added for compatibility with older migration streams
that do not include a TLS termination step.
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
We want to pass flags into qio_channel_tls_readv() but
qio_channel_readv_full_all_eof() doesn't take a flags argument.
No functional change.
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Acked-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Add a task dispatcher for gnutls_bye similar to the
qio_channel_tls_handshake_task(). The gnutls_bye() call might be
interrupted and so it needs to be rescheduled.
The migration code will make use of this to help the migration
destination identify a premature EOF. Once the session termination is
in place, any EOF that happens before the source issued gnutls_bye()
will be considered an error.
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Acked-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
The function qio_channel_get_peercred() returns a pointer to the
credentials of the peer process connected to this socket.
This credentials structure is defined in <sys/socket.h> as follows:
struct ucred {
pid_t pid; /* Process ID of the sending process */
uid_t uid; /* User ID of the sending process */
gid_t gid; /* Group ID of the sending process */
};
The use of this function is possible only for connected AF_UNIX stream
sockets and for AF_UNIX stream and datagram socket pairs.
On platform other than Linux, the function return 0.
Signed-off-by: Anthony Harivel <aharivel@redhat.com>
Link: https://lore.kernel.org/r/20240522153453.1230389-2-aharivel@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a new helper function for creating a QIOChannelFile channel with a
duplicated file descriptor. This saves the calling code from having to
do error checking on the dup() call.
Suggested-by: "Daniel P. Berrangé" <berrange@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: "Daniel P. Berrangé" <berrange@redhat.com>
Link: https://lore.kernel.org/r/20240311233335.17299-2-farosas@suse.de
Signed-off-by: Peter Xu <peterx@redhat.com>
Introduce basic pwritev/preadv support in the generic channel layer.
Specific implementation will follow for the file channel as this is
required in order to support migration streams with fixed location of
each ram page.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: "Daniel P. Berrangé" <berrange@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20240229153017.2221-4-farosas@suse.de
Signed-off-by: Peter Xu <peterx@redhat.com>
Add a generic QIOChannel feature SEEKABLE which would be used by the
qemu_file* apis. For the time being this will be only implemented for
file channels.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: "Daniel P. Berrangé" <berrange@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20240229153017.2221-3-farosas@suse.de
Signed-off-by: Peter Xu <peterx@redhat.com>
The term "QEMU global mutex" is identical to the more widely used Big
QEMU Lock ("BQL"). Update the code comments and documentation to use
"BQL" instead of "QEMU global mutex".
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Harsh Prateek Bora <harshpb@linux.ibm.com>
Message-id: 20240102153529.486531-6-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The ongoing QEMU multi-queue block layer effort makes it possible for multiple
threads to process I/O in parallel. The nbd block driver is not compatible with
the multi-queue block layer yet because QIOChannel cannot be used easily from
coroutines running in multiple threads. This series changes the QIOChannel API
to make that possible.
In the current API, calling qio_channel_attach_aio_context() sets the
AioContext where qio_channel_yield() installs an fd handler prior to yielding:
qio_channel_attach_aio_context(ioc, my_ctx);
...
qio_channel_yield(ioc); // my_ctx is used here
...
qio_channel_detach_aio_context(ioc);
This API design has limitations: reading and writing must be done in the same
AioContext and moving between AioContexts involves a cumbersome sequence of API
calls that is not suitable for doing on a per-request basis.
There is no fundamental reason why a QIOChannel needs to run within the
same AioContext every time qio_channel_yield() is called. QIOChannel
only uses the AioContext while inside qio_channel_yield(). The rest of
the time, QIOChannel is independent of any AioContext.
In the new API, qio_channel_yield() queries the AioContext from the current
coroutine using qemu_coroutine_get_aio_context(). There is no need to
explicitly attach/detach AioContexts anymore and
qio_channel_attach_aio_context() and qio_channel_detach_aio_context() are gone.
One coroutine can read from the QIOChannel while another coroutine writes from
a different AioContext.
This API change allows the nbd block driver to use QIOChannel from any thread.
It's important to keep in mind that the block driver already synchronizes
QIOChannel access and ensures that two coroutines never read simultaneously or
write simultaneously.
This patch updates all users of qio_channel_attach_aio_context() to the
new API. Most conversions are simple, but vhost-user-server requires a
new qemu_coroutine_yield() call to quiesce the vu_client_trip()
coroutine when not attached to any AioContext.
While the API is has become simpler, there is one wart: QIOChannel has a
special case for the iohandler AioContext (used for handlers that must not run
in nested event loops). I didn't find an elegant way preserve that behavior, so
I added a new API called qio_channel_set_follow_coroutine_ctx(ioc, true|false)
for opting in to the new AioContext model. By default QIOChannel uses the
iohandler AioHandler. Code that formerly called
qio_channel_attach_aio_context() now calls
qio_channel_set_follow_coroutine_ctx(ioc, true) once after the QIOChannel is
created.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Acked-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20230830224802.493686-5-stefanha@redhat.com>
[eblake: also fix migration/rdma.c]
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20230823065335.1919380-3-mjt@tls.msk.ru>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
The TLS handshake make take some time to complete, during which time an
I/O watch might be registered with the main loop. If the owner of the
I/O channel invokes qio_channel_close() while the handshake is waiting
to continue the I/O watch must be removed. Failing to remove it will
later trigger the completion callback which the owner is not expecting
to receive. In the case of the VNC server, this results in a SEGV as
vnc_disconnect_start() tries to shutdown a client connection that is
already gone / NULL.
CVE-2023-3354
Reported-by: jiangyegen <jiangyegen@huawei.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
nbd_drained_poll() generally runs in the main thread, not whatever
iothread the NBD server coroutine is meant to run in, so it can't
directly reenter the coroutines to wake them up.
The code seems to have the right intention, it specifies the correct
AioContext when it calls qemu_aio_coroutine_enter(). However, this
functions doesn't schedule the coroutine to run in that AioContext, but
it assumes it is already called in the home thread of the AioContext.
To fix this, add a new thread-safe qio_channel_wake_read() that can be
called in the main thread to wake up the coroutine in its AioContext,
and use this in nbd_drained_poll().
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20230517152834.277483-3-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There should be no paths from a coroutine_fn to aio_poll, however in
practice coroutine_mixed_fn will call aio_poll in the !qemu_in_coroutine()
path. By marking mixed functions, we can track accurately the call paths
that execute entirely in coroutine context, and find more missing
coroutine_fn markers. This results in more accurate checks that
coroutine code does not end up blocking.
If the marking were extended transitively to all functions that call
these ones, static analysis could be done much more efficiently.
However, this is a start and makes it possible to use vrc's path-based
searches to find potential bugs where coroutine_fns call blocking functions.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MSG_PEEK peeks at the channel, The data is treated as unread and
the next read shall still return this data. This support is
currently added only for socket class. Extra parameter 'flags'
is added to io_readv calls to pass extra read flags like MSG_PEEK.
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Suggested-by: Daniel P. Berrange <berrange@redhat.com>
Signed-off-by: manish.mishra <manish.mishra@nutanix.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
qemu/coroutine.h and qemu/lockable.h include each other.
They need each other only in macro expansions, so we could simply drop
both inclusions to break the loop, and add suitable includes to files
that expand the macros.
Instead, move a part of qemu/coroutine.h to new qemu/coroutine-core.h
so that qemu/coroutine-core.h doesn't need qemu/lockable.h, and
qemu/lockable.h only needs qemu/coroutine-core.h. Result:
qemu/coroutine.h includes qemu/lockable.h includes
qemu/coroutine-core.h.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20221221131435.3851212-5-armbru@redhat.com>
[Semantic rebase conflict with 7c10cb38cc "accel/tcg: Add debuginfo
support" resolved]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20221121085054.683122-11-armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
The initial implementation was changing the pipe state created by GLib
to PIPE_NOWAIT, but it turns out it doesn't work (read/write returns an
error). Since reading may return less than the requested amount, it
seems to be non-blocking already. However, the IO operation may block
until the FD is ready, I can't find good sources of information, to be
safe we can just poll for readiness before.
Alternatively, we could setup the FDs ourself, and use UNIX sockets on
Windows, which can be used in blocking/non-blocking mode. I haven't
tried it, as I am not sure it is necessary.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20221006113657.2656108-6-marcandre.lureau@redhat.com>
Simplify qio_channel_command_new_spawn() with GSpawn API. This will
allow to build for WIN32 in the following patches.
As pointed out by Daniel Berrangé: there is a change in semantics here
too. The current code only touches stdin/stdout/stderr. Any other FDs
which do NOT have O_CLOEXEC set will be inherited. With the new code,
all FDs except stdin/out/err will be explicitly closed, because we don't
set the flag G_SPAWN_LEAVE_DESCRIPTORS_OPEN. The only place we use
QIOChannelCommand today is the migration exec: protocol, and that is
only declared to use stdin/stdout.
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <20221006113657.2656108-5-marcandre.lureau@redhat.com>
This is for code which needs a portable equivalent to a QIOChannelFile
connected to /dev/null.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
For CONFIG_LINUX, implement the new zero copy flag and the optional callback
io_flush on QIOChannelSocket, but enables it only when MSG_ZEROCOPY
feature is available in the host kernel, which is checked on
qio_channel_socket_connect_sync()
qio_channel_socket_flush() was implemented by counting how many times
sendmsg(...,MSG_ZEROCOPY) was successfully called, and then reading the
socket's error queue, in order to find how many of them finished sending.
Flush will loop until those counters are the same, or until some error occurs.
Notes on using writev() with QIO_CHANNEL_WRITE_FLAG_ZERO_COPY:
1: Buffer
- As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid copying,
some caution is necessary to avoid overwriting any buffer before it's sent.
If something like this happen, a newer version of the buffer may be sent instead.
- If this is a problem, it's recommended to call qio_channel_flush() before freeing
or re-using the buffer.
2: Locked memory
- When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, and
unlocked after it's sent.
- Depending on the size of each buffer, and how often it's sent, it may require
a larger amount of locked memory than usually available to non-root user.
- If the required amount of locked memory is not available, writev_zero_copy
will return an error, which can abort an operation like migration,
- Because of this, when an user code wants to add zero copy as a feature, it
requires a mechanism to disable it, so it can still be accessible to less
privileged users.
Signed-off-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Message-Id: <20220513062836.965425-4-leobras@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Add flags to io_writev and introduce io_flush as optional callback to
QIOChannelClass, allowing the implementation of zero copy writes by
subclasses.
How to use them:
- Write data using qio_channel_writev*(...,QIO_CHANNEL_WRITE_FLAG_ZERO_COPY),
- Wait write completion with qio_channel_flush().
Notes:
As some zero copy write implementations work asynchronously, it's
recommended to keep the write buffer untouched until the return of
qio_channel_flush(), to avoid the risk of sending an updated buffer
instead of the buffer state during write.
As io_flush callback is optional, if a subclass does not implement it, then:
- io_flush will return 0 without changing anything.
Also, some functions like qio_channel_writev_full_all() were adapted to
receive a flag parameter. That allows shared code between zero copy and
non-zero copy writev, and also an easier implementation on new flags.
Signed-off-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Message-Id: <20220513062836.965425-3-leobras@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
The function isn't used outside of qio_channel_command_new_spawn(),
which is !win32-specific.
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Adds qio_channel_readv_full_all_eof() and qio_channel_readv_full_all()
to read both data and FDs. Refactors existing code to use these helpers.
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Acked-by: Daniel P. Berrangé <berrange@redhat.com>
Message-id: b059c4cc0fb741e794d644c144cc21372cad877d.1611938319.git.jag.raman@oracle.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Adds qio_channel_writev_full_all() to transmit both data and FDs.
Refactors existing code to use this helper.
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Daniel P. Berrangé <berrange@redhat.com>
Message-id: 480fbf1fe4152495d60596c9b665124549b426a5.1611938319.git.jag.raman@oracle.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Migration and yank code assume that qio_channel_shutdown is thread
-safe and can be called from qmp oob handler. Document this after
checking the code.
Signed-off-by: Lukas Straub <lukasstraub2@web.de>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <32b8c27e256da043f0f00db05bd7ab8fbc506070.1609167865.git.lukasstraub2@web.de>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
There is no "version 2" of the "Lesser" General Public License.
It is either "GPL version 2.0" or "Lesser GPL version 2.1".
This patch replaces all occurrences of "Lesser GPL version 2" with
"Lesser GPL version 2.1" in comment section.
Signed-off-by: Chetan Pant <chetan4windows@gmail.com>
Acked-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20201014134033.14095-1-chetan4windows@gmail.com>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
One of the goals of having less boilerplate on QOM declarations
is to avoid human error. Requiring an extra argument that is
never used is an opportunity for mistakes.
Remove the unused argument from OBJECT_DECLARE_TYPE and
OBJECT_DECLARE_SIMPLE_TYPE.
Coccinelle patch used to convert all users of the macros:
@@
declarer name OBJECT_DECLARE_TYPE;
identifier InstanceType, ClassType, lowercase, UPPERCASE;
@@
OBJECT_DECLARE_TYPE(InstanceType, ClassType,
- lowercase,
UPPERCASE);
@@
declarer name OBJECT_DECLARE_SIMPLE_TYPE;
identifier InstanceType, lowercase, UPPERCASE;
@@
OBJECT_DECLARE_SIMPLE_TYPE(InstanceType,
- lowercase,
UPPERCASE);
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Acked-by: Igor Mammedov <imammedo@redhat.com>
Acked-by: Paul Durrant <paul@xen.org>
Acked-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20200916182519.415636-4-ehabkost@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
The requirement to specify the parent class type makes the macro
harder to use and easy to misuse (silent bugs can be introduced
if the wrong struct type is specified).
Simplify the macro by just not declaring any class struct,
allowing us to remove the class_size field from the TypeInfo
variables for those types.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200916182519.415636-3-ehabkost@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Some typedefs and macros are defined after the type check macros.
This makes it difficult to automatically replace their
definitions with OBJECT_DECLARE_TYPE.
Patch generated using:
$ ./scripts/codeconverter/converter.py -i \
--pattern=QOMStructTypedefSplit $(git grep -l '' -- '*.[ch]')
which will split "typdef struct { ... } TypedefName"
declarations.
Followed by:
$ ./scripts/codeconverter/converter.py -i --pattern=MoveSymbols \
$(git grep -l '' -- '*.[ch]')
which will:
- move the typedefs and #defines above the type check macros
- add missing #include "qom/object.h" lines if necessary
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Message-Id: <20200831210740.126168-9-ehabkost@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Message-Id: <20200831210740.126168-10-ehabkost@redhat.com>
Message-Id: <20200831210740.126168-11-ehabkost@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>