From d127294f265e6a17f8d614f2bef7df8455e81f56 Mon Sep 17 00:00:00 2001 From: Shameer Kolothum Date: Tue, 3 Dec 2024 12:49:43 +0000 Subject: [PATCH 01/25] migration/multifd: Fix compile error caused by page_size usage >From Commit 90fa121c6c07 ("migration/multifd: Inline page_size and page_count") onwards page_size is not part of MutiFD*Params but uses an inline constant instead. However, it missed updating an old usage, causing a compile error. Fixes: 90fa121c6c07 ("migration/multifd: Inline page_size and page_count") Signed-off-by: Shameer Kolothum Reviewed-by: Fabiano Rosas Message-Id: <20241203124943.52572-1-shameerali.kolothum.thodi@huawei.com> Signed-off-by: Fabiano Rosas --- migration/multifd-uadk.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/migration/multifd-uadk.c b/migration/multifd-uadk.c index 6e6a290ae9..6895c1f65a 100644 --- a/migration/multifd-uadk.c +++ b/migration/multifd-uadk.c @@ -169,7 +169,7 @@ static int multifd_uadk_send_prepare(MultiFDSendParams *p, Error **errp) .src_len = page_size, .dst = buf, /* Set dst_len to double the src in case compressed out >= page_size */ - .dst_len = p->page_size * 2, + .dst_len = page_size * 2, }; if (uadk_data->handle) { From 1d457daf868191dc1c0b58dc7280799964f40334 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Fri, 6 Dec 2024 17:47:49 -0500 Subject: [PATCH 02/25] migration/multifd: Further remove the SYNC on complete Commit 637280aeb2 ("migration/multifd: Avoid the final FLUSH in complete()") stopped sending the RAM_SAVE_FLAG_MULTIFD_FLUSH flag at ram_save_complete(), because the sync on the destination side is not needed due to the last iteration of find_dirty_block() having already done it. However, that commit overlooked that multifd_ram_flush_and_sync() on the source side is also not needed at ram_save_complete(), for the same reason. Moreover, removing the RAM_SAVE_FLAG_MULTIFD_FLUSH but keeping the multifd_ram_flush_and_sync() means that currently the recv threads will hang when receiving the MULTIFD_FLAG_SYNC message, waiting for the destination sync which only happens when RAM_SAVE_FLAG_MULTIFD_FLUSH is received. Luckily, multifd is still all working fine because recv side cleanup code (mostly multifd_recv_sync_main()) is smart enough to make sure even if recv threads are stuck at SYNC it'll get kicked out. And since this is the completion phase of migration, nothing else will be sent after the SYNCs. This needs to be fixed because in the future VFIO will have data to push after ram_save_complete() and we don't want the recv thread to be stuck in the MULTIFD_FLAG_SYNC message. Remove the unnecessary (and buggy) invocation of multifd_ram_flush_and_sync(). For very old binaries (multifd_flush_after_each_section==true), the flush_and_sync is still needed because each EOS received on destination will enforce all-channel sync once. Stable branches do not need this patch, as no real bug I can think of that will go wrong there.. so not attaching Fixes to be clear on the backport not needed. Reviewed-by: Fabiano Rosas Signed-off-by: Peter Xu Message-Id: <20241206224755.1108686-2-peterx@redhat.com> Signed-off-by: Fabiano Rosas --- migration/ram.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/migration/ram.c b/migration/ram.c index a60666d3f6..f0ddd5eabe 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -3283,9 +3283,16 @@ static int ram_save_complete(QEMUFile *f, void *opaque) } } - ret = multifd_ram_flush_and_sync(); - if (ret < 0) { - return ret; + if (migrate_multifd() && + migrate_multifd_flush_after_each_section()) { + /* + * Only the old dest QEMU will need this sync, because each EOS + * will require one SYNC message on each channel. + */ + ret = multifd_ram_flush_and_sync(); + if (ret < 0) { + return ret; + } } if (migrate_mapped_ram()) { From 10801e08ac926a5a6083a9bd2ff87b153ccb95b1 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Fri, 6 Dec 2024 17:47:50 -0500 Subject: [PATCH 03/25] migration/multifd: Allow to sync with sender threads only Teach multifd_send_sync_main() to sync with threads only. We already have such requests, which is when mapped-ram is enabled with multifd. In that case, no SYNC messages will be pushed to the stream when multifd syncs the sender threads because there's no destination threads waiting for that. The whole point of the sync is to make sure all threads finished their jobs. So fundamentally we have a request to do the sync in different ways: - Either to sync the threads only, - Or to sync the threads but also with the destination side. Mapped-ram did it already because of the use_packet check in the sync handler of the sender thread. It works. However it may stop working when e.g. VFIO may start to reuse multifd channels to push device states. In that case VFIO has similar request on "thread-only sync" however we can't check a flag because such sync request can still come from RAM which needs the on-wire notifications. Paving way for that by allowing the multifd_send_sync_main() to specify what kind of sync the caller needs. We can use it for mapped-ram already. No functional change intended. Signed-off-by: Peter Xu Reviewed-by: Fabiano Rosas Message-Id: <20241206224755.1108686-3-peterx@redhat.com> Signed-off-by: Fabiano Rosas --- migration/multifd-nocomp.c | 7 ++++++- migration/multifd.c | 17 +++++++++++------ migration/multifd.h | 23 ++++++++++++++++++++--- 3 files changed, 37 insertions(+), 10 deletions(-) diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c index 55191152f9..219f9e58ef 100644 --- a/migration/multifd-nocomp.c +++ b/migration/multifd-nocomp.c @@ -345,6 +345,8 @@ retry: int multifd_ram_flush_and_sync(void) { + MultiFDSyncReq req; + if (!migrate_multifd()) { return 0; } @@ -356,7 +358,10 @@ int multifd_ram_flush_and_sync(void) } } - return multifd_send_sync_main(); + /* File migrations only need to sync with threads */ + req = migrate_mapped_ram() ? MULTIFD_SYNC_LOCAL : MULTIFD_SYNC_ALL; + + return multifd_send_sync_main(req); } bool multifd_send_prepare_common(MultiFDSendParams *p) diff --git a/migration/multifd.c b/migration/multifd.c index 4f973d70e0..64e0ac2488 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -523,11 +523,13 @@ static int multifd_zero_copy_flush(QIOChannel *c) return ret; } -int multifd_send_sync_main(void) +int multifd_send_sync_main(MultiFDSyncReq req) { int i; bool flush_zero_copy; + assert(req != MULTIFD_SYNC_NONE); + flush_zero_copy = migrate_zero_copy_send(); for (i = 0; i < migrate_multifd_channels(); i++) { @@ -543,8 +545,8 @@ int multifd_send_sync_main(void) * We should be the only user so far, so not possible to be set by * others concurrently. */ - assert(qatomic_read(&p->pending_sync) == false); - qatomic_set(&p->pending_sync, true); + assert(qatomic_read(&p->pending_sync) == MULTIFD_SYNC_NONE); + qatomic_set(&p->pending_sync, req); qemu_sem_post(&p->sem); } for (i = 0; i < migrate_multifd_channels(); i++) { @@ -635,14 +637,17 @@ static void *multifd_send_thread(void *opaque) */ qatomic_store_release(&p->pending_job, false); } else { + MultiFDSyncReq req = qatomic_read(&p->pending_sync); + /* * If not a normal job, must be a sync request. Note that * pending_sync is a standalone flag (unlike pending_job), so * it doesn't require explicit memory barriers. */ - assert(qatomic_read(&p->pending_sync)); + assert(req != MULTIFD_SYNC_NONE); - if (use_packets) { + /* Only push the SYNC message if it involves a remote sync */ + if (req == MULTIFD_SYNC_ALL) { p->flags = MULTIFD_FLAG_SYNC; multifd_send_fill_packet(p); ret = qio_channel_write_all(p->c, (void *)p->packet, @@ -654,7 +659,7 @@ static void *multifd_send_thread(void *opaque) stat64_add(&mig_stats.multifd_bytes, p->packet_len); } - qatomic_set(&p->pending_sync, false); + qatomic_set(&p->pending_sync, MULTIFD_SYNC_NONE); qemu_sem_post(&p->sem_sync); } } diff --git a/migration/multifd.h b/migration/multifd.h index 50d58c0c9c..6493512305 100644 --- a/migration/multifd.h +++ b/migration/multifd.h @@ -19,6 +19,22 @@ typedef struct MultiFDRecvData MultiFDRecvData; typedef struct MultiFDSendData MultiFDSendData; +typedef enum { + /* No sync request */ + MULTIFD_SYNC_NONE = 0, + /* Sync locally on the sender threads without pushing messages */ + MULTIFD_SYNC_LOCAL, + /* + * Sync not only on the sender threads, but also push MULTIFD_FLAG_SYNC + * message to the wire for each iochannel (which is for a remote sync). + * + * When remote sync is used, need to be paired with a follow up + * RAM_SAVE_FLAG_EOS / RAM_SAVE_FLAG_MULTIFD_FLUSH message on the main + * channel. + */ + MULTIFD_SYNC_ALL, +} MultiFDSyncReq; + bool multifd_send_setup(void); void multifd_send_shutdown(void); void multifd_send_channel_created(void); @@ -28,7 +44,7 @@ void multifd_recv_shutdown(void); bool multifd_recv_all_channels_created(void); void multifd_recv_new_channel(QIOChannel *ioc, Error **errp); void multifd_recv_sync_main(void); -int multifd_send_sync_main(void); +int multifd_send_sync_main(MultiFDSyncReq req); bool multifd_queue_page(RAMBlock *block, ram_addr_t offset); bool multifd_recv(void); MultiFDRecvData *multifd_get_recv_data(void); @@ -143,7 +159,7 @@ typedef struct { /* multifd flags for each packet */ uint32_t flags; /* - * The sender thread has work to do if either of below boolean is set. + * The sender thread has work to do if either of below field is set. * * @pending_job: a job is pending * @pending_sync: a sync request is pending @@ -152,7 +168,8 @@ typedef struct { * cleared by the multifd sender threads. */ bool pending_job; - bool pending_sync; + MultiFDSyncReq pending_sync; + MultiFDSendData *data; /* thread local variables. No locking required */ From 604b4749c58f676aa37bd4d96496152f36f3b293 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Fri, 6 Dec 2024 17:47:51 -0500 Subject: [PATCH 04/25] migration/ram: Move RAM_SAVE_FLAG* into ram.h Firstly, we're going to use the multifd flag soon in multifd code, so ram.c isn't gonna work. Secondly, we have a separate RDMA flag dangling around, which is definitely not obvious. There's one comment that helps, but not too much. Put all RAM save flags altogether, so nothing will get overlooked. Add a section explain why we can't use bits over 0x200. Remove RAM_SAVE_FLAG_FULL as it's already not used in QEMU, as the comment explained. Reviewed-by: Fabiano Rosas Signed-off-by: Peter Xu Message-Id: <20241206224755.1108686-4-peterx@redhat.com> Signed-off-by: Fabiano Rosas --- migration/ram.c | 21 --------------------- migration/ram.h | 28 ++++++++++++++++++++++++++++ migration/rdma.h | 7 ------- 3 files changed, 28 insertions(+), 28 deletions(-) diff --git a/migration/ram.c b/migration/ram.c index f0ddd5eabe..01521de71f 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -71,27 +71,6 @@ /***********************************************************/ /* ram save/restore */ -/* - * RAM_SAVE_FLAG_ZERO used to be named RAM_SAVE_FLAG_COMPRESS, it - * worked for pages that were filled with the same char. We switched - * it to only search for the zero value. And to avoid confusion with - * RAM_SAVE_FLAG_COMPRESS_PAGE just rename it. - * - * RAM_SAVE_FLAG_FULL was obsoleted in 2009. - * - * RAM_SAVE_FLAG_COMPRESS_PAGE (0x100) was removed in QEMU 9.1. - */ -#define RAM_SAVE_FLAG_FULL 0x01 -#define RAM_SAVE_FLAG_ZERO 0x02 -#define RAM_SAVE_FLAG_MEM_SIZE 0x04 -#define RAM_SAVE_FLAG_PAGE 0x08 -#define RAM_SAVE_FLAG_EOS 0x10 -#define RAM_SAVE_FLAG_CONTINUE 0x20 -#define RAM_SAVE_FLAG_XBZRLE 0x40 -/* 0x80 is reserved in rdma.h for RAM_SAVE_FLAG_HOOK */ -#define RAM_SAVE_FLAG_MULTIFD_FLUSH 0x200 -/* We can't use any flag that is bigger than 0x200 */ - /* * mapped-ram migration supports O_DIRECT, so we need to make sure the * userspace buffer, the IO operation size and the file offset are diff --git a/migration/ram.h b/migration/ram.h index 0d1981f888..921c39a2c5 100644 --- a/migration/ram.h +++ b/migration/ram.h @@ -33,6 +33,34 @@ #include "exec/cpu-common.h" #include "io/channel.h" +/* + * RAM_SAVE_FLAG_ZERO used to be named RAM_SAVE_FLAG_COMPRESS, it + * worked for pages that were filled with the same char. We switched + * it to only search for the zero value. And to avoid confusion with + * RAM_SAVE_FLAG_COMPRESS_PAGE just rename it. + * + * RAM_SAVE_FLAG_FULL (0x01) was obsoleted in 2009. + * + * RAM_SAVE_FLAG_COMPRESS_PAGE (0x100) was removed in QEMU 9.1. + * + * RAM_SAVE_FLAG_HOOK is only used in RDMA. Whenever this is found in the + * data stream, the flags will be passed to rdma functions in the + * incoming-migration side. + * + * We can't use any flag that is bigger than 0x200, because the flags are + * always assumed to be encoded in a ramblock address offset, which is + * multiple of PAGE_SIZE. Here it means QEMU supports migration with any + * architecture that has PAGE_SIZE>=1K (0x400). + */ +#define RAM_SAVE_FLAG_ZERO 0x002 +#define RAM_SAVE_FLAG_MEM_SIZE 0x004 +#define RAM_SAVE_FLAG_PAGE 0x008 +#define RAM_SAVE_FLAG_EOS 0x010 +#define RAM_SAVE_FLAG_CONTINUE 0x020 +#define RAM_SAVE_FLAG_XBZRLE 0x040 +#define RAM_SAVE_FLAG_HOOK 0x080 +#define RAM_SAVE_FLAG_MULTIFD_FLUSH 0x200 + extern XBZRLECacheStats xbzrle_counters; /* Should be holding either ram_list.mutex, or the RCU lock. */ diff --git a/migration/rdma.h b/migration/rdma.h index a8d27f33b8..f55f28bbed 100644 --- a/migration/rdma.h +++ b/migration/rdma.h @@ -33,13 +33,6 @@ void rdma_start_incoming_migration(InetSocketAddress *host_port, Error **errp); #define RAM_CONTROL_ROUND 1 #define RAM_CONTROL_FINISH 3 -/* - * Whenever this is found in the data stream, the flags - * will be passed to rdma functions in the incoming-migration - * side. - */ -#define RAM_SAVE_FLAG_HOOK 0x80 - #define RAM_SAVE_CONTROL_NOT_SUPP -1000 #define RAM_SAVE_CONTROL_DELAYED -2000 From e5f14aa5fe7f44649f5413558cac81c09d6c7f93 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Fri, 6 Dec 2024 17:47:52 -0500 Subject: [PATCH 05/25] migration/multifd: Unify RAM_SAVE_FLAG_MULTIFD_FLUSH messages RAM_SAVE_FLAG_MULTIFD_FLUSH message should always be correlated to a sync request on src. Unify such message into one place, and conditionally send the message only if necessary. Reviewed-by: Fabiano Rosas Signed-off-by: Peter Xu Message-Id: <20241206224755.1108686-5-peterx@redhat.com> Signed-off-by: Fabiano Rosas --- migration/multifd-nocomp.c | 27 +++++++++++++++++++++++++-- migration/multifd.h | 2 +- migration/ram.c | 18 ++++-------------- 3 files changed, 30 insertions(+), 17 deletions(-) diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c index 219f9e58ef..58372db0f4 100644 --- a/migration/multifd-nocomp.c +++ b/migration/multifd-nocomp.c @@ -20,6 +20,7 @@ #include "qemu/cutils.h" #include "qemu/error-report.h" #include "trace.h" +#include "qemu-file.h" static MultiFDSendData *multifd_ram_send; @@ -343,9 +344,10 @@ retry: return true; } -int multifd_ram_flush_and_sync(void) +int multifd_ram_flush_and_sync(QEMUFile *f) { MultiFDSyncReq req; + int ret; if (!migrate_multifd()) { return 0; @@ -361,7 +363,28 @@ int multifd_ram_flush_and_sync(void) /* File migrations only need to sync with threads */ req = migrate_mapped_ram() ? MULTIFD_SYNC_LOCAL : MULTIFD_SYNC_ALL; - return multifd_send_sync_main(req); + ret = multifd_send_sync_main(req); + if (ret) { + return ret; + } + + /* If we don't need to sync with remote at all, nothing else to do */ + if (req == MULTIFD_SYNC_LOCAL) { + return 0; + } + + /* + * Old QEMUs don't understand RAM_SAVE_FLAG_MULTIFD_FLUSH, it relies + * on RAM_SAVE_FLAG_EOS instead. + */ + if (migrate_multifd_flush_after_each_section()) { + return 0; + } + + qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH); + qemu_fflush(f); + + return 0; } bool multifd_send_prepare_common(MultiFDSendParams *p) diff --git a/migration/multifd.h b/migration/multifd.h index 6493512305..0fef431f6b 100644 --- a/migration/multifd.h +++ b/migration/multifd.h @@ -354,7 +354,7 @@ static inline uint32_t multifd_ram_page_count(void) void multifd_ram_save_setup(void); void multifd_ram_save_cleanup(void); -int multifd_ram_flush_and_sync(void); +int multifd_ram_flush_and_sync(QEMUFile *f); size_t multifd_ram_payload_size(void); void multifd_ram_fill_packet(MultiFDSendParams *p); int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp); diff --git a/migration/ram.c b/migration/ram.c index 01521de71f..ef683d11f0 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -1306,15 +1306,10 @@ static int find_dirty_block(RAMState *rs, PageSearchStatus *pss) (!migrate_multifd_flush_after_each_section() || migrate_mapped_ram())) { QEMUFile *f = rs->pss[RAM_CHANNEL_PRECOPY].pss_channel; - int ret = multifd_ram_flush_and_sync(); + int ret = multifd_ram_flush_and_sync(f); if (ret < 0) { return ret; } - - if (!migrate_mapped_ram()) { - qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH); - qemu_fflush(f); - } } /* Hit the end of the list */ @@ -3044,18 +3039,13 @@ static int ram_save_setup(QEMUFile *f, void *opaque, Error **errp) } bql_unlock(); - ret = multifd_ram_flush_and_sync(); + ret = multifd_ram_flush_and_sync(f); bql_lock(); if (ret < 0) { error_setg(errp, "%s: multifd synchronization failed", __func__); return ret; } - if (migrate_multifd() && !migrate_multifd_flush_after_each_section() - && !migrate_mapped_ram()) { - qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH); - } - qemu_put_be64(f, RAM_SAVE_FLAG_EOS); ret = qemu_fflush(f); if (ret < 0) { @@ -3190,7 +3180,7 @@ out: if (ret >= 0 && migration_is_running()) { if (migrate_multifd() && migrate_multifd_flush_after_each_section() && !migrate_mapped_ram()) { - ret = multifd_ram_flush_and_sync(); + ret = multifd_ram_flush_and_sync(f); if (ret < 0) { return ret; } @@ -3268,7 +3258,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque) * Only the old dest QEMU will need this sync, because each EOS * will require one SYNC message on each channel. */ - ret = multifd_ram_flush_and_sync(); + ret = multifd_ram_flush_and_sync(f); if (ret < 0) { return ret; } From de695b1399242da0c618049932a9a6f1a0a0a4f1 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Fri, 6 Dec 2024 17:47:53 -0500 Subject: [PATCH 06/25] migration/multifd: Remove sync processing on postcopy Multifd never worked with postcopy, at least yet so far. Remove the sync processing there, because it's confusing, and they should never appear. Now if RAM_SAVE_FLAG_MULTIFD_FLUSH is observed, we fail hard instead of trying to invoke multifd code. Reviewed-by: Fabiano Rosas Signed-off-by: Peter Xu Message-Id: <20241206224755.1108686-6-peterx@redhat.com> Signed-off-by: Fabiano Rosas --- migration/ram.c | 8 -------- 1 file changed, 8 deletions(-) diff --git a/migration/ram.c b/migration/ram.c index ef683d11f0..9eeb77665b 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -3772,15 +3772,7 @@ int ram_load_postcopy(QEMUFile *f, int channel) TARGET_PAGE_SIZE); } break; - case RAM_SAVE_FLAG_MULTIFD_FLUSH: - multifd_recv_sync_main(); - break; case RAM_SAVE_FLAG_EOS: - /* normal exit */ - if (migrate_multifd() && - migrate_multifd_flush_after_each_section()) { - multifd_recv_sync_main(); - } break; default: error_report("Unknown combination of migration flags: 0x%x" From 1aa81c3098f0270905deff516d455604fcbfaab5 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Fri, 6 Dec 2024 17:47:54 -0500 Subject: [PATCH 07/25] migration/multifd: Cleanup src flushes on condition check The src flush condition check is over complicated, and it's getting more out of control if postcopy will be involved. In general, we have two modes to do the sync: legacy or modern ways. Legacy uses per-section flush, modern uses per-round flush. Mapped-ram always uses the modern, which is per-round. Introduce two helpers, which can greatly simplify the code, and hopefully make it readable again. Signed-off-by: Peter Xu Reviewed-by: Fabiano Rosas Message-Id: <20241206224755.1108686-7-peterx@redhat.com> Signed-off-by: Fabiano Rosas --- migration/multifd-nocomp.c | 42 ++++++++++++++++++++++++++++++++++++++ migration/multifd.h | 2 ++ migration/ram.c | 10 +++------ 3 files changed, 47 insertions(+), 7 deletions(-) diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c index 58372db0f4..c1f686c0ce 100644 --- a/migration/multifd-nocomp.c +++ b/migration/multifd-nocomp.c @@ -344,6 +344,48 @@ retry: return true; } +/* + * We have two modes for multifd flushes: + * + * - Per-section mode: this is the legacy way to flush, it requires one + * MULTIFD_FLAG_SYNC message for each RAM_SAVE_FLAG_EOS. + * + * - Per-round mode: this is the modern way to flush, it requires one + * MULTIFD_FLAG_SYNC message only for each round of RAM scan. Normally + * it's paired with a new RAM_SAVE_FLAG_MULTIFD_FLUSH message in network + * based migrations. + * + * One thing to mention is mapped-ram always use the modern way to sync. + */ + +/* Do we need a per-section multifd flush (legacy way)? */ +bool multifd_ram_sync_per_section(void) +{ + if (!migrate_multifd()) { + return false; + } + + if (migrate_mapped_ram()) { + return false; + } + + return migrate_multifd_flush_after_each_section(); +} + +/* Do we need a per-round multifd flush (modern way)? */ +bool multifd_ram_sync_per_round(void) +{ + if (!migrate_multifd()) { + return false; + } + + if (migrate_mapped_ram()) { + return true; + } + + return !migrate_multifd_flush_after_each_section(); +} + int multifd_ram_flush_and_sync(QEMUFile *f) { MultiFDSyncReq req; diff --git a/migration/multifd.h b/migration/multifd.h index 0fef431f6b..bd785b9873 100644 --- a/migration/multifd.h +++ b/migration/multifd.h @@ -355,6 +355,8 @@ static inline uint32_t multifd_ram_page_count(void) void multifd_ram_save_setup(void); void multifd_ram_save_cleanup(void); int multifd_ram_flush_and_sync(QEMUFile *f); +bool multifd_ram_sync_per_round(void); +bool multifd_ram_sync_per_section(void); size_t multifd_ram_payload_size(void); void multifd_ram_fill_packet(MultiFDSendParams *p); int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp); diff --git a/migration/ram.c b/migration/ram.c index 9eeb77665b..d9336d8a09 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -1302,9 +1302,7 @@ static int find_dirty_block(RAMState *rs, PageSearchStatus *pss) pss->page = 0; pss->block = QLIST_NEXT_RCU(pss->block, next); if (!pss->block) { - if (migrate_multifd() && - (!migrate_multifd_flush_after_each_section() || - migrate_mapped_ram())) { + if (multifd_ram_sync_per_round()) { QEMUFile *f = rs->pss[RAM_CHANNEL_PRECOPY].pss_channel; int ret = multifd_ram_flush_and_sync(f); if (ret < 0) { @@ -3178,8 +3176,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque) out: if (ret >= 0 && migration_is_running()) { - if (migrate_multifd() && migrate_multifd_flush_after_each_section() && - !migrate_mapped_ram()) { + if (multifd_ram_sync_per_section()) { ret = multifd_ram_flush_and_sync(f); if (ret < 0) { return ret; @@ -3252,8 +3249,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque) } } - if (migrate_multifd() && - migrate_multifd_flush_after_each_section()) { + if (multifd_ram_sync_per_section()) { /* * Only the old dest QEMU will need this sync, because each EOS * will require one SYNC message on each channel. From baab4473dba2b85adf3c0622b92bc209f7a8dec0 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Fri, 6 Dec 2024 17:47:55 -0500 Subject: [PATCH 08/25] migration/multifd: Document the reason to sync for save_setup() It's not straightforward to see why src QEMU needs to sync multifd during setup() phase. After all, there's no page queued at that point. For old QEMUs, there's a solid reason: EOS requires it to work. While it's clueless on the new QEMUs which do not take EOS message as sync requests. One will figure that out only when this is conditionally removed. In fact, the author did try it out. Logically we could still avoid doing this on new machine types, however that needs a separate compat field and that can be an overkill in some trivial overhead in setup() phase. Let's instead document it completely, to avoid someone else tries this again and do the debug one more time, or anyone confused on why this ever existed. Signed-off-by: Peter Xu Reviewed-by: Fabiano Rosas Message-Id: <20241206224755.1108686-8-peterx@redhat.com> Signed-off-by: Fabiano Rosas --- migration/ram.c | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/migration/ram.c b/migration/ram.c index d9336d8a09..ce28328141 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -3036,6 +3036,31 @@ static int ram_save_setup(QEMUFile *f, void *opaque, Error **errp) migration_ops->ram_save_target_page = ram_save_target_page_legacy; } + /* + * This operation is unfortunate.. + * + * For legacy QEMUs using per-section sync + * ======================================= + * + * This must exist because the EOS below requires the SYNC messages + * per-channel to work. + * + * For modern QEMUs using per-round sync + * ===================================== + * + * Logically such sync is not needed, and recv threads should not run + * until setup ready (using things like channels_ready on src). Then + * we should be all fine. + * + * However even if we add channels_ready to recv side in new QEMUs, old + * QEMU won't have them so this sync will still be needed to make sure + * multifd recv threads won't start processing guest pages early before + * ram_load_setup() is properly done. + * + * Let's stick with this. Fortunately the overhead is low to sync + * during setup because the VM is running, so at least it's not + * accounted as part of downtime. + */ bql_unlock(); ret = multifd_ram_flush_and_sync(f); bql_lock(); From b93d897ea2f0abbe7fc341a9ac176b5ecd0f3c93 Mon Sep 17 00:00:00 2001 From: Fabiano Rosas Date: Fri, 13 Dec 2024 13:01:19 -0300 Subject: [PATCH 09/25] migration/multifd: Fix compat with QEMU < 9.0 Commit f5f48a7891 ("migration/multifd: Separate SYNC request with normal jobs") changed the multifd source side to stop sending data along with the MULTIFD_FLAG_SYNC, effectively introducing the concept of a SYNC-only packet. Relying on that, commit d7e58f412c ("migration/multifd: Don't send ram data during SYNC") later came along and skipped reading data from SYNC packets. In a versions timeline like this: 8.2 f5f48a7 9.0 9.1 d7e58f41 9.2 The issue arises that QEMUs < 9.0 still send data along with SYNC, but QEMUs > 9.1 don't gather that data anymore. This leads to various kinds of migration failures due to desync/missing data. Stop checking for a SYNC packet on the destination and unconditionally unfill the packet. >From now on: old -> new: the source sends data + sync, destination reads normally new -> new: source sends only sync, destination reads zeros new -> old: source sends only sync, destination reads zeros CC: qemu-stable@nongnu.org Fixes: d7e58f412c ("migration/multifd: Don't send ram data during SYNC") Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2720 Reviewed-by: Peter Xu Signed-off-by: Fabiano Rosas Message-Id: <20241213160120.23880-2-farosas@suse.de> Signed-off-by: Fabiano Rosas --- migration/multifd.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/migration/multifd.c b/migration/multifd.c index 64e0ac2488..ab73d6d984 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -252,9 +252,8 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp) p->packet_num = be64_to_cpu(packet->packet_num); p->packets_recved++; - if (!(p->flags & MULTIFD_FLAG_SYNC)) { - ret = multifd_ram_unfill_packet(p, errp); - } + /* Always unfill, old QEMUs (<9.0) send data along with SYNC */ + ret = multifd_ram_unfill_packet(p, errp); trace_multifd_recv_unfill(p->id, p->packet_num, p->flags, p->next_packet_size); @@ -1156,9 +1155,13 @@ static void *multifd_recv_thread(void *opaque) flags = p->flags; /* recv methods don't know how to handle the SYNC flag */ p->flags &= ~MULTIFD_FLAG_SYNC; - if (!(flags & MULTIFD_FLAG_SYNC)) { - has_data = p->normal_num || p->zero_num; - } + + /* + * Even if it's a SYNC packet, this needs to be set + * because older QEMUs (<9.0) still send data along with + * the SYNC packet. + */ + has_data = p->normal_num || p->zero_num; qemu_mutex_unlock(&p->mutex); } else { /* From 7815f69867da92335055d4b5248430b0f122ce4e Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Fri, 6 Dec 2024 18:08:33 -0500 Subject: [PATCH 10/25] migration: Add helper to get target runstate In 99% cases, after QEMU migrates to dest host, it tries to detect the target VM runstate using global_state_get_runstate(). There's one outlier so far which is Xen that won't send global state. That's the major reason why global_state_received() check was always there together with global_state_get_runstate(). However it's utterly confusing why global_state_received() has anything to do with "let's start VM or not". Provide a helper to explain it, then we have an unified entry for getting the target dest QEMU runstate after migration. Suggested-by: Fabiano Rosas Signed-off-by: Peter Xu Message-Id: <20241206230838.1111496-2-peterx@redhat.com> Signed-off-by: Fabiano Rosas --- migration/migration.c | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/migration/migration.c b/migration/migration.c index df61ca4e93..969b03cdcd 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -135,6 +135,21 @@ static bool migration_needs_multiple_sockets(void) return migrate_multifd() || migrate_postcopy_preempt(); } +static RunState migration_get_target_runstate(void) +{ + /* + * When the global state is not migrated, it means we don't know the + * runstate of the src QEMU. We don't have much choice but assuming + * the VM is running. NOTE: this is pretty rare case, so far only Xen + * uses it. + */ + if (!global_state_received()) { + return RUN_STATE_RUNNING; + } + + return global_state_get_runstate(); +} + static bool transport_supports_multi_channels(MigrationAddress *addr) { if (addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET) { @@ -735,8 +750,7 @@ static void process_incoming_migration_bh(void *opaque) * unless we really are starting the VM. */ if (!migrate_late_block_activate() || - (autostart && (!global_state_received() || - runstate_is_live(global_state_get_runstate())))) { + (autostart && runstate_is_live(migration_get_target_runstate()))) { /* Make sure all file formats throw away their mutable metadata. * If we get an error here, just don't restart the VM yet. */ bdrv_activate_all(&local_err); @@ -759,8 +773,7 @@ static void process_incoming_migration_bh(void *opaque) dirty_bitmap_mig_before_vm_start(); - if (!global_state_received() || - runstate_is_live(global_state_get_runstate())) { + if (runstate_is_live(migration_get_target_runstate())) { if (autostart) { vm_start(); } else { From e4e5e89bbd8e731e86735d9d25b7b5f49e8f08b6 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Fri, 6 Dec 2024 18:08:34 -0500 Subject: [PATCH 11/25] qmp/cont: Only activate disks if migration completed As the comment says, the activation of disks is for the case where migration has completed, rather than when QEMU is still during migration (RUN_STATE_INMIGRATE). Move the code over to reflect what the comment is describing. Cc: Kevin Wolf Cc: Markus Armbruster Signed-off-by: Peter Xu Reviewed-by: Fabiano Rosas Message-Id: <20241206230838.1111496-3-peterx@redhat.com> Signed-off-by: Fabiano Rosas --- monitor/qmp-cmds.c | 26 ++++++++++++++------------ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/monitor/qmp-cmds.c b/monitor/qmp-cmds.c index 34f215097c..415e645130 100644 --- a/monitor/qmp-cmds.c +++ b/monitor/qmp-cmds.c @@ -96,21 +96,23 @@ void qmp_cont(Error **errp) } } - /* Continuing after completed migration. Images have been inactivated to - * allow the destination to take control. Need to get control back now. - * - * If there are no inactive block nodes (e.g. because the VM was just - * paused rather than completing a migration), bdrv_inactivate_all() simply - * doesn't do anything. */ - bdrv_activate_all(&local_err); - if (local_err) { - error_propagate(errp, local_err); - return; - } - if (runstate_check(RUN_STATE_INMIGRATE)) { autostart = 1; } else { + /* + * Continuing after completed migration. Images have been + * inactivated to allow the destination to take control. Need to + * get control back now. + * + * If there are no inactive block nodes (e.g. because the VM was + * just paused rather than completing a migration), + * bdrv_inactivate_all() simply doesn't do anything. + */ + bdrv_activate_all(&local_err); + if (local_err) { + error_propagate(errp, local_err); + return; + } vm_start(); } } From fca9aef1c8d8fc4482cc541638dbfac76dc125d6 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Fri, 6 Dec 2024 18:08:35 -0500 Subject: [PATCH 12/25] migration/block: Make late-block-active the default Migration capability 'late-block-active' controls when the block drives will be activated. If enabled, block drives will only be activated until VM starts, either src runstate was "live" (RUNNING, or SUSPENDED), or it'll be postponed until qmp_cont(). Let's do this unconditionally. There's no harm to delay activation of block drives. Meanwhile there's no ABI breakage if dest does it, because src QEMU has nothing to do with it, so it's no concern on ABI breakage. IIUC we could avoid introducing this cap when introducing it before, but now it's still not too late to just always do it. Cap now prone to removal, but it'll be for later patches. Signed-off-by: Peter Xu Reviewed-by: Fabiano Rosas Message-Id: <20241206230838.1111496-4-peterx@redhat.com> Signed-off-by: Fabiano Rosas --- migration/migration.c | 38 +++++++++++++++++++------------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/migration/migration.c b/migration/migration.c index 969b03cdcd..c80fc7b94c 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -743,24 +743,6 @@ static void process_incoming_migration_bh(void *opaque) trace_vmstate_downtime_checkpoint("dst-precopy-bh-enter"); - /* If capability late_block_activate is set: - * Only fire up the block code now if we're going to restart the - * VM, else 'cont' will do it. - * This causes file locking to happen; so we don't want it to happen - * unless we really are starting the VM. - */ - if (!migrate_late_block_activate() || - (autostart && runstate_is_live(migration_get_target_runstate()))) { - /* Make sure all file formats throw away their mutable metadata. - * If we get an error here, just don't restart the VM yet. */ - bdrv_activate_all(&local_err); - if (local_err) { - error_report_err(local_err); - local_err = NULL; - autostart = false; - } - } - /* * This must happen after all error conditions are dealt with and * we're sure the VM is going to be running on this host. @@ -775,7 +757,25 @@ static void process_incoming_migration_bh(void *opaque) if (runstate_is_live(migration_get_target_runstate())) { if (autostart) { - vm_start(); + /* + * Block activation is always delayed until VM starts, either + * here (which means we need to start the dest VM right now..), + * or until qmp_cont() later. + * + * We used to have cap 'late-block-activate' but now we do this + * unconditionally, as it has no harm but only benefit. E.g., + * it's not part of migration ABI on the time of disk activation. + * + * Make sure all file formats throw away their mutable + * metadata. If error, don't restart the VM yet. + */ + bdrv_activate_all(&local_err); + if (local_err) { + error_report_err(local_err); + local_err = NULL; + } else { + vm_start(); + } } else { runstate_set(RUN_STATE_PAUSED); } From 61f2b489987c51159c53101a072c6aa901b50506 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Fri, 6 Dec 2024 18:08:36 -0500 Subject: [PATCH 13/25] migration/block: Apply late-block-active behavior to postcopy Postcopy never cared about late-block-active. However there's no mention in the capability that it doesn't apply to postcopy. Considering that we _assumed_ late activation is always good, do that too for postcopy unconditionally, just like precopy. After this patch, we should have unified the behavior across all. Signed-off-by: Peter Xu Reviewed-by: Fabiano Rosas Message-Id: <20241206230838.1111496-5-peterx@redhat.com> Signed-off-by: Fabiano Rosas --- migration/savevm.c | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/migration/savevm.c b/migration/savevm.c index 927b1146c0..d4842b519d 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -2137,22 +2137,21 @@ static void loadvm_postcopy_handle_run_bh(void *opaque) trace_vmstate_downtime_checkpoint("dst-postcopy-bh-announced"); - /* Make sure all file formats throw away their mutable metadata. - * If we get an error here, just don't restart the VM yet. */ - bdrv_activate_all(&local_err); - if (local_err) { - error_report_err(local_err); - local_err = NULL; - autostart = false; - } - - trace_vmstate_downtime_checkpoint("dst-postcopy-bh-cache-invalidated"); - dirty_bitmap_mig_before_vm_start(); if (autostart) { - /* Hold onto your hats, starting the CPU */ - vm_start(); + /* + * Make sure all file formats throw away their mutable metadata. + * If we get an error here, just don't restart the VM yet. + */ + bdrv_activate_all(&local_err); + trace_vmstate_downtime_checkpoint("dst-postcopy-bh-cache-invalidated"); + if (local_err) { + error_report_err(local_err); + local_err = NULL; + } else { + vm_start(); + } } else { /* leave it paused and let management decide when to start the CPU */ runstate_set(RUN_STATE_PAUSED); From 8c97c5a476d146b35b2873ef73df601216a494d9 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Fri, 6 Dec 2024 18:08:37 -0500 Subject: [PATCH 14/25] migration/block: Fix possible race with block_inactive Src QEMU sets block_inactive=true very early before the invalidation takes place. It means if something wrong happened during setting the flag but before reaching qemu_savevm_state_complete_precopy_non_iterable() where it did the invalidation work, it'll make block_inactive flag inconsistent. For example, think about when qemu_savevm_state_complete_precopy_iterable() can fail: it will have block_inactive set to true even if all block drives are active. Fix that by only update the flag after the invalidation is done. No Fixes for any commit, because it's not an issue if bdrv_activate_all() is re-entrant upon all-active disks - false positive block_inactive can bring nothing more than "trying to active the blocks but they're already active". However let's still do it right to avoid the inconsistent flag v.s. reality. Signed-off-by: Peter Xu Reviewed-by: Fabiano Rosas Message-Id: <20241206230838.1111496-6-peterx@redhat.com> Signed-off-by: Fabiano Rosas --- migration/migration.c | 9 +++------ migration/savevm.c | 2 ++ 2 files changed, 5 insertions(+), 6 deletions(-) diff --git a/migration/migration.c b/migration/migration.c index c80fc7b94c..fd42a549e6 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -2742,14 +2742,11 @@ static int migration_completion_precopy(MigrationState *s, goto out_unlock; } - /* - * Inactivate disks except in COLO, and track that we have done so in order - * to remember to reactivate them if migration fails or is cancelled. - */ - s->block_inactive = !migrate_colo(); migration_rate_set(RATE_LIMIT_DISABLED); + + /* Inactivate disks except in COLO */ ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false, - s->block_inactive); + !migrate_colo()); out_unlock: bql_unlock(); return ret; diff --git a/migration/savevm.c b/migration/savevm.c index d4842b519d..3a414aff52 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -1558,6 +1558,8 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f, qemu_file_set_error(f, ret); return ret; } + /* Remember that we did this */ + s->block_inactive = true; } if (!in_postcopy) { /* Postcopy stream will still be going */ From 8597af76153a87068b675d8099063c3ad8695773 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Fri, 6 Dec 2024 18:08:38 -0500 Subject: [PATCH 15/25] migration/block: Rewrite disk activation This patch proposes a flag to maintain disk activation status globally. It mostly rewrites disk activation mgmt for QEMU, including COLO and QMP command xen_save_devices_state. Backgrounds =========== We have two problems on disk activations, one resolved, one not. Problem 1: disk activation recover (for switchover interruptions) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When migration is either cancelled or failed during switchover, especially when after the disks are inactivated, QEMU needs to remember re-activate the disks again before vm starts. It used to be done separately in two paths: one in qmp_migrate_cancel(), the other one in the failure path of migration_completion(). It used to be fixed in different commits, all over the places in QEMU. So these are the relevant changes I saw, I'm not sure if it's complete list: - In 2016, commit fe904ea824 ("migration: regain control of images when migration fails to complete") - In 2017, commit 1d2acc3162 ("migration: re-active images while migration been canceled after inactive them") - In 2023, commit 6dab4c93ec ("migration: Attempt disk reactivation in more failure scenarios") Now since we have a slightly better picture maybe we can unify the reactivation in a single path. One side benefit of doing so is, we can move the disk operation outside QMP command "migrate_cancel". It's possible that in the future we may want to make "migrate_cancel" be OOB-compatible, while that requires the command doesn't need BQL in the first place. This will already do that and make migrate_cancel command lightweight. Problem 2: disk invalidation on top of invalidated disks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is an unresolved bug for current QEMU. Link in "Resolves:" at the end. It turns out besides the src switchover phase (problem 1 above), QEMU also needs to remember block activation on destination. Consider two continuous migration in a row, where the VM was always paused. In that scenario, the disks are not activated even until migration completed in the 1st round. When the 2nd round starts, if QEMU doesn't know the status of the disks, it needs to try inactivate the disk again. Here the issue is the block layer API bdrv_inactivate_all() will crash a QEMU if invoked on already inactive disks for the 2nd migration. For detail, see the bug link at the end. Implementation ============== This patch proposes to maintain disk activation with a global flag, so we know: - If we used to inactivate disks for migration, but migration got cancelled, or failed, QEMU will know it should reactivate the disks. - On incoming side, if the disks are never activated but then another migration is triggered, QEMU should be able to tell that inactivate is not needed for the 2nd migration. We used to have disk_inactive, but it only solves the 1st issue, not the 2nd. Also, it's done in completely separate paths so it's extremely hard to follow either how the flag changes, or the duration that the flag is valid, and when we will reactivate the disks. Convert the existing disk_inactive flag into that global flag (also invert its naming), and maintain the disk activation status for the whole lifecycle of qemu. That includes the incoming QEMU. Put both of the error cases of source migration (failure, cancelled) together into migration_iteration_finish(), which will be invoked for either of the scenario. So from that part QEMU should behave the same as before. However with such global maintenance on disk activation status, we not only cleanup quite a few temporary paths that we try to maintain the disk activation status (e.g. in postcopy code), meanwhile it fixes the crash for problem 2 in one shot. For freshly started QEMU, the flag is initialized to TRUE showing that the QEMU owns the disks by default. For incoming migrated QEMU, the flag will be initialized to FALSE once and for all showing that the dest QEMU doesn't own the disks until switchover. That is guaranteed by the "once" variable. Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2395 Signed-off-by: Peter Xu Reviewed-by: Fabiano Rosas Message-Id: <20241206230838.1111496-7-peterx@redhat.com> Signed-off-by: Fabiano Rosas --- include/migration/misc.h | 4 ++ migration/block-active.c | 94 ++++++++++++++++++++++++++++++++++++++++ migration/colo.c | 2 +- migration/meson.build | 1 + migration/migration.c | 80 ++++++++-------------------------- migration/migration.h | 6 +-- migration/savevm.c | 33 ++++++-------- migration/trace-events | 3 ++ monitor/qmp-cmds.c | 8 +--- 9 files changed, 140 insertions(+), 91 deletions(-) create mode 100644 migration/block-active.c diff --git a/include/migration/misc.h b/include/migration/misc.h index c0e23fdac9..67f7ef7a0e 100644 --- a/include/migration/misc.h +++ b/include/migration/misc.h @@ -104,4 +104,8 @@ bool migration_incoming_postcopy_advised(void); /* True if background snapshot is active */ bool migration_in_bg_snapshot(void); +/* Wrapper for block active/inactive operations */ +bool migration_block_activate(Error **errp); +bool migration_block_inactivate(void); + #endif diff --git a/migration/block-active.c b/migration/block-active.c new file mode 100644 index 0000000000..d477cf8182 --- /dev/null +++ b/migration/block-active.c @@ -0,0 +1,94 @@ +/* + * Block activation tracking for migration purpose + * + * SPDX-License-Identifier: GPL-2.0-or-later + * + * Copyright (C) 2024 Red Hat, Inc. + */ +#include "qemu/osdep.h" +#include "block/block.h" +#include "qapi/error.h" +#include "migration/migration.h" +#include "qemu/error-report.h" +#include "trace.h" + +/* + * Migration-only cache to remember the block layer activation status. + * Protected by BQL. + * + * We need this because.. + * + * - Migration can fail after block devices are invalidated (during + * switchover phase). When that happens, we need to be able to recover + * the block drive status by re-activating them. + * + * - Currently bdrv_inactivate_all() is not safe to be invoked on top of + * invalidated drives (even if bdrv_activate_all() is actually safe to be + * called any time!). It means remembering this could help migration to + * make sure it won't invalidate twice in a row, crashing QEMU. It can + * happen when we migrate a PAUSED VM from host1 to host2, then migrate + * again to host3 without starting it. TODO: a cleaner solution is to + * allow safe invoke of bdrv_inactivate_all() at anytime, like + * bdrv_activate_all(). + * + * For freshly started QEMU, the flag is initialized to TRUE reflecting the + * scenario where QEMU owns block device ownerships. + * + * For incoming QEMU taking a migration stream, the flag is initialized to + * FALSE reflecting that the incoming side doesn't own the block devices, + * not until switchover happens. + */ +static bool migration_block_active; + +/* Setup the disk activation status */ +void migration_block_active_setup(bool active) +{ + migration_block_active = active; +} + +bool migration_block_activate(Error **errp) +{ + ERRP_GUARD(); + + assert(bql_locked()); + + if (migration_block_active) { + trace_migration_block_activation("active-skipped"); + return true; + } + + trace_migration_block_activation("active"); + + bdrv_activate_all(errp); + if (*errp) { + error_report_err(error_copy(*errp)); + return false; + } + + migration_block_active = true; + return true; +} + +bool migration_block_inactivate(void) +{ + int ret; + + assert(bql_locked()); + + if (!migration_block_active) { + trace_migration_block_activation("inactive-skipped"); + return true; + } + + trace_migration_block_activation("inactive"); + + ret = bdrv_inactivate_all(); + if (ret) { + error_report("%s: bdrv_inactivate_all() failed: %d", + __func__, ret); + return false; + } + + migration_block_active = false; + return true; +} diff --git a/migration/colo.c b/migration/colo.c index afc9869020..9a8e5fbe9b 100644 --- a/migration/colo.c +++ b/migration/colo.c @@ -836,7 +836,7 @@ static void *colo_process_incoming_thread(void *opaque) /* Make sure all file formats throw away their mutable metadata */ bql_lock(); - bdrv_activate_all(&local_err); + migration_block_activate(&local_err); bql_unlock(); if (local_err) { error_report_err(local_err); diff --git a/migration/meson.build b/migration/meson.build index d53cf3417a..dac687ee3a 100644 --- a/migration/meson.build +++ b/migration/meson.build @@ -11,6 +11,7 @@ migration_files = files( system_ss.add(files( 'block-dirty-bitmap.c', + 'block-active.c', 'channel.c', 'channel-block.c', 'cpu-throttle.c', diff --git a/migration/migration.c b/migration/migration.c index fd42a549e6..2d1da917c7 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -738,7 +738,6 @@ static void qemu_start_incoming_migration(const char *uri, bool has_channels, static void process_incoming_migration_bh(void *opaque) { - Error *local_err = NULL; MigrationIncomingState *mis = opaque; trace_vmstate_downtime_checkpoint("dst-precopy-bh-enter"); @@ -769,11 +768,7 @@ static void process_incoming_migration_bh(void *opaque) * Make sure all file formats throw away their mutable * metadata. If error, don't restart the VM yet. */ - bdrv_activate_all(&local_err); - if (local_err) { - error_report_err(local_err); - local_err = NULL; - } else { + if (migration_block_activate(NULL)) { vm_start(); } } else { @@ -1560,16 +1555,6 @@ static void migrate_fd_cancel(MigrationState *s) } } } - if (s->state == MIGRATION_STATUS_CANCELLING && s->block_inactive) { - Error *local_err = NULL; - - bdrv_activate_all(&local_err); - if (local_err) { - error_report_err(local_err); - } else { - s->block_inactive = false; - } - } } void migration_add_notifier_mode(NotifierWithReturn *notify, @@ -1853,6 +1838,12 @@ void qmp_migrate_incoming(const char *uri, bool has_channels, return; } + /* + * Newly setup incoming QEMU. Mark the block active state to reflect + * that the src currently owns the disks. + */ + migration_block_active_setup(false); + once = false; } @@ -2505,7 +2496,6 @@ static int postcopy_start(MigrationState *ms, Error **errp) QIOChannelBuffer *bioc; QEMUFile *fb; uint64_t bandwidth = migrate_max_postcopy_bandwidth(); - bool restart_block = false; int cur_state = MIGRATION_STATUS_ACTIVE; if (migrate_postcopy_preempt()) { @@ -2541,13 +2531,10 @@ static int postcopy_start(MigrationState *ms, Error **errp) goto fail; } - ret = bdrv_inactivate_all(); - if (ret < 0) { - error_setg_errno(errp, -ret, "%s: Failed in bdrv_inactivate_all()", - __func__); + if (!migration_block_inactivate()) { + error_setg(errp, "%s: Failed in bdrv_inactivate_all()", __func__); goto fail; } - restart_block = true; /* * Cause any non-postcopiable, but iterative devices to @@ -2617,8 +2604,6 @@ static int postcopy_start(MigrationState *ms, Error **errp) goto fail_closefb; } - restart_block = false; - /* Now send that blob */ if (qemu_savevm_send_packaged(ms->to_dst_file, bioc->data, bioc->usage)) { error_setg(errp, "%s: Failed to send packaged data", __func__); @@ -2663,17 +2648,7 @@ fail_closefb: fail: migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE, MIGRATION_STATUS_FAILED); - if (restart_block) { - /* A failure happened early enough that we know the destination hasn't - * accessed block devices, so we're safe to recover. - */ - Error *local_err = NULL; - - bdrv_activate_all(&local_err); - if (local_err) { - error_report_err(local_err); - } - } + migration_block_activate(NULL); migration_call_notifiers(ms, MIG_EVENT_PRECOPY_FAILED, NULL); bql_unlock(); return -1; @@ -2771,31 +2746,6 @@ static void migration_completion_postcopy(MigrationState *s) trace_migration_completion_postcopy_end_after_complete(); } -static void migration_completion_failed(MigrationState *s, - int current_active_state) -{ - if (s->block_inactive && (s->state == MIGRATION_STATUS_ACTIVE || - s->state == MIGRATION_STATUS_DEVICE)) { - /* - * If not doing postcopy, vm_start() will be called: let's - * regain control on images. - */ - Error *local_err = NULL; - - bql_lock(); - bdrv_activate_all(&local_err); - if (local_err) { - error_report_err(local_err); - } else { - s->block_inactive = false; - } - bql_unlock(); - } - - migrate_set_state(&s->state, current_active_state, - MIGRATION_STATUS_FAILED); -} - /** * migration_completion: Used by migration_thread when there's not much left. * The caller 'breaks' the loop when this returns. @@ -2849,7 +2799,8 @@ fail: error_free(local_err); } - migration_completion_failed(s, current_active_state); + migrate_set_state(&s->state, current_active_state, + MIGRATION_STATUS_FAILED); } /** @@ -3279,6 +3230,11 @@ static void migration_iteration_finish(MigrationState *s) case MIGRATION_STATUS_FAILED: case MIGRATION_STATUS_CANCELLED: case MIGRATION_STATUS_CANCELLING: + /* + * Re-activate the block drives if they're inactivated. Note, COLO + * shouldn't use block_active at all, so it should be no-op there. + */ + migration_block_activate(NULL); if (runstate_is_live(s->vm_old_state)) { if (!runstate_check(RUN_STATE_SHUTDOWN)) { vm_start(); @@ -3852,6 +3808,8 @@ static void migration_instance_init(Object *obj) ms->state = MIGRATION_STATUS_NONE; ms->mbps = -1; ms->pages_per_second = -1; + /* Freshly started QEMU owns all the block devices */ + migration_block_active_setup(true); qemu_sem_init(&ms->pause_sem, 0); qemu_mutex_init(&ms->error_mutex); diff --git a/migration/migration.h b/migration/migration.h index 7b6e718690..0df2a187af 100644 --- a/migration/migration.h +++ b/migration/migration.h @@ -370,9 +370,6 @@ struct MigrationState { /* Flag set once the migration thread is running (and needs joining) */ bool migration_thread_running; - /* Flag set once the migration thread called bdrv_inactivate_all */ - bool block_inactive; - /* Migration is waiting for guest to unplug device */ QemuSemaphore wait_unplug_sem; @@ -556,4 +553,7 @@ void migration_bitmap_sync_precopy(bool last_stage); /* migration/block-dirty-bitmap.c */ void dirty_bitmap_mig_init(void); +/* migration/block-active.c */ +void migration_block_active_setup(bool active); + #endif diff --git a/migration/savevm.c b/migration/savevm.c index 3a414aff52..c929da1ca5 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -1547,19 +1547,18 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f, } if (inactivate_disks) { - /* Inactivate before sending QEMU_VM_EOF so that the - * bdrv_activate_all() on the other end won't fail. */ - ret = bdrv_inactivate_all(); - if (ret) { - error_setg(&local_err, "%s: bdrv_inactivate_all() failed (%d)", - __func__, ret); + /* + * Inactivate before sending QEMU_VM_EOF so that the + * bdrv_activate_all() on the other end won't fail. + */ + if (!migration_block_inactivate()) { + error_setg(&local_err, "%s: bdrv_inactivate_all() failed", + __func__); migrate_set_error(ms, local_err); error_report_err(local_err); - qemu_file_set_error(f, ret); + qemu_file_set_error(f, -EFAULT); return ret; } - /* Remember that we did this */ - s->block_inactive = true; } if (!in_postcopy) { /* Postcopy stream will still be going */ @@ -2123,7 +2122,6 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis) static void loadvm_postcopy_handle_run_bh(void *opaque) { - Error *local_err = NULL; MigrationIncomingState *mis = opaque; trace_vmstate_downtime_checkpoint("dst-postcopy-bh-enter"); @@ -2146,12 +2144,11 @@ static void loadvm_postcopy_handle_run_bh(void *opaque) * Make sure all file formats throw away their mutable metadata. * If we get an error here, just don't restart the VM yet. */ - bdrv_activate_all(&local_err); + bool success = migration_block_activate(NULL); + trace_vmstate_downtime_checkpoint("dst-postcopy-bh-cache-invalidated"); - if (local_err) { - error_report_err(local_err); - local_err = NULL; - } else { + + if (success) { vm_start(); } } else { @@ -3193,11 +3190,7 @@ void qmp_xen_save_devices_state(const char *filename, bool has_live, bool live, * side of the migration take control of the images. */ if (live && !saved_vm_running) { - ret = bdrv_inactivate_all(); - if (ret) { - error_setg(errp, "%s: bdrv_inactivate_all() failed (%d)", - __func__, ret); - } + migration_block_inactivate(); } } diff --git a/migration/trace-events b/migration/trace-events index bb0e0cc6dc..b82a1c5e40 100644 --- a/migration/trace-events +++ b/migration/trace-events @@ -383,3 +383,6 @@ migration_pagecache_insert(void) "Error allocating page" # cpu-throttle.c cpu_throttle_set(int new_throttle_pct) "set guest CPU throttled by %d%%" cpu_throttle_dirty_sync(void) "" + +# block-active.c +migration_block_activation(const char *name) "%s" diff --git a/monitor/qmp-cmds.c b/monitor/qmp-cmds.c index 415e645130..1ca44fbd72 100644 --- a/monitor/qmp-cmds.c +++ b/monitor/qmp-cmds.c @@ -31,6 +31,7 @@ #include "qapi/type-helpers.h" #include "hw/mem/memory-device.h" #include "hw/intc/intc.h" +#include "migration/misc.h" NameInfo *qmp_query_name(Error **errp) { @@ -103,13 +104,8 @@ void qmp_cont(Error **errp) * Continuing after completed migration. Images have been * inactivated to allow the destination to take control. Need to * get control back now. - * - * If there are no inactive block nodes (e.g. because the VM was - * just paused rather than completing a migration), - * bdrv_inactivate_all() simply doesn't do anything. */ - bdrv_activate_all(&local_err); - if (local_err) { + if (!migration_block_activate(&local_err)) { error_propagate(errp, local_err); return; } From 86bee9e0c761a3d0e67c43b44001fd752f894cb0 Mon Sep 17 00:00:00 2001 From: Fabiano Rosas Date: Thu, 9 Jan 2025 15:52:43 -0300 Subject: [PATCH 16/25] migration: Add more error handling to analyze-migration.py The analyze-migration script was seen failing in s390x in misterious ways. It seems we're reaching the VMSDFieldStruct constructor without any fields, which would indicate an empty .subsection entry, a VMSTATE_STRUCT with no fields or a vmsd with no fields. We don't have any of those, at least not without the unmigratable flag set, so this should never happen. Add some debug statements so that we can see what's going on the next time the issue happens. Reviewed-by: Peter Xu Message-Id: <20250109185249.23952-2-farosas@suse.de> Signed-off-by: Fabiano Rosas --- scripts/analyze-migration.py | 75 +++++++++++++++++++++--------------- 1 file changed, 45 insertions(+), 30 deletions(-) diff --git a/scripts/analyze-migration.py b/scripts/analyze-migration.py index 8a254a5b6a..f2457b1dde 100755 --- a/scripts/analyze-migration.py +++ b/scripts/analyze-migration.py @@ -429,6 +429,9 @@ class VMSDFieldStruct(VMSDFieldGeneric): super(VMSDFieldStruct, self).__init__(desc, file) self.data = collections.OrderedDict() + if 'fields' not in self.desc['struct']: + raise Exception("No fields in struct. VMSD:\n%s" % self.desc) + # When we see compressed array elements, unfold them here new_fields = [] for field in self.desc['struct']['fields']: @@ -477,6 +480,10 @@ class VMSDFieldStruct(VMSDFieldGeneric): raise Exception("Subsection %s not found at offset %x" % ( subsection['vmsd_name'], self.file.tell())) name = self.file.readstr() version_id = self.file.read32() + + if not subsection: + raise Exception("Empty description for subsection: %s" % name) + self.data[name] = VMSDSection(self.file, version_id, subsection, (name, 0)) self.data[name].read() @@ -574,10 +581,13 @@ class MigrationDump(object): } self.filename = filename self.vmsd_desc = None + self.vmsd_json = "" - def read(self, desc_only = False, dump_memory = False, write_memory = False): + def read(self, desc_only = False, dump_memory = False, + write_memory = False): # Read in the whole file file = MigrationFile(self.filename) + self.vmsd_json = file.read_migration_debug_json() # File magic data = file.read32() @@ -635,9 +645,11 @@ class MigrationDump(object): file.close() def load_vmsd_json(self, file): - vmsd_json = file.read_migration_debug_json() - self.vmsd_desc = json.loads(vmsd_json, object_pairs_hook=collections.OrderedDict) + self.vmsd_desc = json.loads(self.vmsd_json, + object_pairs_hook=collections.OrderedDict) for device in self.vmsd_desc['devices']: + if 'fields' not in device: + raise Exception("vmstate for device %s has no fields" % device['name']) key = (device['name'], device['instance_id']) value = ( VMSDSection, device ) self.section_classes[key] = value @@ -666,31 +678,34 @@ args = parser.parse_args() jsonenc = JSONEncoder(indent=4, separators=(',', ': ')) -if args.extract: - dump = MigrationDump(args.file) - - dump.read(desc_only = True) - print("desc.json") - f = open("desc.json", "w") - f.truncate() - f.write(jsonenc.encode(dump.vmsd_desc)) - f.close() - - dump.read(write_memory = True) - dict = dump.getDict() - print("state.json") - f = open("state.json", "w") - f.truncate() - f.write(jsonenc.encode(dict)) - f.close() -elif args.dump == "state": - dump = MigrationDump(args.file) - dump.read(dump_memory = args.memory) - dict = dump.getDict() - print(jsonenc.encode(dict)) -elif args.dump == "desc": - dump = MigrationDump(args.file) - dump.read(desc_only = True) - print(jsonenc.encode(dump.vmsd_desc)) -else: +if not any([args.extract, args.dump == "state", args.dump == "desc"]): raise Exception("Please specify either -x, -d state or -d desc") + +try: + dump = MigrationDump(args.file) + + if args.extract: + dump.read(desc_only = True) + + print("desc.json") + f = open("desc.json", "w") + f.truncate() + f.write(jsonenc.encode(dump.vmsd_desc)) + f.close() + + dump.read(write_memory = True) + dict = dump.getDict() + print("state.json") + f = open("state.json", "w") + f.truncate() + f.write(jsonenc.encode(dict)) + f.close() + elif args.dump == "state": + dump.read(dump_memory = args.memory) + dict = dump.getDict() + print(jsonenc.encode(dict)) + elif args.dump == "desc": + dump.read(desc_only = True) + print(jsonenc.encode(dump.vmsd_desc)) +except Exception: + raise Exception("Full JSON dump:\n%s", dump.vmsd_json) From 2aead53d39b828f8d9d0769ffa3579dadd64d846 Mon Sep 17 00:00:00 2001 From: Fabiano Rosas Date: Thu, 9 Jan 2025 15:52:44 -0300 Subject: [PATCH 17/25] migration: Remove unused argument in vmsd_desc_field_end Reviewed-by: Peter Xu Message-Id: <20250109185249.23952-3-farosas@suse.de> Signed-off-by: Fabiano Rosas --- migration/vmstate.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/migration/vmstate.c b/migration/vmstate.c index fa002b24e8..aa2821dec6 100644 --- a/migration/vmstate.c +++ b/migration/vmstate.c @@ -311,7 +311,7 @@ static void vmsd_desc_field_start(const VMStateDescription *vmsd, static void vmsd_desc_field_end(const VMStateDescription *vmsd, JSONWriter *vmdesc, - const VMStateField *field, size_t size, int i) + const VMStateField *field, size_t size) { if (!vmdesc) { return; @@ -420,7 +420,7 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd, } written_bytes = qemu_file_transferred(f) - old_offset; - vmsd_desc_field_end(vmsd, vmdesc_loop, field, written_bytes, i); + vmsd_desc_field_end(vmsd, vmdesc_loop, field, written_bytes); /* Compressed arrays only care about the first element */ if (vmdesc_loop && vmsd_can_compress(field)) { From 69d1f784569fdb950f2923c3b6d00d7c1b71acc1 Mon Sep 17 00:00:00 2001 From: Fabiano Rosas Date: Thu, 9 Jan 2025 15:52:45 -0300 Subject: [PATCH 18/25] migration: Fix parsing of s390 stream The parsing for the S390StorageAttributes section is currently leaving an unconsumed token that is later interpreted by the generic code as QEMU_VM_EOF, cutting the parsing short. The migration will issue a STATTR_FLAG_DONE between iterations, which the script consumes correctly, but there's a final STATTR_FLAG_EOS at .save_complete that the script is ignoring. Since the EOS flag is a u64 0x1ULL and the stream is big endian, on little endian hosts a byte read from it will be 0x0, the same as QEMU_VM_EOF. Fixes: 81c2c9dd5d ("tests/qtest/migration-test: Fix analyze-migration.py for s390x") Reviewed-by: Peter Xu Message-Id: <20250109185249.23952-4-farosas@suse.de> Signed-off-by: Fabiano Rosas --- scripts/analyze-migration.py | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/scripts/analyze-migration.py b/scripts/analyze-migration.py index f2457b1dde..fcda11f31d 100755 --- a/scripts/analyze-migration.py +++ b/scripts/analyze-migration.py @@ -65,6 +65,9 @@ class MigrationFile(object): def tell(self): return self.file.tell() + def seek(self, a, b): + return self.file.seek(a, b) + # The VMSD description is at the end of the file, after EOF. Look for # the last NULL byte, then for the beginning brace of JSON. def read_migration_debug_json(self): @@ -272,11 +275,24 @@ class S390StorageAttributes(object): self.section_key = section_key def read(self): + pos = 0 while True: addr_flags = self.file.read64() flags = addr_flags & 0xfff - if (flags & (self.STATTR_FLAG_DONE | self.STATTR_FLAG_EOS)): + + if flags & self.STATTR_FLAG_DONE: + pos = self.file.tell() + continue + elif flags & self.STATTR_FLAG_EOS: return + else: + # No EOS came after DONE, that's OK, but rewind the + # stream because this is not our data. + if pos: + self.file.seek(pos, os.SEEK_SET) + return + raise Exception("Unknown flags %x", flags) + if (flags & self.STATTR_FLAG_ERROR): raise Exception("Error in migration stream") count = self.file.read64() From f52965bf0eeee28e89933264f1a9dbdcdaa76a7e Mon Sep 17 00:00:00 2001 From: Fabiano Rosas Date: Thu, 9 Jan 2025 15:52:46 -0300 Subject: [PATCH 19/25] migration: Rename vmstate_info_nullptr Rename vmstate_info_nullptr from "uint64_t" to "nullptr". This vmstate actually reads and writes just a byte, so the proper name would be uint8. However, since this is a marker for a NULL pointer, it's convenient to have a more explicit name that can be identified by the consumers of the JSON part of the stream. Change the name to "nullptr" and add support for it in the analyze-migration.py script. Arbitrarily use the name of the type as the value of the field to avoid the script showing 0x30 or '0', which could be confusing for readers. Reviewed-by: Peter Xu Message-Id: <20250109185249.23952-5-farosas@suse.de> Signed-off-by: Fabiano Rosas --- migration/vmstate-types.c | 2 +- scripts/analyze-migration.py | 23 +++++++++++++++++++++++ 2 files changed, 24 insertions(+), 1 deletion(-) diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c index e83bfccb9e..d70d573dbd 100644 --- a/migration/vmstate-types.c +++ b/migration/vmstate-types.c @@ -338,7 +338,7 @@ static int put_nullptr(QEMUFile *f, void *pv, size_t size, } const VMStateInfo vmstate_info_nullptr = { - .name = "uint64", + .name = "nullptr", .get = get_nullptr, .put = put_nullptr, }; diff --git a/scripts/analyze-migration.py b/scripts/analyze-migration.py index fcda11f31d..923f174f1b 100755 --- a/scripts/analyze-migration.py +++ b/scripts/analyze-migration.py @@ -417,6 +417,28 @@ class VMSDFieldIntLE(VMSDFieldInt): super(VMSDFieldIntLE, self).__init__(desc, file) self.dtype = ' Date: Thu, 9 Jan 2025 15:52:47 -0300 Subject: [PATCH 20/25] migration: Dump correct JSON format for nullptr replacement QEMU plays a trick with null pointers inside an array of pointers in a VMSD field. See 07d4e69147 ("migration/vmstate: fix array of ptr with nullptrs") for more details on why. The idea makes sense in general, but it may overlooked the JSON writer where it could write nothing in a "struct" in the JSON hints section. We hit some analyze-migration.py issues on s390 recently, showing that some of the struct field contains nothing, like: {"name": "css", "array_len": 256, "type": "struct", "struct": {}, "size": 1} As described in details by Fabiano: https://lore.kernel.org/r/87pll37cin.fsf@suse.de It could be that we hit some null pointers there, and JSON was gone when they're null pointers. To fix it, instead of hacking around only at VMStateInfo level, do that from VMStateField level, so that JSON writer can also be involved. In this case, JSON writer will replace the pointer array (which used to be a "struct") to be the real representation of the nullptr field. Signed-off-by: Peter Xu Message-Id: <20250109185249.23952-6-farosas@suse.de> Signed-off-by: Fabiano Rosas --- migration/vmstate.c | 118 ++++++++++++++++++++++++++++++++++---------- 1 file changed, 91 insertions(+), 27 deletions(-) diff --git a/migration/vmstate.c b/migration/vmstate.c index aa2821dec6..52704c822c 100644 --- a/migration/vmstate.c +++ b/migration/vmstate.c @@ -51,6 +51,36 @@ vmstate_field_exists(const VMStateDescription *vmsd, const VMStateField *field, return result; } +/* + * Create a fake nullptr field when there's a NULL pointer detected in the + * array of a VMS_ARRAY_OF_POINTER VMSD field. It's needed because we + * can't dereference the NULL pointer. + */ +static const VMStateField * +vmsd_create_fake_nullptr_field(const VMStateField *field) +{ + VMStateField *fake = g_new0(VMStateField, 1); + + /* It can only happen on an array of pointers! */ + assert(field->flags & VMS_ARRAY_OF_POINTER); + + /* Some of fake's properties should match the original's */ + fake->name = field->name; + fake->version_id = field->version_id; + + /* Do not need "field_exists" check as it always exists (which is null) */ + fake->field_exists = NULL; + + /* See vmstate_info_nullptr - use 1 byte to represent nullptr */ + fake->size = 1; + fake->info = &vmstate_info_nullptr; + fake->flags = VMS_SINGLE; + + /* All the rest fields shouldn't matter.. */ + + return (const VMStateField *)fake; +} + static int vmstate_n_elems(void *opaque, const VMStateField *field) { int n_elems = 1; @@ -143,23 +173,39 @@ int vmstate_load_state(QEMUFile *f, const VMStateDescription *vmsd, } for (i = 0; i < n_elems; i++) { void *curr_elem = first_elem + size * i; + const VMStateField *inner_field; if (field->flags & VMS_ARRAY_OF_POINTER) { curr_elem = *(void **)curr_elem; } + if (!curr_elem && size) { - /* if null pointer check placeholder and do not follow */ - assert(field->flags & VMS_ARRAY_OF_POINTER); - ret = vmstate_info_nullptr.get(f, curr_elem, size, NULL); - } else if (field->flags & VMS_STRUCT) { - ret = vmstate_load_state(f, field->vmsd, curr_elem, - field->vmsd->version_id); - } else if (field->flags & VMS_VSTRUCT) { - ret = vmstate_load_state(f, field->vmsd, curr_elem, - field->struct_version_id); + /* + * If null pointer found (which should only happen in + * an array of pointers), use null placeholder and do + * not follow. + */ + inner_field = vmsd_create_fake_nullptr_field(field); } else { - ret = field->info->get(f, curr_elem, size, field); + inner_field = field; } + + if (inner_field->flags & VMS_STRUCT) { + ret = vmstate_load_state(f, inner_field->vmsd, curr_elem, + inner_field->vmsd->version_id); + } else if (inner_field->flags & VMS_VSTRUCT) { + ret = vmstate_load_state(f, inner_field->vmsd, curr_elem, + inner_field->struct_version_id); + } else { + ret = inner_field->info->get(f, curr_elem, size, + inner_field); + } + + /* If we used a fake temp field.. free it now */ + if (inner_field != field) { + g_clear_pointer((gpointer *)&inner_field, g_free); + } + if (ret >= 0) { ret = qemu_file_get_error(f); } @@ -387,29 +433,50 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd, } for (i = 0; i < n_elems; i++) { void *curr_elem = first_elem + size * i; + const VMStateField *inner_field; - vmsd_desc_field_start(vmsd, vmdesc_loop, field, i, n_elems); old_offset = qemu_file_transferred(f); if (field->flags & VMS_ARRAY_OF_POINTER) { assert(curr_elem); curr_elem = *(void **)curr_elem; } + if (!curr_elem && size) { - /* if null pointer write placeholder and do not follow */ - assert(field->flags & VMS_ARRAY_OF_POINTER); - ret = vmstate_info_nullptr.put(f, curr_elem, size, NULL, - NULL); - } else if (field->flags & VMS_STRUCT) { - ret = vmstate_save_state(f, field->vmsd, curr_elem, - vmdesc_loop); - } else if (field->flags & VMS_VSTRUCT) { - ret = vmstate_save_state_v(f, field->vmsd, curr_elem, - vmdesc_loop, - field->struct_version_id, errp); + /* + * If null pointer found (which should only happen in + * an array of pointers), use null placeholder and do + * not follow. + */ + inner_field = vmsd_create_fake_nullptr_field(field); } else { - ret = field->info->put(f, curr_elem, size, field, - vmdesc_loop); + inner_field = field; } + + vmsd_desc_field_start(vmsd, vmdesc_loop, inner_field, + i, n_elems); + + if (inner_field->flags & VMS_STRUCT) { + ret = vmstate_save_state(f, inner_field->vmsd, + curr_elem, vmdesc_loop); + } else if (inner_field->flags & VMS_VSTRUCT) { + ret = vmstate_save_state_v(f, inner_field->vmsd, + curr_elem, vmdesc_loop, + inner_field->struct_version_id, + errp); + } else { + ret = inner_field->info->put(f, curr_elem, size, + inner_field, vmdesc_loop); + } + + written_bytes = qemu_file_transferred(f) - old_offset; + vmsd_desc_field_end(vmsd, vmdesc_loop, inner_field, + written_bytes); + + /* If we used a fake temp field.. free it now */ + if (inner_field != field) { + g_clear_pointer((gpointer *)&inner_field, g_free); + } + if (ret) { error_setg(errp, "Save of field %s/%s failed", vmsd->name, field->name); @@ -419,9 +486,6 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd, return ret; } - written_bytes = qemu_file_transferred(f) - old_offset; - vmsd_desc_field_end(vmsd, vmdesc_loop, field, written_bytes); - /* Compressed arrays only care about the first element */ if (vmdesc_loop && vmsd_can_compress(field)) { vmdesc_loop = NULL; From 35049eb0d2fc72bb8c563196ec75b4d6c13fce02 Mon Sep 17 00:00:00 2001 From: Fabiano Rosas Date: Thu, 9 Jan 2025 15:52:48 -0300 Subject: [PATCH 21/25] migration: Fix arrays of pointers in JSON writer Currently, if an array of pointers contains a NULL pointer, that pointer will be encoded as '0' in the stream. Since the JSON writer doesn't define a "pointer" type, that '0' will now be an uint8, which is different from the original type being pointed to, e.g. struct. (we're further calling uint8 "nullptr", but that's irrelevant to the issue) That mixed-type array shouldn't be compressed, otherwise data is lost as the code currently makes the whole array have the type of the first element: css = {NULL, NULL, ..., 0x5555568a7940, NULL}; {"name": "s390_css", "instance_id": 0, "vmsd_name": "s390_css", "version": 1, "fields": [ ..., {"name": "css", "array_len": 256, "type": "nullptr", "size": 1}, ..., ]} In the above, the valid pointer at position 254 got lost among the compressed array of nullptr. While we could disable the array compression when a NULL pointer is found, the JSON part of the stream still makes part of downtime, so we should avoid writing unecessary bytes to it. Keep the array compression in place, but if NULL and non-NULL pointers are mixed break the array into several type-contiguous pieces : css = {NULL, NULL, ..., 0x5555568a7940, NULL}; {"name": "s390_css", "instance_id": 0, "vmsd_name": "s390_css", "version": 1, "fields": [ ..., {"name": "css", "array_len": 254, "type": "nullptr", "size": 1}, {"name": "css", "type": "struct", "struct": {"vmsd_name": "s390_css_img", ... }, "size": 768}, {"name": "css", "type": "nullptr", "size": 1}, ..., ]} Now each type-discontiguous region will become a new JSON entry. The reader should interpret this as a concatenation of values, all part of the same field. Parsing the JSON with analyze-script.py now shows the proper data being pointed to at the places where the pointer is valid and "nullptr" where there's NULL: "s390_css (14)": { ... "css": [ "nullptr", "nullptr", ... "nullptr", { "chpids": [ { "in_use": "0x00", "type": "0x00", "is_virtual": "0x00" }, ... ] }, "nullptr", } Reviewed-by: Peter Xu Message-Id: <20250109185249.23952-7-farosas@suse.de> Signed-off-by: Fabiano Rosas --- migration/vmstate.c | 33 ++++++++++++++++++++++++++++++++- scripts/analyze-migration.py | 26 ++++++++++++++++++-------- 2 files changed, 50 insertions(+), 9 deletions(-) diff --git a/migration/vmstate.c b/migration/vmstate.c index 52704c822c..82bd005a83 100644 --- a/migration/vmstate.c +++ b/migration/vmstate.c @@ -425,15 +425,19 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd, int size = vmstate_size(opaque, field); uint64_t old_offset, written_bytes; JSONWriter *vmdesc_loop = vmdesc; + bool is_prev_null = false; trace_vmstate_save_state_loop(vmsd->name, field->name, n_elems); if (field->flags & VMS_POINTER) { first_elem = *(void **)first_elem; assert(first_elem || !n_elems || !size); } + for (i = 0; i < n_elems; i++) { void *curr_elem = first_elem + size * i; const VMStateField *inner_field; + bool is_null; + int max_elems = n_elems - i; old_offset = qemu_file_transferred(f); if (field->flags & VMS_ARRAY_OF_POINTER) { @@ -448,12 +452,39 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd, * not follow. */ inner_field = vmsd_create_fake_nullptr_field(field); + is_null = true; } else { inner_field = field; + is_null = false; + } + + /* + * Due to the fake nullptr handling above, if there's mixed + * null/non-null data, it doesn't make sense to emit a + * compressed array representation spanning the entire array + * because the field types will be different (e.g. struct + * vs. nullptr). Search ahead for the next null/non-null element + * and start a new compressed array if found. + */ + if (field->flags & VMS_ARRAY_OF_POINTER && + is_null != is_prev_null) { + + is_prev_null = is_null; + vmdesc_loop = vmdesc; + + for (int j = i + 1; j < n_elems; j++) { + void *elem = *(void **)(first_elem + size * j); + bool elem_is_null = !elem && size; + + if (is_null != elem_is_null) { + max_elems = j - i; + break; + } + } } vmsd_desc_field_start(vmsd, vmdesc_loop, inner_field, - i, n_elems); + i, max_elems); if (inner_field->flags & VMS_STRUCT) { ret = vmstate_save_state(f, inner_field->vmsd, diff --git a/scripts/analyze-migration.py b/scripts/analyze-migration.py index 923f174f1b..8e1fbf4c9d 100755 --- a/scripts/analyze-migration.py +++ b/scripts/analyze-migration.py @@ -502,15 +502,25 @@ class VMSDFieldStruct(VMSDFieldGeneric): field['data'] = reader(field, self.file) field['data'].read() - if 'index' in field: - if field['name'] not in self.data: - self.data[field['name']] = [] - a = self.data[field['name']] - if len(a) != int(field['index']): - raise Exception("internal index of data field unmatched (%d/%d)" % (len(a), int(field['index']))) - a.append(field['data']) + fname = field['name'] + fdata = field['data'] + + # The field could be: + # i) a single data entry, e.g. uint64 + # ii) an array, indicated by it containing the 'index' key + # + # However, the overall data after parsing the whole + # stream, could be a mix of arrays and single data fields, + # all sharing the same field name due to how QEMU breaks + # up arrays with NULL pointers into multiple compressed + # array segments. + if fname not in self.data: + self.data[fname] = fdata + elif type(self.data[fname]) == list: + self.data[fname].append(fdata) else: - self.data[field['name']] = field['data'] + tmp = self.data[fname] + self.data[fname] = [tmp, fdata] if 'subsections' in self.desc['struct']: for subsection in self.desc['struct']['subsections']: From c76ee1f6255c3988a9447d363bb17072f1ec84e1 Mon Sep 17 00:00:00 2001 From: Fabiano Rosas Date: Thu, 9 Jan 2025 15:52:49 -0300 Subject: [PATCH 22/25] s390x: Fix CSS migration Commit a55ae46683 ("s390: move css_migration_enabled from machine to css.c") disabled CSS migration globally instead of doing it per-instance. CC: Paolo Bonzini CC: qemu-stable@nongnu.org #9.1 Fixes: a55ae46683 ("s390: move css_migration_enabled from machine to css.c") Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2704 Reviewed-by: Thomas Huth Message-Id: <20250109185249.23952-8-farosas@suse.de> Signed-off-by: Fabiano Rosas --- hw/s390x/s390-virtio-ccw.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c index 8a242cc1ec..38aeba14ee 100644 --- a/hw/s390x/s390-virtio-ccw.c +++ b/hw/s390x/s390-virtio-ccw.c @@ -1244,6 +1244,7 @@ static void ccw_machine_2_9_instance_options(MachineState *machine) s390_cpudef_featoff_greater(12, 1, S390_FEAT_ZPCI); s390_cpudef_featoff_greater(12, 1, S390_FEAT_ADAPTER_INT_SUPPRESSION); s390_cpudef_featoff_greater(12, 1, S390_FEAT_ADAPTER_EVENT_NOTIFICATION); + css_migration_enabled = false; } static void ccw_machine_2_9_class_options(MachineClass *mc) @@ -1256,7 +1257,6 @@ static void ccw_machine_2_9_class_options(MachineClass *mc) ccw_machine_2_10_class_options(mc); compat_props_add(mc->compat_props, hw_compat_2_9, hw_compat_2_9_len); compat_props_add(mc->compat_props, compat, G_N_ELEMENTS(compat)); - css_migration_enabled = false; } DEFINE_CCW_MACHINE(2, 9); From cdc3970f8597ebdc1a4c2090cfb4d11e297329ed Mon Sep 17 00:00:00 2001 From: Yuan Liu Date: Wed, 18 Dec 2024 17:14:11 +0800 Subject: [PATCH 23/25] multifd: bugfix for migration using compression methods When compression is enabled on the migration channel and the pages processed are all zero pages, these pages will not be sent and updated on the target side, resulting in incorrect memory data on the source and target sides. The root cause is that all compression methods call multifd_send_prepare_common to determine whether to compress dirty pages, but multifd_send_prepare_common does not update the IOV of MultiFDPacket_t when all dirty pages are zero pages. The solution is to always update the IOV of MultiFDPacket_t regardless of whether the dirty pages are all zero pages. Fixes: 303e6f54f9 ("migration/multifd: Implement zero page transmission on the multifd thread.") Cc: qemu-stable@nongnu.org #9.0+ Signed-off-by: Yuan Liu Reviewed-by: Jason Zeng Reviewed-by: Peter Xu Message-Id: <20241218091413.140396-2-yuan1.liu@intel.com> Signed-off-by: Fabiano Rosas --- migration/multifd-nocomp.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c index c1f686c0ce..1325dba97c 100644 --- a/migration/multifd-nocomp.c +++ b/migration/multifd-nocomp.c @@ -432,6 +432,7 @@ int multifd_ram_flush_and_sync(QEMUFile *f) bool multifd_send_prepare_common(MultiFDSendParams *p) { MultiFDPages_t *pages = &p->data->u.ram; + multifd_send_prepare_header(p); multifd_send_zero_page_detect(p); if (!pages->normal_num) { @@ -439,8 +440,6 @@ bool multifd_send_prepare_common(MultiFDSendParams *p) return false; } - multifd_send_prepare_header(p); - return true; } From 2588a5f99b0c3493b4690e3ff01ed36f80e830cc Mon Sep 17 00:00:00 2001 From: Yuan Liu Date: Wed, 18 Dec 2024 17:14:12 +0800 Subject: [PATCH 24/25] multifd: bugfix for incorrect migration data with QPL compression When QPL compression is enabled on the migration channel and the same dirty page changes from a normal page to a zero page in the iterative memory copy, the dirty page will not be updated to a zero page again on the target side, resulting in incorrect memory data on the source and target sides. The root cause is that the target side does not record the normal pages to the receivedmap. The solution is to add ramblock_recv_bitmap_set_offset in target side to record the normal pages. Signed-off-by: Yuan Liu Reviewed-by: Jason Zeng Reviewed-by: Peter Xu Message-Id: <20241218091413.140396-3-yuan1.liu@intel.com> Signed-off-by: Fabiano Rosas --- migration/multifd-qpl.c | 1 + 1 file changed, 1 insertion(+) diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c index bbe466617f..88e2344af2 100644 --- a/migration/multifd-qpl.c +++ b/migration/multifd-qpl.c @@ -679,6 +679,7 @@ static int multifd_qpl_recv(MultiFDRecvParams *p, Error **errp) qpl->zlen[i] = be32_to_cpu(qpl->zlen[i]); assert(qpl->zlen[i] <= multifd_ram_page_size()); zbuf_len += qpl->zlen[i]; + ramblock_recv_bitmap_set_offset(p->block, p->normal[i]); } /* read compressed pages */ From a523bc52166c80d8a04d46584f9f3868bd53ef69 Mon Sep 17 00:00:00 2001 From: Yuan Liu Date: Wed, 18 Dec 2024 17:14:13 +0800 Subject: [PATCH 25/25] multifd: bugfix for incorrect migration data with qatzip compression When QPL compression is enabled on the migration channel and the same dirty page changes from a normal page to a zero page in the iterative memory copy, the dirty page will not be updated to a zero page again on the target side, resulting in incorrect memory data on the source and target sides. The root cause is that the target side does not record the normal pages to the receivedmap. The solution is to add ramblock_recv_bitmap_set_offset in target side to record the normal pages. Signed-off-by: Yuan Liu Reviewed-by: Jason Zeng Reviewed-by: Peter Xu Message-Id: <20241218091413.140396-4-yuan1.liu@intel.com> Signed-off-by: Fabiano Rosas --- migration/multifd-qatzip.c | 1 + 1 file changed, 1 insertion(+) diff --git a/migration/multifd-qatzip.c b/migration/multifd-qatzip.c index 7b68397625..6a0e989fae 100644 --- a/migration/multifd-qatzip.c +++ b/migration/multifd-qatzip.c @@ -373,6 +373,7 @@ static int qatzip_recv(MultiFDRecvParams *p, Error **errp) /* Copy each page to its appropriate location. */ for (int i = 0; i < p->normal_num; i++) { memcpy(p->host + p->normal[i], q->out_buf + page_size * i, page_size); + ramblock_recv_bitmap_set_offset(p->block, p->normal[i]); } return 0; }