Andrea Fioraldi c6a00ab288
Full system hooks (#8)
* scsi-disk: add new quirks bitmap to SCSIDiskState

Since the MacOS SCSI implementation is quite old (and Apple added some firmware
customisations to their drives for m68k Macs) there is need to add a mechanism
to correctly handle Apple-specific quirks.

Add a new quirks bitmap to SCSIDiskState that can be used to enable these
features as required.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Message-Id: <20220622105314.802852-2-mark.cave-ayland@ilande.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

* scsi-disk: add MODE_PAGE_APPLE_VENDOR quirk for Macintosh

One of the mechanisms MacOS uses to identify CDROM drives compatible with MacOS
is to send a custom MODE SELECT command for page 0x30 to the drive. The
response to this is a hard-coded manufacturer string which must match in order
for the CDROM to be usable within MacOS.

Add an implementation of the MODE SELECT page 0x30 response guarded by a newly
defined SCSI_DISK_QUIRK_MODE_PAGE_APPLE_VENDOR quirk bit so that CDROM drives
attached to non-Apple machines function exactly as before.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Message-Id: <20220622105314.802852-3-mark.cave-ayland@ilande.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

* q800: implement compat_props to enable quirk_mode_page_apple_vendor for scsi-cd devices

By default quirk_mode_page_apple_vendor should be enabled for all scsi-cd devices
connected to the q800 machine to enable MacOS to detect and use them.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Message-Id: <20220622105314.802852-4-mark.cave-ayland@ilande.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

* scsi-disk: add SCSI_DISK_QUIRK_MODE_SENSE_ROM_USE_DBD quirk for Macintosh

During SCSI bus enumeration A/UX sends a MODE SENSE command to the CDROM with
the DBD bit unset and expects the response to include a block descriptor. As per
the latest SCSI documentation, QEMU currently force-disables the block
descriptor for CDROM devices but the A/UX driver expects the requested block
descriptor to be returned.

If the block descriptor is not returned in the response then A/UX becomes
confused, since the block descriptor returned in the MODE SENSE response is
used to generate a subsequent MODE SELECT command which is then invalid.

Add a new SCSI_DISK_QUIRK_MODE_SENSE_ROM_USE_DBD quirk to allow this behaviour
to be enabled as required. Note that an additional workaround is required for
the previous SCSI_DISK_QUIRK_MODE_PAGE_APPLE_VENDOR quirk which must never
return a block descriptor even though the DBD bit is left unset.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Message-Id: <20220622105314.802852-5-mark.cave-ayland@ilande.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

* q800: implement compat_props to enable quirk_mode_sense_rom_use_dbd for scsi-cd devices

By default quirk_mode_sense_rom_use_dbd should be enabled for all scsi-cd devices
connected to the q800 machine to correctly report the CDROM block descriptor back
to A/UX.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Message-Id: <20220622105314.802852-6-mark.cave-ayland@ilande.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

* scsi-disk: add SCSI_DISK_QUIRK_MODE_PAGE_VENDOR_SPECIFIC_APPLE quirk for Macintosh

Both MacOS and A/UX make use of vendor-specific MODE SELECT commands with PF=0
to identify SCSI devices:

- MacOS sends a MODE SELECT command with PF=0 for the MODE_PAGE_VENDOR_SPECIFIC
  (0x0) mode page containing 2 bytes before initialising a disk

- A/UX (installed on disk) sends a MODE SELECT command with PF=0 during SCSI
  bus enumeration, and gets stuck in an infinite loop if it fails

Add a new SCSI_DISK_QUIRK_MODE_PAGE_VENDOR_SPECIFIC_APPLE quirk to allow both
PF=0 MODE SELECT commands and implement a MODE_PAGE_VENDOR_SPECIFIC (0x0)
mode page which is compatible with MacOS.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Message-Id: <20220622105314.802852-7-mark.cave-ayland@ilande.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

* q800: implement compat_props to enable quirk_mode_page_vendor_specific_apple for scsi devices

By default quirk_mode_page_vendor_specific_apple should be enabled for both scsi-hd
and scsi-cd devices to allow MacOS to format SCSI disk devices, and A/UX to
enumerate SCSI CDROM devices succesfully without getting stuck in a loop.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Message-Id: <20220622105314.802852-8-mark.cave-ayland@ilande.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

* scsi-disk: add FORMAT UNIT command

When initialising a drive ready to install MacOS, Apple HD SC Setup first attempts
to format the drive. Add a simple FORMAT UNIT command which simply returns success
to allow the format to succeed.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20220622105314.802852-9-mark.cave-ayland@ilande.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

* scsi-disk: add SCSI_DISK_QUIRK_MODE_PAGE_TRUNCATED quirk for Macintosh

When A/UX configures the CDROM device it sends a truncated MODE SELECT request
for page 1 (MODE_PAGE_R_W_ERROR) which is only 6 bytes in length rather than
10. This seems to be due to bug in Apple's code which calculates the CDB message
length incorrectly.

The work at [1] suggests that this truncated request is accepted on real
hardware whereas in QEMU it generates an INVALID_PARAM_LEN sense code which
causes A/UX to get stuck in a loop retrying the command in an attempt to succeed.

Alter the mode page request length check so that truncated requests are allowed
if the SCSI_DISK_QUIRK_MODE_PAGE_TRUNCATED quirk is enabled, whilst also adding a
trace event to enable the condition to be detected.

[1] https://68kmla.org/bb/index.php?threads/scsi2sd-project-anyone-interested.29040/page-7#post-316444

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Message-Id: <20220622105314.802852-10-mark.cave-ayland@ilande.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

* q800: implement compat_props to enable quirk_mode_page_truncated for scsi-cd devices

By default quirk_mode_page_truncated should be enabled for all scsi-cd devices
connected to the q800 machine to allow A/UX to enumerate SCSI CDROM devices
without hanging.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Message-Id: <20220622105314.802852-11-mark.cave-ayland@ilande.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

* scsi-disk: allow the MODE_PAGE_R_W_ERROR AWRE bit to be changeable for CDROM drives

A/UX sends a MODE_PAGE_R_W_ERROR command with the AWRE bit set to 0 when enumerating
CDROM drives. Since the bit is currently hardcoded to 1 then indicate that the AWRE
bit can be changed (even though we don't care about the value) so that
the MODE_PAGE_R_W_ERROR page can be set successfully.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Message-Id: <20220622105314.802852-12-mark.cave-ayland@ilande.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

* scsi-disk: allow MODE SELECT block descriptor to set the block size

The MODE SELECT command can contain an optional block descriptor that can be used
to set the device block size. If the block descriptor is present then update the
block size on the SCSI device accordingly.

This allows CDROMs to be used with A/UX which requires a CDROM drive which is
capable of switching from a 2048 byte sector size to a 512 byte sector size.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Message-Id: <20220622105314.802852-13-mark.cave-ayland@ilande.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

* q800: add default vendor and product information for scsi-hd devices

The Apple HD SC Setup program uses a SCSI INQUIRY command to check that any SCSI
hard disks detected match a whitelist of vendors and products before allowing
the "Initialise" button to prepare an empty disk.

Add known-good default vendor and product information using the existing
compat_prop mechanism so the user doesn't have to use long command lines to set
the qdev properties manually.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Message-Id: <20220622105314.802852-14-mark.cave-ayland@ilande.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

* q800: add default vendor and product information for scsi-cd devices

The MacOS CDROM driver uses a SCSI INQUIRY command to check that any SCSI CDROMs
detected match a whitelist of vendors and products before adding them to the
list of available devices.

Add known-good default vendor and product information using the existing
compat_prop mechanism so the user doesn't have to use long command lines to set
the qdev properties manually.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Message-Id: <20220622105314.802852-15-mark.cave-ayland@ilande.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

* pc-bios/s390-ccw: add -Wno-array-bounds

The option generates a lot of warnings for integers casted to pointers,
for example:

/home/pbonzini/work/upstream/qemu/pc-bios/s390-ccw/dasd-ipl.c:174:19: warning: array subscript 0 is outside array bounds of ‘CcwSeekData[0]’ [-Warray-bounds]
  174 |     seekData->cyl = 0x00;
      |     ~~~~~~~~~~~~~~^~~~~~

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

* aspeed: sbc: Allow per-machine settings

In order to correctly report secure boot running firmware the values
of certain registers must be set.

We don't yet have documentation from ASPEED on what they mean. The
meaning is inferred from u-boot's use of them.

Introduce properties so the settings can be configured per-machine.

Reviewed-by: Peter Delevoryas <pdel@fb.com>
Tested-by: Peter Delevoryas <pdel@fb.com>
Signed-off-by: Joel Stanley <joel@jms.id.au>
Message-Id: <20220628154740.1117349-4-clg@kaod.org>
Signed-off-by: Cédric Le Goater <clg@kaod.org>

* hw/i2c/pmbus: Add idle state to return 0xff's

Signed-off-by: Peter Delevoryas <pdel@fb.com>
Reviewed-by: Titus Rwantare <titusr@google.com>
Message-Id: <20220701000626.77395-2-me@pjd.dev>
Signed-off-by: Cédric Le Goater <clg@kaod.org>

* hw/sensor: Add IC_DEVICE_ID to ISL voltage regulators

This commit adds a passthrough for PMBUS_IC_DEVICE_ID to allow Renesas
voltage regulators to return the integrated circuit device ID if they
would like to.

The behavior is very device specific, so it hasn't been added to the
general PMBUS model. Additionally, if the device ID hasn't been set,
then the voltage regulator will respond with the error byte value.  The
guest error message will change slightly for IC_DEVICE_ID with this
commit.

Signed-off-by: Peter Delevoryas <pdel@fb.com>
Reviewed-by: Titus Rwantare <titusr@google.com>
Message-Id: <20220701000626.77395-3-me@pjd.dev>
Signed-off-by: Cédric Le Goater <clg@kaod.org>

* hw/sensor: Add Renesas ISL69259 device model

This adds the ISL69259, using all the same functionality as the existing
ISL69260 but overriding the IC_DEVICE_ID.

Signed-off-by: Peter Delevoryas <pdel@fb.com>
Reviewed-by: Titus Rwantare <titusr@google.com>
Message-Id: <20220701000626.77395-4-me@pjd.dev>
Signed-off-by: Cédric Le Goater <clg@kaod.org>

* aspeed: Create SRAM name from first CPU index

To support multiple SoC's running simultaneously, we need a unique name for
each RAM region. DRAM is created by the machine, but SRAM is created by the
SoC, since in hardware it is part of the SoC's internals.

We need a way to uniquely identify each SRAM region though, for VM
migration. Since each of the SoC's CPU's has an index which identifies it
uniquely from other CPU's in the machine, we can use the index of any of the
CPU's in the SoC to uniquely identify differentiate the SRAM name from other
SoC SRAM's. In this change, I just elected to use the index of the first CPU
in each SoC.

Signed-off-by: Peter Delevoryas <peter@pjd.dev>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20220705191400.41632-3-peter@pjd.dev>
Signed-off-by: Cédric Le Goater <clg@kaod.org>

* aspeed: Refactor UART init for multi-SoC machines

This change moves the code that connects the SoC UART's to serial_hd's
to the machine.

It makes each UART a proper child member of the SoC, and then allows the
machine to selectively initialize the chardev for each UART with a
serial_hd.

This should preserve backwards compatibility, but also allow multi-SoC
boards to completely change the wiring of serial devices from the
command line to specific SoC UART's.

This also removes the uart-default property from the SoC, since the SoC
doesn't need to know what UART is the "default" on the machine anymore.

I tested this using the images and commands from the previous
refactoring, and another test image for the ast1030:

    wget https://github.com/facebook/openbmc/releases/download/v2021.49.0/fuji.mtd
    wget https://github.com/facebook/openbmc/releases/download/v2021.49.0/wedge100.mtd
    wget https://github.com/peterdelevoryas/OpenBIC/releases/download/oby35-cl-2022.13.01/Y35BCL.elf

Fuji uses UART1:

    qemu-system-arm -machine fuji-bmc \
        -drive file=fuji.mtd,format=raw,if=mtd \
        -nographic

ast2600-evb uses uart-default=UART5:

    qemu-system-arm -machine ast2600-evb \
        -drive file=fuji.mtd,format=raw,if=mtd \
        -serial null -serial mon:stdio -display none

Wedge100 uses UART3:

    qemu-system-arm -machine palmetto-bmc \
        -drive file=wedge100.mtd,format=raw,if=mtd \
        -serial null -serial null -serial null \
        -serial mon:stdio -display none

AST1030 EVB uses UART5:

    qemu-system-arm -machine ast1030-evb \
        -kernel Y35BCL.elf -nographic

Fixes: 6827ff20b2975 ("hw: aspeed: Init all UART's with serial devices")
Signed-off-by: Peter Delevoryas <peter@pjd.dev>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20220705191400.41632-4-peter@pjd.dev>
Signed-off-by: Cédric Le Goater <clg@kaod.org>

* aspeed: Make aspeed_board_init_flashes public

Signed-off-by: Peter Delevoryas <peter@pjd.dev>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20220705191400.41632-5-peter@pjd.dev>
Signed-off-by: Cédric Le Goater <clg@kaod.org>

* aspeed: Add fby35 skeleton

Signed-off-by: Peter Delevoryas <peter@pjd.dev>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20220705191400.41632-6-peter@pjd.dev>
Signed-off-by: Cédric Le Goater <clg@kaod.org>

* aspeed: Add AST2600 (BMC) to fby35

You can test booting the BMC with both '-device loader' and '-drive
file'. This is necessary because of how the fb-openbmc boot sequence
works (jump to 0x20000000 after U-Boot SPL).

    wget https://github.com/facebook/openbmc/releases/download/openbmc-e2294ff5d31d/fby35.mtd
    qemu-system-arm -machine fby35 -nographic \
        -device loader,file=fby35.mtd,addr=0,cpu-num=0 -drive file=fby35.mtd,format=raw,if=mtd

Signed-off-by: Peter Delevoryas <peter@pjd.dev>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20220705191400.41632-7-peter@pjd.dev>
Signed-off-by: Cédric Le Goater <clg@kaod.org>

* aspeed: fby35: Add a bootrom for the BMC

The BMC boots from the first flash device by fetching instructions
from the flash contents. Add an alias region on 0x0 for this
purpose. There are currently performance issues with this method (TBs
being flushed too often), so as a faster alternative, install the
flash contents as a ROM in the BMC memory space.

See commit 1a15311a12fa ("hw/arm/aspeed: add a 'execute-in-place'
property to boot directly from CE0")

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Peter Delevoryas <peter@pjd.dev>
[ clg: blk_pread() fixes ]
Message-Id: <20220705191400.41632-8-peter@pjd.dev>
Signed-off-by: Cédric Le Goater <clg@kaod.org>

* aspeed: Add AST1030 (BIC) to fby35

With the BIC, the easiest way to run everything is to create two pty's
for each SoC and reserve stdin/stdout for the monitor:

    wget https://github.com/facebook/openbmc/releases/download/openbmc-e2294ff5d31d/fby35.mtd
    wget https://github.com/peterdelevoryas/OpenBIC/releases/download/oby35-cl-2022.13.01/Y35BCL.elf
    qemu-system-arm -machine fby35 \
        -drive file=fby35.mtd,format=raw,if=mtd \
        -device loader,file=fby35.mtd,addr=0,cpu-num=0 \
        -serial pty -serial pty -serial mon:stdio -display none -S

    screen /dev/ttys0
    screen /dev/ttys1
    (qemu) c

This commit only adds the the first server board's Bridge IC, but in the
future we'll try to include the other three server board Bridge IC's
too.

Signed-off-by: Peter Delevoryas <peter@pjd.dev>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20220705191400.41632-9-peter@pjd.dev>
Signed-off-by: Cédric Le Goater <clg@kaod.org>

* docs: aspeed: Add fby35 multi-SoC machine section

Signed-off-by: Peter Delevoryas <peter@pjd.dev>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
[ clg: - fixed URL links
       - Moved Facebook Yosemite section at the end of the file ]
Message-Id: <20220705191400.41632-10-peter@pjd.dev>
Signed-off-by: Cédric Le Goater <clg@kaod.org>

* docs: aspeed: Minor updates

Some more controllers have been modeled recently. Reflect that in the
list of supported devices. New machines were also added.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Peter Delevoryas <peter@pjd.dev>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Message-Id: <20220706172131.809255-1-clg@kaod.org>
Signed-off-by: Cédric Le Goater <clg@kaod.org>

* test/avocado/machine_aspeed.py: Add SDK tests

The Aspeed SDK kernel usually includes support for the lastest HW
features. This is interesting to exercise QEMU and discover the gaps
in the models.

Add extra I2C tests for the AST2600 EVB machine to check the new
register interface.

Message-Id: <20220707091239.1029561-1-clg@kaod.org>
Signed-off-by: Cédric Le Goater <clg@kaod.org>

* hw: m25p80: Add Block Protect and Top Bottom bits for write protect

Signed-off-by: Iris Chen <irischenlj@fb.com>
Reviewed-by: Francisco Iglesias <frasse.iglesias@gmail.com>
Message-Id: <20220708164552.3462620-1-irischenlj@fb.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>

* hw: m25p80: add tests for BP and TB bit write protect

Signed-off-by: Iris Chen <irischenlj@fb.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20220627185234.1911337-3-irischenlj@fb.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>

* qtest/aspeed_gpio: Add input pin modification test

Verify the current behavior, which is that input pins can be modified by
guest OS register writes.

Signed-off-by: Peter Delevoryas <peter@pjd.dev>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20220712023219.41065-2-peter@pjd.dev>
Signed-off-by: Cédric Le Goater <clg@kaod.org>

* hw/gpio/aspeed: Don't let guests modify input pins

Up until now, guests could modify input pins by overwriting the data
value register. The guest OS should only be allowed to modify output pin
values, and the QOM property setter should only be permitted to modify
input pins.

This change also updates the gpio input pin test to match this
expectation.

Andrew suggested this particularly refactoring here:

    https://lore.kernel.org/qemu-devel/23523aa1-ba81-412b-92cc-8174faba3612@www.fastmail.com/

Suggested-by: Andrew Jeffery <andrew@aj.id.au>
Signed-off-by: Peter Delevoryas <peter@pjd.dev>
Fixes: 4b7f956862dc ("hw/gpio: Add basic Aspeed GPIO model for AST2400 and AST2500")
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20220712023219.41065-3-peter@pjd.dev>
Signed-off-by: Cédric Le Goater <clg@kaod.org>

* aspeed: Add fby35-bmc slot GPIO's

Signed-off-by: Peter Delevoryas <peter@pjd.dev>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20220712023219.41065-4-peter@pjd.dev>
Signed-off-by: Cédric Le Goater <clg@kaod.org>

* hw/nvme: Implement shadow doorbell buffer support

Implement Doorbel Buffer Config command (Section 5.7 in NVMe Spec 1.3)
and Shadow Doorbel buffer & EventIdx buffer handling logic (Section 7.13
in NVMe Spec 1.3). For queues created before the Doorbell Buffer Config
command, the nvme_dbbuf_config function tries to associate each existing
SQ and CQ with its Shadow Doorbel buffer and EventIdx buffer address.
Queues created after the Doorbell Buffer Config command will have the
doorbell buffers associated with them when they are initialized.

In nvme_process_sq and nvme_post_cqe, proactively check for Shadow
Doorbell buffer changes instead of wait for doorbell register changes.
This reduces the number of MMIOs.

In nvme_process_db(), update the shadow doorbell buffer value with
the doorbell register value if it is the admin queue. This is a hack
since hosts like Linux NVMe driver and SPDK do not use shadow
doorbell buffer for the admin queue. Copying the doorbell register
value to the shadow doorbell buffer allows us to support these hosts
as well as spec-compliant hosts that use shadow doorbell buffer for
the admin queue.

Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
[k.jensen: rebased]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

* hw/nvme: Add trace events for shadow doorbell buffer

When shadow doorbell buffer is enabled, doorbell registers are lazily
updated. The actual queue head and tail pointers are stored in Shadow
Doorbell buffers.

Add trace events for updates on the Shadow Doorbell buffers and EventIdx
buffers. Also add trace event for the Doorbell Buffer Config command.

Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
[k.jensen: rebased]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

* hw/nvme: fix example serial in documentation

The serial prop on the controller is actually describing the nvme
subsystem serial, which has to be identical for all controllers within
the same nvme subsystem.

This is enforced since commit a859eb9f8f64 ("hw/nvme: enforce common
serial per subsystem").

Fix the documentation, so that people copying the qemu command line
example won't get an error on qemu start.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

* hw/nvme: force nvme-ns param 'shared' to false if no nvme-subsys node

Since commit 916b0f0b5264 ("hw/nvme: change nvme-ns 'shared' default")
the default value of nvme-ns param 'shared' is set to true, regardless
if there is a nvme-subsys node or not.

On a system without a nvme-subsys node, a namespace will never be able
to be attached to more than one controller, so for this configuration,
it is counterintuitive for this parameter to be set by default.

Force the nvme-ns param 'shared' to false for configurations where
there is no nvme-subsys node, as the namespace will never be able to
attach to more than one controller anyway.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

* nvme: Fix misleading macro when mixed with ternary operator

Using the Parfait source code analyser and issue was found in
hw/nvme/ctrl.c where the macros NVME_CAP_SET_CMBS and NVME_CAP_SET_PMRS
are called with a ternary operatore in the second parameter, resulting
in a potentially unexpected expansion of the form:

  x ? a: b & FLAG_TEST

which will result in a different result to:

  (x ? a: b) & FLAG_TEST.

The macros should wrap each of the parameters in brackets to ensure the
correct result on expansion.

Signed-off-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

* hw/nvme: Use ioeventfd to handle doorbell updates

Add property "ioeventfd" which is enabled by default. When this is
enabled, updates on the doorbell registers will cause KVM to signal
an event to the QEMU main loop to handle the doorbell updates.
Therefore, instead of letting the vcpu thread run both guest VM and
IO emulation, we now use the main loop thread to do IO emulation and
thus the vcpu thread has more cycles for the guest VM.

Since ioeventfd does not tell us the exact value that is written, it is
only useful when shadow doorbell buffer is enabled, where we check
for the value in the shadow doorbell buffer when we get the doorbell
update event.

IOPS comparison on Linux 5.19-rc2: (Unit: KIOPS)

qd           1   4  16  64
qemu        35 121 176 153
ioeventfd   41 133 258 313

Changes since v3:
 - Do not deregister ioeventfd when it was not enabled on a SQ/CQ

Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

* MAINTAINERS: Add myself as Guest Agent co-maintainer

Signed-off-by: Konstantin Kostiuk <kkostiuk@redhat.com>
Acked-by: Michael Roth <michael.roth@amd.com>

* hw/intc/armv7m_nvic: ICPRn must not unpend an IRQ that is being held high

In the M-profile Arm ARM, rule R_CVJS defines when an interrupt should
be set to the Pending state:
 A) when the input line is high and the interrupt is not Active
 B) when the input line transitions from low to high and the interrupt
    is Active
(Note that the first of these is an ongoing condition, and the
second is a point-in-time event.)

This can be rephrased as:
 1 when the line goes from low to high, set Pending
 2 when Active goes from 1 to 0, if line is high then set Pending
 3 ignore attempts to clear Pending when the line is high
   and Active is 0

where 1 covers both B and one of the "transition into condition A"
cases, 2 deals with the other "transition into condition A"
possibility, and 3 is "don't drop Pending if we're already in
condition A".  Transitions out of condition A don't affect Pending
state.

We handle case 1 in set_irq_level(). For an interrupt (as opposed
to other kinds of exception) the only place where we clear Active
is in armv7m_nvic_complete_irq(), where we handle case 2 by
checking for whether we need to re-pend the exception. For case 3,
the only places where we clear Pending state on an interrupt are in
armv7m_nvic_acknowledge_irq() (where we are setting Active so it
doesn't count) and for writes to NVIC_ICPRn.

It is the "write to NVIC_ICPRn" case that we missed: we must ignore
this if the input line is high and the interrupt is not Active.
(This required behaviour is differently and perhaps more clearly
stated in the v7M Arm ARM, which has pseudocode in section B3.4.1
that implies it.)

Reported-by: Igor Kotrasiński <i.kotrasinsk@samsung.com>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-id: 20220628154724.3297442-1-peter.maydell@linaro.org

* target/arm: Fill in VL for tbflags when SME enabled and SVE disabled

When PSTATE.SM, VL = SVL even if SVE is disabled.
This is visible in kselftest ssve-test.

Reported-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20220713045848.217364-2-richard.henderson@linaro.org
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

* target/arm: Fix aarch64_sve_change_el for SME

We were only checking for SVE disabled and not taking into
account PSTATE.SM to check SME disabled, which resulted in
vectors being incorrectly truncated.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20220713045848.217364-3-richard.henderson@linaro.org
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

* linux-user/aarch64: Do not clear PROT_MTE on mprotect

The documentation for PROT_MTE says that it cannot be cleared
by mprotect.  Further, the implementation of the VM_ARCH_CLEAR bit,
contains PROT_BTI confiming that bit should be cleared.

Introduce PAGE_TARGET_STICKY to allow target/arch/cpu.h to control
which bits may be reset during page_set_flags.  This is sort of the
opposite of VM_ARCH_CLEAR, but works better with qemu's PAGE_* bits
that are separate from PROT_* bits.

Reported-by: Vitaly Buka <vitalybuka@google.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20220711031420.17820-1-richard.henderson@linaro.org
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

* target/arm: Define and use new regime_tcr_value() function

The regime_tcr() function returns a pointer to a struct TCR
corresponding to the TCR controlling a translation regime.  The
struct TCR has the raw value of the register, plus two fields mask
and base_mask which are used as a small optimization in the case of
32-bit short-descriptor lookups.  Almost all callers of regime_tcr()
only want the raw register value.  Define and use a new
regime_tcr_value() function which returns only the raw 64-bit
register value.

This is a preliminary to removing the 32-bit short descriptor
optimization -- it only saves a handful of bit operations, which is
tiny compared to the overhead of doing a page table walk at all, and
the TCR struct is awkward and makes fixing
https://gitlab.com/qemu-project/qemu/-/issues/1103 unnecessarily
difficult.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20220714132303.1287193-2-peter.maydell@linaro.org

* target/arm: Calculate mask/base_mask in get_level1_table_address()

In get_level1_table_address(), instead of using precalculated values
of mask and base_mask from the TCR struct, calculate them directly
(in the same way we currently do in vmsa_ttbcr_raw_write() to
populate the TCR struct fields).

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20220714132303.1287193-3-peter.maydell@linaro.org

* target/arm: Fold regime_tcr() and regime_tcr_value() together

The only caller of regime_tcr() is now regime_tcr_value(); fold the
two together, and use the shorter and more natural 'regime_tcr'
name for the new function.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20220714132303.1287193-4-peter.maydell@linaro.org

* target/arm: Fix big-endian host handling of VTCR

We have a bug in our handling of accesses to the AArch32 VTCR
register on big-endian hosts: we were not adjusting the part of the
uint64_t field within TCR that the generated code would access.  That
can be done with offsetoflow32(), by using an ARM_CP_STATE_BOTH cpreg
struct, or by defining a full set of read/write/reset functions --
the various other TCR cpreg structs used one or another of those
strategies, but for VTCR we did not, so on a big-endian host VTCR
accesses would touch the wrong half of the register.

Use offsetoflow32() in the VTCR register struct.  This works even
though the field in the CPU struct is currently a struct TCR, because
the first field in that struct is the uint64_t raw_tcr.

None of the other TCR registers have this bug -- either they are
AArch64 only, or else they define resetfn, writefn, etc, and
expect to be passed the full struct pointer.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20220714132303.1287193-5-peter.maydell@linaro.org

* target/arm: Store VTCR_EL2, VSTCR_EL2 registers as uint64_t

Change the representation of the VSTCR_EL2 and VTCR_EL2 registers in
the CPU state struct from struct TCR to uint64_t.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20220714132303.1287193-6-peter.maydell@linaro.org

* target/arm: Store TCR_EL* registers as uint64_t

Change the representation of the TCR_EL* registers in the CPU state
struct from struct TCR to uint64_t.  This allows us to drop the
custom vmsa_ttbcr_raw_write() function, moving the "enforce RES0"
checks to their more usual location in the writefn
vmsa_ttbcr_write().  We also don't need the resetfn any more.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20220714132303.1287193-7-peter.maydell@linaro.org

* target/arm: Honour VTCR_EL2 bits in Secure EL2

In regime_tcr() we return the appropriate TCR register for the
translation regime.  For Secure EL2, we return the VSTCR_EL2 value,
but in this translation regime some fields that control behaviour are
in VTCR_EL2.  When this code was originally written (as the comment
notes), QEMU didn't care about any of those fields, but we have since
added support for features such as LPA2 which do need the values from
those fields.

Synthesize a TCR value by merging in the relevant VTCR_EL2 fields to
the VSTCR_EL2 value.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1103
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20220714132303.1287193-8-peter.maydell@linaro.org

* hw/adc: Fix CONV bit in NPCM7XX ADC CON register

The correct bit for the CONV bit in NPCM7XX ADC is bit 13. This patch
fixes that in the module, and also lower the IRQ when the guest
is done handling an interrupt event from the ADC module.

Signed-off-by: Hao Wu <wuhaotsh@google.com>
Reviewed-by: Patrick Venture<venture@google.com>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 20220714182836.89602-4-wuhaotsh@google.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

* hw/adc: Make adci[*] R/W in NPCM7XX ADC

Our sensor test requires both reading and writing from a sensor's
QOM property. So we need to make the input of ADC module R/W instead
of write only for that to work.

Signed-off-by: Hao Wu <wuhaotsh@google.com>
Reviewed-by: Titus Rwantare <titusr@google.com>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 20220714182836.89602-5-wuhaotsh@google.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

* target/arm: Don't set syndrome ISS for loads and stores with writeback

The architecture requires that for faults on loads and stores which
do writeback, the syndrome information does not have the ISS
instruction syndrome information (i.e. ISV is 0).  We got this wrong
for the load and store instructions covered by disas_ldst_reg_imm9().
Calculate iss_valid correctly so that if the insn is a writeback one
it is false.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1057
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20220715123323.1550983-1-peter.maydell@linaro.org

* Align Raspberry Pi DMA interrupts with Linux DTS

There is nothing in the specs on DMA engine interrupt lines: it should have
been in the "BCM2835 ARM Peripherals" datasheet but the appropriate
"ARM peripherals interrupt table" (p.113) is nearly empty.

All Raspberry Pi models 1-3 (based on bcm2835) have
Linux device tree (arch/arm/boot/dts/bcm2835-common.dtsi +25):

    /* dma channel 11-14 share one irq */

This information is repeated in the driver code
(drivers/dma/bcm2835-dma.c +1344):

    /*
     * in case of channel >= 11
     * use the 11th interrupt and that is shared
     */

In this patch channels 0--10 and 11--14 are handled separately.

Signed-off-by: Andrey Makarov <andrey.makarov@auriga.com>
Message-id: 20220716113210.349153-1-andrey.makarov@auriga.com
[PMM: fixed checkpatch nits]
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

* monitor: add support for boolean statistics

The next version of Linux will introduce boolean statistics, which
can only have 0 or 1 values.  Support them in the schema and in
the HMP command.

Suggested-by: Amneesh Singh <natto@weirdnatto.in>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

* kvm: add support for boolean statistics

The next version of Linux will introduce boolean statistics, which
can only have 0 or 1 values.  Convert them to the new QAPI fields
added in the previous commit.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

* ppc64: Allocate IRQ lines with qdev_init_gpio_in()

This replaces the IRQ array 'irq_inputs' with GPIO lines, the goal
being to remove 'irq_inputs' when all CPUs have been converted.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Acked-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220705145814.461723-2-clg@kaod.org>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* ppc/40x: Allocate IRQ lines with qdev_init_gpio_in()

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220705145814.461723-3-clg@kaod.org>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* ppc/6xx: Allocate IRQ lines with qdev_init_gpio_in()

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Acked-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220705145814.461723-4-clg@kaod.org>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* ppc/e500: Allocate IRQ lines with qdev_init_gpio_in()

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220705145814.461723-5-clg@kaod.org>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* ppc: Remove unused irq_inputs

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220705145814.461723-6-clg@kaod.org>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* hw/ppc: pass random seed to fdt

If the FDT contains /chosen/rng-seed, then the Linux RNG will use it to
initialize early. Set this using the usual guest random number
generation function. This is confirmed to successfully initialize the
RNG on Linux 5.19-rc6. The rng-seed node is part of the DT spec. Set
this on the paravirt platforms, spapr and e500, just as is done on other
architectures with paravirt hardware.

Cc: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <20220712135114.289855-1-Jason@zx2c4.com>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc/kvm: Skip current and parent directories in kvmppc_find_cpu_dt

Some systems have /proc/device-tree/cpus/../clock-frequency. However,
this is not the expected path for a CPU device tree directory.

Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <20220712210810.35514-1-muriloo@linux.ibm.com>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc: Fix gen_priv_exception error value in mfspr/mtspr

The code in linux-user/ppc/cpu_loop.c expects POWERPC_EXCP_PRIV
exception with error POWERPC_EXCP_PRIV_OPC or POWERPC_EXCP_PRIV_REG,
while POWERPC_EXCP_INVAL_SPR is expected in POWERPC_EXCP_INVAL
exceptions. This mismatch caused an EXCP_DUMP with the message "Unknown
privilege violation (03)", as seen in [1].

[1] https://gitlab.com/qemu-project/qemu/-/issues/588

Fixes: 9b2fadda3e01 ("ppc: Rework generation of priv and inval interrupts")
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/588
Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: Matheus Ferst <matheus.ferst@eldorado.org.br>
Message-Id: <20220627141104.669152-2-matheus.ferst@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc: fix exception error value in slbfee

Testing on a POWER9 DD2.3, we observed that the Linux kernel delivers a
signal with si_code ILL_PRVOPC (5) when a userspace application tries to
use slbfee. To obtain this behavior on linux-user, we should use
POWERPC_EXCP_PRIV with POWERPC_EXCP_PRIV_OPC.

No functional change is intended for softmmu targets as
gen_hvpriv_exception uses the same 'exception' argument
(POWERPC_EXCP_HV_EMU) for raise_exception_*, and the powerpc_excp_*
methods do not use lower bits of the exception error code when handling
POWERPC_EXCP_{INVAL,PRIV}.

Reported-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Matheus Ferst <matheus.ferst@eldorado.org.br>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <20220627141104.669152-3-matheus.ferst@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc: remove mfdcrux and mtdcrux

The only PowerPC implementations with these insns were the 460 and 460F,
which had their definitions removed in [1].

[1] 7ff26aa6c657 ("target/ppc: Remove unused PPC 460 and 460F definitions")

Signed-off-by: Matheus Ferst <matheus.ferst@eldorado.org.br>
Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>
Message-Id: <20220627141104.669152-4-matheus.ferst@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc: fix exception error code in helper_{load, store}_dcr

POWERPC_EXCP_INVAL should only be or-ed with other constants prefixed
with POWERPC_EXCP_INVAL_. Also, take the opportunity to move both
helpers under #if !defined(CONFIG_USER_ONLY) as the instructions that
use them are privileged.

No functional change is intended, the lower 4 bits of the error code are
ignored by all powerpc_excp_* methods on POWERPC_EXCP_INVAL exceptions.

Reported-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Matheus Ferst <matheus.ferst@eldorado.org.br>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <20220627141104.669152-5-matheus.ferst@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc: fix PMU Group A register read/write exceptions

A call to "gen_(hv)priv_exception" should use POWERPC_EXCP_PRIV_* as the
'error' argument instead of POWERPC_EXCP_INVAL_*, and POWERPC_EXCP_FU is
an exception type, not an exception error code. To correctly set
FSCR[IC], we should raise Facility Unavailable with this exception type
and IC value as the error code.

Fixes: 565cb1096733 ("target/ppc: add user read/write functions for MMCR0")
Signed-off-by: Matheus Ferst <matheus.ferst@eldorado.org.br>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <20220627141104.669152-6-matheus.ferst@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc: fix exception error code in spr_write_excp_vector

The 'error' argument of gen_inval_exception will be or-ed with
POWERPC_EXCP_INVAL, so it should always be a constant prefixed with
POWERPC_EXCP_INVAL_. No functional change is intended,
spr_write_excp_vector is only used by register_BookE_sprs, and
powerpc_excp_booke ignores the lower 4 bits of the error code on
POWERPC_EXCP_INVAL exceptions.

Also, take the opportunity to replace printf with qemu_log_mask.

Signed-off-by: Matheus Ferst <matheus.ferst@eldorado.org.br>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <20220627141104.669152-7-matheus.ferst@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc: Move tlbie[l] to decode tree

Also decode RIC, PRS and R operands.

Signed-off-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <20220712193741.59134-2-leandro.lupori@eldorado.org.br>
[danielhb: mark bit 31 in @X_tlbie pattern as ignored]
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc: Implement ISA 3.00 tlbie[l]

This initial version supports the invalidation of one or all
TLB entries. Flush by PID/LPID, or based in process/partition
scope is not supported, because it would make using the
generic QEMU TLB implementation hard. In these cases, all
entries are flushed.

Signed-off-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <20220712193741.59134-3-leandro.lupori@eldorado.org.br>
[danielhb: moved 'set' declaration to TLBIE_RIC_PWC block]
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc: receive DisasContext explicitly in GEN_PRIV

GEN_PRIV and related CHK_* macros just assumed that variable named
"ctx" would be in scope when they are used, and that it would be a
pointer to DisasContext. Change these macros to receive the pointer
explicitly.

Reviewed-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
Signed-off-by: Matheus Ferst <matheus.ferst@eldorado.org.br>
Signed-off-by: Lucas Coutinho <lucas.coutinho@eldorado.org.br>
Message-Id: <20220701133507.740619-2-lucas.coutinho@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc: add macros to check privilege level

Equivalent to CHK_SV and CHK_HV, but can be used in decodetree methods.

Reviewed-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
Signed-off-by: Matheus Ferst <matheus.ferst@eldorado.org.br>
Signed-off-by: Lucas Coutinho <lucas.coutinho@eldorado.org.br>
Message-Id: <20220701133507.740619-3-lucas.coutinho@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc: Move slbie to decodetree

Reviewed-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
Signed-off-by: Lucas Coutinho <lucas.coutinho@eldorado.org.br>
Message-Id: <20220701133507.740619-4-lucas.coutinho@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc: Move slbieg to decodetree

Reviewed-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
Signed-off-by: Lucas Coutinho <lucas.coutinho@eldorado.org.br>
Message-Id: <20220701133507.740619-5-lucas.coutinho@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc: Move slbia to decodetree

Reviewed-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
Signed-off-by: Lucas Coutinho <lucas.coutinho@eldorado.org.br>
Message-Id: <20220701133507.740619-6-lucas.coutinho@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc: Move slbmte to decodetree

Reviewed-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
Signed-off-by: Lucas Coutinho <lucas.coutinho@eldorado.org.br>
Message-Id: <20220701133507.740619-7-lucas.coutinho@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc: Move slbmfev to decodetree

Reviewed-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
Signed-off-by: Lucas Coutinho <lucas.coutinho@eldorado.org.br>
Message-Id: <20220701133507.740619-8-lucas.coutinho@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc: Move slbmfee to decodetree

Reviewed-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
Signed-off-by: Lucas Coutinho <lucas.coutinho@eldorado.org.br>
Message-Id: <20220701133507.740619-9-lucas.coutinho@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc: Move slbfee to decodetree

Reviewed-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
Signed-off-by: Lucas Coutinho <lucas.coutinho@eldorado.org.br>
Message-Id: <20220701133507.740619-10-lucas.coutinho@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc: Move slbsync to decodetree

Reviewed-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
Signed-off-by: Lucas Coutinho <lucas.coutinho@eldorado.org.br>
Message-Id: <20220701133507.740619-11-lucas.coutinho@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc: Implement slbiag

Reviewed-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
Signed-off-by: Lucas Coutinho <lucas.coutinho@eldorado.org.br>
Message-Id: <20220701133507.740619-12-lucas.coutinho@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc: check tb_env != 0 before printing TBU/TBL/DECR

When using "-machine none", env->tb_env is not allocated, causing the
segmentation fault reported in issue #85 (launchpad bug #811683). To
avoid this problem, check if the pointer != NULL before calling the
methods to print TBU/TBL/DECR.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/85
Signed-off-by: Matheus Ferst <matheus.ferst@eldorado.org.br>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <20220714172343.80539-1-matheus.ferst@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* ppc: Check partition and process table alignment

Check if partition and process tables are properly aligned, in
their size, according to PowerISA 3.1B, Book III 6.7.6 programming
note. Hardware and KVM also raise an exception in these cases.

Signed-off-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>
Message-Id: <20220628133959.15131-2-leandro.lupori@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc: Improve Radix xlate level validation

Check if the number and size of Radix levels are valid on
POWER9/POWER10 CPUs, according to the supported Radix Tree
Configurations described in their User Manuals.

Signed-off-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>
Message-Id: <20220628133959.15131-3-leandro.lupori@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* target/ppc: Check page dir/table base alignment

According to PowerISA 3.1B, Book III 6.7.6 programming note, the
page directory base addresses are expected to be aligned to their
size. Real hardware seems to rely on that and will access the
wrong address if they are misaligned. This results in a
translation failure even if the page tables seem to be properly
populated.

Signed-off-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <20220628133959.15131-4-leandro.lupori@eldorado.org.br>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

* qga: treat get-guest-fsinfo as "best effort"

In some container environments, there may be references to block devices
witnessable from a container through /proc/self/mountinfo that reference
devices we simply don't have access to in the container, and cannot
provide information about.

Instead of failing the entire fsinfo command, return stub information
for these failed lookups.

This allows test-qga to pass under docker tests, which are in turn used
by the CentOS VM tests.

Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <20220708153503.18864-2-jsnow@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>

* tests/vm: use 'cp' instead of 'ln' for temporary vm images

If the initial setup fails, you've permanently altered the state of the
downloaded image in an unknowable way. Use 'cp' like our other test
setup scripts do.

Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20220708153503.18864-3-jsnow@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>

* tests/vm: switch CentOS 8 to CentOS 8 Stream

The old CentOS image didn't work anymore because it was already EOL at
the beginning of 2022.

Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20220708153503.18864-4-jsnow@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>

* tests/vm: switch centos.aarch64 to CentOS 8 Stream

Switch this test over to using a cloud image like the base CentOS8 VM
test, which helps make this script a bit simpler too.

Note: At time of writing, this test seems pretty flaky when run without
KVM support for aarch64. Certain unit tests like migration-test,
virtio-net-failover, test-hmp and qom-test seem quite prone to fail
under TCG. Still, this is an improvement in that at least pure build
tests are functional.

Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20220708153503.18864-5-jsnow@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>

* tests/vm: upgrade Ubuntu 18.04 VM to 20.04

18.04 has fallen out of our support window, so move ubuntu.aarch64
forward to ubuntu 20.04, which is now our oldest supported Ubuntu
release.

Notes:

This checksum changes periodically; use a fixed point image with a known
checksum so that the image isn't re-downloaded on every single
invocation. (The checksum for the 18.04 image was already incorrect at
the time of writing.)

Just like the centos.aarch64 test, this test currently seems very
flaky when run as a TCG test.

Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20220708153503.18864-6-jsnow@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>

* tests/vm: remove ubuntu.i386 VM test

Ubuntu 18.04 is out of our support window, and Ubuntu 20.04 does not
support i386 anymore. The debian project does, but they do not provide
any cloud images for it, a new expect-style script would have to be
written.

Since we have i386 cross-compiler tests hosted on GitLab CI, we don't
need to support this VM test anymore.

Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20220708153503.18864-7-jsnow@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>

* tests/vm: remove duplicate 'centos' VM test

This is listed twice by accident; we require genisoimage to run the
test, so remove the unconditional entry.

Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20220708153503.18864-8-jsnow@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>

* tests/vm: add 1GB extra memory per core

If you try to run a 16 or 32 threaded test, you're going to run out of
memory very quickly with qom-test and a few others. Bump the memory
limit to try to scale with larger-core machines.

Granted, this means that a 16 core processor is going to ask for 16GB,
but you *probably* meet that requirement if you have such a machine.

512MB per core didn't seem to be enough to avoid ENOMEM and SIGABRTs in
the test cases in practice on a six core machine; so I bumped it up to
1GB which seemed to help.

Add this magic in early to the configuration process so that the
config file, if provided, can still override it.

Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Acked-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20220708153503.18864-9-jsnow@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>

* tests/vm: Remove docker cross-compile test from CentOS VM

The fedora container has since been split apart, so there's no suitable
nearby target that would support "test-mingw" as it requires both x32
and x64 support -- so either fedora-cross-win32 nor fedora-cross-win64
would be truly suitable.

Just remove this test as superfluous with our current CI infrastructure.

Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20220708153503.18864-10-jsnow@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>

* qtest/machine-none: Add LoongArch support

Update the cpu_maps[] to support the LoongArch target.

Signed-off-by: Song Gao <gaosong@loongson.cn>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20220713020258.601424-1-gaosong@loongson.cn>
Signed-off-by: Thomas Huth <thuth@redhat.com>

* tests/unit: Replace g_memdup() by g_memdup2()

Per https://discourse.gnome.org/t/port-your-module-from-g-memdup-to-g-memdup2-now/5538

  The old API took the size of the memory to duplicate as a guint,
  whereas most memory functions take memory sizes as a gsize. This
  made it easy to accidentally pass a gsize to g_memdup(). For large
  values, that would lead to a silent truncation of the size from 64
  to 32 bits, and result in a heap area being returned which is
  significantly smaller than what the caller expects. This can likely
  be exploited in various modules to cause a heap buffer overflow.

Replace g_memdup() by the safer g_memdup2() wrapper.

Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20210903174510.751630-24-philmd@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>

* Replace 'whitelist' with 'allow'

Let's use more inclusive language here and avoid terms
that are frowned upon nowadays.

Message-Id: <20220711095300.60462-1-thuth@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Thomas Huth <thuth@redhat.com>

* util: Fix broken build on Haiku

A recent commit moved some Haiku-specific code parts from oslib-posix.c
to cutils.c, but failed to move the corresponding header #include
statement, too, so "make vm-build-haiku.x86_64" is currently broken.
Fix it by moving the header #include, too.

Fixes: 06680b15b4 ("include: move qemu_*_exec_dir() to cutils")
Message-Id: <20220718172026.139004-1-thuth@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>

* python/qemu/qmp/legacy: Replace 'returns-whitelist' with the correct type

'returns-whitelist' has been renamed to 'command-returns-exceptions' in
commit b86df3747848 ("qapi: Rename pragma *-whitelist to *-exceptions").

Message-Id: <20220711095721.61280-1-thuth@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>

* pl050: move PL050State from pl050.c to new pl050.h header file

This allows the QOM types in pl050.c to be used elsewhere by simply including
pl050.h.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-2-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* pl050: rename pl050_keyboard_init() to pl050_kbd_init()

This is for consistency with all of the other devices that use the PS2 keyboard
device.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-3-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* pl050: change PL050State dev pointer from void to PS2State

This allows the compiler to enforce that the PS2 device pointer is always of
type PS2State. Update the name of the pointer from dev to ps2dev to emphasise
this type change.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-4-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* pl050: introduce new PL050_KBD_DEVICE QOM type

This will be soon be used to hold the underlying PS2_KBD_DEVICE object.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-5-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* pl050: introduce new PL050_MOUSE_DEVICE QOM type

This will be soon be used to hold the underlying PS2_MOUSE_DEVICE object.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-6-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* pl050: move logic from pl050_realize() to pl050_init()

The logic for initialising the register memory region and the sysbus output IRQ
does not depend upon any device properties and so can be moved from
pl050_realize() to pl050_init().

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-7-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* pl050: introduce PL050DeviceClass for the PL050 device

This will soon be used to store the reference to the PL050 parent device
for PL050_KBD_DEVICE and PL050_MOUSE_DEVICE.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-8-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* pl050: introduce pl050_kbd_class_init() and pl050_kbd_realize()

Introduce a new pl050_kbd_class_init() function containing a call to
device_class_set_parent_realize() which calls a new pl050_kbd_realize()
function to initialise the PS2 keyboard device.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-9-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* pl050: introduce pl050_mouse_class_init() and pl050_mouse_realize()

Introduce a new pl050_mouse_class_init() function containing a call to
device_class_set_parent_realize() which calls a new pl050_mouse_realize()
function to initialise the PS2 mouse device.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-10-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* pl050: don't use legacy ps2_kbd_init() function

Instantiate the PS2 keyboard device within PL050KbdState using
object_initialize_child() in pl050_kbd_init() and realize it in
pl050_kbd_realize() accordingly.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-11-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* pl050: don't use legacy ps2_mouse_init() function

Instantiate the PS2 mouse device within PL050MouseState using
object_initialize_child() in pl050_mouse_init() and realize it in
pl050_mouse_realize() accordingly.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-12-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: don't use vmstate_register() in lasips2_realize()

Since lasips2 is a qdev device then vmstate_ps2_mouse can be registered using
the DeviceClass vmsd field instead.

Note that due to the use of the base parameter in the original vmstate_register()
function call, this is actually a migration break for the HPPA B160L machine.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-13-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: remove the qdev base property and the lasips2_properties array

The base property was only needed for use by vmstate_register() in order to
preserve migration compatibility. Now that the lasips2 migration state is
registered through the DeviceClass vmsd field, the base property and also
the lasips2_properties array can be removed completely as they are no longer
required.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-14-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: remove legacy lasips2_initfn() function

There is only one user of the legacy lasips2_initfn() function which is in
machine_hppa_init(), so inline its functionality into machine_hppa_init() and
then remove it.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-15-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: change LASIPS2State dev pointer from void to PS2State

This allows the compiler to enforce that the PS2 device pointer is always of
type PS2State. Update the name of the pointer from dev to ps2dev to emphasise
this type change.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-16-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: QOMify LASIPS2Port

This becomes an abstract QOM type which will be a parent type for separate
keyboard and mouse port types.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-17-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: introduce new LASIPS2_KBD_PORT QOM type

This will be soon be used to hold the underlying PS2_KBD_DEVICE object.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-18-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: introduce new LASIPS2_MOUSE_PORT QOM type

This will be soon be used to hold the underlying PS2_MOUSE_DEVICE object.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-19-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: move keyboard port initialisation to new lasips2_kbd_port_init() function

Move the initialisation of the keyboard port from lasips2_init() to
a new lasips2_kbd_port_init() function which will be invoked using
object_initialize_child() during the LASIPS2 device init.

Update LASIPS2State so that it now holds the new LASIPS2KbdPort child object and
ensure that it is realised in lasips2_realize().

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-20-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: move mouse port initialisation to new lasips2_mouse_port_init() function

Move the initialisation of the mouse port from lasips2_init() to
a new lasips2_mouse_port_init() function which will be invoked using
object_initialize_child() during the LASIPS2 device init.

Update LASIPS2State so that it now holds the new LASIPS2MousePort child object and
ensure that it is realised in lasips2_realize().

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-21-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: introduce lasips2_kbd_port_class_init() and lasips2_kbd_port_realize()

Introduce a new lasips2_kbd_port_class_init() function which uses a new
lasips2_kbd_port_realize() function to initialise the PS2 keyboard device.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-22-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: introduce lasips2_mouse_port_class_init() and lasips2_mouse_port_realize()

Introduce a new lasips2_mouse_port_class_init() function which uses a new
lasips2_mouse_port_realize() function to initialise the PS2 mouse device.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-23-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: rename LASIPS2Port irq field to birq

The existing boolean irq field in LASIPS2Port will soon be replaced by a proper
qemu_irq, so rename the field to birq to allow the upcoming qemu_irq to use the
irq name.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-24-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: introduce port IRQ and new lasips2_port_init() function

Introduce a new lasips2_port_init() QOM init function for the LASIPS2_PORT type
and use it to initialise a new gpio for use as a port IRQ. Add a new qemu_irq
representing the gpio as a new irq field within LASIPS2Port.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-25-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: introduce LASIPS2PortDeviceClass for the LASIPS2_PORT device

This will soon be used to store the reference to the LASIPS2_PORT parent device
for LASIPS2_KBD_PORT and LASIPS2_MOUSE_PORT.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-26-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: add named input gpio to port for downstream PS2 device IRQ

The named input gpio is to be connected to the IRQ output of the downstream
PS2 device and used to drive the port IRQ. Initialise the named input gpio
in lasips2_port_init() and add new lasips2_port_class_init() and
lasips2_port_realize() functions to connect the PS2 device output gpio to
the new named input gpio.

Note that the reference to lasips2_port_realize() is stored in
LASIPS2PortDeviceClass but not yet used.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-27-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: add named input gpio to handle incoming port IRQs

The LASIPS2 device named input gpio is soon to be connected to the port output
IRQs. Add a new int_status field to LASIPS2State which is a bitmap representing
the port input IRQ status which will be enabled in the next patch.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Message-Id: <20220712215251.7944-28-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: switch to using port-based IRQs

Now we can implement port-based IRQs by wiring the PS2 device IRQs to the
LASI2Port named input gpios rather than directly to the LASIPS2 device, and
generate the LASIPS2 output IRQ from the int_status bitmap representing the
individual port IRQs instead of the birq boolean.

This enables us to remove the separate PS2 keyboard and PS2 mouse named input
gpios from the LASIPS2 device and simplify the register implementation to
drive the port IRQ using qemu_set_irq() rather than accessing the LASIPS2
device IRQs directly. As a consequence the IRQ level logic in lasips2_set_irq()
can also be simplified accordingly.

For now this patch ignores adding the int_status bitmap and simply drops the
birq boolean from the vmstate_lasips2 VMStateDescription. This is because the
migration stream is already missing some required LASIPS2 fields, and as this
series already introduces a migration break for the lasips2 device it is
easiest to fix this in a follow-up patch.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Message-Id: <20220712215251.7944-29-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: rename LASIPS2Port parent pointer to lasips2

This makes it clearer that the pointer is a reference to the LASIPS2 container
device rather than an implied part of the QOM hierarchy.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-30-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: standardise on lp name for LASIPS2Port variables

This is shorter to type and keeps the naming convention consistent within the
LASIPS2 device.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-31-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: switch register memory region to DEVICE_BIG_ENDIAN

The LASI device (and so also the LASIPS2 device) are only used for the HPPA
B160L machine which is a big endian architecture.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-32-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: don't use legacy ps2_kbd_init() function

Instantiate the PS2 keyboard device within LASIPS2KbdPort using
object_initialize_child() in lasips2_kbd_port_init() and realize it in
lasips2_kbd_port_realize() accordingly.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-33-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: don't use legacy ps2_mouse_init() function

Instantiate the PS2 mouse device within LASIPS2MousePort using
object_initialize_child() in lasips2_mouse_port_init() and realize it in
lasips2_mouse_port_realize() accordingly.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-34-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* lasips2: update VMStateDescription for LASIPS2 device

Since this series has already introduced a migration break for the HPPA B160L
machine, we can use this opportunity to improve the VMStateDescription for
the LASIPS2 device.

Add the new int_status field to the VMStateDescription and remodel the ports
as separate VMSTATE_STRUCT instances representing each LASIPS2Port. Once this
is done, the migration stream can be updated to include buf and loopback_rbne
for each port (which is necessary since the values are accessed across separate
IO accesses), and drop the port id as this is hardcoded for each port type.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Message-Id: <20220712215251.7944-35-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* pckbd: introduce new vmstate_kbd_mmio VMStateDescription for the I8042_MMIO device

This enables us to register the VMStateDescription using the DeviceClass vmsd
property rather than having to call vmstate_register() from i8042_mmio_realize().

Note that this is a migration break for the MIPS magnum machine which is the only
user of the I8042_MMIO device.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-36-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* pckbd: don't use legacy ps2_kbd_init() function

Instantiate the PS2 keyboard device within KBDState using
object_initialize_child() in i8042_initfn() and i8042_mmio_init() and realize
it in i8042_realizefn() and i8042_mmio_realize() accordingly.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-37-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* ps2: remove unused legacy ps2_kbd_init() function

Now that the legacy ps2_kbd_init() function is no longer used, it can be completely
removed along with its associated trace-event.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-38-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* pckbd: don't use legacy ps2_mouse_init() function

Instantiate the PS2 mouse device within KBDState using
object_initialize_child() in i8042_initfn() and i8042_mmio_init() and realize
it in i8042_realizefn() and i8042_mmio_realize() accordingly.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-39-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* ps2: remove unused legacy ps2_mouse_init() function

Now that the legacy ps2_mouse_init() function is no longer used, it can be completely
removed along with its associated trace-event.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-40-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* pckbd: remove legacy i8042_mm_init() function

This legacy function is only used during the initialisation of the MIPS magnum
machine, so inline its functionality directly into mips_jazz_init() and then
remove it.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Tested-by: Helge Deller <deller@gmx.de>
Acked-by: Helge Deller <deller@gmx.de>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20220712215251.7944-41-mark.cave-ayland@ilande.co.uk>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

* util: Fix broken build on Haiku

A recent commit moved some Haiku-specific code parts from oslib-posix.c
to cutils.c, but failed to move the corresponding header #include
statement, too, so "make vm-build-haiku.x86_64" is currently broken.
Fix it by moving the header #include, too.

Fixes: 06680b15b4 ("include: move qemu_*_exec_dir() to cutils")
Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <20220718172026.139004-1-thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

* target/s390x: fix handling of zeroes in vfmin/vfmax

vfmin_res() / vfmax_res() are trying to check whether a and b are both
zeroes, but in reality they check that they are the same kind of zero.
This causes incorrect results when comparing positive and negative
zeroes.

Fixes: da4807527f3b ("s390x/tcg: Implement VECTOR FP (MAXIMUM|MINIMUM)")
Co-developed-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Message-Id: <20220713182612.3780050-2-iii@linux.ibm.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>

* target/s390x: fix NaN propagation rules

s390x has the same NaN propagation rules as ARM, and not as x86.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Message-Id: <20220713182612.3780050-3-iii@linux.ibm.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>

* tests/tcg/s390x: test signed vfmin/vfmax

Add a test to prevent regressions. Try all floating point value sizes
and all combinations of floating point value classes. Verify the results
against PoP tables, which are represented as close to the original as
possible - this produces a lot of checkpatch complaints, but it seems
to be justified in this case.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20220713182612.3780050-4-iii@linux.ibm.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>

* dbus-display: fix test race when initializing p2p connection

The D-Bus connection starts processing messages before QEMU has the time
to set the object manager server. This is causing dbus-display-test to
fail randomly with:

ERROR:../tests/qtest/dbus-display-test.c:68:test_dbus_display_vm:
assertion failed
(qemu_dbus_display1_vm_get_name(QEMU_DBUS_DISPLAY1_VM(vm)) ==
"dbus-test"): (NULL == "dbus-test") ERROR

Use the delayed message processing flag and method to avoid that
situation.

(the bus connection doesn't need a fix, as the initialization is done
synchronously)

Reported-by: Robinson, Cole <crobinso@redhat.com>
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Tested-by: Cole Robinson <crobinso@redhat.com>
Message-Id: <20220609152647.870373-1-marcandre.lureau@redhat.com>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>

* microvm: turn off io reservations for pcie root ports

The pcie host bridge has no io window on microvm,
so io reservations will not work.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Message-Id: <20220701091516.43489-1-kraxel@redhat.com>

* usb/hcd-xhci: check slotid in xhci_wakeup_endpoint()

This prevents an OOB read (followed by an assertion failure in
xhci_kick_ep) when slotid > xhci->numslots.

Reported-by: Soul Chen <soulchen8650@gmail.com>
Signed-off-by: Mauro Matteo Cascella <mcascell@redhat.com>
Message-Id: <20220705174734.2348829-1-mcascell@redhat.com>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>

* usb: document guest-reset and guest-reset-all

Suggested-by: Michal Prívozník <mprivozn@redhat.com>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Message-Id: <20220711094437.3995927-2-kraxel@redhat.com>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>

* usb: document pcap (aka usb traffic capture)

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Message-Id: <20220711094437.3995927-3-kraxel@redhat.com>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>

* gtk: Add show_tabs=on|off command line option.

The patch adds "show_tabs" command line option for GTK ui similar to
"grab_on_hover". This option allows tabbed view mode to not have to be
enabled by hand at each start of the VM.

Signed-off-by: Felix "xq" Queißner <xq@random-projects.net>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220712133753.18937-1-xq@random-projects.net>
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>

* tests/docker/dockerfiles: Add debian-loongarch-cross.docker

Use the pre-packaged toolchain provided by Loongson via github.

Tested-by: Song Gao <gaosong@loongson.cn>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20220704070824.965429-1-richard.henderson@linaro.org>

* target/loongarch: Fix loongarch_cpu_class_by_name

The cpu_model argument may already have the '-loongarch-cpu' suffix,
e.g. when using the default for the LS7A1000 machine.  If that fails,
try again with the suffix.  Validate that the object created by the
function is derived from the proper base class.

Signed-off-by: Xiaojuan Yang <yangxiaojuan@loongson.cn>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20220715060740.1500628-2-yangxiaojuan@loongson.cn>
[rth: Try without and then with the suffix, to avoid testsuite breakage.]
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

* hw/intc/loongarch_pch_pic: Fix bugs for update_irq function

Fix such errors:
1. We should not use 'unsigned long' type as argument when we use
find_first_bit(), and we use ctz64() to replace find_first_bit()
to fix this bug.
2. It is not standard to use '1ULL << irq' to generate a irq mask.
So, we replace it with 'MAKE_64BIT_MASK(irq, 1)'.

Fix coverity CID: 1489761 1489764 1489765

Signed-off-by: Xiaojuan Yang <yangxiaojuan@loongson.cn>
Message-Id: <20220715060740.1500628-3-yangxiaojuan@loongson.cn>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

* target/loongarch/cpu: Fix coverity errors about excp_names

Fix out-of-bounds errors when access excp_names[] array. the valid
boundary size of excp_names should be 0 to ARRAY_SIZE(excp_names)-1.
However, the general code do not consider the max boundary.

Fix coverity CID: 1489758

Signed-off-by: Xiaojuan Yang <yangxiaojuan@loongson.cn>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20220715060740.1500628-4-yangxiaojuan@loongson.cn>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

* target/loongarch/tlb_helper: Fix coverity integer overflow error

Replace '1 << shift' with 'MAKE_64BIT_MASK(shift, 1)' to fix
unintentional integer overflow errors in tlb_helper file.

Fix coverity CID: 1489759 1489762

Signed-off-by: Xiaojuan Yang <yangxiaojuan@loongson.cn>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20220715060740.1500628-5-yangxiaojuan@loongson.cn>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

* target/loongarch/op_helper: Fix coverity cond_at_most error

The boundary size of cpucfg array should be 0 to ARRAY_SIZE(cpucfg)-1.
So, using index bigger than max boundary to access cpucfg[] must be
forbidden.

Fix coverity CID: 1489760

Signed-off-by: Xiaojuan Yang <yangxiaojuan@loongson.cn>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20220715060740.1500628-6-yangxiaojuan@loongson.cn>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

* target/loongarch/cpu: Fix cpucfg default value

We should config cpucfg[20] to set value for the scache's ways, sets,
and size arguments when loongarch cpu init. However, the old code
wirte 'sets argument' twice, so we change one of them to 'size argument'.

Signed-off-by: Xiaojuan Yang <yangxiaojuan@loongson.cn>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20220715064829.1521482-1-yangxiaojuan@loongson.cn>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

* fpu/softfloat: Add LoongArch specializations for pickNaN*

The muladd (inf,zero,nan) case sets InvalidOp and returns the
input value 'c', and prefer sNaN over qNaN, in c,a,b order.
Binary operations prefer sNaN over qNaN and a,b order.

Signed-off-by: Song Gao <gaosong@loongson.cn>
Message-Id: <20220716085426.3098060-3-gaosong@loongson.cn>
[rth: Add specialization for pickNaN]
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

* target/loongarch: Fix float_convd/float_convs test failing

We should result zero when exception is invalid and operation is nan

Signed-off-by: Song Gao <gaosong@loongson.cn>
Message-Id: <20220716085426.3098060-4-gaosong@loongson.cn>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

* tests/tcg/loongarch64: Add float reference files

Generated on Loongson-3A5000 (CPU revision 0x0014c011).

Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20220104132022.2146857-1-f4bug@amsat.org>
Signed-off-by: Song Gao <gaosong@loongson.cn>
Message-Id: <20220716085426.3098060-2-gaosong@loongson.cn>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

* tests/tcg/loongarch64: Add clo related instructions test

This includes:
- CL{O/Z}.{W/D}
- CT{O/Z}.{W/D}

Signed-off-by: Song Gao <gaosong@loongson.cn>
Message-Id: <20220716085426.3098060-5-gaosong@loongson.cn>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

* tests/tcg/loongarch64: Add div and mod related instructions test

This includes:
- DIV.{W[U]/D[U]}
- MOD.{W[U]/D[U]}

Signed-off-by: Song Gao <gaosong@loongson.cn>
Message-Id: <20220716085426.3098060-6-gaosong@loongson.cn>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

* tests/tcg/loongarch64: Add fclass test

This includes:
- FCLASS.{S/D}

Signed-off-by: Song Gao <gaosong@loongson.cn>
Message-Id: <20220716085426.3098060-7-gaosong@loongson.cn>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

* tests/tcg/loongarch64: Add fp comparison instructions test

Choose some instructions to test:
- FCMP.cond.S
- cond: ceq clt cle cne seq slt sle sne

Signed-off-by: Song Gao <gaosong@loongson.cn>
Message-Id: <20220716085426.3098060-8-gaosong@loongson.cn>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

* tests/tcg/loongarch64: Add pcadd related instructions test

This includes:
- PCADDI
- PCADDU12I
- PCADDU18I
- PCALAU12I

Signed-off-by: Song Gao <gaosong@loongson.cn>
Message-Id: <20220716085426.3098060-9-gaosong@loongson.cn>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

* hw/loongarch: Add fw_cfg table support

Add fw_cfg table for loongarch virt machine, including memmap table.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Xiaojuan Yang <yangxiaojuan@loongson.cn>
Message-Id: <20220712083206.4187715-2-yangxiaojuan@loongson.cn>
[rth: Replace fprintf with assert; drop unused return value;
      initialize reserved slot to zero.]
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

* hw/loongarch: Add uefi bios loading support

Add uefi bios loading support, now only uefi bios is porting to
loongarch virt machine.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Xiaojuan Yang <yangxiaojuan@loongson.cn>
Message-Id: <20220712083206.4187715-3-yangxiaojuan@loongson.cn>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

* hw/loongarch: Add linux kernel booting support

There are two situations to start system by kernel file. If exists bios
option, system will boot from loaded bios file, else system will boot
from hardcoded auxcode, and jump to kernel elf entry.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Xiaojuan Yang <yangxiaojuan@loongson.cn>
Message-Id: <20220712083206.4187715-4-yangxiaojuan@loongson.cn>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

* hw/loongarch: Add smbios support

Add smbios support for loongarch virt machine, and put them into fw_cfg
table so that bios can parse them quickly. The weblink of smbios spec:
https://www.dmtf.org/dsp/DSP0134, the version is 3.6.0.

Acked-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Xiaojuan Yang <yangxiaojuan@loongson.cn>
Message-Id: <20220712083206.4187715-5-yangxiaojuan@loongson.cn>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

* hw/loongarch: Add acpi ged support

Loongarch virt machine uses general hardware reduces acpi method, rather
than LS7A acpi device. Now only power management function is used in
acpi ged device, memory hotplug will be added later. Also acpi tables
such as RSDP/RSDT/FADT etc.

The acpi table has submited to acpi spec, and will release soon.

Acked-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Xiaojuan Yang <yangxiaojuan@loongson.cn>
Message-Id: <20220712083206.4187715-6-yangxiaojuan@loongson.cn>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

* hw/loongarch: Add fdt support

Add LoongArch flatted device tree, adding cpu device node, firmware cfg node,
pcie node into it, and create fdt rom memory region. Now fdt info is not
full since only uefi bios uses fdt, linux kernel does not use fdt.
Loongarch Linux kernel uses acpi table which is full in qemu virt
machine.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Xiaojuan Yang <yangxiaojuan@loongson.cn>
Message-Id: <20220712083206.4187715-7-yangxiaojuan@loongson.cn>
[rth: Set TARGET_NEED_FDT, add fdt to meson.build]
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

* Hexagon (target/hexagon) fix store w/mem_noshuf & predicated load

Call the CHECK_NOSHUF macro multiple times: once in the
fGEN_TCG_PRED_LOAD() and again in fLOAD().

Before this commit, a packet with a store and a predicated
load with mem_noshuf that gets encoded like this:

    { P0 = cmp.eq(R17,#0x0)
      memw(R18+#0x0) = R2
      if (!P0.new) R3 = memw(R17+#0x4) }

... would end up generating a branch over both the load
and the store like so:

    ...
    brcond_i32 loc17,$0x0,eq,$L1
    mov_i32 loc18,store_addr_1
    qemu_st_i32 store_val32_1,store_addr_1,leul,0
    qemu_ld_i32 loc16,loc7,leul,0
    set_label $L1
    ...

Test cases added to tests/tcg/hexagon/mem_noshuf.c

Co-authored-by: Taylor Simpson <tsimpson@quicinc.com>
Signed-off-by: Brian Cain <bcain@quicinc.com>
Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20220707210546.15985-2-tsimpson@quicinc.com>

* Hexagon (target/hexagon) fix bug in mem_noshuf load exception

The semantics of a mem_noshuf packet are that the store effectively
happens before the load.  However, in cases where the load raises an
exception, we cannot simply execute the store first.

This change adds a probe to check that the load will not raise an
exception before executing the store.

If the load is predicated, this requires special handling.  We check
the condition before performing the probe.  Since, we need the EA to
perform the check, we move the GET_EA portion inside CHECK_NOSHUF_PRED.

Test case added in tests/tcg/hexagon/mem_noshuf_exception.c

Suggested-by: Alessandro Di Federico <ale@rev.ng>
Suggested-by: Anton Johansson <anjo@rev.ng>
Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20220707210546.15985-3-tsimpson@quicinc.com>

* vhost: move descriptor translation to vhost_svq_vring_write_descs

It's done for both in and out descriptors so it's better placed here.

Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* virtio-net: Expose MAC_TABLE_ENTRIES

vhost-vdpa control virtqueue needs to know the maximum entries supported
by the virtio-net device, so we know if it is possible to apply the
filter.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* virtio-net: Expose ctrl virtqueue logic

This allows external vhost-net devices to modify the state of the
VirtIO device model once the vhost-vdpa device has acknowledged the
control commands.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* vdpa: Avoid compiler to squash reads to used idx

In the next patch we will allow busypolling of this value. The compiler
have a running path where shadow_used_idx, last_used_idx, and vring used
idx are not modified within the same thread busypolling.

This was not an issue before since we always cleared device event
notifier before checking it, and that could act as memory barrier.
However, the busypoll needs something similar to kernel READ_ONCE.

Let's add it here, sepparated from the polling.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* vhost: Reorder vhost_svq_kick

Future code needs to call it from vhost_svq_add.

No functional change intended.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* vhost: Move vhost_svq_kick call to vhost_svq_add

The series needs to expose vhost_svq_add with full functionality,
including kick

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* vhost: Check for queue full at vhost_svq_add

The series need to expose vhost_svq_add with full functionality,
including checking for full queue.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* vhost: Decouple vhost_svq_add from VirtQueueElement

VirtQueueElement comes from the guest, but we're heading SVQ to be able
to modify the element presented to the device without the guest's
knowledge.

To do so, make SVQ accept sg buffers directly, instead of using
VirtQueueElement.

Add vhost_svq_add_element to maintain element convenience.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* vhost: Add SVQDescState

This will allow SVQ to add context to the different queue elements.

This patch only store the actual element, no functional change intended.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* vhost: Track number of descs in SVQDescState

A guest's buffer continuos on GPA may need multiple descriptors on
qemu's VA, so SVQ should track its length sepparatedly.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* vhost: add vhost_svq_push_elem

This function allows external SVQ users to return guest's available
buffers.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* vhost: Expose vhost_svq_add

This allows external parts of SVQ to forward custom buffers to the
device.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* vhost: add vhost_svq_poll

It allows the Shadow Control VirtQueue to wait for the device to use the
available buffers.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* vhost: Add svq avail_handler callback

This allows external handlers to be aware of new buffers that the guest
places in the virtqueue.

When this callback is defined the ownership of the guest's virtqueue
element is transferred to the callback. This means that if the user
wants to forward the descriptor it needs to manually inject it. The
callback is also free to process the command by itself and use the
element with svq_push.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* vdpa: Export vhost_vdpa_dma_map and unmap calls

Shadow CVQ will copy buffers on qemu VA, so we avoid TOCTOU attacks from
the guest that could set a different state in qemu device model and vdpa
device.

To do so, it needs to be able to map these new buffers to the device.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* vhost-net-vdpa: add stubs for when no virtio-net device is present

net/vhost-vdpa.c will need functions that are declared in
vhost-shadow-virtqueue.c, that needs functions of virtio-net.c.

Copy the vhost-vdpa-stub.c code so
only the constructor net_init_vhost_vdpa needs to be defined.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* vdpa: manual forward CVQ buffers

Do a simple forwarding of CVQ buffers, the same work SVQ could do but
through callbacks. No functional change intended.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* vdpa: Buffer CVQ support on shadow virtqueue

Introduce the control virtqueue support for vDPA shadow virtqueue. This
is needed for advanced networking features like rx filtering.

Virtio-net control VQ copies the descriptors to qemu's VA, so we avoid
TOCTOU with the guest's or device's memory every time there is a device
model change.  Otherwise, the guest could change the memory content in
the time between qemu and the device read it.

To demonstrate command handling, VIRTIO_NET_F_CTRL_MACADDR is
implemented.  If the virtio-net driver changes MAC the virtio-net device
model will be updated with the new one, and a rx filtering change event
will be raised.

More cvq commands could be added here straightforwardly but they have
not been tested.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* vdpa: Extract get features part from vhost_vdpa_get_max_queue_pairs

To know the device features is needed for CVQ SVQ, so SVQ knows if it
can handle all commands or not. Extract from
vhost_vdpa_get_max_queue_pairs so we can reuse it.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* vdpa: Add device migration blocker

Since the vhost-vdpa device is exposing _F_LOG, adding a migration blocker if
it uses CVQ.

However, qemu is able to migrate simple devices with no CVQ as long as
they use SVQ. To allow it, add a placeholder error to vhost_vdpa, and
only add to vhost_dev when used. vhost_dev machinery place the migration
blocker if needed.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* vdpa: Add x-svq to NetdevVhostVDPAOptions

Finally offering the possibility to enable SVQ from the command line.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* softmmu/runstate.c: add RunStateTransition support form COLO to PRELAUNCH

If the checkpoint occurs when the guest finishes restarting
but has not started running, the runstate_set() may reject
the transition from COLO to PRELAUNCH with the crash log:

{"timestamp": {"seconds": 1593484591, "microseconds": 26605},\
"event": "RESET", "data": {"guest": true, "reason": "guest-reset"}}
qemu-system-x86_64: invalid runstate transition: 'colo' -> 'prelaunch'

Long-term testing says that it's pretty safe.

Signed-off-by: Like Xu <like.xu@linux.intel.com>
Signed-off-by: Zhang Chen <chen.zhang@intel.com>
Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* net/colo: Fix a "double free" crash to clear the conn_list

We notice the QEMU may crash when the guest has too many
incoming network connections with the following log:

15197@1593578622.668573:colo_proxy_main : colo proxy connection hashtable full, clear it
free(): invalid pointer
[1]    15195 abort (core dumped)  qemu-system-x86_64 ....

This is because we create the s->connection_track_table with
g_hash_table_new_full() which is defined as:

GHashTable * g_hash_table_new_full (GHashFunc hash_func,
                       GEqualFunc key_equal_func,
                       GDestroyNotify key_destroy_func,
                       GDestroyNotify value_destroy_func);

The fourth parameter connection_destroy() will be called to free the
memory allocated for all 'Connection' values in the hashtable when
we call g_hash_table_remove_all() in the connection_hashtable_reset().

But both connection_track_table and conn_list reference to the same
conn instance. It will trigger double free in conn_list clear. So this
patch remove free action on hash table side to avoid double free the
conn.

Signed-off-by: Like Xu <like.xu@linux.intel.com>
Signed-off-by: Zhang Chen <chen.zhang@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* net/colo.c: No need to track conn_list for filter-rewriter

Filter-rewriter no need to track connection in conn_list.
This patch fix the glib g_queue_is_empty assertion when COLO guest
keep a lot of network connection.

Signed-off-by: Zhang Chen <chen.zhang@intel.com>
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* net/colo.c: fix segmentation fault when packet is not parsed correctly

When COLO use only one vnet_hdr_support parameter between
filter-redirector and filter-mirror(or colo-compare), COLO will crash
with segmentation fault. Back track as follow:

Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
0x0000555555cb200b in eth_get_l2_hdr_length (p=0x0)
    at /home/tao/project/COLO/colo-qemu/include/net/eth.h:296
296         uint16_t proto = be16_to_cpu(PKT_GET_ETH_HDR(p)->h_proto);
(gdb) bt
0  0x0000555555cb200b in eth_get_l2_hdr_length (p=0x0)
    at /home/tao/project/COLO/colo-qemu/include/net/eth.h:296
1  0x0000555555cb22b4 in parse_packet_early (pkt=0x555556a44840) at
net/colo.c:49
2  0x0000555555cb2b91 in is_tcp_packet (pkt=0x555556a44840) at
net/filter-rewriter.c:63

So wrong vnet_hdr_len will cause pkt->data become NULL. Add check to
raise error and add trace-events to track vnet_hdr_len.

Signed-off-by: Tao Xu <tao3.xu@intel.com>
Signed-off-by: Zhang Chen <chen.zhang@intel.com>
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

* accel/kvm/kvm-all: Refactor per-vcpu dirty ring reaping

Add a non-required argument 'CPUState' to kvm_dirty_ring_reap so
that it can cover single vcpu dirty-ring-reaping scenario.

Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-Id: <c32001242875e83b0d9f78f396fe2dcd380ba9e8.1656177590.git.huangy81@chinatelecom.cn>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* cpus: Introduce cpu_list_generation_id

Introduce cpu_list_generation_id to track cpu list generation so
that cpu hotplug/unplug can be detected during measurement of
dirty page rate.

cpu_list_generation_id could be used to detect changes of cpu
list, which is prepared for dirty page rate measurement.

Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-Id: <06e1f1362b2501a471dce796abb065b04f320fa5.1656177590.git.huangy81@chinatelecom.cn>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* migration/dirtyrate: Refactor dirty page rate calculation

abstract out dirty log change logic into function
global_dirty_log_change.

abstract out dirty page rate calculation logic via
dirty-ring into function vcpu_calculate_dirtyrate.

abstract out mathematical dirty page rate calculation
into do_calculate_dirtyrate, decouple it from DirtyStat.

rename set_sample_page_period to dirty_stat_wait, which
is well-understood and will be reused in dirtylimit.

handle cpu hotplug/unplug scenario during measurement of
dirty page rate.

export util functions outside migration.

Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-Id: <7b6f6f4748d5b3d017b31a0429e630229ae97538.1656177590.git.huangy81@chinatelecom.cn>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* softmmu/dirtylimit: Implement vCPU dirtyrate calculation periodically

Introduce the third method GLOBAL_DIRTY_LIMIT of dirty
tracking for calculate dirtyrate periodly for dirty page
rate limit.

Add dirtylimit.c to implement dirtyrate calculation periodly,
which will be used for dirty page rate limit.

Add dirtylimit.h to export util functions for dirty page rate
limit implementation.

Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-Id: <5d0d641bffcb9b1c4cc3e323b6dfecb36050d948.1656177590.git.huangy81@chinatelecom.cn>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* accel/kvm/kvm-all: Introduce kvm_dirty_ring_size function

Introduce kvm_dirty_ring_size util function to help calculate
dirty ring ful time.

Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Acked-by: Peter Xu <peterx@redhat.com>
Message-Id: <f9ce1f550bfc0e3a1f711e17b1dbc8f701700e56.1656177590.git.huangy81@chinatelecom.cn>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* softmmu/dirtylimit: Implement virtual CPU throttle

Setup a negative feedback system when vCPU thread
handling KVM_EXIT_DIRTY_RING_FULL exit by introducing
throttle_us_per_full field in struct CPUState. Sleep
throttle_us_per_full microseconds to throttle vCPU
if dirtylimit is in service.

Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-Id: <977e808e03a1cef5151cae75984658b6821be618.1656177590.git.huangy81@chinatelecom.cn>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* softmmu/dirtylimit: Implement dirty page rate limit

Implement dirtyrate calculation periodically basing on
dirty-ring and throttle virtual CPU until it reachs the quota
dirty page rate given by user.

Introduce qmp commands "set-vcpu-dirty-limit",
"cancel-vcpu-dirty-limit", "query-vcpu-dirty-limit"
to enable, disable, query dirty page limit for virtual CPU.

Meanwhile, introduce corresponding hmp commands
"set_vcpu_dirty_limit", "cancel_vcpu_dirty_limit",
"info vcpu_dirty_limit" so the feature can be more usable.

"query-vcpu-dirty-limit" success depends on enabling dirty
page rate limit, so just add it to the list of skipped
command to ensure qmp-cmd-test run successfully.

Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Acked-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-Id: <4143f26706d413dd29db0b672fe58b3d3fbe34bc.1656177590.git.huangy81@chinatelecom.cn>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* tests: Add dirty page rate limit test

Add dirty page rate limit test if kernel support dirty ring,

The following qmp commands are covered by this test case:
"calc-dirty-rate", "query-dirty-rate", "set-vcpu-dirty-limit",
"cancel-vcpu-dirty-limit" and "query-vcpu-dirty-limit".

Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Acked-by: Peter Xu <peterx@redhat.com>
Message-Id: <eed5b847a6ef0a9c02a36383dbdd7db367dd1e7e.1656177590.git.huangy81@chinatelecom.cn>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* multifd: Copy pages before compressing them with zlib

zlib_send_prepare() compresses pages of a running VM. zlib does not
make any thread-safety guarantees with respect to changing deflate()
input concurrently with deflate() [1].

One can observe problems due to this with the IBM zEnterprise Data
Compression accelerator capable zlib [2]. When the hardware
acceleration is enabled, migration/multifd/tcp/plain/zlib test fails
intermittently [3] due to sliding window corruption. The accelerator's
architecture explicitly discourages concurrent accesses [4]:

    Page 26-57, "Other Conditions":

    As observed by this CPU, other CPUs, and channel
    programs, references to the parameter block, first,
    second, and third operands may be multiple-access
    references, accesses to these storage locations are
    not necessarily block-concurrent, and the sequence
    of these accesses or references is undefined.

Mark Adler pointed out that vanilla zlib performs double fetches under
certain circumstances as well [5], therefore we need to copy data
before passing it to deflate().

[1] https://zlib.net/manual.html
[2] https://github.com/madler/zlib/pull/410
[3] https://lists.nongnu.org/archive/html/qemu-devel/2022-03/msg03988.html
[4] http://publibfp.dhe.ibm.com/epubs/pdf/a227832c.pdf
[5] https://lists.gnu.org/archive/html/qemu-devel/2022-07/msg00889.html

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Message-Id: <20220705203559.2960949-1-iii@linux.ibm.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* migration: Add postcopy-preempt capability

Firstly, postcopy already preempts precopy due to the fact that we do
unqueue_page() first before looking into dirty bits.

However that's not enough, e.g., when there're host huge page enabled, when
sending a precopy huge page, a postcopy request needs to wait until the whole
huge page that is sending to finish.  That could introduce quite some delay,
the bigger the huge page is the larger delay it'll bring.

This patch adds a new capability to allow postcopy requests to preempt existing
precopy page during sending a huge page, so that postcopy requests can be
serviced even faster.

Meanwhile to send it even faster, bypass the precopy stream by providing a
standalone postcopy socket for sending requested pages.

Since the new behavior will not be compatible with the old behavior, this will
not be the default, it's enabled only when the new capability is set on both
src/dst QEMUs.

This patch only adds the capability itself, the logic will be added in follow
up patches.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20220707185342.26794-2-peterx@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* migration: Postcopy preemption preparation on channel creation

Create a new socket for postcopy to be prepared to send postcopy requested
pages via this specific channel, so as to not get blocked by precopy pages.

A new thread is also created on dest qemu to receive data from this new channel
based on the ram_load_postcopy() routine.

The ram_load_postcopy(POSTCOPY) branch and the thread has not started to
function, and that'll be done in follow up patches.

Cleanup the new sockets on both src/dst QEMUs, meanwhile look after the new
thread too to make sure it'll be recycled properly.

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20220707185502.27149-1-peterx@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
  dgilbert: With Peter's fix to quieten compiler warning on
       start_migration

* migration: Postcopy preemption enablement

This patch enables postcopy-preempt feature.

It contains two major changes to the migration logic:

(1) Postcopy requests are now sent via a different socket from precopy
    background migration stream, so as to be isolated from very high page
    request delays.

(2) For huge page enabled hosts: when there's postcopy requests, they can now
    intercept a partial sending of huge host pages on src QEMU.

After this patch, we'll live migrate a VM with two channels for postcopy: (1)
PRECOPY channel, which is the default channel that transfers background pages;
and (2) POSTCOPY channel, which only transfers requested pages.

There's no strict rule of which channel to use, e.g., if a requested page is
already being transferred on precopy channel, then we will keep using the same
precopy channel to transfer the page even if it's explicitly requested.  In 99%
of the cases we'll prioritize the channels so we send requested page via the
postcopy channel as long as possible.

On the source QEMU, when we found a postcopy request, we'll interrupt the
PRECOPY channel sending process and quickly switch to the POSTCOPY channel.
After we serviced all the high priority postcopy pages, we'll switch back to
PRECOPY channel so that we'll continue to send the interrupted huge page again.
There's no new thread introduced on src QEMU.

On the destination QEMU, one new thread is introduced to receive page data from
the postcopy specific socket (done in the preparation patch).

This patch has a side effect: after sending postcopy pages, previously we'll
assume the guest will access follow up pages so we'll keep sending from there.
Now it's changed.  Instead of going on with a postcopy requested page, we'll go
back and continue sending the precopy huge page (which can be intercepted by a
postcopy request so the huge page can be sent partially before).

Whether that's a problem is debatable, because "assuming the guest will
continue to access the next page" may not really suite when huge pages are
used, especially if the huge page is large (e.g. 1GB pages).  So that locality
hint is much meaningless if huge pages are used.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20220707185504.27203-1-peterx@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* migration: Postcopy recover with preempt enabled

To allow postcopy recovery, the ram fast load (preempt-only) dest QEMU thread
needs similar handling on fault tolerance.  When ram_load_postcopy() fails,
instead of stopping the thread it halts with a semaphore, preparing to be
kicked again when recovery is detected.

A mutex is introduced to make sure there's no concurrent operation upon the
socket.  To make it simple, the fast ram load thread will take the mutex during
its whole procedure, and only release it if it's paused.  The fast-path socket
will be properly released by the main loading thread safely when there's
network failures during postcopy with that mutex held.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20220707185506.27257-1-peterx@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* migration: Create the postcopy preempt channel asynchronously

This patch allows the postcopy preempt channel to be created
asynchronously.  The benefit is that when the connection is slow, we won't
take the BQL (and potentially block all things like QMP) for a long time
without releasing.

A function postcopy_preempt_wait_channel() is introduced, allowing the
migration thread to be able to wait on the channel creation.  The channel
is always created by the main thread, in which we'll kick a new semaphore
to tell the migration thread that the channel has created.

We'll need to wait for the new channel in two places: (1) when there's a
new postcopy migration that is starting, or (2) when there's a postcopy
migration to resume.

For the start of migration, we don't need to wait for this channel until
when we want to start postcopy, aka, postcopy_start().  We'll fail the
migration if we found that the channel creation failed (which should
probably not happen at all in 99% of the cases, because the main channel is
using the same network topology).

For a postcopy recovery, we'll need to wait in postcopy_pause().  In that
case if the channel creation failed, we can't fail the migration or we'll
crash the VM, instead we keep in PAUSED state, waiting for yet another
recovery.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Manish Mishra <manish.mishra@nutanix.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20220707185509.27311-1-peterx@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* migration: Add property x-postcopy-preempt-break-huge

Add a property field that can conditionally disable the "break sending huge
page" behavior in postcopy preemption.  By default it's enabled.

It should only be used for debugging purposes, and we should never remove
the "x-" prefix.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Manish Mishra <manish.mishra@nutanix.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20220707185511.27366-1-peterx@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* migration: Add helpers to detect TLS capability

Add migrate_channel_requires_tls() to detect whether the specific channel
requires TLS, leveraging the recently introduced migrate_use_tls().  No
functional change intended.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20220707185513.27421-1-peterx@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* migration: Export tls-[creds|hostname|authz] params to cmdline too

It's useful for specifying tls credentials all in the cmdline (along with
the -object tls-creds-*), especially for debugging purpose.

The trick here is we must remember to not free these fields again in the
finalize() function of migration object, otherwise it'll cause double-free.

The thing is when destroying an object, we'll first destroy the properties
that bound to the object, then the object itself.  To be explicit, when
destroy the object in object_finalize() we have such sequence of
operations:

    object_property_del_all(obj);
    object_deinit(obj, ti);

So after this change the two fields are properly released already even
before reaching the finalize() function but in object_property_del_all(),
hence we don't need to free them anymore in finalize() or it's double-free.

This also fixes a trivial memory leak for tls-authz as we forgot to free it
before this patch.

Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20220707185515.27475-1-peterx@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* migration: Enable TLS for preempt channel

This patch is based on the async preempt channel creation.  It continues
wiring up the new channel with TLS handshake to destionation when enabled.

Note that only the src QEMU needs such operation; the dest QEMU does not
need any change for TLS support due to the fact that all channels are
established synchronously there, so all the TLS magic is already properly
handled by migration_tls_channel_process_incoming().

Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20220707185518.27529-1-peterx@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* migration: Respect postcopy request order in preemption mode

With preemption mode on, when we see a postcopy request that was requesting
for exactly the page that we have preempted before (so we've partially sent
the page already via PRECOPY channel and it got preempted by another
postcopy request), currently we drop the request so that after all the
other postcopy requests are serviced then we'll go back to precopy stream
and start to handle that.

We dropped the request because we can't send it via postcopy channel since
the precopy channel already contains partial of the data, and we can only
send a huge page via one channel as a whole.  We can't split a huge page
into two channels.

That's a very corner case and that works, but there's a change on the order
of postcopy requests that we handle since we're postponing this (unlucky)
postcopy request to be later than the other queued postcopy requests.  The
problem is there's a possibility that when the guest was very busy, the
postcopy queue can be always non-empty, it means this dropped request will
never be handled until the end of postcopy migration. So, there's a chance
that there's one dest QEMU vcpu thread waiting for a page fault for an
extremely long time just because it's unluckily accessing the specific page
that was preempted before.

The worst case time it needs can be as long as the whole postcopy migration
procedure.  It's extremely unlikely to happen, but when it happens it's not
good.

The root cause of this problem is because we treat pss->postcopy_requested
variable as with two meanings bound together, as the variable shows:

  1. Whether this page request is urgent, and,
  2. Which channel we should use for this page request.

With the old code, when we set postcopy_requested it means either both (1)
and (2) are true, or both (1) and (2) are false.  We can never have (1)
and (2) to have different values.

However it doesn't necessarily need to be like that.  It's very legal that
there's one request that has (1) very high urgency, but (2) we'd like to
use the precopy channel.  Just like the corner case we were discussing
above.

To differenciate the two meanings better, introduce a new field called
postcopy_target_channel, showing which channel we should use for this page
request, so as to cover the old meaning (2) only.  Then we leave the
postcopy_requested variable to stand only for meaning (1), which is the
urgency of this page request.

With this change, we can easily boost priority of a preempted precopy page
as long as we know that page is also requested as a postcopy page.  So with
the new approach in get_queued_page() instead of dropping that request, we
send it right away with the precopy channel so we get back the ordering of
the page faults just like how they're requested on dest.

Reported-by: Manish Mishra <manish.mishra@nutanix.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Manish Mishra <manish.mishra@nutanix.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20220707185520.27583-1-peterx@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* tests: Move MigrateCommon upper

So that it can be used in postcopy tests too soon.

Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20220707185522.27638-1-peterx@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* tests: Add postcopy tls migration test

We just added TLS tests for precopy but not postcopy.  Add the
corresponding test for vanilla postcopy.

Rename the vanilla postcopy to "postcopy/plain" because all postcopy tests
will only use unix sockets as channel.

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20220707185525.27692-1-peterx@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
  dgilbert: Manual merge

* tests: Add postcopy tls recovery migration test

It's easy to build this upon the postcopy tls test.  Rename the old
postcopy recovery test to postcopy/recovery/plain.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20220707185527.27747-1-peterx@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
  dgilbert: Manual merge

* tests: Add postcopy preempt tests

Four tests are added for preempt mode:

  - Postcopy plain
  - Postcopy recovery
  - Postcopy tls
  - Postcopy tls+recovery

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20220707185530.27801-1-peterx@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
  dgilbert: Manual merge

* migration: remove unreachable code after reading data

The code calls qio_channel_read() in a loop when it reports
QIO_CHANNEL_ERR_BLOCK. This code is reported when errno==EAGAIN.

As such the later block of code will always hit the 'errno != EAGAIN'
condition, making the final 'else' unreachable.

Fixes: Coverity CID 1490203
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20220627135318.156121-1-berrange@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* QIOChannelSocket: Fix zero-copy flush returning code 1 when nothing sent

If flush is called when no buffer was sent with MSG_ZEROCOPY, it currently
returns 1. This return code should be used only when Linux fails to use
MSG_ZEROCOPY on a lot of sendmsg().

Fix this by returning early from flush if no sendmsg(...,MSG_ZEROCOPY)
was attempted.

Fixes: 2bc58ffc2926 ("QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX")
Signed-off-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Acked-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-Id: <20220711211112.18951-2-leobras@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* Add dirty-sync-missed-zero-copy migration stat

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20220711211112.18951-3-leobras@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* migration/multifd: Report to user when zerocopy not working

Some errors, like the lack of Scatter-Gather support by the network
interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail on using
zero-copy, which causes it to fall back to the default copying mechanism.

After each full dirty-bitmap scan there should be a zero-copy flush
happening, which checks for errors each of the previous calls to
sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy, then
increment dirty_sync_missed_zero_copy migration stat to let the user know
about it.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Message-Id: <20220711211112.18951-4-leobras@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* multifd: Document the locking of MultiFD{Send/Recv}Params

Reorder the structures so we can know if the fields are:
- Read only
- Their own locking (i.e. sems)
- Protected by 'mutex'
- Only for the multifd channel

Signed-off-by: Juan Quintela <quintela@redhat.com>
Message-Id: <20220531104318.7494-2-quintela@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
  dgilbert: Typo fixes from Chen Zhang

* migration: Avoid false-positive on non-supported scenarios for zero-copy-send

Migration with zero-copy-send currently has it's limitations, as it can't
be used with TLS nor any kind of compression. In such scenarios, it should
output errors during parameter / capability setting.

But currently there are some ways of setting this not-supported scenarios
without printing the error message:

!) For 'compression' capability, it works by enabling it together with
zero-copy-send. This happens because the validity test for zero-copy uses
the helper unction migrate_use_compression(), which check for compression
presence in s->enabled_capabilities[MIGRATION_CAPABILITY_COMPRESS].

The point here is: the validity test happens before the capability gets
enabled. If all of them get enabled together, this test will not return
error.

In order to fix that, replace migrate_use_compression() by directly testing
the cap_list parameter migrate_caps_check().

2) For features enabled by parameters such as TLS & 'multifd_compression',
there was also a possibility of setting non-supported scenarios: setting
zero-copy-send first, then setting the unsupported parameter.

In order to fix that, also add a check for parameters conflicting with
zero-copy-send on migrate_params_check().

3) XBZRLE is also a compression capability, so it makes sense to also add
it to the list of capabilities which are not supported with zero-copy-send.

Fixes: 1abaec9a1b2c ("migration: Change zero_copy_send from migration parameter to migration capability")
Signed-off-by: Leonardo Bras <leobras@redhat.com>
Message-Id: <20220719122345.253713-1-leobras@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

* Revert "gitlab: disable accelerated zlib for s390x"

This reverts commit 309df6acb29346f89e1ee542b1986f60cab12b87.
With Ilya's 'multifd: Copy pages before compressing them with zlib'
in the latest migration series, this shouldn't be a problem any more.

Suggested-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>

* slow snapshots api

Co-authored-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Co-authored-by: Paolo Bonzini <pbonzini@redhat.com>
Co-authored-by: Peter Maydell <peter.maydell@linaro.org>
Co-authored-by: Joel Stanley <joel@jms.id.au>
Co-authored-by: Peter Delevoryas <pdel@fb.com>
Co-authored-by: Peter Delevoryas <peter@pjd.dev>
Co-authored-by: Cédric Le Goater <clg@kaod.org>
Co-authored-by: Iris Chen <irischenlj@fb.com>
Co-authored-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
Co-authored-by: Niklas Cassel <niklas.cassel@wdc.com>
Co-authored-by: Darren Kenny <darren.kenny@oracle.com>
Co-authored-by: Konstantin Kostiuk <kkostiuk@redhat.com>
Co-authored-by: Richard Henderson <richard.henderson@linaro.org>
Co-authored-by: Hao Wu <wuhaotsh@google.com>
Co-authored-by: Andrey Makarov <ph.makarov@gmail.com>
Co-authored-by: Jason A. Donenfeld <Jason@zx2c4.com>
Co-authored-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Co-authored-by: Matheus Ferst <matheus.ferst@eldorado.org.br>
Co-authored-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
Co-authored-by: Lucas Coutinho <lucas.coutinho@eldorado.org.br>
Co-authored-by: John Snow <jsnow@redhat.com>
Co-authored-by: Song Gao <gaosong@loongson.cn>
Co-authored-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Co-authored-by: Thomas Huth <thuth@redhat.com>
Co-authored-by: Ilya Leoshkevich <iii@linux.ibm.com>
Co-authored-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Co-authored-by: Gerd Hoffmann <kraxel@redhat.com>
Co-authored-by: Mauro Matteo Cascella <mcascell@redhat.com>
Co-authored-by: Felix xq Queißner <xq@random-projects.net>
Co-authored-by: Xiaojuan Yang <yangxiaojuan@loongson.cn>
Co-authored-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Co-authored-by: Taylor Simpson <tsimpson@quicinc.com>
Co-authored-by: Eugenio Pérez <eperezma@redhat.com>
Co-authored-by: Zhang Chen <chen.zhang@intel.com>
Co-authored-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Co-authored-by: Peter Xu <peterx@redhat.com>
Co-authored-by: Daniel P. Berrangé <berrange@redhat.com>
Co-authored-by: Leonardo Bras <leobras@redhat.com>
Co-authored-by: Juan Quintela <quintela@redhat.com>
Co-authored-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2022-07-22 17:02:58 +02:00

8245 lines
242 KiB
C

/*
* QEMU System Emulator block driver
*
* Copyright (c) 2003 Fabrice Bellard
* Copyright (c) 2020 Virtuozzo International GmbH.
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
* THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
* THE SOFTWARE.
*/
#include "qemu/osdep.h"
#include "block/trace.h"
#include "block/block_int.h"
#include "block/blockjob.h"
#include "block/fuse.h"
#include "block/nbd.h"
#include "block/qdict.h"
#include "qemu/error-report.h"
#include "block/module_block.h"
#include "qemu/main-loop.h"
#include "qemu/module.h"
#include "qapi/error.h"
#include "qapi/qmp/qdict.h"
#include "qapi/qmp/qjson.h"
#include "qapi/qmp/qnull.h"
#include "qapi/qmp/qstring.h"
#include "qapi/qobject-output-visitor.h"
#include "qapi/qapi-visit-block-core.h"
#include "sysemu/block-backend.h"
#include "qemu/notify.h"
#include "qemu/option.h"
#include "qemu/coroutine.h"
#include "block/qapi.h"
#include "qemu/timer.h"
#include "qemu/cutils.h"
#include "qemu/id.h"
#include "qemu/range.h"
#include "qemu/rcu.h"
#include "block/coroutines.h"
#ifdef CONFIG_BSD
#include <sys/ioctl.h>
#include <sys/queue.h>
#if defined(HAVE_SYS_DISK_H)
#include <sys/disk.h>
#endif
#endif
#ifdef _WIN32
#include <windows.h>
#endif
#define NOT_DONE 0x7fffffff /* used while emulated sync operation in progress */
/* Protected by BQL */
static QTAILQ_HEAD(, BlockDriverState) graph_bdrv_states =
QTAILQ_HEAD_INITIALIZER(graph_bdrv_states);
/* Protected by BQL */
static QTAILQ_HEAD(, BlockDriverState) all_bdrv_states =
QTAILQ_HEAD_INITIALIZER(all_bdrv_states);
/* Protected by BQL */
static QLIST_HEAD(, BlockDriver) bdrv_drivers =
QLIST_HEAD_INITIALIZER(bdrv_drivers);
static BlockDriverState *bdrv_open_inherit(const char *filename,
const char *reference,
QDict *options, int flags,
BlockDriverState *parent,
const BdrvChildClass *child_class,
BdrvChildRole child_role,
Error **errp);
static bool bdrv_recurse_has_child(BlockDriverState *bs,
BlockDriverState *child);
static void bdrv_child_free(BdrvChild *child);
static void bdrv_replace_child_noperm(BdrvChild **child,
BlockDriverState *new_bs,
bool free_empty_child);
static void bdrv_remove_file_or_backing_child(BlockDriverState *bs,
BdrvChild *child,
Transaction *tran);
static void bdrv_remove_filter_or_cow_child(BlockDriverState *bs,
Transaction *tran);
static int bdrv_reopen_prepare(BDRVReopenState *reopen_state,
BlockReopenQueue *queue,
Transaction *change_child_tran, Error **errp);
static void bdrv_reopen_commit(BDRVReopenState *reopen_state);
static void bdrv_reopen_abort(BDRVReopenState *reopen_state);
static bool bdrv_backing_overridden(BlockDriverState *bs);
/* If non-zero, use only whitelisted block drivers */
static int use_bdrv_whitelist;
#ifdef _WIN32
static int is_windows_drive_prefix(const char *filename)
{
return (((filename[0] >= 'a' && filename[0] <= 'z') ||
(filename[0] >= 'A' && filename[0] <= 'Z')) &&
filename[1] == ':');
}
int is_windows_drive(const char *filename)
{
if (is_windows_drive_prefix(filename) &&
filename[2] == '\0')
return 1;
if (strstart(filename, "\\\\.\\", NULL) ||
strstart(filename, "//./", NULL))
return 1;
return 0;
}
#endif
size_t bdrv_opt_mem_align(BlockDriverState *bs)
{
if (!bs || !bs->drv) {
/* page size or 4k (hdd sector size) should be on the safe side */
return MAX(4096, qemu_real_host_page_size());
}
IO_CODE();
return bs->bl.opt_mem_alignment;
}
size_t bdrv_min_mem_align(BlockDriverState *bs)
{
if (!bs || !bs->drv) {
/* page size or 4k (hdd sector size) should be on the safe side */
return MAX(4096, qemu_real_host_page_size());
}
IO_CODE();
return bs->bl.min_mem_alignment;
}
/* check if the path starts with "<protocol>:" */
int path_has_protocol(const char *path)
{
const char *p;
#ifdef _WIN32
if (is_windows_drive(path) ||
is_windows_drive_prefix(path)) {
return 0;
}
p = path + strcspn(path, ":/\\");
#else
p = path + strcspn(path, ":/");
#endif
return *p == ':';
}
int path_is_absolute(const char *path)
{
#ifdef _WIN32
/* specific case for names like: "\\.\d:" */
if (is_windows_drive(path) || is_windows_drive_prefix(path)) {
return 1;
}
return (*path == '/' || *path == '\\');
#else
return (*path == '/');
#endif
}
/* if filename is absolute, just return its duplicate. Otherwise, build a
path to it by considering it is relative to base_path. URL are
supported. */
char *path_combine(const char *base_path, const char *filename)
{
const char *protocol_stripped = NULL;
const char *p, *p1;
char *result;
int len;
if (path_is_absolute(filename)) {
return g_strdup(filename);
}
if (path_has_protocol(base_path)) {
protocol_stripped = strchr(base_path, ':');
if (protocol_stripped) {
protocol_stripped++;
}
}
p = protocol_stripped ?: base_path;
p1 = strrchr(base_path, '/');
#ifdef _WIN32
{
const char *p2;
p2 = strrchr(base_path, '\\');
if (!p1 || p2 > p1) {
p1 = p2;
}
}
#endif
if (p1) {
p1++;
} else {
p1 = base_path;
}
if (p1 > p) {
p = p1;
}
len = p - base_path;
result = g_malloc(len + strlen(filename) + 1);
memcpy(result, base_path, len);
strcpy(result + len, filename);
return result;
}
/*
* Helper function for bdrv_parse_filename() implementations to remove optional
* protocol prefixes (especially "file:") from a filename and for putting the
* stripped filename into the options QDict if there is such a prefix.
*/
void bdrv_parse_filename_strip_prefix(const char *filename, const char *prefix,
QDict *options)
{
if (strstart(filename, prefix, &filename)) {
/* Stripping the explicit protocol prefix may result in a protocol
* prefix being (wrongly) detected (if the filename contains a colon) */
if (path_has_protocol(filename)) {
GString *fat_filename;
/* This means there is some colon before the first slash; therefore,
* this cannot be an absolute path */
assert(!path_is_absolute(filename));
/* And we can thus fix the protocol detection issue by prefixing it
* by "./" */
fat_filename = g_string_new("./");
g_string_append(fat_filename, filename);
assert(!path_has_protocol(fat_filename->str));
qdict_put(options, "filename",
qstring_from_gstring(fat_filename));
} else {
/* If no protocol prefix was detected, we can use the shortened
* filename as-is */
qdict_put_str(options, "filename", filename);
}
}
}
/* Returns whether the image file is opened as read-only. Note that this can
* return false and writing to the image file is still not possible because the
* image is inactivated. */
bool bdrv_is_read_only(BlockDriverState *bs)
{
IO_CODE();
return !(bs->open_flags & BDRV_O_RDWR);
}
int bdrv_can_set_read_only(BlockDriverState *bs, bool read_only,
bool ignore_allow_rdw, Error **errp)
{
IO_CODE();
/* Do not set read_only if copy_on_read is enabled */
if (bs->copy_on_read && read_only) {
error_setg(errp, "Can't set node '%s' to r/o with copy-on-read enabled",
bdrv_get_device_or_node_name(bs));
return -EINVAL;
}
/* Do not clear read_only if it is prohibited */
if (!read_only && !(bs->open_flags & BDRV_O_ALLOW_RDWR) &&
!ignore_allow_rdw)
{
error_setg(errp, "Node '%s' is read only",
bdrv_get_device_or_node_name(bs));
return -EPERM;
}
return 0;
}
/*
* Called by a driver that can only provide a read-only image.
*
* Returns 0 if the node is already read-only or it could switch the node to
* read-only because BDRV_O_AUTO_RDONLY is set.
*
* Returns -EACCES if the node is read-write and BDRV_O_AUTO_RDONLY is not set
* or bdrv_can_set_read_only() forbids making the node read-only. If @errmsg
* is not NULL, it is used as the error message for the Error object.
*/
int bdrv_apply_auto_read_only(BlockDriverState *bs, const char *errmsg,
Error **errp)
{
int ret = 0;
IO_CODE();
if (!(bs->open_flags & BDRV_O_RDWR)) {
return 0;
}
if (!(bs->open_flags & BDRV_O_AUTO_RDONLY)) {
goto fail;
}
ret = bdrv_can_set_read_only(bs, true, false, NULL);
if (ret < 0) {
goto fail;
}
bs->open_flags &= ~BDRV_O_RDWR;
return 0;
fail:
error_setg(errp, "%s", errmsg ?: "Image is read-only");
return -EACCES;
}
/*
* If @backing is empty, this function returns NULL without setting
* @errp. In all other cases, NULL will only be returned with @errp
* set.
*
* Therefore, a return value of NULL without @errp set means that
* there is no backing file; if @errp is set, there is one but its
* absolute filename cannot be generated.
*/
char *bdrv_get_full_backing_filename_from_filename(const char *backed,
const char *backing,
Error **errp)
{
if (backing[0] == '\0') {
return NULL;
} else if (path_has_protocol(backing) || path_is_absolute(backing)) {
return g_strdup(backing);
} else if (backed[0] == '\0' || strstart(backed, "json:", NULL)) {
error_setg(errp, "Cannot use relative backing file names for '%s'",
backed);
return NULL;
} else {
return path_combine(backed, backing);
}
}
/*
* If @filename is empty or NULL, this function returns NULL without
* setting @errp. In all other cases, NULL will only be returned with
* @errp set.
*/
static char *bdrv_make_absolute_filename(BlockDriverState *relative_to,
const char *filename, Error **errp)
{
char *dir, *full_name;
if (!filename || filename[0] == '\0') {
return NULL;
} else if (path_has_protocol(filename) || path_is_absolute(filename)) {
return g_strdup(filename);
}
dir = bdrv_dirname(relative_to, errp);
if (!dir) {
return NULL;
}
full_name = g_strconcat(dir, filename, NULL);
g_free(dir);
return full_name;
}
char *bdrv_get_full_backing_filename(BlockDriverState *bs, Error **errp)
{
GLOBAL_STATE_CODE();
return bdrv_make_absolute_filename(bs, bs->backing_file, errp);
}
void bdrv_register(BlockDriver *bdrv)
{
assert(bdrv->format_name);
GLOBAL_STATE_CODE();
QLIST_INSERT_HEAD(&bdrv_drivers, bdrv, list);
}
BlockDriverState *bdrv_new(void)
{
BlockDriverState *bs;
int i;
GLOBAL_STATE_CODE();
bs = g_new0(BlockDriverState, 1);
QLIST_INIT(&bs->dirty_bitmaps);
for (i = 0; i < BLOCK_OP_TYPE_MAX; i++) {
QLIST_INIT(&bs->op_blockers[i]);
}
qemu_co_mutex_init(&bs->reqs_lock);
qemu_mutex_init(&bs->dirty_bitmap_mutex);
bs->refcnt = 1;
bs->aio_context = qemu_get_aio_context();
qemu_co_queue_init(&bs->flush_queue);
qemu_co_mutex_init(&bs->bsc_modify_lock);
bs->block_status_cache = g_new0(BdrvBlockStatusCache, 1);
for (i = 0; i < bdrv_drain_all_count; i++) {
bdrv_drained_begin(bs);
}
QTAILQ_INSERT_TAIL(&all_bdrv_states, bs, bs_list);
return bs;
}
static BlockDriver *bdrv_do_find_format(const char *format_name)
{
BlockDriver *drv1;
GLOBAL_STATE_CODE();
QLIST_FOREACH(drv1, &bdrv_drivers, list) {
if (!strcmp(drv1->format_name, format_name)) {
return drv1;
}
}
return NULL;
}
BlockDriver *bdrv_find_format(const char *format_name)
{
BlockDriver *drv1;
int i;
GLOBAL_STATE_CODE();
drv1 = bdrv_do_find_format(format_name);
if (drv1) {
return drv1;
}
/* The driver isn't registered, maybe we need to load a module */
for (i = 0; i < (int)ARRAY_SIZE(block_driver_modules); ++i) {
if (!strcmp(block_driver_modules[i].format_name, format_name)) {
block_module_load_one(block_driver_modules[i].library_name);
break;
}
}
return bdrv_do_find_format(format_name);
}
static int bdrv_format_is_whitelisted(const char *format_name, bool read_only)
{
static const char *whitelist_rw[] = {
CONFIG_BDRV_RW_WHITELIST
NULL
};
static const char *whitelist_ro[] = {
CONFIG_BDRV_RO_WHITELIST
NULL
};
const char **p;
if (!whitelist_rw[0] && !whitelist_ro[0]) {
return 1; /* no whitelist, anything goes */
}
for (p = whitelist_rw; *p; p++) {
if (!strcmp(format_name, *p)) {
return 1;
}
}
if (read_only) {
for (p = whitelist_ro; *p; p++) {
if (!strcmp(format_name, *p)) {
return 1;
}
}
}
return 0;
}
int bdrv_is_whitelisted(BlockDriver *drv, bool read_only)
{
GLOBAL_STATE_CODE();
return bdrv_format_is_whitelisted(drv->format_name, read_only);
}
bool bdrv_uses_whitelist(void)
{
return use_bdrv_whitelist;
}
typedef struct CreateCo {
BlockDriver *drv;
char *filename;
QemuOpts *opts;
int ret;
Error *err;
} CreateCo;
static void coroutine_fn bdrv_create_co_entry(void *opaque)
{
Error *local_err = NULL;
int ret;
CreateCo *cco = opaque;
assert(cco->drv);
GLOBAL_STATE_CODE();
ret = cco->drv->bdrv_co_create_opts(cco->drv,
cco->filename, cco->opts, &local_err);
error_propagate(&cco->err, local_err);
cco->ret = ret;
}
int bdrv_create(BlockDriver *drv, const char* filename,
QemuOpts *opts, Error **errp)
{
int ret;
GLOBAL_STATE_CODE();
Coroutine *co;
CreateCo cco = {
.drv = drv,
.filename = g_strdup(filename),
.opts = opts,
.ret = NOT_DONE,
.err = NULL,
};
if (!drv->bdrv_co_create_opts) {
error_setg(errp, "Driver '%s' does not support image creation", drv->format_name);
ret = -ENOTSUP;
goto out;
}
if (qemu_in_coroutine()) {
/* Fast-path if already in coroutine context */
bdrv_create_co_entry(&cco);
} else {
co = qemu_coroutine_create(bdrv_create_co_entry, &cco);
qemu_coroutine_enter(co);
while (cco.ret == NOT_DONE) {
aio_poll(qemu_get_aio_context(), true);
}
}
ret = cco.ret;
if (ret < 0) {
if (cco.err) {
error_propagate(errp, cco.err);
} else {
error_setg_errno(errp, -ret, "Could not create image");
}
}
out:
g_free(cco.filename);
return ret;
}
/**
* Helper function for bdrv_create_file_fallback(): Resize @blk to at
* least the given @minimum_size.
*
* On success, return @blk's actual length.
* Otherwise, return -errno.
*/
static int64_t create_file_fallback_truncate(BlockBackend *blk,
int64_t minimum_size, Error **errp)
{
Error *local_err = NULL;
int64_t size;
int ret;
GLOBAL_STATE_CODE();
ret = blk_truncate(blk, minimum_size, false, PREALLOC_MODE_OFF, 0,
&local_err);
if (ret < 0 && ret != -ENOTSUP) {
error_propagate(errp, local_err);
return ret;
}
size = blk_getlength(blk);
if (size < 0) {
error_free(local_err);
error_setg_errno(errp, -size,
"Failed to inquire the new image file's length");
return size;
}
if (size < minimum_size) {
/* Need to grow the image, but we failed to do that */
error_propagate(errp, local_err);
return -ENOTSUP;
}
error_free(local_err);
local_err = NULL;
return size;
}
/**
* Helper function for bdrv_create_file_fallback(): Zero the first
* sector to remove any potentially pre-existing image header.
*/
static int create_file_fallback_zero_first_sector(BlockBackend *blk,
int64_t current_size,
Error **errp)
{
int64_t bytes_to_clear;
int ret;
GLOBAL_STATE_CODE();
bytes_to_clear = MIN(current_size, BDRV_SECTOR_SIZE);
if (bytes_to_clear) {
ret = blk_pwrite_zeroes(blk, 0, bytes_to_clear, BDRV_REQ_MAY_UNMAP);
if (ret < 0) {
error_setg_errno(errp, -ret,
"Failed to clear the new image's first sector");
return ret;
}
}
return 0;
}
/**
* Simple implementation of bdrv_co_create_opts for protocol drivers
* which only support creation via opening a file
* (usually existing raw storage device)
*/
int coroutine_fn bdrv_co_create_opts_simple(BlockDriver *drv,
const char *filename,
QemuOpts *opts,
Error **errp)
{
BlockBackend *blk;
QDict *options;
int64_t size = 0;
char *buf = NULL;
PreallocMode prealloc;
Error *local_err = NULL;
int ret;
GLOBAL_STATE_CODE();
size = qemu_opt_get_size_del(opts, BLOCK_OPT_SIZE, 0);
buf = qemu_opt_get_del(opts, BLOCK_OPT_PREALLOC);
prealloc = qapi_enum_parse(&PreallocMode_lookup, buf,
PREALLOC_MODE_OFF, &local_err);
g_free(buf);
if (local_err) {
error_propagate(errp, local_err);
return -EINVAL;
}
if (prealloc != PREALLOC_MODE_OFF) {
error_setg(errp, "Unsupported preallocation mode '%s'",
PreallocMode_str(prealloc));
return -ENOTSUP;
}
options = qdict_new();
qdict_put_str(options, "driver", drv->format_name);
blk = blk_new_open(filename, NULL, options,
BDRV_O_RDWR | BDRV_O_RESIZE, errp);
if (!blk) {
error_prepend(errp, "Protocol driver '%s' does not support image "
"creation, and opening the image failed: ",
drv->format_name);
return -EINVAL;
}
size = create_file_fallback_truncate(blk, size, errp);
if (size < 0) {
ret = size;
goto out;
}
ret = create_file_fallback_zero_first_sector(blk, size, errp);
if (ret < 0) {
goto out;
}
ret = 0;
out:
blk_unref(blk);
return ret;
}
int bdrv_create_file(const char *filename, QemuOpts *opts, Error **errp)
{
QemuOpts *protocol_opts;
BlockDriver *drv;
QDict *qdict;
int ret;
GLOBAL_STATE_CODE();
drv = bdrv_find_protocol(filename, true, errp);
if (drv == NULL) {
return -ENOENT;
}
if (!drv->create_opts) {
error_setg(errp, "Driver '%s' does not support image creation",
drv->format_name);
return -ENOTSUP;
}
/*
* 'opts' contains a QemuOptsList with a combination of format and protocol
* default values.
*
* The format properly removes its options, but the default values remain
* in 'opts->list'. So if the protocol has options with the same name
* (e.g. rbd has 'cluster_size' as qcow2), it will see the default values
* of the format, since for overlapping options, the format wins.
*
* To avoid this issue, lets convert QemuOpts to QDict, in this way we take
* only the set options, and then convert it back to QemuOpts, using the
* create_opts of the protocol. So the new QemuOpts, will contain only the
* protocol defaults.
*/
qdict = qemu_opts_to_qdict(opts, NULL);
protocol_opts = qemu_opts_from_qdict(drv->create_opts, qdict, errp);
if (protocol_opts == NULL) {
ret = -EINVAL;
goto out;
}
ret = bdrv_create(drv, filename, protocol_opts, errp);
out:
qemu_opts_del(protocol_opts);
qobject_unref(qdict);
return ret;
}
int coroutine_fn bdrv_co_delete_file(BlockDriverState *bs, Error **errp)
{
Error *local_err = NULL;
int ret;
IO_CODE();
assert(bs != NULL);
if (!bs->drv) {
error_setg(errp, "Block node '%s' is not opened", bs->filename);
return -ENOMEDIUM;
}
if (!bs->drv->bdrv_co_delete_file) {
error_setg(errp, "Driver '%s' does not support image deletion",
bs->drv->format_name);
return -ENOTSUP;
}
ret = bs->drv->bdrv_co_delete_file(bs, &local_err);
if (ret < 0) {
error_propagate(errp, local_err);
}
return ret;
}
void coroutine_fn bdrv_co_delete_file_noerr(BlockDriverState *bs)
{
Error *local_err = NULL;
int ret;
IO_CODE();
if (!bs) {
return;
}
ret = bdrv_co_delete_file(bs, &local_err);
/*
* ENOTSUP will happen if the block driver doesn't support
* the 'bdrv_co_delete_file' interface. This is a predictable
* scenario and shouldn't be reported back to the user.
*/
if (ret == -ENOTSUP) {
error_free(local_err);
} else if (ret < 0) {
error_report_err(local_err);
}
}
/**
* Try to get @bs's logical and physical block size.
* On success, store them in @bsz struct and return 0.
* On failure return -errno.
* @bs must not be empty.
*/
int bdrv_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
{
BlockDriver *drv = bs->drv;
BlockDriverState *filtered = bdrv_filter_bs(bs);
GLOBAL_STATE_CODE();
if (drv && drv->bdrv_probe_blocksizes) {
return drv->bdrv_probe_blocksizes(bs, bsz);
} else if (filtered) {
return bdrv_probe_blocksizes(filtered, bsz);
}
return -ENOTSUP;
}
/**
* Try to get @bs's geometry (cyls, heads, sectors).
* On success, store them in @geo struct and return 0.
* On failure return -errno.
* @bs must not be empty.
*/
int bdrv_probe_geometry(BlockDriverState *bs, HDGeometry *geo)
{
BlockDriver *drv = bs->drv;
BlockDriverState *filtered = bdrv_filter_bs(bs);
GLOBAL_STATE_CODE();
if (drv && drv->bdrv_probe_geometry) {
return drv->bdrv_probe_geometry(bs, geo);
} else if (filtered) {
return bdrv_probe_geometry(filtered, geo);
}
return -ENOTSUP;
}
/*
* Create a uniquely-named empty temporary file.
* Return 0 upon success, otherwise a negative errno value.
*/
int get_tmp_filename(char *filename, int size)
{
#ifdef _WIN32
char temp_dir[MAX_PATH];
/* GetTempFileName requires that its output buffer (4th param)
have length MAX_PATH or greater. */
assert(size >= MAX_PATH);
return (GetTempPath(MAX_PATH, temp_dir)
&& GetTempFileName(temp_dir, "qem", 0, filename)
? 0 : -GetLastError());
#else
int fd;
const char *tmpdir;
tmpdir = getenv("TMPDIR");
if (!tmpdir) {
tmpdir = "/var/tmp";
}
if (snprintf(filename, size, "%s/vl.XXXXXX", tmpdir) >= size) {
return -EOVERFLOW;
}
fd = mkstemp(filename);
if (fd < 0) {
return -errno;
}
if (close(fd) != 0) {
unlink(filename);
return -errno;
}
return 0;
#endif
}
/*
* Detect host devices. By convention, /dev/cdrom[N] is always
* recognized as a host CDROM.
*/
static BlockDriver *find_hdev_driver(const char *filename)
{
int score_max = 0, score;
BlockDriver *drv = NULL, *d;
GLOBAL_STATE_CODE();
QLIST_FOREACH(d, &bdrv_drivers, list) {
if (d->bdrv_probe_device) {
score = d->bdrv_probe_device(filename);
if (score > score_max) {
score_max = score;
drv = d;
}
}
}
return drv;
}
static BlockDriver *bdrv_do_find_protocol(const char *protocol)
{
BlockDriver *drv1;
GLOBAL_STATE_CODE();
QLIST_FOREACH(drv1, &bdrv_drivers, list) {
if (drv1->protocol_name && !strcmp(drv1->protocol_name, protocol)) {
return drv1;
}
}
return NULL;
}
BlockDriver *bdrv_find_protocol(const char *filename,
bool allow_protocol_prefix,
Error **errp)
{
BlockDriver *drv1;
char protocol[128];
int len;
const char *p;
int i;
GLOBAL_STATE_CODE();
/* TODO Drivers without bdrv_file_open must be specified explicitly */
/*
* XXX(hch): we really should not let host device detection
* override an explicit protocol specification, but moving this
* later breaks access to device names with colons in them.
* Thanks to the brain-dead persistent naming schemes on udev-
* based Linux systems those actually are quite common.
*/
drv1 = find_hdev_driver(filename);
if (drv1) {
return drv1;
}
if (!path_has_protocol(filename) || !allow_protocol_prefix) {
return &bdrv_file;
}
p = strchr(filename, ':');
assert(p != NULL);
len = p - filename;
if (len > sizeof(protocol) - 1)
len = sizeof(protocol) - 1;
memcpy(protocol, filename, len);
protocol[len] = '\0';
drv1 = bdrv_do_find_protocol(protocol);
if (drv1) {
return drv1;
}
for (i = 0; i < (int)ARRAY_SIZE(block_driver_modules); ++i) {
if (block_driver_modules[i].protocol_name &&
!strcmp(block_driver_modules[i].protocol_name, protocol)) {
block_module_load_one(block_driver_modules[i].library_name);
break;
}
}
drv1 = bdrv_do_find_protocol(protocol);
if (!drv1) {
error_setg(errp, "Unknown protocol '%s'", protocol);
}
return drv1;
}
/*
* Guess image format by probing its contents.
* This is not a good idea when your image is raw (CVE-2008-2004), but
* we do it anyway for backward compatibility.
*
* @buf contains the image's first @buf_size bytes.
* @buf_size is the buffer size in bytes (generally BLOCK_PROBE_BUF_SIZE,
* but can be smaller if the image file is smaller)
* @filename is its filename.
*
* For all block drivers, call the bdrv_probe() method to get its
* probing score.
* Return the first block driver with the highest probing score.
*/
BlockDriver *bdrv_probe_all(const uint8_t *buf, int buf_size,
const char *filename)
{
int score_max = 0, score;
BlockDriver *drv = NULL, *d;
IO_CODE();
QLIST_FOREACH(d, &bdrv_drivers, list) {
if (d->bdrv_probe) {
score = d->bdrv_probe(buf, buf_size, filename);
if (score > score_max) {
score_max = score;
drv = d;
}
}
}
return drv;
}
static int find_image_format(BlockBackend *file, const char *filename,
BlockDriver **pdrv, Error **errp)
{
BlockDriver *drv;
uint8_t buf[BLOCK_PROBE_BUF_SIZE];
int ret = 0;
GLOBAL_STATE_CODE();
/* Return the raw BlockDriver * to scsi-generic devices or empty drives */
if (blk_is_sg(file) || !blk_is_inserted(file) || blk_getlength(file) == 0) {
*pdrv = &bdrv_raw;
return ret;
}
ret = blk_pread(file, 0, sizeof(buf), buf, 0);
if (ret < 0) {
error_setg_errno(errp, -ret, "Could not read image for determining its "
"format");
*pdrv = NULL;
return ret;
}
drv = bdrv_probe_all(buf, sizeof(buf), filename);
if (!drv) {
error_setg(errp, "Could not determine image format: No compatible "
"driver found");
*pdrv = NULL;
return -ENOENT;
}
*pdrv = drv;
return 0;
}
/**
* Set the current 'total_sectors' value
* Return 0 on success, -errno on error.
*/
int refresh_total_sectors(BlockDriverState *bs, int64_t hint)
{
BlockDriver *drv = bs->drv;
IO_CODE();
if (!drv) {
return -ENOMEDIUM;
}
/* Do not attempt drv->bdrv_getlength() on scsi-generic devices */
if (bdrv_is_sg(bs))
return 0;
/* query actual device if possible, otherwise just trust the hint */
if (drv->bdrv_getlength) {
int64_t length = drv->bdrv_getlength(bs);
if (length < 0) {
return length;
}
hint = DIV_ROUND_UP(length, BDRV_SECTOR_SIZE);
}
bs->total_sectors = hint;
if (bs->total_sectors * BDRV_SECTOR_SIZE > BDRV_MAX_LENGTH) {
return -EFBIG;
}
return 0;
}
/**
* Combines a QDict of new block driver @options with any missing options taken
* from @old_options, so that leaving out an option defaults to its old value.
*/
static void bdrv_join_options(BlockDriverState *bs, QDict *options,
QDict *old_options)
{
GLOBAL_STATE_CODE();
if (bs->drv && bs->drv->bdrv_join_options) {
bs->drv->bdrv_join_options(options, old_options);
} else {
qdict_join(options, old_options, false);
}
}
static BlockdevDetectZeroesOptions bdrv_parse_detect_zeroes(QemuOpts *opts,
int open_flags,
Error **errp)
{
Error *local_err = NULL;
char *value = qemu_opt_get_del(opts, "detect-zeroes");
BlockdevDetectZeroesOptions detect_zeroes =
qapi_enum_parse(&BlockdevDetectZeroesOptions_lookup, value,
BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF, &local_err);
GLOBAL_STATE_CODE();
g_free(value);
if (local_err) {
error_propagate(errp, local_err);
return detect_zeroes;
}
if (detect_zeroes == BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP &&
!(open_flags & BDRV_O_UNMAP))
{
error_setg(errp, "setting detect-zeroes to unmap is not allowed "
"without setting discard operation to unmap");
}
return detect_zeroes;
}
/**
* Set open flags for aio engine
*
* Return 0 on success, -1 if the engine specified is invalid
*/
int bdrv_parse_aio(const char *mode, int *flags)
{
if (!strcmp(mode, "threads")) {
/* do nothing, default */
} else if (!strcmp(mode, "native")) {
*flags |= BDRV_O_NATIVE_AIO;
#ifdef CONFIG_LINUX_IO_URING
} else if (!strcmp(mode, "io_uring")) {
*flags |= BDRV_O_IO_URING;
#endif
} else {
return -1;
}
return 0;
}
/**
* Set open flags for a given discard mode
*
* Return 0 on success, -1 if the discard mode was invalid.
*/
int bdrv_parse_discard_flags(const char *mode, int *flags)
{
*flags &= ~BDRV_O_UNMAP;
if (!strcmp(mode, "off") || !strcmp(mode, "ignore")) {
/* do nothing */
} else if (!strcmp(mode, "on") || !strcmp(mode, "unmap")) {
*flags |= BDRV_O_UNMAP;
} else {
return -1;
}
return 0;
}
/**
* Set open flags for a given cache mode
*
* Return 0 on success, -1 if the cache mode was invalid.
*/
int bdrv_parse_cache_mode(const char *mode, int *flags, bool *writethrough)
{
*flags &= ~BDRV_O_CACHE_MASK;
if (!strcmp(mode, "off") || !strcmp(mode, "none")) {
*writethrough = false;
*flags |= BDRV_O_NOCACHE;
} else if (!strcmp(mode, "directsync")) {
*writethrough = true;
*flags |= BDRV_O_NOCACHE;
} else if (!strcmp(mode, "writeback")) {
*writethrough = false;
} else if (!strcmp(mode, "unsafe")) {
*writethrough = false;
*flags |= BDRV_O_NO_FLUSH;
} else if (!strcmp(mode, "writethrough")) {
*writethrough = true;
} else {
return -1;
}
return 0;
}
static char *bdrv_child_get_parent_desc(BdrvChild *c)
{
BlockDriverState *parent = c->opaque;
return g_strdup_printf("node '%s'", bdrv_get_node_name(parent));
}
static void bdrv_child_cb_drained_begin(BdrvChild *child)
{
BlockDriverState *bs = child->opaque;
bdrv_do_drained_begin_quiesce(bs, NULL, false);
}
static bool bdrv_child_cb_drained_poll(BdrvChild *child)
{
BlockDriverState *bs = child->opaque;
return bdrv_drain_poll(bs, false, NULL, false);
}
static void bdrv_child_cb_drained_end(BdrvChild *child,
int *drained_end_counter)
{
BlockDriverState *bs = child->opaque;
bdrv_drained_end_no_poll(bs, drained_end_counter);
}
static int bdrv_child_cb_inactivate(BdrvChild *child)
{
BlockDriverState *bs = child->opaque;
GLOBAL_STATE_CODE();
assert(bs->open_flags & BDRV_O_INACTIVE);
return 0;
}
static bool bdrv_child_cb_can_set_aio_ctx(BdrvChild *child, AioContext *ctx,
GSList **ignore, Error **errp)
{
BlockDriverState *bs = child->opaque;
return bdrv_can_set_aio_context(bs, ctx, ignore, errp);
}
static void bdrv_child_cb_set_aio_ctx(BdrvChild *child, AioContext *ctx,
GSList **ignore)
{
BlockDriverState *bs = child->opaque;
return bdrv_set_aio_context_ignore(bs, ctx, ignore);
}
/*
* Returns the options and flags that a temporary snapshot should get, based on
* the originally requested flags (the originally requested image will have
* flags like a backing file)
*/
static void bdrv_temp_snapshot_options(int *child_flags, QDict *child_options,
int parent_flags, QDict *parent_options)
{
GLOBAL_STATE_CODE();
*child_flags = (parent_flags & ~BDRV_O_SNAPSHOT) | BDRV_O_TEMPORARY;
/* For temporary files, unconditional cache=unsafe is fine */
qdict_set_default_str(child_options, BDRV_OPT_CACHE_DIRECT, "off");
qdict_set_default_str(child_options, BDRV_OPT_CACHE_NO_FLUSH, "on");
/* Copy the read-only and discard options from the parent */
qdict_copy_default(child_options, parent_options, BDRV_OPT_READ_ONLY);
qdict_copy_default(child_options, parent_options, BDRV_OPT_DISCARD);
/* aio=native doesn't work for cache.direct=off, so disable it for the
* temporary snapshot */
*child_flags &= ~BDRV_O_NATIVE_AIO;
}
static void bdrv_backing_attach(BdrvChild *c)
{
BlockDriverState *parent = c->opaque;
BlockDriverState *backing_hd = c->bs;
GLOBAL_STATE_CODE();
assert(!parent->backing_blocker);
error_setg(&parent->backing_blocker,
"node is used as backing hd of '%s'",
bdrv_get_device_or_node_name(parent));
bdrv_refresh_filename(backing_hd);
parent->open_flags &= ~BDRV_O_NO_BACKING;
bdrv_op_block_all(backing_hd, parent->backing_blocker);
/* Otherwise we won't be able to commit or stream */
bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_COMMIT_TARGET,
parent->backing_blocker);
bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_STREAM,
parent->backing_blocker);
/*
* We do backup in 3 ways:
* 1. drive backup
* The target bs is new opened, and the source is top BDS
* 2. blockdev backup
* Both the source and the target are top BDSes.
* 3. internal backup(used for block replication)
* Both the source and the target are backing file
*
* In case 1 and 2, neither the source nor the target is the backing file.
* In case 3, we will block the top BDS, so there is only one block job
* for the top BDS and its backing chain.
*/
bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_SOURCE,
parent->backing_blocker);
bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_TARGET,
parent->backing_blocker);
}
static void bdrv_backing_detach(BdrvChild *c)
{
BlockDriverState *parent = c->opaque;
GLOBAL_STATE_CODE();
assert(parent->backing_blocker);
bdrv_op_unblock_all(c->bs, parent->backing_blocker);
error_free(parent->backing_blocker);
parent->backing_blocker = NULL;
}
static int bdrv_backing_update_filename(BdrvChild *c, BlockDriverState *base,
const char *filename, Error **errp)
{
BlockDriverState *parent = c->opaque;
bool read_only = bdrv_is_read_only(parent);
int ret;
GLOBAL_STATE_CODE();
if (read_only) {
ret = bdrv_reopen_set_read_only(parent, false, errp);
if (ret < 0) {
return ret;
}
}
ret = bdrv_change_backing_file(parent, filename,
base->drv ? base->drv->format_name : "",
false);
if (ret < 0) {
error_setg_errno(errp, -ret, "Could not update backing file link");
}
if (read_only) {
bdrv_reopen_set_read_only(parent, true, NULL);
}
return ret;
}
/*
* Returns the options and flags that a generic child of a BDS should
* get, based on the given options and flags for the parent BDS.
*/
static void bdrv_inherited_options(BdrvChildRole role, bool parent_is_format,
int *child_flags, QDict *child_options,
int parent_flags, QDict *parent_options)
{
int flags = parent_flags;
GLOBAL_STATE_CODE();
/*
* First, decide whether to set, clear, or leave BDRV_O_PROTOCOL.
* Generally, the question to answer is: Should this child be
* format-probed by default?
*/
/*
* Pure and non-filtered data children of non-format nodes should
* be probed by default (even when the node itself has BDRV_O_PROTOCOL
* set). This only affects a very limited set of drivers (namely
* quorum and blkverify when this comment was written).
* Force-clear BDRV_O_PROTOCOL then.
*/
if (!parent_is_format &&
(role & BDRV_CHILD_DATA) &&
!(role & (BDRV_CHILD_METADATA | BDRV_CHILD_FILTERED)))
{
flags &= ~BDRV_O_PROTOCOL;
}
/*
* All children of format nodes (except for COW children) and all
* metadata children in general should never be format-probed.
* Force-set BDRV_O_PROTOCOL then.
*/
if ((parent_is_format && !(role & BDRV_CHILD_COW)) ||
(role & BDRV_CHILD_METADATA))
{
flags |= BDRV_O_PROTOCOL;
}
/*
* If the cache mode isn't explicitly set, inherit direct and no-flush from
* the parent.
*/
qdict_copy_default(child_options, parent_options, BDRV_OPT_CACHE_DIRECT);
qdict_copy_default(child_options, parent_options, BDRV_OPT_CACHE_NO_FLUSH);
qdict_copy_default(child_options, parent_options, BDRV_OPT_FORCE_SHARE);
if (role & BDRV_CHILD_COW) {
/* backing files are opened read-only by default */
qdict_set_default_str(child_options, BDRV_OPT_READ_ONLY, "on");
qdict_set_default_str(child_options, BDRV_OPT_AUTO_READ_ONLY, "off");
} else {
/* Inherit the read-only option from the parent if it's not set */
qdict_copy_default(child_options, parent_options, BDRV_OPT_READ_ONLY);
qdict_copy_default(child_options, parent_options,
BDRV_OPT_AUTO_READ_ONLY);
}
/*
* bdrv_co_pdiscard() respects unmap policy for the parent, so we
* can default to enable it on lower layers regardless of the
* parent option.
*/
qdict_set_default_str(child_options, BDRV_OPT_DISCARD, "unmap");
/* Clear flags that only apply to the top layer */
flags &= ~(BDRV_O_SNAPSHOT | BDRV_O_NO_BACKING | BDRV_O_COPY_ON_READ);
if (role & BDRV_CHILD_METADATA) {
flags &= ~BDRV_O_NO_IO;
}
if (role & BDRV_CHILD_COW) {
flags &= ~BDRV_O_TEMPORARY;
}
*child_flags = flags;
}
static void bdrv_child_cb_attach(BdrvChild *child)
{
BlockDriverState *bs = child->opaque;
assert_bdrv_graph_writable(bs);
QLIST_INSERT_HEAD(&bs->children, child, next);
if (child->role & BDRV_CHILD_COW) {
bdrv_backing_attach(child);
}
bdrv_apply_subtree_drain(child, bs);
}
static void bdrv_child_cb_detach(BdrvChild *child)
{
BlockDriverState *bs = child->opaque;
if (child->role & BDRV_CHILD_COW) {
bdrv_backing_detach(child);
}
bdrv_unapply_subtree_drain(child, bs);
assert_bdrv_graph_writable(bs);
QLIST_REMOVE(child, next);
}
static int bdrv_child_cb_update_filename(BdrvChild *c, BlockDriverState *base,
const char *filename, Error **errp)
{
if (c->role & BDRV_CHILD_COW) {
return bdrv_backing_update_filename(c, base, filename, errp);
}
return 0;
}
AioContext *child_of_bds_get_parent_aio_context(BdrvChild *c)
{
BlockDriverState *bs = c->opaque;
IO_CODE();
return bdrv_get_aio_context(bs);
}
const BdrvChildClass child_of_bds = {
.parent_is_bds = true,
.get_parent_desc = bdrv_child_get_parent_desc,
.inherit_options = bdrv_inherited_options,
.drained_begin = bdrv_child_cb_drained_begin,
.drained_poll = bdrv_child_cb_drained_poll,
.drained_end = bdrv_child_cb_drained_end,
.attach = bdrv_child_cb_attach,
.detach = bdrv_child_cb_detach,
.inactivate = bdrv_child_cb_inactivate,
.can_set_aio_ctx = bdrv_child_cb_can_set_aio_ctx,
.set_aio_ctx = bdrv_child_cb_set_aio_ctx,
.update_filename = bdrv_child_cb_update_filename,
.get_parent_aio_context = child_of_bds_get_parent_aio_context,
};
AioContext *bdrv_child_get_parent_aio_context(BdrvChild *c)
{
GLOBAL_STATE_CODE();
return c->klass->get_parent_aio_context(c);
}
static int bdrv_open_flags(BlockDriverState *bs, int flags)
{
int open_flags = flags;
GLOBAL_STATE_CODE();
/*
* Clear flags that are internal to the block layer before opening the
* image.
*/
open_flags &= ~(BDRV_O_SNAPSHOT | BDRV_O_NO_BACKING | BDRV_O_PROTOCOL);
return open_flags;
}
static void update_flags_from_options(int *flags, QemuOpts *opts)
{
GLOBAL_STATE_CODE();
*flags &= ~(BDRV_O_CACHE_MASK | BDRV_O_RDWR | BDRV_O_AUTO_RDONLY);
if (qemu_opt_get_bool_del(opts, BDRV_OPT_CACHE_NO_FLUSH, false)) {
*flags |= BDRV_O_NO_FLUSH;
}
if (qemu_opt_get_bool_del(opts, BDRV_OPT_CACHE_DIRECT, false)) {
*flags |= BDRV_O_NOCACHE;
}
if (!qemu_opt_get_bool_del(opts, BDRV_OPT_READ_ONLY, false)) {
*flags |= BDRV_O_RDWR;
}
if (qemu_opt_get_bool_del(opts, BDRV_OPT_AUTO_READ_ONLY, false)) {
*flags |= BDRV_O_AUTO_RDONLY;
}
}
static void update_options_from_flags(QDict *options, int flags)
{
GLOBAL_STATE_CODE();
if (!qdict_haskey(options, BDRV_OPT_CACHE_DIRECT)) {
qdict_put_bool(options, BDRV_OPT_CACHE_DIRECT, flags & BDRV_O_NOCACHE);
}
if (!qdict_haskey(options, BDRV_OPT_CACHE_NO_FLUSH)) {
qdict_put_bool(options, BDRV_OPT_CACHE_NO_FLUSH,
flags & BDRV_O_NO_FLUSH);
}
if (!qdict_haskey(options, BDRV_OPT_READ_ONLY)) {
qdict_put_bool(options, BDRV_OPT_READ_ONLY, !(flags & BDRV_O_RDWR));
}
if (!qdict_haskey(options, BDRV_OPT_AUTO_READ_ONLY)) {
qdict_put_bool(options, BDRV_OPT_AUTO_READ_ONLY,
flags & BDRV_O_AUTO_RDONLY);
}
}
static void bdrv_assign_node_name(BlockDriverState *bs,
const char *node_name,
Error **errp)
{
char *gen_node_name = NULL;
GLOBAL_STATE_CODE();
if (!node_name) {
node_name = gen_node_name = id_generate(ID_BLOCK);
} else if (!id_wellformed(node_name)) {
/*
* Check for empty string or invalid characters, but not if it is
* generated (generated names use characters not available to the user)
*/
error_setg(errp, "Invalid node-name: '%s'", node_name);
return;
}
/* takes care of avoiding namespaces collisions */
if (blk_by_name(node_name)) {
error_setg(errp, "node-name=%s is conflicting with a device id",
node_name);
goto out;
}
/* takes care of avoiding duplicates node names */
if (bdrv_find_node(node_name)) {
error_setg(errp, "Duplicate nodes with node-name='%s'", node_name);
goto out;
}
/* Make sure that the node name isn't truncated */
if (strlen(node_name) >= sizeof(bs->node_name)) {
error_setg(errp, "Node name too long");
goto out;
}
/* copy node name into the bs and insert it into the graph list */
pstrcpy(bs->node_name, sizeof(bs->node_name), node_name);
QTAILQ_INSERT_TAIL(&graph_bdrv_states, bs, node_list);
out:
g_free(gen_node_name);
}
static int bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv,
const char *node_name, QDict *options,
int open_flags, Error **errp)
{
Error *local_err = NULL;
int i, ret;
GLOBAL_STATE_CODE();
bdrv_assign_node_name(bs, node_name, &local_err);
if (local_err) {
error_propagate(errp, local_err);
return -EINVAL;
}
bs->drv = drv;
bs->opaque = g_malloc0(drv->instance_size);
if (drv->bdrv_file_open) {
assert(!drv->bdrv_needs_filename || bs->filename[0]);
ret = drv->bdrv_file_open(bs, options, open_flags, &local_err);
} else if (drv->bdrv_open) {
ret = drv->bdrv_open(bs, options, open_flags, &local_err);
} else {
ret = 0;
}
if (ret < 0) {
if (local_err) {
error_propagate(errp, local_err);
} else if (bs->filename[0]) {
error_setg_errno(errp, -ret, "Could not open '%s'", bs->filename);
} else {
error_setg_errno(errp, -ret, "Could not open image");
}
goto open_failed;
}
ret = refresh_total_sectors(bs, bs->total_sectors);
if (ret < 0) {
error_setg_errno(errp, -ret, "Could not refresh total sector count");
return ret;
}
bdrv_refresh_limits(bs, NULL, &local_err);
if (local_err) {
error_propagate(errp, local_err);
return -EINVAL;
}
assert(bdrv_opt_mem_align(bs) != 0);
assert(bdrv_min_mem_align(bs) != 0);
assert(is_power_of_2(bs->bl.request_alignment));
for (i = 0; i < bs->quiesce_counter; i++) {
if (drv->bdrv_co_drain_begin) {
drv->bdrv_co_drain_begin(bs);
}
}
return 0;
open_failed:
bs->drv = NULL;
if (bs->file != NULL) {
bdrv_unref_child(bs, bs->file);
bs->file = NULL;
}
g_free(bs->opaque);
bs->opaque = NULL;
return ret;
}
/*
* Create and open a block node.
*
* @options is a QDict of options to pass to the block drivers, or NULL for an
* empty set of options. The reference to the QDict belongs to the block layer
* after the call (even on failure), so if the caller intends to reuse the
* dictionary, it needs to use qobject_ref() before calling bdrv_open.
*/
BlockDriverState *bdrv_new_open_driver_opts(BlockDriver *drv,
const char *node_name,
QDict *options, int flags,
Error **errp)
{
BlockDriverState *bs;
int ret;
GLOBAL_STATE_CODE();
bs = bdrv_new();
bs->open_flags = flags;
bs->options = options ?: qdict_new();
bs->explicit_options = qdict_clone_shallow(bs->options);
bs->opaque = NULL;
update_options_from_flags(bs->options, flags);
ret = bdrv_open_driver(bs, drv, node_name, bs->options, flags, errp);
if (ret < 0) {
qobject_unref(bs->explicit_options);
bs->explicit_options = NULL;
qobject_unref(bs->options);
bs->options = NULL;
bdrv_unref(bs);
return NULL;
}
return bs;
}
/* Create and open a block node. */
BlockDriverState *bdrv_new_open_driver(BlockDriver *drv, const char *node_name,
int flags, Error **errp)
{
GLOBAL_STATE_CODE();
return bdrv_new_open_driver_opts(drv, node_name, NULL, flags, errp);
}
QemuOptsList bdrv_runtime_opts = {
.name = "bdrv_common",
.head = QTAILQ_HEAD_INITIALIZER(bdrv_runtime_opts.head),
.desc = {
{
.name = "node-name",
.type = QEMU_OPT_STRING,
.help = "Node name of the block device node",
},
{
.name = "driver",
.type = QEMU_OPT_STRING,
.help = "Block driver to use for the node",
},
{
.name = BDRV_OPT_CACHE_DIRECT,
.type = QEMU_OPT_BOOL,
.help = "Bypass software writeback cache on the host",
},
{
.name = BDRV_OPT_CACHE_NO_FLUSH,
.type = QEMU_OPT_BOOL,
.help = "Ignore flush requests",
},
{
.name = BDRV_OPT_READ_ONLY,
.type = QEMU_OPT_BOOL,
.help = "Node is opened in read-only mode",
},
{
.name = BDRV_OPT_AUTO_READ_ONLY,
.type = QEMU_OPT_BOOL,
.help = "Node can become read-only if opening read-write fails",
},
{
.name = "detect-zeroes",
.type = QEMU_OPT_STRING,
.help = "try to optimize zero writes (off, on, unmap)",
},
{
.name = BDRV_OPT_DISCARD,
.type = QEMU_OPT_STRING,
.help = "discard operation (ignore/off, unmap/on)",
},
{
.name = BDRV_OPT_FORCE_SHARE,
.type = QEMU_OPT_BOOL,
.help = "always accept other writers (default: off)",
},
{ /* end of list */ }
},
};
QemuOptsList bdrv_create_opts_simple = {
.name = "simple-create-opts",
.head = QTAILQ_HEAD_INITIALIZER(bdrv_create_opts_simple.head),
.desc = {
{
.name = BLOCK_OPT_SIZE,
.type = QEMU_OPT_SIZE,
.help = "Virtual disk size"
},
{
.name = BLOCK_OPT_PREALLOC,
.type = QEMU_OPT_STRING,
.help = "Preallocation mode (allowed values: off)"
},
{ /* end of list */ }
}
};
/*
* Common part for opening disk images and files
*
* Removes all processed options from *options.
*/
static int bdrv_open_common(BlockDriverState *bs, BlockBackend *file,
QDict *options, Error **errp)
{
int ret, open_flags;
const char *filename;
const char *driver_name = NULL;
const char *node_name = NULL;
const char *discard;
QemuOpts *opts;
BlockDriver *drv;
Error *local_err = NULL;
bool ro;
assert(bs->file == NULL);
assert(options != NULL && bs->options != options);
GLOBAL_STATE_CODE();
opts = qemu_opts_create(&bdrv_runtime_opts, NULL, 0, &error_abort);
if (!qemu_opts_absorb_qdict(opts, options, errp)) {
ret = -EINVAL;
goto fail_opts;
}
update_flags_from_options(&bs->open_flags, opts);
driver_name = qemu_opt_get(opts, "driver");
drv = bdrv_find_format(driver_name);
assert(drv != NULL);
bs->force_share = qemu_opt_get_bool(opts, BDRV_OPT_FORCE_SHARE, false);
if (bs->force_share && (bs->open_flags & BDRV_O_RDWR)) {
error_setg(errp,
BDRV_OPT_FORCE_SHARE
"=on can only be used with read-only images");
ret = -EINVAL;
goto fail_opts;
}
if (file != NULL) {
bdrv_refresh_filename(blk_bs(file));
filename = blk_bs(file)->filename;
} else {
/*
* Caution: while qdict_get_try_str() is fine, getting
* non-string types would require more care. When @options
* come from -blockdev or blockdev_add, its members are typed
* according to the QAPI schema, but when they come from
* -drive, they're all QString.
*/
filename = qdict_get_try_str(options, "filename");
}
if (drv->bdrv_needs_filename && (!filename || !filename[0])) {
error_setg(errp, "The '%s' block driver requires a file name",
drv->format_name);
ret = -EINVAL;
goto fail_opts;
}
trace_bdrv_open_common(bs, filename ?: "", bs->open_flags,
drv->format_name);
ro = bdrv_is_read_only(bs);
if (use_bdrv_whitelist && !bdrv_is_whitelisted(drv, ro)) {
if (!ro && bdrv_is_whitelisted(drv, true)) {
ret = bdrv_apply_auto_read_only(bs, NULL, NULL);
} else {
ret = -ENOTSUP;
}
if (ret < 0) {
error_setg(errp,
!ro && bdrv_is_whitelisted(drv, true)
? "Driver '%s' can only be used for read-only devices"
: "Driver '%s' is not whitelisted",
drv->format_name);
goto fail_opts;
}
}
/* bdrv_new() and bdrv_close() make it so */
assert(qatomic_read(&bs->copy_on_read) == 0);
if (bs->open_flags & BDRV_O_COPY_ON_READ) {
if (!ro) {
bdrv_enable_copy_on_read(bs);
} else {
error_setg(errp, "Can't use copy-on-read on read-only device");
ret = -EINVAL;
goto fail_opts;
}
}
discard = qemu_opt_get(opts, BDRV_OPT_DISCARD);
if (discard != NULL) {
if (bdrv_parse_discard_flags(discard, &bs->open_flags) != 0) {
error_setg(errp, "Invalid discard option");
ret = -EINVAL;
goto fail_opts;
}
}
bs->detect_zeroes =
bdrv_parse_detect_zeroes(opts, bs->open_flags, &local_err);
if (local_err) {
error_propagate(errp, local_err);
ret = -EINVAL;
goto fail_opts;
}
if (filename != NULL) {
pstrcpy(bs->filename, sizeof(bs->filename), filename);
} else {
bs->filename[0] = '\0';
}
pstrcpy(bs->exact_filename, sizeof(bs->exact_filename), bs->filename);
/* Open the image, either directly or using a protocol */
open_flags = bdrv_open_flags(bs, bs->open_flags);
node_name = qemu_opt_get(opts, "node-name");
assert(!drv->bdrv_file_open || file == NULL);
ret = bdrv_open_driver(bs, drv, node_name, options, open_flags, errp);
if (ret < 0) {
goto fail_opts;
}
qemu_opts_del(opts);
return 0;
fail_opts:
qemu_opts_del(opts);
return ret;
}
static QDict *parse_json_filename(const char *filename, Error **errp)
{
QObject *options_obj;
QDict *options;
int ret;
GLOBAL_STATE_CODE();
ret = strstart(filename, "json:", &filename);
assert(ret);
options_obj = qobject_from_json(filename, errp);
if (!options_obj) {
error_prepend(errp, "Could not parse the JSON options: ");
return NULL;
}
options = qobject_to(QDict, options_obj);
if (!options) {
qobject_unref(options_obj);
error_setg(errp, "Invalid JSON object given");
return NULL;
}
qdict_flatten(options);
return options;
}
static void parse_json_protocol(QDict *options, const char **pfilename,
Error **errp)
{
QDict *json_options;
Error *local_err = NULL;
GLOBAL_STATE_CODE();
/* Parse json: pseudo-protocol */
if (!*pfilename || !g_str_has_prefix(*pfilename, "json:")) {
return;
}
json_options = parse_json_filename(*pfilename, &local_err);
if (local_err) {
error_propagate(errp, local_err);
return;
}
/* Options given in the filename have lower priority than options
* specified directly */
qdict_join(options, json_options, false);
qobject_unref(json_options);
*pfilename = NULL;
}
/*
* Fills in default options for opening images and converts the legacy
* filename/flags pair to option QDict entries.
* The BDRV_O_PROTOCOL flag in *flags will be set or cleared accordingly if a
* block driver has been specified explicitly.
*/
static int bdrv_fill_options(QDict **options, const char *filename,
int *flags, Error **errp)
{
const char *drvname;
bool protocol = *flags & BDRV_O_PROTOCOL;
bool parse_filename = false;
BlockDriver *drv = NULL;
Error *local_err = NULL;
GLOBAL_STATE_CODE();
/*
* Caution: while qdict_get_try_str() is fine, getting non-string
* types would require more care. When @options come from
* -blockdev or blockdev_add, its members are typed according to
* the QAPI schema, but when they come from -drive, they're all
* QString.
*/
drvname = qdict_get_try_str(*options, "driver");
if (drvname) {
drv = bdrv_find_format(drvname);
if (!drv) {
error_setg(errp, "Unknown driver '%s'", drvname);
return -ENOENT;
}
/* If the user has explicitly specified the driver, this choice should
* override the BDRV_O_PROTOCOL flag */
protocol = drv->bdrv_file_open;
}
if (protocol) {
*flags |= BDRV_O_PROTOCOL;
} else {
*flags &= ~BDRV_O_PROTOCOL;
}
/* Translate cache options from flags into options */
update_options_from_flags(*options, *flags);
/* Fetch the file name from the options QDict if necessary */
if (protocol && filename) {
if (!qdict_haskey(*options, "filename")) {
qdict_put_str(*options, "filename", filename);
parse_filename = true;
} else {
error_setg(errp, "Can't specify 'file' and 'filename' options at "
"the same time");
return -EINVAL;
}
}
/* Find the right block driver */
/* See cautionary note on accessing @options above */
filename = qdict_get_try_str(*options, "filename");
if (!drvname && protocol) {
if (filename) {
drv = bdrv_find_protocol(filename, parse_filename, errp);
if (!drv) {
return -EINVAL;
}
drvname = drv->format_name;
qdict_put_str(*options, "driver", drvname);
} else {
error_setg(errp, "Must specify either driver or file");
return -EINVAL;
}
}
assert(drv || !protocol);
/* Driver-specific filename parsing */
if (drv && drv->bdrv_parse_filename && parse_filename) {
drv->bdrv_parse_filename(filename, *options, &local_err);
if (local_err) {
error_propagate(errp, local_err);
return -EINVAL;
}
if (!drv->bdrv_needs_filename) {
qdict_del(*options, "filename");
}
}
return 0;
}
typedef struct BlockReopenQueueEntry {
bool prepared;
bool perms_checked;
BDRVReopenState state;
QTAILQ_ENTRY(BlockReopenQueueEntry) entry;
} BlockReopenQueueEntry;
/*
* Return the flags that @bs will have after the reopens in @q have
* successfully completed. If @q is NULL (or @bs is not contained in @q),
* return the current flags.
*/
static int bdrv_reopen_get_flags(BlockReopenQueue *q, BlockDriverState *bs)
{
BlockReopenQueueEntry *entry;
if (q != NULL) {
QTAILQ_FOREACH(entry, q, entry) {
if (entry->state.bs == bs) {
return entry->state.flags;
}
}
}
return bs->open_flags;
}
/* Returns whether the image file can be written to after the reopen queue @q
* has been successfully applied, or right now if @q is NULL. */
static bool bdrv_is_writable_after_reopen(BlockDriverState *bs,
BlockReopenQueue *q)
{
int flags = bdrv_reopen_get_flags(q, bs);
return (flags & (BDRV_O_RDWR | BDRV_O_INACTIVE)) == BDRV_O_RDWR;
}
/*
* Return whether the BDS can be written to. This is not necessarily
* the same as !bdrv_is_read_only(bs), as inactivated images may not
* be written to but do not count as read-only images.
*/
bool bdrv_is_writable(BlockDriverState *bs)
{
IO_CODE();
return bdrv_is_writable_after_reopen(bs, NULL);
}
static char *bdrv_child_user_desc(BdrvChild *c)
{
GLOBAL_STATE_CODE();
return c->klass->get_parent_desc(c);
}
/*
* Check that @a allows everything that @b needs. @a and @b must reference same
* child node.
*/
static bool bdrv_a_allow_b(BdrvChild *a, BdrvChild *b, Error **errp)
{
const char *child_bs_name;
g_autofree char *a_user = NULL;
g_autofree char *b_user = NULL;
g_autofree char *perms = NULL;
assert(a->bs);
assert(a->bs == b->bs);
GLOBAL_STATE_CODE();
if ((b->perm & a->shared_perm) == b->perm) {
return true;
}
child_bs_name = bdrv_get_node_name(b->bs);
a_user = bdrv_child_user_desc(a);
b_user = bdrv_child_user_desc(b);
perms = bdrv_perm_names(b->perm & ~a->shared_perm);
error_setg(errp, "Permission conflict on node '%s': permissions '%s' are "
"both required by %s (uses node '%s' as '%s' child) and "
"unshared by %s (uses node '%s' as '%s' child).",
child_bs_name, perms,
b_user, child_bs_name, b->name,
a_user, child_bs_name, a->name);
return false;
}
static bool bdrv_parent_perms_conflict(BlockDriverState *bs, Error **errp)
{
BdrvChild *a, *b;
GLOBAL_STATE_CODE();
/*
* During the loop we'll look at each pair twice. That's correct because
* bdrv_a_allow_b() is asymmetric and we should check each pair in both
* directions.
*/
QLIST_FOREACH(a, &bs->parents, next_parent) {
QLIST_FOREACH(b, &bs->parents, next_parent) {
if (a == b) {
continue;
}
if (!bdrv_a_allow_b(a, b, errp)) {
return true;
}
}
}
return false;
}
static void bdrv_child_perm(BlockDriverState *bs, BlockDriverState *child_bs,
BdrvChild *c, BdrvChildRole role,
BlockReopenQueue *reopen_queue,
uint64_t parent_perm, uint64_t parent_shared,
uint64_t *nperm, uint64_t *nshared)
{
assert(bs->drv && bs->drv->bdrv_child_perm);
GLOBAL_STATE_CODE();
bs->drv->bdrv_child_perm(bs, c, role, reopen_queue,
parent_perm, parent_shared,
nperm, nshared);
/* TODO Take force_share from reopen_queue */
if (child_bs && child_bs->force_share) {
*nshared = BLK_PERM_ALL;
}
}
/*
* Adds the whole subtree of @bs (including @bs itself) to the @list (except for
* nodes that are already in the @list, of course) so that final list is
* topologically sorted. Return the result (GSList @list object is updated, so
* don't use old reference after function call).
*
* On function start @list must be already topologically sorted and for any node
* in the @list the whole subtree of the node must be in the @list as well. The
* simplest way to satisfy this criteria: use only result of
* bdrv_topological_dfs() or NULL as @list parameter.
*/
static GSList *bdrv_topological_dfs(GSList *list, GHashTable *found,
BlockDriverState *bs)
{
BdrvChild *child;
g_autoptr(GHashTable) local_found = NULL;
GLOBAL_STATE_CODE();
if (!found) {
assert(!list);
found = local_found = g_hash_table_new(NULL, NULL);
}
if (g_hash_table_contains(found, bs)) {
return list;
}
g_hash_table_add(found, bs);
QLIST_FOREACH(child, &bs->children, next) {
list = bdrv_topological_dfs(list, found, child->bs);
}
return g_slist_prepend(list, bs);
}
typedef struct BdrvChildSetPermState {
BdrvChild *child;
uint64_t old_perm;
uint64_t old_shared_perm;
} BdrvChildSetPermState;
static void bdrv_child_set_perm_abort(void *opaque)
{
BdrvChildSetPermState *s = opaque;
GLOBAL_STATE_CODE();
s->child->perm = s->old_perm;
s->child->shared_perm = s->old_shared_perm;
}
static TransactionActionDrv bdrv_child_set_pem_drv = {
.abort = bdrv_child_set_perm_abort,
.clean = g_free,
};
static void bdrv_child_set_perm(BdrvChild *c, uint64_t perm,
uint64_t shared, Transaction *tran)
{
BdrvChildSetPermState *s = g_new(BdrvChildSetPermState, 1);
GLOBAL_STATE_CODE();
*s = (BdrvChildSetPermState) {
.child = c,
.old_perm = c->perm,
.old_shared_perm = c->shared_perm,
};
c->perm = perm;
c->shared_perm = shared;
tran_add(tran, &bdrv_child_set_pem_drv, s);
}
static void bdrv_drv_set_perm_commit(void *opaque)
{
BlockDriverState *bs = opaque;
uint64_t cumulative_perms, cumulative_shared_perms;
GLOBAL_STATE_CODE();
if (bs->drv->bdrv_set_perm) {
bdrv_get_cumulative_perm(bs, &cumulative_perms,
&cumulative_shared_perms);
bs->drv->bdrv_set_perm(bs, cumulative_perms, cumulative_shared_perms);
}
}
static void bdrv_drv_set_perm_abort(void *opaque)
{
BlockDriverState *bs = opaque;
GLOBAL_STATE_CODE();
if (bs->drv->bdrv_abort_perm_update) {
bs->drv->bdrv_abort_perm_update(bs);
}
}
TransactionActionDrv bdrv_drv_set_perm_drv = {
.abort = bdrv_drv_set_perm_abort,
.commit = bdrv_drv_set_perm_commit,
};
static int bdrv_drv_set_perm(BlockDriverState *bs, uint64_t perm,
uint64_t shared_perm, Transaction *tran,
Error **errp)
{
GLOBAL_STATE_CODE();
if (!bs->drv) {
return 0;
}
if (bs->drv->bdrv_check_perm) {
int ret = bs->drv->bdrv_check_perm(bs, perm, shared_perm, errp);
if (ret < 0) {
return ret;
}
}
if (tran) {
tran_add(tran, &bdrv_drv_set_perm_drv, bs);
}
return 0;
}
typedef struct BdrvReplaceChildState {
BdrvChild *child;
BdrvChild **childp;
BlockDriverState *old_bs;
bool free_empty_child;
} BdrvReplaceChildState;
static void bdrv_replace_child_commit(void *opaque)
{
BdrvReplaceChildState *s = opaque;
GLOBAL_STATE_CODE();
if (s->free_empty_child && !s->child->bs) {
bdrv_child_free(s->child);
}
bdrv_unref(s->old_bs);
}
static void bdrv_replace_child_abort(void *opaque)
{
BdrvReplaceChildState *s = opaque;
BlockDriverState *new_bs = s->child->bs;
GLOBAL_STATE_CODE();
/*
* old_bs reference is transparently moved from @s to s->child.
*
* Pass &s->child here instead of s->childp, because:
* (1) s->old_bs must be non-NULL, so bdrv_replace_child_noperm() will not
* modify the BdrvChild * pointer we indirectly pass to it, i.e. it
* will not modify s->child. From that perspective, it does not matter
* whether we pass s->childp or &s->child.
* (2) If new_bs is not NULL, s->childp will be NULL. We then cannot use
* it here.
* (3) If new_bs is NULL, *s->childp will have been NULLed by
* bdrv_replace_child_tran()'s bdrv_replace_child_noperm() call, and we
* must not pass a NULL *s->childp here.
*
* So whether new_bs was NULL or not, we cannot pass s->childp here; and in
* any case, there is no reason to pass it anyway.
*/
bdrv_replace_child_noperm(&s->child, s->old_bs, true);
/*
* The child was pre-existing, so s->old_bs must be non-NULL, and
* s->child thus must not have been freed
*/
assert(s->child != NULL);
if (!new_bs) {
/* As described above, *s->childp was cleared, so restore it */
assert(s->childp != NULL);
*s->childp = s->child;
}
bdrv_unref(new_bs);
}
static TransactionActionDrv bdrv_replace_child_drv = {
.commit = bdrv_replace_child_commit,
.abort = bdrv_replace_child_abort,
.clean = g_free,
};
/*
* bdrv_replace_child_tran
*
* Note: real unref of old_bs is done only on commit.
*
* The function doesn't update permissions, caller is responsible for this.
*
* (*childp)->bs must not be NULL.
*
* Note that if new_bs == NULL, @childp is stored in a state object attached
* to @tran, so that the old child can be reinstated in the abort handler.
* Therefore, if @new_bs can be NULL, @childp must stay valid until the
* transaction is committed or aborted.
*
* If @free_empty_child is true and @new_bs is NULL, the BdrvChild is
* freed (on commit). @free_empty_child should only be false if the
* caller will free the BDrvChild themselves (which may be important
* if this is in turn called in another transactional context).
*/
static void bdrv_replace_child_tran(BdrvChild **childp,
BlockDriverState *new_bs,
Transaction *tran,
bool free_empty_child)
{
BdrvReplaceChildState *s = g_new(BdrvReplaceChildState, 1);
*s = (BdrvReplaceChildState) {
.child = *childp,
.childp = new_bs == NULL ? childp : NULL,
.old_bs = (*childp)->bs,
.free_empty_child = free_empty_child,
};
tran_add(tran, &bdrv_replace_child_drv, s);
/* The abort handler relies on this */
assert(s->old_bs != NULL);
if (new_bs) {
bdrv_ref(new_bs);
}
/*
* Pass free_empty_child=false, we will free the child (if
* necessary) in bdrv_replace_child_commit() (if our
* @free_empty_child parameter was true).
*/
bdrv_replace_child_noperm(childp, new_bs, false);
/* old_bs reference is transparently moved from *childp to @s */
}
/*
* Refresh permissions in @bs subtree. The function is intended to be called
* after some graph modification that was done without permission update.
*/
static int bdrv_node_refresh_perm(BlockDriverState *bs, BlockReopenQueue *q,
Transaction *tran, Error **errp)
{
BlockDriver *drv = bs->drv;
BdrvChild *c;
int ret;
uint64_t cumulative_perms, cumulative_shared_perms;
GLOBAL_STATE_CODE();
bdrv_get_cumulative_perm(bs, &cumulative_perms, &cumulative_shared_perms);
/* Write permissions never work with read-only images */
if ((cumulative_perms & (BLK_PERM_WRITE | BLK_PERM_WRITE_UNCHANGED)) &&
!bdrv_is_writable_after_reopen(bs, q))
{
if (!bdrv_is_writable_after_reopen(bs, NULL)) {
error_setg(errp, "Block node is read-only");
} else {
error_setg(errp, "Read-only block node '%s' cannot support "
"read-write users", bdrv_get_node_name(bs));
}
return -EPERM;
}
/*
* Unaligned requests will automatically be aligned to bl.request_alignment
* and without RESIZE we can't extend requests to write to space beyond the
* end of the image, so it's required that the image size is aligned.
*/
if ((cumulative_perms & (BLK_PERM_WRITE | BLK_PERM_WRITE_UNCHANGED)) &&
!(cumulative_perms & BLK_PERM_RESIZE))
{
if ((bs->total_sectors * BDRV_SECTOR_SIZE) % bs->bl.request_alignment) {
error_setg(errp, "Cannot get 'write' permission without 'resize': "
"Image size is not a multiple of request "
"alignment");
return -EPERM;
}
}
/* Check this node */
if (!drv) {
return 0;
}
ret = bdrv_drv_set_perm(bs, cumulative_perms, cumulative_shared_perms, tran,
errp);
if (ret < 0) {
return ret;
}
/* Drivers that never have children can omit .bdrv_child_perm() */
if (!drv->bdrv_child_perm) {
assert(QLIST_EMPTY(&bs->children));
return 0;
}
/* Check all children */
QLIST_FOREACH(c, &bs->children, next) {
uint64_t cur_perm, cur_shared;
bdrv_child_perm(bs, c->bs, c, c->role, q,
cumulative_perms, cumulative_shared_perms,
&cur_perm, &cur_shared);
bdrv_child_set_perm(c, cur_perm, cur_shared, tran);
}
return 0;
}
static int bdrv_list_refresh_perms(GSList *list, BlockReopenQueue *q,
Transaction *tran, Error **errp)
{
int ret;
BlockDriverState *bs;
GLOBAL_STATE_CODE();
for ( ; list; list = list->next) {
bs = list->data;
if (bdrv_parent_perms_conflict(bs, errp)) {
return -EINVAL;
}
ret = bdrv_node_refresh_perm(bs, q, tran, errp);
if (ret < 0) {
return ret;
}
}
return 0;
}
void bdrv_get_cumulative_perm(BlockDriverState *bs, uint64_t *perm,
uint64_t *shared_perm)
{
BdrvChild *c;
uint64_t cumulative_perms = 0;
uint64_t cumulative_shared_perms = BLK_PERM_ALL;
GLOBAL_STATE_CODE();
QLIST_FOREACH(c, &bs->parents, next_parent) {
cumulative_perms |= c->perm;
cumulative_shared_perms &= c->shared_perm;
}
*perm = cumulative_perms;
*shared_perm = cumulative_shared_perms;
}
char *bdrv_perm_names(uint64_t perm)
{
struct perm_name {
uint64_t perm;
const char *name;
} permissions[] = {
{ BLK_PERM_CONSISTENT_READ, "consistent read" },
{ BLK_PERM_WRITE, "write" },
{ BLK_PERM_WRITE_UNCHANGED, "write unchanged" },
{ BLK_PERM_RESIZE, "resize" },
{ 0, NULL }
};
GString *result = g_string_sized_new(30);
struct perm_name *p;
for (p = permissions; p->name; p++) {
if (perm & p->perm) {
if (result->len > 0) {
g_string_append(result, ", ");
}
g_string_append(result, p->name);
}
}
return g_string_free(result, FALSE);
}
static int bdrv_refresh_perms(BlockDriverState *bs, Error **errp)
{
int ret;
Transaction *tran = tran_new();
g_autoptr(GSList) list = bdrv_topological_dfs(NULL, NULL, bs);
GLOBAL_STATE_CODE();
ret = bdrv_list_refresh_perms(list, NULL, tran, errp);
tran_finalize(tran, ret);
return ret;
}
int bdrv_child_try_set_perm(BdrvChild *c, uint64_t perm, uint64_t shared,
Error **errp)
{
Error *local_err = NULL;
Transaction *tran = tran_new();
int ret;
GLOBAL_STATE_CODE();
bdrv_child_set_perm(c, perm, shared, tran);
ret = bdrv_refresh_perms(c->bs, &local_err);
tran_finalize(tran, ret);
if (ret < 0) {
if ((perm & ~c->perm) || (c->shared_perm & ~shared)) {
/* tighten permissions */
error_propagate(errp, local_err);
} else {
/*
* Our caller may intend to only loosen restrictions and
* does not expect this function to fail. Errors are not
* fatal in such a case, so we can just hide them from our
* caller.
*/
error_free(local_err);
ret = 0;
}
}
return ret;
}
int bdrv_child_refresh_perms(BlockDriverState *bs, BdrvChild *c, Error **errp)
{
uint64_t parent_perms, parent_shared;
uint64_t perms, shared;
GLOBAL_STATE_CODE();
bdrv_get_cumulative_perm(bs, &parent_perms, &parent_shared);
bdrv_child_perm(bs, c->bs, c, c->role, NULL,
parent_perms, parent_shared, &perms, &shared);
return bdrv_child_try_set_perm(c, perms, shared, errp);
}
/*
* Default implementation for .bdrv_child_perm() for block filters:
* Forward CONSISTENT_READ, WRITE, WRITE_UNCHANGED, and RESIZE to the
* filtered child.
*/
static void bdrv_filter_default_perms(BlockDriverState *bs, BdrvChild *c,
BdrvChildRole role,
BlockReopenQueue *reopen_queue,
uint64_t perm, uint64_t shared,
uint64_t *nperm, uint64_t *nshared)
{
GLOBAL_STATE_CODE();
*nperm = perm & DEFAULT_PERM_PASSTHROUGH;
*nshared = (shared & DEFAULT_PERM_PASSTHROUGH) | DEFAULT_PERM_UNCHANGED;
}
static void bdrv_default_perms_for_cow(BlockDriverState *bs, BdrvChild *c,
BdrvChildRole role,
BlockReopenQueue *reopen_queue,
uint64_t perm, uint64_t shared,
uint64_t *nperm, uint64_t *nshared)
{
assert(role & BDRV_CHILD_COW);
GLOBAL_STATE_CODE();
/*
* We want consistent read from backing files if the parent needs it.
* No other operations are performed on backing files.
*/
perm &= BLK_PERM_CONSISTENT_READ;
/*
* If the parent can deal with changing data, we're okay with a
* writable and resizable backing file.
* TODO Require !(perm & BLK_PERM_CONSISTENT_READ), too?
*/
if (shared & BLK_PERM_WRITE) {
shared = BLK_PERM_WRITE | BLK_PERM_RESIZE;
} else {
shared = 0;
}
shared |= BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED;
if (bs->open_flags & BDRV_O_INACTIVE) {
shared |= BLK_PERM_WRITE | BLK_PERM_RESIZE;
}
*nperm = perm;
*nshared = shared;
}
static void bdrv_default_perms_for_storage(BlockDriverState *bs, BdrvChild *c,
BdrvChildRole role,
BlockReopenQueue *reopen_queue,
uint64_t perm, uint64_t shared,
uint64_t *nperm, uint64_t *nshared)
{
int flags;
GLOBAL_STATE_CODE();
assert(role & (BDRV_CHILD_METADATA | BDRV_CHILD_DATA));
flags = bdrv_reopen_get_flags(reopen_queue, bs);
/*
* Apart from the modifications below, the same permissions are
* forwarded and left alone as for filters
*/
bdrv_filter_default_perms(bs, c, role, reopen_queue,
perm, shared, &perm, &shared);
if (role & BDRV_CHILD_METADATA) {
/* Format drivers may touch metadata even if the guest doesn't write */
if (bdrv_is_writable_after_reopen(bs, reopen_queue)) {
perm |= BLK_PERM_WRITE | BLK_PERM_RESIZE;
}
/*
* bs->file always needs to be consistent because of the
* metadata. We can never allow other users to resize or write
* to it.
*/
if (!(flags & BDRV_O_NO_IO)) {
perm |= BLK_PERM_CONSISTENT_READ;
}
shared &= ~(BLK_PERM_WRITE | BLK_PERM_RESIZE);
}
if (role & BDRV_CHILD_DATA) {
/*
* Technically, everything in this block is a subset of the
* BDRV_CHILD_METADATA path taken above, and so this could
* be an "else if" branch. However, that is not obvious, and
* this function is not performance critical, therefore we let
* this be an independent "if".
*/
/*
* We cannot allow other users to resize the file because the
* format driver might have some assumptions about the size
* (e.g. because it is stored in metadata, or because the file
* is split into fixed-size data files).
*/
shared &= ~BLK_PERM_RESIZE;
/*
* WRITE_UNCHANGED often cannot be performed as such on the
* data file. For example, the qcow2 driver may still need to
* write copied clusters on copy-on-read.
*/
if (perm & BLK_PERM_WRITE_UNCHANGED) {
perm |= BLK_PERM_WRITE;
}
/*
* If the data file is written to, the format driver may
* expect to be able to resize it by writing beyond the EOF.
*/
if (perm & BLK_PERM_WRITE) {
perm |= BLK_PERM_RESIZE;
}
}
if (bs->open_flags & BDRV_O_INACTIVE) {
shared |= BLK_PERM_WRITE | BLK_PERM_RESIZE;
}
*nperm = perm;
*nshared = shared;
}
void bdrv_default_perms(BlockDriverState *bs, BdrvChild *c,
BdrvChildRole role, BlockReopenQueue *reopen_queue,
uint64_t perm, uint64_t shared,
uint64_t *nperm, uint64_t *nshared)
{
GLOBAL_STATE_CODE();
if (role & BDRV_CHILD_FILTERED) {
assert(!(role & (BDRV_CHILD_DATA | BDRV_CHILD_METADATA |
BDRV_CHILD_COW)));
bdrv_filter_default_perms(bs, c, role, reopen_queue,
perm, shared, nperm, nshared);
} else if (role & BDRV_CHILD_COW) {
assert(!(role & (BDRV_CHILD_DATA | BDRV_CHILD_METADATA)));
bdrv_default_perms_for_cow(bs, c, role, reopen_queue,
perm, shared, nperm, nshared);
} else if (role & (BDRV_CHILD_METADATA | BDRV_CHILD_DATA)) {
bdrv_default_perms_for_storage(bs, c, role, reopen_queue,
perm, shared, nperm, nshared);
} else {
g_assert_not_reached();
}
}
uint64_t bdrv_qapi_perm_to_blk_perm(BlockPermission qapi_perm)
{
static const uint64_t permissions[] = {
[BLOCK_PERMISSION_CONSISTENT_READ] = BLK_PERM_CONSISTENT_READ,
[BLOCK_PERMISSION_WRITE] = BLK_PERM_WRITE,
[BLOCK_PERMISSION_WRITE_UNCHANGED] = BLK_PERM_WRITE_UNCHANGED,
[BLOCK_PERMISSION_RESIZE] = BLK_PERM_RESIZE,
};
QEMU_BUILD_BUG_ON(ARRAY_SIZE(permissions) != BLOCK_PERMISSION__MAX);
QEMU_BUILD_BUG_ON(1UL << ARRAY_SIZE(permissions) != BLK_PERM_ALL + 1);
assert(qapi_perm < BLOCK_PERMISSION__MAX);
return permissions[qapi_perm];
}
/**
* Replace (*childp)->bs by @new_bs.
*
* If @new_bs is NULL, *childp will be set to NULL, too: BDS parents
* generally cannot handle a BdrvChild with .bs == NULL, so clearing
* BdrvChild.bs should generally immediately be followed by the
* BdrvChild pointer being cleared as well.
*
* If @free_empty_child is true and @new_bs is NULL, the BdrvChild is
* freed. @free_empty_child should only be false if the caller will
* free the BdrvChild themselves (this may be important in a
* transactional context, where it may only be freed on commit).
*/
static void bdrv_replace_child_noperm(BdrvChild **childp,
BlockDriverState *new_bs,
bool free_empty_child)
{
BdrvChild *child = *childp;
BlockDriverState *old_bs = child->bs;
int new_bs_quiesce_counter;
int drain_saldo;
assert(!child->frozen);
assert(old_bs != new_bs);
GLOBAL_STATE_CODE();
if (old_bs && new_bs) {
assert(bdrv_get_aio_context(old_bs) == bdrv_get_aio_context(new_bs));
}
new_bs_quiesce_counter = (new_bs ? new_bs->quiesce_counter : 0);
drain_saldo = new_bs_quiesce_counter - child->parent_quiesce_counter;
/*
* If the new child node is drained but the old one was not, flush
* all outstanding requests to the old child node.
*/
while (drain_saldo > 0 && child->klass->drained_begin) {
bdrv_parent_drained_begin_single(child, true);
drain_saldo--;
}
if (old_bs) {
/* Detach first so that the recursive drain sections coming from @child
* are already gone and we only end the drain sections that came from
* elsewhere. */
if (child->klass->detach) {
child->klass->detach(child);
}
assert_bdrv_graph_writable(old_bs);
QLIST_REMOVE(child, next_parent);
}
child->bs = new_bs;
if (!new_bs) {
*childp = NULL;
}
if (new_bs) {
assert_bdrv_graph_writable(new_bs);
QLIST_INSERT_HEAD(&new_bs->parents, child, next_parent);
/*
* Detaching the old node may have led to the new node's
* quiesce_counter having been decreased. Not a problem, we
* just need to recognize this here and then invoke
* drained_end appropriately more often.
*/
assert(new_bs->quiesce_counter <= new_bs_quiesce_counter);
drain_saldo += new_bs->quiesce_counter - new_bs_quiesce_counter;
/* Attach only after starting new drained sections, so that recursive
* drain sections coming from @child don't get an extra .drained_begin
* callback. */
if (child->klass->attach) {
child->klass->attach(child);
}
}
/*
* If the old child node was drained but the new one is not, allow
* requests to come in only after the new node has been attached.
*/
while (drain_saldo < 0 && child->klass->drained_end) {
bdrv_parent_drained_end_single(child);
drain_saldo++;
}
if (free_empty_child && !child->bs) {
bdrv_child_free(child);
}
}
/**
* Free the given @child.
*
* The child must be empty (i.e. `child->bs == NULL`) and it must be
* unused (i.e. not in a children list).
*/
static void bdrv_child_free(BdrvChild *child)
{
assert(!child->bs);
GLOBAL_STATE_CODE();
assert(!child->next.le_prev); /* not in children list */
g_free(child->name);
g_free(child);
}
typedef struct BdrvAttachChildCommonState {
BdrvChild **child;
AioContext *old_parent_ctx;
AioContext *old_child_ctx;
} BdrvAttachChildCommonState;
static void bdrv_attach_child_common_abort(void *opaque)
{
BdrvAttachChildCommonState *s = opaque;
BdrvChild *child = *s->child;
BlockDriverState *bs = child->bs;
GLOBAL_STATE_CODE();
/*
* Pass free_empty_child=false, because we still need the child
* for the AioContext operations on the parent below; those
* BdrvChildClass methods all work on a BdrvChild object, so we
* need to keep it as an empty shell (after this function, it will
* not be attached to any parent, and it will not have a .bs).
*/
bdrv_replace_child_noperm(s->child, NULL, false);
if (bdrv_get_aio_context(bs) != s->old_child_ctx) {
bdrv_try_set_aio_context(bs, s->old_child_ctx, &error_abort);
}
if (bdrv_child_get_parent_aio_context(child) != s->old_parent_ctx) {
GSList *ignore;
/* No need to ignore `child`, because it has been detached already */
ignore = NULL;
child->klass->can_set_aio_ctx(child, s->old_parent_ctx, &ignore,
&error_abort);
g_slist_free(ignore);
ignore = NULL;
child->klass->set_aio_ctx(child, s->old_parent_ctx, &ignore);
g_slist_free(ignore);
}
bdrv_unref(bs);
bdrv_child_free(child);
}
static TransactionActionDrv bdrv_attach_child_common_drv = {
.abort = bdrv_attach_child_common_abort,
.clean = g_free,
};
/*
* Common part of attaching bdrv child to bs or to blk or to job
*
* Resulting new child is returned through @child.
* At start *@child must be NULL.
* @child is saved to a new entry of @tran, so that *@child could be reverted to
* NULL on abort(). So referenced variable must live at least until transaction
* end.
*
* Function doesn't update permissions, caller is responsible for this.
*/
static int bdrv_attach_child_common(BlockDriverState *child_bs,
const char *child_name,
const BdrvChildClass *child_class,
BdrvChildRole child_role,
uint64_t perm, uint64_t shared_perm,
void *opaque, BdrvChild **child,
Transaction *tran, Error **errp)
{
BdrvChild *new_child;
AioContext *parent_ctx;
AioContext *child_ctx = bdrv_get_aio_context(child_bs);
assert(child);
assert(*child == NULL);
assert(child_class->get_parent_desc);
GLOBAL_STATE_CODE();
new_child = g_new(BdrvChild, 1);
*new_child = (BdrvChild) {
.bs = NULL,
.name = g_strdup(child_name),
.klass = child_class,
.role = child_role,
.perm = perm,
.shared_perm = shared_perm,
.opaque = opaque,
};
/*
* If the AioContexts don't match, first try to move the subtree of
* child_bs into the AioContext of the new parent. If this doesn't work,
* try moving the parent into the AioContext of child_bs instead.
*/
parent_ctx = bdrv_child_get_parent_aio_context(new_child);
if (child_ctx != parent_ctx) {
Error *local_err = NULL;
int ret = bdrv_try_set_aio_context(child_bs, parent_ctx, &local_err);
if (ret < 0 && child_class->can_set_aio_ctx) {
GSList *ignore = g_slist_prepend(NULL, new_child);
if (child_class->can_set_aio_ctx(new_child, child_ctx, &ignore,
NULL))
{
error_free(local_err);
ret = 0;
g_slist_free(ignore);
ignore = g_slist_prepend(NULL, new_child);
child_class->set_aio_ctx(new_child, child_ctx, &ignore);
}
g_slist_free(ignore);
}
if (ret < 0) {
error_propagate(errp, local_err);
bdrv_child_free(new_child);
return ret;
}
}
bdrv_ref(child_bs);
bdrv_replace_child_noperm(&new_child, child_bs, true);
/* child_bs was non-NULL, so new_child must not have been freed */
assert(new_child != NULL);
*child = new_child;
BdrvAttachChildCommonState *s = g_new(BdrvAttachChildCommonState, 1);
*s = (BdrvAttachChildCommonState) {
.child = child,
.old_parent_ctx = parent_ctx,
.old_child_ctx = child_ctx,
};
tran_add(tran, &bdrv_attach_child_common_drv, s);
return 0;
}
/*
* Variable referenced by @child must live at least until transaction end.
* (see bdrv_attach_child_common() doc for details)
*
* Function doesn't update permissions, caller is responsible for this.
*/
static int bdrv_attach_child_noperm(BlockDriverState *parent_bs,
BlockDriverState *child_bs,
const char *child_name,
const BdrvChildClass *child_class,
BdrvChildRole child_role,
BdrvChild **child,
Transaction *tran,
Error **errp)
{
int ret;
uint64_t perm, shared_perm;
assert(parent_bs->drv);
GLOBAL_STATE_CODE();
if (bdrv_recurse_has_child(child_bs, parent_bs)) {
error_setg(errp, "Making '%s' a %s child of '%s' would create a cycle",
child_bs->node_name, child_name, parent_bs->node_name);
return -EINVAL;
}
bdrv_get_cumulative_perm(parent_bs, &perm, &shared_perm);
bdrv_child_perm(parent_bs, child_bs, NULL, child_role, NULL,
perm, shared_perm, &perm, &shared_perm);
ret = bdrv_attach_child_common(child_bs, child_name, child_class,
child_role, perm, shared_perm, parent_bs,
child, tran, errp);
if (ret < 0) {
return ret;
}
return 0;
}
static void bdrv_detach_child(BdrvChild **childp)
{
BlockDriverState *old_bs = (*childp)->bs;
GLOBAL_STATE_CODE();
bdrv_replace_child_noperm(childp, NULL, true);
if (old_bs) {
/*
* Update permissions for old node. We're just taking a parent away, so
* we're loosening restrictions. Errors of permission update are not
* fatal in this case, ignore them.
*/
bdrv_refresh_perms(old_bs, NULL);
/*
* When the parent requiring a non-default AioContext is removed, the
* node moves back to the main AioContext
*/
bdrv_try_set_aio_context(old_bs, qemu_get_aio_context(), NULL);
}
}
/*
* This function steals the reference to child_bs from the caller.
* That reference is later dropped by bdrv_root_unref_child().
*
* On failure NULL is returned, errp is set and the reference to
* child_bs is also dropped.
*
* The caller must hold the AioContext lock @child_bs, but not that of @ctx
* (unless @child_bs is already in @ctx).
*/
BdrvChild *bdrv_root_attach_child(BlockDriverState *child_bs,
const char *child_name,
const BdrvChildClass *child_class,
BdrvChildRole child_role,
uint64_t perm, uint64_t shared_perm,
void *opaque, Error **errp)
{
int ret;
BdrvChild *child = NULL;
Transaction *tran = tran_new();
GLOBAL_STATE_CODE();
ret = bdrv_attach_child_common(child_bs, child_name, child_class,
child_role, perm, shared_perm, opaque,
&child, tran, errp);
if (ret < 0) {
goto out;
}
ret = bdrv_refresh_perms(child_bs, errp);
out:
tran_finalize(tran, ret);
/* child is unset on failure by bdrv_attach_child_common_abort() */
assert((ret < 0) == !child);
bdrv_unref(child_bs);
return child;
}
/*
* This function transfers the reference to child_bs from the caller
* to parent_bs. That reference is later dropped by parent_bs on
* bdrv_close() or if someone calls bdrv_unref_child().
*
* On failure NULL is returned, errp is set and the reference to
* child_bs is also dropped.
*
* If @parent_bs and @child_bs are in different AioContexts, the caller must
* hold the AioContext lock for @child_bs, but not for @parent_bs.
*/
BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
BlockDriverState *child_bs,
const char *child_name,
const BdrvChildClass *child_class,
BdrvChildRole child_role,
Error **errp)
{
int ret;
BdrvChild *child = NULL;
Transaction *tran = tran_new();
GLOBAL_STATE_CODE();
ret = bdrv_attach_child_noperm(parent_bs, child_bs, child_name, child_class,
child_role, &child, tran, errp);
if (ret < 0) {
goto out;
}
ret = bdrv_refresh_perms(parent_bs, errp);
if (ret < 0) {
goto out;
}
out:
tran_finalize(tran, ret);
/* child is unset on failure by bdrv_attach_child_common_abort() */
assert((ret < 0) == !child);
bdrv_unref(child_bs);
return child;
}
/* Callers must ensure that child->frozen is false. */
void bdrv_root_unref_child(BdrvChild *child)
{
BlockDriverState *child_bs;
GLOBAL_STATE_CODE();
child_bs = child->bs;
bdrv_detach_child(&child);
bdrv_unref(child_bs);
}
typedef struct BdrvSetInheritsFrom {
BlockDriverState *bs;
BlockDriverState *old_inherits_from;
} BdrvSetInheritsFrom;
static void bdrv_set_inherits_from_abort(void *opaque)
{
BdrvSetInheritsFrom *s = opaque;
s->bs->inherits_from = s->old_inherits_from;
}
static TransactionActionDrv bdrv_set_inherits_from_drv = {
.abort = bdrv_set_inherits_from_abort,
.clean = g_free,
};
/* @tran is allowed to be NULL. In this case no rollback is possible */
static void bdrv_set_inherits_from(BlockDriverState *bs,
BlockDriverState *new_inherits_from,
Transaction *tran)
{
if (tran) {
BdrvSetInheritsFrom *s = g_new(BdrvSetInheritsFrom, 1);
*s = (BdrvSetInheritsFrom) {
.bs = bs,
.old_inherits_from = bs->inherits_from,
};
tran_add(tran, &bdrv_set_inherits_from_drv, s);
}
bs->inherits_from = new_inherits_from;
}
/**
* Clear all inherits_from pointers from children and grandchildren of
* @root that point to @root, where necessary.
* @tran is allowed to be NULL. In this case no rollback is possible
*/
static void bdrv_unset_inherits_from(BlockDriverState *root, BdrvChild *child,
Transaction *tran)
{
BdrvChild *c;
if (child->bs->inherits_from == root) {
/*
* Remove inherits_from only when the last reference between root and
* child->bs goes away.
*/
QLIST_FOREACH(c, &root->children, next) {
if (c != child && c->bs == child->bs) {
break;
}
}
if (c == NULL) {
bdrv_set_inherits_from(child->bs, NULL, tran);
}
}
QLIST_FOREACH(c, &child->bs->children, next) {
bdrv_unset_inherits_from(root, c, tran);
}
}
/* Callers must ensure that child->frozen is false. */
void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child)
{
GLOBAL_STATE_CODE();
if (child == NULL) {
return;
}
bdrv_unset_inherits_from(parent, child, NULL);
bdrv_root_unref_child(child);
}
static void bdrv_parent_cb_change_media(BlockDriverState *bs, bool load)
{
BdrvChild *c;
GLOBAL_STATE_CODE();
QLIST_FOREACH(c, &bs->parents, next_parent) {
if (c->klass->change_media) {
c->klass->change_media(c, load);
}
}
}
/* Return true if you can reach parent going through child->inherits_from
* recursively. If parent or child are NULL, return false */
static bool bdrv_inherits_from_recursive(BlockDriverState *child,
BlockDriverState *parent)
{
while (child && child != parent) {
child = child->inherits_from;
}
return child != NULL;
}
/*
* Return the BdrvChildRole for @bs's backing child. bs->backing is
* mostly used for COW backing children (role = COW), but also for
* filtered children (role = FILTERED | PRIMARY).
*/
static BdrvChildRole bdrv_backing_role(BlockDriverState *bs)
{
if (bs->drv && bs->drv->is_filter) {
return BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY;
} else {
return BDRV_CHILD_COW;
}
}
/*
* Sets the bs->backing or bs->file link of a BDS. A new reference is created;
* callers which don't need their own reference any more must call bdrv_unref().
*
* Function doesn't update permissions, caller is responsible for this.
*/
static int bdrv_set_file_or_backing_noperm(BlockDriverState *parent_bs,
BlockDriverState *child_bs,
bool is_backing,
Transaction *tran, Error **errp)
{
int ret = 0;
bool update_inherits_from =
bdrv_inherits_from_recursive(child_bs, parent_bs);
BdrvChild *child = is_backing ? parent_bs->backing : parent_bs->file;
BdrvChildRole role;
GLOBAL_STATE_CODE();
if (!parent_bs->drv) {
/*
* Node without drv is an object without a class :/. TODO: finally fix
* qcow2 driver to never clear bs->drv and implement format corruption
* handling in other way.
*/
error_setg(errp, "Node corrupted");
return -EINVAL;
}
if (child && child->frozen) {
error_setg(errp, "Cannot change frozen '%s' link from '%s' to '%s'",
child->name, parent_bs->node_name, child->bs->node_name);
return -EPERM;
}
if (is_backing && !parent_bs->drv->is_filter &&
!parent_bs->drv->supports_backing)
{
error_setg(errp, "Driver '%s' of node '%s' does not support backing "
"files", parent_bs->drv->format_name, parent_bs->node_name);
return -EINVAL;
}
if (parent_bs->drv->is_filter) {
role = BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY;
} else if (is_backing) {
role = BDRV_CHILD_COW;
} else {
/*
* We only can use same role as it is in existing child. We don't have
* infrastructure to determine role of file child in generic way
*/
if (!child) {
error_setg(errp, "Cannot set file child to format node without "
"file child");
return -EINVAL;
}
role = child->role;
}
if (child) {
bdrv_unset_inherits_from(parent_bs, child, tran);
bdrv_remove_file_or_backing_child(parent_bs, child, tran);
}
if (!child_bs) {
goto out;
}
ret = bdrv_attach_child_noperm(parent_bs, child_bs,
is_backing ? "backing" : "file",
&child_of_bds, role,
is_backing ? &parent_bs->backing :
&parent_bs->file,
tran, errp);
if (ret < 0) {
return ret;
}
/*
* If inherits_from pointed recursively to bs then let's update it to
* point directly to bs (else it will become NULL).
*/
if (update_inherits_from) {
bdrv_set_inherits_from(child_bs, parent_bs, tran);
}
out:
bdrv_refresh_limits(parent_bs, tran, NULL);
return 0;
}
static int bdrv_set_backing_noperm(BlockDriverState *bs,
BlockDriverState *backing_hd,
Transaction *tran, Error **errp)
{
GLOBAL_STATE_CODE();
return bdrv_set_file_or_backing_noperm(bs, backing_hd, true, tran, errp);
}
int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
Error **errp)
{
int ret;
Transaction *tran = tran_new();
GLOBAL_STATE_CODE();
bdrv_drained_begin(bs);
ret = bdrv_set_backing_noperm(bs, backing_hd, tran, errp);
if (ret < 0) {
goto out;
}
ret = bdrv_refresh_perms(bs, errp);
out:
tran_finalize(tran, ret);
bdrv_drained_end(bs);
return ret;
}
/*
* Opens the backing file for a BlockDriverState if not yet open
*
* bdref_key specifies the key for the image's BlockdevRef in the options QDict.
* That QDict has to be flattened; therefore, if the BlockdevRef is a QDict
* itself, all options starting with "${bdref_key}." are considered part of the
* BlockdevRef.
*
* TODO Can this be unified with bdrv_open_image()?
*/
int bdrv_open_backing_file(BlockDriverState *bs, QDict *parent_options,
const char *bdref_key, Error **errp)
{
char *backing_filename = NULL;
char *bdref_key_dot;
const char *reference = NULL;
int ret = 0;
bool implicit_backing = false;
BlockDriverState *backing_hd;
QDict *options;
QDict *tmp_parent_options = NULL;
Error *local_err = NULL;
GLOBAL_STATE_CODE();
if (bs->backing != NULL) {
goto free_exit;
}
/* NULL means an empty set of options */
if (parent_options == NULL) {
tmp_parent_options = qdict_new();
parent_options = tmp_parent_options;
}
bs->open_flags &= ~BDRV_O_NO_BACKING;
bdref_key_dot = g_strdup_printf("%s.", bdref_key);
qdict_extract_subqdict(parent_options, &options, bdref_key_dot);
g_free(bdref_key_dot);
/*
* Caution: while qdict_get_try_str() is fine, getting non-string
* types would require more care. When @parent_options come from
* -blockdev or blockdev_add, its members are typed according to
* the QAPI schema, but when they come from -drive, they're all
* QString.
*/
reference = qdict_get_try_str(parent_options, bdref_key);
if (reference || qdict_haskey(options, "file.filename")) {
/* keep backing_filename NULL */
} else if (bs->backing_file[0] == '\0' && qdict_size(options) == 0) {
qobject_unref(options);
goto free_exit;
} else {
if (qdict_size(options) == 0) {
/* If the user specifies options that do not modify the
* backing file's behavior, we might still consider it the
* implicit backing file. But it's easier this way, and
* just specifying some of the backing BDS's options is
* only possible with -drive anyway (otherwise the QAPI
* schema forces the user to specify everything). */
implicit_backing = !strcmp(bs->auto_backing_file, bs->backing_file);
}
backing_filename = bdrv_get_full_backing_filename(bs, &local_err);
if (local_err) {
ret = -EINVAL;
error_propagate(errp, local_err);
qobject_unref(options);
goto free_exit;
}
}
if (!bs->drv || !bs->drv->supports_backing) {
ret = -EINVAL;
error_setg(errp, "Driver doesn't support backing files");
qobject_unref(options);
goto free_exit;
}
if (!reference &&
bs->backing_format[0] != '\0' && !qdict_haskey(options, "driver")) {
qdict_put_str(options, "driver", bs->backing_format);
}
backing_hd = bdrv_open_inherit(backing_filename, reference, options, 0, bs,
&child_of_bds, bdrv_backing_role(bs), errp);
if (!backing_hd) {
bs->open_flags |= BDRV_O_NO_BACKING;
error_prepend(errp, "Could not open backing file: ");
ret = -EINVAL;
goto free_exit;
}
if (implicit_backing) {
bdrv_refresh_filename(backing_hd);
pstrcpy(bs->auto_backing_file, sizeof(bs->auto_backing_file),
backing_hd->filename);
}
/* Hook up the backing file link; drop our reference, bs owns the
* backing_hd reference now */
ret = bdrv_set_backing_hd(bs, backing_hd, errp);
bdrv_unref(backing_hd);
if (ret < 0) {
goto free_exit;
}
qdict_del(parent_options, bdref_key);
free_exit:
g_free(backing_filename);
qobject_unref(tmp_parent_options);
return ret;
}
static BlockDriverState *
bdrv_open_child_bs(const char *filename, QDict *options, const char *bdref_key,
BlockDriverState *parent, const BdrvChildClass *child_class,
BdrvChildRole child_role, bool allow_none, Error **errp)
{
BlockDriverState *bs = NULL;
QDict *image_options;
char *bdref_key_dot;
const char *reference;
assert(child_class != NULL);
bdref_key_dot = g_strdup_printf("%s.", bdref_key);
qdict_extract_subqdict(options, &image_options, bdref_key_dot);
g_free(bdref_key_dot);
/*
* Caution: while qdict_get_try_str() is fine, getting non-string
* types would require more care. When @options come from
* -blockdev or blockdev_add, its members are typed according to
* the QAPI schema, but when they come from -drive, they're all
* QString.
*/
reference = qdict_get_try_str(options, bdref_key);
if (!filename && !reference && !qdict_size(image_options)) {
if (!allow_none) {
error_setg(errp, "A block device must be specified for \"%s\"",
bdref_key);
}
qobject_unref(image_options);
goto done;
}
bs = bdrv_open_inherit(filename, reference, image_options, 0,
parent, child_class, child_role, errp);
if (!bs) {
goto done;
}
done:
qdict_del(options, bdref_key);
return bs;
}
/*
* Opens a disk image whose options are given as BlockdevRef in another block
* device's options.
*
* If allow_none is true, no image will be opened if filename is false and no
* BlockdevRef is given. NULL will be returned, but errp remains unset.
*
* bdrev_key specifies the key for the image's BlockdevRef in the options QDict.
* That QDict has to be flattened; therefore, if the BlockdevRef is a QDict
* itself, all options starting with "${bdref_key}." are considered part of the
* BlockdevRef.
*
* The BlockdevRef will be removed from the options QDict.
*/
BdrvChild *bdrv_open_child(const char *filename,
QDict *options, const char *bdref_key,
BlockDriverState *parent,
const BdrvChildClass *child_class,
BdrvChildRole child_role,
bool allow_none, Error **errp)
{
BlockDriverState *bs;
GLOBAL_STATE_CODE();
bs = bdrv_open_child_bs(filename, options, bdref_key, parent, child_class,
child_role, allow_none, errp);
if (bs == NULL) {
return NULL;
}
return bdrv_attach_child(parent, bs, bdref_key, child_class, child_role,
errp);
}
/*
* TODO Future callers may need to specify parent/child_class in order for
* option inheritance to work. Existing callers use it for the root node.
*/
BlockDriverState *bdrv_open_blockdev_ref(BlockdevRef *ref, Error **errp)
{
BlockDriverState *bs = NULL;
QObject *obj = NULL;
QDict *qdict = NULL;
const char *reference = NULL;
Visitor *v = NULL;
GLOBAL_STATE_CODE();
if (ref->type == QTYPE_QSTRING) {
reference = ref->u.reference;
} else {
BlockdevOptions *options = &ref->u.definition;
assert(ref->type == QTYPE_QDICT);
v = qobject_output_visitor_new(&obj);
visit_type_BlockdevOptions(v, NULL, &options, &error_abort);
visit_complete(v, &obj);
qdict = qobject_to(QDict, obj);
qdict_flatten(qdict);
/* bdrv_open_inherit() defaults to the values in bdrv_flags (for
* compatibility with other callers) rather than what we want as the
* real defaults. Apply the defaults here instead. */
qdict_set_default_str(qdict, BDRV_OPT_CACHE_DIRECT, "off");
qdict_set_default_str(qdict, BDRV_OPT_CACHE_NO_FLUSH, "off");
qdict_set_default_str(qdict, BDRV_OPT_READ_ONLY, "off");
qdict_set_default_str(qdict, BDRV_OPT_AUTO_READ_ONLY, "off");
}
bs = bdrv_open_inherit(NULL, reference, qdict, 0, NULL, NULL, 0, errp);
obj = NULL;
qobject_unref(obj);
visit_free(v);
return bs;
}
static BlockDriverState *bdrv_append_temp_snapshot(BlockDriverState *bs,
int flags,
QDict *snapshot_options,
Error **errp)
{
/* TODO: extra byte is a hack to ensure MAX_PATH space on Windows. */
char *tmp_filename = g_malloc0(PATH_MAX + 1);
int64_t total_size;
QemuOpts *opts = NULL;
BlockDriverState *bs_snapshot = NULL;
int ret;
GLOBAL_STATE_CODE();
/* if snapshot, we create a temporary backing file and open it
instead of opening 'filename' directly */
/* Get the required size from the image */
total_size = bdrv_getlength(bs);
if (total_size < 0) {
error_setg_errno(errp, -total_size, "Could not get image size");
goto out;
}
/* Create the temporary image */
ret = get_tmp_filename(tmp_filename, PATH_MAX + 1);
if (ret < 0) {
error_setg_errno(errp, -ret, "Could not get temporary filename");
goto out;
}
opts = qemu_opts_create(bdrv_qcow2.create_opts, NULL, 0,
&error_abort);
qemu_opt_set_number(opts, BLOCK_OPT_SIZE, total_size, &error_abort);
ret = bdrv_create(&bdrv_qcow2, tmp_filename, opts, errp);
qemu_opts_del(opts);
if (ret < 0) {
error_prepend(errp, "Could not create temporary overlay '%s': ",
tmp_filename);
goto out;
}
/* Prepare options QDict for the temporary file */
qdict_put_str(snapshot_options, "file.driver", "file");
qdict_put_str(snapshot_options, "file.filename", tmp_filename);
qdict_put_str(snapshot_options, "driver", "qcow2");
bs_snapshot = bdrv_open(NULL, NULL, snapshot_options, flags, errp);
snapshot_options = NULL;
if (!bs_snapshot) {
goto out;
}
ret = bdrv_append(bs_snapshot, bs, errp);
if (ret < 0) {
bs_snapshot = NULL;
goto out;
}
out:
qobject_unref(snapshot_options);
g_free(tmp_filename);
return bs_snapshot;
}
/*
* Opens a disk image (raw, qcow2, vmdk, ...)
*
* options is a QDict of options to pass to the block drivers, or NULL for an
* empty set of options. The reference to the QDict belongs to the block layer
* after the call (even on failure), so if the caller intends to reuse the
* dictionary, it needs to use qobject_ref() before calling bdrv_open.
*
* If *pbs is NULL, a new BDS will be created with a pointer to it stored there.
* If it is not NULL, the referenced BDS will be reused.
*
* The reference parameter may be used to specify an existing block device which
* should be opened. If specified, neither options nor a filename may be given,
* nor can an existing BDS be reused (that is, *pbs has to be NULL).
*/
static BlockDriverState *bdrv_open_inherit(const char *filename,
const char *reference,
QDict *options, int flags,
BlockDriverState *parent,
const BdrvChildClass *child_class,
BdrvChildRole child_role,
Error **errp)
{
int ret;
BlockBackend *file = NULL;
BlockDriverState *bs;
BlockDriver *drv = NULL;
BdrvChild *child;
const char *drvname;
const char *backing;
Error *local_err = NULL;
QDict *snapshot_options = NULL;
int snapshot_flags = 0;
assert(!child_class || !flags);
assert(!child_class == !parent);
GLOBAL_STATE_CODE();
if (reference) {
bool options_non_empty = options ? qdict_size(options) : false;
qobject_unref(options);
if (filename || options_non_empty) {
error_setg(errp, "Cannot reference an existing block device with "
"additional options or a new filename");
return NULL;
}
bs = bdrv_lookup_bs(reference, reference, errp);
if (!bs) {
return NULL;
}
bdrv_ref(bs);
return bs;
}
bs = bdrv_new();
/* NULL means an empty set of options */
if (options == NULL) {
options = qdict_new();
}
/* json: syntax counts as explicit options, as if in the QDict */
parse_json_protocol(options, &filename, &local_err);
if (local_err) {
goto fail;
}
bs->explicit_options = qdict_clone_shallow(options);
if (child_class) {
bool parent_is_format;
if (parent->drv) {
parent_is_format = parent->drv->is_format;
} else {
/*
* parent->drv is not set yet because this node is opened for
* (potential) format probing. That means that @parent is going
* to be a format node.
*/
parent_is_format = true;
}
bs->inherits_from = parent;
child_class->inherit_options(child_role, parent_is_format,
&flags, options,
parent->open_flags, parent->options);
}
ret = bdrv_fill_options(&options, filename, &flags, &local_err);
if (ret < 0) {
goto fail;
}
/*
* Set the BDRV_O_RDWR and BDRV_O_ALLOW_RDWR flags.
* Caution: getting a boolean member of @options requires care.
* When @options come from -blockdev or blockdev_add, members are
* typed according to the QAPI schema, but when they come from
* -drive, they're all QString.
*/
if (g_strcmp0(qdict_get_try_str(options, BDRV_OPT_READ_ONLY), "on") &&
!qdict_get_try_bool(options, BDRV_OPT_READ_ONLY, false)) {
flags |= (BDRV_O_RDWR | BDRV_O_ALLOW_RDWR);
} else {
flags &= ~BDRV_O_RDWR;
}
if (flags & BDRV_O_SNAPSHOT) {
snapshot_options = qdict_new();
bdrv_temp_snapshot_options(&snapshot_flags, snapshot_options,
flags, options);
/* Let bdrv_backing_options() override "read-only" */
qdict_del(options, BDRV_OPT_READ_ONLY);
bdrv_inherited_options(BDRV_CHILD_COW, true,
&flags, options, flags, options);
}
bs->open_flags = flags;
bs->options = options;
options = qdict_clone_shallow(options);
/* Find the right image format driver */
/* See cautionary note on accessing @options above */
drvname = qdict_get_try_str(options, "driver");
if (drvname) {
drv = bdrv_find_format(drvname);
if (!drv) {
error_setg(errp, "Unknown driver: '%s'", drvname);
goto fail;
}
}
assert(drvname || !(flags & BDRV_O_PROTOCOL));
/* See cautionary note on accessing @options above */
backing = qdict_get_try_str(options, "backing");
if (qobject_to(QNull, qdict_get(options, "backing")) != NULL ||
(backing && *backing == '\0'))
{
if (backing) {
warn_report("Use of \"backing\": \"\" is deprecated; "
"use \"backing\": null instead");
}
flags |= BDRV_O_NO_BACKING;
qdict_del(bs->explicit_options, "backing");
qdict_del(bs->options, "backing");
qdict_del(options, "backing");
}
/* Open image file without format layer. This BlockBackend is only used for
* probing, the block drivers will do their own bdrv_open_child() for the
* same BDS, which is why we put the node name back into options. */
if ((flags & BDRV_O_PROTOCOL) == 0) {
BlockDriverState *file_bs;
file_bs = bdrv_open_child_bs(filename, options, "file", bs,
&child_of_bds, BDRV_CHILD_IMAGE,
true, &local_err);
if (local_err) {
goto fail;
}
if (file_bs != NULL) {
/* Not requesting BLK_PERM_CONSISTENT_READ because we're only
* looking at the header to guess the image format. This works even
* in cases where a guest would not see a consistent state. */
file = blk_new(bdrv_get_aio_context(file_bs), 0, BLK_PERM_ALL);
blk_insert_bs(file, file_bs, &local_err);
bdrv_unref(file_bs);
if (local_err) {
goto fail;
}
qdict_put_str(options, "file", bdrv_get_node_name(file_bs));
}
}
/* Image format probing */
bs->probed = !drv;
if (!drv && file) {
ret = find_image_format(file, filename, &drv, &local_err);
if (ret < 0) {
goto fail;
}
/*
* This option update would logically belong in bdrv_fill_options(),
* but we first need to open bs->file for the probing to work, while
* opening bs->file already requires the (mostly) final set of options
* so that cache mode etc. can be inherited.
*
* Adding the driver later is somewhat ugly, but it's not an option
* that would ever be inherited, so it's correct. We just need to make
* sure to update both bs->options (which has the full effective
* options for bs) and options (which has file.* already removed).
*/
qdict_put_str(bs->options, "driver", drv->format_name);
qdict_put_str(options, "driver", drv->format_name);
} else if (!drv) {
error_setg(errp, "Must specify either driver or file");
goto fail;
}
/* BDRV_O_PROTOCOL must be set iff a protocol BDS is about to be created */
assert(!!(flags & BDRV_O_PROTOCOL) == !!drv->bdrv_file_open);
/* file must be NULL if a protocol BDS is about to be created
* (the inverse results in an error message from bdrv_open_common()) */
assert(!(flags & BDRV_O_PROTOCOL) || !file);
/* Open the image */
ret = bdrv_open_common(bs, file, options, &local_err);
if (ret < 0) {
goto fail;
}
if (file) {
blk_unref(file);
file = NULL;
}
/* If there is a backing file, use it */
if ((flags & BDRV_O_NO_BACKING) == 0) {
ret = bdrv_open_backing_file(bs, options, "backing", &local_err);
if (ret < 0) {
goto close_and_fail;
}
}
/* Remove all children options and references
* from bs->options and bs->explicit_options */
QLIST_FOREACH(child, &bs->children, next) {
char *child_key_dot;
child_key_dot = g_strdup_printf("%s.", child->name);
qdict_extract_subqdict(bs->explicit_options, NULL, child_key_dot);
qdict_extract_subqdict(bs->options, NULL, child_key_dot);
qdict_del(bs->explicit_options, child->name);
qdict_del(bs->options, child->name);
g_free(child_key_dot);
}
/* Check if any unknown options were used */
if (qdict_size(options) != 0) {
const QDictEntry *entry = qdict_first(options);
if (flags & BDRV_O_PROTOCOL) {
error_setg(errp, "Block protocol '%s' doesn't support the option "
"'%s'", drv->format_name, entry->key);
} else {
error_setg(errp,
"Block format '%s' does not support the option '%s'",
drv->format_name, entry->key);
}
goto close_and_fail;
}
bdrv_parent_cb_change_media(bs, true);
qobject_unref(options);
options = NULL;
/* For snapshot=on, create a temporary qcow2 overlay. bs points to the
* temporary snapshot afterwards. */
if (snapshot_flags) {
BlockDriverState *snapshot_bs;
snapshot_bs = bdrv_append_temp_snapshot(bs, snapshot_flags,
snapshot_options, &local_err);
snapshot_options = NULL;
if (local_err) {
goto close_and_fail;
}
/* We are not going to return bs but the overlay on top of it
* (snapshot_bs); thus, we have to drop the strong reference to bs
* (which we obtained by calling bdrv_new()). bs will not be deleted,
* though, because the overlay still has a reference to it. */
bdrv_unref(bs);
bs = snapshot_bs;
}
return bs;
fail:
blk_unref(file);
qobject_unref(snapshot_options);
qobject_unref(bs->explicit_options);
qobject_unref(bs->options);
qobject_unref(options);
bs->options = NULL;
bs->explicit_options = NULL;
bdrv_unref(bs);
error_propagate(errp, local_err);
return NULL;
close_and_fail:
bdrv_unref(bs);
qobject_unref(snapshot_options);
qobject_unref(options);
error_propagate(errp, local_err);
return NULL;
}
BlockDriverState *bdrv_open(const char *filename, const char *reference,
QDict *options, int flags, Error **errp)
{
GLOBAL_STATE_CODE();
return bdrv_open_inherit(filename, reference, options, flags, NULL,
NULL, 0, errp);
}
/* Return true if the NULL-terminated @list contains @str */
static bool is_str_in_list(const char *str, const char *const *list)
{
if (str && list) {
int i;
for (i = 0; list[i] != NULL; i++) {
if (!strcmp(str, list[i])) {
return true;
}
}
}
return false;
}
/*
* Check that every option set in @bs->options is also set in
* @new_opts.
*
* Options listed in the common_options list and in
* @bs->drv->mutable_opts are skipped.
*
* Return 0 on success, otherwise return -EINVAL and set @errp.
*/
static int bdrv_reset_options_allowed(BlockDriverState *bs,
const QDict *new_opts, Error **errp)
{
const QDictEntry *e;
/* These options are common to all block drivers and are handled
* in bdrv_reopen_prepare() so they can be left out of @new_opts */
const char *const common_options[] = {
"node-name", "discard", "cache.direct", "cache.no-flush",
"read-only", "auto-read-only", "detect-zeroes", NULL
};
for (e = qdict_first(bs->options); e; e = qdict_next(bs->options, e)) {
if (!qdict_haskey(new_opts, e->key) &&
!is_str_in_list(e->key, common_options) &&
!is_str_in_list(e->key, bs->drv->mutable_opts)) {
error_setg(errp, "Option '%s' cannot be reset "
"to its default value", e->key);
return -EINVAL;
}
}
return 0;
}
/*
* Returns true if @child can be reached recursively from @bs
*/
static bool bdrv_recurse_has_child(BlockDriverState *bs,
BlockDriverState *child)
{
BdrvChild *c;
if (bs == child) {
return true;
}
QLIST_FOREACH(c, &bs->children, next) {
if (bdrv_recurse_has_child(c->bs, child)) {
return true;
}
}
return false;
}
/*
* Adds a BlockDriverState to a simple queue for an atomic, transactional
* reopen of multiple devices.
*
* bs_queue can either be an existing BlockReopenQueue that has had QTAILQ_INIT
* already performed, or alternatively may be NULL a new BlockReopenQueue will
* be created and initialized. This newly created BlockReopenQueue should be
* passed back in for subsequent calls that are intended to be of the same
* atomic 'set'.
*
* bs is the BlockDriverState to add to the reopen queue.
*
* options contains the changed options for the associated bs
* (the BlockReopenQueue takes ownership)
*
* flags contains the open flags for the associated bs
*
* returns a pointer to bs_queue, which is either the newly allocated
* bs_queue, or the existing bs_queue being used.
*
* bs must be drained between bdrv_reopen_queue() and bdrv_reopen_multiple().
*/
static BlockReopenQueue *bdrv_reopen_queue_child(BlockReopenQueue *bs_queue,
BlockDriverState *bs,
QDict *options,
const BdrvChildClass *klass,
BdrvChildRole role,
bool parent_is_format,
QDict *parent_options,
int parent_flags,
bool keep_old_opts)
{
assert(bs != NULL);
BlockReopenQueueEntry *bs_entry;
BdrvChild *child;
QDict *old_options, *explicit_options, *options_copy;
int flags;
QemuOpts *opts;
/* Make sure that the caller remembered to use a drained section. This is
* important to avoid graph changes between the recursive queuing here and
* bdrv_reopen_multiple(). */
assert(bs->quiesce_counter > 0);
GLOBAL_STATE_CODE();
if (bs_queue == NULL) {
bs_queue = g_new0(BlockReopenQueue, 1);
QTAILQ_INIT(bs_queue);
}
if (!options) {
options = qdict_new();
}
/* Check if this BlockDriverState is already in the queue */
QTAILQ_FOREACH(bs_entry, bs_queue, entry) {
if (bs == bs_entry->state.bs) {
break;
}
}
/*
* Precedence of options:
* 1. Explicitly passed in options (highest)
* 2. Retained from explicitly set options of bs
* 3. Inherited from parent node
* 4. Retained from effective options of bs
*/
/* Old explicitly set values (don't overwrite by inherited value) */
if (bs_entry || keep_old_opts) {
old_options = qdict_clone_shallow(bs_entry ?
bs_entry->state.explicit_options :
bs->explicit_options);
bdrv_join_options(bs, options, old_options);
qobject_unref(old_options);
}
explicit_options = qdict_clone_shallow(options);
/* Inherit from parent node */
if (parent_options) {
flags = 0;
klass->inherit_options(role, parent_is_format, &flags, options,
parent_flags, parent_options);
} else {
flags = bdrv_get_flags(bs);
}
if (keep_old_opts) {
/* Old values are used for options that aren't set yet */
old_options = qdict_clone_shallow(bs->options);
bdrv_join_options(bs, options, old_options);
qobject_unref(old_options);
}
/* We have the final set of options so let's update the flags */
options_copy = qdict_clone_shallow(options);
opts = qemu_opts_create(&bdrv_runtime_opts, NULL, 0, &error_abort);
qemu_opts_absorb_qdict(opts, options_copy, NULL);
update_flags_from_options(&flags, opts);
qemu_opts_del(opts);
qobject_unref(options_copy);
/* bdrv_open_inherit() sets and clears some additional flags internally */
flags &= ~BDRV_O_PROTOCOL;
if (flags & BDRV_O_RDWR) {
flags |= BDRV_O_ALLOW_RDWR;
}
if (!bs_entry) {
bs_entry = g_new0(BlockReopenQueueEntry, 1);
QTAILQ_INSERT_TAIL(bs_queue, bs_entry, entry);
} else {
qobject_unref(bs_entry->state.options);
qobject_unref(bs_entry->state.explicit_options);
}
bs_entry->state.bs = bs;
bs_entry->state.options = options;
bs_entry->state.explicit_options = explicit_options;
bs_entry->state.flags = flags;
/*
* If keep_old_opts is false then it means that unspecified
* options must be reset to their original value. We don't allow
* resetting 'backing' but we need to know if the option is
* missing in order to decide if we have to return an error.
*/
if (!keep_old_opts) {
bs_entry->state.backing_missing =
!qdict_haskey(options, "backing") &&
!qdict_haskey(options, "backing.driver");
}
QLIST_FOREACH(child, &bs->children, next) {
QDict *new_child_options = NULL;
bool child_keep_old = keep_old_opts;
/* reopen can only change the options of block devices that were
* implicitly created and inherited options. For other (referenced)
* block devices, a syntax like "backing.foo" results in an error. */
if (child->bs->inherits_from != bs) {
continue;
}
/* Check if the options contain a child reference */
if (qdict_haskey(options, child->name)) {
const char *childref = qdict_get_try_str(options, child->name);
/*
* The current child must not be reopened if the child
* reference is null or points to a different node.
*/
if (g_strcmp0(childref, child->bs->node_name)) {
continue;
}
/*
* If the child reference points to the current child then
* reopen it with its existing set of options (note that
* it can still inherit new options from the parent).
*/
child_keep_old = true;
} else {
/* Extract child options ("child-name.*") */
char *child_key_dot = g_strdup_printf("%s.", child->name);
qdict_extract_subqdict(explicit_options, NULL, child_key_dot);
qdict_extract_subqdict(options, &new_child_options, child_key_dot);
g_free(child_key_dot);
}
bdrv_reopen_queue_child(bs_queue, child->bs, new_child_options,
child->klass, child->role, bs->drv->is_format,
options, flags, child_keep_old);
}
return bs_queue;
}
BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue,
BlockDriverState *bs,
QDict *options, bool keep_old_opts)
{
GLOBAL_STATE_CODE();
return bdrv_reopen_queue_child(bs_queue, bs, options, NULL, 0, false,
NULL, 0, keep_old_opts);
}
void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue)
{
GLOBAL_STATE_CODE();
if (bs_queue) {
BlockReopenQueueEntry *bs_entry, *next;
QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) {
qobject_unref(bs_entry->state.explicit_options);
qobject_unref(bs_entry->state.options);
g_free(bs_entry);
}
g_free(bs_queue);
}
}
/*
* Reopen multiple BlockDriverStates atomically & transactionally.
*
* The queue passed in (bs_queue) must have been built up previous
* via bdrv_reopen_queue().
*
* Reopens all BDS specified in the queue, with the appropriate
* flags. All devices are prepared for reopen, and failure of any
* device will cause all device changes to be abandoned, and intermediate
* data cleaned up.
*
* If all devices prepare successfully, then the changes are committed
* to all devices.
*
* All affected nodes must be drained between bdrv_reopen_queue() and
* bdrv_reopen_multiple().
*
* To be called from the main thread, with all other AioContexts unlocked.
*/
int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp)
{
int ret = -1;
BlockReopenQueueEntry *bs_entry, *next;
AioContext *ctx;
Transaction *tran = tran_new();
g_autoptr(GHashTable) found = NULL;
g_autoptr(GSList) refresh_list = NULL;
assert(qemu_get_current_aio_context() == qemu_get_aio_context());
assert(bs_queue != NULL);
GLOBAL_STATE_CODE();
QTAILQ_FOREACH(bs_entry, bs_queue, entry) {
ctx = bdrv_get_aio_context(bs_entry->state.bs);
aio_context_acquire(ctx);
ret = bdrv_flush(bs_entry->state.bs);
aio_context_release(ctx);
if (ret < 0) {
error_setg_errno(errp, -ret, "Error flushing drive");
goto abort;
}
}
QTAILQ_FOREACH(bs_entry, bs_queue, entry) {
assert(bs_entry->state.bs->quiesce_counter > 0);
ctx = bdrv_get_aio_context(bs_entry->state.bs);
aio_context_acquire(ctx);
ret = bdrv_reopen_prepare(&bs_entry->state, bs_queue, tran, errp);
aio_context_release(ctx);
if (ret < 0) {
goto abort;
}
bs_entry->prepared = true;
}
found = g_hash_table_new(NULL, NULL);
QTAILQ_FOREACH(bs_entry, bs_queue, entry) {
BDRVReopenState *state = &bs_entry->state;
refresh_list = bdrv_topological_dfs(refresh_list, found, state->bs);
if (state->old_backing_bs) {
refresh_list = bdrv_topological_dfs(refresh_list, found,
state->old_backing_bs);
}
if (state->old_file_bs) {
refresh_list = bdrv_topological_dfs(refresh_list, found,
state->old_file_bs);
}
}
/*
* Note that file-posix driver rely on permission update done during reopen
* (even if no permission changed), because it wants "new" permissions for
* reconfiguring the fd and that's why it does it in raw_check_perm(), not
* in raw_reopen_prepare() which is called with "old" permissions.
*/
ret = bdrv_list_refresh_perms(refresh_list, bs_queue, tran, errp);
if (ret < 0) {
goto abort;
}
/*
* If we reach this point, we have success and just need to apply the
* changes.
*
* Reverse order is used to comfort qcow2 driver: on commit it need to write
* IN_USE flag to the image, to mark bitmaps in the image as invalid. But
* children are usually goes after parents in reopen-queue, so go from last
* to first element.
*/
QTAILQ_FOREACH_REVERSE(bs_entry, bs_queue, entry) {
ctx = bdrv_get_aio_context(bs_entry->state.bs);
aio_context_acquire(ctx);
bdrv_reopen_commit(&bs_entry->state);
aio_context_release(ctx);
}
tran_commit(tran);
QTAILQ_FOREACH_REVERSE(bs_entry, bs_queue, entry) {
BlockDriverState *bs = bs_entry->state.bs;
if (bs->drv->bdrv_reopen_commit_post) {
ctx = bdrv_get_aio_context(bs);
aio_context_acquire(ctx);
bs->drv->bdrv_reopen_commit_post(&bs_entry->state);
aio_context_release(ctx);
}
}
ret = 0;
goto cleanup;
abort:
tran_abort(tran);
QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) {
if (bs_entry->prepared) {
ctx = bdrv_get_aio_context(bs_entry->state.bs);
aio_context_acquire(ctx);
bdrv_reopen_abort(&bs_entry->state);
aio_context_release(ctx);
}
}
cleanup:
bdrv_reopen_queue_free(bs_queue);
return ret;
}
int bdrv_reopen(BlockDriverState *bs, QDict *opts, bool keep_old_opts,
Error **errp)
{
AioContext *ctx = bdrv_get_aio_context(bs);
BlockReopenQueue *queue;
int ret;
GLOBAL_STATE_CODE();
bdrv_subtree_drained_begin(bs);
if (ctx != qemu_get_aio_context()) {
aio_context_release(ctx);
}
queue = bdrv_reopen_queue(NULL, bs, opts, keep_old_opts);
ret = bdrv_reopen_multiple(queue, errp);
if (ctx != qemu_get_aio_context()) {
aio_context_acquire(ctx);
}
bdrv_subtree_drained_end(bs);
return ret;
}
int bdrv_reopen_set_read_only(BlockDriverState *bs, bool read_only,
Error **errp)
{
QDict *opts = qdict_new();
GLOBAL_STATE_CODE();
qdict_put_bool(opts, BDRV_OPT_READ_ONLY, read_only);
return bdrv_reopen(bs, opts, true, errp);
}
/*
* Take a BDRVReopenState and check if the value of 'backing' in the
* reopen_state->options QDict is valid or not.
*
* If 'backing' is missing from the QDict then return 0.
*
* If 'backing' contains the node name of the backing file of
* reopen_state->bs then return 0.
*
* If 'backing' contains a different node name (or is null) then check
* whether the current backing file can be replaced with the new one.
* If that's the case then reopen_state->replace_backing_bs is set to
* true and reopen_state->new_backing_bs contains a pointer to the new
* backing BlockDriverState (or NULL).
*
* Return 0 on success, otherwise return < 0 and set @errp.
*/
static int bdrv_reopen_parse_file_or_backing(BDRVReopenState *reopen_state,
bool is_backing, Transaction *tran,
Error **errp)
{
BlockDriverState *bs = reopen_state->bs;
BlockDriverState *new_child_bs;
BlockDriverState *old_child_bs = is_backing ? child_bs(bs->backing) :
child_bs(bs->file);
const char *child_name = is_backing ? "backing" : "file";
QObject *value;
const char *str;
GLOBAL_STATE_CODE();
value = qdict_get(reopen_state->options, child_name);
if (value == NULL) {
return 0;
}
switch (qobject_type(value)) {
case QTYPE_QNULL:
assert(is_backing); /* The 'file' option does not allow a null value */
new_child_bs = NULL;
break;
case QTYPE_QSTRING:
str = qstring_get_str(qobject_to(QString, value));
new_child_bs = bdrv_lookup_bs(NULL, str, errp);
if (new_child_bs == NULL) {
return -EINVAL;
} else if (bdrv_recurse_has_child(new_child_bs, bs)) {
error_setg(errp, "Making '%s' a %s child of '%s' would create a "
"cycle", str, child_name, bs->node_name);
return -EINVAL;
}
break;
default:
/*
* The options QDict has been flattened, so 'backing' and 'file'
* do not allow any other data type here.
*/
g_assert_not_reached();
}
if (old_child_bs == new_child_bs) {
return 0;
}
if (old_child_bs) {
if (bdrv_skip_implicit_filters(old_child_bs) == new_child_bs) {
return 0;
}
if (old_child_bs->implicit) {
error_setg(errp, "Cannot replace implicit %s child of %s",
child_name, bs->node_name);
return -EPERM;
}
}
if (bs->drv->is_filter && !old_child_bs) {
/*
* Filters always have a file or a backing child, so we are trying to
* change wrong child
*/
error_setg(errp, "'%s' is a %s filter node that does not support a "
"%s child", bs->node_name, bs->drv->format_name, child_name);
return -EINVAL;
}
if (is_backing) {
reopen_state->old_backing_bs = old_child_bs;
} else {
reopen_state->old_file_bs = old_child_bs;
}
return bdrv_set_file_or_backing_noperm(bs, new_child_bs, is_backing,
tran, errp);
}
/*
* Prepares a BlockDriverState for reopen. All changes are staged in the
* 'opaque' field of the BDRVReopenState, which is used and allocated by
* the block driver layer .bdrv_reopen_prepare()
*
* bs is the BlockDriverState to reopen
* flags are the new open flags
* queue is the reopen queue
*
* Returns 0 on success, non-zero on error. On error errp will be set
* as well.
*
* On failure, bdrv_reopen_abort() will be called to clean up any data.
* It is the responsibility of the caller to then call the abort() or
* commit() for any other BDS that have been left in a prepare() state
*
*/
static int bdrv_reopen_prepare(BDRVReopenState *reopen_state,
BlockReopenQueue *queue,
Transaction *change_child_tran, Error **errp)
{
int ret = -1;
int old_flags;
Error *local_err = NULL;
BlockDriver *drv;
QemuOpts *opts;
QDict *orig_reopen_opts;
char *discard = NULL;
bool read_only;
bool drv_prepared = false;
assert(reopen_state != NULL);
assert(reopen_state->bs->drv != NULL);
GLOBAL_STATE_CODE();
drv = reopen_state->bs->drv;
/* This function and each driver's bdrv_reopen_prepare() remove
* entries from reopen_state->options as they are processed, so
* we need to make a copy of the original QDict. */
orig_reopen_opts = qdict_clone_shallow(reopen_state->options);
/* Process generic block layer options */
opts = qemu_opts_create(&bdrv_runtime_opts, NULL, 0, &error_abort);
if (!qemu_opts_absorb_qdict(opts, reopen_state->options, errp)) {
ret = -EINVAL;
goto error;
}
/* This was already called in bdrv_reopen_queue_child() so the flags
* are up-to-date. This time we simply want to remove the options from
* QemuOpts in order to indicate that they have been processed. */
old_flags = reopen_state->flags;
update_flags_from_options(&reopen_state->flags, opts);
assert(old_flags == reopen_state->flags);
discard = qemu_opt_get_del(opts, BDRV_OPT_DISCARD);
if (discard != NULL) {
if (bdrv_parse_discard_flags(discard, &reopen_state->flags) != 0) {
error_setg(errp, "Invalid discard option");
ret = -EINVAL;
goto error;
}
}
reopen_state->detect_zeroes =
bdrv_parse_detect_zeroes(opts, reopen_state->flags, &local_err);
if (local_err) {
error_propagate(errp, local_err);
ret = -EINVAL;
goto error;
}
/* All other options (including node-name and driver) must be unchanged.
* Put them back into the QDict, so that they are checked at the end
* of this function. */
qemu_opts_to_qdict(opts, reopen_state->options);
/* If we are to stay read-only, do not allow permission change
* to r/w. Attempting to set to r/w may fail if either BDRV_O_ALLOW_RDWR is
* not set, or if the BDS still has copy_on_read enabled */
read_only = !(reopen_state->flags & BDRV_O_RDWR);
ret = bdrv_can_set_read_only(reopen_state->bs, read_only, true, &local_err);
if (local_err) {
error_propagate(errp, local_err);
goto error;
}
if (drv->bdrv_reopen_prepare) {
/*
* If a driver-specific option is missing, it means that we
* should reset it to its default value.
* But not all options allow that, so we need to check it first.
*/
ret = bdrv_reset_options_allowed(reopen_state->bs,
reopen_state->options, errp);
if (ret) {
goto error;
}
ret = drv->bdrv_reopen_prepare(reopen_state, queue, &local_err);
if (ret) {
if (local_err != NULL) {
error_propagate(errp, local_err);
} else {
bdrv_refresh_filename(reopen_state->bs);
error_setg(errp, "failed while preparing to reopen image '%s'",
reopen_state->bs->filename);
}
goto error;
}
} else {
/* It is currently mandatory to have a bdrv_reopen_prepare()
* handler for each supported drv. */
error_setg(errp, "Block format '%s' used by node '%s' "
"does not support reopening files", drv->format_name,
bdrv_get_device_or_node_name(reopen_state->bs));
ret = -1;
goto error;
}
drv_prepared = true;
/*
* We must provide the 'backing' option if the BDS has a backing
* file or if the image file has a backing file name as part of
* its metadata. Otherwise the 'backing' option can be omitted.
*/
if (drv->supports_backing && reopen_state->backing_missing &&
(reopen_state->bs->backing || reopen_state->bs->backing_file[0])) {
error_setg(errp, "backing is missing for '%s'",
reopen_state->bs->node_name);
ret = -EINVAL;
goto error;
}
/*
* Allow changing the 'backing' option. The new value can be
* either a reference to an existing node (using its node name)
* or NULL to simply detach the current backing file.
*/
ret = bdrv_reopen_parse_file_or_backing(reopen_state, true,
change_child_tran, errp);
if (ret < 0) {
goto error;
}
qdict_del(reopen_state->options, "backing");
/* Allow changing the 'file' option. In this case NULL is not allowed */
ret = bdrv_reopen_parse_file_or_backing(reopen_state, false,
change_child_tran, errp);
if (ret < 0) {
goto error;
}
qdict_del(reopen_state->options, "file");
/* Options that are not handled are only okay if they are unchanged
* compared to the old state. It is expected that some options are only
* used for the initial open, but not reopen (e.g. filename) */
if (qdict_size(reopen_state->options)) {
const QDictEntry *entry = qdict_first(reopen_state->options);
do {
QObject *new = entry->value;
QObject *old = qdict_get(reopen_state->bs->options, entry->key);
/* Allow child references (child_name=node_name) as long as they
* point to the current child (i.e. everything stays the same). */
if (qobject_type(new) == QTYPE_QSTRING) {
BdrvChild *child;
QLIST_FOREACH(child, &reopen_state->bs->children, next) {
if (!strcmp(child->name, entry->key)) {
break;
}
}
if (child) {
if (!strcmp(child->bs->node_name,
qstring_get_str(qobject_to(QString, new)))) {
continue; /* Found child with this name, skip option */
}
}
}
/*
* TODO: When using -drive to specify blockdev options, all values
* will be strings; however, when using -blockdev, blockdev-add or
* filenames using the json:{} pseudo-protocol, they will be
* correctly typed.
* In contrast, reopening options are (currently) always strings
* (because you can only specify them through qemu-io; all other
* callers do not specify any options).
* Therefore, when using anything other than -drive to create a BDS,
* this cannot detect non-string options as unchanged, because
* qobject_is_equal() always returns false for objects of different
* type. In the future, this should be remedied by correctly typing
* all options. For now, this is not too big of an issue because
* the user can simply omit options which cannot be changed anyway,
* so they will stay unchanged.
*/
if (!qobject_is_equal(new, old)) {
error_setg(errp, "Cannot change the option '%s'", entry->key);
ret = -EINVAL;
goto error;
}
} while ((entry = qdict_next(reopen_state->options, entry)));
}
ret = 0;
/* Restore the original reopen_state->options QDict */
qobject_unref(reopen_state->options);
reopen_state->options = qobject_ref(orig_reopen_opts);
error:
if (ret < 0 && drv_prepared) {
/* drv->bdrv_reopen_prepare() has succeeded, so we need to
* call drv->bdrv_reopen_abort() before signaling an error
* (bdrv_reopen_multiple() will not call bdrv_reopen_abort()
* when the respective bdrv_reopen_prepare() has failed) */
if (drv->bdrv_reopen_abort) {
drv->bdrv_reopen_abort(reopen_state);
}
}
qemu_opts_del(opts);
qobject_unref(orig_reopen_opts);
g_free(discard);
return ret;
}
/*
* Takes the staged changes for the reopen from bdrv_reopen_prepare(), and
* makes them final by swapping the staging BlockDriverState contents into
* the active BlockDriverState contents.
*/
static void bdrv_reopen_commit(BDRVReopenState *reopen_state)
{
BlockDriver *drv;
BlockDriverState *bs;
BdrvChild *child;
assert(reopen_state != NULL);
bs = reopen_state->bs;
drv = bs->drv;
assert(drv != NULL);
GLOBAL_STATE_CODE();
/* If there are any driver level actions to take */
if (drv->bdrv_reopen_commit) {
drv->bdrv_reopen_commit(reopen_state);
}
/* set BDS specific flags now */
qobject_unref(bs->explicit_options);
qobject_unref(bs->options);
qobject_ref(reopen_state->explicit_options);
qobject_ref(reopen_state->options);
bs->explicit_options = reopen_state->explicit_options;
bs->options = reopen_state->options;
bs->open_flags = reopen_state->flags;
bs->detect_zeroes = reopen_state->detect_zeroes;
/* Remove child references from bs->options and bs->explicit_options.
* Child options were already removed in bdrv_reopen_queue_child() */
QLIST_FOREACH(child, &bs->children, next) {
qdict_del(bs->explicit_options, child->name);
qdict_del(bs->options, child->name);
}
/* backing is probably removed, so it's not handled by previous loop */
qdict_del(bs->explicit_options, "backing");
qdict_del(bs->options, "backing");
bdrv_refresh_limits(bs, NULL, NULL);
}
/*
* Abort the reopen, and delete and free the staged changes in
* reopen_state
*/
static void bdrv_reopen_abort(BDRVReopenState *reopen_state)
{
BlockDriver *drv;
assert(reopen_state != NULL);
drv = reopen_state->bs->drv;
assert(drv != NULL);
GLOBAL_STATE_CODE();
if (drv->bdrv_reopen_abort) {
drv->bdrv_reopen_abort(reopen_state);
}
}
static void bdrv_close(BlockDriverState *bs)
{
BdrvAioNotifier *ban, *ban_next;
BdrvChild *child, *next;
GLOBAL_STATE_CODE();
assert(!bs->refcnt);
bdrv_drained_begin(bs); /* complete I/O */
bdrv_flush(bs);
bdrv_drain(bs); /* in case flush left pending I/O */
if (bs->drv) {
if (bs->drv->bdrv_close) {
/* Must unfreeze all children, so bdrv_unref_child() works */
bs->drv->bdrv_close(bs);
}
bs->drv = NULL;
}
QLIST_FOREACH_SAFE(child, &bs->children, next, next) {
bdrv_unref_child(bs, child);
}
bs->backing = NULL;
bs->file = NULL;
g_free(bs->opaque);
bs->opaque = NULL;
qatomic_set(&bs->copy_on_read, 0);
bs->backing_file[0] = '\0';
bs->backing_format[0] = '\0';
bs->total_sectors = 0;
bs->encrypted = false;
bs->sg = false;
qobject_unref(bs->options);
qobject_unref(bs->explicit_options);
bs->options = NULL;
bs->explicit_options = NULL;
qobject_unref(bs->full_open_options);
bs->full_open_options = NULL;
g_free(bs->block_status_cache);
bs->block_status_cache = NULL;
bdrv_release_named_dirty_bitmaps(bs);
assert(QLIST_EMPTY(&bs->dirty_bitmaps));
QLIST_FOREACH_SAFE(ban, &bs->aio_notifiers, list, ban_next) {
g_free(ban);
}
QLIST_INIT(&bs->aio_notifiers);
bdrv_drained_end(bs);
/*
* If we're still inside some bdrv_drain_all_begin()/end() sections, end
* them now since this BDS won't exist anymore when bdrv_drain_all_end()
* gets called.
*/
if (bs->quiesce_counter) {
bdrv_drain_all_end_quiesce(bs);
}
}
void bdrv_close_all(void)
{
assert(job_next(NULL) == NULL);
GLOBAL_STATE_CODE();
/* Drop references from requests still in flight, such as canceled block
* jobs whose AIO context has not been polled yet */
bdrv_drain_all();
blk_remove_all_bs();
blockdev_close_all_bdrv_states();
assert(QTAILQ_EMPTY(&all_bdrv_states));
}
static bool should_update_child(BdrvChild *c, BlockDriverState *to)
{
GQueue *queue;
GHashTable *found;
bool ret;
if (c->klass->stay_at_node) {
return false;
}
/* If the child @c belongs to the BDS @to, replacing the current
* c->bs by @to would mean to create a loop.
*
* Such a case occurs when appending a BDS to a backing chain.
* For instance, imagine the following chain:
*
* guest device -> node A -> further backing chain...
*
* Now we create a new BDS B which we want to put on top of this
* chain, so we first attach A as its backing node:
*
* node B
* |
* v
* guest device -> node A -> further backing chain...
*
* Finally we want to replace A by B. When doing that, we want to
* replace all pointers to A by pointers to B -- except for the
* pointer from B because (1) that would create a loop, and (2)
* that pointer should simply stay intact:
*
* guest device -> node B
* |
* v
* node A -> further backing chain...
*
* In general, when replacing a node A (c->bs) by a node B (@to),
* if A is a child of B, that means we cannot replace A by B there
* because that would create a loop. Silently detaching A from B
* is also not really an option. So overall just leaving A in
* place there is the most sensible choice.
*
* We would also create a loop in any cases where @c is only
* indirectly referenced by @to. Prevent this by returning false
* if @c is found (by breadth-first search) anywhere in the whole
* subtree of @to.
*/
ret = true;
found = g_hash_table_new(NULL, NULL);
g_hash_table_add(found, to);
queue = g_queue_new();
g_queue_push_tail(queue, to);
while (!g_queue_is_empty(queue)) {
BlockDriverState *v = g_queue_pop_head(queue);
BdrvChild *c2;
QLIST_FOREACH(c2, &v->children, next) {
if (c2 == c) {
ret = false;
break;
}
if (g_hash_table_contains(found, c2->bs)) {
continue;
}
g_queue_push_tail(queue, c2->bs);
g_hash_table_add(found, c2->bs);
}
}
g_queue_free(queue);
g_hash_table_destroy(found);
return ret;
}
typedef struct BdrvRemoveFilterOrCowChild {
BdrvChild *child;
BlockDriverState *bs;
bool is_backing;
} BdrvRemoveFilterOrCowChild;
static void bdrv_remove_filter_or_cow_child_abort(void *opaque)
{
BdrvRemoveFilterOrCowChild *s = opaque;
BlockDriverState *parent_bs = s->child->opaque;
if (s->is_backing) {
parent_bs->backing = s->child;
} else {
parent_bs->file = s->child;
}
/*
* We don't have to restore child->bs here to undo bdrv_replace_child_tran()
* because that function is transactionable and it registered own completion
* entries in @tran, so .abort() for bdrv_replace_child_safe() will be
* called automatically.
*/
}
static void bdrv_remove_filter_or_cow_child_commit(void *opaque)
{
BdrvRemoveFilterOrCowChild *s = opaque;
GLOBAL_STATE_CODE();
bdrv_child_free(s->child);
}
static void bdrv_remove_filter_or_cow_child_clean(void *opaque)
{
BdrvRemoveFilterOrCowChild *s = opaque;
/* Drop the bs reference after the transaction is done */
bdrv_unref(s->bs);
g_free(s);
}
static TransactionActionDrv bdrv_remove_filter_or_cow_child_drv = {
.abort = bdrv_remove_filter_or_cow_child_abort,
.commit = bdrv_remove_filter_or_cow_child_commit,
.clean = bdrv_remove_filter_or_cow_child_clean,
};
/*
* A function to remove backing or file child of @bs.
* Function doesn't update permissions, caller is responsible for this.
*/
static void bdrv_remove_file_or_backing_child(BlockDriverState *bs,
BdrvChild *child,
Transaction *tran)
{
BdrvChild **childp;
BdrvRemoveFilterOrCowChild *s;
if (!child) {
return;
}
/*
* Keep a reference to @bs so @childp will stay valid throughout the
* transaction (required by bdrv_replace_child_tran())
*/
bdrv_ref(bs);
if (child == bs->backing) {
childp = &bs->backing;
} else if (child == bs->file) {
childp = &bs->file;
} else {
g_assert_not_reached();
}
if (child->bs) {
/*
* Pass free_empty_child=false, we will free the child in
* bdrv_remove_filter_or_cow_child_commit()
*/
bdrv_replace_child_tran(childp, NULL, tran, false);
}
s = g_new(BdrvRemoveFilterOrCowChild, 1);
*s = (BdrvRemoveFilterOrCowChild) {
.child = child,
.bs = bs,
.is_backing = (childp == &bs->backing),
};
tran_add(tran, &bdrv_remove_filter_or_cow_child_drv, s);
}
/*
* A function to remove backing-chain child of @bs if exists: cow child for
* format nodes (always .backing) and filter child for filters (may be .file or
* .backing)
*/
static void bdrv_remove_filter_or_cow_child(BlockDriverState *bs,
Transaction *tran)
{
bdrv_remove_file_or_backing_child(bs, bdrv_filter_or_cow_child(bs), tran);
}
static int bdrv_replace_node_noperm(BlockDriverState *from,
BlockDriverState *to,
bool auto_skip, Transaction *tran,
Error **errp)
{
BdrvChild *c, *next;
assert(to != NULL);
GLOBAL_STATE_CODE();
QLIST_FOREACH_SAFE(c, &from->parents, next_parent, next) {
assert(c->bs == from);
if (!should_update_child(c, to)) {
if (auto_skip) {
continue;
}
error_setg(errp, "Should not change '%s' link to '%s'",
c->name, from->node_name);
return -EINVAL;
}
if (c->frozen) {
error_setg(errp, "Cannot change '%s' link to '%s'",
c->name, from->node_name);
return -EPERM;
}
/*
* Passing a pointer to the local variable @c is fine here, because
* @to is not NULL, and so &c will not be attached to the transaction.
*/
bdrv_replace_child_tran(&c, to, tran, true);
}
return 0;
}
/*
* With auto_skip=true bdrv_replace_node_common skips updating from parents
* if it creates a parent-child relation loop or if parent is block-job.
*
* With auto_skip=false the error is returned if from has a parent which should
* not be updated.
*
* With @detach_subchain=true @to must be in a backing chain of @from. In this
* case backing link of the cow-parent of @to is removed.
*
* @to must not be NULL.
*/
static int bdrv_replace_node_common(BlockDriverState *from,
BlockDriverState *to,
bool auto_skip, bool detach_subchain,
Error **errp)
{
Transaction *tran = tran_new();
g_autoptr(GHashTable) found = NULL;
g_autoptr(GSList) refresh_list = NULL;
BlockDriverState *to_cow_parent = NULL;
int ret;
GLOBAL_STATE_CODE();
assert(to != NULL);
if (detach_subchain) {
assert(bdrv_chain_contains(from, to));
assert(from != to);
for (to_cow_parent = from;
bdrv_filter_or_cow_bs(to_cow_parent) != to;
to_cow_parent = bdrv_filter_or_cow_bs(to_cow_parent))
{
;
}
}
/* Make sure that @from doesn't go away until we have successfully attached
* all of its parents to @to. */
bdrv_ref(from);
assert(qemu_get_current_aio_context() == qemu_get_aio_context());
assert(bdrv_get_aio_context(from) == bdrv_get_aio_context(to));
bdrv_drained_begin(from);
/*
* Do the replacement without permission update.
* Replacement may influence the permissions, we should calculate new
* permissions based on new graph. If we fail, we'll roll-back the
* replacement.
*/
ret = bdrv_replace_node_noperm(from, to, auto_skip, tran, errp);
if (ret < 0) {
goto out;
}
if (detach_subchain) {
bdrv_remove_filter_or_cow_child(to_cow_parent, tran);
}
found = g_hash_table_new(NULL, NULL);
refresh_list = bdrv_topological_dfs(refresh_list, found, to);
refresh_list = bdrv_topological_dfs(refresh_list, found, from);
ret = bdrv_list_refresh_perms(refresh_list, NULL, tran, errp);
if (ret < 0) {
goto out;
}
ret = 0;
out:
tran_finalize(tran, ret);
bdrv_drained_end(from);
bdrv_unref(from);
return ret;
}
/**
* Replace node @from by @to (where neither may be NULL).
*/
int bdrv_replace_node(BlockDriverState *from, BlockDriverState *to,
Error **errp)
{
GLOBAL_STATE_CODE();
return bdrv_replace_node_common(from, to, true, false, errp);
}
int bdrv_drop_filter(BlockDriverState *bs, Error **errp)
{
GLOBAL_STATE_CODE();
return bdrv_replace_node_common(bs, bdrv_filter_or_cow_bs(bs), true, true,
errp);
}
/*
* Add new bs contents at the top of an image chain while the chain is
* live, while keeping required fields on the top layer.
*
* This will modify the BlockDriverState fields, and swap contents
* between bs_new and bs_top. Both bs_new and bs_top are modified.
*
* bs_new must not be attached to a BlockBackend and must not have backing
* child.
*
* This function does not create any image files.
*/
int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
Error **errp)
{
int ret;
Transaction *tran = tran_new();
GLOBAL_STATE_CODE();
assert(!bs_new->backing);
ret = bdrv_attach_child_noperm(bs_new, bs_top, "backing",
&child_of_bds, bdrv_backing_role(bs_new),
&bs_new->backing, tran, errp);
if (ret < 0) {
goto out;
}
ret = bdrv_replace_node_noperm(bs_top, bs_new, true, tran, errp);
if (ret < 0) {
goto out;
}
ret = bdrv_refresh_perms(bs_new, errp);
out:
tran_finalize(tran, ret);
bdrv_refresh_limits(bs_top, NULL, NULL);
return ret;
}
/* Not for empty child */
int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
Error **errp)
{
int ret;
Transaction *tran = tran_new();
g_autoptr(GHashTable) found = NULL;
g_autoptr(GSList) refresh_list = NULL;
BlockDriverState *old_bs = child->bs;
GLOBAL_STATE_CODE();
bdrv_ref(old_bs);
bdrv_drained_begin(old_bs);
bdrv_drained_begin(new_bs);
bdrv_replace_child_tran(&child, new_bs, tran, true);
/* @new_bs must have been non-NULL, so @child must not have been freed */
assert(child != NULL);
found = g_hash_table_new(NULL, NULL);
refresh_list = bdrv_topological_dfs(refresh_list, found, old_bs);
refresh_list = bdrv_topological_dfs(refresh_list, found, new_bs);
ret = bdrv_list_refresh_perms(refresh_list, NULL, tran, errp);
tran_finalize(tran, ret);
bdrv_drained_end(old_bs);
bdrv_drained_end(new_bs);
bdrv_unref(old_bs);
return ret;
}
static void bdrv_delete(BlockDriverState *bs)
{
assert(bdrv_op_blocker_is_empty(bs));
assert(!bs->refcnt);
GLOBAL_STATE_CODE();
/* remove from list, if necessary */
if (bs->node_name[0] != '\0') {
QTAILQ_REMOVE(&graph_bdrv_states, bs, node_list);
}
QTAILQ_REMOVE(&all_bdrv_states, bs, bs_list);
bdrv_close(bs);
g_free(bs);
}
/*
* Replace @bs by newly created block node.
*
* @options is a QDict of options to pass to the block drivers, or NULL for an
* empty set of options. The reference to the QDict belongs to the block layer
* after the call (even on failure), so if the caller intends to reuse the
* dictionary, it needs to use qobject_ref() before calling bdrv_open.
*/
BlockDriverState *bdrv_insert_node(BlockDriverState *bs, QDict *options,
int flags, Error **errp)
{
ERRP_GUARD();
int ret;
BlockDriverState *new_node_bs = NULL;
const char *drvname, *node_name;
BlockDriver *drv;
drvname = qdict_get_try_str(options, "driver");
if (!drvname) {
error_setg(errp, "driver is not specified");
goto fail;
}
drv = bdrv_find_format(drvname);
if (!drv) {
error_setg(errp, "Unknown driver: '%s'", drvname);
goto fail;
}
node_name = qdict_get_try_str(options, "node-name");
GLOBAL_STATE_CODE();
new_node_bs = bdrv_new_open_driver_opts(drv, node_name, options, flags,
errp);
options = NULL; /* bdrv_new_open_driver() eats options */
if (!new_node_bs) {
error_prepend(errp, "Could not create node: ");
goto fail;
}
bdrv_drained_begin(bs);
ret = bdrv_replace_node(bs, new_node_bs, errp);
bdrv_drained_end(bs);
if (ret < 0) {
error_prepend(errp, "Could not replace node: ");
goto fail;
}
return new_node_bs;
fail:
qobject_unref(options);
bdrv_unref(new_node_bs);
return NULL;
}
/*
* Run consistency checks on an image
*
* Returns 0 if the check could be completed (it doesn't mean that the image is
* free of errors) or -errno when an internal error occurred. The results of the
* check are stored in res.
*/
int coroutine_fn bdrv_co_check(BlockDriverState *bs,
BdrvCheckResult *res, BdrvCheckMode fix)
{
IO_CODE();
if (bs->drv == NULL) {
return -ENOMEDIUM;
}
if (bs->drv->bdrv_co_check == NULL) {
return -ENOTSUP;
}
memset(res, 0, sizeof(*res));
return bs->drv->bdrv_co_check(bs, res, fix);
}
/*
* Return values:
* 0 - success
* -EINVAL - backing format specified, but no file
* -ENOSPC - can't update the backing file because no space is left in the
* image file header
* -ENOTSUP - format driver doesn't support changing the backing file
*/
int bdrv_change_backing_file(BlockDriverState *bs, const char *backing_file,
const char *backing_fmt, bool require)
{
BlockDriver *drv = bs->drv;
int ret;
GLOBAL_STATE_CODE();
if (!drv) {
return -ENOMEDIUM;
}
/* Backing file format doesn't make sense without a backing file */
if (backing_fmt && !backing_file) {
return -EINVAL;
}
if (require && backing_file && !backing_fmt) {
return -EINVAL;
}
if (drv->bdrv_change_backing_file != NULL) {
ret = drv->bdrv_change_backing_file(bs, backing_file, backing_fmt);
} else {
ret = -ENOTSUP;
}
if (ret == 0) {
pstrcpy(bs->backing_file, sizeof(bs->backing_file), backing_file ?: "");
pstrcpy(bs->backing_format, sizeof(bs->backing_format), backing_fmt ?: "");
pstrcpy(bs->auto_backing_file, sizeof(bs->auto_backing_file),
backing_file ?: "");
}
return ret;
}
/*
* Finds the first non-filter node above bs in the chain between
* active and bs. The returned node is either an immediate parent of
* bs, or there are only filter nodes between the two.
*
* Returns NULL if bs is not found in active's image chain,
* or if active == bs.
*
* Returns the bottommost base image if bs == NULL.
*/
BlockDriverState *bdrv_find_overlay(BlockDriverState *active,
BlockDriverState *bs)
{
GLOBAL_STATE_CODE();
bs = bdrv_skip_filters(bs);
active = bdrv_skip_filters(active);
while (active) {
BlockDriverState *next = bdrv_backing_chain_next(active);
if (bs == next) {
return active;
}
active = next;
}
return NULL;
}
/* Given a BDS, searches for the base layer. */
BlockDriverState *bdrv_find_base(BlockDriverState *bs)
{
GLOBAL_STATE_CODE();
return bdrv_find_overlay(bs, NULL);
}
/*
* Return true if at least one of the COW (backing) and filter links
* between @bs and @base is frozen. @errp is set if that's the case.
* @base must be reachable from @bs, or NULL.
*/
bool bdrv_is_backing_chain_frozen(BlockDriverState *bs, BlockDriverState *base,
Error **errp)
{
BlockDriverState *i;
BdrvChild *child;
GLOBAL_STATE_CODE();
for (i = bs; i != base; i = child_bs(child)) {
child = bdrv_filter_or_cow_child(i);
if (child && child->frozen) {
error_setg(errp, "Cannot change '%s' link from '%s' to '%s'",
child->name, i->node_name, child->bs->node_name);
return true;
}
}
return false;
}
/*
* Freeze all COW (backing) and filter links between @bs and @base.
* If any of the links is already frozen the operation is aborted and
* none of the links are modified.
* @base must be reachable from @bs, or NULL.
* Returns 0 on success. On failure returns < 0 and sets @errp.
*/
int bdrv_freeze_backing_chain(BlockDriverState *bs, BlockDriverState *base,
Error **errp)
{
BlockDriverState *i;
BdrvChild *child;
GLOBAL_STATE_CODE();
if (bdrv_is_backing_chain_frozen(bs, base, errp)) {
return -EPERM;
}
for (i = bs; i != base; i = child_bs(child)) {
child = bdrv_filter_or_cow_child(i);
if (child && child->bs->never_freeze) {
error_setg(errp, "Cannot freeze '%s' link to '%s'",
child->name, child->bs->node_name);
return -EPERM;
}
}
for (i = bs; i != base; i = child_bs(child)) {
child = bdrv_filter_or_cow_child(i);
if (child) {
child->frozen = true;
}
}
return 0;
}
/*
* Unfreeze all COW (backing) and filter links between @bs and @base.
* The caller must ensure that all links are frozen before using this
* function.
* @base must be reachable from @bs, or NULL.
*/
void bdrv_unfreeze_backing_chain(BlockDriverState *bs, BlockDriverState *base)
{
BlockDriverState *i;
BdrvChild *child;
GLOBAL_STATE_CODE();
for (i = bs; i != base; i = child_bs(child)) {
child = bdrv_filter_or_cow_child(i);
if (child) {
assert(child->frozen);
child->frozen = false;
}
}
}
/*
* Drops images above 'base' up to and including 'top', and sets the image
* above 'top' to have base as its backing file.
*
* Requires that the overlay to 'top' is opened r/w, so that the backing file
* information in 'bs' can be properly updated.
*
* E.g., this will convert the following chain:
* bottom <- base <- intermediate <- top <- active
*
* to
*
* bottom <- base <- active
*
* It is allowed for bottom==base, in which case it converts:
*
* base <- intermediate <- top <- active
*
* to
*
* base <- active
*
* If backing_file_str is non-NULL, it will be used when modifying top's
* overlay image metadata.
*
* Error conditions:
* if active == top, that is considered an error
*
*/
int bdrv_drop_intermediate(BlockDriverState *top, BlockDriverState *base,
const char *backing_file_str)
{
BlockDriverState *explicit_top = top;
bool update_inherits_from;
BdrvChild *c;
Error *local_err = NULL;
int ret = -EIO;
g_autoptr(GSList) updated_children = NULL;
GSList *p;
GLOBAL_STATE_CODE();
bdrv_ref(top);
bdrv_subtree_drained_begin(top);
if (!top->drv || !base->drv) {
goto exit;
}
/* Make sure that base is in the backing chain of top */
if (!bdrv_chain_contains(top, base)) {
goto exit;
}
/* If 'base' recursively inherits from 'top' then we should set
* base->inherits_from to top->inherits_from after 'top' and all
* other intermediate nodes have been dropped.
* If 'top' is an implicit node (e.g. "commit_top") we should skip
* it because no one inherits from it. We use explicit_top for that. */
explicit_top = bdrv_skip_implicit_filters(explicit_top);
update_inherits_from = bdrv_inherits_from_recursive(base, explicit_top);
/* success - we can delete the intermediate states, and link top->base */
if (!backing_file_str) {
bdrv_refresh_filename(base);
backing_file_str = base->filename;
}
QLIST_FOREACH(c, &top->parents, next_parent) {
updated_children = g_slist_prepend(updated_children, c);
}
/*
* It seems correct to pass detach_subchain=true here, but it triggers
* one more yet not fixed bug, when due to nested aio_poll loop we switch to
* another drained section, which modify the graph (for example, removing
* the child, which we keep in updated_children list). So, it's a TODO.
*
* Note, bug triggered if pass detach_subchain=true here and run
* test-bdrv-drain. test_drop_intermediate_poll() test-case will crash.
* That's a FIXME.
*/
bdrv_replace_node_common(top, base, false, false, &local_err);
if (local_err) {
error_report_err(local_err);
goto exit;
}
for (p = updated_children; p; p = p->next) {
c = p->data;
if (c->klass->update_filename) {
ret = c->klass->update_filename(c, base, backing_file_str,
&local_err);
if (ret < 0) {
/*
* TODO: Actually, we want to rollback all previous iterations
* of this loop, and (which is almost impossible) previous
* bdrv_replace_node()...
*
* Note, that c->klass->update_filename may lead to permission
* update, so it's a bad idea to call it inside permission
* update transaction of bdrv_replace_node.
*/
error_report_err(local_err);
goto exit;
}
}
}
if (update_inherits_from) {
base->inherits_from = explicit_top->inherits_from;
}
ret = 0;
exit:
bdrv_subtree_drained_end(top);
bdrv_unref(top);
return ret;
}
/**
* Implementation of BlockDriver.bdrv_get_allocated_file_size() that
* sums the size of all data-bearing children. (This excludes backing
* children.)
*/
static int64_t bdrv_sum_allocated_file_size(BlockDriverState *bs)
{
BdrvChild *child;
int64_t child_size, sum = 0;
QLIST_FOREACH(child, &bs->children, next) {
if (child->role & (BDRV_CHILD_DATA | BDRV_CHILD_METADATA |
BDRV_CHILD_FILTERED))
{
child_size = bdrv_get_allocated_file_size(child->bs);
if (child_size < 0) {
return child_size;
}
sum += child_size;
}
}
return sum;
}
/**
* Length of a allocated file in bytes. Sparse files are counted by actual
* allocated space. Return < 0 if error or unknown.
*/
int64_t bdrv_get_allocated_file_size(BlockDriverState *bs)
{
BlockDriver *drv = bs->drv;
IO_CODE();
if (!drv) {
return -ENOMEDIUM;
}
if (drv->bdrv_get_allocated_file_size) {
return drv->bdrv_get_allocated_file_size(bs);
}
if (drv->bdrv_file_open) {
/*
* Protocol drivers default to -ENOTSUP (most of their data is
* not stored in any of their children (if they even have any),
* so there is no generic way to figure it out).
*/
return -ENOTSUP;
} else if (drv->is_filter) {
/* Filter drivers default to the size of their filtered child */
return bdrv_get_allocated_file_size(bdrv_filter_bs(bs));
} else {
/* Other drivers default to summing their children's sizes */
return bdrv_sum_allocated_file_size(bs);
}
}
/*
* bdrv_measure:
* @drv: Format driver
* @opts: Creation options for new image
* @in_bs: Existing image containing data for new image (may be NULL)
* @errp: Error object
* Returns: A #BlockMeasureInfo (free using qapi_free_BlockMeasureInfo())
* or NULL on error
*
* Calculate file size required to create a new image.
*
* If @in_bs is given then space for allocated clusters and zero clusters
* from that image are included in the calculation. If @opts contains a
* backing file that is shared by @in_bs then backing clusters may be omitted
* from the calculation.
*
* If @in_bs is NULL then the calculation includes no allocated clusters
* unless a preallocation option is given in @opts.
*
* Note that @in_bs may use a different BlockDriver from @drv.
*
* If an error occurs the @errp pointer is set.
*/
BlockMeasureInfo *bdrv_measure(BlockDriver *drv, QemuOpts *opts,
BlockDriverState *in_bs, Error **errp)
{
IO_CODE();
if (!drv->bdrv_measure) {
error_setg(errp, "Block driver '%s' does not support size measurement",
drv->format_name);
return NULL;
}
return drv->bdrv_measure(opts, in_bs, errp);
}
/**
* Return number of sectors on success, -errno on error.
*/
int64_t bdrv_nb_sectors(BlockDriverState *bs)
{
BlockDriver *drv = bs->drv;
IO_CODE();
if (!drv)
return -ENOMEDIUM;
if (drv->has_variable_length) {
int ret = refresh_total_sectors(bs, bs->total_sectors);
if (ret < 0) {
return ret;
}
}
return bs->total_sectors;
}
/**
* Return length in bytes on success, -errno on error.
* The length is always a multiple of BDRV_SECTOR_SIZE.
*/
int64_t bdrv_getlength(BlockDriverState *bs)
{
int64_t ret = bdrv_nb_sectors(bs);
IO_CODE();
if (ret < 0) {
return ret;
}
if (ret > INT64_MAX / BDRV_SECTOR_SIZE) {
return -EFBIG;
}
return ret * BDRV_SECTOR_SIZE;
}
/* return 0 as number of sectors if no device present or error */
void bdrv_get_geometry(BlockDriverState *bs, uint64_t *nb_sectors_ptr)
{
int64_t nb_sectors = bdrv_nb_sectors(bs);
IO_CODE();
*nb_sectors_ptr = nb_sectors < 0 ? 0 : nb_sectors;
}
bool bdrv_is_sg(BlockDriverState *bs)
{
IO_CODE();
return bs->sg;
}
/**
* Return whether the given node supports compressed writes.
*/
bool bdrv_supports_compressed_writes(BlockDriverState *bs)
{
BlockDriverState *filtered;
IO_CODE();
if (!bs->drv || !block_driver_can_compress(bs->drv)) {
return false;
}
filtered = bdrv_filter_bs(bs);
if (filtered) {
/*
* Filters can only forward compressed writes, so we have to
* check the child.
*/
return bdrv_supports_compressed_writes(filtered);
}
return true;
}
const char *bdrv_get_format_name(BlockDriverState *bs)
{
IO_CODE();
return bs->drv ? bs->drv->format_name : NULL;
}
static int qsort_strcmp(const void *a, const void *b)
{
return strcmp(*(char *const *)a, *(char *const *)b);
}
void bdrv_iterate_format(void (*it)(void *opaque, const char *name),
void *opaque, bool read_only)
{
BlockDriver *drv;
int count = 0;
int i;
const char **formats = NULL;
GLOBAL_STATE_CODE();
QLIST_FOREACH(drv, &bdrv_drivers, list) {
if (drv->format_name) {
bool found = false;
int i = count;
if (use_bdrv_whitelist && !bdrv_is_whitelisted(drv, read_only)) {
continue;
}
while (formats && i && !found) {
found = !strcmp(formats[--i], drv->format_name);
}
if (!found) {
formats = g_renew(const char *, formats, count + 1);
formats[count++] = drv->format_name;
}
}
}
for (i = 0; i < (int)ARRAY_SIZE(block_driver_modules); i++) {
const char *format_name = block_driver_modules[i].format_name;
if (format_name) {
bool found = false;
int j = count;
if (use_bdrv_whitelist &&
!bdrv_format_is_whitelisted(format_name, read_only)) {
continue;
}
while (formats && j && !found) {
found = !strcmp(formats[--j], format_name);
}
if (!found) {
formats = g_renew(const char *, formats, count + 1);
formats[count++] = format_name;
}
}
}
qsort(formats, count, sizeof(formats[0]), qsort_strcmp);
for (i = 0; i < count; i++) {
it(opaque, formats[i]);
}
g_free(formats);
}
/* This function is to find a node in the bs graph */
BlockDriverState *bdrv_find_node(const char *node_name)
{
BlockDriverState *bs;
assert(node_name);
GLOBAL_STATE_CODE();
QTAILQ_FOREACH(bs, &graph_bdrv_states, node_list) {
if (!strcmp(node_name, bs->node_name)) {
return bs;
}
}
return NULL;
}
/* Put this QMP function here so it can access the static graph_bdrv_states. */
BlockDeviceInfoList *bdrv_named_nodes_list(bool flat,
Error **errp)
{
BlockDeviceInfoList *list;
BlockDriverState *bs;
GLOBAL_STATE_CODE();
list = NULL;
QTAILQ_FOREACH(bs, &graph_bdrv_states, node_list) {
BlockDeviceInfo *info = bdrv_block_device_info(NULL, bs, flat, errp);
if (!info) {
qapi_free_BlockDeviceInfoList(list);
return NULL;
}
QAPI_LIST_PREPEND(list, info);
}
return list;
}
typedef struct XDbgBlockGraphConstructor {
XDbgBlockGraph *graph;
GHashTable *graph_nodes;
} XDbgBlockGraphConstructor;
static XDbgBlockGraphConstructor *xdbg_graph_new(void)
{
XDbgBlockGraphConstructor *gr = g_new(XDbgBlockGraphConstructor, 1);
gr->graph = g_new0(XDbgBlockGraph, 1);
gr->graph_nodes = g_hash_table_new(NULL, NULL);
return gr;
}
static XDbgBlockGraph *xdbg_graph_finalize(XDbgBlockGraphConstructor *gr)
{
XDbgBlockGraph *graph = gr->graph;
g_hash_table_destroy(gr->graph_nodes);
g_free(gr);
return graph;
}
static uintptr_t xdbg_graph_node_num(XDbgBlockGraphConstructor *gr, void *node)
{
uintptr_t ret = (uintptr_t)g_hash_table_lookup(gr->graph_nodes, node);
if (ret != 0) {
return ret;
}
/*
* Start counting from 1, not 0, because 0 interferes with not-found (NULL)
* answer of g_hash_table_lookup.
*/
ret = g_hash_table_size(gr->graph_nodes) + 1;
g_hash_table_insert(gr->graph_nodes, node, (void *)ret);
return ret;
}
static void xdbg_graph_add_node(XDbgBlockGraphConstructor *gr, void *node,
XDbgBlockGraphNodeType type, const char *name)
{
XDbgBlockGraphNode *n;
n = g_new0(XDbgBlockGraphNode, 1);
n->id = xdbg_graph_node_num(gr, node);
n->type = type;
n->name = g_strdup(name);
QAPI_LIST_PREPEND(gr->graph->nodes, n);
}
static void xdbg_graph_add_edge(XDbgBlockGraphConstructor *gr, void *parent,
const BdrvChild *child)
{
BlockPermission qapi_perm;
XDbgBlockGraphEdge *edge;
GLOBAL_STATE_CODE();
edge = g_new0(XDbgBlockGraphEdge, 1);
edge->parent = xdbg_graph_node_num(gr, parent);
edge->child = xdbg_graph_node_num(gr, child->bs);
edge->name = g_strdup(child->name);
for (qapi_perm = 0; qapi_perm < BLOCK_PERMISSION__MAX; qapi_perm++) {
uint64_t flag = bdrv_qapi_perm_to_blk_perm(qapi_perm);
if (flag & child->perm) {
QAPI_LIST_PREPEND(edge->perm, qapi_perm);
}
if (flag & child->shared_perm) {
QAPI_LIST_PREPEND(edge->shared_perm, qapi_perm);
}
}
QAPI_LIST_PREPEND(gr->graph->edges, edge);
}
XDbgBlockGraph *bdrv_get_xdbg_block_graph(Error **errp)
{
BlockBackend *blk;
BlockJob *job;
BlockDriverState *bs;
BdrvChild *child;
XDbgBlockGraphConstructor *gr = xdbg_graph_new();
GLOBAL_STATE_CODE();
for (blk = blk_all_next(NULL); blk; blk = blk_all_next(blk)) {
char *allocated_name = NULL;
const char *name = blk_name(blk);
if (!*name) {
name = allocated_name = blk_get_attached_dev_id(blk);
}
xdbg_graph_add_node(gr, blk, X_DBG_BLOCK_GRAPH_NODE_TYPE_BLOCK_BACKEND,
name);
g_free(allocated_name);
if (blk_root(blk)) {
xdbg_graph_add_edge(gr, blk, blk_root(blk));
}
}
for (job = block_job_next(NULL); job; job = block_job_next(job)) {
GSList *el;
xdbg_graph_add_node(gr, job, X_DBG_BLOCK_GRAPH_NODE_TYPE_BLOCK_JOB,
job->job.id);
for (el = job->nodes; el; el = el->next) {
xdbg_graph_add_edge(gr, job, (BdrvChild *)el->data);
}
}
QTAILQ_FOREACH(bs, &graph_bdrv_states, node_list) {
xdbg_graph_add_node(gr, bs, X_DBG_BLOCK_GRAPH_NODE_TYPE_BLOCK_DRIVER,
bs->node_name);
QLIST_FOREACH(child, &bs->children, next) {
xdbg_graph_add_edge(gr, bs, child);
}
}
return xdbg_graph_finalize(gr);
}
BlockDriverState *bdrv_lookup_bs(const char *device,
const char *node_name,
Error **errp)
{
BlockBackend *blk;
BlockDriverState *bs;
GLOBAL_STATE_CODE();
if (device) {
blk = blk_by_name(device);
if (blk) {
bs = blk_bs(blk);
if (!bs) {
error_setg(errp, "Device '%s' has no medium", device);
}
return bs;
}
}
if (node_name) {
bs = bdrv_find_node(node_name);
if (bs) {
return bs;
}
}
error_setg(errp, "Cannot find device=\'%s\' nor node-name=\'%s\'",
device ? device : "",
node_name ? node_name : "");
return NULL;
}
/* If 'base' is in the same chain as 'top', return true. Otherwise,
* return false. If either argument is NULL, return false. */
bool bdrv_chain_contains(BlockDriverState *top, BlockDriverState *base)
{
GLOBAL_STATE_CODE();
while (top && top != base) {
top = bdrv_filter_or_cow_bs(top);
}
return top != NULL;
}
BlockDriverState *bdrv_next_node(BlockDriverState *bs)
{
GLOBAL_STATE_CODE();
if (!bs) {
return QTAILQ_FIRST(&graph_bdrv_states);
}
return QTAILQ_NEXT(bs, node_list);
}
BlockDriverState *bdrv_next_all_states(BlockDriverState *bs)
{
GLOBAL_STATE_CODE();
if (!bs) {
return QTAILQ_FIRST(&all_bdrv_states);
}
return QTAILQ_NEXT(bs, bs_list);
}
const char *bdrv_get_node_name(const BlockDriverState *bs)
{
IO_CODE();
return bs->node_name;
}
const char *bdrv_get_parent_name(const BlockDriverState *bs)
{
BdrvChild *c;
const char *name;
IO_CODE();
/* If multiple parents have a name, just pick the first one. */
QLIST_FOREACH(c, &bs->parents, next_parent) {
if (c->klass->get_name) {
name = c->klass->get_name(c);
if (name && *name) {
return name;
}
}
}
return NULL;
}
/* TODO check what callers really want: bs->node_name or blk_name() */
const char *bdrv_get_device_name(const BlockDriverState *bs)
{
IO_CODE();
return bdrv_get_parent_name(bs) ?: "";
}
/* This can be used to identify nodes that might not have a device
* name associated. Since node and device names live in the same
* namespace, the result is unambiguous. The exception is if both are
* absent, then this returns an empty (non-null) string. */
const char *bdrv_get_device_or_node_name(const BlockDriverState *bs)
{
IO_CODE();
return bdrv_get_parent_name(bs) ?: bs->node_name;
}
int bdrv_get_flags(BlockDriverState *bs)
{
IO_CODE();
return bs->open_flags;
}
int bdrv_has_zero_init_1(BlockDriverState *bs)
{
GLOBAL_STATE_CODE();
return 1;
}
int bdrv_has_zero_init(BlockDriverState *bs)
{
BlockDriverState *filtered;
GLOBAL_STATE_CODE();
if (!bs->drv) {
return 0;
}
/* If BS is a copy on write image, it is initialized to
the contents of the base image, which may not be zeroes. */
if (bdrv_cow_child(bs)) {
return 0;
}
if (bs->drv->bdrv_has_zero_init) {
return bs->drv->bdrv_has_zero_init(bs);
}
filtered = bdrv_filter_bs(bs);
if (filtered) {
return bdrv_has_zero_init(filtered);
}
/* safe default */
return 0;
}
bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs)
{
IO_CODE();
if (!(bs->open_flags & BDRV_O_UNMAP)) {
return false;
}
return bs->supported_zero_flags & BDRV_REQ_MAY_UNMAP;
}
void bdrv_get_backing_filename(BlockDriverState *bs,
char *filename, int filename_size)
{
IO_CODE();
pstrcpy(filename, filename_size, bs->backing_file);
}
int bdrv_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
{
int ret;
BlockDriver *drv = bs->drv;
IO_CODE();
/* if bs->drv == NULL, bs is closed, so there's nothing to do here */
if (!drv) {
return -ENOMEDIUM;
}
if (!drv->bdrv_get_info) {
BlockDriverState *filtered = bdrv_filter_bs(bs);
if (filtered) {
return bdrv_get_info(filtered, bdi);
}
return -ENOTSUP;
}
memset(bdi, 0, sizeof(*bdi));
ret = drv->bdrv_get_info(bs, bdi);
if (ret < 0) {
return ret;
}
if (bdi->cluster_size > BDRV_MAX_ALIGNMENT) {
return -EINVAL;
}
return 0;
}
ImageInfoSpecific *bdrv_get_specific_info(BlockDriverState *bs,
Error **errp)
{
BlockDriver *drv = bs->drv;
IO_CODE();
if (drv && drv->bdrv_get_specific_info) {
return drv->bdrv_get_specific_info(bs, errp);
}
return NULL;
}
BlockStatsSpecific *bdrv_get_specific_stats(BlockDriverState *bs)
{
BlockDriver *drv = bs->drv;
IO_CODE();
if (!drv || !drv->bdrv_get_specific_stats) {
return NULL;
}
return drv->bdrv_get_specific_stats(bs);
}
void bdrv_debug_event(BlockDriverState *bs, BlkdebugEvent event)
{
IO_CODE();
if (!bs || !bs->drv || !bs->drv->bdrv_debug_event) {
return;
}
bs->drv->bdrv_debug_event(bs, event);
}
static BlockDriverState *bdrv_find_debug_node(BlockDriverState *bs)
{
GLOBAL_STATE_CODE();
while (bs && bs->drv && !bs->drv->bdrv_debug_breakpoint) {
bs = bdrv_primary_bs(bs);
}
if (bs && bs->drv && bs->drv->bdrv_debug_breakpoint) {
assert(bs->drv->bdrv_debug_remove_breakpoint);
return bs;
}
return NULL;
}
int bdrv_debug_breakpoint(BlockDriverState *bs, const char *event,
const char *tag)
{
GLOBAL_STATE_CODE();
bs = bdrv_find_debug_node(bs);
if (bs) {
return bs->drv->bdrv_debug_breakpoint(bs, event, tag);
}
return -ENOTSUP;
}
int bdrv_debug_remove_breakpoint(BlockDriverState *bs, const char *tag)
{
GLOBAL_STATE_CODE();
bs = bdrv_find_debug_node(bs);
if (bs) {
return bs->drv->bdrv_debug_remove_breakpoint(bs, tag);
}
return -ENOTSUP;
}
int bdrv_debug_resume(BlockDriverState *bs, const char *tag)
{
GLOBAL_STATE_CODE();
while (bs && (!bs->drv || !bs->drv->bdrv_debug_resume)) {
bs = bdrv_primary_bs(bs);
}
if (bs && bs->drv && bs->drv->bdrv_debug_resume) {
return bs->drv->bdrv_debug_resume(bs, tag);
}
return -ENOTSUP;
}
bool bdrv_debug_is_suspended(BlockDriverState *bs, const char *tag)
{
GLOBAL_STATE_CODE();
while (bs && bs->drv && !bs->drv->bdrv_debug_is_suspended) {
bs = bdrv_primary_bs(bs);
}
if (bs && bs->drv && bs->drv->bdrv_debug_is_suspended) {
return bs->drv->bdrv_debug_is_suspended(bs, tag);
}
return false;
}
/* backing_file can either be relative, or absolute, or a protocol. If it is
* relative, it must be relative to the chain. So, passing in bs->filename
* from a BDS as backing_file should not be done, as that may be relative to
* the CWD rather than the chain. */
BlockDriverState *bdrv_find_backing_image(BlockDriverState *bs,
const char *backing_file)
{
char *filename_full = NULL;
char *backing_file_full = NULL;
char *filename_tmp = NULL;
int is_protocol = 0;
bool filenames_refreshed = false;
BlockDriverState *curr_bs = NULL;
BlockDriverState *retval = NULL;
BlockDriverState *bs_below;
GLOBAL_STATE_CODE();
if (!bs || !bs->drv || !backing_file) {
return NULL;
}
filename_full = g_malloc(PATH_MAX);
backing_file_full = g_malloc(PATH_MAX);
is_protocol = path_has_protocol(backing_file);
/*
* Being largely a legacy function, skip any filters here
* (because filters do not have normal filenames, so they cannot
* match anyway; and allowing json:{} filenames is a bit out of
* scope).
*/
for (curr_bs = bdrv_skip_filters(bs);
bdrv_cow_child(curr_bs) != NULL;
curr_bs = bs_below)
{
bs_below = bdrv_backing_chain_next(curr_bs);
if (bdrv_backing_overridden(curr_bs)) {
/*
* If the backing file was overridden, we can only compare
* directly against the backing node's filename.
*/
if (!filenames_refreshed) {
/*
* This will automatically refresh all of the
* filenames in the rest of the backing chain, so we
* only need to do this once.
*/
bdrv_refresh_filename(bs_below);
filenames_refreshed = true;
}
if (strcmp(backing_file, bs_below->filename) == 0) {
retval = bs_below;
break;
}
} else if (is_protocol || path_has_protocol(curr_bs->backing_file)) {
/*
* If either of the filename paths is actually a protocol, then
* compare unmodified paths; otherwise make paths relative.
*/
char *backing_file_full_ret;
if (strcmp(backing_file, curr_bs->backing_file) == 0) {
retval = bs_below;
break;
}
/* Also check against the full backing filename for the image */
backing_file_full_ret = bdrv_get_full_backing_filename(curr_bs,
NULL);
if (backing_file_full_ret) {
bool equal = strcmp(backing_file, backing_file_full_ret) == 0;
g_free(backing_file_full_ret);
if (equal) {
retval = bs_below;
break;
}
}
} else {
/* If not an absolute filename path, make it relative to the current
* image's filename path */
filename_tmp = bdrv_make_absolute_filename(curr_bs, backing_file,
NULL);
/* We are going to compare canonicalized absolute pathnames */
if (!filename_tmp || !realpath(filename_tmp, filename_full)) {
g_free(filename_tmp);
continue;
}
g_free(filename_tmp);
/* We need to make sure the backing filename we are comparing against
* is relative to the current image filename (or absolute) */
filename_tmp = bdrv_get_full_backing_filename(curr_bs, NULL);
if (!filename_tmp || !realpath(filename_tmp, backing_file_full)) {
g_free(filename_tmp);
continue;
}
g_free(filename_tmp);
if (strcmp(backing_file_full, filename_full) == 0) {
retval = bs_below;
break;
}
}
}
g_free(filename_full);
g_free(backing_file_full);
return retval;
}
void bdrv_init(void)
{
#ifdef CONFIG_BDRV_WHITELIST_TOOLS
use_bdrv_whitelist = 1;
#endif
module_call_init(MODULE_INIT_BLOCK);
}
void bdrv_init_with_whitelist(void)
{
use_bdrv_whitelist = 1;
bdrv_init();
}
int bdrv_activate(BlockDriverState *bs, Error **errp)
{
BdrvChild *child, *parent;
Error *local_err = NULL;
int ret;
BdrvDirtyBitmap *bm;
GLOBAL_STATE_CODE();
if (!bs->drv) {
return -ENOMEDIUM;
}
QLIST_FOREACH(child, &bs->children, next) {
bdrv_activate(child->bs, &local_err);
if (local_err) {
error_propagate(errp, local_err);
return -EINVAL;
}
}
/*
* Update permissions, they may differ for inactive nodes.
*
* Note that the required permissions of inactive images are always a
* subset of the permissions required after activating the image. This
* allows us to just get the permissions upfront without restricting
* bdrv_co_invalidate_cache().
*
* It also means that in error cases, we don't have to try and revert to
* the old permissions (which is an operation that could fail, too). We can
* just keep the extended permissions for the next time that an activation
* of the image is tried.
*/
if (bs->open_flags & BDRV_O_INACTIVE) {
bs->open_flags &= ~BDRV_O_INACTIVE;
ret = bdrv_refresh_perms(bs, errp);
if (ret < 0) {
bs->open_flags |= BDRV_O_INACTIVE;
return ret;
}
ret = bdrv_invalidate_cache(bs, errp);
if (ret < 0) {
bs->open_flags |= BDRV_O_INACTIVE;
return ret;
}
FOR_EACH_DIRTY_BITMAP(bs, bm) {
bdrv_dirty_bitmap_skip_store(bm, false);
}
ret = refresh_total_sectors(bs, bs->total_sectors);
if (ret < 0) {
bs->open_flags |= BDRV_O_INACTIVE;
error_setg_errno(errp, -ret, "Could not refresh total sector count");
return ret;
}
}
QLIST_FOREACH(parent, &bs->parents, next_parent) {
if (parent->klass->activate) {
parent->klass->activate(parent, &local_err);
if (local_err) {
bs->open_flags |= BDRV_O_INACTIVE;
error_propagate(errp, local_err);
return -EINVAL;
}
}
}
return 0;
}
int coroutine_fn bdrv_co_invalidate_cache(BlockDriverState *bs, Error **errp)
{
Error *local_err = NULL;
IO_CODE();
assert(!(bs->open_flags & BDRV_O_INACTIVE));
if (bs->drv->bdrv_co_invalidate_cache) {
bs->drv->bdrv_co_invalidate_cache(bs, &local_err);
if (local_err) {
error_propagate(errp, local_err);
return -EINVAL;
}
}
return 0;
}
void bdrv_activate_all(Error **errp)
{
BlockDriverState *bs;
BdrvNextIterator it;
GLOBAL_STATE_CODE();
for (bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) {
AioContext *aio_context = bdrv_get_aio_context(bs);
int ret;
aio_context_acquire(aio_context);
ret = bdrv_activate(bs, errp);
aio_context_release(aio_context);
if (ret < 0) {
bdrv_next_cleanup(&it);
return;
}
}
}
static bool bdrv_has_bds_parent(BlockDriverState *bs, bool only_active)
{
BdrvChild *parent;
GLOBAL_STATE_CODE();
QLIST_FOREACH(parent, &bs->parents, next_parent) {
if (parent->klass->parent_is_bds) {
BlockDriverState *parent_bs = parent->opaque;
if (!only_active || !(parent_bs->open_flags & BDRV_O_INACTIVE)) {
return true;
}
}
}
return false;
}
static int bdrv_inactivate_recurse(BlockDriverState *bs)
{
BdrvChild *child, *parent;
int ret;
uint64_t cumulative_perms, cumulative_shared_perms;
GLOBAL_STATE_CODE();
if (!bs->drv) {
return -ENOMEDIUM;
}
/* Make sure that we don't inactivate a child before its parent.
* It will be covered by recursion from the yet active parent. */
if (bdrv_has_bds_parent(bs, true)) {
return 0;
}
assert(!(bs->open_flags & BDRV_O_INACTIVE));
/* Inactivate this node */
if (bs->drv->bdrv_inactivate) {
ret = bs->drv->bdrv_inactivate(bs);
if (ret < 0) {
return ret;
}
}
QLIST_FOREACH(parent, &bs->parents, next_parent) {
if (parent->klass->inactivate) {
ret = parent->klass->inactivate(parent);
if (ret < 0) {
return ret;
}
}
}
bdrv_get_cumulative_perm(bs, &cumulative_perms,
&cumulative_shared_perms);
if (cumulative_perms & (BLK_PERM_WRITE | BLK_PERM_WRITE_UNCHANGED)) {
/* Our inactive parents still need write access. Inactivation failed. */
return -EPERM;
}
bs->open_flags |= BDRV_O_INACTIVE;
/*
* Update permissions, they may differ for inactive nodes.
* We only tried to loosen restrictions, so errors are not fatal, ignore
* them.
*/
bdrv_refresh_perms(bs, NULL);
/* Recursively inactivate children */
QLIST_FOREACH(child, &bs->children, next) {
ret = bdrv_inactivate_recurse(child->bs);
if (ret < 0) {
return ret;
}
}
return 0;
}
int bdrv_inactivate_all(void)
{
BlockDriverState *bs = NULL;
BdrvNextIterator it;
int ret = 0;
GSList *aio_ctxs = NULL, *ctx;
GLOBAL_STATE_CODE();
for (bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) {
AioContext *aio_context = bdrv_get_aio_context(bs);
if (!g_slist_find(aio_ctxs, aio_context)) {
aio_ctxs = g_slist_prepend(aio_ctxs, aio_context);
aio_context_acquire(aio_context);
}
}
for (bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) {
/* Nodes with BDS parents are covered by recursion from the last
* parent that gets inactivated. Don't inactivate them a second
* time if that has already happened. */
if (bdrv_has_bds_parent(bs, false)) {
continue;
}
ret = bdrv_inactivate_recurse(bs);
if (ret < 0) {
bdrv_next_cleanup(&it);
goto out;
}
}
out:
for (ctx = aio_ctxs; ctx != NULL; ctx = ctx->next) {
AioContext *aio_context = ctx->data;
aio_context_release(aio_context);
}
g_slist_free(aio_ctxs);
return ret;
}
/**************************************************************/
/* removable device support */
/**
* Return TRUE if the media is present
*/
bool bdrv_is_inserted(BlockDriverState *bs)
{
BlockDriver *drv = bs->drv;
BdrvChild *child;
IO_CODE();
if (!drv) {
return false;
}
if (drv->bdrv_is_inserted) {
return drv->bdrv_is_inserted(bs);
}
QLIST_FOREACH(child, &bs->children, next) {
if (!bdrv_is_inserted(child->bs)) {
return false;
}
}
return true;
}
/**
* If eject_flag is TRUE, eject the media. Otherwise, close the tray
*/
void bdrv_eject(BlockDriverState *bs, bool eject_flag)
{
BlockDriver *drv = bs->drv;
IO_CODE();
if (drv && drv->bdrv_eject) {
drv->bdrv_eject(bs, eject_flag);
}
}
/**
* Lock or unlock the media (if it is locked, the user won't be able
* to eject it manually).
*/
void bdrv_lock_medium(BlockDriverState *bs, bool locked)
{
BlockDriver *drv = bs->drv;
IO_CODE();
trace_bdrv_lock_medium(bs, locked);
if (drv && drv->bdrv_lock_medium) {
drv->bdrv_lock_medium(bs, locked);
}
}
/* Get a reference to bs */
void bdrv_ref(BlockDriverState *bs)
{
GLOBAL_STATE_CODE();
bs->refcnt++;
}
/* Release a previously grabbed reference to bs.
* If after releasing, reference count is zero, the BlockDriverState is
* deleted. */
void bdrv_unref(BlockDriverState *bs)
{
GLOBAL_STATE_CODE();
if (!bs) {
return;
}
assert(bs->refcnt > 0);
if (--bs->refcnt == 0) {
bdrv_delete(bs);
}
}
struct BdrvOpBlocker {
Error *reason;
QLIST_ENTRY(BdrvOpBlocker) list;
};
bool bdrv_op_is_blocked(BlockDriverState *bs, BlockOpType op, Error **errp)
{
BdrvOpBlocker *blocker;
GLOBAL_STATE_CODE();
assert((int) op >= 0 && op < BLOCK_OP_TYPE_MAX);
if (!QLIST_EMPTY(&bs->op_blockers[op])) {
blocker = QLIST_FIRST(&bs->op_blockers[op]);
error_propagate_prepend(errp, error_copy(blocker->reason),
"Node '%s' is busy: ",
bdrv_get_device_or_node_name(bs));
return true;
}
return false;
}
void bdrv_op_block(BlockDriverState *bs, BlockOpType op, Error *reason)
{
BdrvOpBlocker *blocker;
GLOBAL_STATE_CODE();
assert((int) op >= 0 && op < BLOCK_OP_TYPE_MAX);
blocker = g_new0(BdrvOpBlocker, 1);
blocker->reason = reason;
QLIST_INSERT_HEAD(&bs->op_blockers[op], blocker, list);
}
void bdrv_op_unblock(BlockDriverState *bs, BlockOpType op, Error *reason)
{
BdrvOpBlocker *blocker, *next;
GLOBAL_STATE_CODE();
assert((int) op >= 0 && op < BLOCK_OP_TYPE_MAX);
QLIST_FOREACH_SAFE(blocker, &bs->op_blockers[op], list, next) {
if (blocker->reason == reason) {
QLIST_REMOVE(blocker, list);
g_free(blocker);
}
}
}
void bdrv_op_block_all(BlockDriverState *bs, Error *reason)
{
int i;
GLOBAL_STATE_CODE();
for (i = 0; i < BLOCK_OP_TYPE_MAX; i++) {
bdrv_op_block(bs, i, reason);
}
}
void bdrv_op_unblock_all(BlockDriverState *bs, Error *reason)
{
int i;
GLOBAL_STATE_CODE();
for (i = 0; i < BLOCK_OP_TYPE_MAX; i++) {
bdrv_op_unblock(bs, i, reason);
}
}
bool bdrv_op_blocker_is_empty(BlockDriverState *bs)
{
int i;
GLOBAL_STATE_CODE();
for (i = 0; i < BLOCK_OP_TYPE_MAX; i++) {
if (!QLIST_EMPTY(&bs->op_blockers[i])) {
return false;
}
}
return true;
}
void bdrv_img_create(const char *filename, const char *fmt,
const char *base_filename, const char *base_fmt,
char *options, uint64_t img_size, int flags, bool quiet,
Error **errp)
{
QemuOptsList *create_opts = NULL;
QemuOpts *opts = NULL;
const char *backing_fmt, *backing_file;
int64_t size;
BlockDriver *drv, *proto_drv;
Error *local_err = NULL;
int ret = 0;
GLOBAL_STATE_CODE();
/* Find driver and parse its options */
drv = bdrv_find_format(fmt);
if (!drv) {
error_setg(errp, "Unknown file format '%s'", fmt);
return;
}
proto_drv = bdrv_find_protocol(filename, true, errp);
if (!proto_drv) {
return;
}
if (!drv->create_opts) {
error_setg(errp, "Format driver '%s' does not support image creation",
drv->format_name);
return;
}
if (!proto_drv->create_opts) {
error_setg(errp, "Protocol driver '%s' does not support image creation",
proto_drv->format_name);
return;
}
/* Create parameter list */
create_opts = qemu_opts_append(create_opts, drv->create_opts);
create_opts = qemu_opts_append(create_opts, proto_drv->create_opts);
opts = qemu_opts_create(create_opts, NULL, 0, &error_abort);
/* Parse -o options */
if (options) {
if (!qemu_opts_do_parse(opts, options, NULL, errp)) {
goto out;
}
}
if (!qemu_opt_get(opts, BLOCK_OPT_SIZE)) {
qemu_opt_set_number(opts, BLOCK_OPT_SIZE, img_size, &error_abort);
} else if (img_size != UINT64_C(-1)) {
error_setg(errp, "The image size must be specified only once");
goto out;
}
if (base_filename) {
if (!qemu_opt_set(opts, BLOCK_OPT_BACKING_FILE, base_filename,
NULL)) {
error_setg(errp, "Backing file not supported for file format '%s'",
fmt);
goto out;
}
}
if (base_fmt) {
if (!qemu_opt_set(opts, BLOCK_OPT_BACKING_FMT, base_fmt, NULL)) {
error_setg(errp, "Backing file format not supported for file "
"format '%s'", fmt);
goto out;
}
}
backing_file = qemu_opt_get(opts, BLOCK_OPT_BACKING_FILE);
if (backing_file) {
if (!strcmp(filename, backing_file)) {
error_setg(errp, "Error: Trying to create an image with the "
"same filename as the backing file");
goto out;
}
if (backing_file[0] == '\0') {
error_setg(errp, "Expected backing file name, got empty string");
goto out;
}
}
backing_fmt = qemu_opt_get(opts, BLOCK_OPT_BACKING_FMT);
/* The size for the image must always be specified, unless we have a backing
* file and we have not been forbidden from opening it. */
size = qemu_opt_get_size(opts, BLOCK_OPT_SIZE, img_size);
if (backing_file && !(flags & BDRV_O_NO_BACKING)) {
BlockDriverState *bs;
char *full_backing;
int back_flags;
QDict *backing_options = NULL;
full_backing =
bdrv_get_full_backing_filename_from_filename(filename, backing_file,
&local_err);
if (local_err) {
goto out;
}
assert(full_backing);
/*
* No need to do I/O here, which allows us to open encrypted
* backing images without needing the secret
*/
back_flags = flags;
back_flags &= ~(BDRV_O_RDWR | BDRV_O_SNAPSHOT | BDRV_O_NO_BACKING);
back_flags |= BDRV_O_NO_IO;
backing_options = qdict_new();
if (backing_fmt) {
qdict_put_str(backing_options, "driver", backing_fmt);
}
qdict_put_bool(backing_options, BDRV_OPT_FORCE_SHARE, true);
bs = bdrv_open(full_backing, NULL, backing_options, back_flags,
&local_err);
g_free(full_backing);
if (!bs) {
error_append_hint(&local_err, "Could not open backing image.\n");
goto out;
} else {
if (!backing_fmt) {
error_setg(&local_err,
"Backing file specified without backing format");
error_append_hint(&local_err, "Detected format of %s.",
bs->drv->format_name);
goto out;
}
if (size == -1) {
/* Opened BS, have no size */
size = bdrv_getlength(bs);
if (size < 0) {
error_setg_errno(errp, -size, "Could not get size of '%s'",
backing_file);
bdrv_unref(bs);
goto out;
}
qemu_opt_set_number(opts, BLOCK_OPT_SIZE, size, &error_abort);
}
bdrv_unref(bs);
}
/* (backing_file && !(flags & BDRV_O_NO_BACKING)) */
} else if (backing_file && !backing_fmt) {
error_setg(&local_err,
"Backing file specified without backing format");
goto out;
}
if (size == -1) {
error_setg(errp, "Image creation needs a size parameter");
goto out;
}
if (!quiet) {
printf("Formatting '%s', fmt=%s ", filename, fmt);
qemu_opts_print(opts, " ");
puts("");
fflush(stdout);
}
ret = bdrv_create(drv, filename, opts, &local_err);
if (ret == -EFBIG) {
/* This is generally a better message than whatever the driver would
* deliver (especially because of the cluster_size_hint), since that
* is most probably not much different from "image too large". */
const char *cluster_size_hint = "";
if (qemu_opt_get_size(opts, BLOCK_OPT_CLUSTER_SIZE, 0)) {
cluster_size_hint = " (try using a larger cluster size)";
}
error_setg(errp, "The image size is too large for file format '%s'"
"%s", fmt, cluster_size_hint);
error_free(local_err);
local_err = NULL;
}
out:
qemu_opts_del(opts);
qemu_opts_free(create_opts);
error_propagate(errp, local_err);
}
AioContext *bdrv_get_aio_context(BlockDriverState *bs)
{
IO_CODE();
return bs ? bs->aio_context : qemu_get_aio_context();
}
AioContext *coroutine_fn bdrv_co_enter(BlockDriverState *bs)
{
Coroutine *self = qemu_coroutine_self();
AioContext *old_ctx = qemu_coroutine_get_aio_context(self);
AioContext *new_ctx;
IO_CODE();
/*
* Increase bs->in_flight to ensure that this operation is completed before
* moving the node to a different AioContext. Read new_ctx only afterwards.
*/
bdrv_inc_in_flight(bs);
new_ctx = bdrv_get_aio_context(bs);
aio_co_reschedule_self(new_ctx);
return old_ctx;
}
void coroutine_fn bdrv_co_leave(BlockDriverState *bs, AioContext *old_ctx)
{
IO_CODE();
aio_co_reschedule_self(old_ctx);
bdrv_dec_in_flight(bs);
}
void coroutine_fn bdrv_co_lock(BlockDriverState *bs)
{
AioContext *ctx = bdrv_get_aio_context(bs);
/* In the main thread, bs->aio_context won't change concurrently */
assert(qemu_get_current_aio_context() == qemu_get_aio_context());
/*
* We're in coroutine context, so we already hold the lock of the main
* loop AioContext. Don't lock it twice to avoid deadlocks.
*/
assert(qemu_in_coroutine());
if (ctx != qemu_get_aio_context()) {
aio_context_acquire(ctx);
}
}
void coroutine_fn bdrv_co_unlock(BlockDriverState *bs)
{
AioContext *ctx = bdrv_get_aio_context(bs);
assert(qemu_in_coroutine());
if (ctx != qemu_get_aio_context()) {
aio_context_release(ctx);
}
}
void bdrv_coroutine_enter(BlockDriverState *bs, Coroutine *co)
{
IO_CODE();
aio_co_enter(bdrv_get_aio_context(bs), co);
}
static void bdrv_do_remove_aio_context_notifier(BdrvAioNotifier *ban)
{
GLOBAL_STATE_CODE();
QLIST_REMOVE(ban, list);
g_free(ban);
}
static void bdrv_detach_aio_context(BlockDriverState *bs)
{
BdrvAioNotifier *baf, *baf_tmp;
assert(!bs->walking_aio_notifiers);
GLOBAL_STATE_CODE();
bs->walking_aio_notifiers = true;
QLIST_FOREACH_SAFE(baf, &bs->aio_notifiers, list, baf_tmp) {
if (baf->deleted) {
bdrv_do_remove_aio_context_notifier(baf);
} else {
baf->detach_aio_context(baf->opaque);
}
}
/* Never mind iterating again to check for ->deleted. bdrv_close() will
* remove remaining aio notifiers if we aren't called again.
*/
bs->walking_aio_notifiers = false;
if (bs->drv && bs->drv->bdrv_detach_aio_context) {
bs->drv->bdrv_detach_aio_context(bs);
}
if (bs->quiesce_counter) {
aio_enable_external(bs->aio_context);
}
bs->aio_context = NULL;
}
static void bdrv_attach_aio_context(BlockDriverState *bs,
AioContext *new_context)
{
BdrvAioNotifier *ban, *ban_tmp;
GLOBAL_STATE_CODE();
if (bs->quiesce_counter) {
aio_disable_external(new_context);
}
bs->aio_context = new_context;
if (bs->drv && bs->drv->bdrv_attach_aio_context) {
bs->drv->bdrv_attach_aio_context(bs, new_context);
}
assert(!bs->walking_aio_notifiers);
bs->walking_aio_notifiers = true;
QLIST_FOREACH_SAFE(ban, &bs->aio_notifiers, list, ban_tmp) {
if (ban->deleted) {
bdrv_do_remove_aio_context_notifier(ban);
} else {
ban->attached_aio_context(new_context, ban->opaque);
}
}
bs->walking_aio_notifiers = false;
}
/*
* Changes the AioContext used for fd handlers, timers, and BHs by this
* BlockDriverState and all its children and parents.
*
* Must be called from the main AioContext.
*
* The caller must own the AioContext lock for the old AioContext of bs, but it
* must not own the AioContext lock for new_context (unless new_context is the
* same as the current context of bs).
*
* @ignore will accumulate all visited BdrvChild object. The caller is
* responsible for freeing the list afterwards.
*/
void bdrv_set_aio_context_ignore(BlockDriverState *bs,
AioContext *new_context, GSList **ignore)
{
AioContext *old_context = bdrv_get_aio_context(bs);
GSList *children_to_process = NULL;
GSList *parents_to_process = NULL;
GSList *entry;
BdrvChild *child, *parent;
g_assert(qemu_get_current_aio_context() == qemu_get_aio_context());
GLOBAL_STATE_CODE();
if (old_context == new_context) {
return;
}
bdrv_drained_begin(bs);
QLIST_FOREACH(child, &bs->children, next) {
if (g_slist_find(*ignore, child)) {
continue;
}
*ignore = g_slist_prepend(*ignore, child);
children_to_process = g_slist_prepend(children_to_process, child);
}
QLIST_FOREACH(parent, &bs->parents, next_parent) {
if (g_slist_find(*ignore, parent)) {
continue;
}
*ignore = g_slist_prepend(*ignore, parent);
parents_to_process = g_slist_prepend(parents_to_process, parent);
}
for (entry = children_to_process;
entry != NULL;
entry = g_slist_next(entry)) {
child = entry->data;
bdrv_set_aio_context_ignore(child->bs, new_context, ignore);
}
g_slist_free(children_to_process);
for (entry = parents_to_process;
entry != NULL;
entry = g_slist_next(entry)) {
parent = entry->data;
assert(parent->klass->set_aio_ctx);
parent->klass->set_aio_ctx(parent, new_context, ignore);
}
g_slist_free(parents_to_process);
bdrv_detach_aio_context(bs);
/* Acquire the new context, if necessary */
if (qemu_get_aio_context() != new_context) {
aio_context_acquire(new_context);
}
bdrv_attach_aio_context(bs, new_context);
/*
* If this function was recursively called from
* bdrv_set_aio_context_ignore(), there may be nodes in the
* subtree that have not yet been moved to the new AioContext.
* Release the old one so bdrv_drained_end() can poll them.
*/
if (qemu_get_aio_context() != old_context) {
aio_context_release(old_context);
}
bdrv_drained_end(bs);
if (qemu_get_aio_context() != old_context) {
aio_context_acquire(old_context);
}
if (qemu_get_aio_context() != new_context) {
aio_context_release(new_context);
}
}
static bool bdrv_parent_can_set_aio_context(BdrvChild *c, AioContext *ctx,
GSList **ignore, Error **errp)
{
GLOBAL_STATE_CODE();
if (g_slist_find(*ignore, c)) {
return true;
}
*ignore = g_slist_prepend(*ignore, c);
/*
* A BdrvChildClass that doesn't handle AioContext changes cannot
* tolerate any AioContext changes
*/
if (!c->klass->can_set_aio_ctx) {
char *user = bdrv_child_user_desc(c);
error_setg(errp, "Changing iothreads is not supported by %s", user);
g_free(user);
return false;
}
if (!c->klass->can_set_aio_ctx(c, ctx, ignore, errp)) {
assert(!errp || *errp);
return false;
}
return true;
}
bool bdrv_child_can_set_aio_context(BdrvChild *c, AioContext *ctx,
GSList **ignore, Error **errp)
{
GLOBAL_STATE_CODE();
if (g_slist_find(*ignore, c)) {
return true;
}
*ignore = g_slist_prepend(*ignore, c);
return bdrv_can_set_aio_context(c->bs, ctx, ignore, errp);
}
/* @ignore will accumulate all visited BdrvChild object. The caller is
* responsible for freeing the list afterwards. */
bool bdrv_can_set_aio_context(BlockDriverState *bs, AioContext *ctx,
GSList **ignore, Error **errp)
{
BdrvChild *c;
if (bdrv_get_aio_context(bs) == ctx) {
return true;
}
GLOBAL_STATE_CODE();
QLIST_FOREACH(c, &bs->parents, next_parent) {
if (!bdrv_parent_can_set_aio_context(c, ctx, ignore, errp)) {
return false;
}
}
QLIST_FOREACH(c, &bs->children, next) {
if (!bdrv_child_can_set_aio_context(c, ctx, ignore, errp)) {
return false;
}
}
return true;
}
int bdrv_child_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,
BdrvChild *ignore_child, Error **errp)
{
GSList *ignore;
bool ret;
GLOBAL_STATE_CODE();
ignore = ignore_child ? g_slist_prepend(NULL, ignore_child) : NULL;
ret = bdrv_can_set_aio_context(bs, ctx, &ignore, errp);
g_slist_free(ignore);
if (!ret) {
return -EPERM;
}
ignore = ignore_child ? g_slist_prepend(NULL, ignore_child) : NULL;
bdrv_set_aio_context_ignore(bs, ctx, &ignore);
g_slist_free(ignore);
return 0;
}
int bdrv_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,
Error **errp)
{
GLOBAL_STATE_CODE();
return bdrv_child_try_set_aio_context(bs, ctx, NULL, errp);
}
void bdrv_add_aio_context_notifier(BlockDriverState *bs,
void (*attached_aio_context)(AioContext *new_context, void *opaque),
void (*detach_aio_context)(void *opaque), void *opaque)
{
BdrvAioNotifier *ban = g_new(BdrvAioNotifier, 1);
*ban = (BdrvAioNotifier){
.attached_aio_context = attached_aio_context,
.detach_aio_context = detach_aio_context,
.opaque = opaque
};
GLOBAL_STATE_CODE();
QLIST_INSERT_HEAD(&bs->aio_notifiers, ban, list);
}
void bdrv_remove_aio_context_notifier(BlockDriverState *bs,
void (*attached_aio_context)(AioContext *,
void *),
void (*detach_aio_context)(void *),
void *opaque)
{
BdrvAioNotifier *ban, *ban_next;
GLOBAL_STATE_CODE();
QLIST_FOREACH_SAFE(ban, &bs->aio_notifiers, list, ban_next) {
if (ban->attached_aio_context == attached_aio_context &&
ban->detach_aio_context == detach_aio_context &&
ban->opaque == opaque &&
ban->deleted == false)
{
if (bs->walking_aio_notifiers) {
ban->deleted = true;
} else {
bdrv_do_remove_aio_context_notifier(ban);
}
return;
}
}
abort();
}
int bdrv_amend_options(BlockDriverState *bs, QemuOpts *opts,
BlockDriverAmendStatusCB *status_cb, void *cb_opaque,
bool force,
Error **errp)
{
GLOBAL_STATE_CODE();
if (!bs->drv) {
error_setg(errp, "Node is ejected");
return -ENOMEDIUM;
}
if (!bs->drv->bdrv_amend_options) {
error_setg(errp, "Block driver '%s' does not support option amendment",
bs->drv->format_name);
return -ENOTSUP;
}
return bs->drv->bdrv_amend_options(bs, opts, status_cb,
cb_opaque, force, errp);
}
/*
* This function checks whether the given @to_replace is allowed to be
* replaced by a node that always shows the same data as @bs. This is
* used for example to verify whether the mirror job can replace
* @to_replace by the target mirrored from @bs.
* To be replaceable, @bs and @to_replace may either be guaranteed to
* always show the same data (because they are only connected through
* filters), or some driver may allow replacing one of its children
* because it can guarantee that this child's data is not visible at
* all (for example, for dissenting quorum children that have no other
* parents).
*/
bool bdrv_recurse_can_replace(BlockDriverState *bs,
BlockDriverState *to_replace)
{
BlockDriverState *filtered;
GLOBAL_STATE_CODE();
if (!bs || !bs->drv) {
return false;
}
if (bs == to_replace) {
return true;
}
/* See what the driver can do */
if (bs->drv->bdrv_recurse_can_replace) {
return bs->drv->bdrv_recurse_can_replace(bs, to_replace);
}
/* For filters without an own implementation, we can recurse on our own */
filtered = bdrv_filter_bs(bs);
if (filtered) {
return bdrv_recurse_can_replace(filtered, to_replace);
}
/* Safe default */
return false;
}
/*
* Check whether the given @node_name can be replaced by a node that
* has the same data as @parent_bs. If so, return @node_name's BDS;
* NULL otherwise.
*
* @node_name must be a (recursive) *child of @parent_bs (or this
* function will return NULL).
*
* The result (whether the node can be replaced or not) is only valid
* for as long as no graph or permission changes occur.
*/
BlockDriverState *check_to_replace_node(BlockDriverState *parent_bs,
const char *node_name, Error **errp)
{
BlockDriverState *to_replace_bs = bdrv_find_node(node_name);
AioContext *aio_context;
GLOBAL_STATE_CODE();
if (!to_replace_bs) {
error_setg(errp, "Failed to find node with node-name='%s'", node_name);
return NULL;
}
aio_context = bdrv_get_aio_context(to_replace_bs);
aio_context_acquire(aio_context);
if (bdrv_op_is_blocked(to_replace_bs, BLOCK_OP_TYPE_REPLACE, errp)) {
to_replace_bs = NULL;
goto out;
}
/* We don't want arbitrary node of the BDS chain to be replaced only the top
* most non filter in order to prevent data corruption.
* Another benefit is that this tests exclude backing files which are
* blocked by the backing blockers.
*/
if (!bdrv_recurse_can_replace(parent_bs, to_replace_bs)) {
error_setg(errp, "Cannot replace '%s' by a node mirrored from '%s', "
"because it cannot be guaranteed that doing so would not "
"lead to an abrupt change of visible data",
node_name, parent_bs->node_name);
to_replace_bs = NULL;
goto out;
}
out:
aio_context_release(aio_context);
return to_replace_bs;
}
/**
* Iterates through the list of runtime option keys that are said to
* be "strong" for a BDS. An option is called "strong" if it changes
* a BDS's data. For example, the null block driver's "size" and
* "read-zeroes" options are strong, but its "latency-ns" option is
* not.
*
* If a key returned by this function ends with a dot, all options
* starting with that prefix are strong.
*/
static const char *const *strong_options(BlockDriverState *bs,
const char *const *curopt)
{
static const char *const global_options[] = {
"driver", "filename", NULL
};
if (!curopt) {
return &global_options[0];
}
curopt++;
if (curopt == &global_options[ARRAY_SIZE(global_options) - 1] && bs->drv) {
curopt = bs->drv->strong_runtime_opts;
}
return (curopt && *curopt) ? curopt : NULL;
}
/**
* Copies all strong runtime options from bs->options to the given
* QDict. The set of strong option keys is determined by invoking
* strong_options().
*
* Returns true iff any strong option was present in bs->options (and
* thus copied to the target QDict) with the exception of "filename"
* and "driver". The caller is expected to use this value to decide
* whether the existence of strong options prevents the generation of
* a plain filename.
*/
static bool append_strong_runtime_options(QDict *d, BlockDriverState *bs)
{
bool found_any = false;
const char *const *option_name = NULL;
if (!bs->drv) {
return false;
}
while ((option_name = strong_options(bs, option_name))) {
bool option_given = false;
assert(strlen(*option_name) > 0);
if ((*option_name)[strlen(*option_name) - 1] != '.') {
QObject *entry = qdict_get(bs->options, *option_name);
if (!entry) {
continue;
}
qdict_put_obj(d, *option_name, qobject_ref(entry));
option_given = true;
} else {
const QDictEntry *entry;
for (entry = qdict_first(bs->options); entry;
entry = qdict_next(bs->options, entry))
{
if (strstart(qdict_entry_key(entry), *option_name, NULL)) {
qdict_put_obj(d, qdict_entry_key(entry),
qobject_ref(qdict_entry_value(entry)));
option_given = true;
}
}
}
/* While "driver" and "filename" need to be included in a JSON filename,
* their existence does not prohibit generation of a plain filename. */
if (!found_any && option_given &&
strcmp(*option_name, "driver") && strcmp(*option_name, "filename"))
{
found_any = true;
}
}
if (!qdict_haskey(d, "driver")) {
/* Drivers created with bdrv_new_open_driver() may not have a
* @driver option. Add it here. */
qdict_put_str(d, "driver", bs->drv->format_name);
}
return found_any;
}
/* Note: This function may return false positives; it may return true
* even if opening the backing file specified by bs's image header
* would result in exactly bs->backing. */
static bool bdrv_backing_overridden(BlockDriverState *bs)
{
GLOBAL_STATE_CODE();
if (bs->backing) {
return strcmp(bs->auto_backing_file,
bs->backing->bs->filename);
} else {
/* No backing BDS, so if the image header reports any backing
* file, it must have been suppressed */
return bs->auto_backing_file[0] != '\0';
}
}
/* Updates the following BDS fields:
* - exact_filename: A filename which may be used for opening a block device
* which (mostly) equals the given BDS (even without any
* other options; so reading and writing must return the same
* results, but caching etc. may be different)
* - full_open_options: Options which, when given when opening a block device
* (without a filename), result in a BDS (mostly)
* equalling the given one
* - filename: If exact_filename is set, it is copied here. Otherwise,
* full_open_options is converted to a JSON object, prefixed with
* "json:" (for use through the JSON pseudo protocol) and put here.
*/
void bdrv_refresh_filename(BlockDriverState *bs)
{
BlockDriver *drv = bs->drv;
BdrvChild *child;
BlockDriverState *primary_child_bs;
QDict *opts;
bool backing_overridden;
bool generate_json_filename; /* Whether our default implementation should
fill exact_filename (false) or not (true) */
GLOBAL_STATE_CODE();
if (!drv) {
return;
}
/* This BDS's file name may depend on any of its children's file names, so
* refresh those first */
QLIST_FOREACH(child, &bs->children, next) {
bdrv_refresh_filename(child->bs);
}
if (bs->implicit) {
/* For implicit nodes, just copy everything from the single child */
child = QLIST_FIRST(&bs->children);
assert(QLIST_NEXT(child, next) == NULL);
pstrcpy(bs->exact_filename, sizeof(bs->exact_filename),
child->bs->exact_filename);
pstrcpy(bs->filename, sizeof(bs->filename), child->bs->filename);
qobject_unref(bs->full_open_options);
bs->full_open_options = qobject_ref(child->bs->full_open_options);
return;
}
backing_overridden = bdrv_backing_overridden(bs);
if (bs->open_flags & BDRV_O_NO_IO) {
/* Without I/O, the backing file does not change anything.
* Therefore, in such a case (primarily qemu-img), we can
* pretend the backing file has not been overridden even if
* it technically has been. */
backing_overridden = false;
}
/* Gather the options QDict */
opts = qdict_new();
generate_json_filename = append_strong_runtime_options(opts, bs);
generate_json_filename |= backing_overridden;
if (drv->bdrv_gather_child_options) {
/* Some block drivers may not want to present all of their children's
* options, or name them differently from BdrvChild.name */
drv->bdrv_gather_child_options(bs, opts, backing_overridden);
} else {
QLIST_FOREACH(child, &bs->children, next) {
if (child == bs->backing && !backing_overridden) {
/* We can skip the backing BDS if it has not been overridden */
continue;
}
qdict_put(opts, child->name,
qobject_ref(child->bs->full_open_options));
}
if (backing_overridden && !bs->backing) {
/* Force no backing file */
qdict_put_null(opts, "backing");
}
}
qobject_unref(bs->full_open_options);
bs->full_open_options = opts;
primary_child_bs = bdrv_primary_bs(bs);
if (drv->bdrv_refresh_filename) {
/* Obsolete information is of no use here, so drop the old file name
* information before refreshing it */
bs->exact_filename[0] = '\0';
drv->bdrv_refresh_filename(bs);
} else if (primary_child_bs) {
/*
* Try to reconstruct valid information from the underlying
* file -- this only works for format nodes (filter nodes
* cannot be probed and as such must be selected by the user
* either through an options dict, or through a special
* filename which the filter driver must construct in its
* .bdrv_refresh_filename() implementation).
*/
bs->exact_filename[0] = '\0';
/*
* We can use the underlying file's filename if:
* - it has a filename,
* - the current BDS is not a filter,
* - the file is a protocol BDS, and
* - opening that file (as this BDS's format) will automatically create
* the BDS tree we have right now, that is:
* - the user did not significantly change this BDS's behavior with
* some explicit (strong) options
* - no non-file child of this BDS has been overridden by the user
* Both of these conditions are represented by generate_json_filename.
*/
if (primary_child_bs->exact_filename[0] &&
primary_child_bs->drv->bdrv_file_open &&
!drv->is_filter && !generate_json_filename)
{
strcpy(bs->exact_filename, primary_child_bs->exact_filename);
}
}
if (bs->exact_filename[0]) {
pstrcpy(bs->filename, sizeof(bs->filename), bs->exact_filename);
} else {
GString *json = qobject_to_json(QOBJECT(bs->full_open_options));
if (snprintf(bs->filename, sizeof(bs->filename), "json:%s",
json->str) >= sizeof(bs->filename)) {
/* Give user a hint if we truncated things. */
strcpy(bs->filename + sizeof(bs->filename) - 4, "...");
}
g_string_free(json, true);
}
}
char *bdrv_dirname(BlockDriverState *bs, Error **errp)
{
BlockDriver *drv = bs->drv;
BlockDriverState *child_bs;
GLOBAL_STATE_CODE();
if (!drv) {
error_setg(errp, "Node '%s' is ejected", bs->node_name);
return NULL;
}
if (drv->bdrv_dirname) {
return drv->bdrv_dirname(bs, errp);
}
child_bs = bdrv_primary_bs(bs);
if (child_bs) {
return bdrv_dirname(child_bs, errp);
}
bdrv_refresh_filename(bs);
if (bs->exact_filename[0] != '\0') {
return path_combine(bs->exact_filename, "");
}
error_setg(errp, "Cannot generate a base directory for %s nodes",
drv->format_name);
return NULL;
}
/*
* Hot add/remove a BDS's child. So the user can take a child offline when
* it is broken and take a new child online
*/
void bdrv_add_child(BlockDriverState *parent_bs, BlockDriverState *child_bs,
Error **errp)
{
GLOBAL_STATE_CODE();
if (!parent_bs->drv || !parent_bs->drv->bdrv_add_child) {
error_setg(errp, "The node %s does not support adding a child",
bdrv_get_device_or_node_name(parent_bs));
return;
}
if (!QLIST_EMPTY(&child_bs->parents)) {
error_setg(errp, "The node %s already has a parent",
child_bs->node_name);
return;
}
parent_bs->drv->bdrv_add_child(parent_bs, child_bs, errp);
}
void bdrv_del_child(BlockDriverState *parent_bs, BdrvChild *child, Error **errp)
{
BdrvChild *tmp;
GLOBAL_STATE_CODE();
if (!parent_bs->drv || !parent_bs->drv->bdrv_del_child) {
error_setg(errp, "The node %s does not support removing a child",
bdrv_get_device_or_node_name(parent_bs));
return;
}
QLIST_FOREACH(tmp, &parent_bs->children, next) {
if (tmp == child) {
break;
}
}
if (!tmp) {
error_setg(errp, "The node %s does not have a child named %s",
bdrv_get_device_or_node_name(parent_bs),
bdrv_get_device_or_node_name(child->bs));
return;
}
parent_bs->drv->bdrv_del_child(parent_bs, child, errp);
}
int bdrv_make_empty(BdrvChild *c, Error **errp)
{
BlockDriver *drv = c->bs->drv;
int ret;
GLOBAL_STATE_CODE();
assert(c->perm & (BLK_PERM_WRITE | BLK_PERM_WRITE_UNCHANGED));
if (!drv->bdrv_make_empty) {
error_setg(errp, "%s does not support emptying nodes",
drv->format_name);
return -ENOTSUP;
}
ret = drv->bdrv_make_empty(c->bs);
if (ret < 0) {
error_setg_errno(errp, -ret, "Failed to empty %s",
c->bs->filename);
return ret;
}
return 0;
}
/*
* Return the child that @bs acts as an overlay for, and from which data may be
* copied in COW or COR operations. Usually this is the backing file.
*/
BdrvChild *bdrv_cow_child(BlockDriverState *bs)
{
IO_CODE();
if (!bs || !bs->drv) {
return NULL;
}
if (bs->drv->is_filter) {
return NULL;
}
if (!bs->backing) {
return NULL;
}
assert(bs->backing->role & BDRV_CHILD_COW);
return bs->backing;
}
/*
* If @bs acts as a filter for exactly one of its children, return
* that child.
*/
BdrvChild *bdrv_filter_child(BlockDriverState *bs)
{
BdrvChild *c;
IO_CODE();
if (!bs || !bs->drv) {
return NULL;
}
if (!bs->drv->is_filter) {
return NULL;
}
/* Only one of @backing or @file may be used */
assert(!(bs->backing && bs->file));
c = bs->backing ?: bs->file;
if (!c) {
return NULL;
}
assert(c->role & BDRV_CHILD_FILTERED);
return c;
}
/*
* Return either the result of bdrv_cow_child() or bdrv_filter_child(),
* whichever is non-NULL.
*
* Return NULL if both are NULL.
*/
BdrvChild *bdrv_filter_or_cow_child(BlockDriverState *bs)
{
BdrvChild *cow_child = bdrv_cow_child(bs);
BdrvChild *filter_child = bdrv_filter_child(bs);
IO_CODE();
/* Filter nodes cannot have COW backing files */
assert(!(cow_child && filter_child));
return cow_child ?: filter_child;
}
/*
* Return the primary child of this node: For filters, that is the
* filtered child. For other nodes, that is usually the child storing
* metadata.
* (A generally more helpful description is that this is (usually) the
* child that has the same filename as @bs.)
*
* Drivers do not necessarily have a primary child; for example quorum
* does not.
*/
BdrvChild *bdrv_primary_child(BlockDriverState *bs)
{
BdrvChild *c, *found = NULL;
IO_CODE();
QLIST_FOREACH(c, &bs->children, next) {
if (c->role & BDRV_CHILD_PRIMARY) {
assert(!found);
found = c;
}
}
return found;
}
static BlockDriverState *bdrv_do_skip_filters(BlockDriverState *bs,
bool stop_on_explicit_filter)
{
BdrvChild *c;
if (!bs) {
return NULL;
}
while (!(stop_on_explicit_filter && !bs->implicit)) {
c = bdrv_filter_child(bs);
if (!c) {
/*
* A filter that is embedded in a working block graph must
* have a child. Assert this here so this function does
* not return a filter node that is not expected by the
* caller.
*/
assert(!bs->drv || !bs->drv->is_filter);
break;
}
bs = c->bs;
}
/*
* Note that this treats nodes with bs->drv == NULL as not being
* filters (bs->drv == NULL should be replaced by something else
* anyway).
* The advantage of this behavior is that this function will thus
* always return a non-NULL value (given a non-NULL @bs).
*/
return bs;
}
/*
* Return the first BDS that has not been added implicitly or that
* does not have a filtered child down the chain starting from @bs
* (including @bs itself).
*/
BlockDriverState *bdrv_skip_implicit_filters(BlockDriverState *bs)
{
GLOBAL_STATE_CODE();
return bdrv_do_skip_filters(bs, true);
}
/*
* Return the first BDS that does not have a filtered child down the
* chain starting from @bs (including @bs itself).
*/
BlockDriverState *bdrv_skip_filters(BlockDriverState *bs)
{
IO_CODE();
return bdrv_do_skip_filters(bs, false);
}
/*
* For a backing chain, return the first non-filter backing image of
* the first non-filter image.
*/
BlockDriverState *bdrv_backing_chain_next(BlockDriverState *bs)
{
IO_CODE();
return bdrv_skip_filters(bdrv_cow_bs(bdrv_skip_filters(bs)));
}
/**
* Check whether [offset, offset + bytes) overlaps with the cached
* block-status data region.
*
* If so, and @pnum is not NULL, set *pnum to `bsc.data_end - offset`,
* which is what bdrv_bsc_is_data()'s interface needs.
* Otherwise, *pnum is not touched.
*/
static bool bdrv_bsc_range_overlaps_locked(BlockDriverState *bs,
int64_t offset, int64_t bytes,
int64_t *pnum)
{
BdrvBlockStatusCache *bsc = qatomic_rcu_read(&bs->block_status_cache);
bool overlaps;
overlaps =
qatomic_read(&bsc->valid) &&
ranges_overlap(offset, bytes, bsc->data_start,
bsc->data_end - bsc->data_start);
if (overlaps && pnum) {
*pnum = bsc->data_end - offset;
}
return overlaps;
}
/**
* See block_int.h for this function's documentation.
*/
bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t *pnum)
{
IO_CODE();
RCU_READ_LOCK_GUARD();
return bdrv_bsc_range_overlaps_locked(bs, offset, 1, pnum);
}
/**
* See block_int.h for this function's documentation.
*/
void bdrv_bsc_invalidate_range(BlockDriverState *bs,
int64_t offset, int64_t bytes)
{
IO_CODE();
RCU_READ_LOCK_GUARD();
if (bdrv_bsc_range_overlaps_locked(bs, offset, bytes, NULL)) {
qatomic_set(&bs->block_status_cache->valid, false);
}
}
/**
* See block_int.h for this function's documentation.
*/
void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t bytes)
{
BdrvBlockStatusCache *new_bsc = g_new(BdrvBlockStatusCache, 1);
BdrvBlockStatusCache *old_bsc;
IO_CODE();
*new_bsc = (BdrvBlockStatusCache) {
.valid = true,
.data_start = offset,
.data_end = offset + bytes,
};
QEMU_LOCK_GUARD(&bs->bsc_modify_lock);
old_bsc = qatomic_rcu_read(&bs->block_status_cache);
qatomic_rcu_set(&bs->block_status_cache, new_bsc);
if (old_bsc) {
g_free_rcu(old_bsc, rcu);
}
}