 d73abd6dcc
			
		
	
	
		d73abd6dcc
		
	
	
	
	
		
			
			This patch adds global variables, defines, function declarations, and function stubs for deterministic VM replay used by external modules. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru> Message-Id: <20150917162337.8676.41538.stgit@PASHA-ISP.def.inno> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
		
			
				
	
	
		
			169 lines
		
	
	
		
			8.4 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			169 lines
		
	
	
		
			8.4 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| Copyright (c) 2010-2015 Institute for System Programming
 | |
|                         of the Russian Academy of Sciences.
 | |
| 
 | |
| This work is licensed under the terms of the GNU GPL, version 2 or later.
 | |
| See the COPYING file in the top-level directory.
 | |
| 
 | |
| Record/replay
 | |
| -------------
 | |
| 
 | |
| Record/replay functions are used for the reverse execution and deterministic
 | |
| replay of qemu execution. This implementation of deterministic replay can
 | |
| be used for deterministic debugging of guest code through a gdb remote
 | |
| interface.
 | |
| 
 | |
| Execution recording writes a non-deterministic events log, which can be later
 | |
| used for replaying the execution anywhere and for unlimited number of times.
 | |
| It also supports checkpointing for faster rewinding during reverse debugging.
 | |
| Execution replaying reads the log and replays all non-deterministic events
 | |
| including external input, hardware clocks, and interrupts.
 | |
| 
 | |
| Deterministic replay has the following features:
 | |
|  * Deterministically replays whole system execution and all contents of
 | |
|    the memory, state of the hardware devices, clocks, and screen of the VM.
 | |
|  * Writes execution log into the file for later replaying for multiple times
 | |
|    on different machines.
 | |
|  * Supports i386, x86_64, and ARM hardware platforms.
 | |
|  * Performs deterministic replay of all operations with keyboard and mouse
 | |
|    input devices.
 | |
| 
 | |
| Usage of the record/replay:
 | |
|  * First, record the execution, by adding the following arguments to the command line:
 | |
|    '-icount shift=7,rr=record,rrfile=replay.bin -net none'.
 | |
|    Block devices' images are not actually changed in the recording mode,
 | |
|    because all of the changes are written to the temporary overlay file.
 | |
|  * Then you can replay it by using another command
 | |
|    line option: '-icount shift=7,rr=replay,rrfile=replay.bin -net none'
 | |
|  * '-net none' option should also be specified if network replay patches
 | |
|    are not applied.
 | |
| 
 | |
| Papers with description of deterministic replay implementation:
 | |
| http://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html
 | |
| http://dl.acm.org/citation.cfm?id=2786805.2803179
 | |
| 
 | |
| Modifications of qemu include:
 | |
|  * wrappers for clock and time functions to save their return values in the log
 | |
|  * saving different asynchronous events (e.g. system shutdown) into the log
 | |
|  * synchronization of the bottom halves execution
 | |
|  * synchronization of the threads from thread pool
 | |
|  * recording/replaying user input (mouse and keyboard)
 | |
|  * adding internal checkpoints for cpu and io synchronization
 | |
| 
 | |
| Non-deterministic events
 | |
| ------------------------
 | |
| 
 | |
| Our record/replay system is based on saving and replaying non-deterministic
 | |
| events (e.g. keyboard input) and simulating deterministic ones (e.g. reading
 | |
| from HDD or memory of the VM). Saving only non-deterministic events makes
 | |
| log file smaller, simulation faster, and allows using reverse debugging even
 | |
| for realtime applications.
 | |
| 
 | |
| The following non-deterministic data from peripheral devices is saved into
 | |
| the log: mouse and keyboard input, network packets, audio controller input,
 | |
| USB packets, serial port input, and hardware clocks (they are non-deterministic
 | |
| too, because their values are taken from the host machine). Inputs from
 | |
| simulated hardware, memory of VM, software interrupts, and execution of
 | |
| instructions are not saved into the log, because they are deterministic and
 | |
| can be replayed by simulating the behavior of virtual machine starting from
 | |
| initial state.
 | |
| 
 | |
| We had to solve three tasks to implement deterministic replay: recording
 | |
| non-deterministic events, replaying non-deterministic events, and checking
 | |
| that there is no divergence between record and replay modes.
 | |
| 
 | |
| We changed several parts of QEMU to make event log recording and replaying.
 | |
| Devices' models that have non-deterministic input from external devices were
 | |
| changed to write every external event into the execution log immediately.
 | |
| E.g. network packets are written into the log when they arrive into the virtual
 | |
| network adapter.
 | |
| 
 | |
| All non-deterministic events are coming from these devices. But to
 | |
| replay them we need to know at which moments they occur. We specify
 | |
| these moments by counting the number of instructions executed between
 | |
| every pair of consecutive events.
 | |
| 
 | |
| Instruction counting
 | |
| --------------------
 | |
| 
 | |
| QEMU should work in icount mode to use record/replay feature. icount was
 | |
| designed to allow deterministic execution in absence of external inputs
 | |
| of the virtual machine. We also use icount to control the occurrence of the
 | |
| non-deterministic events. The number of instructions elapsed from the last event
 | |
| is written to the log while recording the execution. In replay mode we
 | |
| can predict when to inject that event using the instruction counter.
 | |
| 
 | |
| Timers
 | |
| ------
 | |
| 
 | |
| Timers are used to execute callbacks from different subsystems of QEMU
 | |
| at the specified moments of time. There are several kinds of timers:
 | |
|  * Real time clock. Based on host time and used only for callbacks that
 | |
|    do not change the virtual machine state. For this reason real time
 | |
|    clock and timers does not affect deterministic replay at all.
 | |
|  * Virtual clock. These timers run only during the emulation. In icount
 | |
|    mode virtual clock value is calculated using executed instructions counter.
 | |
|    That is why it is completely deterministic and does not have to be recorded.
 | |
|  * Host clock. This clock is used by device models that simulate real time
 | |
|    sources (e.g. real time clock chip). Host clock is the one of the sources
 | |
|    of non-determinism. Host clock read operations should be logged to
 | |
|    make the execution deterministic.
 | |
|  * Real time clock for icount. This clock is similar to real time clock but
 | |
|    it is used only for increasing virtual clock while virtual machine is
 | |
|    sleeping. Due to its nature it is also non-deterministic as the host clock
 | |
|    and has to be logged too.
 | |
| 
 | |
| Checkpoints
 | |
| -----------
 | |
| 
 | |
| Replaying of the execution of virtual machine is bound by sources of
 | |
| non-determinism. These are inputs from clock and peripheral devices,
 | |
| and QEMU thread scheduling. Thread scheduling affect on processing events
 | |
| from timers, asynchronous input-output, and bottom halves.
 | |
| 
 | |
| Invocations of timers are coupled with clock reads and changing the state
 | |
| of the virtual machine. Reads produce non-deterministic data taken from
 | |
| host clock. And VM state changes should preserve their order. Their relative
 | |
| order in replay mode must replicate the order of callbacks in record mode.
 | |
| To preserve this order we use checkpoints. When a specific clock is processed
 | |
| in record mode we save to the log special "checkpoint" event.
 | |
| Checkpoints here do not refer to virtual machine snapshots. They are just
 | |
| record/replay events used for synchronization.
 | |
| 
 | |
| QEMU in replay mode will try to invoke timers processing in random moment
 | |
| of time. That's why we do not process a group of timers until the checkpoint
 | |
| event will be read from the log. Such an event allows synchronizing CPU
 | |
| execution and timer events.
 | |
| 
 | |
| Another checkpoints application in record/replay is instruction counting
 | |
| while the virtual machine is idle. This function (qemu_clock_warp) is called
 | |
| from the wait loop. It changes virtual machine state and must be deterministic
 | |
| then. That is why we added checkpoint to this function to prevent its
 | |
| operation in replay mode when it does not correspond to record mode.
 | |
| 
 | |
| Bottom halves
 | |
| -------------
 | |
| 
 | |
| Disk I/O events are completely deterministic in our model, because
 | |
| in both record and replay modes we start virtual machine from the same
 | |
| disk state. But callbacks that virtual disk controller uses for reading and
 | |
| writing the disk may occur at different moments of time in record and replay
 | |
| modes.
 | |
| 
 | |
| Reading and writing requests are created by CPU thread of QEMU. Later these
 | |
| requests proceed to block layer which creates "bottom halves". Bottom
 | |
| halves consist of callback and its parameters. They are processed when
 | |
| main loop locks the global mutex. These locks are not synchronized with
 | |
| replaying process because main loop also processes the events that do not
 | |
| affect the virtual machine state (like user interaction with monitor).
 | |
| 
 | |
| That is why we had to implement saving and replaying bottom halves callbacks
 | |
| synchronously to the CPU execution. When the callback is about to execute
 | |
| it is added to the queue in the replay module. This queue is written to the
 | |
| log when its callbacks are executed. In replay mode callbacks are not processed
 | |
| until the corresponding event is read from the events log file.
 | |
| 
 | |
| Sometimes the block layer uses asynchronous callbacks for its internal purposes
 | |
| (like reading or writing VM snapshots or disk image cluster tables). In this
 | |
| case bottom halves are not marked as "replayable" and do not saved
 | |
| into the log.
 |