 a38299bf43
			
		
	
	
		a38299bf43
		
	
	
	
	
		
			
			Fix COLO-Proxy part of COLO architecture diagram Signed-off-by: Zhang Chen <zhangchen.fnst@cn.fujitsu.com> Reviewed-by: zhanghailiang <zhang.zhanghailiang@huawei.com> Signed-off-by: Jason Wang <jasowang@redhat.com>
		
			
				
	
	
		
			192 lines
		
	
	
		
			9.9 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			192 lines
		
	
	
		
			9.9 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| COarse-grained LOck-stepping Virtual Machines for Non-stop Service
 | ||
| ----------------------------------------
 | ||
| Copyright (c) 2016 Intel Corporation
 | ||
| Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
 | ||
| Copyright (c) 2016 Fujitsu, Corp.
 | ||
| 
 | ||
| This work is licensed under the terms of the GNU GPL, version 2 or later.
 | ||
| See the COPYING file in the top-level directory.
 | ||
| 
 | ||
| This document gives an overview of COLO's design and how to use it.
 | ||
| 
 | ||
| == Background ==
 | ||
| Virtual machine (VM) replication is a well known technique for providing
 | ||
| application-agnostic software-implemented hardware fault tolerance,
 | ||
| also known as "non-stop service".
 | ||
| 
 | ||
| COLO (COarse-grained LOck-stepping) is a high availability solution.
 | ||
| Both primary VM (PVM) and secondary VM (SVM) run in parallel. They receive the
 | ||
| same request from client, and generate response in parallel too.
 | ||
| If the response packets from PVM and SVM are identical, they are released
 | ||
| immediately. Otherwise, a VM checkpoint (on demand) is conducted.
 | ||
| 
 | ||
| == Architecture ==
 | ||
| 
 | ||
| The architecture of COLO is shown in the diagram below.
 | ||
| It consists of a pair of networked physical nodes:
 | ||
| The primary node running the PVM, and the secondary node running the SVM
 | ||
| to maintain a valid replica of the PVM.
 | ||
| PVM and SVM execute in parallel and generate output of response packets for
 | ||
| client requests according to the application semantics.
 | ||
| 
 | ||
| The incoming packets from the client or external network are received by the
 | ||
| primary node, and then forwarded to the secondary node, so that both the PVM
 | ||
| and the SVM are stimulated with the same requests.
 | ||
| 
 | ||
| COLO receives the outbound packets from both the PVM and SVM and compares them
 | ||
| before allowing the output to be sent to clients.
 | ||
| 
 | ||
| The SVM is qualified as a valid replica of the PVM, as long as it generates
 | ||
| identical responses to all client requests. Once the differences in the outputs
 | ||
| are detected between the PVM and SVM, COLO withholds transmission of the
 | ||
| outbound packets until it has successfully synchronized the PVM state to the SVM.
 | ||
| 
 | ||
|   Primary Node                                                            Secondary Node
 | ||
| +------------+  +-----------------------+       +------------------------+  +------------+
 | ||
| |            |  |       HeartBeat       +<----->+       HeartBeat        |  |            |
 | ||
| | Primary VM |  +-----------+-----------+       +-----------+------------+  |Secondary VM|
 | ||
| |            |              |                               |               |            |
 | ||
| |            |  +-----------|-----------+       +-----------|------------+  |            |
 | ||
| |            |  |QEMU   +---v----+      |       |QEMU  +----v---+        |  |            |
 | ||
| |            |  |       |Failover|      |       |      |Failover|        |  |            |
 | ||
| |            |  |       +--------+      |       |      +--------+        |  |            |
 | ||
| |            |  |   +---------------+   |       |   +---------------+    |  |            |
 | ||
| |            |  |   | VM Checkpoint +-------------->+ VM Checkpoint |    |  |            |
 | ||
| |            |  |   +---------------+   |       |   +---------------+    |  |            |
 | ||
| |Requests<--------------------------\ /-----------------\ /--------------------->Requests|
 | ||
| |            |  |                   ^ ^ |       |       | |              |  |            |
 | ||
| |Responses+---------------------\ /-|-|------------\ /-------------------------+Responses|
 | ||
| |            |  |               | | | | |       |  | |  | |              |  |            |
 | ||
| |            |  | +-----------+ | | | | |       |  | |  | | +----------+ |  |            |
 | ||
| |            |  | | COLO disk | | | | | |       |  | |  | | | COLO disk| |  |            |
 | ||
| |            |  | |   Manager +---------------------------->| Manager  | |  |            |
 | ||
| |            |  | ++----------+ v v | | |       |  | v  v | +---------++ |  |            |
 | ||
| |            |  |  |+-----------+-+-+-++|       | ++-+--+-+---------+ |  |  |            |
 | ||
| |            |  |  ||   COLO Proxy     ||       | |   COLO Proxy    | |  |  |            |
 | ||
| |            |  |  || (compare packet  ||       | |(adjust sequence | |  |  |            |
 | ||
| |            |  |  ||and mirror packet)||       | |    and ACK)     | |  |  |            |
 | ||
| |            |  |  |+------------+---+-+|       | +-----------------+ |  |  |            |
 | ||
| +------------+  +-----------------------+       +------------------------+  +------------+
 | ||
| +------------+     |             |   |                                |     +------------+
 | ||
| | VM Monitor |     |             |   |                                |     | VM Monitor |
 | ||
| +------------+     |             |   |                                |     +------------+
 | ||
| +---------------------------------------+       +----------------------------------------+
 | ||
| |   Kernel         |             |   |  |       |   Kernel            |                  |
 | ||
| +---------------------------------------+       +----------------------------------------+
 | ||
|                    |             |   |                                |
 | ||
|     +--------------v+  +---------v---+--+       +------------------+ +v-------------+
 | ||
|     |   Storage     |  |External Network|       | External Network | |   Storage    |
 | ||
|     +---------------+  +----------------+       +------------------+ +--------------+
 | ||
| 
 | ||
| 
 | ||
| == Components introduction ==
 | ||
| 
 | ||
| You can see there are several components in COLO's diagram of architecture.
 | ||
| Their functions are described below.
 | ||
| 
 | ||
| HeartBeat:
 | ||
| Runs on both the primary and secondary nodes, to periodically check platform
 | ||
| availability. When the primary node suffers a hardware fail-stop failure,
 | ||
| the heartbeat stops responding, the secondary node will trigger a failover
 | ||
| as soon as it determines the absence.
 | ||
| 
 | ||
| COLO disk Manager:
 | ||
| When primary VM writes data into image, the colo disk manger captures this data
 | ||
| and sends it to secondary VM's which makes sure the context of secondary VM's
 | ||
| image is consistent with the context of primary VM 's image.
 | ||
| For more details, please refer to docs/block-replication.txt.
 | ||
| 
 | ||
| Checkpoint/Failover Controller:
 | ||
| Modifications of save/restore flow to realize continuous migration,
 | ||
| to make sure the state of VM in Secondary side is always consistent with VM in
 | ||
| Primary side.
 | ||
| 
 | ||
| COLO Proxy:
 | ||
| Delivers packets to Primary and Seconday, and then compare the responses from
 | ||
| both side. Then decide whether to start a checkpoint according to some rules.
 | ||
| Please refer to docs/colo-proxy.txt for more informations.
 | ||
| 
 | ||
| Note:
 | ||
| HeartBeat has not been implemented yet, so you need to trigger failover process
 | ||
| by using 'x-colo-lost-heartbeat' command.
 | ||
| 
 | ||
| == Test procedure ==
 | ||
| 1. Startup qemu
 | ||
| Primary:
 | ||
| # qemu-kvm -enable-kvm -m 2048 -smp 2 -qmp stdio -vnc :7 -name primary \
 | ||
|   -device piix3-usb-uhci \
 | ||
|   -device usb-tablet -netdev tap,id=hn0,vhost=off \
 | ||
|   -device virtio-net-pci,id=net-pci0,netdev=hn0 \
 | ||
|   -drive if=virtio,id=primary-disk0,driver=quorum,read-pattern=fifo,vote-threshold=1,\
 | ||
|          children.0.file.filename=1.raw,\
 | ||
|          children.0.driver=raw -S
 | ||
| Secondary:
 | ||
| # qemu-kvm -enable-kvm -m 2048 -smp 2 -qmp stdio -vnc :7 -name secondary \
 | ||
|   -device piix3-usb-uhci \
 | ||
|   -device usb-tablet -netdev tap,id=hn0,vhost=off \
 | ||
|   -device virtio-net-pci,id=net-pci0,netdev=hn0 \
 | ||
|   -drive if=none,id=secondary-disk0,file.filename=1.raw,driver=raw,node-name=node0 \
 | ||
|   -drive if=virtio,id=active-disk0,driver=replication,mode=secondary,\
 | ||
|          file.driver=qcow2,top-id=active-disk0,\
 | ||
|          file.file.filename=/mnt/ramfs/active_disk.img,\
 | ||
|          file.backing.driver=qcow2,\
 | ||
|          file.backing.file.filename=/mnt/ramfs/hidden_disk.img,\
 | ||
|          file.backing.backing=secondary-disk0 \
 | ||
|   -incoming tcp:0:8888
 | ||
| 
 | ||
| 2. On Secondary VM's QEMU monitor, issue command
 | ||
| {'execute':'qmp_capabilities'}
 | ||
| { 'execute': 'nbd-server-start',
 | ||
|   'arguments': {'addr': {'type': 'inet', 'data': {'host': 'xx.xx.xx.xx', 'port': '8889'} } }
 | ||
| }
 | ||
| {'execute': 'nbd-server-add', 'arguments': {'device': 'secondeary-disk0', 'writable': true } }
 | ||
| 
 | ||
| Note:
 | ||
|   a. The qmp command nbd-server-start and nbd-server-add must be run
 | ||
|      before running the qmp command migrate on primary QEMU
 | ||
|   b. Active disk, hidden disk and nbd target's length should be the
 | ||
|      same.
 | ||
|   c. It is better to put active disk and hidden disk in ramdisk.
 | ||
| 
 | ||
| 3. On Primary VM's QEMU monitor, issue command:
 | ||
| {'execute':'qmp_capabilities'}
 | ||
| { 'execute': 'human-monitor-command',
 | ||
|   'arguments': {'command-line': 'drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=xx.xx.xx.xx,file.port=8889,file.export=secondary-disk0,node-name=nbd_client0'}}
 | ||
| { 'execute':'x-blockdev-change', 'arguments':{'parent': 'primary-disk0', 'node': 'nbd_client0' } }
 | ||
| { 'execute': 'migrate-set-capabilities',
 | ||
|       'arguments': {'capabilities': [ {'capability': 'x-colo', 'state': true } ] } }
 | ||
| { 'execute': 'migrate', 'arguments': {'uri': 'tcp:xx.xx.xx.xx:8888' } }
 | ||
| 
 | ||
|   Note:
 | ||
|   a. There should be only one NBD Client for each primary disk.
 | ||
|   b. xx.xx.xx.xx is the secondary physical machine's hostname or IP
 | ||
|   c. The qmp command line must be run after running qmp command line in
 | ||
|      secondary qemu.
 | ||
| 
 | ||
| 4. After the above steps, you will see, whenever you make changes to PVM, SVM will be synced.
 | ||
| You can issue command '{ "execute": "migrate-set-parameters" , "arguments":{ "x-checkpoint-delay": 2000 } }'
 | ||
| to change the checkpoint period time
 | ||
| 
 | ||
| 5. Failover test
 | ||
| You can kill Primary VM and run 'x_colo_lost_heartbeat' in Secondary VM's
 | ||
| monitor at the same time, then SVM will failover and client will not detect this
 | ||
| change.
 | ||
| 
 | ||
| Before issuing '{ "execute": "x-colo-lost-heartbeat" }' command, we have to
 | ||
| issue block related command to stop block replication.
 | ||
| Primary:
 | ||
|   Remove the nbd child from the quorum:
 | ||
|   { 'execute': 'x-blockdev-change', 'arguments': {'parent': 'colo-disk0', 'child': 'children.1'}}
 | ||
|   { 'execute': 'human-monitor-command','arguments': {'command-line': 'drive_del blk-buddy0'}}
 | ||
|   Note: there is no qmp command to remove the blockdev now
 | ||
| 
 | ||
| Secondary:
 | ||
|   The primary host is down, so we should do the following thing:
 | ||
|   { 'execute': 'nbd-server-stop' }
 | ||
| 
 | ||
| == TODO ==
 | ||
| 1. Support continuous VM replication.
 | ||
| 2. Support shared storage.
 | ||
| 3. Develop the heartbeat part.
 | ||
| 4. Reduce checkpoint VM’s downtime while doing checkpoint.
 |