Design of Practical System for Fault-Tolerant VMs
This is my review/summary of this famous paper on fault-tolerance using the state-machine approach.
Shipping all changes to primary machine to backup consumes enormous bandwidth
Instead, model servers as deterministic state-machines that are kept in sync by starting them in the same initial states and give the identical inputs in the same order. Extra coordination is required for non-deterministic input.
Using VM to implement synchronized state-machines is easier Hypervisor is able to capture all information about non-deterministic operations on primary VM and replay them on backup VM
Design
To provide fault-tolerance to a primary VM, we run a backup VM in a different physical server that is kept in sync and executes identically to the primary VM. The two VMs are in virtual lockstep. Only the primary VM advertises its presence to the network, hence all inputs goes to the primary VM. All the inputs received by the primary VM is sent to the backup VM through logging channel. The outputs of the backup VM is dropped by the hypervisor. The primary and backup VMs follow a protocol where the backup acknowledges information from the primary.
Challenges
- correctly capturing all the input and non-determinism to ensure deterministic execution in the backup VM
- correctly applying the inputs and non-determinism to the backup VM
- does not degrade performance
Deterministic Replay Implementation
Deterministic replay records the inputs of a VM and all possible non-determinism associated with the VM execution in a stream of log entries written to a log file. The VM execution may be exactly replayed later by reading the log entries from the file. For non-deterministic operations, sufficient information is logged to allow the operation to be reproduced with the same state change and output. For non-deterministic events such as timer or IO completion interrupts, the exact instruction at which the event occurred is also recorded. During replay, the event is delivered at the same point in the instruction stream.
Fault Tolerance Protocol
Output requirement: if the backup VM ever takes over after a failure of the primary, the backup VM will continue executing in a way that is entirely consistent with all outputs that the primary VM has sent to the external world.
Output Rule: the primary VM may not send an output to the external world, until the backup VM has received and acknowledged the log entry associated with the operation producing the output.
Detecting & Responding to Failure
- Backup fails
- Primary go live, leave recording mode (stop sending log entries over the logging channel), and start executing normally
- Primary fails
- Backup go live, replays all log entries, then start executing as a normal VM
- New MAC address broadcasted to network
- Detecting failure
- UDP heartbeating between servers that are running VMs to detect server failures
- Halt in the flow of logging channel or acknowledgements after specified timeouts
- Split-brain problem: network connectivity between primary and backup severed; two primary running at the same time
- When either primary or backup VM wants to go live, it executes an atomic test-and-set operation on shared storage
- Succeed → take over
- Fails → the other VM is already live → halts itself (commits suicide)
- Failed to read storage → wait
- When either primary or backup VM wants to go live, it executes an atomic test-and-set operation on shared storage
Practical Implementation of Fault Tolerance
Cloning & Re-booting
FT VMotion clones the source VM to a remote host, set up a logging channel, and causes the source VM to enter logging mode as primary, and the destination VM to enter replay mode
vSphere implements a clustering service that maintains management and resource information. When failure happens, the primary informs the clustering service to request a new backup. The clustering service determines the best server on which to run the backup VM based on resource usage and invokes FT VMotion to create the backup VM.
Logging Channel
Primary VM writes to the log buffer as it executes. The contents in primary’s log buffer are flushed out to the logging channel ASAP. Log entries are read into backup’s buffer from the logging channel ASAP. The backup sends acknowledgements back to the primary each time it reads some log entries from the network into its log buffer.
Primary encounters full log buffer → stops execution until log buffer is flushed out Backup encounters empty log buffer → stops execution
Slowdown Mechanism: send additional info to determine real-time execution lag between the primary and backup VMs. If the execution lag is significant, VMware FT informs the scheduler to give it smaller amount of CPU.
Races: since the primary and backup shares virtual storage and external devices could write into the storage, we need to make sure that no race occurs when a DMA (Direct Memory Access) happens while a read from primary/backup takes place
bounce buffer: temporary buffer that has the same size as the memory being accessed by a disk operation. Memory reads/writes from/to the bounce buffer instead of the disk. Hypervisor only interrupts VM after the copy is complete.
Network IO
Disabling of the asynchronous network optimizations → all network input/output are trapped to hypervisor
- Clustering optimization to reduce VM traps and interrupts
- Reducing the transmit delay of sending a log message to the backup and getting an acknowledgement