Fault Tolerance 2.2
This guide's objective is to provide information in order to prepare for failures and/or recover from them. This failures are categorized depending on whether they come from the physical infrastructure (Host failures), from the virtualized infrastructure (VM crashes) or from the virtual infrastructure manager (OpenNebula crash).
The following sections give recipes and best practices to prevent and deal with these errors.
When OpenNebula detects that a host is down, a hook can be triggered to deal with the situation. OpenNebula comes with a script out-of-the-box that can act as a hook to be triggered when a host enters the ERROR state. This can very useful to limit the downtime of a service due to a hardware failure, since it can redeploy the VMs on another host.
Let's see how to configure $ONE_LOCATION/etc/oned.conf
to set up this Host hook, to be triggered in the ERROR state. The following should be uncommented in the mentioned configuration file:
<xterm> #——————————————————————————- HOST_HOOK = [
name = "error", on = "ERROR", command = "host_error.rb", arguments = "$HID -r n", remote = no ]
#——————————————————————————- </xterm>
We are defining a host hook, named “error”, that will execute the script 'host_error.rb' locally with the following arguments:
Argument | Description |
---|---|
Host ID | ID of the host containing the VMs to treat. It is compulsory and better left to $HID, that will be automatically filled by OpenNebula with the Host ID of the host that went down. |
Action | This defined the action to be performed upon the VMs that were running in the host that went down. This can be -r (resubmit) or -d (delete). |
DoSuspended | This argument tells the hook to perform Action to suspended VMs belonging to the host that went down (y), or not to perform Action to them (n) . |
More information on hooks here.
Additionally, there is a corner case that in critical production environments should be taken into account. OpenNebula also has become tolerant to network errors (up to a limit). This means that a spurious network error won't trigger the hook. But if this network error stretches in time, the hook may be triggered and the VMs resubmitted. When (and if) the network comes back, there will be a potential clash between the old and the reincarnated VMs. In order to prevent this, a script can be placed in the cron of every host, that will detect the network error and shutdown the host completely (or delete the VMs).
The Virtual Machine lifecycle management can fail in several points. The following two cases should cover them:
$ONE_LOCATION/etc/oned.conf
(and restarting oned
):<xterm> #——————————————————————————- VM_HOOK = [
name = "on_failure_resubmit", on = "FAILURE", command = "onevm resubmit", arguments = "$VMID" ]
#——————————————————————————- </xterm>
OpenNebula can recover from a crash occurred in its core daemon, since all the information regarding infrastructure configuration and the state of the virtualized resources is stored on a persistent backend.
Therefore, the 'oned' daemon can be restarted after a crash, all the running VMs will be reconnected with and monitored from this point onwards. Pending machines will be placed on a suitable host just as before the OpenNebula crash, as well as other non-transient states.
However VMs not in a final state may need to be recovered manually, as in general the VM drivers are stateless. The following states should be dealt with:
In any of the above situations, the VM can be resubmitted or deleted. OpenNebula will take care of any pending clean-up operation like removing image files or cancelling the VM, in case it is running. An external VM-collector script can be set-up to automatically recover or delete the VMs when oned is restarted after a crash.
Due to its architecture, clouds deployed using OpenNebula features very few points of failure. Some interesting facts on this: