Fault Tolerance 3.0

This guide's objective is to provide information in order to prepare for failures and/or recover from them. This failures are categorized depending on whether they come from the physical infrastructure (Host failures), from the virtualized infrastructure (VM crashes) or from the virtual infrastructure manager (OpenNebula crash).

The following sections give recipes and best practices to prevent and deal with these errors.

Host Failures

When OpenNebula detects that a host is down, a hook can be triggered to deal with the situation. OpenNebula comes with a script out-of-the-box that can act as a hook to be triggered when a host enters the ERROR state. This can very useful to limit the downtime of a service due to a hardware failure, since it can redeploy the VMs on another host.

Let's see how to configure /etc/one/oned.conf to set up this Host hook, to be triggered in the ERROR state. The following should be uncommented in the mentioned configuration file:

<xterm> #——————————————————————————- HOST_HOOK = [

  name      = "error",
  on        = "ERROR",
  command   = "host_error.rb",
  arguments = "$HID -r n",
  remote    = no ]

#——————————————————————————- </xterm>

We are defining a host hook, named “error”, that will execute the script 'host_error.rb' locally with the following arguments:

Argument	Description
Host ID	ID of the host containing the VMs to treat. It is compulsory and better left to $HID, that will be automatically filled by OpenNebula with the Host ID of the host that went down.
Action	This defined the action to be performed upon the VMs that were running in the host that went down. This can be -r (resubmit) or -d (delete).
DoSuspended	This argument tells the hook to perform Action to suspended VMs belonging to the host that went down (y), or not to perform Action to them (n) .

More information on hooks here.

Additionally, there is a corner case that in critical production environments should be taken into account. OpenNebula also has become tolerant to network errors (up to a limit). This means that a spurious network error won't trigger the hook. But if this network error stretches in time, the hook may be triggered and the VMs resubmitted. When (and if) the network comes back, there will be a potential clash between the old and the reincarnated VMs. In order to prevent this, a script can be placed in the cron of every host, that will detect the network error and shutdown the host completely (or delete the VMs).

Virtual Machine Failures

The Virtual Machine lifecycle management can fail in several points. The following two cases should cover them:

VM fails: This may be due to a network error that prevents the image to be staged into the node, a hypervisor related issue, a migration problem, etc. The common symptom is that the VM enters the FAILED state. In order to deal with these errors, a Virtual Machine hook can be set to “resubmit” the failed VM (or, depending the production scenario, delete it). This can be achieved by uncommenting the following (for resubmitting, the deletion hook is also present in the same file) in /etc/one/oned.conf (and restarting oned):

<xterm> #——————————————————————————- VM_HOOK = [

 name      = "on_failure_resubmit",
 on        = "FAILURE",
 command   = "onevm resubmit",
 arguments = "$VMID" ]

#——————————————————————————- </xterm>

VM crash: This point is concerned with crashes that can happen to a VM after it has been successfully booted (note that here boot doesn't refer to the actual VM boot process, but to the OpenNebula boot process, that comprises staging and hypervisor deployment). OpenNebula is able to detect such crashes, and report it as the VM being in an UNKNOWN state. This failure can be recovered from using the “onevm restart” functionality.

OpenNebula Failures

OpenNebula can recover from a crash occurred in its core daemon, since all the information regarding infrastructure configuration and the state of the virtualized resources is stored on a persistent backend.

Therefore, the 'oned' daemon can be restarted after a crash, all the running VMs will be reconnected with and monitored from this point onwards. Pending machines will be placed on a suitable host just as before the OpenNebula crash, as well as other non-transient states.

However VMs not in a final state may need to be recovered manually, as in general the VM drivers are stateless. The following states should be dealt with:

PROLOG: The image transfer from the image repository to the virtualization node was most likely disrupted.
BOOT: The boot process of the VM may not be completed correctly.
EPILOG: The image transfer from the virtualization node to the image repository was most likely disrupted
SHUTDOWN: The shutdown operation maybe not completed.

In any of the above situations, the VM can be resubmitted or deleted. OpenNebula will take care of any pending clean-up operation like removing image files or cancelling the VM, in case it is running. An external VM-collector script can be set-up to automatically recover or delete the VMs when oned is restarted after a crash.

Cloud Failures

Due to its architecture, clouds deployed using OpenNebula features very few points of failure. Some interesting facts on this:

There are very few dependencies on an OpenNebula installation, due to its lightweight and also the in house implementation of many different and basic functionality.
There are no software components residing in the nodes
Communication from the front-end to the nodes uses the reliable and thoroughly tested SSH technology