Failover Configuration

OnApp allows configuring the compute resource failover behavior. The failover settings are specified per compute zone. Below you can find instructions on how to manage failover processes for compute resources.

How Failover Works

Requests before marked as failed (default value = 12) specifies how many times we cannot get a reply from a Compute resource after which the Compute resource is marked as offline. If Compute resource is marked as offline and the failover is enabled, the failover process starts. This parameter is configurable at Control Panel > Settings > Configuration, see the following Failover Settings section for details.

Also the Ping hosted virtual servers before initiating failover slider should be enabled to contact VSs before initiating failover.

First iteration tries to migrate all VSs according to the failover algorithm set for the Compute zone. If some VSs weren't migrated, next iteration will start, until all VSs are migrated (iterations run once a minute).

Failover can be globally turned off/on for the whole cloud in the /onapp/interface/config/on_app.yml file. Please check if disable_hypervisor_failover is set to 'false' to have Failover enabled.

Note that you should also check the Operating System Type option of a target compute resource before the VS migration. A Windows-based VS can be only migrated to a compute resource with Anoption or Windows only option enabled. The Linux-based or FreeBSD-based VS can be only migrated to a compute resource with Any option or Non-Windows option enabled.

Additional Considerations for Integrated Storage

In Integrated Storage backend nodes are marked as inactive approximately three minutes after a backend node has stopped reporting its status. IS is a distributed system and it takes some time to sync/converge metadata across nodes. If IS is used in the cloud it is strongly recommended to set the “Requests before marked as failed” parameter in Settings > Configuration menu to at least 18-20.

Failover Settings

To configure Compute zone failover settings:

  1. Go to your Control Panel's Settings menu, and click the Compute resource Zones icon.
    The screen that appears will show all zones currently set up in the cloud.

  2. Click the Actions button next to the required Compute zone, then click Edit and specify the following parameters:

Placement type - specify the Compute resource selection algorithm, that will be used on virtual server provisioning and recovery, per Compute zone:
Take Compute resource with maximum free RAM (Round Robin) - set this type to select the Compute resource with maximum free RAM during the VS recovery. This option allows performing faster migration of virtual servers with the lesser number of iterations during the failover.

This option behaves in different ways, depending on the event:

    1. On provisioning, the round-robin algorithm will be used on Compute resource selection.

    2. On recovery, the Compute resource with maximum free RAM will be selected.

Take Compute resource with minimum required free RAM - with this type the system selects the Compute resource with minimum required free RAM. This option allows filling Compute resource as tightly as possible before starting to use next Compute resource in the zone.

Failover timeout - set how many minutes the system should try to find the appropriate hypervisor to migrate the VSs from the compute resource that failed. The count will start after the first time the system will find no compute resources to which to migrate VSs.

You can disable failover for each particular Compute resource in Compute resource settings:

  1. Go to your Control Panel's Settings menu.

  2. Click the Compute resources icon.

  3. Click the Actions button next to the Compute resource you want to edit, then click  Edit.

  4. On the screen that follows, change the failover settings:

Disable failover - enable or disable the VS migration to another Compute resource if this Compute resource is marked as offline by the Control Panel server.

To configure the Requests before marked as failed parameter:

  1. Go to your Control Panel's Settings menu, and click the Configuration icon.

  2. Click the System tab to change the settings:

Requests before marked as failed - determines how many times the Control Panel server will attempt to contact a Compute resource before failover is initiated. For the Integrated Storage, we recommend increasing this parameter to 30, so that the storage platform has enough time to mark the Compute resources accordingly, and allow the VSs to start up after a failed Compute resource.

The time before the CP initiates failover may differ depending on the number of compute resources and their load.

Ping hosted virtual servers before initiating failover - move the slider to the right to enable contacting VSs before initiating failover for a particular compute resource. By default this slider is enabled.

Note that if you are using Floating IPs in your environment or if you have VS with primary IPs which could respond to your Control Panel server from elsewhere on your network we would recommend to disable this setting to avoid the possibility of a false-positive ICMP result.

Failover Algorithm

Control Panel daemon checks compute resource accessibility via the management network (using SNMP) each 10 seconds.

If after a certain number of attempts (indicated in settings as Requests before marked as failed) compute resource's SNMP service is down, system should ensure that compute resource is offline.

Control Panel takes the following steps:

A option

Control Panel sends snmpget request. If it is successful, then SSH is added into compute resource virsh list and failure account (amount of requests before compute resource is marked as failed) is reset. 

B option

In case of snmpget request failure SSH is checked. If command is successful, then SSH is added into compute resource, services (snmpd&snmptrapd, restart etc.) are checked and one more snmpget request is sent. If it is successful, then A option is applied.

C option

If option B is unsuccessful, then one more snmpget request is sent. If it is successful, then A option is applied. In case of failure you get an alert (with information that SNMP has unusual configuration) and failure account (amount of requests before compute resource is marked as failed) is reset.

D option

If SSH checking request is unsuccessful, all booted VSs of the compute resource are pinged. This step is optional and depends if the Ping hosted virtual servers before initiating failover slider is enabled (by default this slider is enabled, see Failover settings section below).

E option

If ping of VSs is successful, you get an alert and failure account (amount of requests before compute resource is marked as failed) is reset.

F option

If ping of VSs is unsuccessful, failover is activated and compute resource is marked as offline.

Below you can find meanings of commands:

virsh list - get virtualization system status (Xen or KVM) to ensure that it works properly
snmpget - take uptime from compute resource

Failover Logs

Failover processes show the list of failover logs that take place on the Compute zones in the cloud.

To view the list of failover processes:

  1. Go to Control Panel > Logs.

  2. Click the Failover Processes button. On the page that appears, you can see the following information for each failover log:

    • Failover number

    • Indication of the time when it started

    • Compute zone on which the failover happened

    • Time of the last iteration

    • Failover action status: active or completed

To view the failover transaction details, click its reference number.