Question


How should I manage a failure of CloudBoot compute resources?


Answer


Here you can find several scenario examples which can help you resolve the issue:

  • The Data Center power fails, then the generator fails to fire up:

    Assuming that all servers went offline at the same time, disks and all compute resources are brought back online before booting VSs. The disks should not be degraded.
    In this case, the vDisks are synced for all data store zones in the left sidebar. Once all the compute resources are stable, start powering on the VSs in small batches and then progressively larger if no issues are identified.
    If the servers go down at different times (for example, in the cabinet, the UPS runtime has different amounts of time) or if the VSs are booted before all the compute resources are backed up, then the disks can be in a degraded state and should be repaired. 
    Wait until the compute resources are all back online. If there are some compute resources that fail to come back online or disk drives do not come back, then either the content in those disks should be forgotten (in the case that the compute resource or disk is never coming back) or attempt to make the compute resource/disk come back online. At the point where the system is stable again, repair the disks. To repair the disks, refer to the Diagnostics page.
  • Compute resource power supply fails:

    When the power supply of a compute resource fails, OnApp will identify the compute resource as offline. In this case, the failover processes will start and boot the VSs on other compute resources if there are sufficient resources on those compute resources. Also, the VSs will only start on compute resources with the disk content for all stripes if the local read policy is enabled. Ensure that the failover timeout is set to a bigger value than two minutes for the storage layer to work correctly with the failover. At this stage, any vDisks with content on the offline compute resource will be degraded but the VSs should be running.
  • If the compute resource cannot be fixed:

    If the compute resource cannot be fixed, perform the following operations:
  1. On the backup server or another compute resource, run:

    onappstore forgetfromall forgetlist=<node_id>
    CODE
  2. Repeat the above command for each node from this offline compute resource.

  3. At the Diagnostics page, repair all the disks with a partial member list.

     Example scenario: If the compute resource can be fixed:

       To fix the compute resource:

  1. Boot the compute resource backup.
  2. Check the Diagnostics page to make sure all nodes are active.
  3. Repair all the degraded vDisks.