Storage Health Check

This menu displays the result of diagnostics tests. Below you will find the details on all possible results shown for the following resources:


Disk Health

DiagnosticsDetails

Action

Degraded disksShows the list of VDisks in a degraded state, which means that one or more (but not all) replicas of a stripe are not fully synchronised. Degraded VDisks are listed with the OnApp vd_uuid and a repair option.Use the repair all option to queue the repairs.Repair will resynchronise the content from an elected master to a slave. The repair button starts a repair task that will take some time depending on the  data store, network and disk drive configuration
Disks with partial member listShows the list of VDisks having an incomplete membership list, due to disk failure, network failure or otherwise. Each VDisk should have (S)tripe * (R)eplica members.Use the repair operation to repair the membership.  This will elect a new valid member from the suitable nodes in the data store. Once the membership is repaired, the VDisk will be in a degraded state until it is re-synced.
Stripes with no replicaShows the list of VDisks which have lost all replicas for a stripe. There is no redundancy at this point for this stripe and the data is lost. If a VDisk is in this category then the associated VS is likely broken unless the VDisk is a swap drive.No repair action available.
Disks with no redundancy foundOne or more VDisks have not got a replica stripe member on another compute resource. VDisk is healthy but all replicas of a stripe are on the same compute resource.Use a Rebalance link in the Actions column that leads to re-balance page for a VDisk. This will allow the content of a VDisk to be rebalanced to another suitable disk drive.
Partially online Disks found

The list of VDisks that have at least one stripe online and at least one stripe offline. There must be an authoritative member for each stripe.

Use a Repair link in the Action column that will issue a special Storage API call (online refresh action) to fix this problem. Status of the VDisk before will show offline but one or more members will show an online front end.

Degraded snapshotsThe list of VDisk snapshots in degraded states (except ones currently being used for ongoing backups). Backups cannot be made from a degraded snapshot.To resolve this, use a bulk Delete All link in Action column that will create a background task. This task unmounts, performs unkpartx, makes zombie snapshots offline on each compute resource from the zone, and then removes the snapshot. The task may leave some snapshot VDisks left, so check for unremoved VDisks upon task completion.
Zombie snapshots found

The list of VDisk snapshots created during the backup procedure but still left after the backup is deleted.

To resolve this, use a bulk Delete All link in Action column that will create a background task. This task makes zombie snapshots offline on each compute resource from the zone, and then removes the snapshot.
Zombie disks foundThe list of VDisks that are not associated with a VS have been found. These may include VDisks created by the command line and VDisks created for benchmarks.To resolve, use a bulk Delete All link in Action column that will create a background task. This task unmounts,performs unkpartx, makes zombie disks offline on each compute resource from the zone, and then removes the disk. The task may leave some zombie disks left, so check for unremoved disks upon task completion.
Disks in other degraded statesThe list of VDisks that are degraded but not in any of the other states above. These can be the disks that have missing partial members, missing inactive members,  missing active members, or missing unknown members.No repair action available
Stale cache volumesShows the list of stale cache volumes.To resolve, use a Forget All button in Actions column to forget all items in the list. If you want to remove only some items, you can click the Forget button next to the specific item.
Disks with inactive cacheShows the list of disks within active cache.No repair action available



Drive Health

DiagnosticsDetailsAction
Partial node foundThe compute resource hosting the node is reachable, and reports over the API that the node is running. Possibly storage API is not responding on the storage controller server.To fix, perform a controller restart. Make sure that there is sufficient redundancy such that restarting controllers on one compute resource will not cause VS downtime.
Inactive nodes foundEither the compute resource hosting the node is not reachable, or it is and is reporting that the storage controller for the node is not running.Either power-cyclethe compute resource,orbring up the storage controller VS. This can be a bit tricky if there are more than one storage controllers running on the same compute resource, and only one has shutdown.
Nodes with delayed ping foundNode reachable over the storage API, but is not sending out pings. OnApp SAN Controller services is not responding on the node.To fix this problem, restart the SAN Controller services from inside the storage controller server, that can be triggered from the UI.
Nodes with high utilization foundThe list of nodes with disk utilization over 90%.To improve, click the Rebalance link in Action column leadingtolist of disks located onthenode, so that user can rebalance them away from it.
Out of space nodes foundNode utilisation is reported at 100% for one or more nodes.

Te Repair action will forget the content of one of the VDisksthatis compute resource redundant and in sync.

Missing drives found

The compute resource configuration has a drive selected that is not being reported to Integrated Storage.

No repair action available. Compute resource configuration edit page can be selected from the error reported to deselect the drive if appropriate.

Extra DrivesThe drives that are disk-hotplugged into the system.No repair action available from UI.
Inactive controllersThe list of controllers that cannot be reached but the host compute resource is responding.Restart the controller.
Unreferenced NBDs found

The list of NBD data paths that are active but not referenced by a device mapper.

To fix, schedule a CP transaction which will try to clean up the unreferenced NBDs by disconnecting from the frontend. Delete all.

Reused NBDs found

The list of multiple uses of the same NBD connection.

No repair action available from UI.

Dangling device mappers found

The list of device mappers that are not in use.


Click the Clean all button to remove the device mappers that are not in use. You can also check the corresponding VS and if the VS is booted do nothing but, otherwise, try to unmount and offline the vDisk.

S.M.A.R.T.

Our S.M.A.R.T drive health diagnostics is based on smartmontools - smartd and smartctl utilities, which read the the hardware-supported attributes from each drive.

Note that starting with ATA/ATAPI-4, revision 4, the meaning of these Attribute fields has been made entirely vendor-specific. However most newer ATA/SATA disks seem to respect their meaning, so the option of printing the Attribute values is retained.

Solid-state drives use different meanings for some of the attributes. In this case the attribute name printed by smartctl is incorrect unless the drive is already in the smartmon tools drive database.

  • Please note that in case your servers are using RAID controllers, our S.M.A.R.T. check will not always properly handle the attributes without adding customization to it.
  • For MegaRAID controllers, please add the following line to /onappstore/onappstore.conf file on each Cloudboot Compute Resource: devhealthopt=[GENTYPE:sat+megaraid,0|SCANOPT:-d sat+megaraid,0|SCANTYPE:sat+megaraid,1|FORCETYPE:sat+megaraid,0]
  • For other controllers, please check the smartmontools page or contact OnApp support.

Since this is vendor specific, not all drives support SMART. Nonetheless most do, providing the SMART reporting is enabled in the BIOS and that the hardware supports SMART.

If the drives are behind a RAID or another controller, the controller must also support the SMART's passthrough for SMART to work. Specific BIOS and firmware upgrades may enable SMART support, however it remains very much hardware and configuration dependent.

SMART errors found

For one or more Disk drives in the compute resource, SMART inbuilt tests have reported one or more warnings. SMART errors occur when the drive has surpassed the threshold for reporting a failure.

Replace the drives in the maintenance window that appears.

SMART warnings found

SMART warnings occur when the failure attributes exist but are not at the threshold level - either Pre-failure or Old age. 
Pre-failure Attributes are ones which, if less than or equal to their threshold values, indicate pending disk failure.

Old age, or usage Attributes, are ones which indicate end-of-product life from old-age or normal aging and wear-out, if the Attribute value is less than or equal to the threshold.

Please note: the fact that an Attribute is of type ’Pre-fail’ does not mean that your disk is about to fail! It only has this meaning if the Attribute´s current Normalized value is less than or equal to the threshold value.



Compute Resources

The diagnostics procedure to check the version of storage packages on CloudBoot compute resources and report about the results of the procedure in a daily or hourly storage health notification.

DiagnosticsDetailsAction
Compute Resources have different storage versionsThe list of CloudBoot compute resources and their storage versions.Update CloudBoot compute resources via the reboot or live upgrade procedure for them to have identical storage versions.
All Compute Resources have identical storage versionsIndicates that all CloudBoot compute resources have identical storage versions.No repair action is required.