How Long Does It Take To Recover From vSAN Drive or Node Failures?

It is perhaps a valid question to ask any storage vendor "How long before storage comes back to normal status after failure?" Most vendor's claim such high availability on their solutions, so surely this is something that we have all got nailed down. However, it is an extremely difficult question to answer, influenced by various factors.

This is akin to a someone asking, "How long does it take to travel from Singapore to KL?". Then naturally the question would be around mode of transport, traffic conditions (which could also change depending if you said KL City Center or Petaling Jaya or upon seeing the KL Signage) and weather conditions.

Below I have listed out a few points as to what specifically would impact the recovery times on a typical vSAN Cluster.

Size of the vSAN Cluster
The larger the cluster, the more concurrent nodes are contributing effort to the rebuilding and repair of the failed component/node. It is for certain, rebuild will complete much faster in a 30-node cluster vs a 5-node cluster

Type of Devices Used
Depending on the cluster type, Hybrid or All-Flash cluster, it goes without saying an All Flash cluster will have almost always have a quicker recovery time. Even the various choices of HDD or SSD/FLASH will have an impact to this. Example, a 7200 RPM vs 10,000 RPM SAS, or NVMe vs a generic SSD.

Network Utilisation
Given that vSAN is likely sharing the network with VM data traffic and production workload, congestion on the network will have an impact on rebuild and repair times.

Production IO's
vSAN prioritises production workload over resync/rebuild traffic. We have a feature called Adaptive Resync that will automatically control how much bandwidth is used for repair traffic. Recently, we have also introduced manual throttling of vSAN resync's, but it is also highly dependent on the other points I have noted in this post. As for adaptive resync, when it detects that the latency on production workload is struggling, it reduces repair traffic until latency is optimised. As you can imagine, a busy cluster will mean a slower recovery time.

Capacity Utilisation
The actual utilisation on the failed node will determine how much data needs to be rebuilt. This is where is slightly differs from some traditional storage systems. If the node is 50TB, and 50% is filled, hence 25TB will need to find its way onto other nodes. Similarly, if a 50TB node is only 10% filled, 5TB is moved. Time to move 5TB is naturally much faster.

In a nutshell, these are just some of the key considerations and factors that determine the duration for vSAN.

So to counter the effect of potential data loss during recoveries, we have additional features such as Smart Repairs that would also help the cluster get back online ASAP (which I will not detail here) and policy settings such as FTT=2 or 3, to extend the resiliency of the cluster.

At the end, you may know where this is leading to, but unfortunately the right answer for the earlier question.... "So how long does it really take?" is really a BIG "IT DEPENDS...".