60 Minutes - vSAN says "Mai Kan Cheong"


I'm pretty sure I would be (very) rich if I had a dollar every time I had to answer (FUD's) why we have a delay of 60 minutes before we repair a failed disk on vSAN.

A little background on this, with vSAN, whenever there is an unexpected loss of a hard drive, the system will start a timer to wait for 60 minutes by default before we start rebuilding the components on that drive.

For some coming from the traditional world of SAN, or some of our competitors, the expression wouldn't be too far from that picture on the right :)

Take a deep breath, chill, relax, "mai kan cheong" (local Singaporean dialect, don't be uptight) 

So what is actually happening here? 

Very often with hard drives, when it starts to fail or degrade, it doesn't just suddenly decides to go offline (not saying it doesn't happen, but it's very rare). More common, it starts to exhibit soft or hard errors that eventually leads to a total drive failure. So on vSAN, when a drive just disappears without any issues, by default, vSAN classifies it as a transient error. Transient issues could be caused by host reboots, accidentally unseating a drive from the caddy, or various other issues. So, we give it 60 minutes to recover from this transient error. At this point, the hard drive is marked as "Absent State". Assuming after 60 minutes nothing happens, the hard drive is now marked as "Degraded State" and we will start rebuilding the components.

"We can't tolerate 60 minutes of wait time. Need it to be 5 minutes." 

No issues, this value can be easily changed from the CLI or through vCenter Web Client. Details can be found here. Changing default repair delay.

I would like to think a bug or an issue wouldn't be so easily configurable from the GUI, don't you agree? :)