“Software Defined” has become the epithet of the Nutanix solution, but you will always need hardware and hardware will always fail.
Recently we installed a relatively new NX-3460 for a Proof of Concept (POC) with all four nodes showing as up and running in the Prism management with no alerts. However, the storage total looked a little light, so on investigating the hardware section of the management interface we noticed 3 HDDs were missing from node A of the appliance. Not failed, just not there!
Reseating made no difference and there were no failed lights on the disks. Swapping disks into other bays showed that the disks themselves were not at fault. It’s important to note that this wasn’t a case of an in use storage system losing disks, which would have thrown errors, but as no storage pool was defined initially these just didn’t come online at all on setup and so didn’t show in the resultant storage pool when created – hence the low total readings that alerted us to this issue.
The SATA connecter was assumed to be the problem so a swap out node was arranged. When the replacement node arrived, the SATA DOM from which the node boots was moved to the replacement and the node replaced and booted. During this time the cluster on the other 3 nodes continued in ignorance with just a few alerts complaining of the node A’s disappearance.
This did not solve the problem – the three disks stubbornly refused to be seen. It was decided, therefore, that a total chassis replacement (as this carried the passive mid-plane into which all the nodes hot plug) was the only option, so it was duly ordered for next day delivery.
That night both SSDs in node C went offline which took out the node itself as the metadata disks were lost. However, the Controller VM didn’t actually fail as it boots on the SATA DOM and neither did the ESXi host – also present on the SATA DOM. The cluster continued in an initially degraded but still fully functional state! Data held on node C had duplicate blocks created on the surviving nodes for data consistency.
So now this still working cluster had 5 disks in trouble – 3 HDD on one node and 2 SSD on another, but it didn’t even stop for breath. Once the automatic recovery of node C was complete, the cluster wasn’t even degraded and could actually have taken another hit (say if node A played up again).
The chassis was swapped out the next day by simple replacement, one at a time, of the nodes and disks. This was the only point (about 30 minutes) when the cluster was actually down! All disks were then confirmed visible again, and although the SSDs were more of an issue than just a case of coming back online and needed support intervention to resuscitate, eventually all four nodes and all disks were up and happy.
Conclusion: Nutanix software continued with a working cluster even during a few days of multiple disk failures and even a collapsed node. No data was lost at any point.
The total outage of 30 minutes was just when the entire chassis was being swapped out.
Nutanix software is totally prepared for the inevitable hardware failure and the Energizer Bunny just keeps on going and going…..