WFC : Failback Policies

Issue

There is a concept of “Preferred Owner” and based on the VM failback policy, we have two options to proceed with

1.       Immediate Failback
2.       Failback Window

Ideally the moment preferred node comes backup up, the VM will try to Live Migrate back if Immediate Failback value is set

AutoFailbackType       :
FailbackWindowStart   :
FailbackWindowEnd      :

However on Nutanix this will fail to happen because of late initialization of storage service on CVM since we are using a disk pass through hence we need to wait for OS to initialize the controller and detect device and then pass on the control to the CVM. The VM tries to Failback based on these two setting {FailoverThreshold and FailoverPeriod } However it is bound to fail because if reason mentioned above.

FailoverThreshold      :
FailoverPeriod         :

These Event logs clearly points the issue

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          2/4/2017 1:13:45 AM
Event ID:      1069
Task Category: Resource Control Manager
Level:         Error
Keywords:
User:          SYSTEM
Computer:
Description:
Cluster resource 'Virtual Machine Configuration Failback' of type 'Virtual Machine Configuration' in clustered role 'Failback' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows

Log Name:      Microsoft-Windows-FailoverClustering/Diagnostic
Source:        Microsoft-Windows-FailoverClustering
Date:          2/4/2017 1:06:55 AM
Event ID:      2051
Task Category: None
Level:         Error
Keywords:
User:          SYSTEM
Computer:
Description:
[RES] Physical Disk: HardDiskpIsFileCSVFile(): Failed to execute DeviceIoControl for \\XXXXXXXX\smbcontainer\Failback\Failback\, status 50

Conclusion

Immediate Failback will not work in Nutanix Environment because of disk subsystem {disk pass through}architecture and delay in disk initialization. This is  a limitation from Windows Hypervisor because it doesn’t support PCI pass through yet. In coming supported Hypervisor versions this issue should be resolved.

FailbackWindowStart & FailbackWindowEnd  attribute defines the hour of the day when the VM should Ideally failback to the preferred owner.  So this needs to be defined in order to ensure that VM migrate to its preferred owner.
0 to 23   Indicates the hour of day (local cluster time) that the failback window ends.

Some Handy Links

Preferred Owners in a Cluster
Failover behavior on clusters of three or more nodes
Understanding Hyper-V Virtual Machine (VM) Failover Policies
Modify the Failover Settings for a Clustered Service or Application
Configure Failover and Failback Settings for a Clustered Service or Application
AntiAffinityClassNames
Using Guest Clustering for High Availability

What’s New in Failover Clustering in Windows Server {20012 R2}
What’s new in Failover Clustering in Windows Server 2016Windows Server 2016 Failover Cluster Troubleshooting Enhancements – Cluster Log

Get-ClusterGroup -Name _________  fl *

Cluster            : Cluster Name
IsCoreGroup    : Part of core group
OwnerNode        : Node Name where the Cluster group is hosted
State                 : Online/offline
Name               : Group Name
PersistentState 
FailoverThreshold 
FailoverPeriod 
AutoFailbackType
FailbackWindowStart 
FailbackWindowEnd
GroupType 
Priority              : 2000 {1000, 2000, 3000}
DefaultOwner           : 1 {Node ID}
AntiAffinityClassNames : {}
StatusInformation
Advertisements