I have seen an issue which motivated me to write this blog wherein due high memory usage on an active node (w23k cluster) disrupted the whole cluster group & resources. So I tried to look into the logs to verify what caused it . Just keep in mind this concept is applicable to the latest version of windows cluster (Windows server 2012 R2) .
So we have a 2 node cluster N01B(Active) & N01A (Passive), Node B stopped responding however the fail over didn’t take place.
Excerpts from the logs summary :
- As per the logs on Node 01 A, it lost the communication with Node 01B on it’s interface because the Heartbeat & Public network went down around 5:38:25 PM (which is actually 4:38:25 PM) & nodes went into dormant state.
- Eventually Node B was kicked out of the cluster & cluster service got stopped, The Fail-over didn’t occur because they had specified the threshold of 10 attempts & of 6 hours. I couldn’t find anything in the cluster logs because it was truncated.
- The cluster resources came up when Node B was rebooted & Node B managed to take its ownership by 5:39:17 PM (4:39:17 PM) .
The logs specified time is one hour ahead because of DST . Event Type: Warning Event Source: ClusSvc Event Category: Node Mgr Event ID: 1124 Date: 2/16/2013 Time: 5:38:25 PM User: N/A Computer: N01A Description: The node determined that its interface to network 'Heartbeat NIC' failed. Event Type: Warning Event Source: ClusSvc Event Category: Node Mgr Event ID: 1123 Date: 2/16/2013 Time: 5:38:26 PM User: N/A Computer: 01A Description: The node lost communication with cluster node 'N01B' on network 'HP Network Team'. Event Type: Warning Event Source: ClusSvc Event Category: Node Mgr Event ID: 1135 Date: 2/16/2013 Time: 5:39:11 PM User: N/A Computer: 01A Description: Cluster node N01B was removed from the active server cluster membership. Cluster service may have been stopped on the node, the node may have failed, or the node may have lost communication with the other active server cluster nodes. Event Type: Information Event Source: ClusSvc Event Category: Failover Mgr Event ID: 1200 Date: 2/16/2013 Time: 5:39:11 PM User: N/A Computer: 01A Description: The Cluster Service is attempting to bring online the Resource Group "Cluster Group". Event Type: Information Event Source: ClusSvc Event Category: Failover Mgr Event ID: 1200 Date: 2/16/2013 Time: 5:39:12 PM User: N/A Computer: 01A Description: The Cluster Service is attempting to bring online the Resource Group "SQL Group". Event Type: Information Event Source: ClusSvc Event Category: Failover Mgr Event ID: 1201 Date: 2/16/2013 Time: 5:39:17 PM User: N/A Computer: 01A Description: The Cluster Service brought the Resource Group "Cluster Group" online. Event Type: Information Event Source: ClusSvc Event Category: Failover Mgr Event ID: 1201 Date: 2/16/2013 Time: 5:40:01 PM User: N/A Computer: 01A Description: The Cluster Service brought the Resource Group "SQL Group" online.
Looked into the cluster properties & found there something interesting .
Before I move on let’s go ahead & understand these two terms Fail-over Threshold & Period :
FailoverThreshold Specifies the number of times the Cluster service attempts to Failover a group before it concludes that the group cannot be brought online anywhere in the cluster.
FailoverPeriod Specifies the interval (in hours) over which the Cluster service attempts to fail over a group.
In our case both Cluster & SQL groups have the value configured ( 10 attempts & 6 hours ) , Now As I described earlier the communication between both nodes were broken on both of the network paths , So Node B was continuously trying to do the Fail-over but was unable to do because of broken path & it didn’t give up because of the configuration . Otherwise in normal scenarios after 2 attempts it had failed the groups & on Node A we would have easily bought it up.
So once the failover threshold is crossed, resource or resources will stay in the failed state & require manual intervention.
Now I believe when we configure a group in the cluster & as we know their availability is very critical from users & production point of view . Now I think as we are SLA bound so we should provide more chance to a cluster group to recover from a failure & do a Failover in case of any outages or issues on the owner node (Active).
There are three things which needs to be in mind when we have a group configured in the cluster
- The Failover threshold
- The Failover period
- Whether or not a failure of one resource affects other resources in the group.
The Fail-over threshold is the number of times the group can fail over within the number of hours specified by the Fail over period. Like in our case
- The group Fail over threshold was set to 10 and its Fail over period was 6 hours,
- So the Cluster service tried to do a fail over the group at most ten times within a six-hour period which is the limit , so it means that It couldn’t made the 11th attempt within the specified time frame, the Cluster service failed all other resources in the group and leaves the entire group offline instead of failing over the group.
- So as per my recommendation keeping the SLA in mind, we can keep the same Threshold but decrease the time frame to 1. So we are giving our cluster groups more time to recover within 1-hour time frame & it can attempt again 10 times in the next 1 hour.
- By default, if one or more resources fail in a group, the Cluster service fails all other resources in the group. We can disable the “Effect the Group” settings for non-critical resources so that in case if those resources go into a failed state Cluster service leaves the resource in the offline state instead of failing all other resources in the group. For Instance, Backup Services (TSM Backup), Monitoring services (Tivoli Monitoring) configured in the group
- So if Somehow these services go into an offline state due to a failure, they will not bring down the whole group.
Please do let me know in case if there is any question or doubt.