What is the specific error to look for that guarantees wsfc initiated either restart or failover?



  • I am looking at sql server error log and the cluster log (cluster.log) file for each node in the cluster.

    I can see couple of errors scattered around like failed a periodic health check on file share, failing over group, because of 'Shutting down', resource x is causing group y to failoer...

    What is specific error record to look for, that guarantees that the WSFC initiated either a restart or failover?



  • You could use this command https://docs.microsoft.com/en-us/powershell/module/failoverclusters/get-clusterlog?view=windowsserver2022-ps to generate cluster logs for each node,

    Example command:

    Get-ClusterLog -Node Node01, Node02, Node03 -Destination 'C:\Temp\' -UseLocalTime
    

    The above command generates log files of each node at C:\Temp of each node. Once you have the output, you could do following on those files to filer,

    1. To filter only errors from the log file.
    Select-String -Path C:\Temp\Node03_cluster.log -Pattern ' ERR '
    
    1. To filter failoverCount messages.
    Select-String -Path C:\Temp\Node03_cluster.log -Pattern  'failoverCount'
    

    You would be able to mix and match the string filters to get appropriate information you are looking for.

    Sample from one of our instances with filter (2) applied,

    Line 40: ObjectId,ObjectName,resources,_acceptOwnershipCounts,_pOwnerNode,_state,_stateCounter,_markedBusyBy,_failoverInProgress,_failoverCount,_lastFailoverTime,_beingForcefullyDeleted,_operationFlags,_bounceBackFlags,_waitStart,_placementAttempts,_groupType,_numPreemptions,_numBlindPreemptions,_waitingForFirstPlacement,_priority,_defaultOwner,_flags,_persistentState,_failoverThreshold,_failoverPeriod,_autoFailbackType,_failbackWindowStart,_failbackWindowEnd,_description,_groupStartDelay,_lastOnlineOffline,antiAffinityClassNames,PreferredSite,_preferredOwners,_isCore,_lastOnlineNode,_groupStatusInformation,_moveTarget,_moveTargetBirthdate,_failoverTarget,_moveType,_isTargetedMove,_previousOwner,_queuedTarget,_hasIssuedMoveWithThisQueuedTarget,_targetedQueue,_savedLastOperationStatusCodeDuringQueue,_onlineTime,lastStateChangeTime,lastSeenMoveTime_GetSystemTime,lastSeenMoveTime_NodeId,_coldStartSetting,_placementOptions,providers
        Line 10958: [Verbose] 000012e0.000036bc::2022/05/02-07:34:14.124 INFO  [RCM] move of group AG01 from Node02(2) to Node03(3) of type MoveType::Failover is about to succeed, failoverCount=1, lastFailoverTime=2022/05/02-07:31:16.711 targeted=false
        Line 15455: [Verbose] 000012e0.00002b08::2022/05/02-07:34:23.178 DBG   [RCM] rcm::RcmGroup::UpdateAndGetFailoverCount=> (1, 2022/05/02-07:31:16.711)
        Line 15457: [Verbose] 000012e0.00002b08::2022/05/02-07:34:23.178 WARN  [RCM] Failing over group AG01, failoverCount 2, last time 2022/05/02-07:31:16.711.
        Line 16165: [Verbose] 000012e0.0000139c::2022/05/02-07:34:24.140 INFO  [RCM] move of group AG01 from Node03(3) to Node02(2) of type MoveType::Failover is about to succeed, failoverCount=2, lastFailoverTime=2022/05/02-07:34:23.170 targeted=false
        Line 16994: [Verbose] 000012e0.000036bc::2022/05/02-07:37:58.017 INFO  [RCM] move of group AG01 from Node02(2) to Node03(3) of type MoveType::Failover is about to succeed, failoverCount=3, lastFailoverTime=2022/05/02-07:35:00.680 targeted=false
        Line 19109: [Verbose] 000012e0.00002a1c::2022/05/02-07:38:01.398 DBG   [RCM] rcm::RcmGroup::UpdateAndGetFailoverCount=> (3, 2022/05/02-07:35:00.680)
        Line 19111: [Verbose] 000012e0.00002a1c::2022/05/02-07:38:01.398 WARN  [RCM] Failing over group AG01, failoverCount 4, last time 2022/05/02-07:35:00.680.
        Line 19608: [Verbose] 000012e0.00001c6c::2022/05/02-07:38:02.370 INFO  [RCM] move of group AG01 from Node03(3) to Node02(2) of type MoveType::Failover is about to succeed, failoverCount=4, lastFailoverTime=2022/05/02-07:38:01.385 targeted=false
        Line 29278: [Verbose] 000012e0.00001d68::2022/05/02-08:27:51.277 INFO  [RCM] move of group Available Storage from Node03(3) to w8mssql110agp0(1) of type MoveType::Drain is about to succeed, failoverCount=0, lastFailoverTime=1601/01/01-00:00:00.000 targeted=false
    

    Hope this information helps.




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2