Cluster Random Failover
Our MSP upgraded our firmware on the 22/23rd of June AUG to 12.6.2 and since then the cluster has been randomly flipping over. In the last few days it's been limited to once per day but it's frustrating and at the moment we don't have the time to rebuild the cluster which was the last response we got from watchguard support. I am afraid it may come to that unless someone here has some advise.
Is there any recommendations on how to narrow this glitch down? I have increased the lost heartbeat threshold to 10 which may or may not have helped.
Dimension Server shows that the cluster ports changed status which caused the switchover. We've change these cable to new ones as well.
2020-09-01 14:27:01 networkd    4   [eth2 (Optional-1)] Interface link status changed to up
2020-09-01 14:27:01 networkd    4   [eth3 (Optional-2)] Interface link status changed to up
Here is today's Cluster HA Events:
Tue Sep  1 14:27:00 2020 Role: Member 801003B4F806D becomes IDLE. (devSt=14)
Tue Sep  1 14:27:00 2020 Formation: On 801003B4F806D, HA port eth2 is DOWN
Tue Sep  1 14:27:03 2020 Formation: On 801003B4F806D, HA port eth2 is UP
Tue Sep  1 14:27:09 2020 Election: cluster election event, Master, rcvd. Current opState=IDLE
Tue Sep  1 14:27:09 2020 Role: Member 801003B4F806D becomes MASTER. (devSt=14)
Tue Sep  1 14:27:26 2020 Formation: On 801003B4F806D, HA port eth2 is DOWN
Tue Sep  1 14:27:40 2020 Formation: On 801003B4F806D, HA port eth2 is UP
Tue Sep  1 14:27:48 2020 Formation: Member 801003D57354C: Device has joined the cluster.Device State=14
Tue Sep  1 14:28:09 2020 Role: Master 801003B4F806D assigns Member 801003D57354C as BACKUP Master, Mode PASSIVE
Comments
There have been a number of issues reported for 12.6.2, which seems quite buggy.
Some sites have reverted to their earlier version, which stopped the problems.
Consider doing that.
Some issues with 12.6.2 are listed here:
https://community.watchguard.com/watchguard-community/discussion/1191/latest-software-releases-as-of-08-20-20
Others have been reported by Robert - RVilhelmsen
https://community.watchguard.com/watchguard-community/profile/discussions/RVilhelmsen
There are downgrade instructions in the Release Notes. Look there first.
For a cluster, from the docs:
Downgrade a FireCluster
To downgrade Fireware OS for a FireCluster, we recommend that you have the cluster members leave the cluster, use one of the methods to downgrade each Firebox separately, and then reconfigure the FireCluster. In Fireware 12.2.1 or higher, this enables you to restore a backup image when you downgrade each cluster member.
Downgrade Fireware OS
https://www.watchguard.com/help/docs/help-center/en-US/Content/en-US/Fireware/installation/version_downgrade_xtm_c.html
Obviously this requires breaking and then reforming the cluster, which is what support is suggesting that you do with 12.6.2.
re: 12.6.2 buggy - Update 1 is already out with 5 fixes for WSM Policy Manager.
Thanks @Bruce_Briggs for the details. If I wasn't running around with my head cut off on multiple things I would have thought to find that.
I have contacted our MSP asking that they review and advise as they're the one that manages the cluster and firmware upgrades usually for us. It's the first time they have upgraded us without waiting a week or 2 to see what issues came from a firmware upgrade for other smaller customers that I am aware of
Hi @NickSimpson
If your MSP hasn't done so already (or yourself, if you prefer, whatever works best for your situation):
I'd suggest opening a case. Cluster failovers could mean any number of things -- while /downgrading/ might fix it, without knowing the root cause of the issue, I can't say when it'd be ok to upgrade again.
You can do so via the support center link at the top right of this page. Support cases are free with devices that have LiveSecurity/Support, which is also required to upgrade the firmware on the device.
-James Carson
WatchGuard Customer Support
@James_Carson
FYI - per the initial post, there has been a support case opened on this:
"we don't have the time to rebuild the cluster which was the last response we got from watchguard support"
Any other thoughts?
Hey @Bruce_Briggs
Thanks for the heads up -- My fault, I missed that part.
@NickSimpson I searched the serial numbers in those log lines near the top, and found what i think is your case (1411940) -- there's crash logs in the support files you uploaded. If you can upload those and let the tech know that you've done so, they might be able to get more information about what's going wrong.
In the WebUI
-Go to the front panel, and on the right side of the screen under device information, there should be an area in red that says "Faults". Click this, then on the next screen, click "send all to watchguard."
In the Firebox System Manager.
-Connect to your cluster, Then go to Tools -> Connect to member. Select the current master firewall.
-A new Firebox System Manager will open. In that window, go to the status report tab. Click the support button.
-Click "send all to watchguard," enter your admin password, and click OK.
If you think your case isn't moving forward, please let me know and I'd be happy to have it escalated for you.
Thank you,
-James Carson
WatchGuard Customer Support
might be a stupid question, but what is the MSP job here? It´s only managed one way - upgrades?
@James_Carson @Bruce_Briggs @RVilhelmsen
Thanks for following up. We cold booted both of the M470's today to see if it would make a difference. If we run into issues we're planning to break the cluster and downgrade to the backup image on Monday when all our locations are closed.
@James_Carson - I have sent the crash logs as per your advise. Thank you for chiming in.
@RVilhelmsen MSP does most of the server patching and Watchguard updates, first time in 4 years we've run into an issue with an upgrade so we're all just trying to figure out the best road taken as we should be onsite when we go through this process.
I will update this over the next couple of days to let everyone know where we stand. Appreciate the comments and feedback
~ 25 hours up and then the random reboot
Looks like we'll be going forward with the downgrade on Monday
Cluster HA Events:
Wed Sep 2 07:31:26 2020 Bootup: System bootup
Wed Sep 2 07:31:35 2020 Formation: Member 801003B4F806D: Device has joined the cluster.Device State=14
Wed Sep 2 07:31:45 2020 Formation: On 801003B4F806D, HA port eth2 is UP
Wed Sep 2 07:31:56 2020 Election: cluster election event, Backup Master, rcvd. Current opState=IDLE
Wed Sep 2 07:31:56 2020 Role: Member 801003B4F806D becomes BACKUP SYNC. (devSt=14)
Wed Sep 2 07:31:56 2020 Role: Member 801003B4F806D becomes BACKUP. (devSt=14)
Thu Sep 3 08:35:06 2020 Formation: On 801003B4F806D, HA port eth2 is DOWN
Thu Sep 3 08:35:09 2020 Role: Member 801003B4F806D becomes IDLE. (devSt=14)
Thu Sep 3 08:35:11 2020 Formation: On 801003B4F806D, HA port eth2 is UP
Thu Sep 3 08:35:17 2020 Election: cluster election event, Master, rcvd. Current opState=IDLE
Thu Sep 3 08:35:17 2020 Role: Member 801003B4F806D becomes MASTER. (devSt=14)
Thu Sep 3 08:35:34 2020 Formation: On 801003B4F806D, HA port eth2 is DOWN
Thu Sep 3 08:35:47 2020 Formation: On 801003B4F806D, HA port eth2 is UP
Thu Sep 3 08:35:56 2020 Formation: Member 801003D57354C: Device has joined the cluster.Device State=14
Thu Sep 3 08:36:17 2020 Role: Master 801003B4F806D assigns Member 801003D57354C as BACKUP Master, Mode PASSIVE
Checking Cluster member HA Events:
Tue Aug 25 06:01:41 2020 Role: Master 801003D57354C assigns Member 801003B4F806D as BACKUP Master, Mode PASSIVE
Thu Sep 3 08:35:47 2020 Bootup: System bootup
Thu Sep 3 08:35:56 2020 Formation: On 801003D57354C, HA port eth2 is UP
Thu Sep 3 08:35:57 2020 Formation: Member 801003D57354C: Device has joined the cluster.Device State=14
Thu Sep 3 08:36:18 2020 Election: cluster election event, Backup Master, rcvd. Current opState=IDLE
Thu Sep 3 08:36:18 2020 Role: Member 801003D57354C becomes BACKUP SYNC. (devSt=14)
Thu Sep 3 08:36:18 2020 Role: Member 801003D57354C becomes BACKUP. (devSt=14)
Your logs from Dimension must show kernel crash else the device would not reboot. What is logget at that time?
here is what's in the diagnostics
DATE-TIME | PROCESSS | MESSAGE
2020-09-03 08:35:14 | dyndns | Could not resolve server: dynupdate.no-ip.com
2020-09-03 08:35:14 | daas | on_wgapi_read: WGAPI_TYPE_NOTIFICATION/WGAPI_EVENT_INTERFACE_STATUS
2020-09-03 08:35:15 | crd | Clst_crd_InvokeCrdFsm:3698: Returned FAILURE
2020-09-03 08:35:18 |ctd | handle_CRD_ROLE_UPDATE_MSG: other:0, role:3 for ID:801003B4F806D
2020-09-03 08:35:18 | ctd | update kxp: ID:801003D57354C bitmap:0x3, ifIndex:5
2020-09-03 08:35:18 | ctd | update kxp: ID:801003B4F806D bitmap:0x0, ifIndex:4
2020-09-03 08:35:18 | ctd | handle_CRD_ROLE_UPDATE_MSG: other:1, role:4 for ID:801003D57354C
2020-09-03 08:35:18 | ccd | CCD_FlushSwitchChip: Failed to open procfs entry /proc/wg_dsa/flush
2020-09-03 08:35:18 | cdiagd.clstevts |Received cluster event.
2020-09-03 08:35:18 | daas | on_wgapi_read: WGAPI_TYPE_NOTIFICATION/WGAPI_EVENT_CLST_ROLE_CHANGE
2020-09-03 08:35:18 | cdiagd.clstevts | Role change, member id 801003B4F806D, pre_role 4, new role 3.
there must be more. That´s not enough to do a kernel crash
There's nothing else in the logs that I can see for that time period unfortunately
Here is the events as well. Very frustrating but it could be happening every 20 minutes like last Monday which was a pain in the @55
2020-09-03 08:30:24 loggerd 6 Archived log file /var/log/traffic.log which reached max size
2020-09-03 08:30:55 loggerd 6 Archived log file /var/log/traffic.log which reached max size
2020-09-03 08:31:30 loggerd 6 Archived log file /var/log/traffic.log which reached max size
2020-09-03 08:32:06 loggerd 6 Archived log file /var/log/traffic.log which reached max size
2020-09-03 08:32:37 loggerd 6 Archived log file /var/log/traffic.log which reached max size
2020-09-03 08:33:19 loggerd 6 Archived log file /var/log/traffic.log which reached max size
2020-09-03 08:33:59 loggerd 6 Archived log file /var/log/traffic.log which reached max size
2020-09-03 08:34:35 loggerd 6 Archived log file /var/log/traffic.log which reached max size
2020-09-03 08:35:17 crd 6 Member 801003B4F806D changed role to MASTER.
2020-09-03 08:35:17 crd 6 Cluster member 801003B4F806D changed role from Idle to Master.
2020-09-03 08:35:17 crd 6 Member 801003B4F806D is now master.
Then if nothing is logged what so ever other than HA failover events and links down my first guess would be power failure. But then agin to both units at different times is very unlikely.
Do the "failed" device do a complete reboot? If so - then there must be logged some kernel/deamon crash or power failure.
the 354C device did a complete reboot this morning, the other that is now the master is showing 1d 3h uptime so we'll have to wait and see tomorrow if that does it as well.
Any idea where to get logging for the kernal crash if it happened?
right after get a support logfile or in your dimension logs
@RVilhelmsen fault reports show it under support files
Kernel Exception crash Sep 3, 2020, 11:35:37 AM
Just wanted to update, Cluster was rebuilt yesterday by my counterpart and our MSP. So far we've had two Kernel Exception crashes on 801003B4F806D causing two fail-overs within 30 minutes. So it doesn't look like rebuilding the cluster hasn't solved our issue.
Just to add I upgraded my A/P M470 firecluster to 12.6.2 this weekend. Seeing same kernel exception radmom reboots and cluster failovers. Case open on it.
Just to add to the list:
I upgraded my A/P M470 firecluster to 12.6.2 this weekend. Seeing same kernel exception random reboots and cluster failovers.
Doesn't appear to be affecting my other M370 A/P Fireclusters on 12.6.2. So limited to M470 (at least in my case).
12.6.4 has a few fixes per the release notes.
In an active/passive FireCluster, DNSWatch no longer fails when the active cluster member has an expired DNSWatch license and the passive cluster member has an unexpired DNSWatch license. [FBX-17093]
This release resolves a FireCluster issue that caused the sslvpn_firecluster process to use high CPU on the backup master when no Mobile VPN with SSL client was connected. [FBX-20962]
An interface disconnected from a FireCluster no longer causes a fault report. [FBX-20143]