Cluster Random Failover

NickSimpson · September 2020

Our MSP upgraded our firmware on the 22/23rd of June AUG to 12.6.2 and since then the cluster has been randomly flipping over. In the last few days it's been limited to once per day but it's frustrating and at the moment we don't have the time to rebuild the cluster which was the last response we got from watchguard support. I am afraid it may come to that unless someone here has some advise.

Is there any recommendations on how to narrow this glitch down? I have increased the lost heartbeat threshold to 10 which may or may not have helped.

Dimension Server shows that the cluster ports changed status which caused the switchover. We've change these cable to new ones as well.

2020-09-01 14:27:01 networkd 4 [eth2 (Optional-1)] Interface link status changed to up
2020-09-01 14:27:01 networkd 4 [eth3 (Optional-2)] Interface link status changed to up

Here is today's Cluster HA Events:

Tue Sep 1 14:27:00 2020 Role: Member 801003B4F806D becomes IDLE. (devSt=14)
Tue Sep 1 14:27:00 2020 Formation: On 801003B4F806D, HA port eth2 is DOWN
Tue Sep 1 14:27:03 2020 Formation: On 801003B4F806D, HA port eth2 is UP
Tue Sep 1 14:27:09 2020 Election: cluster election event, Master, rcvd. Current opState=IDLE
Tue Sep 1 14:27:09 2020 Role: Member 801003B4F806D becomes MASTER. (devSt=14)
Tue Sep 1 14:27:26 2020 Formation: On 801003B4F806D, HA port eth2 is DOWN
Tue Sep 1 14:27:40 2020 Formation: On 801003B4F806D, HA port eth2 is UP
Tue Sep 1 14:27:48 2020 Formation: Member 801003D57354C: Device has joined the cluster.Device State=14
Tue Sep 1 14:28:09 2020 Role: Master 801003B4F806D assigns Member 801003D57354C as BACKUP Master, Mode PASSIVE

Bruce_Briggs · September 2020

There have been a number of issues reported for 12.6.2, which seems quite buggy.
Some sites have reverted to their earlier version, which stopped the problems.
Consider doing that.

Some issues with 12.6.2 are listed here:
https://community.watchguard.com/watchguard-community/discussion/1191/latest-software-releases-as-of-08-20-20

Others have been reported by Robert - RVilhelmsen
https://community.watchguard.com/watchguard-community/profile/discussions/RVilhelmsen

NickSimpson · September 2020

@Bruce_Briggs we were advised by our MSP that reverting the firmware is quite the exercise. I tried to do it with one of our T30s but didn't have much luck. Do you have any suggestions on how to do it?

Bruce_Briggs · September 2020

There are downgrade instructions in the Release Notes. Look there first.

For a cluster, from the docs:

Downgrade a FireCluster

To downgrade Fireware OS for a FireCluster, we recommend that you have the cluster members leave the cluster, use one of the methods to downgrade each Firebox separately, and then reconfigure the FireCluster. In Fireware 12.2.1 or higher, this enables you to restore a backup image when you downgrade each cluster member.

Downgrade Fireware OS
https://www.watchguard.com/help/docs/help-center/en-US/Content/en-US/Fireware/installation/version_downgrade_xtm_c.html

Obviously this requires breaking and then reforming the cluster, which is what support is suggesting that you do with 12.6.2.

re: 12.6.2 buggy - Update 1 is already out with 5 fixes for WSM Policy Manager.

NickSimpson · September 2020

Thanks @Bruce_Briggs for the details. If I wasn't running around with my head cut off on multiple things I would have thought to find that.

I have contacted our MSP asking that they review and advise as they're the one that manages the cluster and firmware upgrades usually for us. It's the first time they have upgraded us without waiting a week or 2 to see what issues came from a firmware upgrade for other smaller customers that I am aware of

james.carson · September 2020

Hi @NickSimpson
If your MSP hasn't done so already (or yourself, if you prefer, whatever works best for your situation):

I'd suggest opening a case. Cluster failovers could mean any number of things -- while /downgrading/ might fix it, without knowing the root cause of the issue, I can't say when it'd be ok to upgrade again.

You can do so via the support center link at the top right of this page. Support cases are free with devices that have LiveSecurity/Support, which is also required to upgrade the firmware on the device.

Bruce_Briggs · September 2020

@James_Carson

FYI - per the initial post, there has been a support case opened on this:
"we don't have the time to rebuild the cluster which was the last response we got from watchguard support"

Any other thoughts?

james.carson · September 2020

Hey @Bruce_Briggs
Thanks for the heads up -- My fault, I missed that part.

@NickSimpson I searched the serial numbers in those log lines near the top, and found what i think is your case (1411940) -- there's crash logs in the support files you uploaded. If you can upload those and let the tech know that you've done so, they might be able to get more information about what's going wrong.

In the WebUI
-Go to the front panel, and on the right side of the screen under device information, there should be an area in red that says "Faults". Click this, then on the next screen, click "send all to watchguard."

In the Firebox System Manager.
-Connect to your cluster, Then go to Tools -> Connect to member. Select the current master firewall.
-A new Firebox System Manager will open. In that window, go to the status report tab. Click the support button.
-Click "send all to watchguard," enter your admin password, and click OK.

If you think your case isn't moving forward, please let me know and I'd be happy to have it escalated for you.

Thank you,

rv@kaufmann.dk · September 2020

might be a stupid question, but what is the MSP job here? It´s only managed one way - upgrades?

NickSimpson · September 2020

@James_Carson @Bruce_Briggs @RVilhelmsen

Thanks for following up. We cold booted both of the M470's today to see if it would make a difference. If we run into issues we're planning to break the cluster and downgrade to the backup image on Monday when all our locations are closed.

@James_Carson - I have sent the crash logs as per your advise. Thank you for chiming in.

@RVilhelmsen MSP does most of the server patching and Watchguard updates, first time in 4 years we've run into an issue with an upgrade so we're all just trying to figure out the best road taken as we should be onsite when we go through this process.

I will update this over the next couple of days to let everyone know where we stand. Appreciate the comments and feedback

NickSimpson · September 2020

~ 25 hours up and then the random reboot

Looks like we'll be going forward with the downgrade on Monday

Cluster HA Events:

Wed Sep 2 07:31:26 2020 Bootup: System bootup
Wed Sep 2 07:31:35 2020 Formation: Member 801003B4F806D: Device has joined the cluster.Device State=14
Wed Sep 2 07:31:45 2020 Formation: On 801003B4F806D, HA port eth2 is UP
Wed Sep 2 07:31:56 2020 Election: cluster election event, Backup Master, rcvd. Current opState=IDLE
Wed Sep 2 07:31:56 2020 Role: Member 801003B4F806D becomes BACKUP SYNC. (devSt=14)
Wed Sep 2 07:31:56 2020 Role: Member 801003B4F806D becomes BACKUP. (devSt=14)
Thu Sep 3 08:35:06 2020 Formation: On 801003B4F806D, HA port eth2 is DOWN
Thu Sep 3 08:35:09 2020 Role: Member 801003B4F806D becomes IDLE. (devSt=14)
Thu Sep 3 08:35:11 2020 Formation: On 801003B4F806D, HA port eth2 is UP
Thu Sep 3 08:35:17 2020 Election: cluster election event, Master, rcvd. Current opState=IDLE
Thu Sep 3 08:35:17 2020 Role: Member 801003B4F806D becomes MASTER. (devSt=14)
Thu Sep 3 08:35:34 2020 Formation: On 801003B4F806D, HA port eth2 is DOWN
Thu Sep 3 08:35:47 2020 Formation: On 801003B4F806D, HA port eth2 is UP
Thu Sep 3 08:35:56 2020 Formation: Member 801003D57354C: Device has joined the cluster.Device State=14
Thu Sep 3 08:36:17 2020 Role: Master 801003B4F806D assigns Member 801003D57354C as BACKUP Master, Mode PASSIVE

Checking Cluster member HA Events:

Tue Aug 25 06:01:41 2020 Role: Master 801003D57354C assigns Member 801003B4F806D as BACKUP Master, Mode PASSIVE
Thu Sep 3 08:35:47 2020 Bootup: System bootup
Thu Sep 3 08:35:56 2020 Formation: On 801003D57354C, HA port eth2 is UP
Thu Sep 3 08:35:57 2020 Formation: Member 801003D57354C: Device has joined the cluster.Device State=14
Thu Sep 3 08:36:18 2020 Election: cluster election event, Backup Master, rcvd. Current opState=IDLE
Thu Sep 3 08:36:18 2020 Role: Member 801003D57354C becomes BACKUP SYNC. (devSt=14)
Thu Sep 3 08:36:18 2020 Role: Member 801003D57354C becomes BACKUP. (devSt=14)

rv@kaufmann.dk · September 2020

Your logs from Dimension must show kernel crash else the device would not reboot. What is logget at that time?

NickSimpson · September 2020

here is what's in the diagnostics

rv@kaufmann.dk · September 2020

there must be more. That´s not enough to do a kernel crash

NickSimpson · September 2020

There's nothing else in the logs that I can see for that time period unfortunately

Here is the events as well. Very frustrating but it could be happening every 20 minutes like last Monday which was a pain in the @55

2020-09-03 08:30:24 loggerd 6 Archived log file /var/log/traffic.log which reached max size
2020-09-03 08:30:55 loggerd 6 Archived log file /var/log/traffic.log which reached max size
2020-09-03 08:31:30 loggerd 6 Archived log file /var/log/traffic.log which reached max size
2020-09-03 08:32:06 loggerd 6 Archived log file /var/log/traffic.log which reached max size
2020-09-03 08:32:37 loggerd 6 Archived log file /var/log/traffic.log which reached max size
2020-09-03 08:33:19 loggerd 6 Archived log file /var/log/traffic.log which reached max size
2020-09-03 08:33:59 loggerd 6 Archived log file /var/log/traffic.log which reached max size
2020-09-03 08:34:35 loggerd 6 Archived log file /var/log/traffic.log which reached max size
2020-09-03 08:35:17 crd 6 Member 801003B4F806D changed role to MASTER.
2020-09-03 08:35:17 crd 6 Cluster member 801003B4F806D changed role from Idle to Master.
2020-09-03 08:35:17 crd 6 Member 801003B4F806D is now master.

rv@kaufmann.dk · September 2020

Then if nothing is logged what so ever other than HA failover events and links down my first guess would be power failure. But then agin to both units at different times is very unlikely.
Do the "failed" device do a complete reboot? If so - then there must be logged some kernel/deamon crash or power failure.

NickSimpson · September 2020

the 354C device did a complete reboot this morning, the other that is now the master is showing 1d 3h uptime so we'll have to wait and see tomorrow if that does it as well.

Any idea where to get logging for the kernal crash if it happened?

rv@kaufmann.dk · September 2020

right after get a support logfile or in your dimension logs

NickSimpson · September 2020

@RVilhelmsen fault reports show it under support files

Kernel Exception crash Sep 3, 2020, 11:35:37 AM

NickSimpson · September 2020

Just wanted to update, Cluster was rebuilt yesterday by my counterpart and our MSP. So far we've had two Kernel Exception crashes on 801003B4F806D causing two fail-overs within 30 minutes. So it doesn't look like rebuilding the cluster hasn't solved our issue.

dlm_parr · October 2020

Just to add I upgraded my A/P M470 firecluster to 12.6.2 this weekend. Seeing same kernel exception radmom reboots and cluster failovers. Case open on it.

dlm_parr · October 2020

Just to add to the list:
I upgraded my A/P M470 firecluster to 12.6.2 this weekend. Seeing same kernel exception random reboots and cluster failovers.

Doesn't appear to be affecting my other M370 A/P Fireclusters on 12.6.2. So limited to M470 (at least in my case).

phanaaekIT · February 2021

12.6.4 has a few fixes per the release notes.

In an active/passive FireCluster, DNSWatch no longer fails when the active cluster member has an expired DNSWatch license and the passive cluster member has an unexpired DNSWatch license. [FBX-17093]
This release resolves a FireCluster issue that caused the sslvpn_firecluster process to use high CPU on the backup master when no Mobile VPN with SSL client was connected. [FBX-20962]
An interface disconnected from a FireCluster no longer causes a fault report. [FBX-20143]

Cluster Random Failover

Comments