Cluster Backup member not joining after power maintenance #7979
Hi guys,
we have an A/P Cluster of XTM330s 12.1.3 B608021 and it's around third time we had a schedule power maintenance and when power backups up, one of the members (Firewall "A") don't sync back to the cluster, it stays "inactive". Strangely, we'd go to connect to that member on FSM, the member status appears "idle". We use WSM 12.5.1 to manage those appliances.
THE FIRST TIME, besides having an inactive member, Cluster would operate fine, despite having this look on FSM, unable to login in Web UI.
https://snipboard.io/LvVhse.jpg
After some back and forth, we realized the inactive member was able to play as master after getting it apart from the network and managing it individually. Since this member showing as "inactive" inside the cluster was not showing same strange behahior as Firewall "A", then we grab all network cords and plugged them to this device - Firewall "B".
Somehow Firewall "A" was glitching FSM, even 'thou it wasn't playing as master member.
On firewall A we completely default factory it, re-applied 12.1.3 B60821 (to match what we have in this cluster scenario), put it safe mode and cluster were once again synced fine.
THE SECOND TIME, it was a bit nasty, but we were lucky, this time after power maintenance and power backup up, Firewall "A" did not synced, same status as "inactive". Strangely, we'd go to connect to that member on FSM, the member status appears "idle" again.
There's no glitch on FSM not showing no status at all, neither Web UI giving HTTP Server error. After some back and forth, we discover the management IP address for Firewall "A" was responding to another host in the network. Since local network manager wasn't aware of each host it was, we decide to grab a a newer used IP address for management IP for this Firewall "A" member. Firewalls were synced after that, nice!
Now, we are on the THIRD TIME, it seems, we do not have an IP conflict this time, once again, there's no glitch on FSM/Web UI, but a power maintenance, when power went back up, Firewall "A" once again did not synced to the cluster, remaining as "inactive" state. Strangely, we'd go to connect to that member on FSM, the member status appears "idle" again. I am starting to think it's just a normal regular behavior this one.
We put Firewall "A" into safe mode and the rejoined it to the Cluster sucessfully. Rebooting Firewall "A" or asking for Discover member, even if we've got a successful message on FSM, the member would sit "inactive" until we forced it into safe mode.
This next sunday we'll have another power maintenance, it seems a routine task and we're waiting cluster to not go back normal after power backs up.
Things i am researching right now:
1) WatchGuard Fórum
I haven't found similar issue here, that's because i'm posting my own.
2) UserMac log
We have several logs with following pattern, that i'm clueless about
024-01-17 11:17:57 Secundario firewall sess_event: Session event "Add" has no "UserMac" parameter Debug 2024-01-17 11:17:57 Secundario firewall sess_event: Session event "Del" has no "UserMac" parameter
3) I've seen, we are using ID 5 for Cluster
Where standard is 1. I'll ask if there's another MAC multicast in the network, i don't recall why it's set to 5
4) Strange local subnetting
Firewall management interface and local network is 10.0.0.1/23, but when i access some local computers that are on 10.0.0.0/24 or 10.0.1.0/24 subnet - i don't recall applying this strange setting to stations, where firewall could just have secondary networks, if they are willing to avoid brodcast storms, but need more local IPs and don't want to deal with VLANs
5) Cluster log messages
The first three times, i did a look at event logs next to the incident hour, but reviewing firewall log settings, Cluster logs were off, i've set them to error now.
I'm also tried letting an ongoing tcpdump with Wireshark over CLI saving packets to files filtering DHCP logs (their DHCP sedrver or another one in the network may be leasing IPs that would conflict with management IPs?) and trying to see on dump IP conflict (i still have to try it out, to see which pattern should i look for within Wireshark to filter that), but for some reason, local network manager has not allowed me to run this dump capture from a local computer, i'll see how good could it be running this from firewall perspective, since at least DHCP logs that i want to grab will be (at least some of them) be broadcasted so i can grab them without port mirroring or similar feature
6) Look for cluster bugs in upcoming Fireware versions
At last, but not least, i'll look through release notes for cluster bugs in upcoming Fireware versions that could match what i'm experiencing right here. I know XTM330 is EOL, but if there an uncorrect bug like this, it'll help an ongoing work this network to update their hardware. But they are considering openning a new office, this old Cluster could be reallocated there and new one would stay in the HQ.
Any hint, guys?
Regards,
Rafael da Costa
Comments
From the last event, on Feb, 25th 3pm - (firewalls went down at 3:18pm) until Feb, 26th 1pm, we have 34914 records with pattern '"UserMac" parameter' - firewalls were rejoined at Feb 26th 12:06pm.
We have these annoying logs until 12:59:58pm - i'll check if we have some of these currently right now and i'll update this post if so, i believe we'll have.
Regards,
Rafael da Costa
Hi @RafaelFerreira
I would not suggest running these devices. You're not running 12.1.3 Update 5 (B640446) or better, which means this device is susceptible to the Cyclops Blink issue.
See: https://techsearch.watchguard.com/KB?type=Article&SFDCID=kA16S000000SOCGSA4&lang=en_US
12.1.3 Update 3 (B608021 - the version you're running) was last updated on 2 December 2019. That's over 4 years with no security updates.
I would suggest that the cost savings of running these very old devices is not worth the security impact of running this old software.
Should you need to get these devices running, I'd suggest the following:
-Connect to the WebUI (https:// IP of firewall :8080 )
-Once logged in, go to System Status -> Diagnostics.
-Click to download a support log file.
(you'll need a program that can open TGZ files, such as 7-zip, and a program that can read UNIX style text files, like Notepad++ for the next steps.)
-Open the support file.
-In the file, navigate to Fireware_XTM_Support.tgz\Fireware_XTM_Support.tar\support\system\system_status.txt
-Look for the section labeled Cluster Snapshot - this may provide more information.
-James Carson
WatchGuard Customer Support
Hi James, i appreciate your reply. Cluster status seems fine to me. If issue arises again, i'll check this same status report again.
Cluster Snapshot
------ Cluster Info ------
cluster is Enabled
mode is Active/Passive
ha1_if_num=eth3
ha2_if_num=eth4
mgmt_if_num=eth1
cluster_id=5
lb_algorithm=Least-Connections
hb_threshold=5
progressive_vmac=1
cluster_uptime=0
nwMode=ROUTER MODE
tpMode ip=0.0.0.0/0
local logLevel=0
hhi_enabled=1
monitor_dev_if count = 5
*** monitor_dev_if ***
eth0
eth1
eth2
eth5
eth6
*** monitor eth interface_list ***
eth: 0 1 2 5 6
*** monitor_la_interface_list ***
bond:
cfg_hash=44ae727015af82ab9e398d0ccd7c8afc44823e1f
Member: Primario ID: Firewall "A"
(local): Online
Role: BACKUP_MASTER
device_state=Active (6, 14)
device_mode=PASSIVE
device_model=XTM330
ha1_ip=10.20.30.1
ha1_netmask=255.255.255.248
ha2_ip=10.20.25.1
ha2_netmask=255.255.255.248
mgmt_ip=10.0.0.2
mgmt_netmask=255.255.254.0
priority=1
cluster_version=2
software_version=12.1.3.B608021
Cluster Dynamic Information
Cluster is Enabled
Mode = Active/Passive
Member Id (self) = Firewall "A"
Master Member Id = Firewall "B"
Internal Device State (self) = 14
Member Cluster Role = BACKUP_MASTER
Flags [ hhi_cfg_on ]
HHI is enabled
SysTime at Clst up = Sun Feb 25 19:22:47 2024
Current sysTime = Thu Feb 29 21:04:44 2024
Cluster UP for = 97 hr(s): 41 mins(s): 57 sec(s)
Monitored ETH Interfaces (1-Enabled, 0-Not Enabled):
eth0=1 eth1=1 eth2=1 eth5=1 eth6=1
ETH Interface Status (1-UP, 0-DOWN):
eth0=1 eth1=1 eth2=1 eth3=1 eth4=1 eth5=1 eth6=1
Monitored LA Interfaces (1-Enabled, 0-Not Enabled):
LA Interface Status (1-UP, 0-DOWN):
System Health Index (SHI) = 100
Hardware Health Index (HHI) = 100
Monitored Ports Health Index (MPHI) = 100
Weighted Avg Index (WAI) = 100
Mbr Info Sync Flag = Done
CTD Channel(tcp connection)Status:
--->To Member Firewall "B" is UP
Member's info
Member [Primario] Id = Firewall "A"
Member Device state = 14
Member cluster Role = BACKUP_MASTER
Priority = 1
Mode = PASSIVE
Member [Secundario] Id = Firewall "B"
Member Device state = 14
Member cluster Role = MASTER
Priority = 1
Mode = ACTIVE
FSS Registration Info On MemberID = Firewall "A"
ModuleID = 3e, IPC Id = b4000733, FSS Reg = 1, IS Reg = 0
ModuleID = 69, IPC Id = 6880067c, FSS Reg = 1, IS Reg = 0
ModuleID = 38, IPC Id = 2400000, FSS Reg = 1, IS Reg = 1
ModuleID = 48, IPC Id = 6bc0071f, FSS Reg = 1, IS Reg = 0
ModuleID = 31, IPC Id = 630006b7, FSS Reg = 1, IS Reg = 0
ModuleID = 2, IPC Id = e2400722, FSS Reg = 1, IS Reg = 0
FSS Registration Info On MemberID = Firewall "B"
ModuleID = 3e, IPC Id = b4000733, FSS Reg = 1, IS Reg = 0
ModuleID = 69, IPC Id = 6880067c, FSS Reg = 1, IS Reg = 0
ModuleID = 38, IPC Id = 2400000, FSS Reg = 1, IS Reg = 1
ModuleID = 48, IPC Id = 6bc0071f, FSS Reg = 1, IS Reg = 0
ModuleID = 31, IPC Id = 630006b7, FSS Reg = 1, IS Reg = 0
ModuleID = 2, IPC Id = e2400722, FSS Reg = 1, IS Reg = 0
Cluster Health
Member Id = Firewall "A"
Member cluster Role = 2
System Health Index (SHI) = 100
Hardware Health Index (HHI) = 100
Monitored Ports Health Index (MPHI) = 100
Weighted Avg Index (WAI) = 100
NOTE: Failover occurs only when member's weighted avg index[WAI] is greater than master's weighted avg index[WAI]
Cluster HA event
Member Id (self) = Firewall "A"
Cluster Role = BACKUP_MASTER
Mon Feb 26 14:57:45 2024 Formation: Member Firewall "A": Device has joined the cluster.Device State=14
Mon Feb 26 11:57:54 2024 Election: cluster election event, Backup Master, rcvd. Current opState=IDLE
Mon Feb 26 11:57:55 2024 Role: Member Firewall "A" becomes BACKUP SYNC. (devSt=14)
Mon Feb 26 11:57:55 2024 Role: Member Firewall "A" becomes BACKUP. (devSt=14)
Cluster Manager Status
operation: none
state: idle
protocol: 0
state: idle
result: ok
start:
end:
msg:
member:
name: Primario
member_id: Firewall "A"
role: backup
state: idle
owner: no
member:
name: Secundario
member_id: Firewall "B"
role: master
state: idle
owner: yes
Cluster Load Balance
Connection state
echo 0 > conn_stat to dump the stat for default clb policy
echo 1 > conn_stat to dump the stat for sslvpn clb policy
default clb policy: algorithm = 0, rr_next = 0
================== ========== ========= ====== ========== =========
Firewall "B" 0 2 1 00000000 0
Firewall "A" 0 5 0 00000001 0
SA state
sa load balance algorithm = 0, rr_next = 0
================== ========== ========= ====== ==========
Firewall "B" 00000000 2 1 00000000
Firewall "A" 00000000 5 0 00000001
Destination Policy IP
echo 0 > dstPcy to dump the complete table, or
echo ip > dstPcy to dump an entry
================== ================
Regards,
Rafael da Costa