Cluster Backup member not joining after power maintenance #7979

Hi guys,

we have an A/P Cluster of XTM330s 12.1.3 B608021 and it's around third time we had a schedule power maintenance and when power backups up, one of the members (Firewall "A") don't sync back to the cluster, it stays "inactive". Strangely, we'd go to connect to that member on FSM, the member status appears "idle". We use WSM 12.5.1 to manage those appliances.

THE FIRST TIME, besides having an inactive member, Cluster would operate fine, despite having this look on FSM, unable to login in Web UI.
https://snipboard.io/LvVhse.jpg

After some back and forth, we realized the inactive member was able to play as master after getting it apart from the network and managing it individually. Since this member showing as "inactive" inside the cluster was not showing same strange behahior as Firewall "A", then we grab all network cords and plugged them to this device - Firewall "B".

Somehow Firewall "A" was glitching FSM, even 'thou it wasn't playing as master member.

On firewall A we completely default factory it, re-applied 12.1.3 B60821 (to match what we have in this cluster scenario), put it safe mode and cluster were once again synced fine.

THE SECOND TIME, it was a bit nasty, but we were lucky, this time after power maintenance and power backup up, Firewall "A" did not synced, same status as "inactive". Strangely, we'd go to connect to that member on FSM, the member status appears "idle" again.

There's no glitch on FSM not showing no status at all, neither Web UI giving HTTP Server error. After some back and forth, we discover the management IP address for Firewall "A" was responding to another host in the network. Since local network manager wasn't aware of each host it was, we decide to grab a a newer used IP address for management IP for this Firewall "A" member. Firewalls were synced after that, nice!

Now, we are on the THIRD TIME, it seems, we do not have an IP conflict this time, once again, there's no glitch on FSM/Web UI, but a power maintenance, when power went back up, Firewall "A" once again did not synced to the cluster, remaining as "inactive" state. Strangely, we'd go to connect to that member on FSM, the member status appears "idle" again. I am starting to think it's just a normal regular behavior this one.

We put Firewall "A" into safe mode and the rejoined it to the Cluster sucessfully. Rebooting Firewall "A" or asking for Discover member, even if we've got a successful message on FSM, the member would sit "inactive" until we forced it into safe mode.

This next sunday we'll have another power maintenance, it seems a routine task and we're waiting cluster to not go back normal after power backs up.

Things i am researching right now:
1) WatchGuard Fórum
I haven't found similar issue here, that's because i'm posting my own.

2) UserMac log
We have several logs with following pattern, that i'm clueless about
024-01-17 11:17:57 Secundario firewall sess_event: Session event "Add" has no "UserMac" parameter Debug 2024-01-17 11:17:57 Secundario firewall sess_event: Session event "Del" has no "UserMac" parameter

3) I've seen, we are using ID 5 for Cluster
Where standard is 1. I'll ask if there's another MAC multicast in the network, i don't recall why it's set to 5

4) Strange local subnetting
Firewall management interface and local network is 10.0.0.1/23, but when i access some local computers that are on 10.0.0.0/24 or 10.0.1.0/24 subnet - i don't recall applying this strange setting to stations, where firewall could just have secondary networks, if they are willing to avoid brodcast storms, but need more local IPs and don't want to deal with VLANs

5) Cluster log messages
The first three times, i did a look at event logs next to the incident hour, but reviewing firewall log settings, Cluster logs were off, i've set them to error now.
I'm also tried letting an ongoing tcpdump with Wireshark over CLI saving packets to files filtering DHCP logs (their DHCP sedrver or another one in the network may be leasing IPs that would conflict with management IPs?) and trying to see on dump IP conflict (i still have to try it out, to see which pattern should i look for within Wireshark to filter that), but for some reason, local network manager has not allowed me to run this dump capture from a local computer, i'll see how good could it be running this from firewall perspective, since at least DHCP logs that i want to grab will be (at least some of them) be broadcasted so i can grab them without port mirroring or similar feature

6) Look for cluster bugs in upcoming Fireware versions

At last, but not least, i'll look through release notes for cluster bugs in upcoming Fireware versions that could match what i'm experiencing right here. I know XTM330 is EOL, but if there an uncorrect bug like this, it'll help an ongoing work this network to update their hardware. But they are considering openning a new office, this old Cluster could be reallocated there and new one would stay in the HQ.

Any hint, guys?

Regards,
Rafael da Costa

Comments

  • From the last event, on Feb, 25th 3pm - (firewalls went down at 3:18pm) until Feb, 26th 1pm, we have 34914 records with pattern '"UserMac" parameter' - firewalls were rejoined at Feb 26th 12:06pm.

    We have these annoying logs until 12:59:58pm - i'll check if we have some of these currently right now and i'll update this post if so, i believe we'll have.

    Regards,
    Rafael da Costa

  • james.carsonjames.carson Moderator, WatchGuard Representative

    Hi @RafaelFerreira
    I would not suggest running these devices. You're not running 12.1.3 Update 5 (B640446) or better, which means this device is susceptible to the Cyclops Blink issue.

    See: https://techsearch.watchguard.com/KB?type=Article&SFDCID=kA16S000000SOCGSA4&lang=en_US

    12.1.3 Update 3 (B608021 - the version you're running) was last updated on 2 December 2019. That's over 4 years with no security updates.

    I would suggest that the cost savings of running these very old devices is not worth the security impact of running this old software.

    Should you need to get these devices running, I'd suggest the following:
    -Connect to the WebUI (https:// IP of firewall :8080 )
    -Once logged in, go to System Status -> Diagnostics.
    -Click to download a support log file.
    (you'll need a program that can open TGZ files, such as 7-zip, and a program that can read UNIX style text files, like Notepad++ for the next steps.)
    -Open the support file.
    -In the file, navigate to Fireware_XTM_Support.tgz\Fireware_XTM_Support.tar\support\system\system_status.txt
    -Look for the section labeled Cluster Snapshot - this may provide more information.

    -James Carson
    WatchGuard Customer Support

  • Hi James, i appreciate your reply. Cluster status seems fine to me. If issue arises again, i'll check this same status report again.

    Cluster Snapshot

    ------ Cluster Info ------
    cluster is Enabled
    mode is Active/Passive
    ha1_if_num=eth3
    ha2_if_num=eth4
    mgmt_if_num=eth1
    cluster_id=5
    lb_algorithm=Least-Connections
    hb_threshold=5
    progressive_vmac=1
    cluster_uptime=0
    nwMode=ROUTER MODE
    tpMode ip=0.0.0.0/0
    local logLevel=0
    hhi_enabled=1
    monitor_dev_if count = 5
    *** monitor_dev_if ***
    eth0
    eth1
    eth2
    eth5
    eth6
    *** monitor eth interface_list ***
    eth: 0 1 2 5 6
    *** monitor_la_interface_list ***
    bond:
    cfg_hash=44ae727015af82ab9e398d0ccd7c8afc44823e1f
    Member: Primario ID: Firewall "A"
    (local): Online
    Role: BACKUP_MASTER
    device_state=Active (6, 14)
    device_mode=PASSIVE
    device_model=XTM330
    ha1_ip=10.20.30.1
    ha1_netmask=255.255.255.248
    ha2_ip=10.20.25.1
    ha2_netmask=255.255.255.248
    mgmt_ip=10.0.0.2
    mgmt_netmask=255.255.254.0
    priority=1
    cluster_version=2
    software_version=12.1.3.B608021

    Cluster Dynamic Information

    Cluster is Enabled
    Mode = Active/Passive
    Member Id (self) = Firewall "A"
    Master Member Id = Firewall "B"
    Internal Device State (self) = 14
    Member Cluster Role = BACKUP_MASTER
    Flags [ hhi_cfg_on ]
    HHI is enabled
    SysTime at Clst up = Sun Feb 25 19:22:47 2024
    Current sysTime = Thu Feb 29 21:04:44 2024
    Cluster UP for = 97 hr(s): 41 mins(s): 57 sec(s)

    Monitored ETH Interfaces (1-Enabled, 0-Not Enabled):
    eth0=1 eth1=1 eth2=1 eth5=1 eth6=1
    ETH Interface Status (1-UP, 0-DOWN):
    eth0=1 eth1=1 eth2=1 eth3=1 eth4=1 eth5=1 eth6=1
    Monitored LA Interfaces (1-Enabled, 0-Not Enabled):

    LA Interface Status (1-UP, 0-DOWN):

    System Health Index (SHI) = 100
    Hardware Health Index (HHI) = 100
    Monitored Ports Health Index (MPHI) = 100
    Weighted Avg Index (WAI) = 100

    Mbr Info Sync Flag = Done

    CTD Channel(tcp connection)Status:
    --->To Member Firewall "B" is UP

    Member's info

    Member [Primario] Id = Firewall "A"
    Member Device state = 14
    Member cluster Role = BACKUP_MASTER
    Priority = 1
    Mode = PASSIVE

    Member [Secundario] Id = Firewall "B"
    Member Device state = 14
    Member cluster Role = MASTER
    Priority = 1
    Mode = ACTIVE

    FSS Registration Info On MemberID = Firewall "A"
    ModuleID = 3e, IPC Id = b4000733, FSS Reg = 1, IS Reg = 0
    ModuleID = 69, IPC Id = 6880067c, FSS Reg = 1, IS Reg = 0
    ModuleID = 38, IPC Id = 2400000, FSS Reg = 1, IS Reg = 1
    ModuleID = 48, IPC Id = 6bc0071f, FSS Reg = 1, IS Reg = 0
    ModuleID = 31, IPC Id = 630006b7, FSS Reg = 1, IS Reg = 0
    ModuleID = 2, IPC Id = e2400722, FSS Reg = 1, IS Reg = 0

    FSS Registration Info On MemberID = Firewall "B"
    ModuleID = 3e, IPC Id = b4000733, FSS Reg = 1, IS Reg = 0
    ModuleID = 69, IPC Id = 6880067c, FSS Reg = 1, IS Reg = 0
    ModuleID = 38, IPC Id = 2400000, FSS Reg = 1, IS Reg = 1
    ModuleID = 48, IPC Id = 6bc0071f, FSS Reg = 1, IS Reg = 0
    ModuleID = 31, IPC Id = 630006b7, FSS Reg = 1, IS Reg = 0
    ModuleID = 2, IPC Id = e2400722, FSS Reg = 1, IS Reg = 0

    Cluster Health

    Member Id = Firewall "A"
    Member cluster Role = 2

    System Health Index (SHI) = 100
    Hardware Health Index (HHI) = 100
    Monitored Ports Health Index (MPHI) = 100
    Weighted Avg Index (WAI) = 100

    NOTE: Failover occurs only when member's weighted avg index[WAI] is greater than master's weighted avg index[WAI]

    Cluster HA event

    Member Id (self) = Firewall "A"
    Cluster Role = BACKUP_MASTER

    Mon Feb 26 11:56:59 2024 System bootup <<<

    Mon Feb 26 14:57:45 2024 Formation: Member Firewall "A": Device has joined the cluster.Device State=14
    Mon Feb 26 11:57:54 2024 Election: cluster election event, Backup Master, rcvd. Current opState=IDLE
    Mon Feb 26 11:57:55 2024 Role: Member Firewall "A" becomes BACKUP SYNC. (devSt=14)
    Mon Feb 26 11:57:55 2024 Role: Member Firewall "A" becomes BACKUP. (devSt=14)

    Fri Feb 4 13:41:53 2022 System bootup <<<

    Cluster Manager Status

    operation: none
    state: idle
    protocol: 0
    state: idle
    result: ok
    start:
    end:
    msg:
    member:
    name: Primario
    member_id: Firewall "A"
    role: backup
    state: idle
    owner: no
    member:
    name: Secundario
    member_id: Firewall "B"
    role: master
    state: idle
    owner: yes

    Cluster Load Balance


    Connection state

    echo 0 > conn_stat to dump the stat for default clb policy
    echo 1 > conn_stat to dump the stat for sslvpn clb policy
    default clb policy: algorithm = 0, rr_next = 0

     member_id        conn_cnt      flags     status   kxp_handle    total_cnt
    

    ================== ========== ========= ====== ========== =========
    Firewall "B" 0 2 1 00000000 0

    Firewall "A" 0 5 0 00000001 0

    SA state

    sa load balance algorithm = 0, rr_next = 0

     member_id         sa_cnt       flags     status   kxp_handle 
    

    ================== ========== ========= ====== ==========
    Firewall "B" 00000000 2 1 00000000

    Firewall "A" 00000000 5 0 00000001

    Destination Policy IP

    echo 0 > dstPcy to dump the complete table, or
    echo ip > dstPcy to dump an entry

     dstPcyIp            member_id     
    

    ================== ================

    Regards,
    Rafael da Costa

Sign In to comment.