Options

why the cluster is going crazy

We are use HA active/passive cluster. (2x M270)
Each member of the cluster works properly about 2-3 hours and later failover occurs.
After next 2 - 3 hours - again failover occurs.
Machine's are under protected power (UPS) and temperature is ok about 18 st Celsius

Cause of failover: heartbeat lost
What to do? Recreate the cluster?
I do not believe that both machines were damaged in the same way
I have already changed the cluster cables

I attached:

Cluster status: (power is ok, there is no reason to restart, no man will restart machines for many days,)

Master 801xxxxxxxxxX Online 3h 54m 1s 1% 57%
Backup 801xxxxxxxxxX Online 1h 18m 44s 0% 28%

2021-10-20 10:12:58 AM Failover Heartbeat Lost N/A
2021-10-20 11:47:33 AM Failover Heartbeat Lost N/A
2021-10-20 12:50:16 PM Failover Heartbeat Lost N/A
2021-10-20 01:30:37 PM Failover Unknown N/A
2021-10-20 01:40:05 PM Failover Heartbeat Lost N/A
2021-10-20 02:57:12 PM Failover Heartbeat Lost N/A
2021-10-20 03:53:37 PM Failover Heartbeat Lost N/A
2021-10-20 05:27:30 PM Failover Heartbeat Lost N/A
2021-10-20 06:34:39 PM Failover Heartbeat Lost N/A
2021-10-20 08:05:52 PM Failover Heartbeat Lost N/A
2021-10-21 12:10:56 AM Failover Heartbeat Lost N/A
2021-10-21 12:43:09 AM Failover Heartbeat Lost N/A
2021-10-21 01:42:01 AM Failover Heartbeat Lost N/A
2021-10-21 02:13:10 AM Failover Heartbeat Lost N/A
2021-10-21 03:14:17 AM Failover Heartbeat Lost N/A
2021-10-21 05:22:16 AM Failover Heartbeat Lost N/A
2021-10-21 07:55:22 AM Failover Heartbeat Lost N/A
2021-10-21 09:57:25 AM Failover Heartbeat Lost N/A
2021-10-21 11:58:53 AM Failover Heartbeat Lost N/A
2021-10-21 01:33:22 PM Failover Heartbeat Lost N/A
2021-10-21 02:07:00 PM Failover Heartbeat Lost N/A
2021-10-21 03:07:59 PM Failover Heartbeat Lost N/A
2021-10-21 03:38:50 PM Failover Heartbeat Lost N/A
2021-10-21 04:40:16 PM Failover Heartbeat Lost N/A
2021-10-21 05:10:24 PM Failover Heartbeat Lost N/A
2021-10-21 06:11:28 PM Failover Heartbeat Lost N/A

Comments

  • Options

    If the M270 devices is connected directly to each other, i would suspect a hardware failure on one of the devices, maybe a NIC issue.

    Have you checked the NIC status on each device?

    Do you have other equitment running in the same lan with the same vrrp address?

  • Options
    james.carsonjames.carson Moderator, WatchGuard Representative

    If the boxes are loosing heartbeat it's likely due to a process crash or the devices rebooting for some reason. I'd suggest opening a case so that our team can dig into the boxes a bit more in depth and see what might be causing that problem.

    -James Carson
    WatchGuard Customer Support

Sign In to comment.