why the cluster is going crazy
We are use HA active/passive cluster. (2x M270)
Each member of the cluster works properly about 2-3 hours and later failover occurs.
After next 2 - 3 hours - again failover occurs.
Machine's are under protected power (UPS) and temperature is ok about 18 st Celsius
Cause of failover: heartbeat lost
What to do? Recreate the cluster?
I do not believe that both machines were damaged in the same way
I have already changed the cluster cables
I attached:
Cluster status: (power is ok, there is no reason to restart, no man will restart machines for many days,)
Master 801xxxxxxxxxX Online 3h 54m 1s 1% 57%
Backup 801xxxxxxxxxX Online 1h 18m 44s 0% 28%
2021-10-20 10:12:58 AM Failover Heartbeat Lost N/A
2021-10-20 11:47:33 AM Failover Heartbeat Lost N/A
2021-10-20 12:50:16 PM Failover Heartbeat Lost N/A
2021-10-20 01:30:37 PM Failover Unknown N/A
2021-10-20 01:40:05 PM Failover Heartbeat Lost N/A
2021-10-20 02:57:12 PM Failover Heartbeat Lost N/A
2021-10-20 03:53:37 PM Failover Heartbeat Lost N/A
2021-10-20 05:27:30 PM Failover Heartbeat Lost N/A
2021-10-20 06:34:39 PM Failover Heartbeat Lost N/A
2021-10-20 08:05:52 PM Failover Heartbeat Lost N/A
2021-10-21 12:10:56 AM Failover Heartbeat Lost N/A
2021-10-21 12:43:09 AM Failover Heartbeat Lost N/A
2021-10-21 01:42:01 AM Failover Heartbeat Lost N/A
2021-10-21 02:13:10 AM Failover Heartbeat Lost N/A
2021-10-21 03:14:17 AM Failover Heartbeat Lost N/A
2021-10-21 05:22:16 AM Failover Heartbeat Lost N/A
2021-10-21 07:55:22 AM Failover Heartbeat Lost N/A
2021-10-21 09:57:25 AM Failover Heartbeat Lost N/A
2021-10-21 11:58:53 AM Failover Heartbeat Lost N/A
2021-10-21 01:33:22 PM Failover Heartbeat Lost N/A
2021-10-21 02:07:00 PM Failover Heartbeat Lost N/A
2021-10-21 03:07:59 PM Failover Heartbeat Lost N/A
2021-10-21 03:38:50 PM Failover Heartbeat Lost N/A
2021-10-21 04:40:16 PM Failover Heartbeat Lost N/A
2021-10-21 05:10:24 PM Failover Heartbeat Lost N/A
2021-10-21 06:11:28 PM Failover Heartbeat Lost N/A
Comments
If the M270 devices is connected directly to each other, i would suspect a hardware failure on one of the devices, maybe a NIC issue.
Have you checked the NIC status on each device?
Do you have other equitment running in the same lan with the same vrrp address?
If the boxes are loosing heartbeat it's likely due to a process crash or the devices rebooting for some reason. I'd suggest opening a case so that our team can dig into the boxes a bit more in depth and see what might be causing that problem.
-James Carson
WatchGuard Customer Support