HA vrrp address stops responding

Hi,

I am in the proces of setting up a firebox V cluster on VMware.

VMware ESXi, 7.0.3, 19482531
Fireware 12.7.2 U2

Promiscuous mode = enable
address changes = enabled
Forged transmits = enabled
Forged Transmits = Accept
Garp allowed
All mac addresses on members is:

HA1:
00:50:56:00:00:00
00:50:56:00:00:01
00:50:56:00:00:02
00:50:56:00:00:03

HA2:
00:50:56:22:22:20
00:50:56:22:22:21
00:50:56:22:22:22
00:50:56:22:22:23

They are both using VMXNET3 adaptor.

All is setup and cluster (id 50) is running with no problems reported. Access from external to the public ip works no matter who is master. Cluster has been running for 2 weeks.

Today we powered a VM up behind the cluster but we could not get traffic routed through the cluster. The cluster vrrp ip address did not repsond to ping, but both members management addresses was responding to ping and the firebox arp tabled listed the VM mac/ip address. Management access from external to public ip was working.

We are renting a virtual datacenter and my provider asked me to turn of one of the members. As soon as i turn of the cluster master the internal vrrp address started to respond to ping and traffic was routed through the new master device fine.

Afterwords i tried every combination of master/slave, power on/off, failover and it works. I am clueless.

Of cause i will have to test futher, but any idea why the cluster stopped accepting traffic to the internal vrrp address?

/Robert

Comments

  • I have been searching and the above would indicate Promiscuous mode have not been enabled on the vSwitches before the fireboxes was booted up. I can see fireware checks for Promiscuous mode.

    Checking VMAC stuff for MGMT IF vlan1
    PROMISC mode is enable.
    Check ok, 00:50:56:00:00:01 is in maddr list of vlan1
    Done check VMAC stuff. [Passed]

    This is my untagged vlan which also holds the vrrp address.

    Let´s say Promiscuous mode was disabled when the fireboxes was booted up - this would make virtual mac addresses not working. Then we powered off the master device and the standby became the new master and the vrrp ip address started to respond to traffic.

    If this is true this means fireware shifts from using a virtual mac address when both cluster members is up and running to using it´s actual mac address assigned from the vSwitch thus not using vrrp.

    If not, i do not understand why this happened.

    How do fireware assign a mac address when it´s running as a single clsuter member vs. both members active?

  • I can answer my own question - either way the cluster is running the vmac is used. In my case 00:00:5E:00:01:33.

  • Happened again today after 2 days uptime. A failover solved it - i know a garp is send out but as the vswitches don´t know the actual mac address and only the mac´s from the interfaces which is connected to it should not matter what mac address is being used as long as Promiscuous mode is enabled.

    At the same time i can ping the vrrp address from the firewall itself when i have the issue. Very odd.

    A case have been opened, 01705766.

  • Tcpdump shows the backup master replies to arp requests for the primary vrrp address which it MUST NOT do in when being in backup mode. (per rfc)

    This would explain why i am seeing network issues.

    So i am the first customer to run HA on VMWare?

  • To all others....
    FBX-23368 : FireboxV Firecluster backup master responding to ARP requests with VRRP adress

  • edited June 2022

    Did you get a fix for this problem? Can't find the article with this number.

  • Hi @Dantheman

    Did you get a fix for this problem? Can't find the article with this number.

    No, it has only been acknowledged as a bug. I guess this is a low priority and no other customers are running ha within a vmware inviroment.

    /Robert

  • james.carsonjames.carson Moderator, WatchGuard Representative

    Hi @rv@kaufmann.dk @Dantheman
    FBX-23368 is currently medium (the normal priority for a bug.) Bugs of this nature generally take a bit longer to resolve as quite a bit of research has to go into if potential fixes are going to cause issues elsewhere in the system.

    @Dantheman If you're running into this issue, I'd suggest creating a case and mentioning FBX-23368 somewhere in the case -- that'll allow us to automatically notify you of any developments on the issue and/or offer any pre-release fixes for the issue if they're available. It also allows our management team to see how many customers are impacted by each issue.

    -James Carson
    WatchGuard Customer Support

  • @james.carson

    Can you see, if there is any new progress with BUG FBX-23368?
    I am still just receiving notifications it is waiting for engenering.

    /Robert

  • james.carsonjames.carson Moderator, WatchGuard Representative

    Hi @Robert_Vilhelmsen
    That specific bug is currently unresolved. It's currently working through the development team's queue. There is not an ETA for this bug at this time.

    -James Carson
    WatchGuard Customer Support

  • Thank you. I guess, i´m the only customer trying to run HA in a VMWare inviroment then.

  • james.carsonjames.carson Moderator, WatchGuard Representative

    Hi @Robert_Vilhelmsen
    I do know that Clusters in VMWare are rare, but I don't have figures on how many people do use them.

    -James Carson
    WatchGuard Customer Support

Sign In to comment.