UDP Connection Issues upon route change or link failover

Hi All,

We have an issue with UDP connections not being broken down upon a link failure/routing change on a firewall.

This is an issue we identified some time ago as we have an IP based telephony system and the handsets use port 5588/udp to communicate back to the pabx (hosted in a centralized DC). This system was implemented 8 years ago.

We have around 15 WG devices, from T35's through to M470's. They all have the same issue and have had so for as long as we've been using them (9+ years).

To give some further details when there is a routing change on an interface on the watchguard itself (either at the edge or the DC), it will cause our phones to go offline and as the WG won't breakdown the UDP connection and re-create is using the changed link. This occurs all the time for our phone system, both for the propriatary PBAX comms with the handsets and also SIP comms (5060/udp).

It's really become an issue since we redesigned out network recently as we are moving off expensive MPLS to high bandwidth fibre based internet connections with BOVPN Virtual Interfaces and BGP to manage routing. We initially wanted 2 active BOVPN Vif's at each site, terminating into 2 different locations. Previously we had tried using BOVPN Vif's as failover for the MPLS.

Unfortuantely we have had to forgo the automatic BGP routing failover due to the phones going offline if the route changes.

When this does happen, the only thing we can do is turn off handsets or disable our SIP comms completely, wait 10 minutes and turn/re-enable commms and a fresh UDP connection is established down the "new" path and everyhing works again.

It's quite unfortunate as it's not really an issue until it is and it's forced us to rethink our how we go about our network redundancy.

I just found this from Cisco (we don't use cisco) but it appears to almost replicate the issue we have with the watchguards - https://www.cisco.com/c/en/us/support/docs/security/asa-5500-x-series-next-generation-firewalls/113592-udp-traffic-fails-00.html

TL;DR

Essentially, it looks like any change to the routing table for egress UDP connections, the WG won't tear down the connection and due to SIP and other telephony devices always sending keepalives, the WG never allows the connection to time-out and re-established, even if the route has changed.

I haven't raised a support ticket and maybe I should, but I thought I'd post here to see if anyone else has experienced the same issue? I'm surprised it's never been addressed in the 9 years we've had WG devices so I suspect not many have come across this.

Comments

  • james.carsonjames.carson Moderator, WatchGuard Representative

    Hi @Ben_G
    Thanks for reaching out.

    First, if you'd like to verify any of this, I would suggest a support case. At minimum it would at least get a bug or feature request in specific to your issue.

    With that said, I'd suggest looking into the sticky connections settings here:
    https://www.watchguard.com/help/docs/help-center/en-US/Content/en-US/Fireware/policies/sticky_connection_add_c.html
    The behavior of this feature sounds very similar to what you're describing, and turning it down may get it to snap over to the other connection more quickly.

    Note that a interface failover (where the firewall determines the interface is down, or the link actually goes down) should result in a quick cutover. The firewall can't force the phone to re-register via the new connection though, so if the SIP keep-alive on the phone is set very high (often 600 seconds/10 minutes, but I've seen as high as 3600 seconds/1 hr) nothing the firewall does will make the phone re-register to the server using the new IP.

    -James Carson
    WatchGuard Customer Support

  • Hi James, Thanks for the post.

    I haven't tested the sticky connection policy settings, when I get a chance I will.

    Regarding interface down, that is correct, if the interface on the FW goes offline then it seems to work ok but I haven't tested this fully.

    The issue is that I can only reproduce this in a production environment. We don't have a lab for the phone system so I can't easily repeat the issue on-demand without impacting the wider business.

    Kind regards,
    Ben

Sign In to comment.