2 minute network outage most days (1 per day)
XTM 330 (12.1.B548280). I know it's old and not updated. They wouldn't pay for updates any longer. Most days the Internet is unavailable for around 2 minutes. I know this because I run a PowerShell loop recording results and timestamp while I ping 8.8.8.8 and 8.8.4.4., where I see
Pinging 8.8.4.4 with 32 bytes of data:
Reply from 8.8.4.4: bytes=32 time=2ms TTL=116
Request timed out. .....
*** and it stays down for about 2 minutes then comes back and stays up until next day. Last week, it happened Monday - Wednesday at 09:01AM and Thursday at 04:26.
A month ago the service provider told me their ONT was rebooting at the times I provided. They replaced their equipment and the problem disappeared for a couple weeks, but came back. They now claim my firewall is the problem. So I am trying to figure a way to prove they are still having issues. My xtm330, shows online and my logs show the external interface receives no connections during the time I see the outage. HERE is the question. How can I prove the problem is or isn't related to the ONT or the XTM330? I currently have five offices which all need Fireboxes, but this customer is only willing if I can prove this is not related to the XTM330.
Thank You,
David
Comments
The only way that I can think of is to have a switch between your ISP device and your firewall external interface, have a switch port mirroring what is going to/from the ISP interface and have a laptop recording the traffic.
That way you are recording the traffic outside the firewall.
So if incoming packets stop from the ISP, then it is their cause.
You can also ask the ISP as to how they have determined that it is your firewall which is the cause, if any, other than their equipment is fine in their opinion.
Thanks Bruce yeah I started to put that in place over the weekend, but the computer wasn't getting an IP Address and I wasn't thinking the computer could still sniff packets without an IP address assigned.
I believe that the switch should send packets to the laptop, no matter the IP addr on the laptop.
Give it a try.
The default for Wireshark is promiscuous mode, that should be all that you need.
You can use the TCPDump function on the firewall to get a packet capture of what's going on on the external interface. It works best if you have WSM (the WebUI's session length for this is limited.)
If you don't have it already, you can get WSM here:
https://cdn.watchguard.com/SoftwareCenter/Files/WSM/12_6_4/WSM12_6_4.exe
This will only work if you're connected to the firewall, so you'll need to be behind the firewall, as you'll loose connection if you're external to it.
-First, start a ping to something. I'd suggest a IP address that you're not pinging anywhere else so it'll be easy to find.
-Open WSM and log into your firewall using the status user by going file -> connect to device.
-Once you're logged into your firewall, right click on it and select Firebox System Manager.
-Once FSM opens, go to Tools -> Diagnostic Tasks.
-In the Network tab, in the Task drop down menu, choose TCP DUMP.
-Tick the advanced options checkbox at the bottom of the window.
-In arguments type in "-nei eth0" without the quotes.
-Tick the "stream data to file" checkbox, and choose a place to save the file.
-Click RUN TASK when the issue starts.
-Click STOP TASK when you have enough data (I'd suggest the whole two minutes if you have time/space to do this.) This may take a moment to stop, only click it once.
You can then use Wireshark (wireshark.org) to look at what was being sent/received on the external interface. Using a filter like "ip.addr==4.2.2.1" in wireshark would filter for the IP 4.2.2.1. Use the IP you were pinging to.
If you can see that traffic leaving the firewall, it's very likely not the firewall causing the issue. You could also potentially see something like the firewall ARPing for the gateway -- if the gateway is not responding, that also points at the ISP device.
As a side note, I would suggest pushing back on the ISP to provide details (as in how specifically they're determining that this is a Firewall/Router issue.) Any information they might have can help you troubleshoot this, and if they don't have any I'd suggest you get it escalated on their side until you can find out that information.
-James Carson
WatchGuard Customer Support
Thank You James and Bruce!
James, If I understood your great instructional, It sounds like this approach requires on-site, access as the error occurs. If yes, this is not realistic, as the problem is intermittent, occurring at different times of the day, and sometimes skips a day or two. Curiously, when I call ISP support, the problem always goes away for multiple days up to a few weeks. Using the logmanager connected to the xtm330, I can see the exact time the externalinterface traffic stops. Like 30ish internal->external messages httprequests and proxyhttpsrequests the seconds before and then nothing for 1minute 45 seconds.
ProxyHTTPSReq 2021-03-01 14:13:15 192.168.10.19 1-Trusted xxx.xxx.xxx.xxx 443 0-External https/tcp HTTPS-proxy-00
ProxyHTTPSReq 2021-03-01 14:13:15 192.168.10.240 1-Trusted xxx.xxx.xxx.xxx 443 0-External https/tcp HTTPS-proxy-00
HTTP Request 2021-03-0114:14:58 192.168.10.124 1-Trusted www.msftconnecttest.com 80 0-External http/tcp HTTP-proxy-00
Deny 2021-03-01 14:14:59 xxx.xxx.xxx.xxx 0-External xxx.xxx.xxx.xxx 7600 Firebox 7600/tcp Unhandled-External-Packet-00
If you notice during this interruption above, there is no external packets "denied" at the external interface which looks to me like they aren't coming through the ONT. I don't think we see 1 minute go by without some denied packet.
Thank You James and Bruce,
David
Pardon me if I missed you already doing this, but I suggest running more scripts at the same time you run a PowerShell loop recording results and timestamp while pinging 8.8.8.8 and 8.8.4.4. Run one to ping the LAN IP of the 330, one to ping the WAN IP, one to ping the ISP's gateway, and one to ping the ISP's DNS servers (which should be closer to your 330 than Google's DNS).
Ideally, you could run the pings to their DNS and gateway, plus your WAN, from an external computer, also. Set the 330 to trust your home computer's IP and run the scripts from there as well as from behind the 330. Make sure that the computers running the scripts have the exact same time.
Gregg Hill
@funkywinkerbean
If it's random and you can't be on site, you'll want to use a port mirror and wireshark to catch it. Trying to run a packet capture via the WSM tool is going to chew up a lot of resources on the firewall for an extended period, and also possibly fill whatever storage device you're using.
I believe Wireshark can be set to keep a rolling log of an hour or so.
-James Carson
WatchGuard Customer Support
Thank You Each of you offered very good ideas. Over this weekend, I'm going to implement the Wireshark on a computer connected to a switch between ONT and XTM330.
update: On a computer installed on the secure (trusted) network of the xtm330 I'm running a powershell loop logging (ping of an external site results, timestamp, and nmap -sp scan of the local network) results. lately the ping results fail at 10:51am and are back up by 10:53am. I notice when the nmap scan and pings run, nmap usually logs 15 devices. when the ping fails, the nmap scan displays 2 devices (itself, and one other computer) ...until the network comes back. I now believe this outage, is on my network, not the ISP. I'm running wire shark, logging, where I'm seeing packets during the outage time originating only from the machine running the network ping loops. I'm not sure if this is an xtm330 problem or what is happening at this point.... continuing to dig.
Thank You,
David
Local devices on the same subnet should not use the firewall to respond to packets.
So the nmap scan going from 15 to 2 devices suggests to me that the switch to which the trusted network is connected is involved.
Is this a managed switch, with port status & logs etc.?
@funkywinkerbean
If you have any free ports on the firewall, configuring another interface as trusted and plugging directly into it (to bypass the switch(es) can help determine if that blip is the firewall itself.
Nmap is doing layer 2 scans (so stuff communicating by MAC address.) 2 seconds is how long the firewall or switch would probably take to rebuild its ARP table if it's clearing for some reason.
-James Carson
WatchGuard Customer Support
Also, do a Wireshark capture to see if something is saturating the switch at that time.
Thank you both, I'm on it and be back with next update!
Really appreciate your suggestions and help,
David
Bruce, I have wireshark running on the trusted side (thinking this is an internal issue) are you referring to, as you mentioned previously, the anonymous capture connected to a switch between the ONT and the firebox?
Appreciate Ya'll,
David
Question Gentlemen, If, as James recommended above, I configure another port as a trusted, and only connect one computer to it and run a few loop tests - if it doesn't have an issue at the same time, that would confirm the problem exists on internal network. If it has same errors, the problem would be xtm or external ISP. Agreed?
Let me know please if I missed your point James?
Thank You,
David
Hi @funkywinkerbean
That sounds correct, barring any circumstances I'm not accounting for. It'd at the very least rule the internal network in/out.
-James Carson
WatchGuard Customer Support
For my recent suggestion on the use of Wireshark - this is on a Trusted PC to see if something is saturating your trusted switch - or if during or just after the timeout you see a bunch of ARP packets from the switch or the firewall
OK, I am here to update ya'll. the final answer was found to be a switch was plugged into an APC Smart-UPS 1400 which was running a daily self test. This was causing the switch to restart. I can't explain, why some days we didn't see the outage... Anyway, I really appreciate your time and beg your pardon, for reading more into this, than was necessary. Finally answer I pulled the UPS and it worked fine. Don't know how the UPS went into this daily test mode.... and can't say I really like the concept.
not so Smart a UPS as it turns out ;-)
Even with a daily test, it should not drop the switch if the battery is good. HOWEVER, testing a battery that frequently can shorten its life. A normal schedule may be once a month. I would replace the battery if the UPS is less than three years old, or replace the whole UPS if more than three years old. A LOT of techs I know have changed from APC to Eaton for their UPS purchases.
Gregg Hill
Update ya'll: the final answer was found to be a switch was plugged into an APC Smart-UPS 1400 which was running a daily self test. This was causing the switch to restart. I can't explain, why some days we didn't see the outage... Anyway, I really appreciate your time and beg your pardon, for reading more into this, than was necessary. Finally answer I pulled the UPS and it worked fine. Don't know how the UPS went into this daily test mode.... and can't say I really like the concept.
You can say that again! (You repeated a previous post)
Gregg Hill