Authentication gateway high cpu and dies randomly

rv@kaufmann.dk · February 2021

Hi

I have Authentication gateway (12.5.4) installed on two domain controllers running 2012r2. Each machine has 2 vCPU at 2.6Ghz and 5GB memory and authentication gateway is configured to use ELM as primary and SSO client as secondary, as we do not have SSO client installed on any machines.

The fireboxes is also configured to allow SSO through vpn tunnels.

I have 2 issues with auth gateway:
1. The cpu usage on the authentication gateways is pretty high - approx. 30-60% cpu time used by the gateway with most fireboxes connected - just for the auth service.
2. At random times the authentication gateway service dies. It get stock using 60% cpu usage and stops responding to requests. Trying to open SSO tool is extremly slow and finally shows nothing in status monitor until i kill the service and restart it.

https://axelkaufmann-my.sharepoint.com/:i:/g/personal/rv_kaufmann_dk/EdJ1lmG8oKZAi7kjk9bu6l0B8md3pr4-MLquVMiw4Pb5rw?e=xGjvgM

Anybody else experience the same using ELM?

/Robert

james.carson · February 2021

Hi @RVilhelmsen
Event log monitor is a bit of a resource hog -- it's literally (trying to) parse the event logs on every single machine and on your servers to determine what users are logged in. In networks with over 100 users, this starts to get really CPU intensive.

A few tips:
-Make sure you make SSO excisions for anything that is contacting the firewall that won't be authenticating. Printers, copiers, smart devices, anything that's not logging into AD. The exclusion stops the ELM portion of the firewall from looking for it.

You can see more about that here:
https://www.watchguard.com/help/docs/help-center/en-US/Content/en-US/Fireware/authentication/sso_enable_c.html

-In your SSO Gateway app (on the server) consider ordering the contact domain settings so they're like
1. SSO Client
2. Event Log monitor

Install the SSO client on any PCs that you can. This effectively lets the client workstations check in themselves, instead of the event log monitor scrubbing logs looking for that information.

If you keep running into issues, I'd suggest opening a case so that our support team can look into the resource usage. The above recommendations are just general, and I don't know what your unique config looks like.

rv@kaufmann.dk · February 2021

Hi @James_Carson
Thank you. I´ll start with the exclusion as we have many subnets which do not need SSO. And i will work towards getting the SSO client installed.

rv@kaufmann.dk · February 2021

@James_Carson

I have a additional question regarding the SSO client. What do the SSO client and agent so, if it client is installed on a machine not added to a domain?

/Robert

james.carson · February 2021

Hi @RVilhelmsen

The Agent takes queries from the firewall and identifies a user using one of the methods listed (Event Log Monitor, Client, Exchange Monitor.)

The Client sits on a client machine and identifies what user is currently logged in when asked by the agent.

There's a detailed overview here:
https://www.watchguard.com/help/docs/help-center/en-US/Content/en-US/Fireware/authentication/sso_about_c.html

FFMLTD · February 2021

We too are experiencing this issue. I agree with the general suggestion of reducing the agents work to the minimum, but there appears to be something else going on. For us this began with the 12.5.4 agent.

if it were just a simple resource issue, then if you just restart the agent during 'peak' usage periods when it's usage was spiked to 100% you would expect to find yourself back at 100% CPU usage within 10 minutes. This is not the case. The resource usage spike is gradual and fairly consistent, and leads to its eventual failure to process auth requests. This is more indicative of an underlying programming issue. Whether it is a race condition or leak is unclear at this time, though lack of a corresponding memory spike at the time of failure points more to the former.

Do your symptoms match the ones we have experienced?

rv@kaufmann.dk · February 2021

I have pushed the agent out to all our Windows clients which might have helped somewhat - i cannot say for sure yet. What i believe helped very much on the load on the Gateways is to make sure, i have excluded each and every ip subnet/address which is not working with SSO or SSO traffic is blocked either way be a firewall.

After doing this the Pendig list on the Authentication Gateway only have a few ip adresses, and i think, if the Auth gateway have a big list on the pending list, there is a bug causing it use more and more resources and eventually halt.

FFMLTD · February 2021

@RVilhelmsen

Your observed symptoms appear exactly as I have seen as well. Some threshold/combination of circumstances beyond which the wagsvc.exe process's CPU useage increases exponentially. The process then becomes unstable and eventually ceases to function, and the CPU use remains stuck until the service is restarted. It also does not 'crash' completely, and therefore does not fail over to a functioning Authentication Gateway. Which if you have redundancy, would be nice.

I am currently working through an upgrade issue ticket so we are still on a special Fireware 12.5.4 build, but when we are finally able to upgrade to the current build and if the problem still exists, we will open a ticket for this. In the meantime, we don't poke the bear.

Authentication gateway high cpu and dies randomly

Comments

Categories