Diagnosing Engine Issues
The following section gives some information about reading logs related to Engines, and solutions to some common issues.
If problems occur with one particular Engine on the grid and the cause is not immediately obvious, it might be easier to reinstall the Engine rather than go on a long troubleshooting exercise. If the problem persists after reinstallation of the Engine, then investigate for issues with the network, application, or machine setup.
For information about managing Engines, see Managing Engines
Engine Logins, Restarts, and Failures
After GridServer starts, Engine Daemons and Engines log on to the Manager. In the Manager log, messages similar to the following appears when this happens:
Info: [EngineEvent] EngineDaemon:S08048-10.103.8.48:Added
Info: [EngineEvent] Engine:Joe-0:Added
The Broker sends periodic heartbeats to the Engine. If these heartbeats fail, you see messages similar to the following:
Info: [ProxyMonitorPlugin] Killing proxy S08049-10.103.8.49 on EngineDaemonServicePlugin
Info: [EngineEvent] Engine:S08048-0:Logoff:Killed by the proxy monitor
Warning: [EmploymentOfficePlugin] Engine:S08048-0:Died
If the Engine cannot perform a heartbeat with the Broker then after 3 retries you see message:
Warning: HeartbeatPlugin Couldn't send a heart beat to the Manager failure to process HTTP request in POST: Connect failed, so the client logs off.
If the Engine Daemon fails you see the following message:
Warning: [EngineDaemonServicePlugin] Engine Daemon:S08049-10.103.8.49:Died
You can lengthen the period between heartbeats at Admin > System Admin > Manager Configuration > Communication.
If the Engine fails, by default it restarts and tries to log in again. Failure messages in the Manager log look similar to the following:
Info: [Scheduler] Engine:NCSILS9027B1GRD-0:Logoff:Ping failed on local webserver, restarting instance in one minute
Fine: [EngineProxy] Logging off: NCSILS9027B1GRD-0
Fine: [EngineLoginManagerPlugin] Logging off proxy + NCSILS9027B1GRD-0 code=3
reason=Ping failed on local webserver, restarting instance in one minute
Info: [EngineEvent] Engine:NCSILS9027B1GRD-0:Removed
The Engine Daemon log reports the following.
Info: [Scheduler] Engine:nldn8347dww-0:NotifyKillTask:1208119245372388313-1208119245372388313-0
Info: [Scheduler] Engine:nldn8347dww-0:TaskDied:1208119245372388313-1208119245372388313-0
Info: [Scheduler] Engine:nldn8347dww-0:Logoff:Killed by the proxy monitor
Warning: [EngineEvent] Engine:nldn8347dww-0:Died
Info: [EngineEvent] Engine:nldn8347dww-0:Removed
JVM Issues
If there are messages similar to the following in the Manager log, the JVM might be running out of memory:
Severe: [HeartbeatPlugin] while sending heartbeat java.lang.OutOfMemoryError: unable to create new native thread
You can increase the Engine JVM maximum heap size in the Engine configurations.
If an Engine fails, or the logs on an Engine end abruptly, the cause might be a Java failure. Check for Java HotSpot compiler error logs in the Engine root directory; they have names like hs_err_pidXXXX.log, and contain information about problems in native code. The information can be used for a web search to see if it is a known problem. You must also check if any native C code is being called by the application that fails.
Connection and Firewall Problems
A common problem is that Drivers or Engines are not connecting to the Director or Broker. This is typically due to Firewall or DNS issues. Correct DNS configuration is essential in GridServer installations. Use the telnet command to test connections from the Manager to the Engine and vice versa.
All supported Windows versions enable the Windows Firewall by default. This automatically blocks any incoming traffic. To make sure your Engine can properly communicate, the inbound port for the Engine’s File Server must be open to traffic. By default, this port is set to 27159; it can be changed in the Engine Configuration. Configure your Windows Firewall to enable use of this port by your Driver machines.
Another possibility is that one of the components is assigning ephemeral ports outside of the range that can be opened. Sometimes systems assign ports outside the range of 49152-65535. You can check this by using netstat -a.
Engine Daemon Cannot Log On to Manager
If an Engine Daemon won’t connect to a Manager, check the URL in the intranet.dat file in the root of the Engine installation directory and see if you can make a connection to it from the Engine machine.
Thread Dumps on Engines
To get thread dumps on Engines, use the java Visual VM tool. It is available at https://visualvm.github.io/.
Using Fusion to Debug .NET Assembly Load Failures
With C# code, a runtime library load failure can take a number of forms (such as a FileLoadException) and might be difficult to debug, The notification is only that the assembly load failed, but not why it failed.
To obtain more detailed debugging information about assembly load / bind failures, use the Microsoft Fusion logging system, included in Visual Studio .NET:
| 1. | Start FUSLOGVW.EXE before launching your application. |
| 2. | Launch your application. |
| 3. | After the failure has occurred, click the Refresh button in the Fusion logging window. An entry related to the process you just ran must appear. |
| 4. | Highlight this entry and click View Log to get a detailed report of the .NET Framework’s attempts to load your assemblies. |