Driver Timeout and Failure

When a client application fails, the Broker detects the failure when the Client does not send a heartbeat and does not log back in within a specified time.

This time is defined as the Client Timeout Minutes plus the Driver Heartbeat Timeout; the Driver Heartbeat Timeout is the Max Millis Per Heartbeat property times the Timeout Factor property. Note that for Max Millis Per Heartbeat, this setting is the maximum value. The actual value is randomly between half this value and the value. (For example, the default value is set at 15000 milliseconds. This means the value is between 7500 and 15000 milliseconds.)

For example, by default, Client Timeout Minutes is 5 seconds (300,000 ms), Max Millis Per Heartbeat is 16,000 ms and Timeout Factor is 15. This means the client timeout is between 300,000 + 8,000 * 15 and 300,000 + 16,000 * 15 ms.

The Client Timeout Minutes is set on the Manager at Admin > System Admin > Engines and Clients > Client Management > Client Timeout Minutes. The Driver Heartbeat Timeout the Max Millis Per Heartbeat property times the Timeout Factor property; both are set at Admin > System Admin > Communication > HTTP Connections > Driver Heartbeat.

When client timeout happens, any currently running Services are canceled. If this happens, application failure recovery or restart is the responsibility of your application.

The exception to cancellation is fully submitted Services of type Collection.LATER or Collection.NEVER. Also, if a Client is collecting results from a Collection.LATER type Service, none of the outputs are removed until all are collected and the Client destroys the Service, so that if a Client fails during collection it can restart and recollect the outputs.

It is also possible to collect tasks from a failed client application for Services of type Collection.AFTER_SUBMIT and Collection.IMMEDIATELY if cancellation has not yet occurred. To do this, you must collect and pass in the Driver session ID, in the same fashion as collection of a Collection.LATER Service. For more information about collection of Collection.LATER Services, see "Deferred Collection (Collect Later)" in the GridServer Developer’s Guide.

All Driver file servers return a Server Unavailable code with instructions to retry if they are processing too many concurrent requests. This significantly reduces the chance of a Service invocation failing due to a temporarily overloaded Driver.