Subsystem Failure

In case a subsystem fails, the in-memory workflow waits and retries (based on configuration parameters) for the susbsytem(s) to become operational. This results in incoming message consumption without action.

The Wait and Retry processes differ for transactional workflows and in-memory workflows.

  • For transactional workflows, the retry mechanism is handled at the activity level, as the scope of the transaction is the activity.
  • For in-memory workflows, the retry mechanism is handled at the process level, as the scope of the transaction is the entire workflow.

When a subsystem fails, the following steps are executed:

The transaction is rolled back. All changes to the record data are rolled back.

Record data updated to the distributed cache also needs to be rolled back, but as the cache system does not support transactional semantics, the rollback action is simulated. All record data modifications associated with this process are removed from the cache.

A retry checks if all the subsystems are up and running. A utility checks if the major subsystems (database, file system, JMS and cache) are running. Only when all the subsystems are running is the process re-executed.

When the subsystems are running, a new message from the event, the document and process related information of the current process is generated and published to the workflow queue. This message has a special flag re-delivered, set to true.

The event, document, and process related information is cleared from the local cache.

On receiving the message with redelivered flag set to true, the regular workflow recovery mechanism kicks in.

  • If no process is associated with the incoming message in the system, a new process is created and executed from the first activity.
  • If there is a process associated with the incoming messages, the process is resumed from the last activity where the process related information has been persisted.