Common Scheduler Features

The features in this section are common to both scheduler modes.

Grid Library Aware Scheduling

The GridServer scheduler does not schedule a task to an Engine if that Engine does not yet have the root Grid Library for that Service. Additionally, when an Engine logs in, it does not wait until it is synchronized to run tasks; rather, it works on any Services it can while it is synchronizing any new libraries. This allows for straightforward library deployment on large grids where it might take hours to fully sync; Services can be started at any time instead of waiting until all Engines are synced.

Engine Blacklisting

If a Service sets the option engineBlacklisting (ENGINE_BLACKLISTING) to true, then Engines that fail on a task from that Service do not receive any other tasks from that Service. The default is false. “fail” means any action that results in a failed task being sent back to the Manager, regardless of whether that failure was due to Engine hardware, Engine environment, or Service implementation code. It does not include events such as the Engine going offline to user activity, since that does not result in a task failure.

You can also set the option failuresBeforeBlacklist(FAILURES_BEFORE_BLACKLIST) to a number of task failures before an Engine is blacklisted.

Blacklisted Engines are excluded for a particular Service Session only; they can freely accept tasks from any other Service, regardless of Service Type, assuming the other Services haven’t also blacklisted the Engine or have some Conditions in place that prevent it. Blacklisted Engines can also be shared to other Brokers that need Engines.

To remove an Engine from all blacklists, go to Grid Components > Engines > Daemon Admin and select Clear from Blacklists from the Actions list.

You can get a list of blacklisted Engines using the GridServer API. In Java, the getBlackListedEngines method in com.datasynapse.gridserver.admin.ServiceAdmin retrieves a list of Engines that have been blacklisted for a given Service. It is also available for C++ and .NET. See the GridServer API for more information.

Engine Greylisting

If your Tasks are more heterogeneous, you might not want to use blacklisting, because a single Task failure might not imply failure of all Tasks. Greylisting enables you to make it less likely that an Engine works on a Service Session than other Engines, without completely excluding that Engine. This is done by lowering affinity for an Engine for the Task retry.

To use Engine Greylisting, set the Service option engineGreylisting (ENGINE_GREYLISTING) to true. When a Task fails, the Service Affinity for that Engine is reduced by a configurable amount, which by default is 5. To change this amount, go to Admin > System Admin > Manager Configuration, and under the Affinity heading, change Greylist Affinity to a negative value.

Greylisting can be used with blacklisting, if the failuresBeforeBlacklist is greater than zero. The Service Affinity is reduced upon each Task failure until failuresBeforeBlacklist is reached and the Engine is blacklisted.

The Clear from Blacklists action described above also clears greylists.

Engine Properties

Engines have a number of intrinsic properties, such as available memory or disk space, performance (megaflops), operating system, and so forth, that the condition can use to define eligibility. Custom properties can also be defined on the grid, and property values can be assigned on a per-Engine basis. These properties are used in a number of the following features.

Engine Tiers

Often time customers might have distinct sets of resources that they might want to use on a preferred basis. Engine tiers provide a mechanism to specify the order in which groups of Engines must be scheduled.

For example, you might have a set of high-performance dedicated blades, a pool of older servers, and a group of desktop computers used as part-time grid resources when idle. Ideally, you might not want to use older servers unless the high-performance servers are completely in use, and avoid using desktop machines unless both other groups were busy.

Engine tiers are defined on the Broker at Admin > System Admin > Manager Configuration > Services, under the Scheduling heading, with the Engine Tiers property. The property’s format is an ordered, comma-delimited list of Engine property name-value pairs. For example, for the above scenario, if you had a type Engine property, you might use type=blade,type=server,type=desktop. The list is ordered from highest-tier to lowest-tier. Engines matching the first tier are always scheduled before those in the second tier, and so on.

If an Engine matches more than one property, the highest tier is matched. If it matches none of the defined tiers, it is scheduled after all other tiers.

Conditions

Conditions are rules that are applied to Services and tasks that affect how they are scheduled to Engines.

A Discriminator Condition limits the execution of tasks to a subset of Engines. If an Engine is ineligible to take the next waiting task, it is assigned the first task it is eligible to take. An Affinity Condition, like intrinsic affinity, provides for prioritized routing of tasks to Engines, but does not prevent any Engines from taking tasks. These conditions typically are based on Engine Properties. Users can also implement custom versions of these conditions. (Note that Affinity is only used in Priority Mode with the Usage algorithm.)

Task Affinity provides the ability to run a set of Tasks on the same Engine or set of Engines. For example, if a Task loads a large dataset, and a number of subsequent Tasks use the same dataset, you can add a Task Affinity Condition to those Tasks so they prefer to run on the Engine that ran the first Task.

When a QueueJump Condition is added to a task submission, the task is added to the front of the Session’s queue so that it is the next task taken.

Dependency Conditions are used to create workflow amongst Sessions and tasks. A task can be set to wait until another task or Session completes; likewise with Sessions. Dependent tasks or Sessions can also optionally fail if the dependency fails.

For more information about using Conditions, see the GridServer Developer’s Guide.

Redundant Rescheduling

Redundant rescheduling addresses the situation in which a handful of tasks, running on less-capable processors, might significantly delay or prevent Service completion. The basic idea is to launch redundant instances of long-running tasks. The Broker accepts the first result to return. Remaining instances are not immediately canceled; it waits to either finish, or waits until the Service finishes. Redundant rescheduling is also useful when completion of long running tasks is critical.

By default, redundant task rescheduling is not enabled. With pools of more capable or nearly identical Engines, fastest task execution occurs when there is no redundancy from rescheduling. In general, rescheduling is only appropriate when there are widely different capabilities in Engines. To enable redundant rescheduling, you must enable one of the three strategies, and set the REDUNDANT_RESCHEDULING_ENABLED Service option to true on each Service you want to redundantly reschedule.

Note 

In situations where a group of Engines might slow down a task run, using discrimination can be more efficient than redundant rescheduling.

Three separate strategies, running in parallel, govern rescheduling. Tasks are rescheduled whenever one or more of the three corresponding criteria are satisfied. However, none of the rescheduling strategies apply for any Service until a certain percentage of tasks within that Service have completed; the Strategy Effective Percent property determines this percentage.

The rescheduler scans the pending task list for each Service at regular intervals, as determined by the Poll Period property. Each Service has an associated taskMaxTime, after which tasks within that Service are rescheduled. When the strategies are active (based on the Strategy Effective Percent), the Broker tracks the mean and standard deviation of the (clock) times consumed by each completed task within the Service. Each of the three strategies uses one or both of these statistics to define a strategy-specific time limit for rescheduling tasks.

Each time the rescheduler scans the pending list, it checks the elapsed computation time for each pending task. Initially, rescheduling is driven solely by the taskMaxTime for the Service; after enough tasks complete, and the strategies are active, the rescheduler also compares the elapsed time for each pending task against the three strategy-specific limits. If any of the limits is exceeded, it adds a redundant instance of the task to the waiting list. (The Broker resets the elapsed time for that task when it gives the redundant instance to an Engine.)

The Reschedule First flag determines whether the redundant task instance is placed at the front of the back of the waiting list; that is, if Reschedule First is true, rescheduled tasks are placed at the front of the queue to be distributed before other tasks that are waiting. The default setting is false, which results in less aggressive rescheduling.

Each of the three strategies computes its corresponding limit as follows:

The Percent Completed Strategy waits until the Service nears completion (as determined by the Remaining Task Percent setting), after which it begins rescheduling pending tasks that are taking longer than the average completion time for tasks within the Service.
The Average Strategy returns the product of the mean completion time and the Average Limit property. That is, this strategy reschedules tasks when their elapsed time exceeds some multiple (as determined by the Average Limit) of the mean completion time.
The Standard Dev Strategy returns the mean plus the product of the Standard Dev Limit property and the standard deviation of the completion times. That is, this strategy reschedules tasks when their elapsed time exceeds the mean by some multiple (as determined by the Standard Dev Limit) of the standard deviation.