Data Movement Examples

As an example of using the data movement mechanisms discussed above, consider the problem of determining the value of a financial instrument. This example uses a computation method named value. The value method takes two arguments: a deal and a pricing scenario. The deal argument contains all information specific to a financial instrument needed to determine its value, such as coupon and maturity date. The pricing scenario argument contains all other determinants of the deal’s value, such as interest rates and prices of underlying instruments. The output of the value function is a single number representing the value of the deal under the given pricing scenario.

Typical applications require the value of many deals over one or several pricing scenarios. To distribute and parallelize this computation, we execute the value function simultaneously on many Engines. We assume the code for the value function is available to each Engine (whether by Resource Update or over a network file system). We also assume that the numbers returned by the value function make their way back to the client through the standard Service return value mechanism. The question we want to consider is how to get the deal and pricing scenario information to the Engines.

Database Access

We first look at the deal information itself, stored in a database or data server somewhere on the network. Compare the two scenarios in Figure 8-1. In the first diagram, on the left, the Driver loads the deal information from the data server and sends it to the Engines. In the second diagram, on the right, the Driver sends just the unique identifier and has each Engine access the data server on its own.

Data flow between a Driver, two Engines, and a Data Server

The second choice is better because it requiresfewer data moves across the network to accomplish the same result. In the first choice, the data moves across the network twice, once from the data server to the Driver and second from the Driver to the Engine. In the second choice, the data moves across the network only once from the data server to the Engine. Also, the data needs marshaling and unmarshaling only once.

The second choice also increases parallelism at the data server. In the first choice, only the Driver is attempting to load data from the data server. In the second, multiple Engines attempt to load data concurrently. Assuming that the data server can handle the load, the second choice increases parallelism.

Single Pricing Scenario

We now consider the case in which you use a single pricing scenario to evaluate many deals. Here is one (suboptimal) way to organize this computation. We assume throughout that you already deployed and registered a Service containing the value function.

Algorithm 1 (suboptimal):

1. Create a Service Session of the value Service.
2. For each deal, submit the deal identifier and the pricing scenario as an asynchronous request to the Service Session.
3. Wait for results.

Although this algorithm gets the job done, it needlessly sends the same pricing scenario multiple times.

This is an ideal application of Service Session state:

Algorithm 2:

Procedure 

1. Create a Service Session of the value Service, initialized with the pricing scenario.
2. For each deal, submit the deal identifier as an asynchronous request to the Service Session.
3. Wait for results.

By making the pricing scenario be part of the session’s state, it is transmitted only as many times as there are Engines that implement the session, rather than once per request. GridServer never allocates more Engines to a Service session than there are requests for that instance, so Algorithm 2 never moves more data than Algorithm 1. And in the likely event that there are many more requests than Engines (we argue below in the Task Duration section why this is a good idea), Algorithm 2 moves much less data than Algorithm 1.

Several Pricing Scenarios

What if the application needs to value the portfolio of deals for more than one pricing scenario? One approach is simply to repeat Algorithm 2 several times, creating a new Service session for each pricing scenario. It is also possible to use a single session and employ the updateState method of the Service client API to transmit each successive pricing scenario to the Engines running the session. If the differences between pricing scenarios are small and they are used to perform the update instead of the pricing scenarios themselves, then using updateState can result in considerable data movement savings; even if the pricing scenarios themselves are used as updates, this approach is still likely to be superior to using separate instances.

Multiple Pricing Scenarios Available Early

Now let us add the following wrinkle: we still want to compute the value of many deals over many pricing scenarios, but the pricing scenarios are available to us sometime before we can run the application. For instance, pricing scenario information is available at 4 PM, but we cannot start the nightly report until 5:30 PM, to avoid interfering with daily work. In this situation, we can exploit the time gap to push information to the Engines before the computation starts. One approach would be to use File Update to put all the pricing scenario data on all the Engines. Another would be to put the pricing scenario data into GridCache and run a “primer” Service that copies the data to the Engines. The trade-offs between these two approaches were discussed above under Data Movement Mechanisms.

Deal-Pricing Scenario Symmetry

Finally, we point out that deals and pricing scenarios are for the most part symmetric in these examples (the main difference being that pricing scenarios are less likely to be indexed by primary key in a database, so the discussion of deal identifiers versus deal data does not apply to them). For instance, if deals are available to you early, you can use File Update or GridCache to push deal information to Engines before your application starts.