About the TDV MPP Engine Configuration

TDV allows you to publish large federated datalakes and datawarehouses as one virtual datalake and make it available to consuming applications.TDV implements a new execution engine for MPP style processing. The Massively Parallel Processing (MPP) execution engine provides the acceleration that is essential for analytical processing of your business needs. The MPP Engine dynamically distributes the queries across multiple processors spanning the entire cluster for large data volume workloads. TDV takes into consideration the dataset volume, the compute capability of the underlying source, the cardinality statistics information from the source, and the compute capacity of TDV when optimizing for MPP style execution. TDV employs a hybrid approach to scheduling queries by determining which engine (MPP or Classic) is most appropriate for it. The use of MPP Engine is transparent and does not require any rewrite of existing DV artifacts.

The MPP Execution Engine optimizes operators like JOINs, UNIONs, AGGREGATEs, and SORTs by distributing the processing across multiple processors. Unlike the traditional single node query execution, the MPP Engine can leverage the entire cluster’s compute and memory for expensive operators like the above. The choice of selecting MPP execution should take into account the volume of dataset being retrieved and whether it involves single or multiple datasources. For a data set to be analyzed by the MPP engine its data source must be configured to allow for receiving partitioned queries. This can be done by setting the data source setting concurrentRequestLimit in the Advanced Connection Settings to a value greater than 0.

If the datasets involved in a query are small, then using the MPP Engine will entail a penalty of setup time and network cost of data hops across the nodes. The traditional single node query execution will be much faster in that situation. There is a tunable Minimum Partition Volume property that controls the threshold beyond which the optimizer will consider MPP execution.

For the MPP configuration, data is partitioned across multiple servers or nodes with each server node having memory and processors to manage the data locally. All communication is through the network— there is no disk-level sharing or contention. TDV recommends that the nodes are homogeneous in terms of CPU cores and memory.

The following image illustrates how the MPP Engine distributes processing of a query in parallel on two nodes of a cluster.

In a clustered environment, when a query is executed:

4. Once the data is retrieved, the MPP Engine further carries on the operation and depending upon the complexity of the query (such as number of JOINs, GROUP BYs, etc), data is exchanged between the nodes.

You can monitor the requests through either the Studio Manager or the Web Manager. From the Studio Manager, click on the Requests panel to view the requests. From the Web Manager, choose Monitoring - > Requests.

Note: Some of the MPP Engine operations, such as the manipulation of the JOIN and GROUP BY clauses between the two nodes, cannot be monitored in the Manager interface as these are handled by the drillbit.