Spark Node Fusion

You can use Spark Node Fusion to have multiple operators in a single Spark job (also called "Spark context"). This allows the job to run faster because it is not recreating a new job and persisting results to HDFS at each analytical step.

Note: Spark Node Fusion is applicable only to workflows that use Hadoop.

Regardless of your workflow's size or the number of operators it contains, performance during runtime is crucial. Being able to specify the use of node fusion on existing workflows, and then revert to the previous setting, must be easy to do. You can accomplish this using the Use Spark property. For more information, see Convert to Spark/Revert to Non-Spark.

When a workflow with Spark operators is run through the job scheduler, the results are not made visible to the user, because doing so would make the job run much more slowly. If you want to view results anyway, set Store Results to true before you run the job.

The following operators have been updated to use Spark Node Fusion. Prior to the release of Team Studio version 6.4, these operators typically used MapReduce or Pig execution frameworks.

Viewing Results for Individual Operators
When a workflow that contains Spark operators is run, the results are not shown by default. This enables the workflow to run faster. You can see the results for individual operators available for Spark Node Fusion under the following conditions.

Related concepts

Prediction

Data Modeling and Model Validation

Workflow Operator Reference

Viewing Results for Individual Operators

Contents

Index

Search Results

Spark Node Fusion