Spark Node Fusion

You can use Spark Node Fusion to have multiple operators in a single Spark job (also called "Spark context"). This allows the job to run faster because it is not recreating a new job and persisting results to HDFS at each analytical step.

Note: Spark Node Fusion is applicable only to workflows that use Hadoop.

Regardless of your workflow's size or the number of operators it contains, performance during runtime is crucial. Being able to specify the use of node fusion on existing workflows, and then revert to the previous setting, must be easy to do. You can accomplish this using the Use Spark property. For more information, see Convert to Spark/Revert to Non-Spark.

When a workflow with Spark operators is run through the job scheduler, the results are not made visible to the user, because doing so would make the job run much more slowly. If you want to view results anyway, set Store Results to true before you run the job.

The following operators have been updated to use Spark Node Fusion. Prior to the release of Team Studio version 6.4, these operators typically used MapReduce or Pig execution frameworks.