Configuring Connection Parameters

To run a New Workflow, you should have a shared file system between TIBCO Data Virtualization and Spark cluster. The following shared file system are available:

  • NFS shared drive when using the Apache Spark Standalone cluster.

  • Amazon S3 bucket when using the EMR cluster.

  • HDFS folder when using the Cloudera cluster.

The system administrator should create separate directories for models and outputs on the shared volume. Intermediate results from the Model operators are stored in the Models directory and the output tables from the operators are stored in the Output directory.

In the Configure Connection Parameters dialog, add the following parameters to store the intermediate results and output of the operators:

Parameter Description
tds.datavirt.sharedDataVolumes This parameter specifies the directory where the system administrator wants to store the output tables from the operators. You can also provide multiple shared volumes. For more information, see Accessing Data in TIBCO Data Virtualizationfrom TIBCO Data Science - Team Studio.
tds.runtime.sharedTempVolume

This parameter specifies the directory where the system administrator wants to store the intermediate results from the Model operators.

Note: If you are using an EMR cluster as a Spark cluster, then you can use the HDFS file system on EMR for better performance.

If you are using,

  • Amazon S3 bucket, then the URL should be: s3a://<directory_path>

  • NFS shared drive, then the URL should be: file:// <directory_path>

  • HDFS shared drive, then the URL should be: hdfs://<directory_path>

For example:

tds.datavirt.sharedDataVolumes = s3a://qat3/output2/

tds.runtime.sharedTempVolume = s3a://qat3/models/

If you are using the EMR as the Spark cluster, then add the following parameters:

Parameter Description
spark.yarn.populateHadoopClasspath = true This parameter appends the YARN classpath on the EMR cluster.
spark.hadoop.fs.s3a.aws.credentials.provider = com.amazonaws.auth.InstanceProfileCredentialsProvider This parameter enables the authentication.
spark.yarn.stagingDir If you are using the Amazon EMR 6.7 cluster, then the administrator must configure the Hadoop YARN data source with this parameter. This parameter specifies the URL of the staging directory that the execution submitter uses while submitting the Spark jobs.
Note: Enter a valid HDFS directory on the EMR cluster and make sure that the Hadoop user configured in the data source has read, write, and run permissions on this directory.

Example: spark.yarn.stagingDir=hdfs://<ip_address>/user/foobar/.sparkStaging

If you are using an Apache Spark Standalone cluster, then add the following parameters:

Parameter Description
tds.executions.sparkClusterVersion This parameter specify the version of the Spark cluster.

Example: 3.2.1

spark.dynamicAllocation.disabled By default, the platform enables dynamic allocation.

If you want to disable the dynamic allocation, then this parameter is set to true.