Configuring Connection Parameters
To run a New Workflow, you should have a shared file system between TIBCO Data Virtualization and Spark cluster. The following shared file system are available:
-
NFS shared drive when using the Apache Spark Standalone cluster.
-
Amazon S3 bucket when using the EMR cluster.
-
HDFS folder when using the Cloudera cluster.
The system administrator should create separate directories for models and outputs on the shared volume. Intermediate results from the Model operators are stored in the Models
directory and the output tables from the operators are stored in the Output
directory.
In the Configure Connection Parameters dialog, add the following parameters to store the intermediate results and output of the operators:
Parameter | Description |
---|---|
tds.datavirt.sharedDataVolumes
|
This parameter specifies the directory where the system administrator wants to store the output tables from the operators. You can also provide multiple shared volumes. For more information, see Accessing Data in TIBCO Data Virtualizationfrom TIBCO Data Science - Team Studio. |
tds.runtime.sharedTempVolume
|
This parameter specifies the directory where the system administrator wants to store the intermediate results from the Model operators. Note: If you are using an EMR cluster as a Spark cluster, then you can use the HDFS file system on EMR for better performance.
|
If you are using,
For example:
|
If you are using the EMR as the Spark cluster, then add the following parameters:
Parameter | Description |
---|---|
spark.yarn.populateHadoopClasspath = true |
This parameter appends the YARN classpath on the EMR cluster. |
spark.hadoop.fs.s3a.aws.credentials.provider = com.amazonaws.auth.InstanceProfileCredentialsProvider |
This parameter enables the authentication. |
spark.yarn.stagingDir
|
If you are using the Amazon EMR 6.7 cluster, then the administrator must configure the Hadoop YARN data source with this parameter. This parameter specifies the URL of the staging directory that the execution submitter uses while submitting the Spark jobs. Note: Enter a valid HDFS directory on the EMR cluster and make sure that the Hadoop user configured in the data source has read, write, and run permissions on this directory. Example: |
If you are using an Apache Spark Standalone cluster, then add the following parameters:
Parameter | Description |
---|---|
tds.executions.sparkClusterVersion
|
This parameter specify the version of the Spark cluster. Example: 3.2.1 |
spark.dynamicAllocation.disabled
|
By default, the platform enables dynamic allocation. If you want to disable the dynamic allocation, then this parameter is set to |