Set Operations (HD)

Information at a Glance

Parameter	Description
Category	Transform
Data source type	DB
Send output to other operators	Yes
Data processing tool	MapReduce / Spark

Note: The Set Operations (HD) operator is for Hadoop data only. For database data, use the Set Operations (DB) operator.

Two or more databases.

Parameter	Description
Notes	Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator.
Sets	Click Define Sets to display the Define Sets dialog. For more information, see the Define Sets dialog.

Store Results?	Specifies whether to store the results. true - results are stored. false - the data set is passed to the next operator without storing.
Results Location	The HDFS directory where the results of the operator are stored. This is the main directory, the sub-directory of which is specified in Results Name. Click Choose File to open the Hadoop File Explorer dialog and browse to the storage location. Do not edit the text directly.
Results Name	The name of the file in which to store the results.
Overwrite	Specifies whether to delete existing data at that path and file name. Yes - if the path exists, delete that file and save the results. No - Fail if the path already exists.

Storage Format

Select the format in which to store the results. The storage format is determined by your type of operator.

Typical formats are Avro, CSV, TSV, or Parquet.

Compression

Select the type of compression for the output.

Available Parquet compression options.

Available Avro compression options.

Use Spark

If Yes (the default), uses Spark to optimize calculation time.

Advanced Spark Settings Automatic Optimization

Yes specifies using the default Spark optimization settings.
No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings dialog for more information.

Visual Output

The data rows of the output table/view displayed (up to 200 rows).

Data Output

A data set of the joined data sets. This operator always creates a CSV output.