Initializing PySpark for Spark Cluster

If you have configured a Spark compute cluster in the Chorus Data section, you can use PySpark to read and write table data. This notebook can be exposed as a operator which can be used by the downstream operators in a workflow.

This method is far more efficient than other methods for medium or large data sets and is the only viable option for reading data sets having a size of more than a few GB.

Note:

Only one data set is initialized for one data source.

You can initialize and use PySpark in your Jupyter Notebooks for Team Studio.

To perform this, start in the Notebooks environment in TIBCO Data Science – Team Studio. This option reads data that is available for Apache Spark 3.2 or later cluster.
    Procedure
  1. Create a new notebook.
  2. Click Data, and then select Initialize PySpark for Spark Cluster.

    Data tab - Initialize PySpark for Spark Cluster options

  3. A SELECT DATA SOURCE TO CONNECT dialog appears. Select an existing Spark Cluster data source, and then click Add Data Source.

    SELECT DATA SOURCE TO CONNECT dialog

  4. A bit of code is inserted into your notebook. This facilitates the communication between the data source and your notebook. If you want to read more data source, repeat step 1 to step 3 (maximum limit is 3 for inputs to a python execute operator). To run this code, press shift+enter or click Run.

    Now, you can run other commands by referring to the comments in the inserted code.

    The commands use the object cc which is an instantiated object of class ChorusCommander with the required parameters for the methods to work correctly. The generated code sets the sqlContext argument to the initialized Spark Session in the cc.read_input_table method call. You can set spark_options dictionary argument to pass additional options having driver property set to the auto-detected JDBC Driver class.

  5. To read the data sets, uncomment the lines in the generated code for that data set along with the _props variable having the corresponding spark_options.

  6. To use the notebook as a python execute operator, change the use_input_substitution parameter from False to True and add the execution_label parameter for data sets to be read. The execution_label value should start from string '1', followed by '2', and '3' for subsequent data sets. For more information, see help(cc.read_input_table).

  7. The generated cc.read_input_table method call returns a Spark Data Frame. You can modify, copy, or perform any other operations on the Data Frame as required.

  8. Once the required output Spark Data Frame has been created, write it to a target table using the cc.write_output_table.

    • To enable the use of output in downstream operators, set use_output_substitution=True.

    • To drop any existing table having the same name as the target, set the drop_if_exists parameter to True.

    • Use the spark_options argument as in the read_input_table for the driver. You can also write to a different target data source by updating cc.datasource_name = '...' to the required value and the driver in spark_options accordingly.

    For more information, see help(cc.write_output_table).

  9. Run the notebook manually for the metadata to be determined so that the notebook can be used as a operator in legacy workflows.