Setting Up the Spark Job

Only one function needs to be overridden here - onExecution().

This function contains all the code we need for the Spark job. It takes in five parameters:
  • sparkContext - The Spark context that is created when the job is submitted.
  • appConf - A map that contains system-related parameters (rather than parameters for the operator itself). This includes all Spark parameters and workflow-level variables.
  • input - The IOBase object defined as the input to the operator. In our example, this operator takes no input, so this is set to IONone.
  • params - The chosen operator parameters passed from the GUI Node.
  • listener - A listener object that allows you to send messages to the Team Studio GUI during the Spark job. You can use this to post error messages or provide status reports in the Team Studio console.

Because our operator returns a tabular HDFS file, we set the output type to HdfsTabularDataset.

Procedure

  • Create the skeleton of the onExecution() method as follows:
    override def onExecution(
                             sparkContext: SparkContext,
                             appConf: mutable.Map[String, String],
                             input: IONone,
                             params: OperatorParameters,
                             listener: OperatorListener): HdfsTabularDataset = {
    }
    We need to do the following in this Spark job:
    1. Create a small list of data in memory.
    2. Transform it into a DataFrame with the output schema we defined so that the custom operator framework can export it as an HdfsTabularDataset.