Creating a Spark Job

In this section, you implement the logic of the Spark job.

This is where the actual algorithm of your operator goes. In our case, the Spark job creates a small list of rows in memory, converts them to a Spark SQL DataFrame, and then uses the SparkRuntimeUtils class to save that DataFrame and return an HDFSTabularDataset object that corresponds to the DataFrame we generated.

Note: This operator is intended to show you how to use the Team Studio Custom Operator SDK. The process used to create the dataset in this example is NOT scalable, because all of the data is created in memory and then distributed. By capping the "number of things" parameter at 100, we ensure that the data we are creating will never be too big to fit in the driver memory. For a more practical example of dataset generation, see our SparkRandomDatasetGenerator example.

Prerequisites

You must have completed the prior tasks for building a source operator.

Procedure

  • Start by adding the following code:
    class SimpleDatasetGeneratorJob extends SparkIOTypedPluginJob[IONone, HdfsTabularDataset] {
    }