Creating a Spark Job

In this section, you implement the logic of the Spark job.

This is where the actual algorithm of your operator goes. In our case, the Spark job creates a small list of rows in memory, converts them to a Spark SQL DataFrame, and then uses the SparkRuntimeUtils class to save that DataFrame and return an HDFSTabularDataset object that corresponds to the DataFrame we generated.

Note: This operator is intended to show you how to use the Team Studio Custom Operator SDK. The process used to create the dataset in this example is NOT scalable, because all of the data is created in memory and then distributed. By capping the "number of things" parameter at 100, we ensure that the data we are creating will never be too big to fit in the driver memory. For a more practical example of dataset generation, see our SparkRandomDatasetGenerator example.

Prerequisites

You must have completed the prior tasks for building a source operator.

Procedure

Start by adding the following code:

class SimpleDatasetGeneratorJob extends SparkIOTypedPluginJob[IONone, HdfsTabularDataset] {
}

Setting Up the Spark Job
Only one function needs to be overridden here - onExecution().
Creating the Dataset
To create a set of rows in memory, we need to know how many to create. This value comes from the "number of things" parameter that the user fills in while configuring the operator. We take that value from the params variable.
Exporting the Dataset
The first thing to do is create a SparkRuntimeUtils object.

Previous topic: Creating a Runtime Class

Contents

Index

Search Results

Creating a Spark Job

Prerequisites

Procedure