Exporting the Dataset

The first thing to do is create a SparkRuntimeUtils object.

This allows us to use utility functions to communicate with the Spark context, and it makes saving and exporting the DataFrame much easier.

Prerequisites

You must have created the dataset.

Procedure

  1. Add the following code:
    val sparkUtils = new SparkRuntimeUtils(sparkContext)
  2. Team Studio needs to know where on HDFS to store our new file, and that information should have come from the user's configuration on the operator parameters. Get that information from the params variable by adding the following code:
    // retrieve the storage parameters using the HdfsParameterUtils class
    val outputPath = HdfsParameterUtils.getOutputPath(params)
    val overwrite = HdfsParameterUtils.getOverwriteParameterValue(params)
    val storageFormat = HdfsParameterUtils.getHdfsStorageFormatType(params)
  3. Use that DatasetGeneratorUtils class we created to access the output schema:
    val outputSchema = DatasetGeneratorUtils.getOutputSchema(params)
  4. Finally, export the RDD to a Spark SQL DataFrame so that our SparkUtils can read it, and then save it to HDFS using the parameters we just pulled, by adding the following code:
    val sqlContext = new SQLContext(sparkContext)
     
    /*
     Create a Spark DataFrame. We use the rowRDD and a Spark sql schema. The SparkRuntimeUtils
     class provides methods to convert the tabular schema from Alpines format to the SparkSQLSchema.
     By creating one schema and converting it, we insure that the runtime and design time schemas
     will match.
     */
    val outputDF =
      sqlContext.createDataFrame(rowRDD, sparkUtils.convertTabularSchemaToSparkSQLSchema(outputSchema))
     
    /*
     * Use the Spark Utils class to save the DataFrame and create a HdfsTabularDataSet object
     */
    sparkUtils.saveDataFrame(
      outputPath,
      outputDF,
      storageFormat,
      overwrite,
      sourceOperatorInfo = None,
      addendum = Map[String, AnyRef](),
      TSVAttributes.defaultCSV
    )

    Now that you've written the Signature, GUI Node, and the Runtime Spark job, you are ready to compile and test your operator on Team Studio. Follow the instructions in Compiling and Running the Sample Operators if you are not familiar with this process. To see a finished copy of the code, see SimpleDatasetGenerator.scala.