Exporting the Dataset
In this task, create a
SparkRuntimeUtils object, and then use its utility functions
to communicate with the Spark context. The object makes saving and exporting the DataFrame much easier.
Before you beginCreating the Dataset.
- Procedure
- Add the following code:
val sparkUtils = new SparkRuntimeUtils(sparkContext)
- TIBCO Data Science - Team Studio needs to know where on HDFS to store the file. That information comes from the user's configuration on the operator parameters. Get that information from the
paramsvariable by adding the following code:// retrieve the storage parameters using the HdfsParameterUtils class val outputPath = HdfsParameterUtils.getOutputPath(params) val overwrite = HdfsParameterUtils.getOverwriteParameterValue(params) val storageFormat = HdfsParameterUtils.getHdfsStorageFormatType(params)
- Use the
DatasetGeneratorUtilsclass to access the output schema:val outputSchema = DatasetGeneratorUtils.getOutputSchema(params)
- Export the RDD to a Spark SQL DataFrame so that our SparkUtils can read it, and then save it to HDFS using the parameters retrieved in step 2, by adding the following code:
val sqlContext = new SQLContext(sparkContext) /* Create a Spark DataFrame. Use the rowRDD and a Spark sql schema. The SparkRuntimeUtils class provides methods to convert the tabular schema from Alpines format to the SparkSQLSchema. By creating one schema and converting it, you ensure that the runtime and design time schemas match. */ val outputDF = sqlContext.createDataFrame(rowRDD, sparkUtils.convertTabularSchemaToSparkSQLSchema(outputSchema)) /* * Use the Spark Utils class to save the DataFrame and create a HdfsTabularDataSet object */ sparkUtils.saveDataFrame( outputPath, outputDF, storageFormat, overwrite, sourceOperatorInfo = None, addendum = Map[String, AnyRef](), TSVAttributes.defaultCSV )
What to do nextNow that you have written the Signature, the GUI Node, and the Runtime Spark job, you are ready to compile and test your operator on TIBCO Data Science - Team Studio. If you are not familiar with this process, then follow the instructions in Installing the Custom Sample Operator for your Version. To see a finished copy of the code, see SimpleDatasetGenerator.scala. (You created this file as part of Setting Up Your Environment.)