Creating the Dataset

To create a set of rows in memory, you need to know how many to create. This value comes from the "number of things" parameter that the user provides while configuring the operator. Get that value from the params variable.

Before you beginSet up the Spark job.
    Procedure
  1. Add the following code:
    val numberOfRows = params.getIntValue(DatasetGeneratorUtils.numberRowsParamKey) // here we're just using that constant we defined in our Utils class to get the right parameter 

    After you have the value, you can visually check that it is correct by outputting it to the TIBCO Data Science - Team Studio console during runtime. This is an example of how to use the listener object to communicate with the TIBCO Data Science - Team Studio console while an operator is running.

  2. Add the following code:
    listener.notifyMessage("number of rows is: " + numberOfRows)
  3. Build the rows of data. They should look like this:
    Thing, 1
    Thing, 2
    Thing, 3
    ...
    Thing, n
  4. Create a Seq of Rows and map the row number to a line that says "Thing". Continue until you reach the specified amount of Rows, as shown in the following example:
    val rowSeq : Seq[Row] =
      Seq.range(1, numberOfRows+1).map(rowNumber => Row.fromTuple("Thing", rowNumber))
    Note: If you are still learning Scala, the above code might look unfamiliar to you. For Java users, a more familiar way to write this code would be:
    val rowArray = Array.ofDim[org.apache.spark.sql.Row](numberOfRows)
    var i = 1
    while(i < numberOfRows+1) {
        // Using plus one here so that the row number will be indexed from one
        rowArray(i) = Row.fromTuple("Thing", i+1)
        i = i+1
    }

    Both snippets do the same thing, but the former takes advantage of the functional features of Scala to write the code in a more concise way.

    At this point, the rowSeq is just a Seq of data stored in memory. You must translate that to a Resilient Distributed Dataset (RDD), and then to an HDFS data set.

  5. To translate it to an RDD, which is the type of data structure Spark operates on, add the following code:
    val rowRDD = sparkContext.parallelize(rowSeq)
    What to do nextExport the RDD to HDFS .