Creating the Dataset

To create a set of rows in memory, we need to know how many to create. This value comes from the "number of things" parameter that the user fills in while configuring the operator. We take that value from the params variable.

Prerequisites

You must have set up the Spark job.

Procedure

  1. Add the following code:
    val numberOfRows = params.getIntValue(DatasetGeneratorUtils.numberRowsParamKey) // here we're just using that constant we defined in our Utils class to get the right parameter 

    Once we have that value, we can visually check that it is correct by outputting it to the Team Studio console during runtime. This is an example of how to use the listener object to communicate with the Team Studio console while an operator is running.

  2. Add the following code:
    listener.notifyMessage("number of rows is: " + numberOfRows)
  3. Now you'll build the rows of data themselves. Remember, we want them to look like this:
    Thing, 1
    Thing, 2
    Thing, 3
    ...
    Thing, n
  4. We create a Seq of Rows and map the row number to a line that says "Thing". We continue doing that until we reach the specified amount of Rows, as shown in the following example:
    val rowSeq : Seq[Row] =
      Seq.range(1, numberOfRows+1).map(rowNumber => Row.fromTuple("Thing", rowNumber))
    Note: If you are still learning Scala, the above code might look unfamiliar to you. For Java users, a more familiar way to write this code would be:
    val rowArray = Array.ofDim[org.apache.spark.sql.Row](numberOfRows)
    var i = 1
    while(i < numberOfRows+1) {
        // Using plus one here so that the row number will be indexed from one
        rowArray(i) = Row.fromTuple("Thing", i+1)
        i = i+1
    }

    Both snippets do the same thing, but the former takes advantage of the functional features of Scala to write the code in a more concise way.

    At this point, the rowSeq is just a Seq of data stored in memory. We need to translate that to an RDD (Resilient Distributed Dataset), and then to an HDFS dataset.

  5. To translate it to an RDD, which is the type of data structure Spark operates on, add the following code:
    val rowRDD = sparkContext.parallelize(rowSeq)

    Now we are ready to export it to HDFS.