val rowSeq : Seq[Row] =
Seq.range(1, numberOfRows+1).map(rowNumber => Row.fromTuple("Thing", rowNumber))
Note: If you are still learning Scala, the above code might look unfamiliar to you. For Java users, a more familiar way to write this code would be:
val rowArray = Array.ofDim[org.apache.spark.sql.Row](numberOfRows)
var i = 1
while(i < numberOfRows+1) {
// Using plus one here so that the row number will be indexed from one
rowArray(i) = Row.fromTuple("Thing", i+1)
i = i+1
}
Both snippets do the same thing, but the former takes advantage of the functional features of Scala to write the code in a more concise way.
At this point, the
rowSeq is just a
Seq of data stored in memory. We need to translate that to an RDD (Resilient Distributed Dataset), and then to an HDFS dataset.