Big Data Import Using Apache Spark and HDFS

The big data import feature uses Apache Spark cluster to get high throughput. The big data import feature also leverages Apache Spark performance by processing the file simultaneously. The file is also uploaded to Apache Hadoop Distributed File System (HDFS) to get high throughput when using huge files.

When you start big data import, Apache Spark and HDFS must be up and running. Import is triggered from any of the TIBCO MDM servers. After the import is triggered, Spark cluster decides (based on the worker nodes) how to balance the loads among the worker nodes. Spark worker nodes need to know the configuration of database and the cache. After the import is triggered in TIBCO MDM and the jobs are completed by the Spark cluster, the import event is updated.

  • HDFS: When you upload the file to TIBCO MDM, it is uploaded to HDFS and Apache Spark reads the file from HDFS.
  • Spark Worker Nodes: Spark worker nodes act as clients.
  • Cache: This is an external cache.