Big Data Import Using Apache Spark and HDFS
The big data import feature uses Apache Spark cluster to get high throughput. The big data import feature also leverages Apache Spark performance by processing the file simultaneously. The file is also uploaded to Apache Hadoop Distributed File System (HDFS) to get high throughput when using huge files.
When you start big data import, Apache Spark and HDFS must be up and running. Import is triggered from any of the ibi MDM servers. Spark worker nodes need to know the configuration of database and the cache. After the import is triggered in ibi MDM and the jobs are completed by the Spark cluster (by balancing the loads among the worker nodes), the import event is updated.
- HDFS: When you upload the file to ibi MDM, it is uploaded to HDFS and Apache Spark reads the file from HDFS.
- Spark Worker Nodes: Spark worker nodes act as clients.
- Cache: This is an external cache.