Setting up Apache Spark

For the Big Data Import feature of TIBCO MDM, you must download and configure Apache Spark. Apache Spark cluster includes a single master and any number of worker nodes. For high and efficient performance, configure four or five worker nodes.

Prerequisites

  • For the recommended platform, see Platform Limitations for Apache Spark.
  • Share the $MQ_HOME, $MQ_COMMON_DIR, and MQ_CONFIG_FILE directories across all worker nodes (from TIBCO MDM host machine to Apache Spark master and worker machines).

    For information about sharing the directory in the cluster environment, see Clustering Set Up.

Procedure

  1. Add Entries in the Hosts File (master and worker)
    1. On the Apache Spark master machine, navigate to /etc.
    2. In the hosts file, specify names of the servers individually for all three nodes.
      Consider a scenario where you have three servers: one for the master node, and two others for the worker node. In other words, these are three Linux machines where Apache Spark is already installed.
      spark1: master node
      spark2: worker node
      spark3: worker node 
      				
      Repeat the step a and b for all worker node machines.
      Note: Ensure that all the servers can ping each other by name and an IP address.
  2. Generate Key
    1. Generate the key from http://www.thegeekstuff.com/2008/11/3-steps-to-perform-ssh-login-without-password-using-ssh-keygen-ssh-copy-id/.
      You need the key to access worker nodes without password. If you have generated the key, you can seamlessly copy the key to the worker node machines without password.
  3. Setup SSH Passwordless Login
    1. On the command line, enter the following command to copy the key to the worker node machines:
      ssh-keygen
      
      ssh-copy-id -i ./.ssh/id_rsa.pub spark1
      ssh-copy-id -i ./.ssh/id_rsa.pub spark2
      ssh-copy-id -i ./.ssh/id_rsa.pub spark3
      
  4. Check Login to Worker Nodes using SSH Keys
    1. Enter the following command from the master node and verify if the worker nodes are accessible.
      ssh 10.XXX.XXX.XXX (an IP address of the worker node) 
    You can now log in to the worker node machine without being prompted for a password.
  5. Install Spark
    1. To download Apache spark, navigate to the Apache Spark website.
    2. In the Archived Releases section, click Spark Release archives.
    3. From the version list, click the spark-2.2.0 version.
    4. Download spark-2.2.0-bin-hadoop2.7.tgz and extract its contents to the /home/username folder.
  6. Set up Apache Spark Cluster
    1. If the slaves file does not exist, copy slaves.template to the slaves directory.
    2. Navigate to $SPARK_HOME/conf and list the following slave or worker nodes in the slaves file:
      • spark1
      • spark2
      • spark3
  7. Configure Spark Master
    1. If the spark-env.sh file does not exist, copy spark-env.sh.template to spark-env.sh.
    2. Navigate to $SPARK_HOME/conf and open the spark-env.sh file.
    3. Add the IP address and port number of the SPARK_MASTER node and instance number of the SPARK_WORKER node:
      SPARK_MASTER_IP=10.XXX.XXX.XXX (an IP address of the master node)
      SPARK_MASTER_PORT=7077
      SPARK_WORKER_INSTANCES=1
      				
    4. Set MQ_LOG in the spark-env.sh file.
      Note: Before setting MQ_LOG parameter, ensure that MQ_CONFIG_FILE is shared to Apache Spark master node machine and has full permissions. If Apache Spark cluster and TIBCO MDM are installed on different machines, share $MQ_HOME. For this, you must share $MQ_HOME from TIBCO MDM host machine to Apache Spark master and worker machines.
    5. Set JAVA_HOME in the spark-env.sh file.
      export JAVA_HOME=/home/apps/JAVA8/jdk1.8.0_112
      export PATH=$PATH:$JAVA_HOME/bin
    6. Copy $SPARK_HOME folder (that is, spark-2.2.0-bin-hadoop2.7) from /home/username/ to all the other worker node machines.
  8. Start Apache Spark Cluster
    1. On the master node, run start-all.sh from/home/username/spark-2.2.0/sbin.
      The master and worker nodes start.
    2. Optional: Navigate to $SPARK_HOME/sbin folder and run the following command to stop Apache Spark: stop-all.sh
  9. Check Whether Services Have Been Started
    1. Run the JPS command from any path to list the processes (master and slave). Running JPS from master machine shows master node processes and slave machines shows worker nodes processes.
      username@TIBCO MDM master machine name cd $SPARK_HOME/sbin
      [username@TIBCO MDM master machine name sbin]# jps
      5495 Master
      5607 Jps
      
      username@TIBCO MDM worker machine name cd $SPARK_HOME/sbin
      [username@TIBCO MDM worker machine name sbin]# jps
      1234 Worker
      5678 Jps
      
      Apache Spark is set up.

What to do next

To verify if Apache Spark setup works correctly, run the samples provided in the $SPARK_HOME/examples/src/main/scala/org/apache/spark/examples directory.