Setting up Hadoop Distributed File System

Apache Spark is compatible with Hadoop data. You can run it in Hadoop clusters through YARN or Apache Spark's standalone mode, and it can process data in Hadoop Distributed File System (HDFS). HDFS is highly fault tolerant and efficient in parallel data processing. HDFS takes in data, breaks the information into separate blocks, and distributes them to different nodes in a cluster.

For more information on HDFS, see HDFS documentation.

Prerequisites

Create hdfs/namenode and hdfs/datanode directories at the home/username.

For example, home/username/hadoop/hdfs/namenode and home/username/hadoop/hdfs/datanode.

Procedure

  1. Download Apache Hadoop 2.7.3 version from https://hadoop.apache.org/releases.html and extract its contents to the /home/username folder.
  2. Navigate to $HADOOP_HOME/etc/hadoop and modify the hdfs-site.xml file to set the path for namenode and datanode.
    <configuration>
     <property>
       <name>dfs.namenode.name.dir</name>
         <value>/home/username/hadoop/hdfs/namenode</value>
           </property>
     <property>
        <name>dfs.datanode.data.dir</name>
          <value>/home/username/hadoop/hdfs/datanode</value>
            </property>
    </configuration>
    
    
  3. Modify the core-site.xml file to specify the HDFS host name and port.
    <configuration>
    <property>
        <name>fs.default.name</name>
            <value>hdfs://hostname</value>
              </property>
    </configuration>
    
  4. Modify the Hadoop-env.sh file to set JAVA_HOME.
  5. Navigate to $HADOOP_HOME/sbin directory and run start-dfs.sh.
    The HDFS setup is completed and the HDFS processes are started.

    Optional: If you want to stop Apache Hadoop, run stop-dfs.sh.

What to do next

  • To verify if HDFS is running correctly , run the JPS command from any path. The following processes with ID are displayed:
    [username@TIBCO MDM master machine name sbin]# jps
    5607 Jps
    4634 DataNode
    4842 SecondaryNameNode
    5132 NodeManager
    4527 NameNode
    5023 ResourceManager
  • Configure TIBCO MDM with Apache Spark. For information, see the "Configuration Properties for Apache Spark" and "Required JAR Files for Apache Spark" sections in TIBCO MDM User's Guide.