Setting up Hadoop Distributed File System

Apache Spark is compatible with Hadoop data. You can run it in Hadoop clusters through YARN or Apache Spark's standalone mode, and it can process data in Hadoop Distributed File System (HDFS). HDFS is highly fault tolerant and efficient in parallel data processing. HDFS takes in data, breaks the information into separate blocks, and distributes them to different nodes in a cluster.

For more information on HDFS, see HDFS documentation.

Prerequisites

Create hdfs/namenode and hdfs/datanode directories at the home/username.

For example, home/username/hadoop/hdfs/namenode and home/username/hadoop/hdfs/datanode.

Procedure

Download Apache Hadoop 2.7.3 version from https://hadoop.apache.org/releases.html and extract its contents to the /home/username folder.

Navigate to $HADOOP_HOME/etc/hadoop and modify the hdfs-site.xml file to set the path for namenode and datanode.

<configuration>
 <property>
   <name>dfs.namenode.name.dir</name>
     <value>/home/username/hadoop/hdfs/namenode</value>
       </property>
 <property>
    <name>dfs.datanode.data.dir</name>
      <value>/home/username/hadoop/hdfs/datanode</value>
        </property>
</configuration>

Modify the core-site.xml file to specify the HDFS host name and port.

<configuration>
<property>
    <name>fs.default.name</name>
        <value>hdfs://hostname</value>
          </property>
</configuration>

Modify the Hadoop-env.sh file to set JAVA_HOME.
Navigate to $HADOOP_HOME/sbin directory and run start-dfs.sh.
The HDFS setup is completed and the HDFS processes are started.
Optional: If you want to stop Apache Hadoop, run stop-dfs.sh.

What to do next

To verify if HDFS is running correctly , run the JPS command from any path. The following processes with ID are displayed:

[username@TIBCO MDM master machine name sbin]# jps
5607 Jps
4634 DataNode
4842 SecondaryNameNode
5132 NodeManager
4527 NameNode
5023 ResourceManager

Configure TIBCO MDM with Apache Spark. For information, see the "Configuration Properties for Apache Spark" and "Required JAR Files for Apache Spark" sections in TIBCO MDM User's Guide.

Contents

Index

Search Results

Setting up Hadoop Distributed File System

Prerequisites

Procedure

What to do next