Hadoop Connection Prerequisites

This checklist is provided to help ensure that all Team Studio components of a typical Hadoop-based installation are accounted for and completed.

The Hadoop connection configuration requires the HDFS host, HDFS port, Jobtracker host, and Jobtracker port. All Hadoop node hostnames must resolve to the proper computers from the Team Studio server. Team Studio needs access to a Hadoop administrator or anyone with access to the Hadoop configuration files (*-site.xml) if the inputs provided in the form below are not valid. Team Studio also might have to make changes to the host file of the Team Studio server if the Hadoop hostnames do not resolve.

The connection takes approximately two hours to configure and test if the Hadoop cluster is not configured for Kerberos. If it is, the user running Team Studio on the Team Studio server must have a keytab to authenticate in Kerberos. Team Studio requires that keytabs for the NameNode and Jobtracker are located on the Team Studio server. If any of these three elements is missing or invalid, Team Studio requires that a Hadoop administrator is available to contact during installation. Configuring the initial connection to a cluster configured for Kerberos takes approximately four hours.

Hadoop Cluster
Question Response For Reference
Which version of Hadoop is installed?
Will a Hadoop administrator be available during installation?
Is the NameNode of the resource manager enabled for high availability?
Is the cluster configured for Kerberos?
Is the cluster running MapReduce (MRv1) or YARN (MRv2)?
Do the HDFS and JobTracker/resource manager hostnames resolve to the correct computers from the Team Studio server? If they do not, configure the hosts file so that these Hadoop hosts resolve properly.
Hadoop Cluster without High Availability
Question Response For Reference
What are the HDFS host and port? Can be found in core-site.xml as fs.default.name: hdfs://HDFSHOST:HDFSPORT
Hadoop Cluster with High Availability
Question Response For Reference
What is the name of the name service? Can be found in hdfs-site.xml as dfs.nameservices: hdfs://nameservice1
What is the value for dfs.ha.namenodes.<namerservice>? Can be found in hdfs-site.xml using the name of the name service.
What are the values for dfs.namenode.rpc- address.<nameservice>.<namenode>? Can be found in hdfs-site.xml using the name of the name service, and each NameNode specified in the previous row.
What is the value for dfs.client.failover.proxy.provider.<namerservice>? Can be found in hdfs-site.xml using the name of the name service.
MapReduce (MRv1)
Question Response For Reference
What are the Job host and port? Can be found in mapred-site.xml as mapred.job.tracker: hdfs://JOBHOST:JOBPORT
YARN (MRv2)
Question Response For Reference
What is the YARN resource manager address? Can be found in yarn-site.xml as yarn.resourcemanager.address
Kerberos (Ignore if Kerberos is not enabled)
Question Response For Reference
Is there a keytab that authenticates the Team Studio server?
Are these required keytab files (merged or unmerged) located on the Team Studio server?
Related reference