Hadoop Connection Prerequisites

This checklist is provided to help ensure that all TIBCO Data Science - Team Studio components of a typical Hadoop-based installation are accounted for and completed.

The Hadoop connection configuration requires the HDFS host, HDFS port, Jobtracker host, and Jobtracker port. All Hadoop node hostnames must resolve to the proper computers from the TIBCO Data Science - Team Studio server. TIBCO Data Science - Team Studio needs access to a Hadoop administrator or anyone with access to the Hadoop configuration files (*-site.xml) if the inputs provided in the form below are not valid. TIBCO Data Science - Team Studio also might have to make changes to the host file of the TIBCO Data Science - Team Studio server if the Hadoop hostnames do not resolve.

The connection takes approximately two hours to configure and test if the Hadoop cluster is not configured for Kerberos. If it is, the user running TIBCO Data Science - Team Studio on the TIBCO Data Science - Team Studio server must have a key tab to authenticate in Kerberos. TIBCO Data Science - Team Studio requires that key tabs for the NameNode and Jobtracker are on the TIBCO Data Science - Team Studio server. If any of these three elements is missing or invalid, TIBCO Data Science - Team Studio requires that a Hadoop administrator is available to contact during installation. Configuring the initial connection to a cluster configured for Kerberos takes approximately 4 hours.

Hadoop Cluster
Question Response For Reference
Which version of Hadoop is installed?    
Is a Hadoop administrator be available during installation?    
Is the NameNode of the resource manager enabled for high availability?    
Is the cluster configured for Kerberos?    
Is the cluster running MapReduce (MRv1) or YARN (MRv2)?    
Do the HDFS and JobTracker/resource manager hostnames resolve to the correct computers from the TIBCO Data Science - Team Studio server?   If they do not, configure the host file so that these Hadoop hosts resolve properly.
Hadoop Cluster without High Availability
Question Response For Reference
What are the HDFS host and port?   Can be found in core-site.xml as fs.default.name: hdfs://HDFSHOST:HDFSPORT
Hadoop Cluster with High Availability
Question Response For Reference
What is the name of the name service?   Can be found in hdfs-site.xml as dfs.nameservices: hdfs://nameservice1
What is the value for dfs.ha.namenodes.<nameservice>?   Can be found in hdfs-site.xml using the name of the name service.
What are the values for dfs.namenode.rpc- address.<nameservice>.<namenode>?   Can be found in hdfs-site.xml using the name of the name service, and each NameNode specified in the previous row.
What is the value for dfs.client.failover.proxy.provider.<namerservice>?   Can be found in hdfs-site.xml using the name of the name service.
MapReduce (MRv1)
Question Response For Reference
What are the Job host and port?   Can be found in mapred-site.xml as mapred.job.tracker: hdfs://JOBHOST:JOBPORT
YARN (MRv2)
Question Response For Reference
What is the YARN resource manager's address?   Can be found in yarn-site.xml as yarn.resourcemanager.address
Kerberos (Ignore if Kerberos is not enabled)
Question Response For Reference
Is there a keytab that authenticates the TIBCO Data Science - Team Studio server?    
Are these required keytab files (merged or unmerged) on the TIBCO Data Science - Team Studio server?