Adding a Hadoop Data Source from the User Interface

To add an HDFS data source, first make sure the Team Studio server can connect to the hosts, and then use the Add Data Source dialog box to add it to Team Studio.

Supported Hadoop distributions are listed in Team Studio System Requirements.

Prerequisites

You must have data administrator or higher privileges to add a data source. Ensure that you have the correct permissions before continuing.

Procedure

  1. From the menu, select Data.
  2. Select Add Data Source.
  3. Choose Hadoop Cluster as the data source type.

  4. Specify the following data source attributes:
    Data Source Name Set a user-facing name for the data source. This should be something meaningful for your team (for example, "Dev_CDH5_cluster").
    Description Enter a description for your data source.
    Hadoop Version Select the Hadoop distribution that matches your data source.
    Use High Availability Check this box to enable High Availability for the Hadoop cluster.
    Disable Kerberos Impersonation If this box is selected and you have Kerberos enabled on your data source, then the workflow uses the user account configured as the Hadoop Credentials here.

    If this box is cleared, the workflow uses the user account of the person running the workflow.

    If you do not have Kerberos enabled on your data source, you do not need to select this box. All workflows run using the account configured as the Hadoop Credentials.

    NameNode Host Enter a single active NameNode to start. Instructions for enabling High Availability are in Step 10.

    To verify the NameNode is active, check the web interface. (The default is http://namenodehost.localhost:50070/)

    NameNode Port Enter the port that your NameNode uses. The default port is 8020.
    Job Tracker/Resource Manager Host For MapReduce v1, specify the job tracker. For YARN, specify the resource manager host.
    Job Tracker/Resource Manager Port Common ports are 8021, 9001, 8012, or 8032.
    Workspace Visibility There are two options here:
    • Public - Visible and available to all workspaces.
    • Limited - Visible and available only to workspaces they are associated with.

    To learn more about associating a data source to a workspace, see Data Visibility.

    Hadoop Credentials Specify the user or service to use to run MapReduce jobs. This user must be able to run MapReduce jobs from the command line.
    Group List Enter the group to which the Hadoop account belongs.
  5. For further configuration, choose Configure Connection Parameters.
  6. Specify key-value pairs for YARN on the Team Studio server. Selecting Load Configuration from Resource Manager attempts to populate configuration values automatically.
    • yarn.resourcemanager.scheduler.address
    • yarn.app.mapreduce.am.staging-dir


    Note: Be sure the directory specified above in the staging-dir variable is writable by the Team Studio user. Spark jobs produce errors if the user cannot write to this directory.
    Required if different from default:
    • yarn.application.classpath
      • The yarn.application.classpath does not need to be updated if the Hadoop cluster is installed in a default location.
      • If the Hadoop cluster is installed in a non-default location, and the yarn.application.classpath has a value different from the default, the YARN job might fail with a "cannot find the class AppMaster" error. In this case, check the yarn-site.xml file in the cluster configuration folder. Configure these key:value pairs in the UI using the Configure Connection Parameters option.
    • yarn.app.mapreduce.job.client.port-range
      • This describes a range of ports to which the application can bind. This is useful if operating under a restrictive firewall that needs to allow specific ports.
    Recommended:
    • mapreduce.jobhistory.address = FQDN:10020
      Caution: Operators that use Pig for processing do not show the correct row count in output if mapreduce.jobhistory.address is not configured correctly. For more information, see Pig operators do not show row count output correctly. .
    • yarn.resourcemanager.hostname = FQDN
    • yarn.resourcemanager.address = FQDN
    • yarn.resourcemanager.scheduler.address = FQDN:8030
    • yarn.resourcemanager.resource-tracker.address = FQDN:8031
    • yarn.resourcemanager.admin.address = FQDN:8033
    • yarn.resourcemanager.webapp.address = FQDN:8088
    • mapreduce.jobhistory.webapp.address = FQDN:19888
  7. Save the configuration.
    Note: Additional configuration might be needed for different Hadoop distributions.

    To connect to a Cloudera (CDH) cluster, follow the instructions above.

    To connect to MIT KDC Kerberized clusters:

    To connect to a MapR cluster:

    To connect to a PHD cluster:

    To connect to a YARN Enabled cluster:

    If you do not have all of these parameters yet, you can save your data source as "Incomplete" while working on it. For more information, see Data Source States.

  8. To perform a series of automated tests on the data source, click Test Connection.

  9. Click Save Configuration to confirm the changes.
  10. When the connectivity to the active NameNode is established above, set up NameNode High Availability (HA) if enabled.

    Required:

    • dfs.ha.namenodes.nameservice1
    • dfs.namenode.rpc-address.nameservice1.namenode<id> (required for each namenode id)
    • dfs.nameservices
    • dfs.client.failover.proxy.provider.nameservice1
    Recommended:
    • ha.zookeeper.quorum
    Note: Support for Resource Manager HA is available.

    To configure this, add failover_resource_manager_hosts to the advanced connection parameters and list the available Resource Managers.

    If one of the active Resource Managers fails during a job running, you must re-run the job, but you no longer must reconfigure the data source that failed. If one of the active Resource Managers fails while a job is not running, you do not need to do anything. Team Studio uses another available Resource Manager instead.