HDFS Connection

The HDFS Connection shared resource contains all necessary parameters to connect to HDFS. It can be used by the HDFS Operation, ListFileStatus, Read, Write activities, and the HCatalog Connection shared resource.

General

In the General panel, you can specify the package that stores the HDFS Connection shared resource and the shared resource name.

The following table lists the fields in the General panel of the HDFS Connection shared resource:

Field Module Property? Description
Package No The name of the package where the shared resource is located.
Name No The name as the label for the shared resource in the process.
Description No A short description for the shared resource.

HDFSConnection

In the HDFSConnection Configuration panel, provide the necessary information to connect the plug-in with HDFS. You can also connect to a Kerberized HDFS server. The HDFS Connection shared resource also supports the Knox Gateway security system provided by HortonWorks and connectivity with Azure Data Lake Storage Gen1.
Note: You must configure Azure Data Lake Storage Gen1 with Service-to-service OAuth2.0 authentication mechanism.

The following table lists the fields in the HDFSConnection panel of the HDFS Connection shared resource:

Condition Applicable Field Module Property? Description
N/A Connection Type No The connection type used to connect to HDFS. The following types of connections are available:
  • Namenode
  • Gateway
  • Azure Data Lake Storage Gen1

The default URL type is Namenode when a new connection is created.

Available only when you select Namenode as the connection type HDFS Url Yes The WebHDFS URL is used to connect to HDFS. The default value is http://localhost:50070.

The plug-in supports HttpFS and HttpFS with SSL. You can enter an HttpFS URL with HTTP or HTTPS in this field. For example:

http://httpfshostname:14000

https://httpfshostname:14000

Note: To set up high availability for your cluster, enter two comma-separated URLs in this field. Make sure that there are no spaces between the comma and the second URL. The plug-in designates the first entry to be the primary node and the second entry to be a secondary node.
Available only when you select Gateway as the connection type Gateway Url Yes The Knox Gateway URL is used to connect to HDFS. For example, enter Knox Gateway URL as https://localhost:8443/gateway/default, where default is the topology name.
Note: In the Gateway URL, based on different topologies, the topology name is appended at the end of the URL.
Password Yes

The password that is used to connect to HDFS.

Available only when you select Namenode or Gateway as the connection type User Name Yes Create a unique user name to connect to HDFS.
SSL No Select the check box to enable the SSL configuration. By default, the SSL check box is not selected.
Available only when you enable SSL configuration Key File Yes Select the server certificate for HDFS.
Key Password Yes The password for the server certificate.
Trust File Yes Select the client certificate for HDFS.
Trust Password Yes The password for the client certificate.
Available only when you select Namenode or Gateway as the connection type Enable Kerberos No To connect to a Kerberized WebHCat server, you can select this check box.
By default, the Enable Kerberos check box is not selected.
Note: If your server uses the AES-256 encryption, you must install Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files on your machine. For more details, see Installing JCE Policy Files.
Available only when you enable Kerberos and select either Keytab, Cached, or Password as the Kerberos method Kerberos Method No The Kerberos authentication method that is used to authorize access to HDFS. Select an authentication method from the list:
  • Keytab: specify a keytab file to authorize access to HDFS.
  • Cached: use the cached credentials to authorize access to HDFS.
  • Password: enter the name of the Kerberos principal and a password for the Kerberos principal.
Kerberos Principal Yes The Kerberos principal name that is used to connect to HDFS.
Available only when you enable Kerberos and select Keytab as the Kerberos method Kerberos Keytab Yes The keytab that is used to connect to HDFS.
Login Module File Yes The login module file is used to authorize access to WebHDFS. Each LoginModule-specific item specifies a LoginModule, a flag value, and options to be passed to the LoginModule.
Note: You can leave the Kerberos Principal and Keytab File fields empty if login module is provided. The login module file takes preference over the principal and keytab file fields if populated.

The login module file for HDFS client is of the following format:

HDFSClient
{ 
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  storeKey=false 
  debug=true 
  keyTab="<keytab file path>" 
  principal="<Principal>";
};
Available only when you enable Kerberos and select Password as the Kerberos method Kerberos Password Yes Password for the Kerberos principal
Available only when you select Azure Data Lake Storage Gen1 as the connection type Data Lake Name Yes Name of the Azure Data Lake Storage Gen1 resource you created on Azure portal
Authentication Type Yes The authentication type you want to use.

For the Azure Data Lake Storage Gen1 connection type, the default OAuth2.0 resource values are as follows:

To override the default values, set the -Dcom.tibco.bw.webhdfs.oauthtoken.resource system property.

Directory (Tenant) ID Yes Tenant ID of the Azure Active Directory Application
Application (Client) ID Yes Client ID of the Azure Active Directory Application
Client Secret Yes Client secret registered under Certificates and Secrets section in the Azure Active Directory application configuration
Token Refresh Time (min) Yes The time in minutes after which the authentication access token is refreshed by the plug-in.

The default token refresh time is 60 minutes.

By default, the Azure Active Directory application OAuth2.0 access is valid for 60 minutes.

To set a buffer time

You can set a buffer time to refresh the token by using the -Dcom.tibco.bw.webhdfs.oauthtoken.minbeforeexpiry system property.

Test Connection

You can click Test Connection to test whether the specified configuration fields result in a valid connection.

Setting up High Availability

You can set up high availability for your cluster in this panel. To do so, enter two URLs as comma-separated values (no space between the comma and the second URL) in the HDFS Url field under the HDFS Connection section of this panel. The plug-in designates the first entry to be the primary node and the second entry to be the secondary node. The plug-in automatically connects and routes the request to the secondary node in the event that the primary node goes down.

To check the status of a node, use the API, <HDFS URL>/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus, For example, http://cdh571.na.tibco.com:50070/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus